병합 아니면 죽기 - AI 에이전트를 위한 CI
hackernews
|
|
🔬 연구
#ai 에이전트
#ci
#review
#머지
#테스트
#파이썬
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
Trajectory는 AI 에이전트의 안정성을 검증하기 위한 새로운 CI(Continuous Integration) 챌린지를 개발했습니다. 이 챌린지는 AI 에이전트가 Trajectly의 CI 과정을 통과해야만 성공하는 방식으로 진행되며, 에이전트가 오류를 일으키면 문제 발생 시점, 관련 코드, 재현 방법, 그리고 축소된 오류 추적 정보를 제공합니다. 이를 통해 최종 결과 평가보다 에이전트 테스트를 더욱 구체적이고 이해하기 쉽게 만들고자 합니다.
본문
Procurement Chaos Witness6Forged approval path fetch_requisition fetch_vendor_quotes create_purchase_order python -m trajectly repro procurement-chaos Your agent passed the demo. Then CI killed it. A GitHub-native challenge where bad agent PRs die with witness-level evidence: violated contract, repro command, shrink result, and the exact step where behavior broke. Open DiscussionForged approval path fetch_requisition fetch_vendor_quotes create_purchase_order python -m trajectly repro procurement-chaos The failure artifacts read like a CI death screen, not a benchmark spreadsheet. The arena is backed by live workflow state, committed specs, and the same Trajectly gate you would use in a production repo. You get the witness step, the violated contract, the repro command, and a minimized counterexample instead of a vague red X. These eight scenarios cover six categories of silent failure that text-based checks cannot see. See all six categories → Final-answer evals can pass broken agents. If the behavior regressed, the agent is broken even when the final sentence still sounds right. Forged approval path fetch_requisition fetch_vendor_quotes create_purchase_order python -m trajectly repro procurement-chaos Secret leaked in outbound payload fetch_logs summarize post_summary python -m trajectly repro secret-karaoke Invite fired before room reservation lookup_oncall send_invite reserve_room python -m trajectly repro calendar-thunderdome Start from a known-good trajectory and commit the baseline so future runs compare against real behavior, not vibes. Replay with fixtures and contracts enabled so final-answer luck cannot hide missing calls, bad order, or unsafe branches. Trajectly identifies the earliest event where behavior diverged and attaches a concrete violation code to that step. Replay the failure locally with one command, then reduce it to the shortest counterexample that still dies the same way. Each scenario demonstrates a trajectory failure that output-only checks can miss. The final PO text can still look valid while the agent forges the approval path. A calm support reply can hide the fact that the required escalation never happened. The summary can read clean while the outbound payload still contains secret-like values. The agent can claim the audit passed while quietly taking the disallowed tool branch. "Bridge arranged" can still pass while reservation order is broken or invites fire twice. The graph can finish and print success while a node argument silently violates contract. The agent can report success even though it reached out to a forbidden domain. The final text can stay identical while execution cost and tool usage quietly regress. Same engine, same contracts, same repro workflow. python -m trajectly repro procurement-chaos Witness index, violated contract, repro command, and shrink result map directly to the same GitHub review flow teams already use. The debug loop is not theoretical. You can collapse a noisy failure into the smallest trace that still proves the regression. Refinement and contract findings make missing calls and broken order legible in a way final-answer checks cannot. 14 events 3 events - route_for_approval Open the repo, run the arena, then inspect the failure like you would in a real CI gate. git clone https://github.com/trajectly/trajectly-survival-arena.git cd trajectly-survival-arena pip install -r requirements.txt python -m trajectly init python -m trajectly run specs/challenges/*.agent.yaml --project-root . python -m trajectly report python -m trajectly repro python -m trajectly shrink Public survivors, compact and easy to scan. The hall of fame appears once arena data is available.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유