HN 표시: TracePact – 프로덕션 전에 AI 에이전트의 도구 호출 회귀를 포착합니다.
hackernews
|
|
🔬 연구
#ai 서비스
#ai 에이전트
#review
#tracepact
#도구 호출
#회귀 테스트
#프로덕션
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
AI 에이전트의 도호출(tool-call) 문제를 해결하기 위해 개발된 TracePact는 알려진 정상 실행을 '카세트'로 기록하고, 새로운 실행과 비교해 변경 사항을 보고합니다. npm test를 실행하던 에이전트가 npm run build로 변경되는 등의 버그를 사전에 감지하며, 변경 사항을 구조적(block) 또는 인수 변경(warn)으로 분류해 CI 통합을 지원합니다. Vitest와 연동해 도호출 결과에 대한 검증을 작성할 수 있으며, 특정 도호출(예: 파일 읽기)을 무시하거나 타임스탬프 필터링 등 노이즈를 줄이는 기능을 제공합니다.
본문
Catch tool-call regressions before they hit production. Most agent failures don't look like bad text. They look like this: - yesterday your agent read context, validated input, then wrote changes - today, after a prompt tweak, it writes too early - the final answer still looks plausible - production is now broken TracePact catches that. Record a known-good run, replay it in CI without API calls, and diff against new runs to see exactly what changed — which tools, in what order, with what arguments. # 1. Record a baseline (one-time, live) npx tracepact run --record # 2. Change your prompt, model, or tool wiring # 3. Record again and diff npx tracepact run --record npx tracepact diff cassettes/before.json cassettes/after.json # 4. CI: fail on behavioral regressions npx tracepact diff baseline.json latest.json --fail-on warn # Ignore noisy args (timestamps, request IDs) npx tracepact diff baseline.json latest.json --ignore-keys timestamp,requestId # Ignore tools you don't care about npx tracepact diff baseline.json latest.json --ignore-tools read_file Comparing cassettes A: cassettes/before.json B: cassettes/after.json 3 changes detected: - read_file (seq 1) (removed) + write_file (seq 3) (added) ~ bash.cmd: "npm test" -> "npm run build" Summary: 1 removed, 1 added, 1 arg changed [BLOCK] You changed the prompt. The output still looks fine. But the agent stopped reading the config before deploying and switched from running tests to running builds. TracePact caught it. Teams already try to catch this, but usually in fragile ways: - manually reviewing traces in agent UIs - parsing raw session logs after tests - writing custom hooks to extract tool calls - comparing old vs new runs by hand - debugging regressions only after a user reports them TracePact turns that into deterministic tests and replayable behavior contracts. TracePact is designed for assertions like these: - read before write - validate input before mutation - never call shell for read-only tasks - never call destructive tools without confirmation - look up an existing record before creating a new one - query the database before writing cache - run tests before finishing a code-editing task - inspect logs before restart actions - do not write outside allowed paths - do not call sensitive tools in low-trust flows These are often easier and more stable than trying to assert that an entire response is "good." A coding agent should read enough context before editing code. import { describe, expect, test } from 'vitest'; import { TraceBuilder } from '@tracepact/vitest'; describe('refactor agent', () => { test('reads context before editing code', () => { const trace = new TraceBuilder() .addCall('read_file', { path: 'src/service.ts' }, '...') .addCall('read_file', { path: 'src/types.ts' }, '...') .addCall('write_file', { path: 'src/service.ts', content: '...' }) .addCall('run_tests', {}, 'PASS') .build(); expect(trace).toHaveCalledToolsInOrder([ 'read_file', 'read_file', 'write_file', 'run_tests', ]); expect(trace).toHaveToolCallCount('read_file', 2); expect(trace).toNotHaveCalledTool('bash'); }); }); This test fails immediately if a prompt or model change causes the agent to write before reading, skip required steps, or introduce a forbidden tool call. No API calls. No tokens. Deterministic. Runs in milliseconds. Capture a known-good run once, then replay it to detect drift caused by changes to system prompts, model choice, tool descriptions, agent logic, or MCP server wiring. import { runSkill } from '@tracepact/vitest'; // Record (requires TRACEPACT_LIVE=1) const result = await runSkill(skill, { prompt: 'deploy to staging', record: './cassettes/deploy.json', sandbox, }); // Replay (no API key needed, instant) const replayed = await runSkill(skill, { prompt: 'deploy to staging', replay: './cassettes/deploy.json', }); expect(replayed.trace).toHaveCalledTool('deploy', { env: 'staging' }); Or use the CLI with automatic cassette recording: # Record all tests (cassettes saved automatically) npx tracepact run --live --record # Replay all tests (zero tokens, instant) npx tracepact run --replay ./cassettes TracePact is especially useful for agents that use multiple tools, operate across several steps, mutate files or systems, and can silently regress after prompt or model updates. Agents that read files, search code, edit code, run tests, use shell, open PRs. Typical contracts: read context before writing, do not use shell unless required, run tests before completion, never edit restricted files. Agents that use GitHub, Jira, Slack, docs, or internal APIs via MCP servers. Typical contracts: use the correct system for the correct task, do not update tickets before validating context. Agents that inspect logs, query metrics, read runbooks, restart services. Typical contracts: inspect before acting, never restart before checking evidence, require confirmation for destructive steps. Agents that create tickets, update CRM records, reconcile data, route tasks. Typica
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유