I reversed Opus 4.7 costs

hackernews | 2026년 4월 18일 05:48 | 📦 오픈소스

#claude

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Forge는 Claude Code를 위한 플러그인으로, 사용자가 작성한 한 줄의 아이디어를 바탕으로 명세서 작성부터 병렬 작업 실행, 검증까지 자동으로 수행합니다. 각 작업은 별도의 git worktree에서 독립적으로 실행되며, 테스트에 실패하면 명세를 자동으로 수정하는 백프로퍼게이션 기능을 제공합니다. 이 도구는 엄격한 토큰 예산 제한, 충돌 시 메인 브랜치 보호, 시스템 충돌 후 이어쓰기가 가능한 복구 계층을 통해 안정적인 자동화 코딩 환경을 지원합니다.

본문

Watch the architecture video · Read the docs You start a feature in Claude Code. You write the prompt. It writes the code. You review it. You re-prompt. It tries again. It loses context. You re-explain. You watch the "context: 87%" warning crawl up. You restart. You re-explain again. You're three hours in, you have half a feature, and you're the one keeping the whole thing from falling apart. You are the project manager. You are the state machine. You are the glue. Forge replaces you as the glue. You describe what you want in one line. Forge writes the spec, plans the tasks, runs them in parallel git worktrees with TDD, reviews the code, verifies it against the acceptance criteria, and commits atomically. You read the diffs in the morning. Two minutes. Requires Claude Code v1.0.33+. Zero npm install, zero build step, zero dependencies. claude plugin marketplace add LucasDuys/forge claude plugin install forge@forge-marketplace Three commands. One autonomous loop. One squash-merge to main. /forge brainstorm "add rate limiting to /api/search with per-user quotas" /forge plan /forge execute --autonomy full Then walk away. Here is what you actually see while Forge runs. $ /forge brainstorm "add rate limiting to /api/search with per-user quotas" [forge-speccer] generating spec from idea... spec written: .forge/specs/spec-rate-limiting.md R001 per-user quotas, configurable per tier (free / pro / enterprise) R002 sliding window counters (1 minute, 1 hour, 1 day) R003 429 response with Retry-After header R004 bypass for admin tokens R005 redis-backed counters with atomic increment R006 structured logs for rate-limit events R007 integration test against /api/search $ /forge plan [forge-planner] decomposing into task DAG... 8 tasks across 3 tiers (depth: standard) T001 add redis client + connection pool [haiku, quick] T002 implement sliding window counter [sonnet, standard] T003 build rate-limit middleware [sonnet, standard] T004 wire middleware to /api/search route [haiku, quick] T005 add 429 response with Retry-After [haiku, quick] T006 admin token bypass [haiku, quick] T007 structured logging [haiku, quick] T008 integration test [sonnet, standard] deps: T001 T002 T003 T004 T005 T006 T007 $ /forge execute --autonomy full ══ FORGE iteration 3/100 ══════════════════════════════════ phase: executing ══ Task T002 [in_progress] @ tests_written → tests_passing Tasks [████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 1/8 (12%) Tokens 47k in / 12k out / 23k cached budget 47k/500k (9%) Per-task 8k/15k tok (53%) Lock alive pid 18432, 4s ago restarts 0/10 ────────────────────────────────────────────────────────────────────── [14:02:48] T001 PASS 4 lines, 1 commit, budget 1820/5000 [14:02:48] T002 T003 dispatched in parallel (disjoint files) [14:06:01] T003 PASS 62 lines, 8 tests, budget 13880/15000 [14:08:27] tier 2 complete, squash-merged 6 worktrees [14:14:18] forge-verifier: existence > substantive > wired > runtime [14:14:18] verifier PASS all 7 requirements satisfied [14:14:18] FORGE_COMPLETE 8 tasks. 12 minutes. 218 lines. 9 commits squash-merged to main. session budget: 47200 / 500000 used. lock released. You read the diffs. You merge the branch. You move on. The pipeline is strictly sequential, enforced programmatically: brainstorm → plan → execute . You cannot skip brainstorming, skip planning, or bypass the approval gate. The spec is the contract. Every acceptance criterion has an R-number; every task maps to at least one R-number; the verifier checks R-numbers, not checklists. Six outcomes, each traceable to a mechanism. - No silent token overruns at 3am. Per-task and session budgets are hard ceilings, not warnings. At 100% the state machine transitions to budget_exhausted , writes a handoff at.forge/resume.md , and stops cleanly. Resume picks up where it died, no re-explaining. budgets - Failed tasks never touch your main branch. Every task runs in its own git worktree. Success squash-merges with a structured commit message. Failure discards the worktree; main stays green. worktrees - Crashes survive. Lock file with heartbeat, per-step checkpoints, forensic resume from the git log. Machine reboots mid-feature, /forge resume reconstructs phase, current task, completed tasks, orphan worktrees, and continues. No lost work, no re-running passing tests. recovery - Verification checks the spec, not the checklist. Four levels: existence, substantive (not a stub), wired (imported where used), runtime (tests pass, webhooks handle, CI green). Catches "looks done but isn't" before it ships. verification - Headless-ready. Proper exit codes, JSON state query in ~2ms, zero interactive prompts. Drop /forge status --json into Prometheus or a cron job. headless - Native Claude Code plugin. Lives in your session. No separate harness, no TUI to learn, no API key to manage. Install in two minutes. architecture One of Forge's more novel ideas. When the executor's tests fail, a PostToolUse hook catches it and trips a flag. The next iteration runs a five-step workflow before resuming the failing task: - Trace. Which spec and R-number does this failure map to? - Analyze. Is the gap a missing criterion, an incomplete one, or a whole missing requirement? - Propose. A spec update for your approval. - Generate. A regression test that would have caught it. - Log. Record in .forge/backprop-log.md ; after three gaps in the same category, suggest systemic changes at the brainstorming layer. The failure becomes a better spec, not just a fixed bug. Opt out with auto_backprop: false in .forge/config.json or FORGE_AUTO_BACKPROP=0 . Manual invocation is /forge backprop "description" . Full detail in backpropagation. | Action | Forge | You | |---|---|---| | Write the spec from a one-line idea | autonomous (via Q&A) | approve the approach | | Decompose spec into task DAG | autonomous | review if you like | | Write code + tests for each task | autonomous | | | Review each task against the spec | autonomous (separate agent) | | | Verify all R-numbers satisfied | autonomous | | | Squash-merge passing tasks to main | autonomous | read the diffs before pushing | | Propose spec updates when tests fail | autonomous (via backprop) | approve the proposed update | | Install new dependencies, hit paid APIs, push to remote | never without you | authorize explicitly | Five mechanisms. Numbers are either measured from the repo's own benchmark suite or tagged as estimates. Hard per-task and session budgets (budgets.md). Task budgets scale with detected complexity: quick 5 000 tokens standard 15 000 tokens thorough 40 000 tokens session 500 000 tokens At 80% the next prompt gets a warning injected. At 100% the phase flips to budget_exhausted , state halts, a resume doc is written. No silent drift over hours of autonomous work. Caveman compression (caveman.md, benchmark). Three intensity modes compress internal agent artifacts (state notes, handoff bundles, checkpoint context, review reports). Never compresses source code, commits, specs, or PR descriptions. | Mode | When | Measured reduction | |---|---|---| | lite | budget > 50% | ~1% on mixed artifacts (filler-word strip only) | | full | 20-50% | 12% on the 10-scenario benchmark | | ultra | substantive > wired > runtime) | Whatever the prompt says | Auto-fix retries on test/lint | | Setup | claude plugin install | Built into Claude Code | npm install -g gsd-pi | - Pick Forge if you want autonomous execution inside your existing Claude Code session with hard cost controls, adaptive depth, and crash recovery. - Pick GSD-2 if you want a battle-tested standalone TUI harness with more engineering hours behind it. - Pick Ralph Loop if you have a tightly-scoped greenfield task with binary verification and want the absolute minimum infrastructure. Full honest comparison with all trade-offs: docs/comparison.md. Forge is a state machine that lives inside your Claude Code session. A spec becomes a tier-ordered task DAG; an autonomous loop dispatches parallel executors in git worktrees; each task is gated by review and verification; successful tasks squash-merge atomically. Seven hooks fire on every tool call to cap tokens, condense test output, cache repeat reads, track progress, and trigger auto-backprop when tests hit a spec gap. State files under .forge/ are the single source of truth; the TUI and headless query both read them without writing. End-to-end. Three commands, one autonomous loop, one merge. flowchart LR User([You: one line idea]) --> Bs["/forge brainstorm"] Bs --> Spec[".forge/specs/spec-{domain}.mdR001…R0NN + acceptance criteria"] Spec --> Plan["/forge plan"] Plan --> Frontier[".forge/plans/{spec}-frontier.mdtier 1 ┃ tier 2 ┃ tier 3dependency DAG"] Frontier --> Exec["/forge execute"] Exec --> Loop{"autonomousloop"} Loop -->|all done| Done([squash-merge to mainFORGE_COMPLETE]) Loop -.->|read-only| Watch["/forge watchlive TUI dashboard"] Loop -.->|read-only| Headless["/forge status --jsonheadless query"] Crash[crash / context reset] -.->|/forge resume| Loop classDef cmd fill:#e3f2fd,stroke:#1565c0,color:#0d47a1 classDef state fill:#fff3e0,stroke:#e65100,color:#bf360c classDef ui fill:#e0f7fa,stroke:#006064,color:#004d40 classDef done fill:#c8e6c9,stroke:#1b5e20,color:#0d2818 class Bs,Plan,Exec,Loop cmd class Spec,Frontier state class Watch,Headless ui class Done,User done class Crash state Four deeper diagrams cover the execute loop, hooks pipeline, backpropagation, and recovery layer. Click any to expand. The full one-piece view sits at the bottom. Execute loop (state machine + DAG dispatch) What /forge execute actually runs. State machine drives everything; the Stop hook re-fires it after every Claude turn. flowchart TB Stop["Stop hookfires after every Claude turn"] --> SM{{"routeDecision()12 phases"}} SM --> Dispatch["streaming-DAG dispatchtiers sequential ┃ tasks parallel"] Dispatch --> Router["forge-routerhaiku=1 ┃ sonnet=5 ┃ opus=25"] Router --> Wts["per-task worktreesforge-executor"] Wts --> Reviewer["forge-reviewer"] Reviewer -->|issues| SM Reviewer -->|pass| Verifier["forge-verifier"] Verifier -->|gap| SM Verifier -->|R's met| Squash["squash-merge to main"] Squash -->|merge fail| Conflict["conflict_resolutionpreserve worktree"] Squash -->|ok| SM Conflict -.-> SM SM -->|next iteration| Stop Hooks pipeline (every tool call) Seven hooks fire on every executor tool call. They keep the loop fast, cheap, and self-correcting. flowchart LR Tool["executor tool call"] --> Pre[PreToolUse] Pre --> Cache["tool-cache120s TTL on read-only ops"] Cache -.->|hit| Skip([cached, no LLM]) Cache -.->|miss| Run[run real tool] Run --> Post["PostToolUse fan-out"] Post --> Tok["token-monitor80%/100% gates"] Post --> Filt["test-output-filter>2000 chars"] Post --> Prog[progress-tracker] Post --> AutoBP["auto-backpropFAIL pattern detect"] Post --> Store[tool-cache-store] Tok -->|>=100%| Exhaust([budget_exhausted]) AutoBP -->|FAIL| Flag([flag file + state flag]) Backpropagation and replanning loops Two feedback loops that change what runs next based on what just happened. flowchart TB subgraph Auto["Auto-backprop: test failure → spec fix"] Fail[test failure] --> Hook[auto-backprop.js captures context] Hook --> Flag[.auto-backprop-pending.json] Flag --> Inject[stop-hook injects directive] Inject --> BP5["TRACE → ANALYZE → PROPOSE → GENERATE test → LOG"] BP5 --> SpecUpd[spec updated + regression test] SpecUpd --> Resume[resume original task] end subgraph Replan["Replanning: concerns → re-decompose"] Tier[tier completes] --> Check{"shouldReplan()concerns ÷ done ≥ 0.3?"} Check -->|yes| Redec["planner re-invokedT003 → T003.1, T003.2"] Redec --> Continue[continue with new frontier] Check -->|no| Continue end SpecUpd -.->|can trigger| Check Recovery layer Three independent layers cooperate so nothing is lost. flowchart LR subgraph Live Acquire["acquireLock()or take over stale (5 min)"] --> HB[heartbeat every 30s] HB --> WriteCP[writeCheckpoint after each step] end Live --> Files[".forge-loop.lock + progress/T###.json + git log"] Files --> Resume{"/forge resume"} Resume --> Forensic[performForensicRecovery] Forensic --> Loop2[resume at exact step] Full one-piece architecture diagram All subsystems in one flow. The four focused diagrams above are easier to read individually; this is the holistic view. GitHub's "click to expand" button renders it at full size. flowchart TB User([You: one line idea]) --> Bs["forge-speccerR-numbered spec"] Bs --> Planner["forge-plannertier DAG + token estimates"] Planner --> SM{"routeDecision()12-phase state machine"} SM --> Dispatch["streaming-DAG dispatch"] Dispatch --> Exec["forge-executorTDD + tests"] Exec --> Gates["reviewer + verifierexistence > substantive > wired > runtime"] Gates --> Merge["squash-merge worktree"] Merge --> SM SM -->|tier done + concerns| Planner SM -->|all done| Done([FORGE_COMPLETE]) Exec -.->|every tool call| Hooks["hooks: tool-cache, token-monitor,test-filter, progress, auto-backprop"] Hooks -.->|test failure| Planner SM -.->|writes| Recovery["lock + checkpoints + forensic resume"] Short version. Full table with every file pointer: docs/architecture.md. | Layer | Key files | What it does | |---|---|---| | State machine | scripts/forge-tools.cjs::routeDecision | 12-phase router called by the Stop hook every Claude turn | | DAG dispatch | scripts/forge-tools.cjs::findAllUnblockedTasks | Tiers sequential, tasks within a tier parallel | | Model routing | scripts/forge-router.cjs::selectModel | Per-role baseline + complexity + budget → haiku / sonnet / opus | | Budget tracking | scripts/forge-budget.cjs | Per-task + session spend with model cost weights, hard 100% gate | | Agents | agents/forge-*.md | Speccer, planner, researcher, executor, reviewer, verifier, complexity | | Hooks | hooks/*.{js,sh} | Seven hooks: tool cache, token monitor, test filter, progress, auto-backprop, cache store, stop | | Recovery | scripts/forge-tools.cjs (lock + checkpoints + forensic) | Lock with heartbeat, 10-step checkpoints, rebuild from git log | | TUI + headless | scripts/forge-tui.cjs , scripts/forge-tools.cjs::queryHeadlessState | Read-only; /forge watch renders at 10Hz, JSON snapshot in ~2ms | - 206 tests, 0 dependencies. Full suite runs in 2.6 seconds. Pure node:assert , zero npm install. - Headless state query: ~2ms. Zero LLM calls, 17-field versioned JSON schema. - Caveman compression: 12% measured on the 10-scenario agent-output benchmark at full intensity, rising to 18% at ultra and up to 65% on dense prose. benchmark - Seven hooks fire on every tool call. Tool cache, token monitor, test filter, progress tracker, auto-backprop, cache store, stop. See architecture. - Seven circuit breakers. Test failures, debug exhaustion, Codex rescue, re-decomposition, review iterations, no-progress detection, max iterations. Nothing runs forever. verification - Lock heartbeat survives crashes, reboots, OOMs, and context resets. Five-minute stale threshold, never auto-deletes user work. - Seven specialized agents, each route

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기