AI 에이전트 세션, 21가지 인간 개입: 일괄 오케스트레이션 데이터

hackernews | | 🔬 연구
#ai agent #automation #claude #minitest #orchestration #review #rspec
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

AI 에이전트를 활용해 모델 테스트를 RSpec에서 Minitest로 이관하는 배치 작업이 약 16일간 진행되었습니다. 이 과정에서 두 가지 오케스트레이터가 투입되어 98개 모델과 161개 파일을 대상으로 총 764회의 세션을 실행했으며, 최종적으로 약 1만 건의 테스트 코드를 생성했습니다. 특히 첫 번째 오케스트레이터(Ralph)는 이관을, 두 번째(Felix)는 픽스처 정리를 담당했고, 정교한 다계층 오류 처리 덕분에 전체 파일의 92%, 모델 수준에서는 85%의 높은 자율 완료율을 기록하며 인간의 개입은 단 21회에 불과했습니다.

본문

The orchestrator is a while read loop. That is the actual answer when someone asks how I ran AI agents across 98 models and ~161 fixture and test files to migrate model tests from RSpec to Minitest. The shell script is maybe 100 lines. It reads a list of files and calls a Claude command for each one, sequentially. The layers of error handling below that loop determined whether the project succeeded or stalled. The previous article covered the pipeline itself: the command file, agent definitions with I/O contracts, gate types, and the Clean Room context strategy. This article covers what happens when you take that pipeline and run it across an entire codebase: two batch orchestrators, their aggregate execution data, and the layered failure handling that determined the real intervention rate. flowchart LR DISC["Discovery Script"] --> LOOP["Shell Loop"] LOOP -->|"claude -p"| PIPE["6-Gate Pipeline"] PIPE --> FIX{"/fix-tests?"} FIX -->|"pass"| COMMIT["Git Commit"] FIX -->|"fail, retry"| PIPE COMMIT --> HUMAN{"Human Layer"} HUMAN -->|"~85% pass (model level)"| DONE["Done"] HUMAN -->|"~15% intervene"| LOOP style DISC fill:#2a5f8f,stroke:#4a9eed,color:#ffffff style LOOP fill:#2a5f8f,stroke:#4a9eed,color:#ffffff style PIPE fill:#1e6e45,stroke:#4ae68a,color:#ffffff style FIX fill:#7a5c2e,stroke:#e6b44a,color:#ffffff style COMMIT fill:#5a2e7a,stroke:#b44ae6,color:#ffffff style HUMAN fill:#8b3a3a,stroke:#e06060,color:#ffffff style DONE fill:#1e6e45,stroke:#4ae68a,color:#ffffff Two orchestrators ran 764 Claude sessions across ~259 unique files over 16 working days. Ralph migrated 98 models through the 6-gate pipeline, producing close to 10,000 tests in 283 sessions. Felix cleaned up the fixture debt Ralph left behind: enum corrections, consolidation, and YAML anchors across ~161 unique fixture and test files in 481 sessions. Between them, 21 problems required a human to step in: an ~85% autonomous rate at the model level (where go/no-go decisions happened) and ~92% at the file level. Each layer caught a different category of failure, so the rate that reached the human stayed low even when individual layers failed often. Batch Orchestrating RSpec to Minitest Migration flowchart TD subgraph Discovery["Discovery Phase"] direction LR D1["Ruby script finds untested models"] --> D2["Sort by size, largest first"] end subgraph Loop["Main Loop"] direction LR L1["/minitest-migration, 6-gate pipeline"] --> L2{"Pass?"} L2 -->|Yes| L3["Commit"] L2 -->|No| L4["/fix-tests, max 3 retries"] L4 --> L3 end subgraph Output["Per-Batch Output"] direction LR O1["Git commit per component"] --> O2["PR with all migrations"] end Discovery --> Loop Loop --> Output style D1 fill:#2a5f8f,stroke:#4a9eed,color:#ffffff style D2 fill:#2a5f8f,stroke:#4a9eed,color:#ffffff style L1 fill:#1e6e45,stroke:#3cb371,color:#ffffff style L2 fill:#7a5c2e,stroke:#d4a04a,color:#ffffff style L3 fill:#1e6e45,stroke:#3cb371,color:#ffffff style L4 fill:#8b3a3a,stroke:#e06060,color:#ffffff style O1 fill:#5a2e7a,stroke:#9b59b6,color:#ffffff style O2 fill:#5a2e7a,stroke:#9b59b6,color:#ffffff style Discovery fill:#1e3a5f,stroke:#4a9eed,color:#e0e0e0 style Loop fill:#1a3d2e,stroke:#3cb371,color:#e0e0e0 style Output fill:#2e1a3d,stroke:#9b59b6,color:#e0e0e0 Ralph is the first layer. The name comes from Geoffrey Huntley's Ralph Wiggum Loop: run Claude in a while true loop, let it fail, let it retry, let naive persistence do the work. My version adds structure beyond that: discovery scripts, a 6-gate pipeline, bounded retries, and commit-after-each-component. But the core pattern is the same. It processes one model at a time: discover which models lack Minitest coverage, run the /minitest-migration pipeline for each, handle failures with a bounded retry loop, commit after each component, and move on. Discovery: Finding What to Migrate The loop starts with a Ruby discovery script that scans the codebase for models without corresponding test files: # Simplified discovery script # Finds models without Minitest coverage, sorted by size (largest first) model_files = Dir.glob("app/models/**/*.rb") untested = model_files.select do |model_path| test_path = model_path .sub("app/models/", "test/models/") .sub(".rb", "_test.rb") !File.exist?(test_path) end sorted = untested.sort_by { |f| -File.readlines(f).size } sorted.each { |f| puts f } Largest first is deliberate. The biggest models are the hardest, and I wanted to hit the hardest problems early when I had the most energy for debugging. A batch of small models at the end is a reward, not a risk. Shell Loop for Sequential Claude Sessions The shell script structure is straightforward: #!/bin/bash # Simplified Ralph main loop structure BATCH_SIZE=10 COMMIT_INTERVAL=3 counter=0 discover_models | head -n "$BATCH_SIZE" | while read -r model_path; do echo "Migrating: $model_path" # Run the 6-gate pipeline claude -p "/minitest-migration $model_path" \ --dangerously-skip-permissions # Check if tests pass test_file=$(echo "$model_path" | sed 's|app/models/|test/models/|; s|.rb|_test.rb|') if ! bin/rails test "$test_file" 2>/dev/null; then fix_tests "$test_file" fi counter=$((counter + 1)) if [ $((counter % COMMIT_INTERVAL)) -eq 0 ]; then git add -A && git commit -m "batch: migrate $counter models" fi done The /minitest-migration command is where the actual intelligence lives. It runs the 6-gate pipeline with I/O contracts and parallel writers. Ralph just calls it and handles the result. Automated Test Fix Retries flowchart TD RUN["bin/rails test"] --> CHK{"Exit code 0?"} CHK -->|"yes"| DONE["Continue to next model"] CHK -->|"no"| EXT["Extract failure lines"] EXT --> FIX["claude -p /fix-tests"] FIX --> RUN FIX -->|"max 3 retries"| WARN["Log warning, move on"] style RUN fill:#2a5f8f,stroke:#4a9eed,color:#ffffff style CHK fill:#7a5c2e,stroke:#e6b44a,color:#ffffff style DONE fill:#1e6e45,stroke:#4ae68a,color:#ffffff style EXT fill:#8b3a3a,stroke:#ed4a4a,color:#ffffff style FIX fill:#8b3a3a,stroke:#ed4a4a,color:#ffffff style WARN fill:#5a2e7a,stroke:#b44ae6,color:#ffffff About 40-50% of models did not pass on the first /minitest-migration run. This is where the second layer kicks in. The fallback function extracts the specific failure from test output and feeds it back to Claude with focused context: fix_tests() { local test_file="$1" local max_retries=3 local attempt=0 while [ $attempt -lt $max_retries ]; do # Capture test output, extract failure block test_output=$(bin/rails test "$test_file" 2>&1) exit_code=$? [ $exit_code -eq 0 ] && return 0 # Extract only the failure lines for focused context failures=$(echo "$test_output" | grep -A 5 "Failure:\|Error:") claude -p "/fix-tests $test_file" \ --dangerously-skip-permissions \ |"overnight runs"| W2["Week 2: 4,830 tests"] W2 --> W3["Week 3: 8,201 tests"] W3 --> W4["Week 4: ~10,000 tests"] style W1 fill:#2a5f8f,stroke:#4a9eed,color:#ffffff style W2 fill:#7a5c2e,stroke:#e6b44a,color:#ffffff style W3 fill:#1e6e45,stroke:#3cb371,color:#ffffff style W4 fill:#5a2e7a,stroke:#9b59b6,color:#ffffff The pipeline did not produce nearly 10,000 tests on the first attempt. It scaled in visible increments, each one surfacing new failure categories that fed back into the agent instructions. After the first week, the test suite was already running faster than the RSpec suite it was replacing: bin/rails test test/models/ 15.23s · 4,830 tests (317.15/s) with 9,453 assertions (620.71/s) Line Coverage: 14.19% (14,582 / 102,732) By the end of week two, the test count had grown to 8,201 across 32 seconds of wall-clock time: bin/rails test test/models/ 32.62s · 8,201 tests (251.4/s) with 15,907 assertions (487.62/s) Line Coverage: 17.58% (17,645 / 100,378) The coverage percentages look small, but those are whole-app numbers (100,000+ lines). The per-model coverage on migrated models was already hitting 80-95%. As the pipeline processed the remaining models over the following week, the test count climbed to close to 10,000 across 98 models, with line coverage approaching 20%. 141,000 lines of generated test code deserve scrutiny. That is a lot of code to maintain, regardless of who wrote it, and “generated code” carries legitimate risk: if the generator produces shallow tests (tests that exercise code paths without verifying meaningful behavior), you end up maintaining code that provides coverage numbers without confidence. Three factors make the maintenance burden manageable in practice, though the long-term cost is still unproven. First, model tests are structurally repetitive (association tests, validation tests, scope tests), so reading and modifying them is fast even at scale. If a factory changes or a validation moves, the fix is mechanical. Second, the mathematical derivation from the Analyst/Writer split means every test traces to a specific technique and source method, which makes it possible to regenerate sections rather than hand-editing. Third, Minitest's flat structure (no nested context blocks, no let chains) means each test method is self-contained and readable in isolation. One comparison from later in the project: TransferProcessingService had 19.39% line coverage under RSpec (71 examples, 82 seconds). The Minitest migration reached 87.0% coverage (228 tests, 13.67 seconds). More coverage, more tests, 6x faster. The real test of this claim is not now, when the code is fresh, but six months from now, when the models have evolved and the generated tests need updating. I do not yet have that data. Fixture Cleanup: Why a Second Orchestrator Was Needed flowchart LR subgraph Phase1["Phase 1"] P1["Selective Loading, reverted"] end subgraph Phase2["Phase 2"] P2["Fix Enums, 65 files"] end subgraph Phase3["Phase 3"] P3["Consolidation, 96 files"] end subgraph Phase4["Phase 4"] P4["YAML Anchors, low impact"] end Phase1 --> Phase2 Phase2 --> Phase3 Phase3 --> Phase4 style P1 fill:#2a5f8f,stroke:#4a9eed,color:#ffffff style P2 fill:#8b3a3a,stroke:#e06060,color:#ffffff style P3 fill:#7a5c2e,stroke:#d4a04a,color:#ffffff style P4 fill:#5a2e7a,stroke:#9b59b6,color:#ffffff style Phase1 fill:#1e3a5f,stroke:#4a9eed,color:#e0e0e0 style Phase2 fill:#3d1a1a,stroke:#e06060,color:#e0e0e0 style Phase3 fill:#3d2e1a,stroke:#d4a04a,color:#e0e0e0 style Phase4 fill:#2e1a3d,stroke:#9b59b6,color:#e0e0e0 Ralph creates fixture debt as a side effect, the same category of structural overhead that TestProf surfaced in the RSpec suite. Every model migration generates JIT fixtures, adds references, and modifies YAML files. After 98 models, the fixture layer accumulated systematic problems that individual migrations could not see. This is the third layer in the error-handling stack: Felix exists to catch what Ralph structurally could not, because the problems only become visible across files, not within them. Felix is structured as four separate discovery-and-fix loops, each targeting a specific class of fixture problem: Phase 1: Selective Loading (96 files, 144 sessions, fully reverted). The goal was to replace the global fixtures :all default with explicit per-file fixtures : declarations, so each test class only loads the fixtures it needs. The orchestrator ran overnight, traced fixture dependencies for all 96 files, and added declarations. The next morning, I discovered that the model dependencies in this codebase were too deeply interconnected for selective loading to work. Foreign key chains meant that loading one fixture table required loading most of the others anyway. I reverted to fixtures :all . Those 144 sessions, roughly 21% of Felix's total, produced zero lasting value. A manual spike on 2-3 files would have revealed the interconnection problem in an hour, before committing to a full batch run. Phase 2: Fix Enums (65 files, 227 sessions). Fixture files contained hardcoded integer enum values (state: 1 ) instead of symbolic references (state: ). When enum definitions change, hardcoded integers silently produce wrong test data. This phase was the most session-heavy because each file required the agent to read the model source, trace enum definitions through concerns, build the integer-to-symbol mapping, and replace values. That is why a “simple” transformation averaged 3.5 sessions per file. Phase 3: Consolidation (96 files, 88 sessions). Fragmented fixture references accumulated across migrations. One test file referenced 43 separate fixtures. The consolidation pass merged related fixture references and removed unused ones. Phase 4: YAML Anchors (5 files, 22 sessions). I expected this to be a big win: the largest fixture files had massive duplication across 50-141 entries. In practice, only 5 files qualified, and the impact was minimal. Most fixture files needed entirely new fixture designs rather than deduplication of existing ones. Each phase had its Ruby discovery script. Phase 2's discovery, for example, scanned fixture YAML for integer values in known enum columns: # Simplified Phase 2 discovery: find fixtures with hardcoded enum integers fixture_files = Dir.glob("test/fixtures/**/*.yml") fixture_files.each do |path| content = File.read(path) # Match lines like " state: 1" or " status: 0" if content.match?(/^ \w+:\s+\d+\s*$/) puts path end end (The initial regex was ^\s+\w+ , which matched nested YAML hashes and produced 80 false positives across 7 files. Tightening it to an exact 2-space indent eliminated the false matches. I caught this mid-run and had to restart, one of the interventions discussed below.) Fixture Cleanup Results: 481 Sessions Across 4 Phases | Phase | Files | Sessions | Wall-Clock | |---|---|---|---| | 1: Selective Loading | 96 test files | 144 | ~2-3 days | | 2: Fix Enums | 65 fixture files | 227 | ~2-3 days | | 3: Consolidation | 96 test files | 88 | ~1-2 days | | 4: YAML Anchors | 5 fixture files | 22 | ~half day | | Total | 262 file-phase ops (~161 unique) | 481 | ~6-8 days | Phases 1 and 3 operate on the same 96 test files. Phases 2 and 4 operate on fixture files (Phase 4's 5 files are a subset of Phase 2's 65). The 262 figure counts file-phase operations; the unique file count is 96 test files + 65 fixture files = ~161. Felix generated 1.7x more sessions than Ralph (481 vs 283) despite targeting narrower per-file transformations. Two factors explain the volume: ~161 files vs 98 models, and Phase 2's enum replacement being harder than it looked (3.5 sessions per file, because tracing enum definitions through model inheritance and concerns required more context than a “simple find-and-replace” would suggest). Volume dominates. Simple work multiplied by hundreds of file-phase operations produces more total agent sessions than complex work on fewer files. Felix required 6 manual interventions across all 4 phases, mostly concentrated in early Phase 2 debugging. Different failure modes than Ralph: regex false positives in discovery scripts, parallel execution blockers from partial fixture state, and the Phase 1 revert decision. The third layer catches different problems than the first two. Combined Results: 764 Sessions

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →