유지보수 담당자: 412건 수정, 14건 거부. 14번이 포인트다

hackernews | | 📰 뉴스
#claude #오픈소스 #유지보수 #자동화 #코드베이스
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

실제 SaaS 코드베이스를 대상으로 10일간 유지보수 에이전트를 가동한 결과, 총 559건의 버그를 발견했습니다. 에이전트는 이 중 412건을 자체적으로 수정했고, 14건은 개발자의 도움을 요청하는 등 100건을 제외한 대다수를 처리했습니다. Go 백엔드와 TypeScript/Next.js 프론트엔드로 구성된 약 20만 라인의 코드를 대상으로 7개의 전문 에이전트가 병렬로 작업하며 격리된 환경에서 회귀 테스트와 최소 수정을 수행했습니다.

본문

I ran a maintenance agent loop against a real codebase for 10 days. It filed 559 bugs, fixed 412 on its own, asked me for help on 14, and marked 31 as wonât-fix. The remaining 100 are in the open queue. The interesting number isnât the 412. Itâs the 14. The setup The codebase is a working SaaS â Go backend, TypeScript/Next.js frontend, roughly 200k lines between them, plus a non-trivial migration history. Not a toy project. The loop is boring. Fresh claude -p per bug, no conversation chains, no long-lived agent memory. Seven specialist agents run in parallel â security, performance, tests, database, code-quality, config, api-design â each scanning the codebase independently and writing findings as markdown files. A consolidator pass deduplicates. Then a fix loop processes each bug in its own isolated subprocess: read the bug, read the code, write a regression test that reproduces the issue, apply a minimal fix, run the full test suite, and only commit if everything passes. If tests fail after two retry cycles, the fix reverts and the bug flips to needs-human . A fix that breaks the build doesnât ship. One rule, strictly enforced in the prompt: audit-only, never add. The agent isnât allowed to introduce new infrastructure, new services, new dependencies, or new tests beyond regression coverage. If a fix requires adding something, the agent doesnât fix it. It escalates. That constraint is the whole bet. What happened Over 10 days and 35 pipeline runs, the distribution landed here: - 412 auto-resolved (73.7%) - 100 open (17.9%) â the working queue - 31 wonât-fix (5.5%) â bugs the agent judged not worth fixing, each with reasoning - 14 needs-human (2.5%) â escalations - 2 regressions across 754 ever-created bug IDs â bugs that got dedupâd and re-filed by a later run âBugâ here is loose. These are what the seven specialists flag â security gaps, perf concerns, missing tests, config drift, code-quality nits, API contract questions. Audit findings, not user-reported production failures. Iâm using the agentâs own word. The retry rate is low. Of the 412 auto-resolved bugs, only 28 needed more than one commit, and the max was 5. The loop isnât thrashing on any single bug. But the headline of â73.7% auto-resolvedâ isnât the story. Any agent loop can post a high auto-resolution rate if you let it touch anything. The question is whether that number is honest work, or whether the agent papered over things it shouldnât have touched. The specialist variance One chart carries most of the argument: Security bugs auto-fix. API design bugs donât. Security sits at 90% because most real-world security fixes are small and self-contained: one missing authorization check, one query that should be parameterized, one validator that accepts the empty string. You can see the fix in three lines and verify it with a regression test. Thereâs nothing downstream to break. API design sits at 51% because the fix usually means changing a public contract â renaming a field, changing a response shape, splitting an endpoint, tightening a validator that someoneâs client already depends on. Contracts have consumers. The correct move when the blast radius leaves the file is to stop and ask. The spread is the argument. If auto-resolution were uniformly high across specialists, Iâd be suspicious. The fact that it varies in the way it should â high where fixes are small and localized, low where fixes are structural â is evidence the agent is making real calls, not just shipping green checkmarks. The 14 The needs-human bugs are where the audit-only rule earns its keep. These are the ones the agent refused. Each one has a written reason. Here are five, trimmed: Schema migration with data-loss risk. Changing the salt changes every derived key in the system. This would: 1. Make all existing AES-encrypted data undecryptable. 2. Invalidate all existing PIN hashes â employees would be unable to clock in. A fix requires a coordinated data migration. The current nil-salt behavior is valid per RFC 5869 â HKDF is designed to work without a salt. Cross-service infrastructure. SSO journey blocked. Not wired in this round because the change crosses three services, requires generating and storing a matching client_secret_encrypted under the appâsMASTER_KEY (otherwise the seeded provider wonât load), and needs Kratos test config. Doing it half-way is worse than leaving the journey marked blocked. Auth infrastructure change. This is not a localized bug fix â itâs an infrastructure-level auth change. Switching Vault authentication from a static long-lived token to AppRole changes public function signatures, requires the production Vault server to be reconfigured with AppRole enabled and a role created before the code change can deploy, and needs renewal logic for token TTLs. Both services using the static-token pattern would need coordinated updates. Breaking API change. Change users.email unique index to per-company scope. This exceeds auto-fix scope on multiple criteria: schema change, auth code changes, new API error code (a public API surface change), blast radius of 8+ files across backend, frontend, i18n catalogs, SQL queries, and observability config â well beyond the <=5 file auto-fix threshold. Diagnostic dead-end. The root cause of sync path failure in E2E is unclear â could be file system permissions on exportDir , RLS context not being set up in the test API, CSV generator failing silently, or the employee resolution returning unexpected counts. Needs actual test execution with debug logging to diagnose. Every one of these could have been a fix commit. The agent had the ability to write code. It chose not to â and explained why in prose I could act on. This is what I care about. Not that the agent is good at fixing bugs. That itâs good at knowing when not to. What didnât work Everything above makes this sound cleaner than it was. I ran out of token budget for six days. The loop ran Apr 12â13, then nothing until Apr 20. The whole setup runs inside a Claude Code Max subscription ($180/month) â no enterprise plan, no special quota. The maintenance loop didnât burn it; a different loop I was running in parallel, driving a browser, did. But the result is the same: one agent eating the budget stalls every other agent sharing it. On a project this size with this cadence, âunlimited runs foreverâ is not the plan. You budget across loops, not just within one. Commits werenât cleanly linked to bug IDs. The pipeline wrote fix commits with the bug title as the subject line, but no stable ID, so when I went to analyze retry rates I had to fuzzy-match commit subjects against bug titles. That worked â 78% of bugs had a matching commit, 98% among auto-resolved â but itâs not the kind of data you want to publish without caveats. Iâve since fixed it: every fix commit now includes a Bug: bug-audit-NNN trailer. The dataset in this post predates that change. The auto-resolve rate is a slight undercount. Consolidator passes sometimes delete bug files in-place when theyâre duplicates. 187 of those deletions happened while the files werenât tracked in git, so their frontmatter is unrecoverable. Real auto-resolution volume is probably 5â6% higher than what I reported. âWonât-fixâ is not one category. The 31 wonât-fix bugs range from âintentional design, confirmed via git blameâ to âpremature optimizationâ to âI think this is actually fine and donât want to spend more token budget on it.â The label is honest but coarse. When this pattern fits The audit-only rule is the whole thing. It works because the constraint removes the biggest failure mode of autonomous agents, which is unrequested scope expansion. An agent that can add new services, new tests, new dependencies in the name of âfixingâ something is an agent you canât sleep through. An agent that can only edit existing code â or refuse â is an agent that does one of two safe things per bug. This pattern doesnât fit every use case. It wonât work if your codebase has so much missing infrastructure that âdonât add anythingâ means âdonât make progress.â It wonât work for feature development, because features are addition by definition. It will feel slow if you want the agent to be heroic. But for maintenance â the boring, necessary, never-quite-finished work of keeping a codebase from rotting â the constraint makes the agent trustworthy enough to leave running. 14 escalations out of 559 bugs is a manageable review queue. Five minutes of human attention per escalation is about an hour a week. Thatâs a trade Iâll make. The point isnât that the agent triaged faster than I could have. I could have worked through those 14 escalations in an afternoon. The point is that the agent did the 10 days of continuous scanning I wasnât going to do. The reframe The discourse around coding agents tends to fixate on what they can do. New features, new test suites, whole subsystems generated from a spec. Thatâs the impressive demo. The less impressive but more useful question is what they should refuse to do. An agent that fixes 412 bugs and says âI donât knowâ 14 times is more useful to me than one that fixes 426 bugs, because I trust the first one. I know where its boundaries are. I can leave it running. Teach your agents to say âI donât know.â The number that matters isnât what they finished. Itâs what they chose not to touch.

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →