AI 코딩 벤치마크가 2배의 품질 격차를 숨기고 있습니다.
hackernews
|
|
🔬 연구
#ai
#claude
#gpt-5
#review
#swe-bench
#벤치마크
#코딩
#품질
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
현재 AI 코딩 모델의 성능을 평가할 때 테스트 통과율만으로 판단하면, 실제 코드 품질에서 최대 2배 이상의 차이가 존재함에도 이를 간과하게 됩니다. 동일한 테스트 통과율을 보인 모델들을 비교한 결과, 가장 우수한 모델은 인간 코드와의 일치도가 45.3%로 최악의 모델보다 1.6배 이상 높았고 비용은 더 낮았습니다. 또한 METR 등의 연구에 따르면 자동 평가를 통과한 코드의 절반가량이 실제 유지보수 담당자의 검토에서는 병합 거절되는 것으로 나타났습니다. 이는 테스트 통과율이 최소 기준일 뿐이며, 실제 코드 품질이나 수용 가능성을 정확히 반영하는 지표가 아님을 시사합니다.
본문
Your AI coding benchmark is hiding a 2x quality gap The assumption The base assumption that coding agent evals (SWE-Bench, Terminal Bench) make is that there is one primary metric to measure agent quality, and that metric is test pass rate. Claude Code with Opus 4.6 passes tests 73% of the time. Codex with GPT 5.4 passes 80%. Ship GPT 5.4. EZ, right? How we measured We ran 3 models against 87 tasks, drawn from 3 real open-source repos: Zod , graphql-go-tools , and sqlparser-rs . Each task is a real PR or commit that was merged to the repo. The agent gets the repo prior to the merge, and instructions to do the task. The PR's own tests decide if the agent's change passes or fails. Pass rate is the gate. But we also score three quality dimensions above it: - Equivalence — how closely does the agent's patch match the real PR that was merged? - Code review — would another model pass or fail the agent's patch in review? - Footprint risk — how many unnecessary changes did the agent make? The dead heat On 87 shared W2 tasks, the pass rates are almost identical: gpt-5.1-codex-mini :77/87 (88.5% )gpt-5.3-codex :78/87 (89.7% )gpt-5.4 :78/87 (89.7% ) That sounds like a tie, but it isn't. Mini and 5.3 agree on 82/87 tasks. 75 both-pass, 7 both-fail, 5 mixed. The pass-rate headline is only moving on five tasks. So I looked at the 75 tasks where both agents pass the tests. Same pass rate. Completely different code. 5.3 is 1.6x more likely than mini to match the human patch. 5.4 is best across the board — highest equivalence, best review pass rate, lowest footprint risk — and the cheapest at $1.34/task . METR confirmed it Different methodology, same conclusion. METR had 4 active maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs that passed the automated grader. ~50% would not be merged. We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise in maintainer merge decisions Others are seeing the same thing Voratiq found the same pattern in their own workflow across 4,784 candidate patches: test-passing candidates were selected 1.8x more often, but top-reviewed candidates were selected 9.9x more often. Tests are a weak proxy for the code teams actually accept. — Voratiq, March 2026 Tests are the gate, not the source of truth Pass rate is where models agree. The quality above the gate — equivalence, review, footprint, cost — is where they diverge. If you're choosing agents by the one metric where they all look the same, you're not choosing. If you want the full picture: /why If this matches what you're seeing: [email protected]
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유