벤치마크 격차: 1,472회 실행은 코딩 에이전트 컨텍스트 변경 결과를 보여줍니다.

hackernews | 2026년 4월 25일 22:21 | 🔬 연구

#claude #gemini #gpt-5 #review

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

1,472번의 OpenCode 실행 테스트를 통해 AI 코딩 도구의 컨텍스트 제한이 실제 성과에 미치는 영향을 분석했습니다. 툴 중심의 코딩 에이전트는 작업 시작 전 약 21,000 토큰의 내장 오버헤드를 차지해, 32,000 토큰 컨텍스트 중 실제 작업 가능량은 11,000 토큰에 불과한 것으로 나타났습니다. 이는 벤더 벤치마크가 보다 깨끗한 컨텍스트와 공격적인 리셋 전략을 사용하기에 제3자 도구와의 결과 격차가 발생함을 시사합니다.

본문

1,472 OpenCode runs reveal what vendor scores don't tell you about AI coding tools. Plus a 60-run supplemental Aider/Cline probe, a 48-run installed-CLI smoke, and a 10-row current-OpenCode context probe to test whether the 32K failure transfers across coding runtimes. Author: Doruk Ardahan. Independent personal research. I work at 0G Foundation; this project is personal and has no connection to 0G. | What we tested | What the vendor reports | What we got | |---|---|---| | GLM-5.1 in OpenCode at 32K context | GLM-5 technical report: 75.9% (BrowseComp) | 0/72 (0%) | | GLM-5.1 in OpenCode at 80K+ context | — | 98.7% (clear thinking) | | GLM-5.1 in OpenCode at 80K, preserved thinking | — | 100% (50/50) | | GLM-5.1 in Aider/Cline at 32K | — | Aider 9/15, Cline 15/15 artifact checks | | GLM-5.1 in current OpenCode 1.14.18 | — | 32K simple-edit timeout; 80K smoke subset 9/9 | The 75.9% BrowseComp figure is from Z.AI's GLM-5 technical report. Our 0/72 OpenCode result uses GLM-5.1 from the same model family at a nominal 32K context, but not the same exact model, task suite, or harness. Why the gap? Tool-heavy coding agents can carry substantial built-in context before the task really starts. In our OpenCode measurement, the built-in overhead was ~21K tokens, leaving only ~11K tokens for actual work at 32K context. Vendor benchmarks can use custom harnesses with much cleaner context and more aggressive reset strategies than third-party tools expose. Z.AI's own technical paper transparently documents their custom harness. The point here is the gap between published harness results and real tool use. The supplemental Aider/Cline probe narrows the claim: OpenCode's 32K failure is runtime-dependent, not proof that GLM-5.1 is universally unable to work at 32K. A small current-OpenCode context probe still found the same direction inside OpenCode itself: 32K timed out on a simple-edit probe, while 80K passed all nine smoke-task repeats. This repo does show: - What happened in an OpenCode 1.3.17 --pure baseline across 1,472 runs on this task suite. - What happened in a small 60-run Aider/Cline probe, a 48-run installed-CLI smoke, and a current-OpenCode context probe on three of the same tasks. - How a large built-in context budget can crush effective working context at 32K. - That preserved thinking beat clear_thinking: true in the matched GLM-5.1 80K/t=0.0 control. This repo does not show: - That vendor benchmarks are fake or dishonest. - That every 32K deployment of GLM-5.1 fails. - That all coding tools have OpenCode's startup overhead. - That the supplemental probes are sufficient to rank coding tools. - That these model rankings transfer unchanged to other tools or task suites. For users: - If you use GLM models in OpenCode, check preserved thinking (the vendor default). In the matched 80K/t=0.0 control it improved the success rate and used 13.3% fewer total tokens overall, with much larger savings on shorter tasks. - In this suite, GLM-5-Turbo was the speed-efficiency pick (93.6% success; among passing runs, 1.7x faster than GLM-5.1 and 35% fewer tokens). - In this suite, GLM-5.1 had the highest benchmark success (98.7% with clear thinking, 100% in the matched preserved-thinking control). - Avoid 32K context in the tested OpenCode 1.3.17 baseline. If your setup has similar overhead, start at 80K+. - The supplemental probes show 32K can work in other runtimes on a small task subset, while a current-OpenCode context probe still timed out at 32K and passed at 80K. Treat the OpenCode 0/72 as a runtime-specific warning, not a universal GLM-5.1 limit. For tool builders: - Built-in background context matters more than you think. At 32K, a 21K baseline leaves ~11K for actual work. - Public reports suggest some other tool-heavy coding agents also pay meaningful background-context costs: | Tool | Built-in/background context | 32K headroom implication | |---|---|---| | Cline | ~12K | Tight but not immediately exhausted | | Claude Code (bare/config-dependent) | ~10-20K | Configuration-dependent | | OpenCode (--pure) | ~21K | Failed in this study | For the research community: - Runtime environment can dominate observed tool-mediated outcomes. Same model family, same nominal context limit, different harness/task suite = 0% vs 75.9%. - In the supplemental probe, changing the coding runtime also changed 32K outcomes: Aider passed 9/15 artifact checks and Cline passed 15/15 on a three-task subset. - Published benchmarks should always specify the full runtime environment. REPRODUCTION.md # How to verify headline numbers without API calls ARTIFACTS.md # Public package inventory and dry-run verification status CLAIMS.md # Claim-to-evidence table for reviewer audit REVIEWER-FAQ.md # Direct answers to likely reviewer/community objections BENCHMARK-PROTOCOL-v2.md # Protocol for clean future replication and runtime probes ENVIRONMENT-ISOLATION.md # Clean host/user/config isolation runbook 0G-SANDBOX-EVALUATION.md # 0G Sandbox feasibility and smoke-test plan RELEASE-CHECKLIST.md # Final public-repo release checklist PUBLIC-REPO-MANIFEST.txt # Exact public file list SHA256SUMS # Checksums for public release files the-benchmark-gap.pdf # PDF version of the paper benchmark/ runner.py # Orchestrates headless OpenCode runs config.py # Test matrix configuration verify.py # Per-task verification with tolerant matching db.py # Results database + OpenCode session parser rebuild_metrics_with_reruns.py # Rebuilds corrected headline metrics after Task 05/10 reruns cluster_uncertainty.py # Cluster-aware bootstrap sensitivity check cross_runtime_probe.py # Runs the supplemental Aider/Cline probe cli_runtime_smoke.py # Runs the supplemental installed-CLI/model smoke probe 0g_sandbox_smoke.py # Non-counted 0G Sandbox smoke wrapper 0g-sandbox.env.example # Secret-free template for local 0G Sandbox env tasks/ # 10 task categories with setup, prompts, expected output reruns/ # Replacement DBs for the strengthened Task 05/10 cells results-glm5-final.db results-glm51-final.db results-glm5-turbo-final.db results-control-c1.db # Preserved thinking at 32K results-control-c4.db # Preserved thinking at 80K results.db # Mixed utility DB; use the C3 slice with temperature=0.0 AND max_context=32000 reports/ # Per-model analysis reports and supplemental smoke reports cross-runtime-probe.db # Supplemental Aider/Cline probe results cli-runtime-smoke.db # Supplemental installed-CLI/model smoke results results-opencode-current-32k-probe.db # Current OpenCode 32K context probe results-opencode-current-mini-80k.db # Current OpenCode 80K smoke rerun paper/ paper.md # Full paper with methodology, results, limitations charts/ # Publication figures panel-reviews/ # Independent reviews by Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro # Clone the public release package git clone https://github.com/dorukardahan/benchmark-gap.git cd benchmark-gap/benchmark # Create venv and install deps python3 -m venv .venv source .venv/bin/activate pip install pytest # Dry run (no API calls) python3 runner.py --tasks p0 --dry-run # Run P0 tasks (requires OpenCode 1.3.17 + Z.AI API access) python3 runner.py --tasks p0 --resume # Rebuild the corrected headline metrics used in the paper python3 rebuild_metrics_with_reruns.py # Inspect the raw GLM-5.1 base DB before Task 05/10 rerun replacement sqlite3 results-glm51-final.db "SELECT task_category, COUNT(*), SUM(pass_fail), ROUND(100.0*SUM(pass_fail)/COUNT(*),0) as pct FROM runs WHERE max_context != 32000 AND status='completed' GROUP BY task_category ORDER BY task_category" # C4 control (preserved thinking) sqlite3 results-control-c4.db "SELECT task_category, COUNT(*), SUM(pass_fail), ROUND(AVG(total_tokens_in + total_tokens_out),0) as avg_tokens FROM runs GROUP BY task_category" results-glm5-turbo-final.db contains both GLM-5-Turbo (450 runs) and GLM-5.1 (500 runs including Matrix B). Filter bymodel column.results.db is a mixed utility database. The C3 control slice istemperature=0.0 AND max_context=32000 .- Raw SQL queries against the base final DBs will not match the paper's corrected Task 05/10 headline numbers. - The paper's headline tables are rebuilt from the base DBs plus benchmark/reruns/ usingbenchmark/rebuild_metrics_with_reruns.py after the pre-publication verifier audit. Matrix A (main benchmark): 3 models x 3 temperatures (0.0, 0.3, 0.7) x 3 context limits x 10 tasks x 5 runs = 450 per model, 1,350 total. Matrix B: GLM-5.1 at 32K context, t=0.7, 50 runs. Controls: | Control | Context | Temp | Thinking | Runs | Result | |---|---|---|---|---|---| | C1 | 32K | 0.7 | Preserved | 9 | 0% | | C3 | 32K | 0.0 | Clear | 13 | 0% | | C4 | 80K | 0.0 | Preserved | 50 | 100% | Supplemental probe: GLM-5.1 in Aider and Cline, 2 contexts (32K, 80K) x 3 tasks x 5 runs = 60 runs. This is a targeted runtime-transfer check, not a full cross-tool benchmark. - All main runs used clear_thinking: true (non-default). This is the study's largest confound. Our matched control shows preserved thinking improves success and reduces total token use overall. - Author-designed synthetic tasks, not drawn from SWE-bench or HumanEval. Task difficulty is uncalibrated without a non-GLM baseline. - Verifiers use tolerant matching (80% list overlap, ±1 integers, ±20% word count). "Pass" means "approximately correct." - Primary runtime is OpenCode 1.3.17. The Aider/Cline probe, installed-CLI smoke, and current-OpenCode context probe are supplemental and not normalized enough to rank tools. - Benchmark was not frozen before execution — 4 verifier bugs were fixed during the first model's run, and Task 05/10 verifiers were later strengthened with targeted reruns. - Aggregate percentages are run-level for this fixed benchmark mix. The 450 runs per model are clustered across repeated tasks and configs, not 450 independent tasks. Read the full paper for complete methodology, limitations, and statistical notes. Use REPRODUCTION.md to verify the headline numbers from the SQLite databases and rerun replacement files. Use CLAIMS.md to map major claims to evidence. Use ARTIFACTS.md and SHA256SUMS to inspect the public package and verify file integrity. Use REVIEWER-FAQ.md for direct answers to the likely objections: synthetic tasks, same-family comparison, verifier evolution, clear_thinking , n=5/cell, and the Aider/Cline probe. Use BENCHMARK-PROTOCOL-v2.md, ENVIRONMENT-ISOLATION.md, and 0G-SANDBOX-EVALUATION.md only for future clean replication or supplemental sandbox work. They do not change the published headline numbers. If you reference this work: Ardahan, D. (2026). The Benchmark Gap: 1,472 Runs Reveal What Vendor Scores Don't Tell You About AI Coding Tools. MIT

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기