Show HN: AWB – 모델뿐만 아니라 AI 코딩 워크플로를 테스트하는 벤치마크

hackernews | | 📦 오픈소스
#ai코딩 #claude #review #swe-bench #벤치마크 #워크플로 #테스트
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

AWB는 단순한 모델 능력 평가를 넘어 AI 코딩 도구의 워크플로, 설정, 도구를 통합적으로 벤치마킹하는 오픈 소스 도구입니다. 실제 오픈 소스 저장소에서 가져온 80개 작업을 통해 정확성, 비용 효율성, 속도 등 7가지 지표로 성능을 측정하며, 각 작업의 난이도에 따라 가중치를 다르게 적용합니다. 이를 통해 사용자는 맞춤형 설정(CLAUDE.md 등)이 모델 성능에 미치는 영향을 분석하고, 특정 역량별 격차를 시각화하여 워크플로를 최적화할 수 있습니다.

본문

Measure AI coding tool+workflow performance, not just model capability. Install from PyPI, validate 100 tasks, run vanilla vs custom, get capability profiles and improvement suggestions. SWE-bench tests models. AWB tests workflows. The same model running vanilla Claude Code vs. a purpose-built setup with a tuned CLAUDE.md, hooks, and structured agents produces meaningfully different results on real engineering tasks. No existing benchmark captures that gap — they all evaluate the model in isolation. AWB benchmarks the full stack: tool + configuration + workflow + model, together, on 100 tasks drawn from real open-source repositories. pip install awb awb quickstart # verify your setup awb run --runs 3 --parallel --adaptive # full 100-task benchmark (parallel, smart re-runs) awb run --category workflow --runs 1 # workflow tasks only (quick test) awb gap results/runs// # analyze capability gaps Clone repo at pinned SHA → Run setup commands → Capture baseline lint/security counts → Execute tool with task prompt → Run test suite + partial credit rubric → Sigmoid-normalize 7 metrics → Produce weighted composite + capability profile Each task starts from a fresh git clone at a pinned commit. Every tool gets the same prompt, the same timeout, and the same verification suite. Results are scored with sigmoid normalization so scores are never negative and never collapse at the boundary. Seven dimensions, sigmoid-normalized with per-task baselines derived from difficulty: | Dimension | Weight | What It Measures | |---|---|---| | Correctness | 55% | Pass/fail (60%) + partial credit rubric (40%) | | Cost efficiency | 15% | Estimated USD per task | | Speed | 10% | Wall-clock seconds vs. estimated task time | | Code quality | 10% | Lint warning delta (pre vs. post) | | Reliability | 5% | Pre-existing tests broken by the change | | Security | 3% | New security issues introduced | | Efficiency | 2% | Tool turns used vs. task max | Sigmoid curve: score = 100 / (1 + exp(k * (value - baseline))) - Optimal performance (excellent) → ~95 - Baseline performance (adequate) → ~50 - Above baseline → smooth decay, never negative Difficulty-weighted aggregation: hard tasks count 2.5×, medium 1.5×, easy 1.0×. A tool that solves hard tasks beats one that only solves easy ones even if the easy-task count is higher. Per-task baselines by difficulty: | Metric | Easy | Medium | Hard | |---|---|---|---| | Cost optimal / baseline | $0.05 / $0.30 | $0.20 / $1.00 | $1.00 / $3.00 | | Speed | 50% / 100% of estimated_minutes | same | same | | Iterations | 3 / max_iters | 8 / max_iters | 15 / max_iters | Real open-source repos, pinned to release tag SHAs. Setup runs in under 15 seconds via venv + pip (Python) or npm (TypeScript). | Category | Count | Easy / Med / Hard | What It Tests | |---|---|---|---| | bug-fix | 12 | 7 / 1 / 4 | Root cause analysis, test-first diagnosis, N+1 queries | | feature-addition | 9 | 3 / 0 / 6 | Convention adherence, ambiguous requirements, Dockerfiles, TypeScript typing | | refactoring | 11 | 5 / 2 / 4 | Multi-file consistency, O(n^2) optimization, CI/CD config, async migration | | code-review | 9 | 4 / 2 / 3 | Security review (report-only), concurrency analysis, migration guides, OWASP | | debugging | 10 | 7 / 0 / 3 | Performance profiling, regression bisection, stack trace diagnosis | | multi-file | 7 | 4 / 0 / 3 | Merge conflicts, plugin systems, auth chains | | legacy-code | 12 | 9 / 0 / 3 | SQLAlchemy 2.0 migration, 20-file codebase navigation, dead code removal | | workflow | 30 | 9 / 12 / 9 | Completeness tracking, convention discovery, security methodology, context utilization, async safety, config extraction, test-driven implementation | Repos used: FastAPI, httpx, Flask, Starlette, Click, Pydantic, SQLAlchemy 2.0, Hono Task IDs: BF-001–014 · FA-001–010 · RF-001–012 · CR-001–010 · DB-001–011 · MF-001–009 · LC-001–012 · WF-001–030 Each task maps to 1–3 capabilities, producing a radar chart of tool strengths: | Capability | Tasks | What It Measures | |---|---|---| | code_comprehension | 41 | Understanding existing code before modifying | | framework_knowledge | 35 | Knowing API patterns (Pydantic v2, async SQLAlchemy, etc.) | | bug_diagnosis | 26 | Structured root cause analysis, test-first diagnosis | | refactoring_discipline | 26 | Changing code without breaking behavior | | multi_file_reasoning | 23 | Coordinating changes across multiple files | | completeness_tracking | 10 | Following all requirements, not stopping at 80% | | convention_adherence | 10 | Discovering and following project conventions | | context_discovery | 10 | Reading project docs and config before editing | | test_writing | 10 | Writing correct, meaningful tests | | security_awareness | 10 | Identifying and fixing vulnerabilities | | security_methodology | 10 | Applying security checklists systematically | | cost_discipline | derived | Token efficiency across all tasks | Example awb gap output: Capability Profile ------------------ code_compr

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →