HN 표시: Mdarena – 자신의 PR과 비교하여 Claude.md를 벤치마킹하세요.

hackernews | 2026년 4월 6일 09:05 | 📦 오픈소스

#claude #claude.md #pr #머신러닝/연구 #벤치마크 #에이전트 #토큰 최적화

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

mdarena는 개발자들이 자신의 코드베이스를 기반으로 CLAUDE.md 파일의 성능을 벤치마크할 수 있도록 돕는 도구입니다. 이 도구는 병합된 PR을 기반으로 작업 세트를 구축하며, CI/CD 및 다양한 패키지 파일에서 테스트 명령을 자동으로 감지하여 실제 테스트 스위트를 통해 에이전트의 패치를 평가합니다. 실제 대규모 프로덕션 모노레포에서 20개의 병합된 PR을 대상으로 진행한 테스트에 따르면, 기존 CLAUDE.md 파일은 기준선 대비 테스트 해결률을 약 27% 향상시켰습니다. 반면, 모든 지침을 하나로 통합한 대안 파일은 오히려 노이즈를 발생시켜 성능 저하를 유발해 CLAUDE.md가 없는 상태와 동일한 수준의 결과를 내는 것으로 나타났습니다. 또한, 이 도구는 SWE-bench의 평가 방식과 유사하게 기록 없는 체크아웃 기능을 통해 평가의 무결성과 통계적 유의성을 확보합니다.

본문

Benchmark your CLAUDE.md against your own PRs. Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase. pip install mdarena # Mine 50 merged PRs into a test set mdarena mine owner/repo --limit 50 --detect-tests # Benchmark multiple CLAUDE.md files + baseline (no context) mdarena run -c claude_v1.md -c claude_v2.md -c agents.md # See who wins mdarena report mdarena mine -> Fetch merged PRs, filter, build task set Auto-detect test commands from CI/package files mdarena run -> For each task x condition: - Checkout repo at pre-PR commit - Baseline: all CLAUDE.md files stripped - Context: inject CLAUDE.md, let Claude discover it - Run tests if available, capture git diff mdarena report -> Compare patches against gold (actual PR diff) - Test pass/fail (same as SWE-bench) - File/hunk overlap, cost, tokens - Statistical significance (paired t-test) mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it. # Auto-detect from CI/CD mdarena mine owner/repo --detect-tests # Or specify manually mdarena mine owner/repo --test-cmd "make test" --setup-cmd "npm install" Parses .github/workflows/*.yml , package.json , pyproject.toml , Cargo.toml , and go.mod . When tests aren't available, falls back to diff overlap scoring. Pass a directory to benchmark a full CLAUDE.md tree: mdarena run -c ./configs-v1/ -c ./configs-v2/ Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree. We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge. Key findings: - The existing CLAUDE.md improved test resolution by ~27% over bare baseline - A consolidated alternative that merged all per-directory guidance into one file performed no better than no CLAUDE.md at all - On hard tasks, per-directory instruction files gave the agent targeted context, while the consolidated version introduced noise that caused regressions The winning CLAUDE.md wasn't the longest or most detailed. It was the one that put the right context in front of the agent at the right time. # Import SWE-bench tasks pip install datasets mdarena load-swebench lite --limit 50 mdarena run -c my_claude.md # Or export your tasks as SWE-bench JSONL mdarena export-swebench Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True , Claude Code runs with --dangerously-skip-permissions ). Sandboxes are isolated temp directories under /tmp but processes run as your user. Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo's git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don't exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions. | Command | Description | |---|---| mdarena mine | Mine merged PRs into a task set | mdarena mine --detect-tests | Mine with auto-detected test extraction | mdarena run -c file.md | Benchmark a single CLAUDE.md | mdarena run -c a.md -c b.md | Compare multiple files head-to-head | mdarena run --no-run-tests | Skip test execution, diff overlap only | mdarena report | Analyze results, show comparison | mdarena load-swebench [dataset] | Import SWE-bench tasks | mdarena export-swebench | Export tasks as SWE-bench JSONL | git clone https://github.com/HudsonGri/mdarena.git cd mdarena uv sync uv run pytest uv run ruff check src/ See ROADMAP.md. MIT. See LICENSE.

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기