모든 기술을 벤치마킹하는 기술 구축

hackernews | 2026년 3월 12일 07:22 | 🔬 연구

#a/b 테스트 #claude #claude code #review #기술 검증 #벤치마킹 #성능 평가

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

1. **Analyze the Request:** * **Role:** News summarization expert. * **Task:** Summarize the provided article. * **Constraints:** * Concise: 2-4 sentences. * Focus: Most important information. * Negative Constraint 1: DO NOT copy-paste. * Negative Constraint 2: DO NOT simply restate the title. * Negative Constraint 3: Add new information beyond the title. * Language: English. 2. **Analyze the Input:** * **Title:** "Built a skill that benchmarks any skill" * **Source:** hackernews * **Article Text:** "Built a skill that benchmarks any skill \n\n Summary:" 3. **Identify the Problem:** The input "Article Text" contains essentially zero information beyond the title. It just repeats the title and has a placeholder "Summary:" tag. There is no body text describing *how* it works, *what* it benchmarks specifically, *who* made it, or *why* it matters. 4. **Formulate a Strategy:** * Since I cannot add *new* information from the text because there is none, I must interpret the title in a way that implies the context of a HackerNews post (usually a "Show HN" type post). * However, the prompt explicitly forbids me from simply restating the title and demands I add new information. This is a "trick" question or a case of insufficient input data. * *Self-Correction/Refinement:* I cannot hallucinate details. If the input text is empty/insufficient, I have to state that the text provided is insufficient to generate a summary beyond the title, *or* I have to infer that this is likely a meta-post where the text *is

본문

Benchmark any skill with controlled A/B evaluation Run /spec:skill-benchmark to measure whether a skill actually improves Claude Code performance. Isolated sessions, layered grading, behavioral analysis, clear verdict. What it measures - Pass rate delta â aggregate score difference between with-skill and baseline - 4 grading criteria â correctness (40%), completeness (25%), quality (20%), efficiency (15%) - Behavioral signals â tool usage frequency, thrashing detection, error recovery rate - Negative controls â out-of-domain tasks where the skill should not activate How it works - Analyzes the target skill and auto-generates balanced tasks (easy/medium/hard + negative control) - Runs isolated claude -p sessions â with-skill vs baseline in separate sandboxes - Grades with deterministic checks (file exists, compiles, passes) then LLM-as-judge - Analyzes transcripts for tool patterns, thrashing, and error recovery - Produces verdict based on score delta â USE / NEUTRAL / DON'T USE Output A report.md with per-task breakdowns, comparative tables, behavioral analysis, token usage, and a clear verdict with actionable recommendations. >Â Benchmark the code-review skill âÂ I'll benchmark code-review with controlled A/B evaluation. Let me read the skill and set up the benchmark. Read .claude/skills/skill-benchmark/SKILL.md Read .claude/skills/skill-benchmark/references/CONFIG.md Read .claude/skills/code-review/SKILL.md âÂ Skill analyzed: code-review provides structured PR review with bug detection, security checks, and actionable feedback. Generating tasks targeting its domain. Generating tasks: 4 positive + 1 negative control Write results/code-review-20260312-091544/tasks/task-01-easy.md +34 lines task-01 [easy] Review a Python Flask endpoint with missing input validation Write results/code-review-20260312-091544/tasks/task-02-easy.md +36 lines task-02 [easy] Review a TypeScript utility with off-by-one and type coercion bugs Write results/code-review-20260312-091544/tasks/task-03-medium.md +48 lines task-03 [medium] Review a Go concurrent cache with race conditions and deadlock risk Write results/code-review-20260312-091544/tasks/task-04-hard.md +56 lines task-04 [hard] Review a Rust async payment processor with error handling and retry logic Write results/code-review-20260312-091544/tasks/task-05-control.md +30 lines task-05 [control] Write a CSS animation from scratch (negative control) âÂ 5 tasks generated with verification checks. Running isolated sessions â each task gets with-skill and baseline in separate sandboxes. âââ task-01 Flask endpoint review [easy] âââ â with-skill Run claude -p --append-system-prompt "...call Skill tool..." â turns: 6 tools: 4 (Skill, Read, Read, Write) cost: $0.072 31.4s â review.md â found SQL injection, missing auth check, no rate limiting â baseline Run claude -p --disallowedTools Skill â turns: 3 tools: 2 (Read, Write) cost: $0.041 14.8s â review.md â found missing validation, missed SQL injection checks: with-skill file_exists â has_findings â severity_rated â has_fix_suggestions â baseline file_exists â has_findings â severity_rated â has_fix_suggestions â judge (opus): with-skill correctness 95 completeness 92 quality 94 efficiency 88 â 93.1 baseline correctness 72 completeness 65 quality 58 efficiency 85 â 69.4 350+ messages compacted â task-02 through task-05: 8 more sessions, deterministic checks, LLM grading, transcript analysis for each Run python3 analyze_transcript.py results/code-review-20260312-091544/ Write results/code-review-20260312-091544/report.md +142 lines ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ Skill Benchmark Report: code-review runner sonnet judge opus tasks 5 runs 1 ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ Verdict: USE Skill scores +14.2% higher than baseline on average. Summary âââââââââââââââââ¬âââââââââââââ¬âââââââââââââ¬ââââââââ Metric â With Skill â Baseline â Delta âââââââââââââââââ¼âââââââââââââ¼âââââââââââââ¼ââââââââ Avg Score â 88.42% â 74.20% â+14.22% Correctness â 91.00% â 76.25% â+14.75% Completeness â 87.50% â 68.00% â+19.50% Quality â 90.25% â 71.50% â+18.75% Efficiency â 82.50% â 80.00% â +2.50% âââââââââââââââââ´âââââââââââââ´âââââââââââââ´ââââââââ Deterministic Checks with-skill 4/4 passed (100%) baseline 2/4 passed (50%) Per-Task Breakdown âââââââââââââââââââââââââââââââââââ¬ââââââââââ¬ââââââââââ¬âââââââââ Task â Skill â Base â Winner âââââââââââââââââââââââââââââââââââ¼ââââââââââ¼ââââââââââ¼âââââââââ Flask endpoint [easy] â 93.10% â 69.40% â Skill +23.7 TS utility bugs [easy] â 89.80% â 78.20% â Skill +11.6 Go concurrent cache [medium] â 86.50% â 71.30% â Skill +15.2 Rust payment proc [hard] â 84.30% â 68.90% â Skill +15.4 CSS animation [control] â 91.20% â 90.80% â Tie +0.4 âââââââââââââââââââââââââââââââââââ´ââââââââââ´ââââââââââ´âââââââââ Where skill helps: â¢ Completeness +19.5% â catches more bugs, checks security, validates edge cases â¢ Quality +18.8% â structured findings with severity,

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기