Claude Opus 4.6 취약점 탐지 벤치마킹

hackernews | 2026년 5월 11일 20:58 | 📦 오픈소스

#anthropic #claude #gpt-4 #오픈소스

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Claude Opus 4.6의 실제 C/C++ 취약점 탐지 성능을 측정하기 위해 4가지 전략으로 실험을 진행한 결과, 정교한 실행 추적이나 상태 증명과 같은 엄격한 근거 요구가 정밀도를 13.6%에서 20.3%까지 크게 향상시켰습니다. 또한 검증 에이전트를 추가하면 정답 쌍 정밀도가 23.3%, CVE 재현율은 28.9%로 상승하여, 구조화된 추론과 검증 과정이 취약점 분석의 품질을 결정적으로 높임을 입증했습니다.

본문

Benchmarking Claude Opus 4.6's ability to detect real-world C/C++ vulnerabilities across four prompting and agent strategies. We evaluate on the PrimeVul paired test set (435 vulnerability/fix pairs from open-source projects), measuring precision, recall, and CVE-correctness to understand how structured reasoning, justification depth, and verification agents affect detection quality. Requiring the model to produce increasingly rigorous justifications (execution traces, state proofs) improves pair-correct precision (P-C) from 13.6% to 20.3%, with rigorous precision nearly doubling from 8.7% to 15.8%. Adding a verification agent pushes P-C to 23.3% and CVE recall to 28.9%. Each experiment uses Claude Opus 4.6 as the analyzer and runs 3 times for consistency. All experiments share the same three-phase pipeline but differ in the structured output the model must produce. | # | Experiment | P-C | P-C Rigorous | P-C Flexible | CVE Recall | Vuln Findings | Benign Findings | Benign-Only Findings | |---|---|---|---|---|---|---|---|---| | 1 | No Justification | 13.6% | 8.7% | 13.6% | 27.6% | 63.4% | 52.4% | 27.4% | | 2 | Limited Justification | 19.3% | 14.5% | 17.7% | 25.5% | 52.2% | 36.6% | 15.6% | | 3 | Extensive Justification | 20.3% | 15.8% | 17.7% | 28.5% | 54.6% | 37.7% | 18.6% | | 4 | Verification Agent | 23.3% | 16.2% | 18.5% | 28.9% | 57.2% | 43.2% | 24.7% | All values are medians across 3 runs. Reference baseline: GPT-4 CoT = 12.94% P-C. Each experiment runs 3 times. "All 3" means the result held in every run; "Any" means it held in at least one. This captures how stable the results were across runs. | # | Experiment | CVE Recall All 3 | CVE Recall Any | P-C Rigorous All 3 | P-C Rigorous Any | P-C Flexible All 3 | P-C Flexible Any | |---|---|---|---|---|---|---|---| | 1 | No Justification | 22.0% | 33.1% | 5.9% | 11.1% | 7.6% | 19.6% | | 2 | Limited Justification | 21.7% | 30.0% | 12.5% | 17.5% | 14.2% | 22.2% | | 3 | Extensive Justification | 23.6% | 33.3% | 11.6% | 20.1% | 12.8% | 24.3% | | 4 | Verification Agent | 19.4% | 37.6% | 9.9% | 23.4% | 10.6% | 27.2% | Simple vulnerability analysis. The model reports CWEs, code snippets, and descriptions with no structured reasoning required. Requires a Justification with an UndesiredOperation (code + CWEs) and step_by_step_execution tracking variable state through ProgramStep s. The model must demonstrate a concrete execution path from function entry to the undesired operation. Full proof of reachability. The model must provide: - UndesiredOperation: description, code, CWEs, impact, and the variable states required to trigger it - Justification: initial variable state at function entry, then a trace of DataTransformation steps (in_state -> out_state) andConditionalStep steps (prove each branch is taken given current state) Same structured reasoning as experiment 3, plus a Claude Sonnet 4.6 verifier agent that checks each finding before inclusion. The verifier validates: is the undesired operation real, is the initial state correct, do steps follow logically, are conditionals justified, and does the final state match the preconditions. Findings get up to 2 verification attempts; unverified findings are discarded. Each experiment follows the same three-phase pipeline: analyze.py -> diff_judge.py -> judge.py - Analyze ( analyze.py ): Claude Opus 4.6 analyzes each of the 870 functions independently, producing structured vulnerability findings. - Diff Judge ( diff_judge.py ): For each commit pair (vulnerable + fixed), matches findings across versions and categorizes them asvuln_only ,benign_only , orshared . - Judge ( judge.py ): Evaluatesvuln_only findings against ground-truth CVE data to determine correctness. All metrics are computed over the 435 vulnerability/fix pairs. After analysis, the diff judge categorizes each finding as vuln_only (unique to vulnerable version), benign_only (unique to fixed version), or shared (present in both). The judge then evaluates whether each vuln_only finding correctly identifies the ground-truth CVE. | Metric | Definition | |---|---| | P-C | % of pairs where the vulnerable side has at least one finding and the benign side has zero findings (no benign_only , no shared ). Measures raw discrimination: can the model tell vulnerable code from fixed code? | | P-C Rigorous | P-C with the additional requirement that all vuln_only findings are judged as related to the ground-truth CVE. The strictest metric — the model must flag only the real vulnerability and nothing else, with a clean benign side. | | P-C Flexible | % of pairs where all vuln_only findings are CVE-correct (at least one exists) and there are no benign_only findings. shared findings are permitted — these represent underlying issues not addressed by the patch. Every benign-side finding must have a corresponding linked vulnerable-side finding. | | CVE Recall | % of pairs where the vulnerable side has at least one finding judged as related to the ground-truth CVE, regardless of what appears on the benign side. Measures the model's ability to detect the actual vulnerability. | | Vuln Findings | % of vulnerable functions that have at least one finding (any category). | | Benign Findings | % of benign functions that have at least one finding (any category). | | Benign-Only Findings | % of benign functions that have at least one finding not also found on the vulnerable side (i.e., a benign_only finding with no linked shared counterpart). | The PrimeVul paired test set contains 435 pairs (870 functions) from real security fixes across open-source C/C++ projects including Linux, TensorFlow, ImageMagick, FFmpeg, OpenSSL, mruby, and others. Each pair consists of: - A vulnerable function (before the fix, target=0 ) - A benign function (after the fix, target=1 ) Ground truth includes CVE ID, CWE classification, NVD URL, and commit message. src/ experiments/ no-justification/ # Experiment 1 limited-justification/ # Experiment 2 extensive-justification/ # Experiment 3 verification-agent/ # Experiment 4 common/ primevul.duckdb # Dataset in DuckDB format data/ experiments/ */experiment.json # Experiment metadata */runs/{1,2,3}/ # Per-run outputs (analysis, diffed, judged, stats) experiment_comparison.json # Cross-experiment metrics comparison Requires Python 3.12+ and uv. uv sync Set your Anthropic API key: export ANTHROPIC_API_KEY=sk-... Each experiment has a run_experiment.sh script: cd src/experiments/extensive-justification bash run_experiment.sh 1 # run number - pydantic-ai - Claude agent framework with structured outputs - duckdb - Dataset storage and querying - datasets - HuggingFace dataset loading

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기