BenchJack – AI 에이전트 벤치마크를 위한 오픈 소스 해킹 가능성 스캐너
hackernews
|
|
📦 오픈소스
#anthropic
#benchmark
#claude
#openai
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
AI 에이전트 벤치마크의 보안 취약점을 자동으로 스캔하는 오픈소스 도구 BenchJack이 공개되었습니다. 이 도구는 정적 분석과 AI 기반 심층 검사를 통해 8가지 취약성 클래스를 식별하고 실제 공격 코드를 생성하며, 8개 주요 벤치마크에서 에이전트가 73~100%의 점수를 부정하게 얻을 수 있음을 밝혀냈습니다. BenchJack은 웹 대시보드를 통해 결과를 실시간으로 확인할 수 있으며, Docker 샌드박싱을 지원해 안전한 감사 환경을 제공합니다.
본문
Find out if your AI benchmark can be gamed — before your model does. BenchJack is a hackability scanner for AI agent benchmarks. It runs a multi-phase audit pipeline — static analysis tools plus AI-powered deep inspection via Claude Code or Codex — and streams results to a live web dashboard as they arrive. Point it at any benchmark repo. BenchJack will tell you whether an agent can cheat. Real-time dashboard showing a vulnerability scan of Terminal-Bench. Red/yellow indicators are vulnerability classes V1–V8. AI benchmarks are supposed to measure capability — but many can be gamed. Agents can read answer keys shipped with the test, hijack the evaluator process, exploit eval() on untrusted input, or fool LLM judges with prompt injection. When benchmarks are hackable, leaderboards become meaningless. For more on why this matters, see our blog post on trustworthy benchmarks. BenchJack automates the process of finding these weaknesses: - 8 vulnerability classes covering the most common benchmark exploits — from leaked answers (V2) to LLM judges without input sanitization (V4) to granting unnecessary permissions (V8) - Static + AI hybrid analysis — Semgrep, Bandit, and Hadolint catch surface-level issues; Claude Code or Codex handle the deep architectural reasoning - Proof-of-concept generation — doesn't just flag problems, generates working exploit code - Real-time streaming dashboard — watch the audit unfold live in your browser - Docker sandboxing (Work In Progress) — run analysis in isolated containers with dropped capabilities and read-only mounts - Claude Code skill — also ships as a standalone Claude Code skill in .claude/skills/benchjack/ , so you can run/benchjack directly inside Claude Code without the web UI or CLI wrapper We used BenchJack to audit 8 major AI agent benchmarks covering 4,458 tasks — and every single one was exploitable. Agents achieved 73–100% scores without doing any legitimate work. No solution code, minimal LLM calls, no actual reasoning. Details in our blog post. | Benchmark | Tasks | Exploit | Score | |---|---|---|---| | SWE-bench Verified | 500 | Pytest hook injection via conftest.py forces all tests to pass | 100% | | SWE-bench Pro | 731 | Same conftest.py hook + Django unittest.TestCase.run monkey-patch | 100% | | Terminal-Bench | 89 | Binary trojaning — replace /usr/bin/curl , fake uvx /pytest output | 100% | | WebArena | 812 | file:// URLs leak reference answers from task configs | ~100% | | FieldWorkArena | 890 | Non-functional validator — send {} , score full marks | 100% | | OSWorld | 369 | wget gold files from public HuggingFace URLs + eval() on grader | 73% | | GAIA | 165 | Public answer lookup + normalization collisions in string matching | ~98% | | CAR-bench | — | Hidden HTML instructions bias LLM judge; generic refusals skip grading | 100% | And there are more to come — see audits/ for community-contributed audit writeups, and audits/README.md for how to submit your own. # Install uv tool install . # Run — opens a browser dashboard at http://localhost:7832 benchjack That's it. Specify the name of the benchmark (or the path/URL) and start auditing. BenchJack finds and clones the repo, runs the full pipeline, and streams results to the dashboard. - Python 3.11+ - uv for package management - One AI backend (at least one): - Claude Code (recommended): npm i -g @anthropic-ai/claude-code - OpenAI Codex (WIP, high refusal rate) - Claude Code (recommended): - Docker (optional, for sandboxed execution) - Without Docker: install semgrep ,bandit , andhadolint for static analysis git clone https://github.com/benchjack/benchjack.git cd benchjack uv tool install . To also install the Python-based static analysis tools: uv pip install ".[tools]" After installing, make sure your AI backend is authenticated and your tools are available. Claude Code — Run claude once in your terminal and complete the login flow. BenchJack invokes claude --print , which requires an active session. If you prefer API-key auth, set ANTHROPIC_API_KEY in your environment instead. OpenAI Codex — Run codex once to authenticate. Codex uses its own OAuth session stored in ~/.codex/ . # Check that your chosen backend is on PATH which claude # or: which codex # Check static analysis tools (only needed without Docker) which semgrep && which bandit # Optional: check Docker (only needed with --sandbox) docker info BenchJack will error early if the selected backend is missing from PATH. Static analysis tools (semgrep , bandit , hadolint ) are only required when running without --sandbox — in sandbox mode they are built into the Docker image. benchjack # start the dashboard, configure from the UI benchjack --port 9000 # custom port (default: 7832) The dashboard lets you configure the backend, mode, sandbox, and PoC level — then start the audit with one click. For headless / scripted operation: benchjack --no-ui [OPTIONS] Options: --backend NAME AI backend: claude | codex | auto (default: claude) --model MODEL Model for AI analysis phases --poc-level LEVEL PoC generation: full | partial | skip (default: partial) --audit Audit mode (default) --hack-it Reward-hack mode --sandbox Run inside Docker sandbox --no-sandbox Run on host (default) # Basic audit benchjack ./my-benchmark --no-ui # Use a specific model benchjack ./my-benchmark --no-ui --model claude-sonnet-4-6 --poc-level partial # Reward-hack mode with Codex, sandboxed benchjack ./my-benchmark --no-ui --hack-it --backend codex --sandbox # Audit a remote repo benchjack https://github.com/org/benchmark --no-ui manual for a detailed guide on using the dashboard and the CLI. BenchJack runs a 6-phase pipeline. Each phase streams events to the dashboard (or CLI) in real time. | Phase | What it does | Engine | |---|---|---| | Setup | Clone or locate the benchmark repo | git | | Static Scan | Run Semgrep, Bandit, Hadolint, Docker Analyzer, Trust Mapper | Static tools | | Reconnaissance | Map evaluation architecture, entry points, trust boundaries | AI | | Vulnerability Scan | Check all 8 vulnerability classes (V1–V8) | AI | | PoC Construction | Generate proof-of-concept exploits | AI | | Report | Produce structured audit report with findings and severity | AI | | ID | Name | Example | |---|---|---| | V1 | No Isolation Between Agent and Evaluator | Agent writes to the same filesystem the evaluator reads from | | V2 | Answers Shipped With the Test | Ground-truth labels accessible at runtime | | V3 | Remote Code Execution on Untrusted Input | eval() / exec() called on agent output | | V4 | LLM Judges Without Input Sanitization | Prompt injection in model-graded evaluation | | V5 | Weak String Matching | Scoring with in or regex that accepts partial / wrong answers | | V6 | Evaluation Logic Gaps | Off-by-one errors, missing edge cases in scoring | | V7 | Trusting the Output of Untrusted Code | Agent-generated code runs with evaluator privileges | | V8 | Granting Unnecessary Permissions | Network access, filesystem write, sudo where not needed | BenchJack can run all analysis inside Docker containers for isolation: - Static tools run with --network=none ,--cap-drop=ALL , and the benchmark mounted read-only - AI backends run with network access (needed for API calls) but the benchmark is still read-only and host capabilities are dropped The sandbox image (benchjack-sandbox ) is built automatically on first use. Pass --no-sandbox to skip Docker and run directly on the host. benchjack.py CLI entry point server/ app.py FastAPI application ai_runner.py Claude Code / Codex CLI wrapper sandbox.py Docker sandbox management event_bus.py SSE pub-sub for real-time streaming pipeline/ audit.py Audit pipeline hack.py Reward-hack pipeline prompts.py AI prompt templates models.py Data models routes/ REST + SSE endpoints web/ index.html Dashboard style.css Styles app.js Frontend logic js/ JS modules .claude/skills/benchjack/ SKILL.md Claude Code skill definition (run /benchjack in Claude Code) tools/ Static analysis scripts & Semgrep rules audits/ Community-contributed audit writeups (one folder per benchmark) Dockerfile.sandbox Sandbox container image BenchJack is in early preview. Keep the following in mind: - Codex backend is experimental. Codex has a high refusal rate on security-related prompts, which causes many pipeline phases to produce incomplete results or fail silently. Claude Code is the recommended backend. - Docker sandbox is work-in-progress. Sandboxed execution ( --sandbox ) works for static analysis, but AI-backend containers may hit credential-forwarding edge cases on Linux hosts (macOS Keychain extraction is macOS-only). SetANTHROPIC_API_KEY explicitly when using sandbox mode on Linux. - No automated tests for PoC verification. Generated proof-of-concept exploits are not automatically validated against the target benchmark. The PoC phase may produce code that looks correct but fails at runtime. See CONTRIBUTING.md for how to help build verification oracles. - Sequential pipeline only. All phases run sequentially — there is no parallelism across vulnerability classes or tasks yet. - Rate limits. Long audits on large benchmarks can hit API rate limits. BenchJack detects rate-limit errors from Claude Code but does not retry automatically; you will need to re-run. - Single-user web UI. The dashboard does not support concurrent audit sessions. Starting a new audit will need to open a new window. We welcome contributions of all kinds — new vulnerability classes, better prompts, static analysis rules, benchmark adapters, UI improvements, and tests. Audited a benchmark with BenchJack? Share your findings in audits/ — see audits/README.md for the submission guide and audits/TEMPLATE.md for a ready-to-fill skeleton. See CONTRIBUTING.md for setup instructions and ideas on where to start. If you use BenchJack in your research, please cite: @software{benchjack2025, title = {BenchJack: AI Agent Benchmark Hackability Scanner}, author = {BenchJack Contributors}, year = {2025}, url = {https://github.com/benchjack/benchjack} } Apache 2.0 — see LICENSE for details.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유