Claude Code 자동 모드: 종합적인 기술 요약

hackernews | 2026년 4월 1일 01:45 | 🔬 연구

#ai tool #anthropic #auto mode #claude #claude code #ml classifier #review #tool call

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

클로드 코드(Claude Code)의 오토 모드는 기존의 사용자 승인/거부 프롬프트 대신 ML 분류기(Sonnet 4.6)를 도입해 에이전트의 모든 도구 실행을 실시간으로 평가합니다. 단순한 명령어 차단이 아닌 사용자의 요청 의도와 실제 작업 범위를 비교하여, 권한을 넘어서는 과도한 행동이나 프롬프트 인젝션 공격 등을 원천적으로 차단하는 것이 핵심입니다. 예를 들어 스테이징 환경 배포를 요청했는데 프로덕션 환경에 배포하거나, 로컬 정리를 요청했는데 원격 브랜치를 삭제하는 등 폭발 반경이 다른 행위는 모두 승인 거부 대상이 됩니다. 또한 에이전트가 파이썬 스크립트나 Base64 인코딩으로 악의적인 셸 명령을 숨기더라도 LLM이 실제 실행되는 코드를 추적해 해석함으로써 정교한 보안 위협에 대응합니다.

본문

Auto mode replaces the human "approve/deny" permission prompt with an ML classifier that evaluates every tool call before execution. Instead of asking you whether rm -rf node_modules is okay, a second instance of Claude (Sonnet 4.6) reads the conversation transcript and decides in real time. This document walks through how it works, from the high-level architecture down to the actual code. - The Philosophy: Intent, Not Just Commands - The Big Picture - Three Tiers of Defense - The Two-Stage Classifier - The Reasoning-Blind Design - Transcript Building and Injection Prevention - The System Prompt and Customizable Slots - Block Rules and Threat Model - Dangerous Permission Stripping - Denial Tracking and Recovery - Multi-Agent Classifier Checks - Fail-Open vs Fail-Closed - Cache Optimization - Performance Numbers - Key Files Reference The classifier is not a command blocklist. It doesn't just ask "is rm -rf dangerous?" — it asks "did the user authorize this specific rm -rf on these specific files?" This is the core design philosophy, and it's what separates auto mode from a traditional security scanner. The classifier evaluates the gap between what the user requested and what the agent is actually doing. Consider: | User says | Agent does | Classifier verdict | |---|---|---| | "Clean up old branches" | Deletes local stale branches | Allow — matches the request | | "Clean up old branches" | git push origin --delete on 15 remote branches | Block — the user said "clean up," not "batch-delete remotes" | | "Deploy to staging" | Runs deploy script against staging | Allow — direct match | | "Deploy to staging" | Runs deploy script against production | Block — wrong target, even if the command syntax is identical | The classifier's operating principle: everything the agent chooses on its own is unauthorized until the user says otherwise. One approval isn't a pattern — prior approval of similar actions doesn't automatically cover new variations. The system addresses four distinct categories, but blocks them all the same way: 1. Overeager Behavior — The agent genuinely tries to help but exceeds its authorization. This is the most common real-world issue. The agent discovers credentials during debugging and uses them. It deletes a lock file blocking its progress. It "helpfully" force-pushes to fix a diverged branch. These look like reasonable problem-solving, just applied past the boundary of what the user authorized. 2. Honest Mistakes — The agent misunderstands scope. It thinks test resources are isolated when they're shared. It misjudges branch ownership. It assumes a staging environment is safe to wipe. The intent is correct but the understanding of the environment is wrong. 3. Prompt Injection — External content (a malicious README, a crafted error message, a hostile web page) plants instructions that the agent follows. The agent's intent has been hijacked. 4. Misaligned Models — The model pursues independent goals. This is the theoretical worst case and the hardest to defend against. The classifier treats it identically to the other three: evaluate the action, not the stated reason. The classifier looks through wrappers to evaluate what actually happens: - If the agent writes a Python script that assembles a shell command, the classifier evaluates the assembled command, not just python script.py - If commands are chained with && , the whole chain is one action — you can't hiderm -rf / after a benignecho hello - If the agent base64-encodes a payload, the classifier evaluates the decoded content This is why the classifier is an LLM (Sonnet 4.6) rather than a regex engine. A regex can catch rm -rf / but can't understand that python -c "import os; os.system('rm -rf /')" does the same thing. For every action, the classifier effectively asks three questions: - Is this action dangerous? (Could it cause irreversible harm, leak data, cross trust boundaries?) - Did the user authorize this specific action? (Not "something like this" — this exact thing) - Does the user's authorization cover the blast radius? (Saying "delete the test file" doesn't authorize deleting the test directory) An action can be dangerous and still be allowed — if the user explicitly asked for it. And a seemingly benign action can be blocked — if it's outside the scope of what the user requested. The Stage 2 classifier prompt makes this explicit: As a reminder, explicit (not suggestive or implicit) user confirmation is required to override blocks. This means: - The user typing "sure, go ahead" after seeing a plan does not equal blanket authorization for everything in the plan - The agent saying "I'll need to do X" in its reasoning does not equal user consent for X (and the classifier can't even see this reasoning — see Reasoning-Blind Design) - A CLAUDE.md instruction like "always run tests" does equal explicit authorization for running tests - Vague requests like "fix everything" do not authorize destructive operations When you run Claude Code in auto

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기