Claude Code 자동 모드를 통한 의도 보안
hackernews
|
|
🔬 연구
#anthropic
#claude
#review
#claude code
#리뷰
#보안
#자동 모드
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
앤스로픽의 '클로드 코드 자동 모드'는 사용자의 승인 피로도(93%)를 줄이고 에이전트의 자율성을 높이기 위해 모델 기반 분류기를 통해 입력 및 출력 경계의 정렬을 평가하는 런타임 보안 시스템입니다. 라소사의 '인텐트 보안(Intent Security)' 프레임워크는 이와 유사하게 고자율성과 보안을 모두 확보하는 상위 사분면을 목표로 하지만, 사용자, 시스템, 모델, 외부 콘텐츠 등 독립적인 의도 차원을 분해하고 교차 검증한다는 점에서 차별화됩니다. 특히 인텐트 보안은 단일 세션을 넘어 시스템 제약 조건의 위변조를 추적하고 행동 기준선에서 벗어나는 악의적 궤적을 교차 세션에서 실시간으로 모니터링하여 보다 정교한 침해 사고를 사전에 차단합니다.
본문
Anthropic just released a detailed engineering article on Claude Code's auto mode, a new feature that lets the agent’s actions be auto-approved instead of asking the user for permission every time. At Lasso, we have been building Intent Security, a runtime security framework that ensures every component in the agentic system behaves as intended. It monitors the behavior of each component and analyzes their alignment. Like auto mode, when alignment holds it allows actions to proceed. When misalignment is detected, it intervenes. When we read Anthropic's post, the overlap in core assumptions was hard to miss. This post provides a comparison of the two approaches. Autonomy Versus Security The trade-off between autonomy and security is the central challenge of agentic AI. What makes an agent valuable is exactly what makes it risky. Tool access, persistent memory or acting on behalf of users. Remove any of these and the agent might be more secure, however, it will also be useless. The security solutions available today tend to address surface-level symptoms rather than the core of the problem. This applies not only to basic mechanisms like keyword filtering and static allowlists, but also to more sophisticated approaches like classifiers trained on prompt injection datasets. These classifiers often conflate multiple distinct concerns into a single detection pass. Prompt injection techniques, harmful user behavior, policy violations, and scope enforcement all get bundled together. The result is a system that blurs the boundaries between what are fundamentally different security questions, making it harder to tune, harder to debug, and prone to trade-offs where improving detection for one concern degrades another. And putting a human in the loop for every action does not scale either. Anthropic's data from their Claude Code auto mode post says it clearly - users approve 93% of permission prompts. Rather than security, it leads to approval fatigue, which is exactly where dangerous actions slip through unnoticed. The alternative is letting the agent run free without oversight, which in the Claude Code world means the --dangerously-skip-permissions flag. The agent still functions, it just has no one watching. Both approaches place the burden of responsibility on the end user. Human-in-the-loop makes the user the decision point for every action, while skipping permissions assumes the user understands and accepts the risks. Either way, the user becomes the weak link. What's needed is a system that assumes responsibility itself, making secure decisions on the user's behalf without requiring constant human oversight or blind trust. Anthropic visualizes this trade-off in their article. Their figure (as shown in Figure 1) positions the permission modes available in Claude Code along two axes: task autonomy vs. security/safety. Manual permission prompts sit low on autonomy. Bypassing permissions gives full autonomy but no safety. Sandboxing gives safety, however, the maintenance friction is high. Auto mode targets the upper-right quadrant: high autonomy with meaningful security, improving over time as classifier coverage gets better. This is the quadrant agentic security solutions should be targeting. Intent Security operates in the same quadrant, with positioning that adapts to the application's architectural constraints. For narrow-scope applications where the set of permitted actions is constrained, overall autonomy is bounded by the scope but higher within that scope, as security enforcement is tighter. The agent still operates autonomously within those boundaries without approval overhead. For broader-scope applications where more actions are allowed, autonomy increases due to architectural choices, but security becomes more challenging to enforce. Intent Security addresses both scenarios through the same framework, adjusting security boundaries based on what the system intent defines as permissible, given the security posture and scope. Figure 1: Task Autonomy vs Security Trade-off in Agentic Systems Adapted from Anthropic's Claude Code auto mode article. Both auto mode and Intent Security target the upper-right quadrant: high autonomy with meaningful security. Intent Security is shown as a flexible range (shaded area) as it is applicable to agentic applications in general, and provides additional security features beyond the scope of auto mode. The position along both axes depends on the agentic application's scope and the security posture defined in the system intent. Maintenance friction remains low, with the application owner updating system intent when security posture or application scope changes. What both auto mode and Intent Security agree on is the need for an oversight layer. It lets the agent run with its full capabilities without wreaking havoc, because the security layer understands what the agent is supposed to be doing and intervenes when it drifts. Auto mode is Anthropic's oversight layer for their coding agent. Intent Security addresses the same challenge for general agentic applications. The approaches converge on monitoring of the behavior and intent of system components, analyzing their alignment to let agents operate with full capabilities while intervening only when alignment breaks. The convergence is worth examining. A Closer Look at Claude Code auto mode For the detailed breakdown of how the auto mode operates under the hood, refer to the original article. Claude Code's auto mode is a runtime security system that delegates approval decisions to model-based classifiers. Instead of asking the user to approve every tool call or letting the agent run unsupervised, auto mode evaluates each action and decides whether it aligns with what the user asked for. As written in Anthropic's blog: "Auto mode uses two layers of defense: one for what Claude reads, one for what Claude does." This means one layer is focused on what enters the agent’s context and another on the decisions made by the agent. Two layers of defense At the input boundary, everything the agent reads (such as file contents, web fetches, shell output, tool responses) is scanned by a prompt-injection probe before it reaches the model. If the external content is detected to attempt to hijack model behavior, it is flagged by the probe and the agent is directed to stay anchored on what the user originally asked for. At the output boundary, a transcript classifier evaluates each action the model is about to take before it executes. It considers the user's messages and the model's tool call commands to determine whether the action aligns with what the user asked for and falls within the system's decision criteria. The classifier never sees the agent's own reasoning or tool outputs, to prevent the agent from influencing its own security evaluation. System behavior configuration For defining the decision criteria guiding the classifier, Anthropic ships a fixed template with configurable slots. Users configure these to reflect how their agent should behave, and the classifier enforces those decisions at runtime. This is significant because it means the security policy is defined per agent, per environment, not only as a universal ruleset. This maps to what we call system intent in the Intent Security framework. The central question Anthropic's threat model covers a range of dangerous agent behaviors, from overeager actions to prompt injection to honest mistakes. The architecturally significant observation in their post is that regardless of what caused the misalignment, the security-relevant question is the same. Is this action aligned with what the user authorized and what the system allows? This is exactly the reduction that Intent Security makes as well. One security layer, grounded in intent, that addresses these threats through a unified approach rather than separate mechanisms for each threat category. Intent Security: What the Security Layer Needs to Know When a human reviewer approves or denies an agent's action, the most obvious thing they check is whether the model is doing what they asked. But that is only one dimension of the problem. Anthropic's threat model in their auto mode post points at exactly why. They identify four categories of dangerous agent behavior. Overeager actions where the agent tries to help but takes initiative beyond what the user authorized. Honest mistakes where the agent misunderstands the scope or blast radius of an action. Prompt injection where instructions planted in external content hijack the agent away from the user's task. And model misalignment where the agent pursues a goal of its own. These are different root causes, but they point to a more complete picture that involves more questions than just "is the model doing what I asked." Was the user even asking for something within the system's boundaries? Did external content shift the model's understanding of its mission? Is the model improvising with tools and arguments that nobody authorized? Is the trajectory of behavior changing in ways that suggest something has gone wrong? Each of these questions requires its own detection logic. Intent Security decomposes the problem into distinct intent dimensions. Dimension Definitions - User intent is the mission the user assigns to the agent, i.e. user’s expected outcome from the interaction with the agentic system. - System intent is what the application was designed to do, the operational boundaries and policies the developer defined. - Model intent is what the model understands as its mission, which specific actions it is about to take, which tools it plans to call and which arguments it is passing. - External content is everything that enters the agent's context from outside. Tool results, fetched web pages, API responses, file content, and so on. It is an influence vector that can push the model's behavior away from user and system intent. Independent Signal Extraction The first step is extracting the signal for each dimension independently. Only then can the security layer run meaningful detection logic. And that detection logic must keep the dimensions separate. A solution that conflates policy enforcement, intent analysis, and technique detection into a single pass will blur the boundaries between fundamentally different security questions. Prompt injection is a good example. It is not purely a technique. It is a combination of adversarial intent and technique. The technique is the delivery mechanism. The intent behind it is what makes it dangerous. Separating these lets the security layer distinguish between similar-looking inputs with different underlying causes. With the signals extracted and the dimensions kept separate, the security layer can run checks both within each dimension and across them. Cross-Dimension Alignment (The term "alignment" here is not to be confused with model alignment in the AI safety sense. We use it to describe the alignment between intent dimensions.) This security framework provides answers to the following questions: Is the user operating within the scope of the system? Is the user asking for something the application was designed to do, or are they pushing beyond the developer's defined boundaries? This check is independent of whether any injection or manipulation is occurring. The user may simply be asking for something the system is not intended to do. Is the model aligned with user intent? Did the user ask for this specific action, or did the model extrapolate? Is the model calling tools that the user's request actually requires, or has it decided on its own that additional tools would be helpful? The model may select a tool that is technically available but was never part of what the user asked for, or call a permitted tool with arguments that go beyond the user's request. Is the model operating within system intent? Even when the user's request is valid, the model may still execute something outside the developer's defined scope. It may have been influenced by external content, made an honest mistake about scope, or taken initiative beyond what was needed. This check catches that independently of whether the user and model are aligned with each other. Has external content caused the model to deviate from user or system intent? If content, such as tool results, fetched web content or API responses, causes the model to pursue actions that no longer align with what the user is asking for or what the system allows, that is a security event regardless of whether the content looks like a traditional injection pattern. Intra-Dimension Checks & Temporal Aspects Beyond the cross-dimension alignment checks, Intent Security also monitors the integrity of the dimensions themselves, as well as considering the temporal aspect. Does the system scope still reflect what the developer defined? Intent Security maintains a reference of the original system constraints as defined at deployment time. This matters because some attacks do not try to violate the existing system scope. They try to change it. An indirect prompt injection embedded in a tool result might not instruct the model to take an unauthorized action directly. Instead, it might inject a new rule. If the model accepts this as a legitimate constraint, the system scope has been rewritten without the developer's authorization. From that point forward, the model may act within what it believes the system allows while violating what the developer actually defined. Intent Security tracks provenance of system constraints. If the operational boundaries shift in ways that do not trace back to the developer's original definition, that points to a security event. Does user or model behavior deviate from the established baseline over time? User and model behavior develops a profile over time. A deviation from that baseline, whether in the types of requests, the tools being called, or the scope of access being requested, can indicate different scenarios. Natural drift occurs as usage patterns evolve gradually, which is expected behavior. Gradual adversarial drift manifests as slow manipulation or scope creep that moves incrementally toward unauthorized territory. Sudden deviation breaks sharply from the established profile, indicating compromise or immediate manipulation. The security layer must distinguish between these patterns and respond accordingly. Figure 2: Behavioral Baseline Tracking. This figure showcases an example of an additional feature of Intent Security: it maintains behavioral baselines and compares each action against established patterns. Actions aligning with the baseline proceed autonomously. Deviations trigger intervention by blocking, alerting, or escalating based on the configured security posture. This cross-session behavioral tracking enables detection of threats that develop over time, extending beyond auto mode's session-based evaluation. Does the trajectory of actions point to malicious intent? A single actio
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유