다중 에이전트 함정

Towards Data Science | | 🔬 연구
#langgraph #review #가트너 #다중 에이전트 #리뷰 #에이전트 ai
원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석

요약

구글 딥마인드는 연구 결과를 통해 에이전트 간 오류가 전파되면 실수가 17배로 증폭될 수 있으며, 이로 인해 다중 에이전트 시스템 프로젝트의 40%가 취소된다는 사실을 발견했습니다. 연구는 이러한 실패를 방지하고 6,000만 달러 규모의 성공을 거두기 위해 에이전트 간 상호작용을 구조화하는 세 가지 아키텍처 패턴을 제안했습니다. 이는 다중 에이전트 네트워크가 갖는 기본적인 위험성을 완화하는 데 중점을 둡니다.

본문

The system runs on a multi-agent architecture built with LangGraph. Here’s the other side. Gartner predicted that over 40% of agentic AI projects will be canceled by the end of 2027. Not scaled back. Not paused. Canceled. Escalating costs, unclear business value, and inadequate risk controls. Same technology. Same year. Wildly different outcomes. If you’re building a multi-agent system (or evaluating whether you should), the gap between these two stories contains everything you need to know. This playbook covers three architecture patterns that work in production, the five failure modes that kill projects, and a framework comparison to help you choose the right tool. You’ll walk away with a pattern selection guide and a pre-deployment checklist you can use on Monday morning. Why More AI Agents Usually Makes Things Worse The intuition feels solid. Split complex tasks across specialized agents, let each one handle what it’s best at. Divide and conquer. In December 2025, a Google DeepMind team led by Yubin Kim tested this assumption rigorously. They ran 180 configurations across 5 agent architectures and 3 Large Language Model (LLM) families. The finding should be taped above every AI team’s monitor: Not 17% worse. Seventeen times worse. When agents are thrown together without structured topology (what the paper calls a “bag of agents”), each agent’s output becomes the next agent’s input. Errors don’t cancel. They cascade. Picture a pipeline where Agent 1 extracts customer intent from a support ticket. It misreads “billing dispute” as “billing inquiry” (subtle, right?). Agent 2 pulls the wrong response template. Agent 3 generates a reply that addresses the wrong problem entirely. Agent 4 sends it. The customer responds, angrier now. The system processes the angry reply through the same broken chain. Each loop amplifies the original misinterpretation. That’s the 17x effect in practice: not a catastrophic failure, but a quiet compounding of small errors that produces confident nonsense. The same study found a saturation threshold: coordination gains plateau beyond 4 agents. Below that number, adding agents to a structured system helps. Above it, coordination overhead consumes the benefits. This isn’t an isolated finding. The Multi-Agent Systems Failure Taxonomy (MAST) study, published in March 2025, analyzed 1,642 execution traces across 7 open-source frameworks. Failure rates ranged from 41% to 86.7%. The largest failure category: coordination breakdowns at 36.9% of all failures. The obvious counter-argument: these failure rates reflect immature tooling, not a fundamental architecture problem. As models improve, the compound reliability issue shrinks. There’s truth in this. Between January 2025 and January 2026, single-agent task completion rates improved significantly (Carnegie Mellon benchmarks showed the best agents reaching 24% on complex office tasks, up from near-zero). But even at 99% per-step reliability, the compound math still applies. Better models shift the curve. They don’t eliminate the compound effect. Architecture still determines whether you land in the 60% or the 40%. The Compound Reliability Problem Here’s the arithmetic that most architecture documents skip. A single agent completes a step with 99% reliability. Sounds excellent. Chain 10 sequential steps: 0.9910 = 90.4% overall reliability. Drop to 95% per step (still strong for most AI tasks). Ten steps: 0.9510 = 59.9%. Twenty steps: 0.9520 = 35.8%. Token costs compound too. A document analysis workflow consuming 10,000 tokens with a single agent requires 35,000 tokens across a 4-agent implementation. That’s a 3.5x cost multiplier before you account for retries, error handling, and coordination messages. This is why Klarna’s architecture works and most copies of it don’t. The difference isn’t agent count. It’s topology. Three Multi-Agent Patterns That Work in Production Flip the question. Instead of asking “how many agents do I need?”, ask: “how would I definitely fail at multi-agent AI?” The research answers clearly. By chaining agents without structure. By ignoring coordination overhead. By treating every problem as a multi-agent problem when a single well-prompted agent would suffice. Three patterns avoid these failure modes. Each serves a different task shape. Plan-and-Execute A capable model creates the complete plan. Cheaper, faster models execute each step. The planner handles reasoning; the executors handle doing. This is close to what Klarna runs. A frontier model analyzes the customer’s intent and maps resolution steps. Smaller models execute each step: pulling account data, processing refunds, generating responses. The planning model touches the task once. Execution models handle the volume. The cost impact: routing planning to one capable model and execution to cheaper models cuts costs by up to 90% compared to using frontier models for everything. When it works: Tasks with clear goals that decompose into sequential steps. Docum

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →