Show HN: AI agents should browse your site, not call your API
hackernews
|
|
{'이벤트': '📰', '머신러닝/연구': '📰', '하드웨어/반도체': '📰', '취약점/보안': '📰', '기타 AI': '📰', 'AI 딜': '📰', 'AI 모델': '📰', 'AI 서비스': '📰', 'discount': '📰', 'news': '📰', 'review': '📰', 'tip': '📰'} AI 모델
#a2a
#agent
#protocol
#web
#웹 프로토콜
요약
2026년 현재 웹사이트 임베디드 AI 에이전트는 RAG 봇, API 연결형, 코드 샌드박스, DOM 네이티브 실행 4가지 아키텍처로 나뉩니다. 기존 방식들은 유지 보수 부담, 보안 노출, 높은 지연 시간 등 구조적 한계가 있어 소비자용 사이트에 적합하지 않습니다. 반면 DOM 네이티브 실행은 사용자의 기존 세션 내에서 실시간 DOM을 직접 제어하여 문서 갱신이나 API 노출 없이 초고속으로 작업을 수행합니다. 이 방식은 별도의 인프라 비용 없이 웹의 기존 보안과 배포 주기를 그대로 활용하는 유일한 솔루션으로 평가받습니다.
왜 중요한가
개발자 관점
DOM 네이티브 실행은 별도의 API 개발이나 서버 인프라 구축 없이, 사용자 세션 내에서 직접 DOM을 제어함으로써 개발 부담을 획기적으로 줄이고 기존 웹 보안 체계를 그대로 활용할 수 있는 혁신적인 아키텍처입니다.
연구자 관점
이는 AI 에이전트와 웹 환경의 상호작용 방식을 연구하는 중요한 사례로, 문서 갱신이나 API 노출이라는 구조적 한계를 극복하고 지연 시간을 최소화하여 인간과 유사한 사용자 경험을 구현하는 최적화된 접근법을 제시합니다.
비즈니스 관점
유지 보수 비용과 인프라 투자 없이도 소비자용 사이트에 AI 에이전트를 즉시 통합할 수 있어 비용 효율성을 극대화하고, 기존 보안 및 배포 주기를 유지함으로써 리스크를 최소화하는 실용적인 비즈니스 솔루션입니다.
본문
The Four Architectures for Website AI Agents A research thesis on where embedded agents are headed, what's structurally broken, and why DOM-native execution is the only architecture that composes with how the web actually ships. rtrvr.ai | April 2026 Every production AI agent attached to a website in 2026 fits into one of four architectures. Three of them are structurally compromised — by staleness, by maintenance burden, or by security exposure. The argument is not that one implementation is better polished than another. The argument is that three of these architectures have ceilings that no amount of engineering can push through. RAG bots cannot act. API-tool agents cannot ship without a parallel maintenance org. Code-sandbox agents cannot run on a consumer website without exposing platform shape and paying sandbox cost per user per action. Only DOM-native embedded execution grandfathers in the web's existing truth model: live HTML, user session, server-side IAM, existing deploy cadence. 2,000+ websites are already integrated with Rover. That number moves because the architecture moves — not because the pitch does. Architecture 1: RAG Bots Intercom Fin, Drift, every chatbot your company has shipped. Chunk your docs. Embed them. Store in a vector DB. User asks a question, similarity-search, stuff top-k chunks into an LLM prompt, stream a response. This is ~99% of customer-facing chatbots on the web today. The ceiling The one that matters most is the simplest: RAG bots can only talk. They cannot click a button. They cannot fill a form. They cannot finish a checkout or advance an onboarding flow. They are a search bar with a chat veneer. Six compounding failure modes sit underneath: - Temporal drift. Ship a feature Tuesday. Embeddings reflect Monday. The bot lies confidently about what your product does right now. - Retrieval noise. At scale, thematic adjacency is not factual correctness. The bot grabs a chunk that sounds right and generates around it. - Scope hallucination. Stripe's chatbot will help you with React. Your healthcare chatbot will debate philosophy. The base LLM has no native concept of "I am a payments site; decline off-topic." - Chunking artifacts. No universal chunking strategy exists. It is domain-specific tuning you pay for forever. - The evaluation gap. Dashboards say 87% positive. Users report the bot is an annoyance. Industry data: ~20% of users rate chatbot experiences as acceptable. - The maintenance tax. You didn't deploy AI support. You deployed a second full-time job: re-embed, re-index, re-test, re-prompt. RAG is a good primitive for knowledge retrieval. As a customer-facing website agent it has peaked. The data: 45% of users abandon chatbot interactions after three failed attempts; mobile rage-click rates up 667% YoY after AI chatbot deployment. The ceiling is load-bearing, not a temporary engineering gap. Architecture 2: API-Connected Agents Google WebMCP, custom MCP integrations, Intercom Fin with actions. Expose your internal APIs as structured tools. Publish schemas. Let an agent call those tools. Google's WebMCP is the most aggressive version: let Chrome's agent talk to your site through exposed tool definitions. The ceiling Not broken in theory. Broken in operations. Three unavoidable costs: Maintenance coupling. Modern product teams ship multiple times a day. Every meaningful API change requires a corresponding schema update, tool description update, and re-test. You've created a second surface that must ship in lockstep with your product. Miss a sync and the agent silently breaks — and because agents are stochastic, you find out from a customer, not a test. Security surface. Every exposed tool is an authorization question. Who can call it? Under what subscription tier? With what user context? With what rate limit? The agent now sits inside your IAM graph, and every tier change, every permission boundary, every row-level access rule must be re-encoded as agent-side policy. This is a permanent engineering org, not a one-time integration. Client-side exposure or server-side cost. You pick your poison. Expose your API surface client-side and you've handed your platform shape to anyone who reads a network tab. Keep it server-side and you're spinning per-user sandboxes — which pushes you into Architecture 3. WebMCP specifically: Google intermediates your users. You do the integration work. You maintain the tool schemas. And the user has the conversation with Chrome's agent, not with your site. Your checkout, your onboarding, your brand moment is mediated by a third party whose incentives are not aligned with yours. This is the 2010s SEO trade again, with higher stakes. This architecture will exist for backend-to-backend agent coordination. As the front door between an AI agent and a consumer website, it imposes a cost structure — ship-in-lockstep + permanent IAM surface + exposure risk — that only a small minority of sites can afford. Architecture 3: Code-Writing Sandbox Agents Cloudflare Agent Lee (launched April 15, 2026), Codemode patterns. Agent Lee is Cloudflare's in-dashboard AI assistant. The mechanism: rather than presenting tool definitions directly to the model, Agent Lee uses Codemode to convert tools into a TypeScript API and asks the model to write code that calls it. Generated code runs in a sandboxed Durable Object that acts as a credentialed proxy — API keys never appear in the generated code, read ops proxy directly, write ops gate behind user approval. The numbers: ~18,000 daily users, ~250K tool calls/day across DNS, Workers, R2, Registrar, Cache, Tunnels, and more. Credit where due — this is elegant. For Cloudflare's problem. Why it doesn't generalize Cloudflare's problem is a developer dashboard: one company, one platform, one permission graph. Every property that makes Agent Lee work depends on Cloudflare owning the entire stack. For a consumer website, the same architecture fails on four axes: Sandbox latency is not consumer-grade. Spinning a sandboxed execution environment per request is fine for developers troubleshooting at 2am. It is not fine for a user trying to check out in under 10 seconds. The LLM-writes-code-then-executes loop adds seconds per action. Approval gates break consumer flow. Every write operation requires a confirmation dialog. On a developer dashboard: a feature. On a checkout page: death. Consumers will not click through modals to complete a purchase they asked the agent to do. Platform shape becomes a typed API exposed to the LLM. Maintenance debt and security debt at the same time. Every internal capability becomes a documented surface the model reasons about. You don't ship a feature; you ship a feature plus its type signature plus its sandbox permission profile. Per-user server cost. Each interaction burns a sandboxed DO plus LLM tokens for code gen plus tool-call round-trips. Cloudflare's users pay $20–$20K/month — that's fine. For a consumer site where the visitor is a free browser user, unit economics are broken before you start. Code-writing sandbox agents are the right architecture for developer platforms — high-consequence ops, developer users, approval gates as a feature. They are the wrong architecture for consumer websites. The generative UI trap Agent Lee and CopilotKit both push generative UI — the agent dynamically composes charts, tables, forms, and cards at runtime. CopilotKit built an entire protocol (AG-UI) around it, adopted by Google, LangChain, and AWS. Three reasons this is wrong for consumer websites: - Token cost compounds per turn. Every UI-rendering interaction costs 2–5x the tokens of text, plus the rendering pipeline, plus re-renders on state change. On a paid dashboard: fine. On a consumer site where every visit is a cost center: not. - It is a maintenance surface disguised as a feature. Good generative UI needs skill prompts, block libraries, brand-matched style tokens, responsive breakpoints, accessibility audits on LLM-composed output. Most deployments do none of these. - Your website already has a UI. It is tested, designed, conversion-optimized, and A/B-validated. Letting an LLM improvise UI blocks on top of that — for one user, at checkout, at the moment of highest intent — undoes all of that work. Every user sees a slightly different, untested surface. Rover takes the opposite position: don't replace your UI. Drive it. The site owner already decided what the right button, form, and flow are. The agent's job is to get the user there faster. Architecture 4: DOM-Native Embedded Execution One tag. The runtime reads the live DOM and accessibility tree, plans the next action, and executes directly in the user's browser — in the user's existing authenticated session, on the site's own domain. No screenshots. No VMs. No server-side sandbox. No exposed API schema. Seven structural properties These are not features. They are consequences of DOM-native execution that the other three architectures cannot replicate without rebuilding as Rover. 1. Zero documentation maintenance. The DOM is the source of truth. When your team ships a new button at 2pm, Rover sees it at 2:00:01pm. No re-embedding. No schema update. No chunk re-index. 2. Zero API exposure. No backend tools to spec. No TypeScript types to publish. No client-side API keys. The agent acts through the same interface a human does — the rendered page. Your platform shape stays private. 3. IAM grandfathered in. The user is already signed in. The agent acts inside that session. Whatever the user can see and do, the agent can. Whatever they cannot, it cannot. Your existing server-side IAM — every tier, every permission, every row-level rule — is the agent's authorization layer for free. 4. Zero deploy dependency. Ship your product at whatever cadence you want. Rover doesn't need a release. You don't version your agent against your product. The agent reads whatever shipped. 5. Sub-second actions. No screenshot round-trip. No vision model inference. No sandbox spin-up. Rover identifies the target element by semantic identity (ARIA label, role, data attributes) and dispatches a native DOM event. Milliseconds, not seconds. 6. Bounded scope by construction. The agent can only do what the DOM exposes. It cannot answer React questions on your payments page — not because of a guardrail prompt, but because there is no React documentation in its input. Scope isn't policy; it's architecture. 7. Client-side execution. User data never leaves the browser except through the same network calls your site already makes. Privacy, compliance, and residency stories are inherited. No new data plane to audit. Head-to-head | RAG bots | API-tool agents | Code sandboxes | Rover (DOM-native) | | |---|---|---|---|---| | Can act on the page | No | Via exposed APIs | Via generated code | Yes (native DOM) | | Maintenance per ship | High (re-embed) | High (schema sync) | Medium (type sync) | Zero | | API exposure risk | None | High | Medium (platform shape to LLM) | None | | IAM integration cost | N/A | Rebuild per tool | Rebuild in sandbox proxy | Free (inherits session) | | Latency per action | ~1s (retrieval) | 0.5–2s (API call) | 2–8s (code gen + sandbox) | <200ms | | Per-user cost | Moderate | Moderate | High (sandbox + tokens) | Low (client-side) | | Scope guarantee | Prompt policy | Tool whitelist | Code classifier + gate | Architecture (DOM boundary) | | Consumer-grade UX | Yes (but only talks) | No (auth flows, latency) | No (approval modals) | Yes | Validation - #1 on Halluminate WebBench at 81.39%, ahead of OpenAI Operator and Anthropic Computer Use - 25,000+ users, 3M+ executed workflows - 2,000+ websites integrated via the embed, Chrome extension, cloud API, and MCP server - Open source under FSL-1.1-Apache-2.0 — site owners can read every line that runs in their users' browsers What this unlocks When the runtime is DOM-native, embedded, and session-native, intent-to-outcome distance collapses: - Prompt-to-checkout. "Buy the Pro plan" — Rover navigates, selects, fills, submits. In the user's session, on the site's domain. - Guided onboarding. "Show me how to set up my first campaign" — Rover clicks alongside the user, advancing the real product UI. - Form completion. Multi-step forms become conversational. 40% less drop-off in production deployments. - Cross-site handoffs. Workflows that span Rover-enabled sites with aggregated lineage. Where this is going The agent-web interaction layer is being built in public — one Agents Week at a time. Cloudflare is solving content, payment, and identity edges. Google is pushing A2A for agent-to-agent. Anthropic is driving MCP for agent-to-tool. All of these layers matter. None of them is the execution layer. The execution layer — the one that turns "I want to do X on your site" into X, on your site, inside the user's session, at sub-second latency, without a parallel engineering org — is DOM-native and embedded. The other three architectures will exist. RAG will keep answering questions. API-tool agents will run backend-to-backend. Code sandboxes will own developer platforms. None of them will own the consumer website agent layer, because none of them composes with how the web actually works.