I asked Claude how it wants to browse the web. It built LAD (LLM-as-DOM)

hackernews | 2026년 4월 15일 07:14 | 📦 오픈소스

#ai 모델 #chatgpt #claude #llama

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

LLM이 웹 브라우징 시 겪는 토큰 과다 사용 문제를 해결하기 위해 개발된 LAD는 DOM을 압축하여 토큰 사용량을 60분의 1로 줄이고, 로그인이나 폼 작업의 90%를 LLM 없이 휴리스틱으로 처리합니다. Chromium, WebKit, iOS 원격 제어를 포함한 다양한 브라우저 엔진을 지원하며, 29개의 시맨틱 도구를 제공하여 Playwright 대비 비용 효율성과 호환성을 높였습니다. 또한 실제 브라우저에 연결하여 인증된 세션을 유지하며 테스트할 수 있어 기존 유료 도구보다 강력한 오픈소스 대안으로 제시됩니다.

본문

Test your app 60x cheaper. lad compresses your DOM so Claude never parses HTML. Quick Start · How It Works · Multi-Engine · MCP Server · Watch System · Playwright Parity · Opera Neon Parity · Benchmarks Your AI agent wastes 80% of tokens reading raw HTML. A login test costs ~15,000 tokens across 4 Playwright roundtrips — and most of that is parsing DOM, not thinking. lad compresses your page to ~100-300 tokens and navigates using heuristics. No LLM needed for login, search, or form fill. Your orchestrator (Claude, GPT) gets structured results, never HTML. Traditional: Claude → Playwright → 15KB HTML → Claude parses → click → repeat (×4) lad: Claude → lad_browse("test login") → { success: true, steps: 3 } cargo install menot-you-mcp-lad # See what lad "sees" on your app lad --url "http://localhost:3000/login" --extract-only # Test a login flow (heuristics only, no LLM needed) lad --url "http://localhost:3000/login" \ --goal "login as [email protected] with password secret123" # Watch it work (opens browser window) lad --url "http://localhost:3000/login" \ --goal "login as [email protected] with password secret123" \ --visible | Mode | Flag | Use case | |---|---|---| | Headless | (default) | CI/CD pipelines, automated testing | | Visible | --visible | Debugging, watching what the pilot does | Your App (localhost) lad Claude │ │ │ │◄── navigate ─────────────┤ │ ├── DOM ──────────────────►│ │ │ ├─ compress (85x) │ │ ├─ heuristics (310ns) ──┐ │ │ │ no LLM needed! │ │ │◄── type/click ───────────┤◄──────────────────────┘ │ │ │ ... repeat ... │ │ ├── {success, steps} ──────►│ │ │ (~300 tokens) │ | Tier | Strategy | Speed | Cost | When | |---|---|---|---|---| | 0 | Playbook replay | instant | Free | Trained flows (login, checkout) | | 1 | @lad/hints | instant | Free | data-lad developer annotations | | 2 | Heuristics | 310ns | Free | Login, search, form fill — 90% of actions | | 3 | Cheap LLM | 0.4s | Free (Ollama) | Ambiguous elements, unknown pages | | 4 | Escalate | — | — | Screenshot sent to orchestrator | Most dev testing never hits the LLM. Heuristics parse your goal, match form fields by name/type/label, find submit buttons, and detect success — all in nanoseconds. lad is browser-agnostic. The pilot, heuristics, and LLM reasoning never touch browser APIs directly — they operate on a compressed SemanticView . The actual browser is a pluggable adapter. | Engine | Flag | Runtime | Platforms | |---|---|---|---| | Chromium | --engine chromium (default) | Chrome/Chromium install | Linux, macOS, Windows | | WebKit | --engine webkit | Native WKWebView | macOS (zero install) | | Remote (iOS) | LAD_WEBKIT_BRIDGE=lad-relay | iPhone WKWebView | iOS 17+ (via Nott app) | # Chromium (default) lad --url "https://example.com" --extract-only # WebKit (macOS — no Chrome needed) lad --url "https://example.com" --engine webkit --extract-only Skip the headless ghost. Point LAD at your actual running Chrome and drive it inside your real authenticated session — cookies, logins, extensions, VPN, everything. Zero setup beyond a debug flag: # 1. Start Chrome with CDP enabled google-chrome \ --remote-debugging-port=9222 \ --user-data-dir="$HOME/.cache/lad-chrome" # 2. From any MCP client, attach: # { "tool": "lad_session", # "arguments": { "action": "attach_cdp", # "endpoint": "http://localhost:9222" } } LAD adopts every open tab into its multi-tab map so lad_tabs_list , lad_click , lad_type , and every other tool operate on your real browser from the first call. Detach anytime with lad_session action=detach — your Chrome keeps running. Loopback-only enforcement (localhost /127.0.0.1 /::1 ) is mandatory — CDP is a full RCE vector over the wire. See docs/attach-chrome.md for the full walkthrough, threat model, and troubleshooting. - Real rendering differences — Safari handles flexbox, , scroll, clipboard API differently. Testing only in Chromium misses ~20% of the web. - Zero install on macOS — WebKit comes with the OS. No 500MB Chrome download. - System proxy — WKWebView respects macOS proxy/VPN settings automatically. - Your protocol — the WebKit adapter uses a simple stdin/stdout JSON protocol. Adding new engines (Firefox, Electron) means writing a ~300 line bridge app. Pilot your iPhone's real Safari engine from your desktop. LAD sends commands, your phone executes them on WKWebView, you watch it happen live. # 1. Start the relay (shows QR code in terminal) LAD_WEBKIT_BRIDGE=lad-relay lad --url "https://example.com" --engine webkit # 2. Open the Nott iOS app → Settings → Connect to LAD # 3. Scan the QR code (or paste the ws:// URL) # 4. Your iPhone is now a remote browser engine Why Remote Control? - Real Safari — test on actual iOS WKWebView, not emulated - Device features — touch events, Safe Area, real viewport - Token auth — one-time 6-digit PIN, secure even on public Wi-Fi - Auto-reconnect — exponential backoff if connection drops - Same API — all 29 LAD tools work identically over Remote Control ┌────────────────────────────────────────────────┐ │ lad (Rust) │ │ │ │ SemanticView ← a11y.rs (JS injection) │ │ │ │ │ pilot.rs → heuristics → LLM → action │ │ │ │ │ BrowserEngine trait ── PageHandle trait │ │ │ │ │ │ ┌────┴────┐ ┌─────┴─────┐ ┌──────┴──────┐ │ │ │Chromium │ │ WebKit │ │ Remote │ │ │ │Adapter │ │ Adapter │ │ (Relay) │ │ │ └────┬────┘ └─────┬─────┘ └──────┬──────┘ │ └───────┼────────────────┼──────────────────┼─────────┘ │ CDP │ stdin/stdout │ stdin → WS ▼ ▼ ▼ ┌─────────┐ ┌──────────────┐ ┌──────────────┐ │ Chrome │ │ Swift macOS │ │ iPhone Nott │ │ process │ │ WKWebView │ │ WKWebView │ └─────────┘ └──────────────┘ └──────────────┘ The PageHandle trait has 9 methods. That's the entire browser API surface: #[async_trait] pub trait PageHandle: Send + Sync { async fn eval_js(&self, script: &str) -> Result; async fn navigate(&self, url: &str) -> Result; async fn wait_for_navigation(&self) -> Result; async fn url(&self) -> Result; async fn title(&self) -> Result; async fn screenshot_png(&self) -> Result>; async fn cookies(&self) -> Result>; async fn set_cookies(&self, cookies: &[CookieEntry]) -> Result; async fn enable_network_monitoring(&self) -> Result; } Everything in a11y.rs (DOM extraction), pilot.rs (decision loop), and all 11 heuristic modules operates on SemanticView — they have no idea which engine is running. # Test your login lad --url "http://localhost:3000/account/login" \ --goal "login as [email protected] with password test123" # Test search lad --url "http://localhost:3000" \ --goal "search for 'blue t-shirt'" # Test checkout flow lad --url "http://localhost:3000/cart" \ --goal "fill shipping with name=John [email protected]" # Extract product catalog structure lad --url "http://localhost:3000/collections/all" --extract-only # GitHub Actions - name: Smoke test login run: lad --url "http://localhost:3000/login" --goal "login as [email protected] with password ci_pass" --max-steps 5 # Same test, both engines — catch rendering differences lad --url "https://myapp.com/login" --engine chromium --extract-only > chromium.json lad --url "https://myapp.com/login" --engine webkit --extract-only > webkit.json diff chromium.json webkit.json lad --url "https://staging.myapp.com/login" \ --goal "login as [email protected] with password staging123" \ --backend zai --model glm-4.7 # cloud LLM for complex pages llm-as-dom-mcp turns your browser into a tool that Claude can call directly. 29 semantic tools — full Playwright parity with 60x fewer tokens. llm-as-dom-mcp # starts MCP server (stdio) | Tool | What it does | |---|---| lad_browse | Navigate to a URL and accomplish a goal autonomously (login, fill form, click, search) | | Tool | What it does | |---|---| lad_extract | Extract structured page info: elements, text, page type. Never returns raw HTML. Supports paginate_index /page_size for large pages and include_hidden=true opt-in | lad_snapshot | Semantic snapshot of the current page — elements with IDs for lad_click /lad_type . Like Playwright's browser_snapshot but 10-60x fewer tokens. Same pagination + hidden-filter params as lad_extract | lad_jq | Run a jq query against the current page's SemanticView JSON. Pulls subsets (e.g. .elements | map(select(.role == "button")) | .[].label ) instead of the full snapshot — 10-30x token savings | lad_screenshot | Take a base64-encoded PNG screenshot of the active page | | Tool | What it does | |---|---| lad_click | Click an element by its ID from lad_snapshot | lad_type | Type text into an element by its ID from lad_snapshot | lad_select | Select a dropdown option by element ID — matches by visible label first, then value | lad_fill_form | Fill multiple form fields at once and optionally submit. Keys match by label/name/placeholder | lad_press_key | Press a keyboard key (Enter, Tab, Escape, etc.). Optionally focus an element first | lad_hover | Hover over an element — triggers dropdown menus, tooltips, hover states | lad_upload | Upload file(s) to a element (Chromium CDP) | lad_scroll | Scroll the page (down/up/bottom/top) or scroll to a specific element by ID | | Tool | What it does | |---|---| lad_dialog | Handle JavaScript dialogs (alert/confirm/prompt) — accept, dismiss, or inspect history | | Tool | What it does | |---|---| lad_wait | Wait for a semantic condition to be true (blocks until satisfied or timeout) | lad_watch | Continuous page monitoring — start/stop polling, diff semantic views, cursor-based event retrieval | | Tool | What it does | |---|---| lad_assert | Check assertions on a URL: has login form, title contains X, has button Y | lad_audit | Audit page quality: a11y (alt text, labels), forms (autocomplete), links (void hrefs) | | Tool | What it does | |---|---| lad_back | Navigate back in browser history | | Tool | What it does | |---|---| lad_eval | Evaluate arbitrary JavaScript — escape hatch for when semantic tools can't handle a specific interaction | lad_network | Inspect network traffic. Includes timing data via Performance API. Note: status codes and byte counts are unavailable for cross-origin requests due to performance.getEntries() limitations. Future: CDP Network domain integration. Filter by type: auth, api, navigation, asset | lad_locate | Map a DOM element back to its source file (React dev source, data-ds, data-lad attributes) | | Tool | What it does | |---|---| lad_clear | Clear an input field (works with React/Vue controlled components) | | Tool | What it does | |---|---| lad_tabs_list | List every open tab with tab_id , title, url, and is_active flag. Opera Neon list-tabs shape | lad_tabs_switch | Set the active tab by tab_id . Every other tool defaults to the active tab when tab_id is omitted | lad_tabs_close | Close a single tab by tab_id (vs lad_close which kills the whole browser) | Every interaction tool (lad_click , lad_type , lad_snapshot , lad_extract , lad_jq , lad_eval , lad_network , ...) accepts an optional tab_id param that targets a specific tab; omit it to target the active tab. | Tool | What it does | |---|---| lad_close | Close the browser and release all resources (kills all tabs) | lad_refresh | Reload the current page | lad_session | View or reset session state — plus action=attach_cdp endpoint=http://localhost:9222 to attach to your real Chrome (see docs/attach-chrome.md ) and action=detach to release it | Claude Desktop config { "mcpServers": { "lad": { "command": "llm-as-dom-mcp", "env": { "LAD_LLM_URL": "http://localhost:11434", "LAD_LLM_MODEL": "qwen2.5:7b", "LAD_ENGINE": "chromium" } } } } Set LAD_ENGINE=webkit for WebKit on macOS. lad_watch enables continuous page monitoring — your agent can observe a page over time and react to changes without polling manually. Agent lad_watch Page │ │ │ ├─ start(url, interval_ms) ─────►│ begin polling loop │ │ ├── extract SemanticView ◄──────┤ │ ├── diff against previous │ │ ├── store in ring buffer (cap 1000) │ ├── MCP resource notification ──►│ (push to client) │ │ ... repeat every tick ... │ │ │ │ ├─ events(since_seq=42) ────────►│ cursor-based retrieval │ │◄──── [events 43..N] ──────────┤ │ │ │ │ ├─ stop ────────────────────────►│ cleanly abort │ - Ring buffer stores up to 1,000 events with monotonic sequence numbers - Semantic diffing via observer.rs — detects added/removed/changed elements, value changes, disabled state transitions - MCP resource notifications pushed to client on each non-empty diff ( watch://url ) - Cursor-based retrieval — since_seq=N returns only events newer than sequence N lad matches Playwright's tool surface with fundamentally different economics: | Dimension | lad | Playwright MCP | |---|---|---| | Tools | 29 | 21 | | Tokens per login test | ~300 | ~18,000 | | Cost ratio | 1x | 60x | | Decision engine | Heuristics-first (70-90% no LLM) | None — LLM parses every page | | Output format | Semantic JSON (never raw HTML) | Raw DOM snapshots | | Browser engines | Chromium + WebKit + iPhone (Remote) | Chromium only | | DOM traversal | Shadow DOM + same-origin iframes | Standard DOM | The key architectural difference: Playwright gives the LLM a DOM and asks it to figure out what to do. lad compresses the DOM, runs heuristics, and only calls the LLM when genuinely ambiguous. Opera shipped its MCP Connector for Opera Neon in March 2026, exposing browser control to external AI clients (Claude Code, ChatGPT, Lovable, n8n). It's gated behind a $19.90/month Opera Neon subscription and requires a running Opera Neon process alongside your normal browser. lad is the self-hosted, zero-subscription, open-source alternative — with 2× the tool surface and drop-in-compatible tab management. | Dimension | lad | Opera Neon MCP Connector | |---|---|---| | Tools | 29 | 13 | | Price | Free (AGPL-3.0) | $19.90 / month | | Modes | Headless + Visible + CDP Attach | Attached only (must run Opera Neon) | | Transport | stdio, SSE | HTTP + OAuth PKCE via mcp.neon.tech | | Engines | Chromium, WebKit, Remote iOS | Opera Neon only | | Authenticated session | CDP attach to your real Chrome | Opera Neon's own session | | CI/headless friendly | Yes | No | | Goal-based browse | lad_browse + pilot heuristics (70-90% no LLM) | Not exposed via MCP | | jq over snapshot | lad_jq | tab-content-jq-search-query | | Tab management | lad_tabs_list / lad_tabs_switch / lad_tabs_close | list-tabs / switch-tab / close-tab | | Hidden-element filter | Default-on (closes Brave CVE class) | Defended in prompt assembly only | | Assert / audit / network | lad_assert , lad_audit , lad_network | Not available | | JS eval escape hatch | lad_eval | Not available | | Sandbox scheme blocklist | chrome:// , opera:// , about: , devtools: , view-source: , edge: , brave: , ws: , wss: , file: , javascript: , data: , blob: , vbscript: | Partial (blocks reads on chrome:// , permits navigation) | Drop-in compatibility. lad's tab-management tools mirror Opera's shape exactly — agent prompts written against Opera MCP work against lad with a lad_ prefix swap. The adopt_existing_pages flag on lad_session attach_cdp reproduces Opera Neon's "operate

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기