Show HN: Resurf – AI 브라우저 에이전트를 위한 현실적이고 재현 가능한 테스트 프레임워크
hackernews
|
|
📦 오픈소스
#openai
#뉴스
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
AI 브라우저 에이전트를 위한 현실적이고 재현 가능한 테스트 프레임워크인 Resurf가 공개되었습니다. 이 도구는 합성 웹사이트와 결함 모드 주입을 통해 쇼핑몰의 카트, 결제, 계정 관리 등 복잡한 상태 유지 작업을 안정적으로 테스트할 수 있게 합니다. 또한 네트워크 지연이나 서버 오류 같은 시나리오를 YAML로 설정하여 코드 수정 없이 다양한 실패 상황을 구현할 수 있습니다. 더불어 브라우저용 도구와의 호환성을 제공하며, DOM 스냅샷과 토큰 수 등 실행 궤적을 기록하는 기능도 포함되어 있습니다.
본문
Realistic, Reproducible Test Framework for AI Browser Agents Systematic testing of browser agents today is not easy: real websites are flaky, rate-limited and expensive (bot unblocking), while static-HTML benchmarks lack state and dynamic behavior. Resurf gives your browser agent a realistic, stateful, instrumented framework — built on synthetic websites with failure-mode injection. | Mind2Web | WebVoyager | Resurf | | |---|---|---|---| | Realistic, dynamic, interactive environment | ❌ | ✅ | ✅ | | Deterministic & reproducible | ✅ | ❌ | ✅ | | Failure-mode injection (latency, payment errors, 5xx) | ❌ | ❌ | ✅ | | Auditable success eval (DB state, not LLM judge) | ❌ | ❌ | ✅ | | No dependency on live websites | ✅ | ❌ | ✅ | shop_v1 — A production-shaped synthetic e-commerce site (FastAPI + React + SQLite) running in a single Docker container. Catalog, cart, multi-step checkout, auth, accounts, returns. Mobile-responsive. Deliberate ambiguous-UI inventory for testing agent reasoning.- Modifiers (failure-mode injection) — FastAPI middleware that toggles network latency, payment outcomes (declined / 3DS / timeout), server error rates, and session expiration per task. Configured per-task in YAML; no code changes to add new failure-mode combinations. - Authoring stack — A YAML task schema, ~10 failure-mode templates with scripted reference trajectories, and a CLI: task new ,task from-template ,task validate ,task try . - Adapters — browser-use (DOM/AX-tree native),stagehand (via Node subprocess), and a vision-only baseline. All optional installs. - Trajectory recording — Per-step DOM snapshots, screenshots, agent actions, token counts, latencies. Deterministic via SQLite snapshot reset and seeded faker. v0 ships a single site — shop_v1 — to keep the focus on getting the abstractions (Environment , modifiers, success predicates, the /__test__/* admin protocol) right against several adapters. E-commerce was picked because it exercises forms, multi-step flows, auth, money, and obvious failure-mode hooks in a small amount of code. Adding more sites is now content work, not platform work — see Adding a new site. Prerequisites: Python 3.11+, Docker, Node 20+ (only if you plan to use the Stagehand adapter), and Chromium for Playwright. # 1. Install resurf with the browser-use adapter pip install 'resurf[browser-use]' # imports as `resurf` # --- Local-development install (use this until resurf is published to PyPI) --- # git clone https://github.com/revar-ai/resurf # cd resurf # python3.11 -m venv .venv && source .venv/bin/activate # pip install -e packages/shared-models # pip install -e 'packages/core-py[dev,browser-use]' # 2. Install Chromium for Playwright (one-time) playwright install chromium # 3. Start shop_v1 docker compose up -d shop_v1 # Wait ~3s for the container to be healthy curl -s http://localhost:8080/api/health # 4. List bundled tasks resurf task list # 5. Run a task with the browser-use adapter export OPENAI_API_KEY=sk-... python examples/run_browser_use.py tasks/shop_v1/find/find_product_by_name.yaml # Tip: set REVAR_HEADED=1 to watch the browser drive in a visible window. # REVAR_HEADED=1 python examples/run_browser_use.py # Works for all four example runners (browser-use, scripted, vision, stagehand). You should see something like: [resurf] reset shop_v1 to seed=42 [resurf] running browser-use agent on find_product_by_name [resurf] step 1: nav https://localhost:8080/ [resurf] step 2: click [aria-label="Search"] [resurf] step 3: type "Acme Bluetooth Speaker" ... [resurf] passed=True steps=7 tokens=4123 wall_clock=12.4s [resurf] trajectory saved to ./trajectories/... Useful if you want to iterate on the backend or read live SQL while debugging: pip install -e sites/shop_v1/backend REVAR_TEST_MODE=1 uvicorn app.main:app --reload --port 8080 --app-dir sites/shop_v1/backend For frontend hacking, cd sites/shop_v1/frontend && npm install && npm run dev runs Vite on port 5173 and proxies /api/* to the backend. The full Python test suite runs against resurf and resurf-models : # From the repo root, with the venv from quickstart activated pytest This is also exactly what CI runs. Useful subsets: pytest packages/core-py # SDK + CLI + task validation pytest packages/shared-models # SQLModel schema and seed determinism pytest -k schema # task YAML schema validation only End-to-end smoke test against a running shop_v1 (Docker container or local uvicorn): docker compose up -d shop_v1 python examples/run_scripted.py tasks/shop_v1/find/find_product_by_name.yaml # expects: passed=True Stagehand runs in Node, so the Stagehand adapter shells out to adapters/stagehand/bridge.mjs over stdio. One-time setup: # 1. Install Node 20+, then install Stagehand into the bridge folder cd adapters/stagehand npm install @browserbasehq/stagehand # 2. (Optional) Smoke-test the bridge in isolation — proves Node + Stagehand are wired up # Requires a running shop_v1 on :8080 and OPENAI_API_KEY set. export OPENAI_API_KEY=sk-... npm run smoke # Expect: a single-line JSON `{ "steps": [...], ... }` and exit 0. Then drive a real task through the Python adapter: cd ../.. # back to repo root docker compose up -d shop_v1 python examples/run_stagehand.py tasks/shop_v1/find/find_acme_bluetooth_speaker.yaml The fastest path is to clone an existing failure-mode template: resurf task from-template checkout/payment_declined_recovery \ -p product_name="Acme Bluetooth Speaker" \ -p product_slug=acme-bluetooth-speaker \ --out tasks/shop_v1/checkout/ resurf task validate tasks/shop_v1/checkout/acme_bluetooth_speaker_payment_declined_recovery.yaml resurf task try tasks/shop_v1/checkout/acme_bluetooth_speaker_payment_declined_recovery.yaml Or write YAML by hand — see docs/architecture.md for the schema. Modifiers are how a task says "make the site behave like X." They're set under modifiers: in the task YAML, and resurf applies them at reset time via POST /__test__/configure . Nothing in shop_v1 's product code needs to know about specific failure modes — middlewares handle them transparently. | Key | Type | Values / shape | Default | Effect | |---|---|---|---|---| latency_profile | string | none | fast | realistic | slow_3g | fast | Sleeps before each /api/* response per the profile's per-route (min, max) seconds. | payment_outcome | string | list | object | success | declined | 3ds_required | timeout , or [..., ...] for a sequence consumed in order, or { sequence: [...] } | success | Drives /api/checkout/confirm . Sequences let you script "decline once, then succeed." | server_error_rate | number (0.0–1.0) | e.g. 0.1 for ~10% 503s | 0.0 | Probability of injecting a 503 on requests matching server_error_paths . | server_error_paths | list[string] | e.g. ["/api/products", "/api/cart"] | ["/api/products"] | URL-prefixes eligible for the error injector above. | session_ttl_s | int | null | seconds | site default | Forces a shorter session expiry — useful for testing mid-flow re-auth. | frozen_time_iso | string (ISO 8601) | e.g. 2026-05-01T12:00:00Z | unset | Pins server-side "now" — use it to make coupon expiry / order timestamps deterministic. | id: shop_v1.checkout.acme_speaker_decline_recovery site: shop_v1 goal: | Buy 1 Acme Bluetooth Speaker. Your first payment will be declined; retry to recover and complete the order. modifiers: latency_profile: realistic payment_outcome: sequence: [declined, success] # first /confirm fails, second succeeds server_error_rate: 0.0 frozen_time_iso: "2026-05-01T12:00:00Z" budget: max_steps: 25 max_wall_clock_s: 120 success: type: state_predicate query: SELECT COUNT(*) AS count FROM "order" WHERE status = 'paid' predicate: result == 1 The same knobs are available via the admin API — handy for ad-hoc poking or building your own runner: curl -s http://localhost:8080/__test__/configure -H 'content-type: application/json' \ -d '{"latency_profile":"slow_3g","payment_outcome":"3ds_required"}' Or from Python: env = Environment(site="shop_v1") env.configure({"latency_profile": "slow_3g", "payment_outcome": "3ds_required"}) The contract is intentionally small. A modifier is just a key in ModifierConfig plus the middleware/handler that reads it. Concretely: - Add the field in sites/shop_v1/backend/app/modifiers.py (ModifierConfig dataclass,reset() ,update() ,to_dict() ). - Implement the behavior — either as a new ASGI middleware in sites/shop_v1/backend/app/middleware/ , or inline in the API handler that should react to it. - Wire it in in sites/shop_v1/backend/app/main.py (only if it's a middleware). - Document the YAML key in this README's table and in docs/architecture.md 's modifier table. - (Optional) Update the task JSON Schema at packages/core-py/resurf/schemas/task.schema.json if you want strict validation of the new key. A good first example to copy is LatencyMiddleware (one file, ~30 lines) — it shows the read-config-and-act pattern end-to-end. A site is just an HTTP service that implements six admin endpoints — GET /api/health , POST /__test__/{reset,configure,query,freeze_time} , and GET /__test__/state — guarded by REVAR_TEST_MODE=1 . The contract is HTTP-shaped, so any backend stack works. To add one: scaffold under sites// , add a deterministic seeder under packages/shared-models/resurf_models// , register a service block in docker-compose.yml , and drop tasks under tasks// with site: in the YAML. Copy from shop_v1 — app/modifiers.py and api/test_endpoints.py are near-verbatim reusable. Contributions welcome. Cut a release by bumping versions with scripts/bump_version.py X.Y.Z , updating CHANGELOG.md , and pushing a vX.Y.Z tag. CI handles the PyPI upload (via Trusted Publishing) and the GitHub Release. Full runbook in RELEASING.md . resurf is licensed under the Apache License 2.0. See LICENSE and NOTICE . Contributions of new tasks, templates, and bug fixes are very welcome. By submitting a pull request, you agree that your contribution is licensed under Apache 2.0 (per Section 5 of the license). For larger changes, please open a GitHub Discussion first so we can align on direction. See CONTRIBUTING.md for details. See docs/roadmap.md . Headline v1 items: a correlation study against real OSS shop demos, adversarial modifiers (CAPTCHA, anti-bot, dark patterns), prompt-driven task generation, a second site vertical, and a native Node SDK.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유