AI 브라우저 에이전트 리더보드

hackernews | 2026년 4월 15일 02:13 | 🔬 연구

#chatgpt #gemini #gpt-4 #openai #review #claude

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

2026년 2월 기준, AI 브라우저 에이전트를 평가하는 WebVoyager 벤치마크에서 H Company의 Surfer 2가 97.1%의 점수로 SOTA 1위를 기록했습니다. OpenAI Operator는 87%, Google Project Mariner는 83.5%로 각각 6위와 8위에 이름을 올렸으나, 벤치마크 최적화보다는 범용성에 초점을 맞춰 전문 에이전트보다는 낮은 점수를 보였습니다. 비교 결과의 신뢰성을 위해서는 전체 데이터셋 사용과 제3자 검증이 필수적이며, 생산 환경에서는 점수 외에도 비용과 보안 기능이 중요한 고려 요소입니다.

본문

What is the WebVoyager benchmark for AI browser agents? [+] [-] WebVoyager is the standard benchmark for evaluating browser agents, introduced in the 2024 paper WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. It consists of 643 tasks across 15 websites including Google, Amazon, GitHub, Reddit, and Wikipedia. Tasks cover form filling, navigation, search, and shopping. GPT-4V evaluates each task by analyzing the final page state. Scores represent the percentage of tasks completed successfully. As of February 2026, the highest score is 97.1% held by Surfer 2 from H Company. Can WebVoyager scores be compared across different agents? [+] [-] Not always. Three factors affect comparability: dataset size (full 643 tasks vs filtered subsets), evaluator (GPT-4V vs custom methods), and verification (third-party vs self-reported). Filtered subsets typically produce higher scores. Click any leaderboard row to see methodology. The most reliable comparisons use full dataset, GPT-4V evaluation, and third-party verification. What is Steel.dev? [+] [-] Steel is browser infrastructure for AI agents - cloud browser sessions controlled through code. The Session works like a fresh incognito window running in the cloud. Steel provides anti-bot capabilities including CAPTCHA solving, proxy rotation, and fingerprint management, plus observability features like live viewing and replay. Steel offers a REST API, Python SDK, and Node SDK for web scraping, form automation, and research agents. Learn more at steel.dev, the beginner's guide, or docs. What is the best AI browser agent in 2026? [+] [-] By WebVoyager score: Surfer 2 at 97.1% (H Company), Magnitude at 93.9%, AIME Browser-Use at 92.34%, Browserable at 90.4%, Browser Use at 89.1%, OpenAI Operator at 87%, Skyvern 2.0 at 85.85%, and Google Project Mariner at 83.5%. Surfer-H (92.2%) is a multi-attempt benchmark (10 attempts), so it is not directly comparable to single-run percentages. Surfer 2 leads accuracy. Browser Use and Skyvern are strong open-source options. Rankings update as new results are submitted. What is the difference between OpenAI Operator, Browser Use, and other AI browser agents? [+] [-] The main agents differ in accuracy and positioning. OpenAI Operator (87%, GPT-4o) is a consumer product in ChatGPT. Browser Use (89.1%) is open-source and supports multiple models. Surfer 2 (97.1%) leads with a proprietary enterprise model. Skyvern 2.0 (85.85%) is open-source with strong visual reasoning. Google Mariner (83.5%, Gemini) integrates with Chrome. For custom agents, Steel provides the browser infrastructure layer. How do AI browser agents work? [+] [-] Browser agents combine LLMs with browser automation to complete web tasks. A vision model sees the webpage via screenshots or DOM. A reasoning model decides actions like clicking, typing, or scrolling. An execution layer drives the browser via Chrome DevTools Protocol or Playwright. A memory component tracks state across steps. Most agents run on cloud infrastructure like Steel for reliability and anti-bot handling. What websites can AI browser agents navigate? [+] [-] Agents can navigate any website. WebVoyager evaluates on 15 specific sites: Amazon, eBay, Google, Google Maps, Wikipedia, Reddit, Twitter/X, GitHub, ArXiv, and Booking.com. Real-world challenges include CAPTCHAs, bot detection, dynamic content, auth flows, and rate limiting. Production agents use infrastructure like Steel for anti-bot measures and proxy rotation. How is the WebVoyager score calculated? [+] [-] Score = (tasks completed / total tasks) x 100. An agent scoring 97.1% completed 624 of 643 tasks correctly. GPT-4V evaluates each task by analyzing the final page state to determine if the goal was achieved - correct page reached, information displayed, forms filled accurately, and flows completed. What does SOTA mean? [+] [-] SOTA stands for State of the Art - the highest-performing result on a benchmark. On this leaderboard, the SOTA badge is awarded to the agent with the highest WebVoyager score and transfers automatically when a new high score is submitted. As of February 2026, the SOTA holder is Surfer 2 by H Company at 97.1%. Are OpenAI Operator and Google Project Mariner on this leaderboard? [+] [-] Yes. OpenAI Operator scores 87% (ranked 6th) and Google Project Mariner scores 83.5% (ranked 8th). Both are consumer products integrated into their ecosystems - Operator via ChatGPT, Mariner via Chrome. They score lower than specialized agents like Surfer 2 (97.1%) because they prioritize broad capability over benchmark optimization. How do I build my own AI browser agent? [+] [-] Three layers are needed. Browser infrastructure: Steel provides managed sessions, proxies, anti-bot handling, and replay. AI layer: a vision-capable model like GPT-4o, Claude, or Gemini with prompting for action selection. Orchestration: frameworks like Browser Use or Skyvern handle clicking, typing, and state tracking. See the production agents guide. Once your agent has a verifiable WebVoyager score, open a pull request on GitHub. Is a higher WebVoyager score always better for production use? [+] [-] Not necessarily. WebVoyager measures task completion on a fixed website set under controlled conditions. Production depends on factors not captured by the benchmark - latency, cost per task, CAPTCHA handling, anti-bot resilience, and generalization to new websites. An agent optimized for benchmark scores may overfit. Use the leaderboard as a directional signal and test on your actual target websites. Why is WebVoyager used instead of other benchmarks? [+] [-] WebVoyager is the most widely adopted public benchmark for browser agents, enabling cross-agent comparison. Other benchmarks exist - Mind2Web (2000+ tasks), OSWorld (desktop interaction), WorkArena (enterprise apps) - but have seen less adoption. WebVoyager's real-world task design, consistent GPT-4V evaluation, and widespread usage make it the current standard. What is Browser Use's WebVoyager benchmark score? [+] [-] Browser Use scores 89.1% on WebVoyager, ranking 5th overall. It's an open-source framework supporting GPT-4, Claude, and other LLMs via API. Many teams pair Browser Use with Steel for production infrastructure and anti-bot handling. What is OpenAI Operator's WebVoyager benchmark score? [+] [-] OpenAI Operator scores 87% on WebVoyager, ranking 6th overall. Built into ChatGPT Pro, it uses GPT-4o for vision and reasoning. Operator requires no setup and handles web tasks like booking reservations and filling forms. What is Skyvern's WebVoyager benchmark score? [+] [-] Skyvern 2.0 scores 85.85% on WebVoyager, ranking 7th overall. It's an open-source agent emphasizing visual understanding for complex layouts. Skyvern works with any LLM backend and integrates with Steel for production infrastructure. What is Google Project Mariner's WebVoyager benchmark score? [+] [-] Google Project Mariner scores 83.5% on WebVoyager, ranking 8th overall. Built on Gemini and integrated with Chrome, it handles web navigation and form filling. Currently in limited availability. What is Surfer 2's WebVoyager benchmark score? [+] [-] Surfer 2 by H Company holds the current SOTA at 97.1% on WebVoyager for pass@1 (as of February 2026), while the same benchmark also reports 100% at pass@10. The gap to the next agent at 93.9% makes it the clear leader for accuracy-focused use cases. How often is the leaderboard updated? [+] [-] The leaderboard updates as new benchmark results are published. New results appear weekly. If you know of a missing agent or score, pull requests and issues are welcome on GitHub. How do I add my agent to the leaderboard? [+] [-] Open a pull request on GitHub with your entry. You need a publicly verifiable WebVoyager score, a link to the source (paper or blog post), and a homepage or GitHub repo for your agent.

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기