Show HN: Trawl – CSS 선택기가 아닌 자연어 필드가 있는 모든 사이트를 스크랩합니다.
hackernews
|
|
💼 비즈니스
#anthropic
#gemini
#llm
#tip
#trawl
#데이터 추출
#웹 스크래핑
#자연어
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
'쇼 HN: Trawl'로 소개된 이 오픈소스 도구는 웹 스크래핑 시 사이트 redesign으로 인해 CSS 선택자가 깨지는 기존 문제를 LLM과 Go의 하이브리드 방식으로 해결합니다. 사용자가 자연어로 원하는 필드를 입력하면 Claude가 HTML 구조를 분석하여 추출 전략을 수립하고, 이후 대규모 데이터 수집 과정은 토큰 비용 없이 고성능 Go 언어로 처리하여 효율성을 극대화합니다. 또한 SPA 및 iframe 지원, 추출 건강도 모니터링을 통한 자동 치유 기능 등을 포함하여 정교한 엔지니어링을 적용했으며, 결과물은 JSON이나 CSV 등 다양한 형식으로 출력 가능합니다.
본문
Scrape structured data from any website using LLM-powered extraction. trawl lets you define what you want semantically — not with CSS selectors — and it figures out how to extract it. When a site redesigns, trawl re-derives the extraction strategy automatically. The LLM is called once per site structure, not once per page. Steady-state scraping is pure Go at full speed with zero API cost. curl -fsSL https://raw.githubusercontent.com/akdavidsson/trawl/main/install.sh | sh Or with Go: go install github.com/akdavidsson/trawl@latest Or build from source: git clone https://github.com/akdavidsson/trawl cd trawl go build -o trawl . export GOOGLE_GEMINI_APIKEY=AIzaSy... # Or, if using Anthropic: export ANTHROPIC_API_KEY=sk-ant-... # Extract product data as JSON trawl "https://books.toscrape.com" --fields "title, price, rating, in_stock" # Output as CSV trawl "https://books.toscrape.com" --fields "title, price" --format csv # Preview the extraction plan without extracting trawl "https://books.toscrape.com" --fields "title, price" --plan trawl [url] [flags] # Simple field extraction trawl "https://example.com/products" --fields "name, price, rating, url" --format json # Use a YAML schema for precise control trawl "https://example.com/products" --schema products.yaml --format csv # Natural language query — trawl infers field names from the data trawl "https://example.com/products" --query "extract all product listings with names, prices, and stock status" # Target a specific section on a page with multiple data tables trawl "https://openrouter.ai/rankings" --query "Market Share" --fields "rank, name, tokens" --js # Save to a file trawl "https://example.com/products" --fields "name, price" --output products.json # Streaming JSONL output, pipe to jq trawl "https://news.example.com" --fields "headline, date, author" --format jsonl | jq '.headline' # Re-use a previously derived strategy (no LLM call) trawl "https://example.com/products" --strategy cached-strategy.json --format csv # Verbose output to see the full pipeline trawl "https://example.com/products" --fields "name, price" -v # JS-rendered pages (React, Next.js, Vue, Svelte, etc.) trawl "https://example.com/spa" --fields "name, value" --js # Iframe-embedded apps (e.g. HuggingFace Spaces) — extra wait for content to load trawl "https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard" \ --fields "rank, model, average" --js --wait 10s # Custom headers and cookies trawl "https://example.com/dashboard" --fields "metric, value" \ --headers "Authorization: Bearer token123" \ --cookie "session=abc123" | Flag | Short | Default | Description | |---|---|---|---| --fields | -f | Comma-separated field names to extract | | --query | -q | Natural language description of what to extract | | --schema | -s | Path to YAML schema file | --query is especially useful for pages with multiple data sections. The query text is matched against section headings and HTML IDs to prioritize the right data region, and is passed to the LLM so it can select the most relevant section. When used together with --fields , the query guides section selection while the fields define the output structure. | Flag | Short | Default | Description | |---|---|---|---| --format | json | Output format: json , jsonl , csv , parquet | | --output | -o | stdout | Write output to file instead of stdout | | Flag | Short | Default | Description | |---|---|---|---| --max-pages | -n | 1 | Maximum pages to crawl | --paginate | Auto-detect and follow pagination | || --concurrency | -c | 10 | Number of concurrent workers | --delay | 1s | Delay between requests to same domain | | --no-robots | Ignore robots.txt (use responsibly) | || --js | Enable headless browser for JS-rendered pages | || --wait | 0 | Extra time to wait after page load with --js (e.g. 5s ) | | --timeout | 30s | Per-request timeout | | --headers | Custom headers ("Key: Value" format) | || --cookie | Cookie string to include | | Flag | Default | Description | |---|---|---| --strategy | Path to a cached extraction strategy JSON file | | --plan | Dry run: show the LLM-derived extraction plan, don't extract | | --no-cache | Don't cache or use cached strategies | | --no-heal | Disable self-healing (don't re-derive on failure) | | Flag | Default | Description | |---|---|---| --model | LLM model to use (default depends on API key) | | --no-llm | Disable LLM, use heuristic extraction only | | Flag | Short | Description | |---|---|---| --verbose | -v | Verbose output (show strategy derivation, health stats) | --help | -h | Help | URL ──► Fetch ──► Detect Data Regions ──► LLM Strategy Derivation ──► Extraction Strategy │ ▼ Output (JSON/CSV/JSONL) ◄────────────────────── Apply Strategy via CSS Selectors (Go) │ [Strategy fails?] │ Re-derive from new HTML - Fetch the target URL with configurable headers, cookies, and timeouts. With --js , uses a headless Chromium browser to render JavaScript. The browser automatically:- Waits for DOM stability — polls th
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유