HN 표시: TypeScript의 웹사이트를 위한 강력한 LLM 추출기

hackernews | | 📦 오픈소스
#ai 딜 #anthropic #gemini #llama #llm #openai #typescript #웹 데이터 추출 #자동화
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Lightfeed Extractor는 Playwright 브라우저 자동화와 LLM을 결합하여 웹사이트에서 구조화된 데이터를 강력하게 추출하는 Typescript 라이브러리입니다. 자연어 프롬프트를 통해 페이지를 탐색하고 HTML을 LLM 친화적인 마크다운으로 변환한 뒤, Zod 스키마에 맞춰 JSON 형식의 데이터를 정확하게 추출합니다. 또한 반자동 탐지 방지 기능과 깊게 중첩된 객체 처리 기능을 포함하여 프로덕션 환경의 데이터 파이프라인에 최적화된 토큰 효율성을 제공합니다.

본문

Robust Web Data Extractor Using LLMs and Browser Automation Lightfeed Extractor is a Typescript library built for robust web data extraction using LLMs. Use natural language prompts to extract structured data from HTML, markdown, or plain text. Get complete, accurate results with great token efficiency — critical for production data pipelines. - 🧹 LLM-ready Markdown - Convert HTML to LLM-ready markdown, with options to extract only main content and clean URLs by removing tracking parameters. - ⚡️ LLM Extraction - Use LLMs in JSON mode to extract structured data according to input Zod schema. Token usage limit and tracking included. - 🛠️ JSON Recovery - Sanitize and recover failed JSON output. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. - 🔗 URL Validation - Handle relative URLs, remove invalid ones, and repair markdown-escaped links. - 🤖 Works with Playwright - Use Playwright to load pages, then extract structured data from the HTML content. - 🧭 AI Browser Navigation - Pair with @lightfeed/browser-agent to navigate pages using natural language commands before extracting structured data. Tip Building retail competitor intelligence at scale? Go to lightfeed.ai - our full platform for tracking competitor pricing, sales, promotions, and SEO. Install the extractor: npm install @lightfeed/extractor Then install the LLM provider you want to use: # OpenAI npm install @langchain/openai # Google Gemini npm install @langchain/google-genai # Anthropic npm install @langchain/anthropic # Ollama (local models) npm install @langchain/ollama @langchain/core will be installed automatically as a peer dependency. This example demonstrates extracting structured product data from an e-commerce website using Playwright to load the page and the extractor to pull structured data. import { ChatGoogleGenerativeAI } from "@langchain/google-genai"; import { chromium } from "playwright"; import { extract, ContentFormat } from "@lightfeed/extractor"; import { z } from "zod"; // Define schema for product catalog extraction const productCatalogSchema = z.object({ products: z .array( z.object({ name: z.string().describe("Product name or title"), brand: z.string().optional().describe("Brand name"), price: z.number().describe("Current price"), originalPrice: z .number() .optional() .describe("Original price if on sale"), rating: z.number().optional().describe("Product rating out of 5"), reviewCount: z.number().optional().describe("Number of reviews"), productUrl: z.string().url().describe("Link to product detail page"), imageUrl: z.string().url().optional().describe("Product image URL"), }) ) .describe("List of bread and bakery products"), }); const browser = await chromium.launch(); const page = await browser.newPage(); const pageUrl = "https://www.walmart.ca/en/browse/grocery/bread-bakery/10019_6000194327359"; await page.goto(pageUrl); try { await page.waitForLoadState("networkidle", { timeout: 10000 }); } catch { console.log("Network idle timeout, continuing..."); } const html = await page.content(); await browser.close(); // Extract structured product data const result = await extract({ llm: new ChatGoogleGenerativeAI({ apiKey: process.env.GOOGLE_API_KEY, model: "gemini-2.5-flash", temperature: 0, }), content: html, format: ContentFormat.HTML, sourceUrl: pageUrl, schema: productCatalogSchema, htmlExtractionOptions: { extractMainHtml: true, includeImages: true, cleanUrls: true } }); console.log("Found products:", result.data.products.length); console.log(JSON.stringify(result.data, null, 2)); /* Expected output: { "products": [ { "name": "Dempster's® Signature The Classic Burger Buns, Pack of 8; 568 g", "brand": "Dempster's", "price": 3.98, "originalPrice": 4.57, "rating": 4.7376, "reviewCount": 141, "productUrl": "https://www.walmart.ca/en/ip/dempsters-signature-the-classic-burger-buns/6000188080451?classType=REGULAR&athbdg=L1300", "imageUrl": "https://i5.walmartimages.ca/images/Enlarge/725/979/6000196725979.jpg?odnHeight=580&odnWidth=580&odnBg=FFFFFF" }, ... (more products) ] } */ Tip Run npm run test:browser to execute this example, or view the complete code in testBrowserExtraction.ts. For pages that require interaction before extraction — searching, clicking through pagination, dismissing popups, etc. — you can pair this library with @lightfeed/browser-agent. The browser agent uses AI to navigate pages via natural language commands, and this library extracts structured data from the result. Install both packages: npm install @lightfeed/extractor @lightfeed/browser-agent Then use the browser agent to navigate and the extractor to pull structured data: import { BrowserAgent } from "@lightfeed/browser-agent"; import { ChatGoogleGenerativeAI } from "@langchain/google-genai"; import { extract, ContentFormat } from "@lightfeed/extractor"; import { z } from "zod"; const schema = z.object({ products: z.array( z.object({ name: z.string(), price: z.number(), rating: z.number().optional

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →