테스트 사례를 통해 AI 라우터의 정확도가 82%에서 98%로 향상되었습니다.
hackernews
|
|
📦 오픈소스
#ai 라우터
#anthropic
#evaluation
#review
#의도 분류
#정확도 향상
#테스트 사례
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
해당 기사는 사용자의 프롬프트 의도를 분류하여 적절한 처리 모드로 라우팅하는 AI 컴포넌트의 성능을 테스트하고 개선하는 방법을 소개합니다. 정규식 패턴과 AI 분류기를 포함한 평가 프레임워크를 통해 시스템을 검증하며, 난이도와 카테고리별로 세분화된 지표를 사용하여 정확도를 측정합니다. 이러한 반복적인 테스트와 수정 과정을 통해 정확도를 82%에서 98%로 대폭 향상시킬 수 있었습니다.
본문
Evaluation framework for testing the intent classification router — the component that decides whether a user prompt should be handled as chat , extract , research , or automate . cd evals npm install # Fast patterns only (instant, no API key needed) npm run eval:fast # AI classifier only (needs ANTHROPIC_API_KEY) ANTHROPIC_API_KEY=your-key npm run eval:ai # Full comparison: fast vs AI vs production pipeline (fast→AI fallback) ANTHROPIC_API_KEY=your-key npm run eval:both # Show all failures instead of first 10 npm run eval:verbose # By difficulty npx tsx run.ts --difficulty=hard --verbose # By category npx tsx run.ts --category=temporal-ambiguity # Combined npx tsx run.ts --classifier=fast --difficulty=easy --verbose npx tsx run.ts --output=results/run-001.json Results are saved as JSON with full per-case details, metrics, and confusion matrices. The results/ directory is gitignored. The eval dataset lives in datasets/classification_v1.jsonl . Each line is a JSON object: { "id": "eval-001", "input": "Write me a tweet about sustainable energy", "expected_mode": "chat", "acceptable_modes": ["chat"], "difficulty": "easy", "category": "content-generation", "ambiguity": "none", "rationale": "Content generation — user wants text output, not browser action", "tags": ["content", "social-media"] } | Field | Description | |---|---| expected_mode | The ground truth classification | acceptable_modes | For ambiguous cases, multiple modes are acceptable | difficulty | easy , medium , hard | ambiguity | none , low , high — marks genuinely debatable cases | rationale | Human-written explanation of why this label is correct | Open classification_v1.jsonl , review each case, and change expected_mode / acceptable_modes if you disagree with the label. Then re-run the eval to see how metrics change. - Accuracy (lenient) — predicted mode is in acceptable_modes - Accuracy (strict) — predicted mode exactly matches expected_mode - Macro F1 — mean of per-class F1 scores (headline metric) - Coverage — % of cases where the classifier returned a prediction (vs null) - Per-class precision/recall/F1 — identifies which modes are strongest/weakest - Confusion matrix — shows exactly where misclassifications happen - Stratified metrics — broken down by difficulty and category - Run npm run eval:fast to test regex patterns (instant) - Fix patterns in src/services/router.ts - Mirror changes in evals/src/classifiers/fast-patterns.ts - Re-run and check if metrics improved - Run npm run eval:both to compare fast vs AI vs pipeline - Add new test cases as you discover real-world failures
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유