근본 원인 분석 및 오류 마이닝 기능을 갖춘 오픈 소스 LLM 판사 평가 제품군
hackernews
|
|
📰 뉴스
#ai 신뢰성
#anthropic
#llm 평가
#openai
#review
#근본 원인 분석
#오류 마이닝
#오픈 소스
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
Cane-eval은 LLM-as-Judge 방식을 통해 AI 에이전트를 평가하는 오픈소스 도구로, YAML을 이용해 정교한 평가 기준과 테스트를 정의할 수 있습니다. Claude 등을 사용해 응답의 정확도와 완성도를 점수화하고, 실패한 사례에 대해 근본 원인 분석(RCA)을 수행하여 상세한 피드백을 제공합니다. 또한, 평가 결과를 DPO 등의 훈련 데이터로 자동 추출(Mining)하거나 LangChain, FastAPI 등 주요 프레임워크와의 원라인(One-liner) 연동을 지원하여 AI 개발 워크플로우의 효율성을 높입니다.
본문
AI system reliability infrastructure. Evaluate any AI system's reliability across correctness, structure, and performance. pip install cane-eval Extensible reliability evaluation for AI systems — not just LLMs, but any AI agent, API, or pipeline. One tool, one score, one answer: would this break in production? Support Agent 28.4s Overall: [=========----------] 47 1 passed 1 warned 3 failed (5 total) Pass rate: 20% Latency: p50: 1.2s p95: 8.4s max: 12.1s Schema: 3/5 valid (60%) Reliability: [=======-----------] 52 (D) export ANTHROPIC_API_KEY=sk-ant-... cane-eval demo 1. Define tests (tests.yaml ): name: Support Agent criteria: - key: accuracy weight: 40 - key: completeness weight: 30 - key: hallucination weight: 30 # Optional: validate response structure schema: type: object required: [answer, sources] properties: answer: { type: string } sources: { type: array } # Optional: latency target for reliability scoring latency_target_ms: 5000 # Optional: configure reliability weights reliability: correctness_weight: 0.60 structural_weight: 0.20 performance_weight: 0.20 # Optional: parallel execution concurrency: 5 tests: - question: What is the return policy? expected_answer: 30-day return policy for unused items with receipt - question: How do I reset my password? expected_answer: Go to Settings > Security > Reset Password 2. Run: cane-eval run tests.yaml 3. Production checks: # Parallel execution cane-eval run tests.yaml -j 5 # Custom reliability weights (correctness:structural:performance) cane-eval run tests.yaml --reliability-weights 60:20:20 # Validate responses against JSON schema cane-eval run tests.yaml --schema schema.json --fail-on-schema # Fail if p95 latency exceeds 10 seconds cane-eval run tests.yaml --latency-p95 10000 # All together + mine failures into training data cane-eval run tests.yaml -j 5 --schema schema.json --latency-p95 10000 --mine --export dpo Every eval run produces an Agent Reliability Score (0-100) across three pillars: | Pillar | What it measures | How | |---|---|---| | Correctness | Does the answer look good? | LLM judge (accuracy, completeness, hallucination) | | Structural | Does the response match expected format? | JSON schema validation | | Performance | Is it fast enough for production? | p95 latency vs target | Grades: A (90+) production-ready, B (75+) mostly reliable, C (60+) needs work, D (40+) significant gaps, F ( float: if ctx.response_time_ms Agent --> LLM Judge -----> Reliability Score (A-F) | | | | v | | Schema Check | | Latency Stats | | | v v v Training Data Root Cause Failure (DPO/SFT/OpenAI) Analysis Mining MIT
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유