로코모 AI 벤치마크: 정답의 6.4%가 틀렸고, 심사위원은 허위 정답의 63%를 인정했습니다.

hackernews | | 📦 오픈소스
#ai 벤치마크 #gpt-4 #openai #데이터 정합성 #로코모 #롱컨텍스트 #머신러닝/연구 #평가 방법론
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

장문 맥락 모델링 벤치마크인 LoCoMo와 평가 프레임워크 EverMemOS에 대한 독립 감사 결과, 데이터셋의 6.4%에 해당하는 정답 오류와 평가 방법론의 불일치, 토큰 비용 과소 포장 등의 문제가 드러났습니다. 특히 채점 모델이 의도적으로 틀린 답변의 62.81%를 정답으로 인정하는 등 관대한 성향을 보였으며, 제시된 성능 점수는 논리적으로 불가능한 정답 배점 한도를 초과하거나 단순한 프롬프트 변경으로도 재현 가능한 것으로 확인되었습니다. 또한 제3자의 재현 시도에서 주장된 성능의 약 38% 수준에 그쳐, 해당 벤치마크 결과의 신뢰성에 심각한 결함이 있음이 밝혀졌습니다.

본문

Independent audit of the LoCoMo (Long-Context Modeling) benchmark and the EverMemOS evaluation framework. Findings cover ground truth errors in the dataset, evaluation methodology differences across implementations, token cost misrepresentation, judge leniency, and third-party reproducibility failures. Every claim links to a verifiable primary source. | Finding | Detail | Source | |---|---|---| | Ground truth errors | 99 of 1,540 questions (6.4%) have wrong golden answers. Theoretical scoring ceiling is 93.57%. | AUDIT_REPORT.md | | Total token cost | EverMemOS README claims 2,298 avg tokens per question. The paper's own Table 8 (arXiv:2601.02163v2) shows 6,669 with GPT-4.1-mini (2.9x higher; 6,045 with GPT-4o-mini). Real reduction vs. full-context is 67%, not 89%. | methodology/token_efficiency.md | | Judge accepts wrong answers | 62.81% of intentionally wrong vague-but-topical answers accepted by the LLM judge. | ap-baseline/README.md | | Scores exceed corrupted ceiling | EverMemOS single-hop (95.96%) and multi-hop (91.37%) exceed their category ceilings (95.72% and 90.07%), mathematically impossible without credit from wrong golden answers. Overall 92.32% is within 1.25 points of the 93.57% aggregate ceiling. | results-audit/RESULTS_AUDIT.md | | Not apples-to-apples | EverMemOS uses 2-3 sequential LLM calls, a 729-token CoT prompt, and agentic retrieval. All other systems: 1 call, simple prompt, no overhead. All reported in the same "Avg. Tokens" column. | methodology/token_efficiency.md, methodology/prompts.md | | Reproducibility failures | Third parties report 38.38% vs. claimed 92.32% (EverMemOS#73). Multiple Mem0 reproducibility issues open. | methodology/reproducibility.md | | Full-context baseline exceeds EverMemOS | GPT-4.1-mini with answer_prompt_cot on full context scores 92.62%, exceeding EverMemOS (92.32%) and the claimed FC baseline (91.21%). The answer prompt, not the memory system, explains the score. | fc-baseline/README.md | locomo-audit/ ├── data/ │ └── locomo10.json # Original dataset (unmodified, SHA256-verified) ├── audit/ │ ├── conv_0.json ... conv_9.json # Per-conversation audit packages │ └── errors_conv_0.json ... errors_conv_9.json # Errors found per conversation ├── results-audit/ # Score impact analysis across 5 published systems │ ├── RESULTS_AUDIT.md # Adjusted scores, ceiling analysis, cross-check │ ├── audit_results.py # Audit script (LLM judge, ~1,485 calls) │ └── download_results.py # Fetches published eval_results from HuggingFace ├── ap-baseline/ # Judge leniency stress test │ ├── README.md # Strategies, results, 6x leniency finding │ ├── score_ap.py # Scoring pipeline (same judge as original eval) │ ├── v1/ # Specific-but-wrong strategy (10.61%) │ └── v2/ # Vague-but-topical strategy (62.81%) ├── fc-baseline/ # Independent full-context baseline (4 runs, 2 models x 2 prompts) │ ├── README.md # Methodology, results, key finding (prompt explains gap) │ ├── scripts/ # fc_eval.py (~860 lines) and analyze_results.py │ └── results/ # eval_results.json for all 4 runs ├── methodology/ # Evaluation methodology analysis │ ├── README.md # Overview and key findings │ ├── prompts.md # Answer prompts, judge prompt, context templates │ ├── word_counts.md # Answer length statistics and scoring correlation │ ├── token_efficiency.md # Token cost claims vs. paper's own data │ ├── discrepancies.md # Cross-repository model, prompt, scoring differences │ ├── full_context_baseline.md # Full-context baselines: 4 measured runs, prompt explains the gap │ ├── image_questions.md # Image-dependent questions and BLIP caption handling │ ├── reproducibility.md # Third-party reproducibility reports │ └── scripts/ # Analysis scripts (stdlib-only Python) ├── evaluation/ │ └── config/ │ └── prompts.yaml # Judge prompts (from EverMemOS pipeline, SHA256-verified) ├── scripts/ │ └── verify_sha256.py # Verify dataset integrity against known hashes ├── errors.json # Consolidated error report (all conversations) ├── AUDIT_REPORT.md # Ground truth audit: full findings and analysis ├── requirements.txt # Python dependencies (openai, pyyaml) └── README.md | File | Source | License | SHA256 | |---|---|---|---| data/locomo10.json | snap-research/locomo | CC BY-NC 4.0 | 79fa87e9...ea698ff4 | evaluation/config/prompts.yaml | EverMind-AI/EverMemOS | Apache 2.0 | ba4f668e...ba498ee9 | Both files are byte-for-byte matches with their official upstream sources (verified Feb 2026). Run python scripts/verify_sha256.py to confirm. See THIRD-PARTY-NOTICES.md for full license attribution. This audit builds on errors first reported in snap-research/locomo#27 (29 errors). Our systematic audit found 156 total issues: 99 score-corrupting, 57 citation-only. This work is licensed under CC BY-NC 4.0, the same license as the underlying LoCoMo dataset. The LoCoMo dataset was created by Maharana, A., Lee, D. H., Tuber, S., & Bansal, M. and is published by SNAP Research under CC BY-NC 4.0. The unmodified dataset is included in data/locomo

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →