Anthropic's Argument for Mythos SWE-bench improvement contains a fatal error
hackernews
|
|
📰 뉴스
#anthropic
#claude
#오픈소스
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
Anthropic은 Mythos가 SWE-bench에서 보인 성과가 데이터를 암기한 결과가 아니라는 주장을 뒷받침하기 위해, 암기 확률이 낮은 문제들만 걸러낸 뒤 비교한 그래프를 제시했습니다. 그러나 해당 암기 탐지기가 완벽하지 않음에도 일관되게 찰치가 아닌 실력으로 판정했을 때, 이는 실제로는 전적으로 데이터를 암기해서 얻은 성과일 가능성을 배제할 수 없습니다. 비록 내부 벤치마크에서의 향상이 있다 하더라도, 불완전한 탐지기의 판단만으로 암기가 성과 향상의 원인이 아님을 증명하는 데는 치명적인 오류가 존재합니다.
본문
Mythos’ system card contains the following graph to support its argument that Mythos performs better on SWE-bench: Anthropic and others are worried LLMs are memorizing SWE-bench, so they asked an LLM to estimate the probability that a solution is memorized. Next, they calculated the pass rate if they only included solutions an LLM judged to be memorized with less than 5% confidence, 10% confidence, and so on. Picking a point on the graph for example: if they include ~400 out of 500 solutions because an LLM has judged them as memorized with a probability <= 60%, Mythos’ success rate is ~92% while Opus 4.6’s is ~82%. This is a hard graph. Read the full caption1 if you need to. Take your time. I stared at it for a long time. After presenting this graph, they conclude: Our detectors are imperfect, but this result is robust to the choice of threshold and consistent with Claude Mythos Preview’s gains on internal benchmarks not present in any training corpus. We conclude that memorization does not explain its SWE-bench improvements. Gains on internal benchmarks count for something, but does an imperfect memorization detector’s consistent judgment that Mythos has genuine gains on SWE-bench support their claim at all? It doesn’t. It is perfectly possible to have an imperfect cheating detector consistently judge a model whose gains are entirely explained by cheating as making genuine gains. Here’s a short python program proving it. Let’s start by modeling an LLM like Opus 4.6 that gets the right answer 80% of the time and has an equally likely chance of cheating between 5% and 90%. These bounds are chosen here because according to the above graph, it looks like LLMs don’t like giving probability estimates below 5% and above 90%. We’ll do this by simply generating an array of pairs where the first item has an 80% of being 1 to represent a successfully solved problem and the second item being the probability the solution was memorized: opus_4_6 = [ (1, random.uniform(.05, .9)) if random.random() < 0.8 else (0, random.uniform(.05, .9)) for _ in range(500) ] Next, we model a cheating Mythos whose 10% performance gain is — by hypothesis — entirely explained by cheating. In this array, half the time opus_4_6 got the answer wrong, we give mythos a 50% chance of getting the right answer and when it does, we increment the probability of cheating with a random value somewhere between 10% and 65% with a cap at 90% since in Anthropic’s data we see LLMs don’t like to give estimates above ~90%. This is meant to model how Anthropic’s cheating detection is imprefect. It’s also assuming that the probability of memorization is strictly higher in Mythos. This is reasonable given its a bigger model w/ more training data. mythos = [ (1, min(.9, og[1] + random.uniform(.1, .65))) if random.random() < 0.5 and og[0] != 1 else og for og in opus_4_6 ] If you graph this data, you can get something that looks very similar to the graph above: Obviously, a perfect cheating detector wouldn’t have this problem, but Anthropic admits their detection isn’t perfect. Until we quantify the degree of their detector’s imperfection, citing it as evidence of Mythos’ performance gains should hold zero weight. Here’s all the code if you want to try it yourself. h/t to Claude for figuring out the visualization code and a few other things. h/t to Antrhopic for making Claude. - The full caption is on page 186 of the system card ↩︎
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유