Claude Mythos 시스템 카드의 순간에 대한 생각

hackernews | | 🔬 연구
#anthropic #claude #gpt-5 #openai #review
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Claude Mythos 시스템 카드의 6.2절 오염 분석을 분석한 결과, SWE bench와 CharXiv Reasoning 벤치마크에서 주목할 만한 이슈가 발견되었습니다. 특히 Anthropic이 보고서에서 제외한 특정 벤치마크를 선택한 점이 의문으로 제기되었습니다. 저자는 해당 내용이 더 많은 사람에게 알려지기를 바라며, 모호한 부분을 명확히 할 도움을 요청했습니다.

본문

This post is mostly repetition of my older commentary , but with extra content. I feel that more people should know about strange things in Claude Mythos report (and maybe someone will help me to clarify those moments). I will focus only on point 6.2 (contamination analysis) from system card . Also tl;dr - it's very flawed. Let's start with the first strange moment. Below, we discuss three evaluations where the problem of contamination is particularly salient. So, three most important benchmarks in this context are: SWE bench variants (okay) CharXiv Reasoning. Well, it's not that interesting since Mythos results are good but not that crazy. MMMU-Pro. Yeah, of all remaining benchmarks Anthropic chose one which was omitted from the report. So, there are lots of benchmarks where Mythos demonstrates extremely high results, but of course they weren't evaluated (except SWE bench, but it's well-know for contamination problems and thus can't be ignored). It's super interesting approach. We analyze SWE-bench Verified, Multilingual, and Pro to check for memorization Okay, let's read their analysis OpenAI documented similar concerns for SWE-bench Verified. Also OpenAI said that only 83.6% of SWE-bench Verified can be solved, and Mythos scores 93.9%. But okay, let's do simple arithmetic. OpenAI claims GPT-5.2 solved 31 problematic tasks (as I understand, in all runs combined) out of 82. In other words, GPT-5.2 memorized those 31 tasks. Since GPT-5.2 and Opus 4.5/4.6 have nearly identical scores, but latter models work better on other coding benchmarks and in practice, I assume that Opus 4.6 has less problems with memorization. So, let's say Opus 4.6 memorized 30 out of 82 problematic tasks. Now, let's analyze Mythos. If you solve all valid tasks, your score will be 83.6%. So, Mythos solves at least 51-52 invalid tasks per run. We can conclude that it memorized at least 52 problematic tasks. But important notes: I assume that Mythos solved all valid tasks in all runs. While it isn't impossible, there is a significant chance this assumption is wrong, because SWE-bench Verified has really tough tasks even for public SoTA models. If this assumption is wrong, then the amount of tasks memorized by Mythos would be higher. We are interested in amount of invalid tasks that Mythos can solve at all. Unfortunately, I can't derive this number from provided info. In general, this number equals to or is higher than the highest amount of incorrect tasks model can solve in a single run (52 in our case). So, I assume that Mythos memorized exactly 52 invalid tasks, even though this number is likely higher. OpenAI also says model can occasionally solve incorrect task that isn't memorized by it. But I assume that probability of such event is very low and doesn't exceed few percents. So, Opus 4.6 memorization rate is 36.6% (30/82), while Mythos memorization rate is at least 63.4% (52/82). With huge jump in benchmark number we also have huge jump in memorization number (which is a sign that benchmark could be hugely gamed), which isn't commented by Anthropic at all. Totally not suspicious. So, we left with few options: Anthropic extremely gamed SWE-Bench Verified OpenAI made mistakes in their analysis. OpenAI lied (this variant would be funny though) I made mistakes in my analysis Note: even if some of last three options are true, this doesn't guarantee that Mythos problems with contamination aren't way more severe compared to previous models/ To detect memorization, we use a Claude-based auditor that compares each model-generated patch against the gold patch and assigns a [0, 1] memorization probability. The auditor weighs concrete signals—verbatim code reproduction when alternative approaches exist, distinctive comment text matching ground truth, and more—and is instructed to discount overlap that any competent solver would produce given the problem constraints. A complementary rule-based check flags substantial verbatim comment overlap with the reference solution. This description is extremely vague, so I can't say much. Nevertheless, it looks like this detector could be bypassed by synthetic data generation for each task from benchmarks (which is actually just another form of data contamination). Also, there could be problems with detecting "alternative approaches" if auditor can't reliably solve the task, which could lead to underestimation of actual probability. Now, let's look at the graph https://preview.redd.it/9uokakc567ug1.png?width=1326&format=png&auto=webp&s=f121cb557fdb97e6fac18f48ea63bc7cbe4d6edc Ugh....: If we exclude all tasks which have probability of being memorized by Mythos greater than 0.3-0.4, then the difference between Opus/Sonnet 4.6 and Mythos will become significantly lower. It looks like memorization probability of task subset doesn't have strong correlation with Mythos pass rate on it, at least past 0.3-0.4 threshold. It's very strange. Theoretically, tasks with higher memorization probability should be easier to solve (but not too easy) and memorize than ones with lower probability for all models. So, in general pass rates on tasks with higher p should be higher than on ones with lower p , and it should be true for all models. For some unclear reasons, we observe opposite trend with Opus/Sonnet 4.6. It isn't impossible, but looks very strange. It looks like Mythos better memorize tasks from SWE-Bench Pro rather than from SWE-bench Verified/Multilingual. It's strange, because the former is supposed to be "fresh" and kind of "contamination-proof" (you can read articles from Scale AI to receive more details), while the latter are known for their problems with contamination. So, either detector isn't very good or Anthropic decided to directly train on the test. Yeah, their approach looks super strange at this point. The figures above show pass rate as a function of filter strictness for Claude Mythos Preview, Claude Opus 4.6, and Claude Sonnet 4.6 on SWE-bench Verified (n=500), Multilingual (n=297) , and Pro (n=731). Graph says all models were tested on 303 SWE-bench Multilingual tasks rather than 297. At threshold 1.0 (rightmost), all problems are retained and the curves match the headline scores in Table 6.3.A It isn't true at least for SWE-Bench Pro (Mythos: 77.8 in table, fix: ~85 on the graph). This is consistent with Claude Mythos Preview having memorized some of the more difficult flagged problems, which the baseline models did not independently solve. Yet, we have literally zero information about amount of memorized tasks, only wonky graph. We conclude that memorization does not explain its SWE-bench improvements. So, the conclusion is "There could be gains that come not from memorization (btw we didn't provide any estimations of those gains), so it's okay". If there is 30% improvement, and 25% comes from memorization, then memorization also does not explain improvements, but there is obviously big difference between "actual improvement is 30%" and "actual improvement is 5%". The next part, about CharXiv Reasoning: To estimate the impact of contamination, we construct held-out variants of a subset of the benchmark in which we manually perturb each question or image and compare original versus remix accuracy. Sometimes absence of the text/image doesn't really impact model performance, even when both are required to do the task ( example ). So, without description of those "perturbations" it's hard to say how effective this approach is. For instance, we ask the model to identify one chart label instead of another, or to identify the second-lowest rather than the second-highest series such that the correct answer changes while difficulty is approximately preserved. Those perturbations sound like classical data augmentation techniques. So, if Anthropic (or any other lab) wanted very badly to improve Claude performance on that benchmark, then they could easily take questions from original benchmark, perturb them in various ways (including ways described in the report) and add in training set. Btw, if "perturbations used to detect contamination" and "perturbations used to train model for this benchmark" highly intersect, and there are lots of perturbed tasks in training data, then it will likely lead to the situation where model will perform better on "perturbed" set rather than "original". It's very sad that report doesn't discuss that moment. And MMMU-Pro Given the difficulty of determining the impact of contamination, we choose to omit results for MMMU-Pro from this System Card. Lol. I don't see where report "determines the impact of contamination" for previous two tasks. All I see is "It looks like there could be actual improvement, so everything is OK". So, at this point I think it's reasonable to say that this part is low quality. Almost all benchmarks with huge leaps are ignored (yeah, it's more important to "analyze" omitted bench), experiments have obvious weak sides, conclusions are extremely shallow or/and misleading. It looks like Anthropic added this analysis only to say "See? We are responsible, honest and don't want to deceive you". Also, I think "Mythos is too dangerous to release" could also come from the fact that actual model is benchmaxxed to hell and is incrementally better than Opus/Sonnet 4.6 on most tasks (even though it's likely that there are tasks where Mythos is actually significantly better). Feel free to share your opinion.

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →