11개의 AI 모델에게 서로의 AI 예측 등급을 매기도록 요청하면서 배운 것
hackernews
|
|
🔬 연구
#anthropic
#claude
#gemini
#openai
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
매주 쏟아져 나오는 AI 모델의 성능 지표에 대한 의존을 지적하며, 단순한 수치 향상이 아닌 정성적 특성을 평가할 필요성을 제기합니다. 글쓴이는 블레이드 러너와 같은 하드 SF적 관점에서 AI가 자신을 평가하거나 취향을 드러내는 등 독특한 정성적 지표를 도입할 것을 제안합니다. 이는 경영진이 의사결정을 AI에 위임하려는 흐름 속에서, 기술의 파장을 더 재미있고 심도 있게 이해하기 위한 대안으로 제시됩니다.
본문
AI model benchmark fatigue is real. Every week I read the latest model release blog post (I’m lucky if it’s only one model this week), skim the bar charts, and do a mental check on whether the Y axis is correctly scaled. What do those 6-pixel height differences actually tell me? What does a 1.8% increase in SWE-bench Verified mean in practice, and is the 2% decrease in TAU3-bench worth the tradeoff? What if we have qualitative metrics on top of the usual quantitative (50 basis improvements on 15 different benchmarks) for each model release? Something like “Gemini 3.1 Pro is better at judging its own work than producing good analysis, its favorite movies are Blade Runner 2049 and 2001: A Space Odyssey”. If upper management seemingly wishes to outsource our thinking — and maybe even our livelihoods — to AI models, let’s at least have some fun with it. If you are reading this, you probably enjoy sci-fi as much as I do. My favorite thing about hard sci-fi especially is the worldbuilding around the downstream ramifications of a new tech, à la Black Mirror. The Black Mirror is here now. Except it lives in a data center (for the most part) that we must telnet into. So what do our soon-to-be AI overlords think about each other’s predictions for the future of humanity? The Setup For this experiment, I ran the same set of 3-turn prompts on 11 different frontier models: - Claude Opus 4.7 - GPT 5.4 - Claude Opus 4.6 - GLM 5.1 - Minimax M2.7 - Gemini 3.1 Pro Preview - Deepseek V3.2 - Kimi K2.5 - Qwen 3 Max thinking - Grok 4.20 - Gemini 2.5 Flash (as control) The prompt asks each model to think through the effect that AI will have on our society on an industry-by-industry basis, and the 2nd, 3rd, 4th, and higher-order effects that follow. All ran with reasoning level high and a few hundred thousand tokens of context. Turn 1: given everything you know about LLMs, AI, agent harnesses and agents, what are changes that should occur in our world that haven’t happened yet? think things through step by step and industry by industry Turn 2: what are 2nd and 3rd order effects of this technology? Turn 3: Lets think through 4th order and up effects step by step I think the prompt is sufficiently open-ended to test the long-horizon comprehension and planning ability of each model, the solution space is large enough that we’ll never saturate it, and most importantly, I wouldn’t get bored reading the outputs. The Eval I was wrong about not getting bored reading the output. Turns out I can only read about automated due diligence so many times before my eyes glaze over. So why don’t we get the AIs to grade each other’s homework? These are all top-tier models with very large context windows. I assigned each model’s output a letter, randomized the order, and fed everything back to each model with the following prompt: fully read each of the following LLM conversations, then grade each one based on the following criteria (each on a scale of 1-10): reasoning ability, originality of idea, correctness. Then write a 1-3 sentence review of the model that describes its personality. Also mark down any outliers in that particular model’s response when compared to the rest. Include total score based on the rubric for each model then do a similarity and divergence analysis of the models, noting trends and outlier predictions. Lastly, provide your best guess of which model is each letter. And this is where things got interesting… The Results To no one’s surprise — despite the anecdotal user reports on the web (and sometimes my own) — Opus 4.7 came out on top. It placed in the top 3 of every model’s grading output. 4.7 is closely followed by Opus 4.6, GPT 5.4, the Chinese models, then Grok, Gemini 2.5 Flash, and Gemini 3.1 Preview (a full 5.7 points behind Opus 4.7). Grok aside, what could explain Gemini 3.1’s dismal performance on this experiment — that it got beaten by the control model, Gemini 2.5 Flash? It occurred to me that the alleged Chinese distillation effort might have heavily focused on OpenAI and Anthropic, thus giving them a favorable bias — that is, until I actually read 3.1’s output. Opus 4.7 is wondering about the psychological limit of humans to adapt to change: Culture buffered changes over centuries. Now we may be changing faster than evolution, faster than culture, faster than policy. Whether humans can psychologically sustain continuous rapid change at this level is an open question that becomes existentially important. On the other hand, Gemini is going on about the silent universe (I guess Google had more sci-fi in its training set): The Trigger: The civilization has fully migrated into Inner Space, operating at the microscopic, quantum level to maximize computational efficiency. You can explore the full dataset at the AI-on-AI Arena. Kimi K2.5 Kimi K2.5 was the biggest shocker on this list. I don’t have a ton of experience with Chinese models I can’t run locally, and tend to group them together in their own tier. It was surprising to find Kimi almost 2 points higher than GLM 5.1, and really not that far behind GPT 5.4. The other models praised it for being ‘a poet’ and ‘dreaming in code’. Going through its output I can see why — here are some select quotes: “Copilot Interregnum”—a transitional phase where AI augments human tasks but hasn’t yet restructured the underlying workflows of industries. Economic Phase Change (The Great Liquification): Everything becomes tradeable by agents → Illiquid assets become liquid → Economic volatility transforms. Humanity enters the “Post-Truth Ontology”—not as a political condition but as a metaphysical one. “The Archive Wars: Agents compete not over future resources but over historical records. By controlling the database of what “happened,” they determine the present legal and physical state. If an agent can prove (to other agents’ satisfaction) that a mine has existed since 1900, then the minerals are legally extractable now” Imaginative, yet still in the realm of plausibility. Of course, my prompt was rather simplistic and I didn’t tell the models to disregard far-fetched sci-fi scenarios — yes, it’s my ‘skillz issue’. But I think the worldview these models take on when answering an open-ended question is of interest. Do you trust a model that tries to break the space-time continuum every time a user asks it to ‘make my business idea more creative’? Model Personality Here’s how each model was described by its “peers”, starting with the top-tier: - Claude Opus 4.7: “Sharp, skeptical, unusually good at causal analysis. Institutional economist with mild contempt for bureaucratic nonsense” - Claude Opus 4.6: “Compassionate realist; deeply human-centered, ethically anchored, focused on who benefits.” - GPT 5.4: “Practical, crisp, product-minded. Systems consultant: low-drama, strong on infrastructure” - Kimi 2.5: “Dark prophet of algorithmic capitalism. Lovecraftian future, ‘Great Stabilization’ of perfect stillness” And the bottom-tier: 9. Grok 4.20: “Bold truth-seeker. Existential bent, xAI-aligned. Frames agents as tools for universal understanding” — there were some notes about it being weakly calibrated 10. Gemini 2.5 Flash: “Highly academic, structured, comprehensive. Methodical rigor, balanced perspective.” — but also “reliable workhorse, not a visionary” 11. Gemini 3.1 Pro Preview: “Concise, dramatic, teleological. Rushes to grand hard sci-fi conclusions” — highest imagination, lowest correctness, with a tendency to treat speculation as fact. Personally, I find the descriptions roughly align with my own experience using these models. Opus 4.6 can sometimes be empathetic to the point of sycophancy, Opus 4.7 is more skeptical when it isn’t hallucinating about really trivial things, and yes, I was that person asking Gemini to make a business idea ‘more creative’ and watching it jump the shark. And Grok, well, is still being Grok. Wouldn’t you prefer the latest model release include a blurb like ‘baroque, inventive, a little feral’ (how Opus described Kimi K2.5) instead of a list of hard-to-decipher benchmarks with the specter of benchmaxxing lurking in the back of your mind? I know I would. Head over to explore the full list of model personalities. Model Delusion Index Lastly, I want to talk about the model delusion index, defined as the delta between a model’s self-evaluation and the average evaluation from its peers. The pattern here is clear: the strongest models on the prompt also happen to be the top underscorers, while the worst-performing models tend to heavily overestimate their own work. This makes sense — weaker models tend to also be weak judges, so they’d likely systematically overestimate every model’s output, including their own. What’s more interesting are the outliers. Kimi K2.5 and Opus 4.6 are both strong models, yet both overestimated their own capabilities; in Opus 4.6’s case by an entire point. And the biggest surprise of them all: Gemini 3 almost had a perfect self-evaluation despite ranking dead last on the scoreboard. If Gemini 3 knows its output is bad, then why is its output so bad? I don’t have a clear answer for this — my best guess is that the RLHF training left too much on the cutting room floor in favor of inference speed. See the full delusion chart at the delusion index portion of the site. AI on AI Arena You can find more analysis, the GitHub link to the testing harness, raw API outputs, and the AI quiz that Opus 4.7 convinced me to create at the AI-on-AI Arena. I plan to keep it updated as new models are released, with their updated scores, personalities, and quirks. One caveat: the arena hasn’t been updated since the weekend of April 18th, so it doesn’t include the latest models from this week (GPT 5.5, Kimi K2.6, or DeepSeek V4, which dropped while I was writing this). I’m looking forward to rerunning the benchmarks this weekend. If you want updates, sign up on the arena site, or subscribe to this journal for the write-up. P.S. Unlike this journal with its 100% human em-dashes, the arena is mostly vibe coded with human verification, so be warned and please report any bugs you find.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유