1. **Analyze the Request:** * **Role:** News summarization expert. * **Task:** Summarize the provided article concisely in 2-4 sentences. * **Constraints:** * Focus on the most important information. * DO NOT copy and paste. * DO NOT simply restate the title. * Add new information beyond the title. * Answer in English. * **Input:** Title ("I underestimated AI capabilities (again)") and Source ("hackernews"). Note: The user provided the Title and Source, but *no actual article text* was provided in the prompt. 2. **Analyze the Input Data:** * The prompt says "Article Text: I underestimated AI capabilities (again) \n Summary:". * *Crucial Realization:* The user *has not actually provided the body of the article*. They only provided the title repeated in the "Article Text" section. * *Source Context:* The source is listed as "hackernews". This usually refers to a comment thread or a linked article on Y Combinator's Hacker News. * *Search/Recall:* The title "I underestimated AI capabilities (again)" is likely a reference to a specific, well-known essay or blog post often discussed on Hacker News. Without the text, I cannot summarize specific details unique to that specific instance. However, looking at the prompt structure, it seems the user *intended* to paste text but might have failed, or expects me to hallucinate/guess based on the title and general knowledge of the AI discourse. * *Correction/Refinement:* Wait, looking closely at the prompt format, the user *did* provide a snippet, but it's just the title repeated. "Article Text: I underestimated AI capabilities (again) \n Summary:". *
본문
I underestimated AI capabilities (again) Revisiting a prediction ten months early On Jan 14th, I made predictions about AI progress in 2026. My forecasts for software engineering capabilities already feel much too conservative. In my view, METR (where I now work) has some of the hardest and highest-quality software engineering and ML engineering benchmarks out there, and the most useful framework for making benchmark performance intuitive: we measure a task’s difficulty by the amount of time a human expert would take to complete it (called the “time horizon”).1 When I made my forecasts last month, the model with the longest measured time horizon on METR’s suite of software engineering tasks was Claude Opus 4.5; it could succeed around half the time at software tasks that would take a human software engineer about five hours.2 Time horizons on software tasks had been doubling a little less than twice a year from 2019 through 2025, which would have implied the state-of-the-art 50% time horizon should be somewhat less than 20 hours by the end of 2026.3 But there was ambiguity about whether the more recent doubling time was faster than the long-run trend, so I bumped that up to 24 hours for my median guess.4 My 20th percentile was around 15 hours and my 80th percentile was around 40 hours. Now, Opus 4.6 (released only 2.5 months after Opus 4.5) was estimated to have a 50% time horizon of ~12 hours.5 I don’t take the specific number literally — there are many fewer very-long tasks than medium and short tasks, and the long tasks more often have guesstimated (rather than measured) human completion times, so time horizon estimates for the latest models are a lot noisier than they were in 2025. And the benchmark underlying the time horizon graph is nearly saturated, which causes the confidence intervals to blow up: the 95% CI is 5.3 hours to 66 hours.6 It’s really hard to discriminate between different capability levels at the current range. But at the end of the day, that dataset had 19 software engineering tasks estimated7 to take humans longer than 8 hours, and Opus 4.6 was able to solve 14 of them at least some of the time (and it reliably nailed four of them).8 And beyond just this one task suite, we’ve seen examples of AI agents doing certain very well-specified software tasks like writing a browser or C compiler, or porting a giant game, that would take humans many weeks or months to do on their own — not perfectly, but better than most people expected and better than a naive reading of the agents’ measured time horizon would have suggested. And this happened in February. It’s no longer very plausible that after ten whole months of additional progress at the recent blistering pace,9 AI agents would still struggle half the time at 24 hour tasks. I wish them the best, but I think my colleagues on the capability evaluations team at METR might struggle to create new software tasks from a similar distribution capable of measuring AI agents’ true time horizons through the end of the year. If we could measure this, I’d guess that by the end of the year, AI agents will have a time horizon of over 100 hours on the sorts of software tasks in METR’s suite (which are not highly precisely specified — on certain extremely well-specified software tasks like the examples above, agents seem to already have a time horizon of more than a hundred hours). And once you’re talking about multiple full-time-equivalent weeks of work, I wonder if the whole concept of “time horizon” starts to break down. It’s nearly impossible to subdivide a typical one hour task (e.g., debugging one failing test) into smaller pieces that multiple people can work on in parallel. It wouldn’t go very well if you had to farm out writing this print statement or reading that error message or tweaking this line of code to different people — the right action to do next depends intimately on everything that came before it and the precise state of the code as a whole, you have to hold the whole context in your mind as you take each action or they won’t cohere in the right way. It’s somewhat easier to decompose an eight hour task (e.g., writing a simple browser game) into smaller components, but those components are constantly bleeding into each other in ways that make clean handoffs hard. When you’re implementing the game logic, you realize it needs to know something about how the graphics are rendered. When you’re handling user input, you find yourself tweaking the game loop. The fastest way to do it is probably one person knocking it out in a day, making a hundred small decisions fluidly as they go. But it’s actually pretty feasible to break down a month-long task into smaller pieces. In fact, you may start benefiting from some explicit decomposition — it might be helpful to write a design doc laying out how the pieces fit together, or break the work into tickets so you don’t lose track of what’s done and what’s left. And while it might take one person working