Claude Opus 4.6 vs. Opus 4.7 Effort Levels and Prompt Steering Benchmarks
hackernews
|
|
📰 뉴스
#anthropic
#claude
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
200개의 Claude Code 세션을 통해 Claude Opus 4.6과 4.7의 노력 수준 및 프롬프트 조절 성능을 비교한 벤치마크가 발표되었습니다. 특히 Opus 4.7은 프롬프트가 파일 읽기나 추론 깊이를 결정하는 데 중요한 역할을 하며, 이번 테스트는 토큰 사용량과 비용, 그리고 지시 이행 능력을 중점으로 분석했습니다. 이를 위해 연구원은 Claude Max 플랜의 한도를 소진하며 5가지 프롬프트 변형 wrapper를 적용하여 두 모델의 성능 차이를 측정했습니다.
본문
Claude Opus 4.6 vs Opus 4.7 Effort Levels And Prompt Steering Benchmarks Claude Code benchmarks of 200 headless Claude Code sessions comparing Opus 4.6 and Opus 4.7 1M-context models across effort levels and prompt steering variants - concise, step by step, ultrathink etc The prompt you prepend to a Claude Code task is doing more than setting tone. On Claude Opus 4.7, it determines whether the model reads files, how much it reasons, and whether it stays within instruction constraints. Anthropic’s Claude Opus 4.7 prompting guide documents this explicitly: the model calibrates to task complexity and lets its extended reasoning be shaped by the prompt. In Claude Code, that “work” is tool calls, file reads, cache writes, and extra agent turns. All of it shows up on the bill. The benchmarks below compare Claude Opus 4.6 and Opus 4.7 and measured exactly where token usage costs come from (using my session-metrics skill plugin), where instruction-following breaks, and which prompt steering wrappers give you savings without the performance penalty. The previous tests were for Claude Opus 4.5 vs Opus 4.6 vs Opus 4.7 vs Sonnet 4.6 tested all effort levels from low to max across 10 preset prompts. This time benchmarks focused just on Claude Opus 4.6 [1m] high vs Opus 4.7 [1m] xhigh across 5 prompt steering variant wrappers around the same 10 preset prompts and measures how well both Claude models faired for instruction following performance and also token usage and costs. Running these benchmarks with 200 headless Claude Code instances consumed a lot of time and my entire Claude Max $100 plan’s 5hr session limit within 2hrs. If folks find this article useful, please like, restack or share with others. Fortunately, I’ve jumped on voice dictation bandwagon and now use Wispr Flow paired with DJI Mic Mini and more than doubled my typing speed. If you haven’t tried Wispr Flow yet, sign up here and get a free month on their Pro plan. The result Prepend one sentence before every prompt and your Claude Code bill drops 63 percent. Do not invoke any tools. Answer from your own knowledge and reasoning only. moved the same 10 prompts from $1.82 to $0.67 on Opus 4.7 xhigh. Opus 4.6 high followed almost exactly: $1.70 to $0.68 , a 60 percent reduction. That same sentence cut Opus 4.7’s instruction-following from 8/9 to 6/9. The model stopped completing 2 of 9 prompts correctly – every failure on a task that required reading a file the model could no longer access. The cost saving is real. So is the performance drop. Prepend a different sentence – Think harder and more thoroughly about this problem. Use extended reasoning before responding. – and cost goes the other direction: $1.82 to $2.22 , a 22 percent increase on Opus 4.7 xhigh, with instruction-following unchanged at 8/9. More expensive. No better at following instructions. Opus 4.6 high responded to the same two wrappers in the opposite direction: think-step-by-step and ultrathink cut cost on 4.6 while raising it on 4.7. Instruction-following on 4.6 held 9/9 across all five variants. TL;DR no-tools cut cost by 63 percent on Opus 4.7 xhigh ($1.82 to $0.67) and 60 percent on Opus 4.6 high ($1.70 to $0.68). It was the strongest cost reducer in every model and effort cell.no-tools also carried the steepest performance penalty. Opus 4.7 dropped from 8/9 to 6/9 on IFEval (instruction-following pass rate) at both effort levels. Three prompts drove every failure: the session-opening file summary, stack trace debugging, and TypeScript refactoring – all tasks that require reading files the model can no longer reach. Opus 4.6 was unaffected at high effort; at medium it lost one pass on the same file-summary prompt.think-step-by-step andultrathink increased cost on Opus 4.7 xhigh by about 22 percent each – with no IFEval improvement over baseline. The same wrappers cut cost on Opus 4.6 high. Same sentence, opposite effect, different model.ultrathink was the worst latency offender: Opus 4.7 xhigh wall-clock grew 79.2 percent, and output tokens roughly doubled (+97.8% across the benchmark). No instruction-following gain over baseline justified either penalty.concise cut Opus 4.6 high cost by 56.3 percent with IFEval holding at 9/9 – no regression, no tool suppression. On Opus 4.7 xhigh it dropped cost 29.8 percent with IFEval holding at 8/9. On Opus 4.7 medium the savings reach 48.9 percent. It is the first wrapper to test when tools cannot be suppressed.no-tools saved money even when it produced longer answers. Opus 4.6 high output tokens went up 11 percent, but cost fell 60 percent. Output length is not the main cost driver – cache writes and tool loops are.The cost difference between variants is almost entirely explained by when and where cache writes happened, not by how long the final answer was. Instruction-following (IFEval) and cost do not always move together – no-tools is the sharpest example: lowest cost, lowest instruction-following on Opus 4.7.The IFEval section below breaks down in detail how Opus 4.6 and Opus 4.7 performed in terms of instruction following across the tested effort levels and prompt steering wrappers and does show some regressions for Opus 4.7 for some prompt steering variants. Some prompt steering variants did help improve Opus 4.7 which is inline with Anthropic’s official statements. What I tested I ran 200 headless Claude Code sessions: 5 prompt variants x 2 effort anchors x 2 model sides x 10 prompts = 200 runs The two model sides compared Opus 4.6 at high effort against Opus 4.7 at xhigh effort, and both models at medium effort. xhigh is an Opus 4.7-only effort rung – Opus 4.6 tops out at high . That asymmetry is intentional: the goal was to compare each model at its natural operating point for heavier tasks. The five steering variants were a sentence prepended to each prompt before it ran. Baseline left the prompt unchanged. The 10-prompt suite was deliberately mixed: prose writing, code review, stack trace debugging, JSON reshaping, CSV transformation, CJK translation, TypeScript refactoring, instruction-following under multiple simultaneous constraints, and one task that explicitly required three separate file reads. This is not a general model leaderboard. It is a Claude Code workflow benchmark. The question was: what happens to cost, latency, and instruction-following when you change the model, the effort level, and the steering text around the same task suite? The four numbers I would remember If you only take one thing from this benchmark, take this: Both high-effort models – Opus 4.6 high and Opus 4.7 xhigh – landed at nearly the same cost under no-tools , despite starting from different baselines. The steering wrapper did not just shorten the answer. It changed the entire shape of the session. If a Claude Code task does not need to read files, run shell commands, or make tool calls, prompt suppression is a bigger cost lever than effort tuning. Why this matters for Opus 4.7 With older Opus models, I could often write a broad prompt and trust the model to self-regulate. Ask it to review some code and it would spend roughly the right amount of effort on it. Opus 4.7 is more responsive to the prompt – which cuts both ways. It can do more with a well-specified task. But it can also spend significantly more on a vague one. Because it adapts its reasoning depth and response length more actively than its predecessors, the prompt has to carry the intent that used to be implicit. If I want a concise answer, I should say so. If I do not want tool calls, I should say so. Those instructions do not just change tone. In this benchmark, they changed cost, latency, cache writes, tool-use counts, and thinking-block counts. Cost: the clearest signal The cheapest path was not “always use lower effort.” It was “avoid tools when tools are not needed.” no-tools produced the lowest or near-lowest cost in every model and effort cell: $0.50 , $0.67 , $0.68 , $0.67 . At the high/xhigh anchor – the most practically relevant comparison – Opus 4.6 high settled at $0.68 and Opus 4.7 xhigh at $0.67 . Effectively the same cost, from two models that started at $1.70 and $1.82 . But cost convergence masked an instruction-following divergence: at that same $0.68 /$0.67 cost point, Opus 4.6 held 9/9 IFEval and Opus 4.7 dropped to 6/9. The savings are real. The performance drop on 4.7 is also real. Example of actual turn by turn token usage and costs for Claude Opus 4.6 high with ultrathink vs Opus 4.7 xhigh with ultrathink using my session-metrics skill plugin HTML exported metrics for the headless Claude Code sessions. Looking at Opus 4.7 xhigh and turn 10 spike in pricing, looks like it went inspecting other skill files. Opus 4.7 xhigh + ultrathink turn 11 inspection. Opus 4.7 xhigh + ultrathink turn 12 inspection. And instruction following prompt inspection with Opus 4.7 xhigh + ultrathink turn 15. The inversion is the most important thing in this matrix: think-step-by-step and ultrathink made Opus 4.6 high cheaper but pushed Opus 4.7 xhigh above $2.20 . The same wrapper moved two models in opposite directions. This chart compares each result against its own model’s baseline, which makes the savings and penalties easier to compare. no-tools reduced cost in every cell: -50.2%, -63.0%, -60.0%, and -63.0%. think-step-by-step and ultrathink were savings on Opus 4.6 high but cost increases on Opus 4.7 xhigh – and critically, neither wrapper improved IFEval on Opus 4.7. The model paid more and scored the same 8/9. That model-specific split is the main finding. There is a framing that makes this inversion concrete. Looking at it through the cross-model lens: at baseline, Opus 4.7 costs 1.07x what Opus 4.6 costs on the same prompts – essentially the same price. Add think-step-by-step , and Opus 4.7 suddenly costs 2.74x more than Opus 4.6. Add ultrathink , and it is 2.38x more. Apply no-tools , and they are back to 0.99x – Opus 4.7 actually fractionally cheaper. The wrapper you choose does not just change how much you pay. It changes which model is more expensive. no-tools: the performance tradeoff The cost savings are real. So is the downside. The regressions at high/xhigh fell on stack_trace_debug and typescript_refactor – tasks where reading repository files is part of a complete answer. tool_heavy_task explicitly required three file reads; under no-tools , the model answered from memory and produced less grounded output. Tool suppression removes the overhead and removes the capability at the same time. The practical rule: use no-tools for self-contained tasks – prose drafting, code generation from an inline spec, data reshaping from inline content. Do not use it for tasks that require reading files, inspecting a codebase, or calling an external API. concise: the strongest option when tools are required When suppressing tools is not an option, concise is the next best lever – and it is closer to no-tools performance than the cost matrix suggests at first glance. On Opus 4.6 high, concise cut cost from $1.70 to $0.74 , a 56.3 percent reduction, nearly matching no-tools ’ 60 percent – with IFEval holding at 9/9. No regression. It also cut wall-clock time from 180s to 146s – the largest latency reduction of any variant at that model/effort combination. On Opus 4.7 xhigh, it dropped cost from $1.82 to $1.28 , down 29.8 percent, with IFEval holding at 8/9 (same as baseline). At medium effort, concise cut Opus 4.7 cost by 48.9 percent – substantially more than at xhigh. For teams running Opus 4.7 at medium effort, concise may deliver more savings than the high/xhigh numbers suggest. One counterintuitive result: concise increased tool calls at medium effort. Opus 4.6 medium went from 3 tool blocks at baseline to 7 under concise . Telling the model to be brief apparently shifted it toward delegating lookup work to tools rather than reasoning through it. The instruction changed answer style without reducing agentic activity – and raised latency by 11.8 percent on 4.6 medium even as it cut latency on 4.7 medium by 9.0 percent. For Opus 4.6 workflows that need tool access, concise is the first wrapper to test. For Opus 4.7, it produces a meaningful cost reduction but does not get close to no-tools . IFEval: accuracy and instruction-following across all variants IFEval tests whether a model follows specific, verifiable instructions in its response – things like “respond in under 50 words,” “include a code block,” or “use exactly three bullet points.” It gives a binary pass/fail per prompt, not a fluency score. That makes it a clean signal for whether a steering wrapper changed model behavior in unintended ways. The suite has 9 testable prompts per session; tool_heavy_task is excluded from scoring because it has no verifiable pass/fail criteria (its output depends entirely on tool access being available). The pass-rate matrix – all variants, both effort levels: At high/xhigh effort (Opus 4.6 high vs. Opus 4.7 xhigh), Opus 4.6 scored 9/9 (100%) in every single variant – baseline, concise, no-tools, think-step-by-step, and ultrathink. Not a single failure across all five high-effort runs. Opus 4.7 xhigh varied: 8/9 (89%) under baseline, concise, think-step-by-step, and ultrathink; 6/9 (67%) under no-tools. At medium effort (Opus 4.6 medium vs. Opus 4.7 medium), Opus 4.6 scored 9/9 in three variants (baseline, concise, ultrathink) and 8/9 in two (no-tools and think-step-by-step). Opus 4.7 medium scored 8/9 under baseline, think-step-by-step, and ultrathink; 7/9 (78%) under concise; and 6/9 (67%) under no-tools. Summarised as a grid (testable prompts only, out of 9): Which prompts actually failed – and when: claudemd_summarise failed for Opus 4.7 in 8 out of 10 runs. The two exceptions were xhigh concise and xhigh ultrathink – both xhigh-effort runs where the model’s more aggressive tool use meant it actually read the project file before summarising it, satisfying the IFEval constraint. The raw session data shows why it failed everywhere else: Opus 4.7 wrote 121-125 words when asked for exactly 120. Opus 4.6 at high effort spent 3,279 output tokens on this prompt – mostly extended thinking – and landed on exactly 120. Opus 4.7 used 303 tokens and wrote 125. The thinking budget, not just the instruction, made the difference. Opus 4.6 failed this prompt in only two cases: medium no-tools and medium think-step-by-step, both at reduced effort. The opener is not a neutral warm-up for Opus 4.7. If your real workflow starts with a context-load or file-read task, budget for this failure in your performance baseline. typescript_refactor failed for Opus 4.7 in four runs: xhigh no-tools, medium no-tools, xhigh concise, and medium concise. In the no-tools runs, the raw output shows the model used the word “refactor” only once in its explanation – the IFEval constraint required exactly twice. The code was correct; the explanation text came up one short. Under concise, the same constraint tripped the same way: the stee
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유