Claude Opus 4.7: Anthropic's New Flagship Clears SWE-Bench Pro
Anthropic shipped Claude Opus 4.7 on April 16, 2026, and the headline number is hard to argue with: 64.3% on SWE-bench Pro, up from 53.4% on Opus 4.6. That's not a tuning-round improvement. It's the kind of jump that makes you re-read the eval card to make sure you haven't misread the x-axis.
This is Anthropic's direct answer to GPT-5.4 and Gemini 3.1 Pro — both of which briefly held the "most capable publicly available LLM" crown over the past two months. With Opus 4.7, Anthropic narrowly takes it back. The catch: the company also publicly conceded that its unreleased Claude Mythos Preview still beats 4.7 across the board. Anthropic is holding Mythos back over cybersecurity concerns, a story we covered in Claude Mythos.
What's actually new in Opus 4.7
Let's skip the marketing and look at what changed at the model level.
Coding just got measurably better. On SWE-bench Verified, Opus 4.7 hits 87.6%, up from 80.8% on Opus 4.6. Cursor CEO Michael Truell said the jump on their internal CursorBench was 70% vs. 58% — an entire class of tasks moved from "partially reliable" to "you can just ship it." Rakuten reported Opus 4.7 resolves 3x more production tasks than 4.6 on Rakuten-SWE-Bench, with double-digit gains in code and test quality.
Vision went from "fine" to "actually useful." Opus 4.7 can now accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times the prior Claude ceiling. XBOW, which uses Claude for autonomous pentesting, reported Opus 4.7 scoring 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. That's not an incremental bump — that's a regime change for any workflow that depends on reading dense screenshots or diagrams.
There's a new effort level: xhigh. It sits between high and max, giving developers a finer knob for trading latency against reasoning depth. In Claude Code, xhigh is now the default on all plans. Anthropic explicitly recommends starting there for coding work.
Task budgets are in public beta on the API. Developers can now cap token spend at the task level so Claude prioritizes effort across longer runs instead of burning budget on the first sub-goal.
The tokenizer change nobody should ignore
Pricing is unchanged — $5 per million input tokens, $25 per million output tokens, the same as Opus 4.6. But there's a subtle catch buried in the migration guide: Opus 4.7 uses an updated tokenizer that maps the same input to 1.0–1.35x more tokens depending on content type.
That means your real bill can rise even though the per-token rate didn't. Anthropic argues the net effect is favorable because 4.7 thinks more efficiently at each effort level, but they explicitly tell teams to "measure the difference on real traffic." That's good advice. A Finout cost analysis flagged the same issue — the "unchanged price tag" hides a real cost shift for high-volume use.
If you're migrating a production workload, run a shadow evaluation first. Don't flip the model string blind.
Instruction following: a double-edged upgrade
Here's a quiet footnote from Anthropic's announcement that will bite anyone migrating in a hurry:
Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally.
Translation: every prompt library tuned to squeeze performance out of Opus 4.6 probably has a few lines where the old model was politely ignoring you. 4.7 won't. Re-tune your prompts and harnesses before you call anything a regression.
What the early partners are saying
The testimonial wall Anthropic shipped with 4.7 is unusually long — 28 partners ranging from Cursor and Replit to Vercel, Databricks, Notion, and Harvey. A few quotes stand out:
| Partner | Claim |
|---|---|
| Notion | +14% on multi-step workflows, fewer tokens, a third of the tool errors of Opus 4.6 |
| Databricks | 21% fewer errors on OfficeQA Pro vs. Opus 4.6 |
| Harvey | 90.9% substantive accuracy on BigLaw Bench at high effort |
| Hex | Low-effort Opus 4.7 ≈ medium-effort Opus 4.6 |
| Factory | +10–15% task success on Factory Droids, fewer tool errors |
| CodeRabbit | Recall improved >10% on PR reviews while precision held |
The consistent theme: long-running, multi-step agent work. Opus 4.7 doesn't give up halfway, loops less, and recovers from tool failures that used to stop Opus cold. That's the actual unlock, more than any single benchmark number.
The Mythos shadow
You can't read the 4.7 announcement without noticing the absence. Anthropic keeps pointing at Mythos Preview — a model they say is stronger on capability and safer on alignment — and then saying you can't use it. Every chart in the blog post shows Mythos outperforming 4.7. Every safety chart shows Mythos with lower misaligned-behavior rates.
This is the second time in six months Anthropic has released a model while publicly admitting it has a better one in the drawer. The company's position — that the frontier needs real-world cyber safeguards before a Mythos-class release — is reasonable. But it's a sharp reminder that the public-facing Claude lineup is now intentionally a step behind the frontier. Competitors will notice.
The Bottom Line
Claude Opus 4.7 is a real upgrade, not a version-number shuffle. For coding agents and multi-step automation, it is the strongest generally available model on the market today — narrowly, but measurably. If you're running Claude Code, Cursor, Devin, Notion Agent, or anything that calls tools in a loop, you should test xhigh this week.
Two things to do before you migrate: first, measure token usage on real traffic because the new tokenizer shifts the bill in non-obvious ways; second, re-tune any prompt that previously relied on 4.6 "loosely interpreting" your instructions — 4.7 will do exactly what you asked, which isn't always what you wanted.
And keep an eye on Mythos. The next Anthropic release cycle is going to be the interesting one.

