Qwen3.7-Max: Alibaba's 35-Hour Agent Run Resets the Frontier
AI News 5 min read

Qwen3.7-Max: Alibaba's 35-Hour Agent Run Resets the Frontier

Sarah Chen
Sarah Chen
May 25, 2026

Alibaba quietly shipped a new frontier model last week, and the headline number is almost absurd. Qwen3.7-Max ran autonomously for 35 hours, fired 1,158 tool calls without human intervention, and delivered a 10× speedup over the standard Triton reference on a GPU kernel it had never seen during training. That isn't a demo polished for a launch tweet. It is the kind of long-horizon, multi-step autonomy that the entire industry has spent the last year pretending was just around the corner.

The model went live on May 20, 2026 with a 1M-token context window, native support for the Anthropic API protocol, and pricing that undercuts every comparable frontier model. The Qwen team didn't bury the lede in a blog post about "agent-readiness." They published the benchmarks, opened the API, and dared the rest of the industry to keep up.

The benchmark numbers that matter

Qwen3.7-Max is targeting one category specifically: agents that have to actually finish jobs. The benchmark suite reflects that.

Benchmark Qwen3.7-Max Closest competitor
Terminal-Bench 2.0-Terminus 69.7 DeepSeek-V4-Pro Max — 67.9
SWE-Bench Verified 80.4 Opus 4.6 Max — 80.8
SWE-Pro 60.6 K2.6 Thinking — 59.5
SWE-Multilingual 78.3
SciCode 53.5

Terminal-Bench 2.0-Terminus is the one to watch. It drops a model into a sandboxed terminal with 12 CPU cores, a 5-hour timeout, and tells it to act like a software engineer. There is no scaffolding to lean on, no human in the loop, no chain-of-thought formatting tricks. Just a shell. Qwen3.7-Max scores 69.7%, the highest of any model on that bench, and it beats Claude Opus 4.6 Max by 4.3 points.

SWE-Bench Verified shows the model finishing a hair behind Opus 4.6 Max — 80.4 vs. 80.8 — but SWE-Pro tells a different story. Qwen3.7-Max leads at 60.6, ahead of Moonshot's Kimi K2.6 Thinking and DeepSeek-V4-Pro Max. The Artificial Analysis Intelligence Index ranked it at 56.6 out of the gate, which slots it at #5 the week of launch and makes it the highest-ranked Chinese model the index has ever measured.

Pricing that breaks the leaderboard

Here's where it stops being a benchmark conversation and starts being a procurement conversation. Qwen3.7-Max is priced at $2.50 per million input tokens and $7.50 per million output tokens, with cached inputs dropping to $0.25/M — a 90% discount that turns repeated long-context calls into essentially a rounding error.

For comparison, Anthropic's Claude Opus 4.7 sits at roughly six times that input price for similar workloads. That gap matters less in single-shot use and matters enormously when you are running a 35-hour autonomous job with thousands of tool calls. A pricing decision at the API tier becomes a category-defining decision when an agent is the customer.

Six-times cheaper input pricing isn't a marketing line — it changes which agent architectures are economically viable.

The model exposes 65,536 output tokens per request and supports the full 1M-token context window across both input and output budgets. A million tokens, for those keeping score, is enough to hold a mid-sized codebase or a long stack of PDFs in a single call without any RAG plumbing.

The 35-hour run, decoded

The headline demo wasn't a chat session. Alibaba pointed Qwen3.7-Max at an unseen GPU and asked it to write the kernel software for it from scratch. The model worked for 35 wall-clock hours, made 1,158 tool calls, and produced code that ran 10× faster than the Triton baseline.

What that demonstrates isn't intelligence in the abstract. It's the absence of drift — the slow degradation in goal-state that kills most long-horizon agents around hour two or three. Most frontier models can run a coherent agent loop for 20-40 minutes before they start forgetting their own earlier decisions, contradicting themselves, or hallucinating tools that don't exist. Thirty-five hours with a coherent goal is a different class of capability.

It also exposes a quiet architectural shift: Qwen3.7-Max natively supports the Anthropic API protocol, which means tooling built for Claude Code drops in unchanged. You can point Claude Code at Qwen3.7-Max and the harness keeps working. That is a deliberate hostile-takeover-grade design choice — it makes switching costs near zero for the developer audience Anthropic has been carefully cultivating.

Where it falls short

Qwen3.7-Max is text-only. No image input, no audio, no native multimodality. For a model marketed at agents, that's a real gap — a meaningful chunk of agent work is screenshot-driven these days, and Qwen3.7-Max can't see the screen.

It's also a preview checkpoint. The current public version is Qwen3.7-Max-Preview, which means the API surface and pricing could shift before general availability. For production deployments where stability matters, that's an asterisk worth keeping.

And the weights aren't open. Qwen has built much of its developer goodwill on permissive releases — Qwen 3.6 Plus, the Coder series, the Moondream-class vision models — but Qwen3.7-Max is closed. The bet is that price-per-intelligence is low enough that no one will care. We'll find out.

The Bottom Line

Qwen3.7-Max is the first frontier model that was clearly designed for the agent era rather than retrofitted for it. A 1M-token window, native Anthropic-protocol compatibility, a benchmark sweep on agentic coding tests, and pricing that makes long-horizon runs economically viable — every one of those decisions points at the same use case. The 35-hour kernel run isn't a stunt. It is the new baseline that every other frontier lab now has to either match or explain away. If you're building anything that depends on an agent finishing the job, your evaluation list got one model longer last week.