AI News 6 min read intermediate

MiniMax M3: Open-Weight Frontier Coding Model With 1M Context

MiniMax M3 is an open-weight model pairing a 1M-token context and revived sparse attention with frontier coding benchmarks at 15x lower cost than Claude Opus 4.7.

Sarah Chen

Jun 16, 2026

MiniMax M3: Open-Weight Frontier Coding Model With 1M Context

For years, the gap between Chinese open-weight models and Western flagships was "close enough on price, not quite there on quality." On June 1, 2026, MiniMax shipped a model that argues the gap has narrowed to a sliver — and that the sliver runs more than 15× cheaper.

What MiniMax Actually Shipped

MiniMax M3 is the Shanghai lab's new flagship, and the pitch is unusually concrete: it claims to be the first open-weight model to combine three frontier capabilities in a single architecture — frontier-level coding, a 1-million-token context window, and native multimodality (image and video understanding).

You can use M3 today through three channels: the standard API, MiniMax's monthly token plans, and MiniMax Code, the company's agent product. The catch worth flagging up front: at launch the model is hosted-only. MiniMax says the open weights and a full technical report will land on Hugging Face and GitHub within roughly ten days — so the "open-weight" label is a promise the company hasn't fully cashed yet.

The repo to watch is huggingface.co/MiniMaxAI for a model card titled MiniMax-M3.

The Real Story Is the Attention Architecture

The headline isn't the benchmarks — it's how M3 gets them. The model is built on MiniMax Sparse Attention (MSA), and the interesting part is that MiniMax abandoned sparse attention for its entire M2 generation before bringing it back here.

A quick refresher on why this matters:

Full attention lets every token in the context attend to every other token. It's accurate but expensive, and the cost scales quadratically as context grows.
Sparse attention skips most of those connections, focusing compute only on the tokens that matter.

M3's design uses a lightweight index branch to scan incoming tokens and select which blocks of past tokens deserve attention, then runs attention only on those blocks. Crucially, it does block selection on the real, uncompressed key-values rather than a compressed approximation — which is how MiniMax claims to keep precision while cutting cost.

This is a public self-correction worth appreciating. In its own M2 engineering notes, the MiniMax team wrote that "the infrastructure for linear and sparse attention is much less mature" than full attention. A year later, they shipped production sparse attention with order-of-magnitude speedups. The bet paid off.

The efficiency numbers

All figures are measured at 1M-token context against the prior generation:

Metric	MiniMax M3
Per-token compute at 1M context	1/20 of prior generation
Prefill (processing input)	>9× faster
Decoding (generating output)	>15× faster
Output speed	~100 tokens/sec
Context window	Up to 1M tokens (512K guaranteed minimum)

The business angle here is sharper than "big context window." Most teams don't need to dump an entire monorepo into one prompt. What they need is long-context inference that's cheap enough to run in a loop — and cutting per-token compute to a twentieth is what makes long-running agents economical instead of a demo-only luxury.

The Benchmarks — and the Asterisk

On coding and agentic tasks, M3's self-reported numbers put it in frontier company:

Benchmark	MiniMax M3	Measures
SWE-Bench Pro	59.0%	Real-world software fixes
Terminal-Bench 2.1	66.0%	Command-line agent tasks
SWE-fficiency	34.8%	Efficient code changes
KernelBench Hard	28.8%	Low-level kernel optimization
MCP Atlas	74.2%	Tool use via MCP
BrowseComp	83.5	Web search and browsing

The standout claims: on SWE-Bench Pro, M3's 59.0% beats GPT-5.5 and Gemini 3.1 Pro and approaches Claude Opus 4.7. On BrowseComp, its 83.5 actually surpasses Opus 4.7's 79.3.

Here's the honest caveat, and MiniMax discloses it themselves: several of these results were run on MiniMax's own infrastructure with agent scaffolding (tools like Claude Code, Mini-SWE-Agent, or Terminus). A model that looks strong on a curated leaderboard can behave very differently inside a messy internal repo. M3 also isn't yet on the neutral DeepSWE board for long-horizon software tasks.

Bottom line on benchmarks: strong enough to take seriously, not yet independent enough to bet production on. Pilot it; don't migrate blind.

Price Is the Loudest Argument

This is where M3 stops being interesting and starts being disruptive. API input pricing starts around $0.30 per million tokens, dropping to a blended ~$0.06 per million with cache optimization. Output runs roughly $1.20 per million during the launch promo.

For comparison, Claude Opus 4.7 runs about $5 per million input and $25 per million output. If M3's quality holds up under independent testing, that's more than 15× cheaper on input — and open-weight on top of it.

For heavier users, MiniMax sells monthly token plans: $20 Plus (~1.7B tokens), $50 Max (~5.1B tokens), and $120 Ultra (~9.8B tokens).

One Honest Limitation: "Open" Has Fine Print

Don't confuse open-weight with open-source. Open-weight means you can download and run the files. Open-source, strictly, also means the license permits unrestricted commercial use.

MiniMax's earlier M2 shipped under a modified-MIT license, but the M2.7 license restricts commercial use without prior written authorization. If M3 follows that precedent, expect downloadable weights with a non-commercial default and enterprise licensing sold separately. Until the technical report drops, this is unsettled — and it matters a lot if you're planning to build a product on top.

The Bottom Line

If your workload genuinely needs 1M-token context — whole-codebase analysis, multi-document research agents, long-running session memory — MiniMax M3 is the first model worth testing first, because it's the one engineered to make that context affordable. For standard coding pipelines, pilot it against your current model but wait for independent benchmarks and the license before committing.

The bigger signal is strategic: Chinese open-weight labs keep widening the cost advantage, and now they're doing it with architecture the closed labs haven't publicly matched. The premium on Western flagships gets harder to defend with every launch like this one.

ai-models open-weights llm benchmarks ai-coding-agents

More in AI News

AI News

DeepSeek V4: 1.6T Open Weights and 1M Context, Now the Default

DeepSeek released V4 as two open-weight mixture-of-experts models: V4-Pro (1.6T total / 49B active) and V4-Flash (284B / 13B active), both with a 1M-token default context and 384K max output. A novel token-wise compression plus DeepSeek Sparse Attention (DSA) makes the long window affordable. API pricing is aggressive (V4-Flash $0.14/M input, $0.28/M output; V4-Pro $0.435/$0.87), and the old deepseek-chat and deepseek-reasoner endpoints were retired after July 24, 2026. Reported ~80.6% on SWE-bench Verified.

By Sarah Chen · 5 min · Aug 1, 2026

AI News

Laguna S 2.1: Poolside's 118B Open-Weight Coding Model

Poolside released Laguna S 2.1 on July 21, 2026, a 118B-parameter Mixture-of-Experts coding model activating ~8B params per token, with a 1M-token context and a permissive OpenMDW-1.1 license. First-party benchmarks show 78.5% on SWE-Bench Multilingual, but independent verification is still pending. Day-one FP8/NVFP4/INT4 and GGUF builds make it genuinely self-hostable.

By Sarah Chen · 5 min · Jul 31, 2026

AI News

FLUX 3: Black Forest Labs' One Model for Video, Audio & Action

FLUX 3, released July 23 2026, is Black Forest Labs' first multimodal model to generate video, audio, and robot actions from one set of weights, built on the Self-Flow method. FLUX 3 Video produces up to 20-second clips with native audio and led human-preference tests over Luma Ray 3.2 (93%) and Runway Gen-4.5 (77%), tying Seedance 2.0 and Gemini Omni Flash at 52%. Access is gated: video and action first, image next, open weights last.

By Sarah Chen · 5 min · Jul 28, 2026