For the past year, Mistral's lineup looked like a French menu: a separate dish for chat, another for reasoning, a third for code. Medium 3.5 changes that. Released April 29, 2026, it's a dense 128 billion-parameter open-weight model that folds instruction-following, reasoning, and coding into a single set of weights — and it's the engine behind a new feature that may matter more than the benchmark numbers: remote coding agents in Vibe that run in the cloud while you go do something else.
This isn't another incremental "we beat the previous model on MMLU" release. It's Mistral consolidating its strategy around one merged flagship and shipping the first agent harness that genuinely lets you offload work — including pull requests on your behalf.
One model, three jobs
Mistral's previous generation forced a choice: pick Devstral 2 for code, Magistral for reasoning, or Medium 3 for chat. Each was good at its thing and mediocre at the others. The new Medium 3.5 is what Mistral calls a merged model — instruction-following, chain-of-thought reasoning, and coding all share the same parameters, with reasoning effort configurable per request.
That last detail matters. The same model can fire off a one-line answer to "what's the capital of Peru" without burning thinking tokens, then chew on a complex agentic refactor with the dial turned up. No router, no model swap, no separate billing line. Mistral also rebuilt the vision encoder from scratch to handle variable image sizes and aspect ratios, so the model isn't blind when an agent needs to read a screenshot or a chart.
The numbers that matter
Here's where Medium 3.5 lands on the boards Mistral cares about most:
| Benchmark | Score | What it measures |
|---|---|---|
| SWE-Bench Verified | 77.6% | Real GitHub issue resolution |
| τ³-Telecom | 91.4 | Agentic tool-using tasks |
| Context window | 256k tokens | Long-document and codebase work |
The 77.6% on SWE-Bench Verified is the headline. For context, Gemini 3.1 Pro sits at 80.6% on the same benchmark — meaning Mistral's open-weight model is within striking distance of Google's closed flagship on the most-watched coding board. It clears Devstral 2 and beats Qwen3.5 397B A17B despite using a third of the total parameter budget.
Pricing through the API runs $1.50 per million input tokens and $7.50 per million output tokens, which makes it a credible default choice for high-volume agent runs where token spend stacks up fast.
Open weights, four GPUs
The model performs strongly in real-world use, with self-hosting possible on as few as four GPUs.
That self-hosting line is the part open-source teams will care about most. 128B dense at 256k context running on four GPUs is the difference between "we should look at this on a benchmark page" and "we can actually deploy this in our VPC next quarter." Weights are on Hugging Face under a modified MIT license, and NVIDIA has hosted the model on build.nvidia.com plus a containerized NIM microservice for teams that want managed inference without surrendering data to a third party.
Compared to last year's pattern of releasing capped or research-only weights, the modified MIT license is unusually permissive for a frontier-tier model — closer to genuine open source than the "open-ish" hedges that dominated 2025.
Vibe remote agents — the actual story
Forget the benchmark for a moment. The thing Mistral is really shipping is async cloud coding sessions that run in the background, in parallel, and notify you when they're done.
You start a remote agent from the Mistral Vibe CLI or from inside Le Chat, describe a task — "upgrade this monorepo to React 19, fix the failing tests, open a PR" — and the agent runs in an isolated sandbox while you close the laptop. When it's finished, it can:
- Open a pull request on GitHub with file diffs and a commit history
- File issues in Linear or Jira for follow-ups it discovered
- Surface incidents from Sentry that the change might have touched
- Notify the channel of record on Slack or Teams
Local CLI sessions can be teleported to the cloud mid-task — a small touch, but anyone who has watched a long-running agent get killed by a closing laptop lid will recognize the value. Each session sits in its own sandbox, so broad edits and dependency installs don't leak into other runs.
This is the bet: that the bottleneck on agent productivity is the human, not the model. If you can run ten agents in parallel and review the diffs at your own pace, you stop being the rate limiter on each keystroke.
Work mode: the assistant grows hands
The same harness powers a new Work mode in Le Chat, where the agent becomes the execution backend for the assistant itself. In Work mode, connectors are on by default rather than chosen manually, which means the agent can reach into mailboxes, calendars, documents, and external tools without you wiring each one up.
Practical examples Mistral cites:
- Catch up across email, messages, and calendar in a single run, then prepare for a meeting with attendee context and talking points.
- Dive into a topic across web, internal docs, and connected tools, then produce a structured brief you can edit before sending.
- Triage your inbox and draft replies; create issues in Jira from team discussions; send a summary to Slack.
Every action is visible — tool calls and reasoning are exposed — and Le Chat asks for explicit approval before sensitive actions like sending messages or modifying data. That permission boundary is the right call. The unspoken design decision in most agent products is "trust by default"; Work mode draws the line in a more sensible place.
Where this leaves Mistral's competitors
Three things stand out from this release. First, the merged model strategy is a clean shot at OpenAI and Anthropic, both of whom still maintain a confusing menu of model variants for different jobs. One model with a reasoning dial is easier to reason about, easier to bill for, and easier to deploy.
Second, the open-weights + cloud-agent combo gives Mistral a story neither closed lab can match: download the weights and run them locally, OR use the hosted Vibe agents — same brain, different deployment. That optionality is rare at frontier scale.
Third, the 77.6% SWE-Bench score on a 128B dense model punctures the assumption that you need a 400B+ MoE behemoth to compete on agentic coding. It opens the door for self-hosted setups that were previously stuck reaching for the API.
The Bottom Line
Mistral Medium 3.5 isn't trying to win every benchmark — it's trying to be the model that an agent harness can actually rely on for an eight-hour cloud-coding session, and to be deployable on hardware a serious team already owns. 77.6% on SWE-Bench Verified, 256k context, four-GPU self-hosting, and a remote agent that opens its own PRs is a coherent product, not a list of model-card stats. If you've been waiting for the "open-weight frontier" pitch to start meaning something practical for production work, this is the release that delivers it.


