Cohere Transcribe: Open-Source ASR Model Dethrones Whisper

Cohere just open-sourced Transcribe, a 2-billion-parameter speech recognition model that sits at the top of the HuggingFace Open ASR Leaderboard — and it did so under an Apache 2.0 license. For developers who have been building on Whisper and tolerating its word error rates, this is the upgrade you did not know was coming.

What Cohere Transcribe Actually Is

Transcribe is a dedicated audio-in, text-out ASR model. It is not a general-purpose voice assistant or a multimodal foundation model moonlighting as a transcriber. Cohere built it from the ground up for one job: turning speech into accurate text, fast.

The model uses an encoder-decoder X-attention transformer architecture with a Fast-Conformer encoder that holds more than 90% of the model's 2 billion parameters. The decoder is intentionally lightweight — under 10% of total parameters — which keeps autoregressive inference compute to a minimum. Translation: it generates text fast because the heavy lifting happens in parallel during encoding, not sequentially during decoding.

Training data: 500,000 hours of curated audio-transcript pairs, augmented with synthetic data including non-speech background noise at signal-to-noise ratios between 0 and 30 dB.

Benchmark Performance

The numbers speak clearly:

Model	Avg WER	AMI	Earnings22	GigaSpeech	LS Clean	LS Other
Cohere Transcribe	5.42%	8.15	10.84	9.33	1.25	2.37
Zoom Scribe v1	5.47%	10.03	9.53	9.61	1.63	2.81
IBM Granite 4.0 1B	5.52%	8.44	8.48	10.14	1.42	2.85
OpenAI Whisper Large v3	7.44%	15.95	11.29	10.02	2.01	3.91

An average word error rate of 5.42% puts it at #1 on the Open ASR Leaderboard, beating Zoom Scribe v1 (5.47%), IBM Granite 4.0 1B (5.52%), and — most notably — OpenAI Whisper Large v3 by 28% (7.44%).

On LibriSpeech Clean, the gold-standard benchmark for clear English speech, Transcribe hits a 1.25% WER. That is approaching human-level accuracy for studio-quality audio.

14 Languages, Not Just English

Transcribe supports 14 languages out of the box: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Mandarin Chinese, Japanese, and Korean.

The language selection is pragmatic — these cover the vast majority of enterprise transcription demand. The model expects monolingual audio with a language tag, which means it is not designed for code-switched conversations (e.g., a speaker alternating between English and Spanish mid-sentence). For most production use cases, this is a non-issue.

Speed: 3x Faster Than Competitors

Raw accuracy is only half the story. Cohere reports a real-time factor (RTFx) up to 3x higher than similarly sized ASR models, meaning it processes audio three times faster than real-time playback.

The team also contributed optimizations to vLLM (via PR #38120) that enable:

Variable-length audio input support
Fine-grained concurrent encoder execution
Packed representations for FlashAttention-based decoding
Minimally-padded convolutional encoder batches

These optimizations deliver a further 2x throughput improvement for production inference, bringing total throughput to roughly 6x real-time under optimal conditions.

How to Get Started

The model is available on HuggingFace right now:

pip install transformers torch

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "CohereLabs/cohere-transcribe-03-2026"
)
processor = AutoProcessor.from_pretrained(
    "CohereLabs/cohere-transcribe-03-2026"
)

For production workloads, Cohere offers the model through their API for free (with rate limits) at dashboard.cohere.com, and through Model Vault for dedicated infrastructure with per-hour pricing.

There is also a HuggingFace Space for quick testing without writing any code: CohereLabs/cohere-transcribe-03-2026.

Known Limitations

Two caveats worth noting:

No code-switching support. The model expects monolingual audio. If your use case involves speakers switching between languages mid-utterance, you will need a different solution or a preprocessing pipeline that segments by language.

Hallucination on non-speech sounds. Like most ASR models, Transcribe can hallucinate text when processing silence, music, or ambient noise. Cohere recommends running a Voice Activity Detection (VAD) or noise gate as a preprocessing step to filter non-speech segments before feeding audio to the model.

Why This Matters

The ASR landscape has been dominated by OpenAI Whisper since its release, largely because nothing else combined open weights with competitive accuracy. Cohere Transcribe changes that equation: it is 28% more accurate, 3x faster, supports 14 languages, and ships under the same Apache 2.0 license that makes Whisper easy to deploy.

For teams building meeting transcription, call center analytics, podcast indexing, or accessibility tools, this is a drop-in upgrade. The architecture is efficient enough to run on consumer-grade hardware, and Cohere's vLLM optimizations mean you can serve it at scale without exotic infrastructure.

The Bottom Line

Cohere Transcribe is the new default recommendation for open-source speech recognition. It beats Whisper Large v3 on accuracy by a wide margin, runs significantly faster, and costs nothing to use under Apache 2.0. The only real gaps — code-switching and non-speech hallucination — are well-documented and manageable with standard preprocessing. If you are still building on Whisper, it is time to benchmark Transcribe against your workload.

Cohere Transcribe: The Open-Source ASR Model That Dethroned Whisper

What Cohere Transcribe Actually Is

Benchmark Performance

14 Languages, Not Just English

Speed: 3x Faster Than Competitors

How to Get Started

Known Limitations

Why This Matters

The Bottom Line

More in Open Source

Kanwas: The Open-Source AI Workspace That Hit #1 on Product Hunt

Understand-Anything: The 37K-Star Knowledge Graph for Your Codebase

Emdash: The Open-Source IDE Built to Run 22 Coding Agents in Parallel