Cohere just open-sourced Transcribe, a 2-billion-parameter speech recognition model that sits at the top of the HuggingFace Open ASR Leaderboard — and it did so under an Apache 2.0 license. For developers who have been building on Whisper and tolerating its word error rates, this is the upgrade you did not know was coming.
What Cohere Transcribe Actually Is
Transcribe is a dedicated audio-in, text-out ASR model. It is not a general-purpose voice assistant or a multimodal foundation model moonlighting as a transcriber. Cohere built it from the ground up for one job: turning speech into accurate text, fast.
The model uses an encoder-decoder X-attention transformer architecture with a Fast-Conformer encoder that holds more than 90% of the model's 2 billion parameters. The decoder is intentionally lightweight — under 10% of total parameters — which keeps autoregressive inference compute to a minimum. Translation: it generates text fast because the heavy lifting happens in parallel during encoding, not sequentially during decoding.
Training data: 500,000 hours of curated audio-transcript pairs, augmented with synthetic data including non-speech background noise at signal-to-noise ratios between 0 and 30 dB.
Benchmark Performance
The numbers speak clearly:
| Model | Avg WER | AMI | Earnings22 | GigaSpeech | LS Clean | LS Other |
|---|---|---|---|---|---|---|
| Cohere Transcribe | 5.42% | 8.15 | 10.84 | 9.33 | 1.25 | 2.37 |
| Zoom Scribe v1 | 5.47% | 10.03 | 9.53 | 9.61 | 1.63 | 2.81 |
| IBM Granite 4.0 1B | 5.52% | 8.44 | 8.48 | 10.14 | 1.42 | 2.85 |
| OpenAI Whisper Large v3 | 7.44% | 15.95 | 11.29 | 10.02 | 2.01 | 3.91 |
An average word error rate of 5.42% puts it at #1 on the Open ASR Leaderboard, beating Zoom Scribe v1 (5.47%), IBM Granite 4.0 1B (5.52%), and — most notably — OpenAI Whisper Large v3 by 28% (7.44%).
On LibriSpeech Clean, the gold-standard benchmark for clear English speech, Transcribe hits a 1.25% WER. That is approaching human-level accuracy for studio-quality audio.
14 Languages, Not Just English
Transcribe supports 14 languages out of the box: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Mandarin Chinese, Japanese, and Korean.
The language selection is pragmatic — these cover the vast majority of enterprise transcription demand. The model expects monolingual audio with a language tag, which means it is not designed for code-switched conversations (e.g., a speaker alternating between English and Spanish mid-sentence). For most production use cases, this is a non-issue.
Speed: 3x Faster Than Competitors
Raw accuracy is only half the story. Cohere reports a real-time factor (RTFx) up to 3x higher than similarly sized ASR models, meaning it processes audio three times faster than real-time playback.
The team also contributed optimizations to vLLM (via PR #38120) that enable:
- Variable-length audio input support
- Fine-grained concurrent encoder execution
- Packed representations for FlashAttention-based decoding
- Minimally-padded convolutional encoder batches
These optimizations deliver a further 2x throughput improvement for production inference, bringing total throughput to roughly 6x real-time under optimal conditions.
How to Get Started
The model is available on HuggingFace right now:
pip install transformers torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"CohereLabs/cohere-transcribe-03-2026"
)
processor = AutoProcessor.from_pretrained(
"CohereLabs/cohere-transcribe-03-2026"
)
For production workloads, Cohere offers the model through their API for free (with rate limits) at dashboard.cohere.com, and through Model Vault for dedicated infrastructure with per-hour pricing.
There is also a HuggingFace Space for quick testing without writing any code: CohereLabs/cohere-transcribe-03-2026.
Known Limitations
Two caveats worth noting:
No code-switching support. The model expects monolingual audio. If your use case involves speakers switching between languages mid-utterance, you will need a different solution or a preprocessing pipeline that segments by language.
Hallucination on non-speech sounds. Like most ASR models, Transcribe can hallucinate text when processing silence, music, or ambient noise. Cohere recommends running a Voice Activity Detection (VAD) or noise gate as a preprocessing step to filter non-speech segments before feeding audio to the model.
Why This Matters
The ASR landscape has been dominated by OpenAI Whisper since its release, largely because nothing else combined open weights with competitive accuracy. Cohere Transcribe changes that equation: it is 28% more accurate, 3x faster, supports 14 languages, and ships under the same Apache 2.0 license that makes Whisper easy to deploy.
For teams building meeting transcription, call center analytics, podcast indexing, or accessibility tools, this is a drop-in upgrade. The architecture is efficient enough to run on consumer-grade hardware, and Cohere's vLLM optimizations mean you can serve it at scale without exotic infrastructure.
The Bottom Line
Cohere Transcribe is the new default recommendation for open-source speech recognition. It beats Whisper Large v3 on accuracy by a wide margin, runs significantly faster, and costs nothing to use under Apache 2.0. The only real gaps — code-switching and non-speech hallucination — are well-documented and manageable with standard preprocessing. If you are still building on Whisper, it is time to benchmark Transcribe against your workload.


