Google just dropped Gemma 4, and the open-source AI landscape will never look the same. This family of four models — spanning pocket-sized edge devices to full-blown data-center workloads — doesn't just compete with proprietary giants. It embarrasses quite a few of them.
Four Models, One Mission
Gemma 4 ships in four variants, each targeting a different sweet spot:
| Model | Parameters | Active Params | Context Window | Best For |
|---|---|---|---|---|
| E2B | 5.1B (2.3B effective) | 2.3B | 128K tokens | Mobile, IoT, edge |
| E4B | 8B (4.5B effective) | 4.5B | 128K tokens | On-device assistants |
| 26B A4B (MoE) | 25.2B | 3.8B | 256K tokens | Efficient reasoning |
| 31B Dense | 30.7B | 30.7B | 256K tokens | Maximum capability |
The standout engineering trick is Effective Parameters. The E2B model contains 5.1 billion total parameters but only activates 2.3 billion during inference, delivering performance that punches well above its weight class while sipping power on a Raspberry Pi.
Benchmarks That Matter
The 31B flagship currently sits at #3 on the Arena AI text leaderboard with an Elo score of 1452 — up from Gemma 3's 1365. The 26B MoE variant claims the #6 spot while activating just 3.8 billion parameters per forward pass.
Here's how the family stacks up on key benchmarks:
| Benchmark | 31B | 26B A4B | E4B | E2B |
|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% |
| AIME 2026 (math) | 89.2% | 88.3% | 42.5% | 37.5% |
| Codeforces Elo | 2150 | 1718 | 940 | 633 |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% |
That 89.2% on AIME 2026 from the 31B model is remarkable for an open-weight model. The 26B MoE hitting 88.3% on the same benchmark with a fraction of active compute is arguably even more impressive.
Truly Multimodal — Including Audio on Edge
Every Gemma 4 model handles text and images natively. But the E2B and E4B edge models go further with native audio and video processing — meaning you can build a completely offline, multimodal assistant that runs on a phone with near-zero latency.
The architecture uses a hybrid attention mechanism that interleaves local sliding-window attention (512 tokens for smaller models, 1024 for larger ones) with full global attention. This keeps memory usage manageable during long-context inference while still capturing distant dependencies.
Agentic by Design
Gemma 4 introduces native function calling across all variants, making it straightforward to build autonomous agents that can navigate apps, call APIs, and chain tool use without external orchestration layers. Combined with configurable thinking modes — where you can toggle the model's internal reasoning on or off — developers get fine-grained control over the speed-accuracy tradeoff.
140 Languages, Consumer Hardware
The entire family supports over 140 languages with cultural context understanding, making it one of the most linguistically diverse open model families available. Model weights are downloadable from Hugging Face, Ollama, Kaggle, LM Studio, and Docker, with deployment support through JAX, Keras, Vertex AI, and Google Kubernetes Engine.
The 31B model runs on a single consumer GPU. The edge models fit comfortably on mobile devices — Google's own AI Edge Gallery app, which lets users run Gemma 4 locally on Android and iOS, recently broke into the App Store top 10.
The Bottom Line
Gemma 4 isn't just an incremental update. It's Google DeepMind's clearest statement yet that frontier-class reasoning belongs in the open. The 26B MoE model delivering near-31B performance with 3.8B active parameters makes the efficiency argument impossible to ignore, and native audio on edge models opens use cases that simply didn't exist before. Under the Apache 2.0 license, there are zero strings attached.


