Google Gemma 4: Four Open Models Punch Above Their Weight

Google just dropped Gemma 4, and the open-source AI landscape will never look the same. This family of four models — spanning pocket-sized edge devices to full-blown data-center workloads — doesn't just compete with proprietary giants. It embarrasses quite a few of them.

Four Models, One Mission

Gemma 4 ships in four variants, each targeting a different sweet spot:

Model	Parameters	Active Params	Context Window	Best For
E2B	5.1B (2.3B effective)	2.3B	128K tokens	Mobile, IoT, edge
E4B	8B (4.5B effective)	4.5B	128K tokens	On-device assistants
26B A4B (MoE)	25.2B	3.8B	256K tokens	Efficient reasoning
31B Dense	30.7B	30.7B	256K tokens	Maximum capability

The standout engineering trick is Effective Parameters. The E2B model contains 5.1 billion total parameters but only activates 2.3 billion during inference, delivering performance that punches well above its weight class while sipping power on a Raspberry Pi.

Benchmarks That Matter

The 31B flagship currently sits at #3 on the Arena AI text leaderboard with an Elo score of 1452 — up from Gemma 3's 1365. The 26B MoE variant claims the #6 spot while activating just 3.8 billion parameters per forward pass.

Here's how the family stacks up on key benchmarks:

Benchmark	31B	26B A4B	E4B	E2B
MMLU Pro	85.2%	82.6%	69.4%	60.0%
AIME 2026 (math)	89.2%	88.3%	42.5%	37.5%
Codeforces Elo	2150	1718	940	633
GPQA Diamond	84.3%	82.3%	58.6%	43.4%

That 89.2% on AIME 2026 from the 31B model is remarkable for an open-weight model. The 26B MoE hitting 88.3% on the same benchmark with a fraction of active compute is arguably even more impressive.

Truly Multimodal — Including Audio on Edge

Every Gemma 4 model handles text and images natively. But the E2B and E4B edge models go further with native audio and video processing — meaning you can build a completely offline, multimodal assistant that runs on a phone with near-zero latency.

The architecture uses a hybrid attention mechanism that interleaves local sliding-window attention (512 tokens for smaller models, 1024 for larger ones) with full global attention. This keeps memory usage manageable during long-context inference while still capturing distant dependencies.

Agentic by Design

Gemma 4 introduces native function calling across all variants, making it straightforward to build autonomous agents that can navigate apps, call APIs, and chain tool use without external orchestration layers. Combined with configurable thinking modes — where you can toggle the model's internal reasoning on or off — developers get fine-grained control over the speed-accuracy tradeoff.

140 Languages, Consumer Hardware

The entire family supports over 140 languages with cultural context understanding, making it one of the most linguistically diverse open model families available. Model weights are downloadable from Hugging Face, Ollama, Kaggle, LM Studio, and Docker, with deployment support through JAX, Keras, Vertex AI, and Google Kubernetes Engine.

The 31B model runs on a single consumer GPU. The edge models fit comfortably on mobile devices — Google's own AI Edge Gallery app, which lets users run Gemma 4 locally on Android and iOS, recently broke into the App Store top 10.

The Bottom Line

Gemma 4 isn't just an incremental update. It's Google DeepMind's clearest statement yet that frontier-class reasoning belongs in the open. The 26B MoE model delivering near-31B performance with 3.8B active parameters makes the efficiency argument impossible to ignore, and native audio on edge models opens use cases that simply didn't exist before. Under the Apache 2.0 license, there are zero strings attached.

Google Gemma 4: Four Open Models That Punch Above Their Weight

Four Models, One Mission

Benchmarks That Matter

Truly Multimodal — Including Audio on Edge

Agentic by Design

140 Languages, Consumer Hardware

The Bottom Line

More in Open Source

Kanwas: The Open-Source AI Workspace That Hit #1 on Product Hunt

Understand-Anything: The 37K-Star Knowledge Graph for Your Codebase

Emdash: The Open-Source IDE Built to Run 22 Coding Agents in Parallel