AI News 5 min read

Gemma 4 12B: Google's Encoder-Free Multimodal Laptop Model

Google released Gemma 4 12B on June 3, 2026, a multimodal open model with an encoder-free architecture that feeds vision and audio directly into the LLM backbone. It runs locally on 16GB of memory, approaches the 26B MoE on benchmarks, uses Multi-Token Prediction drafters for low latency, and ships under Apache 2.0 with broad tooling support.

Sarah Chen

Jun 9, 2026

Google has spent the better part of two years arguing that open models do not have to be big to be good. Gemma 4 12B, released on June 3, 2026, is the most pointed version of that argument yet: a multimodal model that ingests text, images, and audio, runs on a laptop with 16GB of memory, and throws out the encoder stack that nearly every other vision-language model still leans on.

It slots neatly into the lineup Google shipped in April. The edge-class E2B and E4B handle phones and IoT. The 26B Mixture-of-Experts and 31B Dense target workstations. Gemma 4 12B fills the gap in the middle — and it is the first mid-sized Gemma to accept native audio input.

The encoder-free bet

Here is the genuinely interesting part. Most multimodal models bolt a separate encoder onto the language model: a vision tower converts pixels into embeddings, an audio tower does the same for sound, and only then does the LLM see anything. Those encoders work, but they cost latency and memory.

Google trained Gemma 4 12B without them.

"Because these split encoders add latency and increase memory usage, we trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly." — Google DeepMind

For vision, Google replaced the encoder with a lightweight embedding module — a single matrix multiplication, positional embedding, and normalizations — and let the LLM backbone do the visual processing itself. For audio, the team went further and removed the encoder entirely, projecting the raw audio signal straight into the same dimensional space as text tokens.

The practical payoff is a model that handles three modalities through one unified transformer instead of three stitched-together subsystems. Fewer moving parts, less memory, lower latency.

Performance per gigabyte

Gemma 4 12B is built to sit just below the 26B MoE on quality while using less than half the total memory footprint, according to Google. On standard benchmarks its scores approach the larger model's, which is the entire pitch: near-26B reasoning that fits in a 16GB laptop.

To keep responses snappy, the model ships with Multi-Token Prediction (MTP) drafters, a speculative-decoding technique that drafts several tokens ahead to cut latency. Combined with the encoder-free design, that makes it viable for genuinely local agentic work — not just chat, but tool-calling and multi-step tasks running entirely on your machine.

Spec	Gemma 4 12B
Modalities	Text, image, audio (input)
Local memory	Runs on 16GB RAM / VRAM
Architecture	Encoder-free unified transformer
Latency feature	Multi-Token Prediction drafters
License	Apache 2.0

Why "open" still matters here

Gemma 4 12B ships under a commercially permissive Apache 2.0 license — no monthly-active-user caps, no usage gates. That is the same license as the rest of the Gemma 4 family, and it is a deliberate contrast to the bespoke community licenses some rivals attach to their open weights.

The momentum behind that decision is real. Google says the Gemma 4 family has now crossed 150 million downloads, with developers building everything from wearable robotic arms to enterprise security tooling on top of it. An open license that lets you run a capable multimodal model offline, on hardware you already own, is the kind of thing that compounds.

Google also used the launch to release an official Gemma Skills Repository on GitHub — a library of skills aimed at helping agents build with Gemma models, a nod to how much of the ecosystem is now agent-shaped.

Getting it running

The model landed with broad day-one support, so there is no waiting on the toolchain to catch up. You can try it in a couple of clicks through LM Studio, Ollama, the Google AI Edge Gallery app, or the LiteRT-LM CLI. Weights — both pre-trained and instruction-tuned — are on Hugging Face and Kaggle.

For building, Gemma 4 12B works with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, and fine-tunes through Unsloth. For production, Google points to Cloud Run, GKE, and its Model Garden. In other words: pick whatever you already use.

The Bottom Line

Gemma 4 12B is not chasing a leaderboard crown. It is chasing a different prize — the best multimodal reasoning you can run on the laptop already on your desk. By dropping the encoder stack, accepting audio natively, and staying under a 16GB memory ceiling, it makes a strong case that the most useful model is often the one you can run yourself, offline, for free. For developers who care about cost, privacy, and digital sovereignty, that is a more compelling story than another half-point on a benchmark. The encoder-free experiment is the part worth watching: if it holds up, expect the rest of the industry to follow Google out of the encoder business.

gemma google multimodal open-source llm apache-2

More in AI News

AI News

Etched: The $5B Sohu Chip Betting the Transformer Never Dies

Etched, a startup building the transformer-only Sohu inference ASIC, has booked over $1 billion in contracts and reached a $5 billion valuation, with reports of new rounds valuing it up to $20 billion. Sohu hard-wires the transformer graph into silicon on TSMC N4P with 144GB HBM3E, and Etched claims an 8-chip server exceeds 500,000 Llama 70B tokens/sec. No independent benchmarks exist yet.

By Sarah Chen · 5 min · Jul 25, 2026

AI News

Project Perception: Microsoft's Cheaper Rival to Claude Mythos

Microsoft is reportedly developing Project Perception, a multi-model AI security platform that routes vulnerability-scanning tasks across models from Microsoft, OpenAI, and Anthropic to reserve expensive frontier calls for high-value steps. Its pitch is matching Anthropic's Claude Mythos on capability while costing far less. Microsoft has not officially confirmed details, so the news should be treated as a credible report pending benchmarks.

By Sarah Chen · 5 min · Jul 21, 2026

AI News

Inkling: Mira Murati's Thinking Machines Ships Its First Open Model

Thinking Machines Lab, founded by ex-OpenAI CTO Mira Murati, released Inkling on July 15, 2026 — an open-weight mixture-of-experts model with 975B total parameters (41B active), trained on 45 trillion multimodal tokens. The company openly says it isn't the strongest model available; instead it's a customizable foundation enterprises fine-tune via the Tinker platform. The release doubles as an argument that owned, adaptable models beat rented one-size-fits-all APIs.

By Sarah Chen · 5 min · Jul 18, 2026