Nemotron 3 Nano Omni: NVIDIA's 30B Open Model Sees and Hears
Open Source 6 min read

Nemotron 3 Nano Omni: NVIDIA's 30B Open Model Sees and Hears

Marcus Rivera
Marcus Rivera
Apr 29, 2026

For two years, every "multimodal" agent in production has been a Frankenstein. A speech-to-text model. A vision encoder. A language model. A glue layer that shuttles tokens between them, dropping context with every handoff and adding latency at every seam.

On April 28, 2026, NVIDIA released Nemotron 3 Nano Omni — and the seams are gone. One 30B-parameter model with 3B active parameters, hybrid mixture-of-experts, audio and vision encoders fused into the architecture itself. Open weights, plus the datasets and training techniques behind them.

NVIDIA claims 9x higher throughput than other open omni models at equivalent interactivity. The benchmarks back the claim.

What "Omni" Actually Means Here

Most "multimodal" models are vision + text. Nemotron 3 Nano Omni handles text, images, audio, video, documents, charts, and graphical interfaces as inputs and produces text as output — all in a single forward pass through one set of weights.

The architecture is a 30B-A3B hybrid MoE with two key additions:

  • Conv3D for video — the model treats time as a real dimension, not as stitched-together frames.
  • EVS (Efficient Video Sampling) — token-level compression that lets long video stay in context without blowing up the KV cache.
  • 256K context window — long enough to ingest a feature-length screen recording or a multi-hour audio file.

The single-pass design is the point. NVIDIA's pitch: "Functions as the eyes and ears in a system of agents," working alongside Nemotron 3 Super (high-frequency execution) or Ultra (planning), or any closed-weights orchestrator you prefer.

The 9x Throughput Claim, Examined

NVIDIA reports Nemotron 3 Nano Omni delivers up to 9x higher throughput than other open omni models at the same interactivity, and tops six leaderboards for document intelligence, video understanding, and audio understanding.

The throughput delta comes from architecture, not magic:

  • 3B active parameters per token — only 10% of the 30B total fires for any given inference step.
  • Single-pass perception — no separate ASR pass, no separate vision pass, no orchestration overhead.
  • EVS compression — fewer video tokens to process means more frames per second of input at the same compute budget.

For real workloads, this is the difference between an agent that can watch a screen recording in real time and one that needs to batch-transcribe first.

Three Workloads This Unlocks

NVIDIA highlights three concrete agentic uses cases — and unlike most launch-day claims, partners are already shipping integrations.

Computer use agents. H Company's latest computer-use agent uses Nemotron 3 Nano Omni at a native input resolution of 1920×1080 pixels. CEO Gautier Cloix:

"By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn't practical before. This isn't just a speed boost: It's a fundamental shift in how our agents perceive and interact with digital environments in real time."

In preliminary OSWorld benchmark results, the integration showed significant improvements navigating complex GUIs.

Document intelligence. The model interprets documents, charts, tables, screenshots, and mixed-media inputs in a single reasoning pass. For finance, legal, and compliance work — where the answer often depends on cross-referencing a chart against a footnote against a table — single-pass perception eliminates the context loss that bedevils stitched pipelines.

Audio-video understanding. Customer service review, meeting analysis, monitoring workflows. Instead of a separate transcription run feeding a separate vision run feeding a summarizer, Nemotron 3 Nano Omni keeps audio, video, and on-screen text in one reasoning stream.

Where to Run It

Three official deployment paths:

# 1. Hugging Face — full weights, your hardware
huggingface-cli download nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

# 2. OpenRouter — managed, currently free tier
curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -d '{"model":"nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free", ...}'

# 3. NVIDIA NIM microservice on build.nvidia.com

The lightweight footprint is deliberate. NVIDIA designed the model to run consistently from NVIDIA Jetson edge devices to DGX Spark workstations to DGX Station to data center deployments — same weights, same behavior, different scale.

For customization, NVIDIA points to NeMo for fine-tuning, evaluation, and domain adaptation.

Adoption Out of the Gate

The launch-day partner list is unusually deep:

  • Already in production: Aible, Applied Scientific Intelligence (ASI), Eka Care, Foxconn, H Company, Palantir, Pyler.
  • Currently evaluating: Dell Technologies, Docusign, Infosys, K-Dense, Lila, Oracle, Zefr.

That's enterprise weight. Eka Care is using Nemotron 3 Nano Omni to build "agentic multimodal healthcare for India scale patient care" — the kind of workload where stitched pipelines fail because the cost of a missed audio cue or a misread chart is medical.

The Nemotron 3 family — Nano, Super, Ultra — has crossed 50 million downloads in the past year. Omni is the multimodal extension of a model line that's already proven it ships.

Where It Fits vs. Other Open Models

Model Modalities Active / Total Open Weights
Nemotron 3 Nano Omni Text, image, audio, video, docs 3B / 30B Yes
Moondream 3 Text, image 2B (dense) Yes
Qwen 3.6 VL Text, image (not disclosed) Yes
Gemma 4 Text, image varies Yes
GPT-5.5 All modalities No

Nemotron 3 Nano Omni is the only open omni model that handles audio and video natively at this scale. Moondream 3 is excellent but vision-only. Qwen 3.6 VL doesn't handle audio. If your agent needs eyes and ears, the alternatives are closed weights.

What Doesn't Work Yet

Output is text-only — Nemotron 3 Nano Omni doesn't generate images, audio, or video. For end-to-end multimodal generation you still need a stack: Nano Omni for perception, then a generative model for output.

It's also a perception sub-agent in NVIDIA's intended design — not a planner. NVIDIA explicitly positions Nano Omni alongside Nemotron 3 Super or Ultra for execution and planning. Don't ask it to drive complex multi-step reasoning on its own.

The Bottom Line

Nemotron 3 Nano Omni is the first open multimodal model that meaningfully replaces a stitched perception pipeline with a single set of weights. The 9x throughput claim isn't a benchmark stunt — it falls out of running 3B active parameters in one pass instead of running three separate models in three passes.

If you're building agents that need to look at a screen, listen to a meeting, read a document, and answer in real time, your perception layer just got 30B parameters smaller and the seams disappeared. The ecosystem (Aible, Palantir, Foxconn, H Company, Eka Care) is already there.

The only frontier left for open multimodal: generation. NVIDIA didn't ship that this round. Someone will.