NVIDIA Nemotron 3 Super: The Hybrid Architecture That Rewrites the Agent Playbook
AI News 4 min read intermediate

NVIDIA Nemotron 3 Super: The Hybrid Architecture That Rewrites the Agent Playbook

NVIDIA Nemotron 3 Super combines three neural-network architectures into one efficient open model for enterprise AI agents.

Sarah Chen
Sarah Chen
Mar 31, 2026

NVIDIA Nemotron 3 Super is a new kind of open model — one that blends three fundamentally different neural-network architectures into a single 120-billion-parameter system, yet activates only 12 billion parameters per forward pass. Announced at GTC 2026 on March 11, it's designed from the ground up for the agentic AI workloads that are rapidly becoming the industry's primary concern.

Three Architectures, One Model

Most large language models pick a lane: dense Transformer, state-space model, or mixture-of-experts. Nemotron 3 Super refuses to choose. It interleaves Mamba-2 state-space layers, periodic Transformer attention layers, and NVIDIA's proprietary LatentMoE routing into a single forward pass.

Here's the division of labor:

  • Mamba-2 layers handle the bulk of sequence processing in linear time — crucial when your context window stretches to 1 million tokens.
  • Transformer attention layers appear at key depths to provide the precise associative recall that state-space models sometimes lack.
  • LatentMoE compresses tokens from the full hidden dimension (4,096) into a latent space (1,024) before routing them to experts. That 4× reduction means the model can maintain 512 total experts with 22 active per token at the same compute cost a standard MoE would spend on far fewer.

The result: a model that processes long contexts efficiently and reasons precisely when it counts.

Multi-Token Prediction: Built-In Speculative Decoding

Nemotron 3 Super includes Multi-Token Prediction (MTP) layers that predict multiple future tokens simultaneously. In agentic workflows where structured outputs (JSON, function calls, code) are common, MTP delivers up to 3× faster generation without a separate speculative-decoding model.

Trained in 4-Bit From Scratch

Most models train in higher precision and then get quantized down. NVIDIA took a different path: Nemotron 3 Super was pretrained from the first gradient update in NVFP4 — NVIDIA's 4-bit floating-point format. The model learned to be accurate within 4-bit arithmetic constraints rather than being compressed after the fact.

On NVIDIA B200 (Blackwell) GPUs, this native NVFP4 training yields a 4× inference speedup compared to FP8 on the previous-generation H100. Combined with the sparse MoE architecture, overall throughput is more than 5× higher than the previous Nemotron Super.

Training at Scale

The numbers behind the training are substantial:

Metric Value
Pretraining tokens 25 trillion (10T unique)
SFT samples ~7 million (from 40M corpus)
RL environments 21 configurations
RL rollouts ~1.2 million

NVIDIA used NeMo Gym and NeMo RL for the reinforcement-learning phase, training across 21 environment configurations to build the model's agentic reasoning capabilities.

Benchmark Performance

On PinchBench, an agentic reasoning evaluation, Nemotron 3 Super scores 85.6% across the full test suite — making it the best open model in its weight class for multi-step agent tasks. This matters because agentic benchmarks test exactly the kind of plan-execute-verify loops that enterprise deployments demand.

How to Run It

Nemotron 3 Super is available through multiple channels:

  • NVIDIA NIM container for optimized inference
  • Hugging Face for direct weight downloads
  • vLLM, SGLang, and TensorRT-LLM for self-hosted deployments
  • Cloud platforms including Perplexity, OpenRouter, and build.nvidia.com
# Pull via NIM (requires NGC authentication)
docker pull nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest

The model ships under the NVIDIA Nemotron Open Model License, with full weights, datasets, and training recipes published openly.

Why This Matters for Enterprise AI

The enterprise AI conversation has shifted from "which model scores highest on MMLU" to "which model can reliably execute multi-step workflows at reasonable cost." Nemotron 3 Super targets that question directly:

12B active parameters means you can run a frontier-class reasoning model on infrastructure that would choke on a dense 120B model. The 1M context window means your agents don't lose track of the task halfway through.

For teams building autonomous agents — customer-support systems, code-generation pipelines, document-analysis workflows — the combination of high throughput, long context, and strong agentic reasoning makes Nemotron 3 Super worth serious evaluation.

The Bottom Line

NVIDIA's Nemotron 3 Super isn't just another big model. It's a bet that the future of AI inference belongs to hybrid architectures that combine the best properties of Transformers, state-space models, and mixture-of-experts — all trained natively in low precision. With 5× throughput gains and top-tier agentic benchmarks, it's the most compelling open model for production agent deployments in 2026.