Tag

mixture-of-experts

9 articles

Mixture of Experts: How Sparse Models Beat Dense LLMs

Mixture of Experts (MoE) replaces a transformer's single feed-forward network with many smaller expert networks plus a learned router that sends each token to only its top-k experts (sparse activation). This decouples total parameters (which set memory) from active parameters (which set compute). Mixtral 8x7B has 46.7B total but 12.9B active via top-2 routing; DeepSeek-V3 has 671B total but 37B active (5.5%) using 256 routed experts plus one shared expert and top-8 routing. The design traces to Shazeer et al. (2017) and Google's Switch Transformer (2021, top-1 routing, 1.6T params). Trade-offs include memory footprint, load-balancing difficulty, training instability, communication overhead, and harder fine-tuning.

By Aisha Patel · 6 min · Jul 10, 2026

AI News

GLM-5.2: Zhipu's Open-Weight Model Beats GPT-5.5 at 1/6 the Cost

Z.AI released GLM-5.2 on June 16, 2026: a 753B-parameter MoE model under an MIT license with a 1M-token context. It tops open-weight coding benchmarks, beating GPT-5.5 on SWE-bench Pro, FrontierSWE and PostTrainBench at roughly one-sixth the cost.

By Sarah Chen · 5 min · Jun 26, 2026

AI News

MAI-Thinking-1: Microsoft's First In-House Reasoning Model

Microsoft unveiled MAI-Thinking-1 at Build 2026, its first reasoning model trained in-house without distillation. The 35B-active, ~1T-total MoE has a 256k context window, scores 97.0% on AIME 2025 and matches Claude Opus 4.6 on SWE-Bench Pro. It's in private preview on Microsoft Foundry.

By Sarah Chen · 5 min · Jun 23, 2026

AI News

Kimi K2.7-Code: A 30% Token Cut With a Benchmark Asterisk

Moonshot AI's Kimi K2.7-Code is an open-weights, OpenAI-compatible coding model (1T-param MoE, 32B active, 256K context) claiming a 30% cut in reasoning tokens and a narrow win over Claude Opus 4.8. But all published benchmarks are Moonshot's own proprietary suites, with no independent results yet, so the efficiency claims remain unverified.

By Sarah Chen · 5 min · Jun 14, 2026

Deep Dives

ZAYA1-8B: Zyphra's 760M-Active MoE Trained on AMD

Zyphra's ZAYA1-8B MoE model, trained on AMD, achieves high performance with efficient parameter activation.

By Aisha Patel · 6 min · May 24, 2026

Open Source

Trinity-Large-Thinking: 400B U.S.-Made Open Reasoning Model

Trinity-Large-Thinking is Arcee AI's 400B open-weights reasoning model, offering powerful, cost-effective agent tuning.

By Aisha Patel · 7 min · Apr 30, 2026

Open Source

Moondream 3: The 9B Vision Model That Runs Like a 2B

Moondream 3 is a 9B vision model that runs efficiently like a 2B, offering advanced capabilities.

By Marcus Rivera · 4 min · Apr 1, 2026

AI News

NVIDIA Nemotron 3 Super: The Hybrid Architecture That Rewrites the Agent Playbook

NVIDIA's Nemotron 3 Super, a hybrid architecture, delivers 5x throughput and top agentic benchmarks.

By Sarah Chen · 4 min · Mar 31, 2026

Open Source

Mistral Small 4: One Open-Source Model Replaces Three Separate AI Products

Mistral Small 4 unifies three AI products into one powerful open-source model, simplifying capabilities.

By Marcus Rivera · 4 min · Mar 30, 2026