Qwen 3.5 Small: Alibaba's 9B Model Outperforms Larger AI

The Qwen 3.5 Small series — Alibaba's family of 0.8B to 9B parameter open-source models — just proved that you don't need a data center to run a genuinely capable multimodal AI. Released on March 2, 2026 under Apache 2.0, these compact models pack capabilities that would have required 10x the parameters just a year ago.

Here's why this matters for developers building on-device AI.

What Qwen 3.5 Small Actually Is

Qwen 3.5 Small is a family of four dense language models: 0.8B, 2B, 4B, and 9B parameters. They sit at the bottom of Alibaba's broader Qwen 3.5 lineup, which extends all the way up to a 397B-A17B Mixture-of-Experts flagship.

But "small" is misleading. These models are natively multimodal — they process text, images, and video within the same latent space from the earliest layers of training. No adapter bolted on after the fact. No separate vision encoder doing its own thing. It's a unified architecture from the ground up.

TL;DR: Four models, 0.8B–9B params, native text+image+video, 262K context, Apache 2.0. The 9B punches well above its weight class.

The Architecture That Makes It Work

The secret sauce is Gated DeltaNet hybrid attention — a 3:1 ratio of linear attention to full attention layers. This gives the models near-linear scaling for long sequences while keeping the quality benefits of full attention where it counts.

Key technical specs across the family:

Feature	Details
Attention	Gated DeltaNet hybrid (3:1 linear-to-full)
Context Length	262,144 tokens native (1M extended on 9B)
Vocabulary	248K tokens across 201 languages
Training Data	Trillions of multimodal tokens
Prediction	Multi-token prediction
License	Apache 2.0

The multi-token prediction approach means the model predicts several tokens simultaneously during inference, which translates to meaningful speedups on compatible hardware without sacrificing output quality.

Benchmarks: The 9B Punches Way Above Its Weight

The headline number: Qwen3.5-9B scores 81.7 on GPQA Diamond, beating OpenAI's gpt-oss-120B at 80.1. That's a 9-billion-parameter model edging out one that's more than 13x its size on a graduate-level reasoning benchmark — a narrow but meaningful win.

More benchmark highlights for the 9B:

MMLU-Pro: 82.5 (outperforms prior Qwen3-30B, which is 3x larger)
LongBench v2: 55.2
Matches Qwen3-80B in several evaluation categories

The smaller models follow a similar pattern of overperforming their parameter class, though the 9B is the standout. Alibaba credits the hybrid attention architecture and the aggressive multimodal pre-training for these results.

Running It Locally

Getting Qwen 3.5 Small running is straightforward. All models are available on Hugging Face and ModelScope. For the 9B, you'll want at least a GPU with 16GB VRAM for comfortable inference, though quantized versions can run on less.

Using vLLM:

pip install vllm
vllm serve Qwen/Qwen3.5-9B-Instruct --max-model-len 262144

Using SGLang:

pip install sglang
python -m sglang.launch_server --model Qwen/Qwen3.5-9B-Instruct

The base (non-instruct) versions are also available for fine-tuning, which is where the Apache 2.0 license really shines — no revenue caps, no usage restrictions.

What's Actually New vs. Qwen 3

If you've been following Alibaba's model releases, you might wonder what changed. Three things stand out:

Native multimodal from day one — Qwen 3 added vision capabilities through adapters. Qwen 3.5 trains vision and language together, achieving what Alibaba claims is "near-100% multimodal training efficiency compared to text-only training."
Gated DeltaNet architecture — This replaces the standard transformer attention with a hybrid approach that's significantly more efficient at long contexts. The 262K native context window is a direct result.
Reinforcement learning at scale — Alibaba trained these models using RL across "million-agent environments," which is a new approach for models this size.

What's Missing

Let's be real about the limitations:

Video understanding has constraints — While the 4B and 9B handle video natively, complex temporal reasoning across long videos remains inconsistent
The 0.8B and 2B are limited — Useful for classification and simple tasks, but don't expect strong reasoning from the smallest models
Multimodal generation is not included — These models understand images and video but don't generate them
Community ecosystem is still growing — The GitHub repo is still young with modest star counts, which is modest compared to Meta's Llama or Mistral. Tooling and fine-tuning recipes are still catching up.

Why This Matters for On-Device AI

The real story isn't another benchmark leaderboard win. It's that genuinely useful multimodal AI now fits on a laptop GPU — or even a phone, for the smaller variants.

The 4B model can run on a modern smartphone's NPU. The 9B fits comfortably on a single consumer GPU. And because it's Apache 2.0, you can ship it in a product without worrying about licensing landmines.

This is the kind of release that moves the industry toward a world where AI isn't just an API call — it's a local capability embedded in the device you're holding.

The Bottom Line

Qwen 3.5 Small is the most capable small-model family available under a permissive license right now. The 9B variant, in particular, offers reasoning and multimodal capabilities that were flagship-model territory twelve months ago.

If you're building any kind of on-device AI — mobile apps, embedded systems, edge inference, private document processing — this should be at the top of your evaluation list. The combination of Apache 2.0 licensing, native multimodal support, and genuine benchmark performance makes it hard to ignore.

The big question is whether the community will rally around it the way they have around Llama and Mistral. The technical foundation is there. Now it needs the ecosystem to match.

Qwen 3.5 Small: Alibaba's 9B Model That Beats GPT-OSS-120B

What Qwen 3.5 Small Actually Is

The Architecture That Makes It Work

Benchmarks: The 9B Punches Way Above Its Weight

Running It Locally

What's Actually New vs. Qwen 3

What's Missing

Why This Matters for On-Device AI

The Bottom Line

More in Open Source

Kanwas: The Open-Source AI Workspace That Hit #1 on Product Hunt

Understand-Anything: The 37K-Star Knowledge Graph for Your Codebase

Emdash: The Open-Source IDE Built to Run 22 Coding Agents in Parallel