Moondream 3: 9B Vision Model That Runs Like a 2B

Most vision-language models force you to choose: power or efficiency. Moondream 3 says you can have both. This mixture-of-experts model packs 9 billion total parameters but only activates 2 billion per token — delivering frontier-level visual reasoning at speeds comparable to models a fraction of its size.

Architecture That Matters

Moondream 3 is built using a technique called drop upcycling, starting from the dense Moondream 2 (2B) model and scaling it into a full MoE architecture. The result is a 24-layer model where the first 4 layers are dense and the remaining 20 use MoE feed-forward networks with 64 experts per layer, 8 activated per token.

Key architectural details:

Component	Specification
Total Parameters	9B
Active Parameters	2B per token
Architecture	MoE with GeGLU FFNs
Experts	64 per MoE layer, 8 active
Context Window	32K tokens (up from 2K)
Vision Encoder	SigLIP-based with multi-crop channel concatenation
Tokenizer	Custom SuperBPE
Tensor Type	BF16

The 32K context window is a massive upgrade from Moondream 2's 2K limit, enabling the model to handle complex multi-turn conversations about images and process much longer structured outputs.

Four Core Skills

Moondream 3 ships with four built-in vision skills, each accessible through a clean Python API:

1. Query (Visual Question Answering) — Ask open-ended questions about images. The model runs in reasoning mode by default, where it thinks about the question before generating an answer, with each reasoning chunk grounded to specific image regions.

2. Caption — Generate image descriptions at three lengths (short, normal, long), all with streaming support.

3. Point Detection — Identify and locate objects with normalized coordinates. Feed it "person wearing a red shirt" and get back precise (x, y) points.

4. Object Detection — Full bounding-box detection with normalized coordinates and configurable maximum object counts.

All four skills support streaming output and custom inference settings (temperature, top_p, max_tokens).

Getting Started

Moondream 3 runs via HuggingFace Transformers:

import torch
from transformers import AutoModelForCausalLM

moondream = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream3-preview",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"":"cuda"},
)
moondream.compile()  # Critical for fast FlexAttention decoding

From there, querying an image is straightforward:

from PIL import Image

image = Image.open("photo.jpg")
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])

The model also supports image encoding caching — encode once, query multiple times — which is essential for batch processing workflows.

Segmentation Update: 40% Faster

A March 2026 segmentation update brought significant improvements across the board:

Benchmark	Previous	Current	Improvement
RefCOCO Val	81.8	83.2	+1.4
RefCOCO+ Val	74.7	79.1	+4.4
RefCOCOg Val	76.4	80.7	+4.3
RefCOCO-M	86.9	88.2	+1.3

All scores are mIoU (mean Intersection over Union).

The model produces native SVG masks (vectors, not bitmasks) and handles complex referring expressions like "the person touching the door." Compared to SAM 3, Moondream handles complex prompts natively without requiring additional reasoning models, and reportedly operates at a 5x lower cost.

Licensing: Read the Fine Print

Moondream 3 uses the Business Source License 1.1, which is not a traditional OSI-approved open-source license. Here is what you need to know:

Free to use (no agreement needed): internal company use including production, personal projects, research, benchmarks, fine-tunes, merges, quantizations, and products that do not compete with M87 Labs' paid offerings.

Requires a commercial agreement: selling hosted vision/AI APIs, managed hosting or MLaaS, paid SDKs bundling the model, or B2B computer vision APIs competing with M87 Labs.

For most developers building internal tools or products that use vision capabilities rather than selling vision as a service, the license is permissive enough.

The Ecosystem

Moondream has built meaningful traction: 9,000+ GitHub stars and adoption by over 10,000 developers according to the project's website. The model is available through multiple deployment paths — Moondream Cloud (with $5/month free credits, no card required), the local Photon inference engine for fast local processing, and community integrations with Ollama, LM Studio, and llama.cpp.

The Bottom Line

Moondream 3 is the most compelling small-footprint vision model available today. The MoE architecture means you get 9B worth of knowledge with 2B inference costs, the skill-based API is clean and practical, and the segmentation capabilities are genuinely state-of-the-art. The BSL license is worth reading carefully, but for the vast majority of use cases, this is a production-ready vision model that punches well above its weight.

Moondream 3: The 9B Vision Model That Runs Like a 2B

Architecture That Matters

Four Core Skills

Getting Started

Segmentation Update: 40% Faster

Licensing: Read the Fine Print

The Ecosystem

The Bottom Line

More in Open Source

Kanwas: The Open-Source AI Workspace That Hit #1 on Product Hunt

Understand-Anything: The 37K-Star Knowledge Graph for Your Codebase

Emdash: The Open-Source IDE Built to Run 22 Coding Agents in Parallel