Moondream 3: The 9B Vision Model That Runs Like a 2B
Open Source 4 min read intermediate

Moondream 3: The 9B Vision Model That Runs Like a 2B

Moondream 3's MoE architecture delivers 9B-class vision understanding with 2B inference costs and state-of-the-art segmentation.

Marcus Rivera
Marcus Rivera
Apr 1, 2026

Most vision-language models force you to choose: power or efficiency. Moondream 3 says you can have both. This mixture-of-experts model packs 9 billion total parameters but only activates 2 billion per token — delivering frontier-level visual reasoning at speeds comparable to models a fraction of its size.

Architecture That Matters

Moondream 3 is built using a technique called drop upcycling, starting from the dense Moondream 2 (2B) model and scaling it into a full MoE architecture. The result is a 24-layer model where the first 4 layers are dense and the remaining 20 use MoE feed-forward networks with 64 experts per layer, 8 activated per token.

Key architectural details:

Component Specification
Total Parameters 9B
Active Parameters 2B per token
Architecture MoE with GeGLU FFNs
Experts 64 per MoE layer, 8 active
Context Window 32K tokens (up from 2K)
Vision Encoder SigLIP-based with multi-crop channel concatenation
Tokenizer Custom SuperBPE
Tensor Type BF16

The 32K context window is a massive upgrade from Moondream 2's 2K limit, enabling the model to handle complex multi-turn conversations about images and process much longer structured outputs.

Four Core Skills

Moondream 3 ships with four built-in vision skills, each accessible through a clean Python API:

1. Query (Visual Question Answering) — Ask open-ended questions about images. The model runs in reasoning mode by default, where it thinks about the question before generating an answer, with each reasoning chunk grounded to specific image regions.

2. Caption — Generate image descriptions at three lengths (short, normal, long), all with streaming support.

3. Point Detection — Identify and locate objects with normalized coordinates. Feed it "person wearing a red shirt" and get back precise (x, y) points.

4. Object Detection — Full bounding-box detection with normalized coordinates and configurable maximum object counts.

All four skills support streaming output and custom inference settings (temperature, top_p, max_tokens).

Getting Started

Moondream 3 runs via HuggingFace Transformers:

import torch
from transformers import AutoModelForCausalLM

moondream = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream3-preview",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"":"cuda"},
)
moondream.compile()  # Critical for fast FlexAttention decoding

From there, querying an image is straightforward:

from PIL import Image

image = Image.open("photo.jpg")
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])

The model also supports image encoding caching — encode once, query multiple times — which is essential for batch processing workflows.

Segmentation Update: 40% Faster

A March 2026 segmentation update brought significant improvements across the board:

Benchmark Previous Current Improvement
RefCOCO Val 81.8 83.2 +1.4
RefCOCO+ Val 74.7 79.1 +4.4
RefCOCOg Val 76.4 80.7 +4.3
RefCOCO-M 86.9 88.2 +1.3

All scores are mIoU (mean Intersection over Union).

The model produces native SVG masks (vectors, not bitmasks) and handles complex referring expressions like "the person touching the door." Compared to SAM 3, Moondream handles complex prompts natively without requiring additional reasoning models, and reportedly operates at a 5x lower cost.

Licensing: Read the Fine Print

Moondream 3 uses the Business Source License 1.1, which is not a traditional OSI-approved open-source license. Here is what you need to know:

Free to use (no agreement needed): internal company use including production, personal projects, research, benchmarks, fine-tunes, merges, quantizations, and products that do not compete with M87 Labs' paid offerings.

Requires a commercial agreement: selling hosted vision/AI APIs, managed hosting or MLaaS, paid SDKs bundling the model, or B2B computer vision APIs competing with M87 Labs.

For most developers building internal tools or products that use vision capabilities rather than selling vision as a service, the license is permissive enough.

The Ecosystem

Moondream has built meaningful traction: 9,000+ GitHub stars and adoption by over 10,000 developers according to the project's website. The model is available through multiple deployment paths — Moondream Cloud (with $5/month free credits, no card required), the local Photon inference engine for fast local processing, and community integrations with Ollama, LM Studio, and llama.cpp.

The Bottom Line

Moondream 3 is the most compelling small-footprint vision model available today. The MoE architecture means you get 9B worth of knowledge with 2B inference costs, the skill-based API is clean and practical, and the segmentation capabilities are genuinely state-of-the-art. The BSL license is worth reading carefully, but for the vast majority of use cases, this is a production-ready vision model that punches well above its weight.