Most vision-language models force you to choose: power or efficiency. Moondream 3 says you can have both. This mixture-of-experts model packs 9 billion total parameters but only activates 2 billion per token — delivering frontier-level visual reasoning at speeds comparable to models a fraction of its size.
Architecture That Matters
Moondream 3 is built using a technique called drop upcycling, starting from the dense Moondream 2 (2B) model and scaling it into a full MoE architecture. The result is a 24-layer model where the first 4 layers are dense and the remaining 20 use MoE feed-forward networks with 64 experts per layer, 8 activated per token.
Key architectural details:
| Component | Specification |
|---|---|
| Total Parameters | 9B |
| Active Parameters | 2B per token |
| Architecture | MoE with GeGLU FFNs |
| Experts | 64 per MoE layer, 8 active |
| Context Window | 32K tokens (up from 2K) |
| Vision Encoder | SigLIP-based with multi-crop channel concatenation |
| Tokenizer | Custom SuperBPE |
| Tensor Type | BF16 |
The 32K context window is a massive upgrade from Moondream 2's 2K limit, enabling the model to handle complex multi-turn conversations about images and process much longer structured outputs.
Four Core Skills
Moondream 3 ships with four built-in vision skills, each accessible through a clean Python API:
1. Query (Visual Question Answering) — Ask open-ended questions about images. The model runs in reasoning mode by default, where it thinks about the question before generating an answer, with each reasoning chunk grounded to specific image regions.
2. Caption — Generate image descriptions at three lengths (short, normal, long), all with streaming support.
3. Point Detection — Identify and locate objects with normalized coordinates. Feed it "person wearing a red shirt" and get back precise (x, y) points.
4. Object Detection — Full bounding-box detection with normalized coordinates and configurable maximum object counts.
All four skills support streaming output and custom inference settings (temperature, top_p, max_tokens).
Getting Started
Moondream 3 runs via HuggingFace Transformers:
import torch
from transformers import AutoModelForCausalLM
moondream = AutoModelForCausalLM.from_pretrained(
"moondream/moondream3-preview",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"":"cuda"},
)
moondream.compile() # Critical for fast FlexAttention decoding
From there, querying an image is straightforward:
from PIL import Image
image = Image.open("photo.jpg")
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])
The model also supports image encoding caching — encode once, query multiple times — which is essential for batch processing workflows.
Segmentation Update: 40% Faster
A March 2026 segmentation update brought significant improvements across the board:
| Benchmark | Previous | Current | Improvement |
|---|---|---|---|
| RefCOCO Val | 81.8 | 83.2 | +1.4 |
| RefCOCO+ Val | 74.7 | 79.1 | +4.4 |
| RefCOCOg Val | 76.4 | 80.7 | +4.3 |
| RefCOCO-M | 86.9 | 88.2 | +1.3 |
All scores are mIoU (mean Intersection over Union).
The model produces native SVG masks (vectors, not bitmasks) and handles complex referring expressions like "the person touching the door." Compared to SAM 3, Moondream handles complex prompts natively without requiring additional reasoning models, and reportedly operates at a 5x lower cost.
Licensing: Read the Fine Print
Moondream 3 uses the Business Source License 1.1, which is not a traditional OSI-approved open-source license. Here is what you need to know:
Free to use (no agreement needed): internal company use including production, personal projects, research, benchmarks, fine-tunes, merges, quantizations, and products that do not compete with M87 Labs' paid offerings.
Requires a commercial agreement: selling hosted vision/AI APIs, managed hosting or MLaaS, paid SDKs bundling the model, or B2B computer vision APIs competing with M87 Labs.
For most developers building internal tools or products that use vision capabilities rather than selling vision as a service, the license is permissive enough.
The Ecosystem
Moondream has built meaningful traction: 9,000+ GitHub stars and adoption by over 10,000 developers according to the project's website. The model is available through multiple deployment paths — Moondream Cloud (with $5/month free credits, no card required), the local Photon inference engine for fast local processing, and community integrations with Ollama, LM Studio, and llama.cpp.
The Bottom Line
Moondream 3 is the most compelling small-footprint vision model available today. The MoE architecture means you get 9B worth of knowledge with 2B inference costs, the skill-based API is clean and practical, and the segmentation capabilities are genuinely state-of-the-art. The BSL license is worth reading carefully, but for the vast majority of use cases, this is a production-ready vision model that punches well above its weight.


