Google DeepMind dropped Vision Banana on April 22, 2026, and it just rewrote a rule the computer vision world has lived by for fifteen years: that you need a specialist model for every perception task. One generalist image generator now beats SAM 3 on segmentation, beats Depth Anything V3 on metric depth, and beats Lotus-2 on surface normals — without giving up its ability to generate images in the first place.
It's an audacious result, and the way they got there is even stranger than the benchmarks.
The trick: treat every vision task as image generation
Most perception models — segmentation, depth, normals, edges — bolt task-specific decoder heads onto a vision backbone. Each head is a separate piece of engineering with its own loss function, its own training data, its own evaluation harness. The Vision Banana paper, "Image Generators are Generalist Vision Learners" (arXiv:2604.20329), throws that whole approach out.
Instead, the team — led by He Kaiming and Xie Saining — took Nano Banana Pro (Google's frontier text-to-image model) and instruction-tuned it on a lightweight mix of vision task data. Every output, regardless of task, is rendered as an RGB image following a precise, invertible color scheme. A segmentation mask is just a colored picture. A depth map is a colored picture. Surface normals? Also a colored picture, where:
Facing-left normals encode as pinkish-red. Facing-up normals encode as light green. Normals pointing toward the camera encode as light blue/purple.
The benchmarks then decode those colored pictures back into quantitative predictions. No new heads. No new losses. Just generation.
The numbers that matter
The headline benchmarks are not close calls:
| Task | Benchmark | Vision Banana | Prior SOTA |
|---|---|---|---|
| Semantic segmentation | Cityscapes mIoU | 0.699 | SAM 3 — 0.652 |
| Metric depth estimation | δ1 accuracy | 0.929 | Depth Anything V3 — 0.918 |
| Surface normal estimation | — | Beats | Lotus-2 |
A 4.7-point mIoU gain over SAM 3 is not noise. SAM 3 was Meta's heavily-tuned successor to the segmentation model that defined the category. Vision Banana — a general image generator with a vision instruction-tuning pass on top — walks past it.
The depth result is arguably stranger. No camera intrinsics or extrinsics are required at training or inference time. The model infers absolute scale "purely from visual cues and world knowledge embedded during pretraining." And the depth training data is entirely synthetic, rendered from simulation engines, with zero real-world depth. It still beats Depth Anything V3 on real-world δ1.
Why this changes the game for vision teams
The implication is not that SAM 3 is bad. It is that the inductive biases of image-generation pretraining transfer remarkably well to perception.
For years the working theory was that contrastive pretraining (CLIP) and masked-image-modeling pretraining (DINO, MAE) were the right priors for vision backbones. Generation was a different lane. Vision Banana suggests generation may actually be the better prior — because to generate a realistic image, the model has to internalize 3D geometry, lighting, occlusion, and material properties in ways that contrastive models never need to.
If that holds up under replication, every vision team running a stack of specialist heads on a frozen backbone now has a serious decision to make: do we keep maintaining four task-specific models, or do we move to one instruction-tuned generator?
What's still unclear
A few things the paper doesn't fully resolve:
- Inference cost. Generating a 1024×1024 RGB output for a depth map is more expensive than running a regression head on a feature map. The paper claims competitive throughput, but production deployments will need their own numbers.
- Long-tail behavior. SAM 3 was famous for working on objects it had never seen. The Vision Banana evaluations focus on common benchmarks (Cityscapes, NYU, ScanNet). Open-world segmentation comparisons are thinner.
- Open weights. The paper is published; the model is not. As of the April 25 MarkTechPost coverage, Vision Banana exists as a research artifact behind Google DeepMind's wall. Reproducing it requires Nano Banana Pro as a base, and that is closed.
The Bottom Line
Vision Banana is the most interesting computer vision result of 2026 so far — not because the benchmark numbers are huge (they're solid, not ridiculous), but because of what won. A generalist image generator, lightly fine-tuned, beat models that were purpose-built and tuned for a single task across years of work. If image generation is now the strongest pretraining signal for perception, the next-generation vision foundation model is going to look a lot less like a backbone and a lot more like Stable Diffusion with a new head on it.
Specialist vision models had a good run. The generalists are coming for that crown.


