Sohan Lal

Posted on Feb 25 • Originally published at labellerr.com

SAM 3 vs SAM for aerial segmentation: understanding the architecture behind the magic

#webdev #ai #javascript #beginners

You've probably used apps that can automatically select a person in a photo. Now imagine doing that for an entire city from a satellite image

That's aerial segmentation. The newest tool for this job is SAM 3, and it works very differently from the original SAM. Let's peek under the hood and understand SAM 3 vs SAM for aerial segmentation in a way that makes sense.

What is the "Perception Encoder" and why does it matter for satellite images?

The Perception Encoder is SAM 3's special "brain" for seeing. Unlike the original SAM which only understood images, this brain was trained on 5.4 billion image‑text pairs – meaning it learned to connect words like "round building" with what round buildings actually look like from above.

For aerial segmentation, this means it can find things you describe, even if it's never seen that exact satellite photo before.

Think of it like this:

Original SAM = someone who only learned to trace shapes
SAM 3 = someone who read a giant encyclopedia of pictures WITH captions

When you say "find all the circular storage tanks in this oil field", SAM 3's Perception Encoder already knows what those tanks look like from thousands of examples. This is why zero-shot performance on remote sensing datasets is so much better with SAM 3.

How does SAM 3 separate "stuff" from "things" in aerial views?

In aerial images, some things are individual objects (cars, houses, trees) – called "things". Other things are continuous surfaces (grass, water, pavement) – called "stuff".

SAM 3 has TWO different ways to handle these:

Instance head for counting individual objects
Semantic head for coloring large areas

It then blends them together perfectly. Original SAM mostly did "things" and struggled with "stuff".

This matters a lot for drone mapping. Imagine you're mapping a park: you want to count each individual tree (instance segmentation) BUT also outline the whole grassy area (semantic segmentation). SAM 3 does both at the same time.

The technical papers (Li et al. Dec 2025) call this "fusion of semantic and instance predictions" – they take the maximum confidence from both heads to create one perfect map.

In tests on eight remote sensing benchmarks, this fusion helped SAM 3 achieve 53.4% mean IoU, while older models that only did one type got stuck around 40%. That's a huge jump in accuracy.

Why does SAM 3 hallucinate less than the original SAM?

Hallucination in AI means seeing things that aren't there – like marking a shadow as a car. SAM 3 has a special "presence head" that acts like a gatekeeper.

How it works:

First it asks: "Is there ANY car in this entire image?"
Only if the answer is YES, it proceeds to find each car

This simple trick cuts down false alarms by more than half compared to the original SAM.

The presence head is a tiny part of the model (a learned token) that decides the concept exists somewhere. It then shares this confidence score with all the object detectors. If the presence score is low, the whole model holds back.

This is especially useful in aerial images where 90% of the picture might be empty forest – you don't want the AI inventing buildings where there are none.

According to the ablation studies (Carion et al.), presence gating improved accuracy by up to 1.5 points on the SA‑Co benchmark. That might sound small, but in real‑world mapping, fewer false positives means less manual cleanup.

SAM 3 vs SAM for aerial segmentation: the video advantage

Let's compare how they handle drone videos specifically:

Feature	Original SAM	SAM 3
Memory bank	Limited	Tighter coupled with Perception Encoder
Tracking through confusion	Often lost objects	4 strategies: temporal scores, re-prompting, waiting, boundary fixing
Max objects	Separate runs needed	Up to 200 objects at once
ID switching	High	Reduced by temporal validation window
SA‑V test IoU	<3%	30.3%

If you're analyzing drone footage of traffic or wildlife, these differences mean hours saved in manual correction.

What is the "Align loss" and how does it help aerial maps?

The Align loss is a clever trick that makes sure SAM 3's boxes and masks agree with each other. If the model draws a box around a house, the mask (the exact outline) should match that box tightly. This creates cleaner edges – super important for measuring building sizes accurately from satellite photos.

In technical terms: it's a consistency loss between the predicted bounding box and the predicted mask. During training, SAM 3 gets punished if the mask sticks out of the box or if the box is much bigger than the mask. This forces the model to be precise.

For aerial segmentation, where you might need to calculate exact areas of solar panels or crop fields, this precision is gold.

How does SAM 3's training data include aerial views?

The SA‑Co dataset, which trained SAM 3, specifically added a "domain expansion" phase that included aerial, medical, and wildlife imagery. Over 5.2 million images were collected, and a special effort was made to include challenging aerial scenes: crowded city centers, tiny objects, and images with both natural and man‑made structures.

The original SAM's training data was mostly ground‑level photos from the internet.

Here's how the data engine worked in four phases:

Phase 1: Foundational images from existing datasets, human‑verified
Phase 2: AI verifiers (fine‑tuned Llama models) helped check more images, doubling output
Phase 3: Domain expansion – this is where aerial imagery was added, covering scenes from "medical to aerial"
Phase 4: Video extension with challenging clips, including drone shots with crowded objects

This careful curation is why evaluating SAM models for drone and satellite images now shows SAM 3 far ahead – it was literally trained for this.

Real example: mapping solar panels from satellite

A renewable energy company wanted to find all solar panels in a county to estimate energy potential.

With original SAM: manually click each panel or draw boxes – thousands of panels, days of work.
With SAM 3: typed "solar panel" and let it run overnight. The model found 97% of panels correctly, including small rooftop ones that older AI missed.

Labellerr AI helped them fine‑tune the results with just 20 corrected examples, boosting accuracy to 99%. This combination – SAM 3's power plus easy fine‑tuning – is why teams switch to Labellerr for aerial projects.

Frequently asked questions about SAM 3's architecture

Does SAM 3 need an internet connection to work?

No, once you download the model (about 3.5 GB), it runs completely offline. This is important for processing sensitive satellite imagery that can't be sent to the cloud. Labellerr AI offers an on‑premise version so your data never leaves your servers.

Can I run SAM 3 on a laptop?

You can, but it will be slow. For a single image, it might take a few minutes. For videos or many images, you really need a computer with a good graphics card (GPU). There are smaller versions like EfficientSAM3 that run faster on laptops, with a small trade‑off in accuracy.

How many objects can SAM 3 find in one aerial photo?

It can detect up to 200 objects at the same time. In a dense city scene, that might cover all cars, buildings, and trees. If you have more than 200, you'd need to split the image into tiles or run multiple passes. Labellerr AI has built‑in tiling to handle unlimited objects seamlessly.

Limitations of the architecture

Even with all these advances, SAM 3 has some limits to know:

Text length: Only understands short noun phrases (max 32 tokens). Can't say complex descriptions
3D data: Works on flat images, not true 3D models
Very small objects: Still struggles with objects smaller than ~5x5 pixels
Inference cost: 3.5GB model requires significant computation

Researchers are actively working on these – EfficientSAM3 for speed, SAM3‑I for longer instructions, and adaptations for 3D perception.

Conclusion: architecture wins the race

When you dig into SAM 3 vs SAM for aerial segmentation, the architectural changes explain everything: the Perception Encoder, presence head, dual supervision, and fusion of semantic/instance heads. These aren't small tweaks – they're fundamental redesigns that make SAM 3 the first model that truly understands aerial scenes the way humans do.

👉 Want to see the architecture in action?

Read our detailed benchmark with code examples and accuracy metrics:

SAM 3 vs SAM for aerial segmentation – Labellerr AI benchmark

We show you exactly how to deploy it, what hardware you need, and how to fine‑tune for your specific aerial project. Start mapping smarter today!

DEV Community