What if a drone could learn to position itself precisely in a room using nothing but a camera and a neural network small enough to run on a toaster? That's essentially what we built.
I'm a PhD candidate at the National University of Science and Technology POLITEHNICA of Bucharest, and this project started as a question: can we make a cheap drone fly autonomously indoors without GPS, without expensive sensors, and without those black-and-white markers you see plastered on walls in every robotics lab?
The answer is yes and the resulting system was presented as an oral at ICCV 2025, one of the top computer vision conferences. Here's how it works, why it matters, and how you can try it yourself.
The Problem
Most autonomous drone systems rely on at least one of these:
- GPS: useless indoors
- LiDAR or depth sensors: expensive and heavy for small drones
- Fiducial markers (e.g., ArUco, AprilTags): you have to physically place them in the environment
We wanted none of that. Just a regular camera, a regular laptop, and a Parrot Anafi 4k drone (or any cheap drone with a camera if you can control it).
The classical approach to this is called Image-Based Visual Servoing (IBVS) where you detect features in the camera image, compare them to where they should be, and compute velocity commands to minimize the difference. The math is elegant but has problem: numerical instabilities. The equations blow up in certain configurations, causing the drone to do... unexpected things due to the control law.
The Solution: A Teacher That Teaches Itself Out of a Job
We built a two-part system:
1. The Teacher (NSER-IBVS)
First, we fixed the classical IBVS math. We reduced the equations to eliminate the numerical instabilities that burden traditional approaches. We called this NSER: Numerically Stable Efficient Reduced IBVS.
To detect the target (a toy car in our case), we use a pipeline of two small neural networks:
- YOLOv11 Nano (2.84M params) finds and segments the target vehicle
- A U-Net "Mask Splitter" (1.94M params) figures out which end is the front and which is the back
Why does front-vs-back matter? Because IBVS needs consistent keypoint ordering across frames. If the four corners of your bounding box randomly swap labels between frames, your control law goes haywire. The mask splitter solves this elegantly by splitting the segmentation mask into front and rear halves.
![]() |
![]() |
![]() |
|---|---|---|
| Naive: Bounding box corners lack orientation awareness | Analytical: Recomputed corners, still ambiguous | Ours: Mask splitter enables consistent ordering |
This teacher pipeline works great, but it runs at 48 FPS. Not bad, but we can do better.
2. The Student (Tiny ConvNet)
Here's where it gets interesting. We used the teacher to generate training data: thousands of image-velocity pairs collected in a digital twin simulator (Parrot Sphinx + Unreal Engine 4). Then we trained a 1.7M parameter convolutional neural network to directly predict velocity commands from raw camera images.
No feature detection. No IBVS equations. No explicit geometric reasoning. The student just looks at an image and outputs: "move left at 0.3 m/s, rotate right at 0.1 rad/s."
The result?
| Teacher (NSER-IBVS) | Student (ConvNet) | |
|---|---|---|
| Inference Speed | 48 FPS | 540 FPS |
| Parameters | 4.78M (pipeline) | 1.7M |
| Mean Error (Sim) | 29.76 px | 14.26 px |
| Mean Error (Real) | 29.96 px | 33.33 px |
The student is 11x faster and actually more accurate in simulation than the teacher that trained it. In the real world, the teacher still has an edge because the student was trained mostly on simulated data with only limited real-world fine-tuning. With extensive training we expect similar results in the real-world.
The best part? The teacher sometimes fails. In our front-center starting position, the teacher's pipeline chokes. The student? It handles it fine. The student learned to generalize beyond its teacher.
The Sim-to-Real Pipeline
![]() |
![]() |
|---|---|
| Real-World Reference | Simulator Reference |
Training on a real drone is painful because you crash, you wait for battery recharges, you deal with inconsistent lighting. So we built a digital twin:
- The simulation uses Parrot Sphinx for accurate drone physics
- Unreal Engine 4 renders a custom bunker environment with a car model that matches our real-world setup
- We defined 8 starting positions around the target and ran hundreds of automated flights
The simulated data distribution matched the real world closely enough that the student network transferred with minimal fine-tuning (just a few dozen real flights).
Try It Yourself
The entire codebase is open source:
GitHub: SpaceTime-Vision-Robotics-Laboratory/nser-ibvs-drone
git clone --recursive https://github.com/SpaceTime-Vision-Robotics-Laboratory/nser-ibvs-drone.git
cd nser-ibvs-drone
python3 -m venv ./venv && source venv/bin/activate
pip install -r requirements.txt && pip install -e .
All pre-trained models are included, both for simulation and real-world deployment. You'll need a Parrot Anafi drone for real flights, or you can run everything in the Parrot Sphinx simulator on Ubuntu (unfortunately necessary due to Sphinx simulator).
We also released:
- Mask Splitter: the annotation tool and network for splitting segmentation masks
- Hugging Face Collection: models, datasets, simulator environment and demo spaces, all in one place
Why This Matters Beyond Drones
The broader idea here is distilling a complex analytical system into a tiny neural network which applies way beyond drones:
- Industrial inspection robots that need to position themselves precisely using only a camera
- Warehouse drones navigating GPS-denied spaces
- Any edge device where you need real-time control but can't afford heavy computation
Our student network runs inference in 1.85 milliseconds. That's fast enough to run on a Raspberry Pi. The entire approach is hardware-agnostic where the same pipeline could work with any drone or robot that has a camera.
What's Next
We're currently working on:
- Continual learning: allowing drones to learn new objects on the fly without forgetting old ones (our ERF 2026 paper)
- Full autonomous navigation: combining visual servoing with semantic segmentation and depth estimation for obstacle avoidance and autonomous landing (preprint)
- Scaling to outdoor environments with more diverse targets
If you found this interesting, the full paper is here: ICCV 2025 Paper (PDF)
And the project page with all videos, results and additional details: spacetime-vision-robotics-laboratory.github.io/NSER-IBVS
Happy to answer questions in the comments. Especially if you're working on visual servoing, drone autonomy, or sim-to-real transfer.
Sebastian Mocanu: sebastianmocanu.com - GitHub - Google Scholar









Top comments (1)
This is so cool, Sebastian. I love how such a complex task can fit in such a tiny model 😮. I’ve been thinking about the spatial awareness side of things..... achieving that kind of precision without markers is a huge challenge for a network this lean. Definitely looking forward to your next updates!!!