Sebastian Mocanu

Posted on Apr 3

We Taught a Drone to Fly Itself Using a Tiny 1.7M Parameter Neural Network, No GPS, No Markers

#opensource #machinelearning #python #robotics

What if a drone could learn to position itself precisely in a room using nothing but a camera and a neural network small enough to run on a toaster? That's essentially what we built.

I'm a PhD candidate at the National University of Science and Technology POLITEHNICA of Bucharest, and this project started as a question: can we make a cheap drone fly autonomously indoors without GPS, without expensive sensors, and without those black-and-white markers you see plastered on walls in every robotics lab?

The answer is yes and the resulting system was presented as an oral at ICCV 2025, one of the top computer vision conferences. Here's how it works, why it matters, and how you can try it yourself.

The Problem

Most autonomous drone systems rely on at least one of these:

GPS: useless indoors
LiDAR or depth sensors: expensive and heavy for small drones
Fiducial markers (e.g., ArUco, AprilTags): you have to physically place them in the environment

We wanted none of that. Just a regular camera, a regular laptop, and a Parrot Anafi 4k drone (or any cheap drone with a camera if you can control it).

The classical approach to this is called Image-Based Visual Servoing (IBVS) where you detect features in the camera image, compare them to where they should be, and compute velocity commands to minimize the difference. The math is elegant but has problem: numerical instabilities. The equations blow up in certain configurations, causing the drone to do... unexpected things due to the control law.

The Solution: A Teacher That Teaches Itself Out of a Job

We built a two-part system:

1. The Teacher (NSER-IBVS)

First, we fixed the classical IBVS math. We reduced the equations to eliminate the numerical instabilities that burden traditional approaches. We called this NSER: Numerically Stable Efficient Reduced IBVS.

To detect the target (a toy car in our case), we use a pipeline of two small neural networks:

YOLOv11 Nano (2.84M params) finds and segments the target vehicle
A U-Net "Mask Splitter" (1.94M params) figures out which end is the front and which is the back

Why does front-vs-back matter? Because IBVS needs consistent keypoint ordering across frames. If the four corners of your bounding box randomly swap labels between frames, your control law goes haywire. The mask splitter solves this elegantly by splitting the segmentation mask into front and rear halves.


Naive: Bounding box corners lack orientation awareness	Analytical: Recomputed corners, still ambiguous	Ours: Mask splitter enables consistent ordering

This teacher pipeline works great, but it runs at 48 FPS. Not bad, but we can do better.

2. The Student (Tiny ConvNet)

Here's where it gets interesting. We used the teacher to generate training data: thousands of image-velocity pairs collected in a digital twin simulator (Parrot Sphinx + Unreal Engine 4). Then we trained a 1.7M parameter convolutional neural network to directly predict velocity commands from raw camera images.

No feature detection. No IBVS equations. No explicit geometric reasoning. The student just looks at an image and outputs: "move left at 0.3 m/s, rotate right at 0.1 rad/s."

The result?

	Teacher (NSER-IBVS)	Student (ConvNet)
Inference Speed	48 FPS	540 FPS
Parameters	4.78M (pipeline)	1.7M
Mean Error (Sim)	29.76 px	14.26 px
Mean Error (Real)	29.96 px	33.33 px

The student is 11x faster and actually more accurate in simulation than the teacher that trained it. In the real world, the teacher still has an edge because the student was trained mostly on simulated data with only limited real-world fine-tuning. With extensive training we expect similar results in the real-world.

The best part? The teacher sometimes fails. In our front-center starting position, the teacher's pipeline chokes. The student? It handles it fine. The student learned to generalize beyond its teacher.

The Sim-to-Real Pipeline


Real-World Reference	Simulator Reference

Training on a real drone is painful because you crash, you wait for battery recharges, you deal with inconsistent lighting. So we built a digital twin:

The simulation uses Parrot Sphinx for accurate drone physics
Unreal Engine 4 renders a custom bunker environment with a car model that matches our real-world setup
We defined 8 starting positions around the target and ran hundreds of automated flights

The simulated data distribution matched the real world closely enough that the student network transferred with minimal fine-tuning (just a few dozen real flights).

Try It Yourself

The entire codebase is open source:

GitHub: SpaceTime-Vision-Robotics-Laboratory/nser-ibvs-drone

git clone --recursive https://github.com/SpaceTime-Vision-Robotics-Laboratory/nser-ibvs-drone.git
cd nser-ibvs-drone
python3 -m venv ./venv && source venv/bin/activate
pip install -r requirements.txt && pip install -e .

All pre-trained models are included, both for simulation and real-world deployment. You'll need a Parrot Anafi drone for real flights, or you can run everything in the Parrot Sphinx simulator on Ubuntu (unfortunately necessary due to Sphinx simulator).

We also released:

Mask Splitter: the annotation tool and network for splitting segmentation masks
Hugging Face Collection: models, datasets, simulator environment and demo spaces, all in one place

Why This Matters Beyond Drones

The broader idea here is distilling a complex analytical system into a tiny neural network which applies way beyond drones:

Industrial inspection robots that need to position themselves precisely using only a camera
Warehouse drones navigating GPS-denied spaces
Any edge device where you need real-time control but can't afford heavy computation

Our student network runs inference in 1.85 milliseconds. That's fast enough to run on a Raspberry Pi. The entire approach is hardware-agnostic where the same pipeline could work with any drone or robot that has a camera.

What's Next

We're currently working on:

Continual learning: allowing drones to learn new objects on the fly without forgetting old ones (our ERF 2026 paper)
Full autonomous navigation: combining visual servoing with semantic segmentation and depth estimation for obstacle avoidance and autonomous landing (preprint)
Scaling to outdoor environments with more diverse targets

If you found this interesting, the full paper is here: ICCV 2025 Paper (PDF)

And the project page with all videos, results and additional details: spacetime-vision-robotics-laboratory.github.io/NSER-IBVS

Happy to answer questions in the comments. Especially if you're working on visual servoing, drone autonomy, or sim-to-real transfer.

Sebastian Mocanu: sebastianmocanu.com - GitHub - Google Scholar

Top comments (1)

Ana Maria Luisa Mocanu • Apr 3 • Edited

This is so cool, Sebastian. I love how such a complex task can fit in such a tiny model 😮. I’ve been thinking about the spatial awareness side of things..... achieving that kind of precision without markers is a huge challenge for a network this lean. Definitely looking forward to your next updates!!!