Sherin Joseph Roy

Posted on Mar 30

200GB of Raw Rover Data and No Pipeline to Process It. So I Wrote One.

#opensource #machinelearning #python #robotics

Building a 33-module Python pipeline that takes raw GoPro/Insta360 recordings and turns them into SLAM-ready datasets with auto-labeling, depth estimation, and edge deployment. The unglamorous engineering that nobody writes about.

Three field sessions in Kerala with a GoPro Hero 11 on a rover chassis, an Insta360 X4 for 360-degree coverage, and an Android phone running Sensor Logger.

Result: 200GB of raw video, a growing spreadsheet tracking which files had GPS lock, which files had HyperSmooth accidentally enabled (destroying IMU correlation), and one SD card that corrupted mid-write.

I tried stitching together scripts. FFmpeg for frames, a GPMF parser for telemetry, a separate tool for calibration, manual CSV wrangling for synchronization. After the third time I forgot which script ran in which order, I stopped and asked myself what I actually needed.

The answer was not a better script. It was a system.

That system became Orvex, and over six months it grew into 33 core modules, a 28-panel PyQt6 desktop app, a FastAPI backend with 96+ endpoints, and a React frontend. All sharing identical business logic. Zero code duplication.

The Gap That Started Everything

The autonomous driving community has world-class open source perception tools. ORBSLAM3 for visual-inertial SLAM. VINS-Mono for state estimation. DepthAnything for monocular depth. SegFormer for semantic segmentation. YOLOv8 for detection.

But all of these tools expect their input in a specific format with specific conventions. Getting your own raw field recordings into that format is where weeks disappear.

What I needed, concretely:

Audit 50+ GoPro files and flag stabilization artifacts automatically
Extract IMU data at 200Hz and GPS from binary GPMF streams inside MP4 containers
Synchronize three devices that have independent clocks
Run a guided camera calibration workflow with validation gates
Auto-label 10,000 frames with classes specific to Indian roads (autorickshaws, potholes, unmarked speed bumps)
Train a detector, export to ONNX, convert to TensorRT for Jetson deployment
Version datasets as they evolve over collection campaigns

No single tool covered this. Every tool covered one step. The pipeline between steps was my problem.

The One Rule That Saved the Project

Early on I made one architectural decision that paid off more than any other:

Every line of business logic lives in core/. No UI imports allowed.

core/              33 Python modules. No PyQt6. No FastAPI.
  |
  +-- desktop/     PyQt6 app (28 stacked widget panels)
  +-- web/
       +-- backend/   FastAPI (96+ endpoints)
       +-- frontend/  React + Vite (25 pages)

The desktop app and web app are both thin wrappers that call the same core/ functions. I fixed a GPS timestamp parsing bug once in core/extractor_gopro.py and both interfaces got the fix immediately.

When a colleague wanted to run extraction on a headless server, they imported core.extractor_gopro directly. No Qt dependency chain. No display server required.

The enforcement was simple: if the string import PyQt6 or import fastapi appears in any file under core/, the code does not merge. No exceptions.

This also meant every core/ module was independently testable with pytest. No GUI mocking required.

The Hardest Lesson: Never Store Timestamps as Floats

If you remember one thing from this post, let it be this.

Store all timestamps as 64-bit integer nanoseconds. Not floats. Not milliseconds. Not ISO strings.

I lost three days to a synchronization bug before learning this. Here is the problem:

GoPro GPMF gives timestamps as microseconds. Android Sensor Logger gives seconds_elapsed as a float counting from app start. Insta360 encodes timestamps in a proprietary binary format that exiftool can extract.

When you cross-correlate 200Hz accelerometer signals between two devices to compute their clock offset, a rounding error of 0.001 seconds in a float shifts your alignment by an entire IMU sample. At 200Hz, one sample is 5 milliseconds. That error propagates through your entire SLAM pipeline.

The fix is absolute:

# Correct. Every timestamp in the pipeline.
timestamp_ns: int  # nanoseconds since Unix epoch

# Wrong. Precision loss at high sample rates.
timestamp: float

This constraint is enforced at the Pydantic model level. The IMUSample model accepts timestamp_ns: int and nothing else. If you pass a float, validation fails immediately rather than producing subtle drift errors 3 modules downstream.

Synchronizing Devices That Don't Share a Clock

This was the most interesting engineering challenge in the project.

Three devices. Three independent clocks. Three different timestamp formats. The goal: compute a nanosecond offset for each device so all data aligns to a common timeline.

Method 1: GPS Time Anchor

GoPro Hero 11 records GPS timestamps in UTC. If the Android phone also has GPS lock, you can compute the offset directly by comparing GPS time readings from the same moment. This is the most accurate method, typically within a few milliseconds.

The catch: both devices need solid GPS fix. Indoor recordings, dense urban canyons, and cloudy conditions can leave you without GPS on one or both devices.

Method 2: IMU Cross-Correlation

When GPS is unavailable, you can exploit the fact that all devices attached to the same rover experienced the same physical motion. The acceleration magnitude signal should be nearly identical across devices, just shifted in time.

# Resample both signals to 200Hz, compute acceleration magnitude
mag_a = np.sqrt(ax_a**2 + ay_a**2 + az_a**2)
mag_b = np.sqrt(ax_b**2 + ay_b**2 + az_b**2)

# Normalized cross-correlation to find the lag
corr = np.correlate(mag_a - mag_a.mean(), mag_b - mag_b.mean(), mode='full')
lag_samples = np.argmax(corr) - (len(mag_b) - 1)
offset_ns = int(lag_samples / 200.0 * 1e9)

If the computed lag exceeds 2 seconds, Orvex raises a warning. A lag that large usually means one device started recording before the other, and you need to trim the non-overlapping segment.

Method 3: Manual

Sometimes you just know. You clapped in front of all cameras at the start of the recording, and you can identify the spike in the accelerometer data manually. Orvex accepts manual offsets in milliseconds.

The 4GB Trap

GoPro cameras split continuous recordings into chapter files at approximately 4GB boundaries. A 20-minute drive produces GH010001.MP4, GH020001.MP4, GH030001.MP4.

The GPMF telemetry stream is split across these chapters. If you process each chapter independently, you get timestamp discontinuities at every boundary. Your SLAM system sees a time jump of several hundred milliseconds and either loses tracking or inserts a phantom loop closure.

Orvex detects chapter sequences automatically:

_GOPRO_CHAPTER_RE = re.compile(r"G[HX](\d{2})(\d{4})\.MP4", re.IGNORECASE)

Files sharing the same last 4 digits (recording ID) are grouped and their GPMF streams concatenated with corrected timestamps. The user sees a single continuous recording. The chapter boundaries become invisible.

This is not an edge case. Any GoPro recording longer than about 12 minutes at 4K hits the 4GB split. If your pipeline does not handle this, it does not work on real data.

Auto-Labeling for Roads That COCO Never Saw

Stock YOLOv8 trained on MS COCO gives you 80 classes. Indian roads have objects that COCO never included:

Class	In COCO?	Why It Matters
autorickshaw	No	Most common vehicle in Kerala
pothole_region	No	Critical for path planning
speed_bump	No	Often unmarked, no paint, no signs
cow	Yes (class 19)	Regularly standing in traffic lanes
person	Yes (class 0)	Far higher density than Western datasets

Orvex defines 13 rover-relevant classes and maps them to COCO class IDs where a mapping exists. For the three India-specific classes (autorickshaw, pothole, speed bump), a stock COCO model simply produces no detections. Those require a fine-tuned model.

The practical workflow:

Run yolov8n.pt at 0.25 confidence on all frames (fast, catches roughly 70% of objects)
Export to CVAT XML for human review
Reviewers correct false positives and add the missing India-specific classes
Use corrected annotations to fine-tune
Repeat

One reviewer can process 500 auto-labeled frames in the time it takes to manually label 50 from scratch. The model does not need to be perfect. It needs to be faster than starting from zero.

Camera Calibration: Four Steps, Three Weeks of Debugging

Camera-IMU calibration is a guided 4-step workflow in Orvex. Each step has requirements that are not obvious until you get them wrong.

Step 1: IMU Static Noise Characterization. Place the GoPro flat on a stable table and record for at least 4 hours. Allan deviation analysis extracts accelerometer noise density, accelerometer random walk, gyroscope noise density, and gyroscope random walk. A 30-minute recording gives you noise density but not random walk. VINS-Mono needs both parameters. There is no shortcut.

Step 2: Camera Intrinsics. Standard chessboard calibration using OpenCV. The requirement that trips people up: you need a minimum of 15 detected poses, and they must cover the corners and edges of the frame, not just the center. Reprojection error must be below 0.5 pixels. Orvex rejects calibrations that fail this threshold.

Step 3: Camera-IMU Extrinsics. This invokes OpenImuCameraCalibrator as a subprocess. It runs for 20 to 40 minutes. Orvex streams the subprocess stdout to the log panel in real time so you can watch it converge. The output is a 4x4 transformation matrix relating the camera frame to the IMU frame.

Step 4: Validation. Automated checks verify that reprojection error is below 0.5 pixels, translation magnitude is below 10 centimeters, and rotation magnitude is below 0.5 degrees.

Each step saves results to a JSON file. If Step 2 results already exist when you start, Orvex skips directly to Step 3. You calibrate once per physical camera mount, not once per session.

ByteTrack in 839 Lines

I implemented multi-object tracking from scratch rather than pulling in a tracking library. The reasons were practical:

Every tracking library I evaluated either required detections in a specific format that did not match my pipeline, pulled in dozens of transitive dependencies, or could not export results in MOT Challenge CSV format for evaluation.

The from-scratch implementation uses three components:

A Kalman filter with 8 states (position x/y, width/height, and their velocities) that predicts where each tracked object will appear in the next frame
Two-stage IoU matching: high-confidence new detections are matched to active tracks first, then low-confidence detections are matched to recently lost tracks
Hungarian assignment via scipy.linear_sum_assignment for globally optimal matching

The result is 839 lines of Python with no dependencies beyond NumPy and SciPy. It outputs track statistics, MOT Challenge formatted CSVs, and heatmap visualizations of track density.

28 Panels, Zero Frozen Frames

Every long-running operation in the desktop app runs in a QThread worker. The base pattern:

class BaseWorker(QThread):
    progress = pyqtSignal(int)           # 0 to 100
    status   = pyqtSignal(str)           # "Processing frame 45/200..."
    result   = pyqtSignal(object)        # Final payload on success
    error    = pyqtSignal(str)           # Actionable error on failure
    timing   = pyqtSignal(float, float)  # (elapsed_seconds, eta_seconds)

Every worker inherits from BaseWorker. The UI connects signals to update progress bars, status labels, and the collapsible log panel. The main thread never blocks. Model downloads, inference passes, subprocess calls to COLMAP or ORBSLAM3... all in workers.

The sidebar is organized as a sequential workflow. Not by technical category, but by what a beginner does first, then second, then third:

SETUP & IMPORT       (1-4)    Create session, audit files, import data
PROCESS & EXTRACT    (5-9)    Extract frames, calibrate, view telemetry
ANALYZE & ANNOTATE   (10-19)  Auto-label, segment, depth, track, review
TRAIN & DEPLOY       (20-28)  Augment, train, export, version

A beginner follows top to bottom. An experienced user jumps directly to whatever they need.

Same Logic, Two Interfaces

The FastAPI backend wraps the same core/ functions in 23 route modules. Long operations return a task ID immediately:

POST /api/v1/autolabel/{session_id}

{"data": {"task_id": "abc-123"}, "error": null}

The client opens a WebSocket to /ws/tasks/abc-123 and receives progress updates:

{"status": "running", "progress": 45, "total": 200, "message": "Annotating frames 41-48 / 200"}

The React frontend has 25 pages that mirror every desktop panel. The web version enables collaborative annotation workflows where multiple people can work on the same project.

No business logic was duplicated. The web routes call the same functions the desktop workers call.

What I Would Do Differently

Start with the data model. I wrote core/models.py before anything else. Every module knows exactly what shape of data it receives and what it returns. Pydantic v2 catches type mismatches at the function boundary, not three calls deep in a traceback. This was the single best decision.

Resist premature abstraction. The first version of the synchronizer tried to handle N arbitrary devices with a plugin system. The second version handles exactly three concrete methods with explicit code paths. Half the lines. Twice as debuggable.

Test with real recordings, not synthetic data. A CSV with perfectly uniform timestamps at exactly 200.000Hz will never expose the bug where real GoPro GPMF timestamps jitter by plus or minus 50 microseconds between samples. Every core module was tested against actual GoPro Hero 11 and Insta360 X4 files before being marked complete.

Make errors actionable. "Error occurred" is not an error message. "HyperSmooth is enabled in GH010001.MP4. Re-record with HyperSmooth OFF. Path: Settings > Stabilization > Off" is an error message. Every exception in Orvex tells the user what went wrong and what to do about it.

The Numbers

Metric	Count
Core business logic modules	33
Desktop UI panels	28
Web API endpoints	96+
React frontend pages	25
Lines of Python and JavaScript	~60,000
AI models integrated	7 (YOLOv8, SegFormer, DepthAnything, ByteTrack, COLMAP, ORBSLAM3, UFLD)
Export formats	EuRoC, ROS bag, HDF5, CVAT XML, YOLO txt
Test files	32

Running It

git clone https://github.com/Sherin-SEF-AI/Orvex.git
cd Orvex
python3.11 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=. python -m desktop.main

Requirements: Python 3.11+, FFmpeg, exiftool. GPU recommended for AI features but not required.

Orvex is MIT licensed. If you are collecting multi-sensor data for robotics, autonomous navigation, road condition monitoring, or dataset research, particularly in environments where Western datasets and assumptions do not apply, take a look.

Sherin-SEF-AI / Orvex

Data collection and processing platform for autonomous vehicle development

Orvex

Production-grade data pipeline for autonomous rover dataset collection, processing, and perception.

Orvex handles the full lifecycle of multi-sensor data — from raw GoPro/Insta360/Android recordings through telemetry extraction, calibration, synchronization, dataset assembly, auto-labeling, training, 3D reconstruction, SLAM validation, and edge deployment. Built for Indian road conditions.

Author: Sherin Joseph Roy Repository: https://github.com/Sherin-SEF-AI/Orvex.git License: MIT Python: 3.11+

Demos

IMU Telemetry Graph

Depth Estimation

Frame Extraction

Architecture

Orvex is a dual-interface application: a PyQt6 desktop app and a FastAPI + React web app. Both share identical business logic through the core/ module — zero code duplication.

                  +------------------+
                  |     core/        |   33 pure-Python modules
                  |  (business logic)|   No UI imports
                  +--------+---------+
                           |
              +------------+------------+
              |                         |
    +---------+----------+   +----------+---------+
    |   desktop/         |

…

View on GitHub

I am Sherin Joseph Roy. I build tools for autonomous systems operating on Indian roads: unstructured surfaces, mixed traffic, limited GPS coverage, and no lane markings. If you are working on similar problems, I would like to hear from you.

DEV Community