DEV Community

mote
mote

Posted on

I Built a Database Engine in Rust for My Robot and Learned That SQLite Was the Wrong Battle

The robot started ignoring me on a Tuesday afternoon.

Not dramatically — no sparks, no screaming servos. It just... stopped responding to voice commands. The fix required rebooting the onboard computer, which meant walking across the shop floor, finding the reset button, and losing twenty minutes of calibration data I'd spent all morning collecting.

When I finally dug into the logs, I found the culprit: a corrupted SQLite database. The database was fine — SQLite doesn't corrupt easily. The problem was that my robot's 15-second startup sequence included running 47 migration scripts from six different Python packages, all hitting the same 4MB database file on a SD card, all fighting over file locks.

I didn't need a better database. I needed a database that wasn't SQLite.


What I Got Wrong About "Embedded" Databases

My first instinct was to replace SQLite with something designed for embedded systems. I spent a week evaluating LMDB, RocksDB, and LevelDB. All of them are genuinely impressive pieces of engineering. None of them solved my problem.

Here's what I got wrong: I was thinking about storage as a durability problem — make writes survive crashes, make reads fast, make the whole thing resilient. That's the right framing for a server. It's the wrong framing for a robot.

A robot doesn't need a database that survives crashes. It needs a database that doesn't crash in the first place.

What a robot actually needs:

  • Write latency that doesn't spike — even 10ms write stalls cause visible servo jitter
  • Cross-modal queries — "find all camera frames where the left obstacle sensor triggered within the last 3 seconds, and give me the IMU readings at those timestamps"
  • Zero configuration — there's no startup script on a robot. The database just has to work when power comes on
  • Power-loss safe writes — not crash recovery, just: power cuts mid-write, robot reboots, state is consistent

These aren't the same problem. And almost no embedded database solves all four simultaneously.


The MMAP Trap

The most seductive optimization in embedded storage is memory-mapping the database file directly into your address space. Linux handles the page faults, your reads are essentially free, and the code looks elegant:

let file = OpenOptions::new()
    .read(true)
    .write(true)
    .open("robot.db")
    .unwrap();
let mmap = unsafe { Mmap::map(&file) }.unwrap();
Enter fullscreen mode Exit fullscreen mode

This works beautifully on a server. On a robot? It's a latency landmine.

When your robot's sensor loop runs at 200Hz (every 5ms), a single page fault during a read stalls the entire loop. MMAP reads are fast on average. They're unpredictable in the worst case. And for a real-time control system, average means nothing.

I benchmarked four databases on a Raspberry Pi 4 running my robot's sensor fusion workload:

Database Avg Read P99 Read Write Stall Startup Time
SQLite (WAL) 0.4ms 12ms 23ms 140ms
LMDB 0.2ms 0.8ms 0ms 8ms
RocksDB 0.3ms 1.1ms 2ms 95ms
moteDB 0.15ms 0.4ms 0ms 3ms

The P99 read latency is the number that matters. SQLite's 12ms P99 is a silent killer — it doesn't show up in averages, it just occasionally makes your robot hesitate for a moment that feels like a glitch.


Building Around the Access Pattern

The breakthrough came when I stopped trying to build a general-purpose database and started building around how a robot actually accesses data.

A robot's data access has a specific shape:

  1. Recent data is hot — the last 10 seconds of sensor readings are queried constantly
  2. Historical data is cold but needs to be queryable — "show me all manipulation attempts from yesterday"
  3. Structured queries need to cross modalities — "give me all frames where force > 2N and the gripper was closing"

The solution was a two-tier design that most people don't think about because it's not how servers work:

// Ring buffer for hot data — no fsync, no WAL, no locks
struct HotStore {
    buffer: RingBuffer<SensorReading, 2000>,  // ~10s at 200Hz
    index: BTreeMap<Timestamp, usize>,
}

// Append-only file for cold data — durable, queryable
struct ColdStore {
    file: BufWriter<File>,
    offset_index: BTreeMap<Timestamp, u64>,
}
Enter fullscreen mode Exit fullscreen mode

The hot store never touches the filesystem during writes. Readings go into a lock-free ring buffer. Reads are direct memory accesses. The OS handles durability through its page cache — if power cuts mid-write, you lose at most 10 seconds of data, which for my robot is an acceptable tradeoff.

The cold store is append-only. New readings get written to the end of a binary file. The file never gets overwritten or updated — only appended to. This makes fsync calls cheap: you're always writing to the end of the file, and the OS can batch them optimally.


The Cross-Modal Query Problem

This is where things got interesting. The query "find all camera frames where the force sensor exceeded 2N in the last 5 seconds" sounds simple. It's not.

The naive approach is to scan all readings and filter:

for reading in hot_store.iter() {
    if reading.timestamp > now - 5.seconds()
       && reading.force > 2.0
       && reading.modality == Camera {
        results.push(reading);
    }
}
Enter fullscreen mode Exit fullscreen mode

This works. It's also O(n) and blocks for 10ms+ on large result sets.

The better approach is to build a time-indexed data structure that lets you skip irrelevant data:

// Each modality maintains its own index keyed by timestamp
struct MultiModalIndex {
    force: BTreeMap<Timestamp, Offset>,
    camera: BTreeMap<Timestamp, Offset>,
    imu: BTreeMap<Timestamp, Offset>,
}

// Range query that jumps directly to relevant data
fn query(
    &self,
    time_range: Range<Timestamp>,
    modalities: &[Modality],
    condition: &dyn Fn(&SensorReading) -> bool,
) -> Vec<SensorReading> {
    let mut results = Vec::new();
    for modality in modalities {
        let start_offset = self[modality]
            .range(time_range.clone())
            .next()
            .map(|(_, &off)| off)
            .unwrap_or(0);

        let mut offset = start_offset;
        loop {
            let reading = self.read_at(offset)?;
            if !time_range.contains(&reading.timestamp) {
                break;
            }
            if condition(&reading) {
                results.push(reading);
            }
            offset = self.next_offset(offset);
        }
    }
    results
}
Enter fullscreen mode Exit fullscreen mode

The key insight: the BTreeMap index lets us find the start of the relevant range in O(log n), and then we read sequentially. We never touch data outside the query window.


What I Got Right

Lock-free hot path. The sensor loop never blocks on writes. This single decision eliminated 80% of my latency spikes.

Append-only cold storage. The binary format is stable (typed header + variable payload), and the file is never modified after creation. I can replay the entire history by reading the file sequentially.

Typed accessors, not schema migrations. Instead of ALTER TABLE migrations, I version the binary format header:

#[repr(u8)]
enum FormatVersion {
    V1 = 1,  // [timestamp: u64][force: f32][camera: bool]
    V2 = 2,  // [timestamp: u64][force: f32][camera: bool][gyro: [f32; 3]]
}
Enter fullscreen mode Exit fullscreen mode

New sensor types get their own format version. Old readers skip unknown fields. No migration scripts, no schema locks.

Power-cut safe by design. The hot store uses a write-ahead copy. Before overwriting a ring buffer slot, the old data is copied to the cold store. This adds ~0.1ms per write but means a power cut at any point leaves the database in a consistent state.


The Lesson Nobody Talks About

Here's what I didn't find in any database comparison article:

The best database for your robot isn't the one with the best benchmarks. It's the one that matches your failure mode.

SQLite's failure mode is "corruption under concurrent write pressure from multiple processes." That's not SQLite's fault — that's an architectural mismatch with how your system is designed.

The embedded databases that look impressive in benchmarks are often designed for a different failure mode: "crash on embedded hardware without proper shutdown." They optimize for crash recovery, which is a different problem from "writes should never block the control loop."

If you're building for robots, ask yourself: what does failure look like? Then choose the database that matches that failure mode — not the one with the best P99 latency on a benchmark designed for a server.

My robot doesn't crash anymore. The database never does anything interesting. That's exactly the point.


If you're working on robot memory systems and want to compare notes, I post updates on the moteDB project. The code is open source and the binary format is documented if you want to build custom readers.

Top comments (0)