DEV Community

Cover image for 5 Things That Broke My Real-Time Robotics Positioning System (And What I Actually Did About It)
Manoj Pisini
Manoj Pisini

Posted on

5 Things That Broke My Real-Time Robotics Positioning System (And What I Actually Did About It)

I built a precise positioning network for robots in confined facilities. It works now. But getting there involved a clock sync bug I almost shipped, coordinates rendering in the wrong corner of the map, and a gRPC stream that looked perfectly fine right until it didn't.

This post is about those things — not to be self-deprecating, but because I genuinely wish someone had written this before I started. If you're touching real-time positioning, multi-robot coordination, or gRPC streams under any real load, some of this might save you a week. Or at least make you feel less alone when things break weirdly.


The System, Briefly

Robots in confined facilities — warehouses, industrial floors, underground tunnels. GPS doesn't work there. So you build something custom that tracks where every robot is, keeps them from running into each other, and does it accurately enough to actually be useful.

The stack:

  • C/C++ for hardware-level sensor interfaces
  • Rust for the positioning algorithm core (UWB trilateration, Kalman filtering)
  • Go for the gRPC coordination service
  • Rust + Tauri for the dashboard — Rust core, TypeScript/CSS frontend
  • SQL for telemetry history

One system, several languages — and Rust pulling double duty. Here's where it went sideways.


1. I Trusted the Sensor Timestamps

This one cost me four days. Four.

UWB sensors give you time-of-flight measurements, which you trilaterate into a position. Each sensor has its own internal clock. I assumed — and I cannot stress how reasonable this felt at the time — that if two sensors both reported a measurement at timestamp T, those measurements were from the same moment.

They weren't.

Every sensor's clock drifts on its own. At nanosecond-level time-of-flight, even nanosecond drift translates to centimeter-level positioning error — speed of light is 3×10⁸ m/s, so 1ns of timing error is ~30cm of ranging error. Two sensors whose clocks drift apart by even a few nanoseconds will produce inconsistent range measurements, and the trilateration compounds that into a bad position.

The fix was a Two-Way Ranging clock sync protocol between anchor nodes before trusting any measurement pair. The Rust core now rejects measurement sets where the inter-sensor timestamp delta is above a configurable threshold — forces a re-sync instead of computing a bad position and quietly propagating it downstream.

The lesson here sounds obvious in retrospect: in physical systems, "same timestamp" is something you verify, not assume. Clocks lie, especially cheap embedded ones.


2. gRPC Streaming Looked Fine Until Real Load

In testing — three simulated robot clients, controlled environment — the streaming was flawless. Then we moved to ten robots in an actual facility.

Some clients started receiving position updates in bursts. Like, nothing for 800ms, then 15 frames arriving at once. Average latency looked acceptable. Tail latency was quietly destroying the real-time guarantee.

What was happening: I was spawning a new goroutine per position update per connected client. At 30Hz updates × 10 robots × multiple clients, I was creating goroutines faster than Go's scheduler could drain them, and it was batching completions in a way that looked bursty from the outside.

I replaced the whole thing with a per-client buffered channel and a single dedicated goroutine per client that drains it at its own pace. The key part — updates get dropped (with a logged warning) if the channel is full. Intentional lossy delivery.

type ClientStream struct {
    stream  grpc.ServerStreamingServer[PositionUpdate]
    updates chan *PositionUpdate
    dropped atomic.Uint64
    done    chan struct{}
}

func NewClientStream(
    stream grpc.ServerStreamingServer[PositionUpdate],
    bufSize int,
) *ClientStream {
    cs := &ClientStream{
        stream:  stream,
        updates: make(chan *PositionUpdate, bufSize),
        done:    make(chan struct{}),
    }
    go cs.drain()
    return cs
}

func (cs *ClientStream) drain() {
    for {
        select {
        case u := <-cs.updates:
            if err := cs.stream.Send(u); err != nil {
                return // client gone, goroutine exits cleanly
            }
        case <-cs.done:
            return
        }
    }
}

// Publish is non-blocking. Dropped frames are counted atomically —
// no mutex, no contention. We only log on powers of two so a sustained
// backpressure event at 30Hz doesn't flood the log.
//
// n & (n-1) == 0 is true only when n has exactly one bit set,
// i.e. n is a power of two: fires at 1, 2, 4, 8, 16... drops.
func (cs *ClientStream) Publish(update *PositionUpdate) {
    select {
    case cs.updates <- update:
    default:
        n := cs.dropped.Add(1)
        if n&(n-1) == 0 {
            log.Warnf("stream backpressure: %d frames dropped (robot %s)",
                n, update.RobotID)
        }
    }
}

// Close signals the drain goroutine to stop. Call at most once —
// closing a channel twice panics. Wrap in sync.Once in production.
func (cs *ClientStream) Close() { close(cs.done) }
Enter fullscreen mode Exit fullscreen mode

This felt wrong at first. Why would you ever intentionally drop data? But think about what the alternative is: a client receiving position data that's two seconds old, presented as if it's current. In a robot coordination system, stale position is worse than missing position. A robot that doesn't know where another robot is will stop and ask. A robot that has wrong data will keep moving.

Design for lossy delivery upfront. Don't let it surprise you.


3. The Coordinates Were Right. Just in the Wrong Universe.

The algorithm produces coordinates. But — which coordinate system?

Sensors are anchored at physical points in the facility. Robots have their own odometry frames. The dashboard wants a canonical 2D top-down map. I picked one system (the anchor frame) and told myself I'd sort out the transformations later.

Later arrived when a robot physically in the northwest corner was rendering in the southeast corner of the dashboard. The position was accurate — just in the completely wrong reference frame.

The transformation chain goes: sensor space → world metric space → map space → display pixels. Each step has rotation, scale, and origin offset. I had none of this documented. Some of it was hardcoded. One transformation was applied twice in two different places — which meant the error doubled and the result was, somehow, almost correct in certain configurations. That made it harder to find.

The actual fix was a Position<Frame> phantom type in Rust. Every position value carries its coordinate frame as a type tag, and conversions between frames are explicit function calls. You can't accidentally add a sensor-space position to a world-space position — it won't compile.

use std::marker::PhantomData;

// Zero-sized marker types — compile to nothing at runtime,
// exist only for the type checker.
// Derive Copy/Clone/Debug so Position<Frame>'s own derives can satisfy
// the implicit `Frame: Copy + Clone + Debug` bounds they introduce.
#[derive(Debug, Clone, Copy)] pub struct SensorAnchorFrame;
#[derive(Debug, Clone, Copy)] pub struct WorldMetricFrame;
#[derive(Debug, Clone, Copy)] pub struct MapDisplayFrame;

#[derive(Debug, Clone, Copy)]
pub struct Position<Frame> {
    pub x: f64,
    pub y: f64,
    _frame: PhantomData<Frame>,
}

impl<F> Position<F> {
    pub fn new(x: f64, y: f64) -> Self {
        Self { x, y, _frame: PhantomData }
    }
}

/// 2D affine transform:  p' = s · R(θ) · p + t
///
///        | cos θ  −sin θ |
/// R(θ) = |               |
///        | sin θ   cos θ |
#[derive(Debug, Clone, Copy)]
pub struct AffineTransform {
    pub scale:        f64,
    pub rotation_rad: f64,
    pub tx:           f64,
    pub ty:           f64,
}

impl AffineTransform {
    // f64::sin_cos() is a single FSINCOS instruction on x86 —
    // cheaper than calling sin and cos separately.
    #[inline]
    fn apply(&self, x: f64, y: f64) -> (f64, f64) {
        let (sin_θ, cos_θ) = self.rotation_rad.sin_cos();
        let sx = self.scale * x;
        let sy = self.scale * y;
        (
            cos_θ * sx - sin_θ * sy + self.tx,
            sin_θ * sx + cos_θ * sy + self.ty,
        )
    }
}

impl Position<SensorAnchorFrame> {
    pub fn to_world(self, t: &AffineTransform) -> Position<WorldMetricFrame> {
        let (x, y) = t.apply(self.x, self.y);
        Position::new(x, y)
    }
}

impl Position<WorldMetricFrame> {
    pub fn to_map(self, t: &AffineTransform) -> Position<MapDisplayFrame> {
        let (x, y) = t.apply(self.x, self.y);
        Position::new(x, y)
    }

    // f64::hypot avoids overflow/underflow that naïve sqrt(dx²+dy²) hits
    // when coordinates are very large or very small.
    pub fn distance_to(&self, other: &Position<WorldMetricFrame>) -> f64 {
        (self.x - other.x).hypot(self.y - other.y)
    }
}

// These do not compile — and that's the entire point:
//
// distance_to doesn't exist on SensorAnchorFrame, so:
// let a: Position<SensorAnchorFrame> = Position::new(1.0, 2.0);
// let _ = a.distance_to(&a);  // ← E0599: no method named `distance_to`
//                             //   found for Position<SensorAnchorFrame>
//
// Passing the wrong frame to a typed function:
// fn needs_world(_: Position<WorldMetricFrame>) {}
// needs_world(a);             // ← E0308: mismatched types
//                             //   expected Position<WorldMetricFrame>
//                             //      found Position<SensorAnchorFrame>
Enter fullscreen mode Exit fullscreen mode

This is one of those things where Rust's type system is genuinely worth the friction. Coordinate frame bugs are silent — they produce plausible-looking wrong answers. Making the confusion a compile error is worth whatever ceremony it costs.


4. SQL Was the Wrong Shape for One Query

Telemetry goes into PostgreSQL. Good for history, replays, analytics — all fine.

The problem was the one query I ran most: give me the current position of every active robot. Against a position history table with millions of rows, this was taking 40–90ms. I needed to ask this every 100ms per robot. The math doesn't work.

I tried indexing, tried a materialized view with periodic refresh. Both helped, neither fixed it.

Eventually I just stopped fighting it and split the concern in two. PostgreSQL keeps the full history. A sync.RWMutex-backed in-memory store in the Go service holds only the current position per robot, updated on every incoming report. Coordination queries never touch the database — they hit the store. The database gets the writes asynchronously.

// PositionStore is a concurrent current-state store for robot positions.
// Stale updates are rejected using the sensor's monotonic nanosecond
// timestamp — not wall-clock time, which can jump backwards.
type PositionStore struct {
    mu        sync.RWMutex
    positions map[string]*TimestampedPosition
}

type TimestampedPosition struct {
    *PositionUpdate
    ArrivedAt time.Time
}

func NewPositionStore() *PositionStore {
    return &PositionStore{positions: make(map[string]*TimestampedPosition)}
}

// Update returns false and silently discards the frame if it's older
// than what's already stored — handles both network reordering and
// the clock-drift edge case from mistake #1 above.
func (ps *PositionStore) Update(id string, pos *PositionUpdate) bool {
    ps.mu.Lock()
    defer ps.mu.Unlock()

    if cur, ok := ps.positions[id]; ok {
        if pos.SensorTimestampNs <= cur.SensorTimestampNs {
            return false // stale or duplicate
        }
    }
    ps.positions[id] = &TimestampedPosition{
        PositionUpdate: pos,
        ArrivedAt:      time.Now(),
    }
    return true
}

// Snapshot returns a shallow copy of all current positions under a
// single read lock — one acquisition regardless of robot count.
func (ps *PositionStore) Snapshot() map[string]*TimestampedPosition {
    ps.mu.RLock()
    defer ps.mu.RUnlock()

    out := make(map[string]*TimestampedPosition, len(ps.positions))
    for id, pos := range ps.positions {
        out[id] = pos
    }
    return out
}
Enter fullscreen mode Exit fullscreen mode

Why not sync.Map? It shines when keys are written once and read many times — caches, routing tables. Here every robot writes ~30 times per second and we snapshot constantly. Under that pattern, sync.Map's internal dirty-map promotion adds real overhead. I benchmarked both at 20 robots × 30Hz. The RWMutex version was ~40% faster in the update path.

The lesson I took from this: "put it in the database" and "use the right data structure for the access pattern" are not the same question. A relational database is great at history. A hash map is great at current state. They're not competing — they're complementary. You can have both.


5. I Underestimated What "Confined Facility" Actually Means for Signal

The whole system assumes UWB anchors have line-of-sight to the robot tags. In an open lab, mostly true. In an actual facility — steel shelving, concrete pillars, forklifts, other robots moving around — frequently not true at all.

Multipath interference (signal bouncing off surfaces before arriving) and NLOS conditions corrupt the time-of-flight measurements. The trilateration still produces a position. It's just wrong. Not wildly wrong — usually 30–80cm off — but enough to matter when robots are navigating near each other.

The worst part: nothing looked broken. The system was confidently computing wrong answers.

What I added: a sanity check in the Rust positioning core. If a new measurement implies the robot moved faster than its physical maximum speed, the measurement gets flagged as a likely NLOS artifact and down-weighted in the Kalman filter rather than accepted at face value. It's a heuristic, not a perfect fix — but it cut the transient error spikes down significantly without needing hardware changes.

The broader point: in physical systems, bad input doesn't give you obviously bad output. It gives you plausible-looking bad output. That's the dangerous kind. Validate measurements against physical constraints, not just data types.


Where It Ended Up

After all of the above: sub-5cm average positioning accuracy in clean environments, 30Hz per robot, 20 concurrent clients before the Go service needs scaling.

The dashboard is Tauri — Rust owns the backend, a TypeScript/CSS frontend handles everything visual. The position stream from Go is consumed by the Rust Tauri core via gRPC, validated and interpolated, then pushed to the frontend over Tauri's IPC channel using emit. No separate WebSocket server, no Express middleware, no extra process. The Rust layer does the heavy lifting; the web layer does what web is actually good at — layout, styling, smooth animation.

// Tauri v1. In Tauri v2, replace tauri::Window with tauri::WebviewWindow.
#[tauri::command]
async fn subscribe_positions(
    window: tauri::Window,
    state: tauri::State<'_, AppState>,
) -> Result<(), String> {
    let mut rx = state.position_tx.subscribe();

    tokio::spawn(async move {
        while let Ok(snapshot) = rx.recv().await {
            // Serialize once, emit to all JS listeners.
            // tauri::Window::emit is non-blocking — fire and forget.
            let _ = window.emit("position-snapshot", &snapshot);
        }
    });

    Ok(())
}
Enter fullscreen mode Exit fullscreen mode
// In the TypeScript frontend — dead simple on this side
import { listen } from "@tauri-apps/api/event";
import type { PositionSnapshot } from "../bindings";

await listen<PositionSnapshot>("position-snapshot", ({ payload }) => {
    renderFrame(payload); // canvas or WebGL — your choice
});
Enter fullscreen mode Exit fullscreen mode

The split is clean: anything that requires precision, timing, or direct access to the gRPC stream lives in Rust. Anything that requires a design system, CSS animations, or a component library lives in TypeScript. You don't have to choose one or the other.

Still rough edges. The clock sync breaks in certain anchor geometries. The frame-drop policy in the gRPC layer needs a smarter eviction strategy. 3D support in the frontend would mean dropping into a WebGL/Three.js render pass, which is workable but adds its own frame budget constraints.

But when it breaks, I can actually tell why it broke. That took longer to get right than the system itself.


What I assumed What was actually true
Sensor timestamps are synchronized Clocks drift; you have to measure and compensate
gRPC streaming scales by default Backpressure is an explicit design decision
Coordinate frames are a naming convention They're a type-system problem
One database handles everything Current state and history need different shapes
UWB works cleanly indoors Multipath and NLOS require input validation

If any of this sounds familiar, I'd genuinely like to hear what your version of it looked like. Drop a comment.


— Manoj

Top comments (0)