Daniel Romitelli

Posted on Mar 10 • Edited on Mar 23 • Originally published at craftedbydaniel.com

I got SAM3 video tracking wrong: the session wasn’t the problem—my reprojection was

#computervision #video #inference #tracking

I assumed the flicker was SAM3 being temperamental.

I watched masks snap on and off across consecutive frames, and my first instinct was to reach for the usual bag of excuses: “it’s a hard scene,” “video is noisy,” “the model needs more context.”

What actually broke was simpler and more embarrassing: I was letting the model’s outputs live in the wrong coordinate system, and then I tried to stabilize a tracker on top of that. Once I fixed session reuse and reprojection back into each frame’s original_sizes, the churn stopped looking like AI chaos and started looking like normal engineering again.

The key insight: streaming masks only look unstable when you keep moving the floor under them

In this pipeline, I’m doing per-frame segmentation in a video stream. That means every frame has two sizes that matter:

the size I feed into the model (whatever preprocessing produces)
the size I need to draw on screen (the frame’s original dimensions)

If those drift—even slightly—your tracker is trying to associate masks that are effectively being scaled/shifted between frames. That’s not “flicker,” it’s me changing the ruler every 33ms.

The fix wasn’t one thing; it was a set of decisions that all push in the same direction:

create an inference session once and reuse it across frames (warm path)
when I must restart, do it deliberately and recover state (warm→restart)
always reproject model outputs into original_sizes before I stream them
handle mixed-resolution frames explicitly (or you’ll debug ghosts)

I’ll walk through the session lifecycle, the streaming output assumptions (including why pixel_values[0] is acceptable in this particular stream), and the diagnostic endpoint I added—/segment/debug-model—so I could root-cause flicker instead of arguing with vibes.

flowchart TD
  subgraph videoStream
    frameIn[Video frame] --> preprocess[Preprocess to pixelValues]
    preprocess --> infer[Run SAM3 inferenceSession]
    infer --> reproject[Reproject to originalSizes]
    reproject --> track[Associate masks across frames]
    track --> streamOut[Streaming outputs per frame]
  end
  debugModel[Segment debug model endpoint] -.-> infer
  debugModel -.-> reproject

That diagram is the mental model I wish I’d had earlier: the tracker can only be as stable as the coordinate system it’s fed.

How I run the inference session across frames (warm, and warm→restart)

On the GPU server side, I ended up treating the inference session like a long-lived object: create it, feed it frames, and only restart when something actually invalidates the state.

The repo context shows I added a dedicated model introspection endpoint: debug: add /segment/debug-model introspection endpoint (commit a1389d6). That exists because I needed visibility into what the server thought it had loaded and what configuration it was running under.

The pattern I landed on is “configured or no-op”: check prerequisites at module load, and if the session isn’t in a known-good state, refuse to proceed rather than producing garbage outputs. I used this same defensive shape across the SAM3 inference session, the RGB-X pipeline, and every GPU server endpoint in this codebase. Validate state before acting, never assume continuity.

What surprised me when I started applying this mindset to streaming segmentation wasn’t the error handling—it was how often my “session is fine” assumption was wrong because inputs were drifting (resolution/orientation), not because the model had crashed.

Warm path vs warm→restart

I repeatedly hit “state mismatch” style problems elsewhere when the shape of data didn’t match what downstream code assumed. One concrete example is in the RGB-X pipeline: I had to fix output unwrapping because the pipeline returned a nested list.

From gpu-server/rgbx_endpoints.py (commit 0ca5907):

# Pipeline returns nested list: result.images[0][0] is the PIL Image
if hasattr(result, "images") and len(result.images) > 0:
    img = result.images[0]
    # Unwrap nested list (RGB-X wraps each channel in an extra list)
    if isinstance(img, list) and len(img) > 0:
        img = img[0]

The non-obvious lesson for me: in streaming inference, “session stability” problems often show up as shape/format drift first. I stopped trusting any output container until I’d asserted what it actually was.

Reprojecting outputs into `original_sizes` for streaming

The clearest example of this same class of bug lives in the RGB-X endpoints: bounding boxes coming from a different resolution than the input image.

From gpu-server/rgbx_endpoints.py (commit 43a8bb0):

# Scale bbox if it's from a different resolution than the input image
bbox_right = bx + bw
bbox_bottom = by + bh
if bbox_right > img_w or bbox_bottom > img_h:
    scale_x = img_w / max(bbox_right, 1)
    scale_y = img_h / max(bbox_bottom, 1)

That snippet is the same species of problem as SAM3 mask reprojection:

you get coordinates (or masks) in one space
you need them in another
if you skip this, everything downstream looks “unstable”

In my SAM3 stream, the tracker was downstream. So the tracker got blamed. But the real issue was that I was feeding it masks that didn’t line up frame-to-frame because I wasn’t consistently mapping back to each frame’s original_sizes.

One of the most practical guardrails I adopted was: every streaming frame output must carry both the model-space size and the original frame size, and I treat any mismatch as a first-class event (log it, debug it, potentially restart session).

How the tracker associates masks across frames (and why confidence thresholds mattered)

I had to tune SAM3 confidence thresholds in response to “0 segments” and label churn, and those changes are explicitly captured in the commit history:

fix: lower SAM3 internal confidence_threshold from 0.5 to 0.25 (commit c8086b8)
fix: lower SAM3 default confidence to 0.15, remove debug endpoint (commit f3bd706)
debug: add SAM3 segment debug logging to diagnose 0 segments (commit 95cf2fd)
debug: add /segment/debug-model introspection endpoint (commit a1389d6)

Those commits tell the story I recognize from operating this kind of stream: if your confidence gating is too aggressive, you don’t get “cleaner masks,” you get churn—objects disappear, reappear, and your association logic has nothing stable to latch onto.

What went wrong first (concretely)

I started with SAM3’s internal confidence_threshold at 0.5.

That was a mistake. It wasn’t “slightly too high”—it produced the failure mode captured directly in my own commit message: I was diagnosing 0 segments. That’s why I added debug logging (95cf2fd).

Lowering that internal threshold from 0.5 to 0.25 (commit c8086b8) was the first time the stream started behaving like a tracker problem instead of a blank-output problem. Then lowering the default confidence to 0.15 (commit f3bd706) reduced churn further.

I didn’t enjoy admitting this, but it’s the truth: the tracker can’t associate what the model refuses to emit.

Diagnostic endpoints: making flicker debuggable

When I’m dealing with flicker, I want to know three things immediately:

what model/config is loaded?
what thresholds are active?
what does the server think the input/output shapes are?

That’s why I added /segment/debug-model (commit a1389d6). A simple GET that returns the loaded model name, active thresholds, and expected input/output shapes. Nothing clever—just enough state exposure that when something flickers, I can ask the server what it thinks is happening instead of guessing.

The surprising part for me wasn’t that I needed a debug endpoint—it was that once it existed, I stopped “tuning” blindly. I could correlate flicker with specific conditions: mixed-resolution inputs, orientation mismatches, or thresholds that were too strict.

Mixed-resolution frames: the silent killer

The RGB-X bbox scaling fix (43a8bb0) is a perfect example of how mixed-resolution data sneaks in: a bbox that made sense in one resolution becomes out-of-bounds in another.

In the iOS capture side, I also hit a closely related issue: orientation mismatches. In VideoTrackingManager.swift I explicitly rotate ARKit camera buffers because they’re always landscape-left.

From the diff (commit e68236a):

“ARKit camera buffers are always landscape-left orientation.”
“Rotate to portrait when device is upright so masks from the GPU server align with the on-screen display.”

That comment is exactly the kind of bug that masquerades as “model flicker” when it’s really “I rotated one side and not the other.”

This is the one place I’ll use a single analogy, because it’s how it felt to debug: tracking across frames with inconsistent reprojection is like trying to draw on tracing paper while someone keeps swapping the paper size when you blink. You’re not shaky—your reference frame is.

Why `pixel_values[0]` can be acceptable in a batch size 1 stream

This is a per-frame stream, not an offline batch job. Every time I accidentally treated a nested structure as flat (as in RGB-X’s result.images[0][0] case), I paid for it in runtime errors or misaligned outputs.

So my rule became: if I’m going to index [0], I only do it when I’ve asserted (via logging or debug endpoint output) that the stream is batch size 1 and the container shape is stable. That discipline came directly from the RGB-X unwrapping bug (0ca5907).

Strategies for graceful recovery when a session restarts

The broader pattern I used for recovery comes from the iOS→web pipeline: I upload segmentation results “fire-and-forget” from the device using a detached task.

From ios-lidar-app/SidingAILiDAR/ContentView.swift (commit d770ee8):

// Upload segmentations to Supabase (fire-and-forget)
let segmentsToUpload = segments
Task.detached {
    await SupabaseManager.shared.uploadSegmentations(
        projectId: projectId,
        captureId: captureId,
        segments: segmentsToUpload
    )
}

That’s the same recovery philosophy I applied on the GPU side: don’t block the live experience on perfect continuity. When a session restarts, I want the stream to keep moving, and I want enough persisted context (segments + measurements) that the rest of the system can stay coherent.

The thing that bit me early was assuming “restart is rare.” In reality, restarts happen—deploys, GPU hiccups, input drift—and if you don’t plan for them, you end up with a tracker that behaves like it has amnesia.

The part I didn’t expect: lowering confidence reduced churn more than any tracker tweak

I went into this thinking the tracker was the hero.

But the commit history tells the actual sequence: I first had to make SAM3 emit segments reliably (lower internal threshold 0.5 → 0.25), then I had to make defaults more permissive (0.15) to reduce label churn. Only after that did the rest of the pipeline—reprojection into original_sizes, session reuse, and association—start behaving predictably.

Once I stopped moving the coordinate system under the masks, “flicker” stopped being a mysterious model trait and turned back into what it always was: a bug I could point to, log, and kill.

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant

DEV Community

I got SAM3 video tracking wrong: the session wasn’t the problem—my reprojection was

The key insight: streaming masks only look unstable when you keep moving the floor under them

How I run the inference session across frames (warm, and warm→restart)

Warm path vs warm→restart

Reprojecting outputs into `original_sizes` for streaming

How the tracker associates masks across frames (and why confidence thresholds mattered)

What went wrong first (concretely)

Diagnostic endpoints: making flicker debuggable

Mixed-resolution frames: the silent killer

Why `pixel_values[0]` can be acceptable in a batch size 1 stream

Strategies for graceful recovery when a session restarts

The part I didn’t expect: lowering confidence reduced churn more than any tracker tweak

Top comments (0)

The key insight: streaming masks only look unstable when you keep moving the floor under them

How I run the inference session across frames (warm, and warm→restart)

Warm path vs warm→restart

Reprojecting outputs into original_sizes for streaming

How the tracker associates masks across frames (and why confidence thresholds mattered)

What went wrong first (concretely)

Diagnostic endpoints: making flicker debuggable

Mixed-resolution frames: the silent killer

Why pixel_values[0] can be acceptable in a batch size 1 stream

Strategies for graceful recovery when a session restarts

The part I didn’t expect: lowering confidence reduced churn more than any tracker tweak

Reprojecting outputs into `original_sizes` for streaming

Why `pixel_values[0]` can be acceptable in a batch size 1 stream