How I Carve Objects Out of Depth Instead of Texture

#computervision #depth #geometry #nextjs

A depth pipeline should behave like a carpenter reading a level, not a photographer admiring a picture. It thresholds, groups, checks for discontinuities, and validates whether the resulting surfaces can be trusted. That framing is the whole point of what I built: a segmentation path that still has something useful to say when the room is nearly dark and the RGB frame is useless.

The failure that started this pipeline was an image that looked worthless while the depth map still had structure. Once I saw that, segmentation stopped being a color problem and became a geometry problem: if the scene is dark enough, texture is the wrong witness, and the depth field is the one telling the truth.

The shape-first path

The route I care about starts in the web app, but the important work happens on the GPU server. The browser sends a depth payload to the API route, and that route forwards the request to the depth segmentation service. From there, the server turns a depth map into labeled regions by looking for geometric structure instead of visual texture.

flowchart TD
  rawDepth[Raw depth input] --> apiRoute[API route]
  apiRoute --> gpuServer[GPU server]
  gpuServer --> threshold[Thresholding]
  threshold --> components[Connected components]
  components --> discontinuities[Surface discontinuities]
  discontinuities --> holes[Hole handling]
  holes --> labels[Labeled regions]

That diagram is the whole idea in miniature. I am not asking the model to "understand" a wall the way a vision model reads paint or grain. I am asking it to find contiguous surfaces, split them where the depth jumps, and keep the result usable even when the RGB frame is nearly empty.

The API route is intentionally thin. It exists to move the request into the GPU service and return the result back to the app without turning the web tier into an image-processing graveyard.

import { NextRequest, NextResponse } from 'next/server';

const RUNPOD_ENDPOINT_URL = process.env.RUNPOD_ENDPOINT_URL;

export async function POST(request: NextRequest) {
  try {
    const body = await request.json();

    const gpuServerUrl = process.env.SAM3_SERVER_URL || 'http://localhost:8000';

    const response = await fetch(`${gpuServerUrl}/segment/depth`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(body),
    });

    if (!response.ok) {
      const error = await response.text();
      return NextResponse.json(
        { error: `GPU server error: ${error}` },
        { status: response.status }
      );
    }

    const data = await response.json();
    return NextResponse.json(data);
  } catch (error) {
    return NextResponse.json(
      { error: error instanceof Error ? error.message : 'Depth segmentation failed' },
      { status: 500 }
    );
  }
}

What I like about this route is how little personality it has. It does not try to interpret the scene, and it does not pretend to own the algorithm. It just forwards the request, preserves the server's response, and keeps the failure mode obvious when the backend complains.

Why this is not ordinary segmentation

Ordinary image segmentation can lean on texture, edges, contrast, and all the other visual cues that make a photo interesting. This pipeline is different. The depth segmentation path is built for total darkness scenarios, and the docstring says that directly: it uses LiDAR depth maps to detect walls, windows, doors, and trim via RANSAC plane fitting and connected component analysis, with no RGB required.

That distinction matters because the naive approach would be to treat every boundary in the image as a visual boundary. In a depth map, that is the wrong instinct. A glossy surface can look noisy in RGB and still be flat in depth. A dark room can be visually unhelpful and geometrically rich. So I built the pipeline around the shape signal: threshold the depth values, group connected regions, then test whether those regions behave like planes or like fragments of planes.

The geometric tests that earn this post its keep

This is where the pipeline stops being plumbing and starts being interesting. The codebase gives me a vocabulary for geometric tests: depth analysis, geometry detection, multi-reference validation, and auto-scale correction.

The depth analysis interface includes a perpendicularity check, a tilt angle, a perspective correction factor, average depth, depth variance, and a gradient direction. That tells me the pipeline is not just carving masks; it is checking whether the surface behaves like a stable reference plane. A flat region with low variance is one thing. A tilted or gradient-heavy region is another.

The geometry detection layer goes one step further and classifies surfaces as flat, angled, or multi-plane. It tracks peak counts, detected peaks, and a confidence factor. That is the right shape of heuristic for adjacent planar regions: if the histogram of depth values suggests multiple peaks, I should not pretend the whole surface is one plane. I should split it, warn about it, or reduce confidence.

export interface GeometryAnalysis {
  /** Whether surface appears flat (single depth plane) */
  isFlatSurface: boolean;
  /** Number of detected depth peaks */
  peakCount: number;
  /** Complexity classification */
  complexity: 'flat' | 'angled' | 'multi-plane';
  /** Detected peak depths (normalized 0-1) */
  peaks: Peak[];
  /** Warning message for user */
  warning?: string;
  /** Confidence factor for calibration (0-1) */
  confidenceFactor: number;
}

I like this interface because it refuses to collapse geometry into a yes-or-no answer. A surface can be flat, angled, or multi-plane, and the rest of the pipeline needs that nuance if it is going to keep users out of trouble.

The non-obvious part is the confidence factor. That is the bridge between geometry and behavior: once a region starts looking like a bay window or a composite surface, I do not just label it and move on. I lower trust, surface a warning, and let the downstream calibration logic react accordingly.

The multi-reference validator uses a simple but important rule: when both a door and a window are detected, it calculates scale from each and compares them. The expected behavior is explicit: if only one reference exists, use it directly; if both exist, compare them and warn if the ratio falls outside 0.85-1.15; prefer the door as the primary reference because it is larger and more reliable.

That is the kind of heuristic I trust in production. It is not magical, and it is not trying to be. It is a guardrail around geometry that keeps the system from confidently lying when the scene is awkward.

Adjacent planes: where the geometry gets annoying

A wall next to trim, a door next to a window, or a bay window with multiple surfaces can all look like one shape until depth exposes the seams. That is why the system includes both depth-variance analysis and peak-based geometry detection. A single plane should not produce multiple strong depth peaks. Multiple peaks are a hint that the surface should be split or downgraded in confidence.

The multi-reference validator is the practical version of that same idea. It compares scale estimates from different reference objects and checks whether they agree. If they do, I trust the measurement more. If they do not, I treat that disagreement as a sign that the scene may contain perspective distortion, lens issues, or a misdetection.

That approach is deliberately conservative. It does not try to rescue every region with a heroic guess. It asks whether the geometry is consistent enough to deserve confidence, and it only promotes the result when the scene agrees with itself.

How I keep zero-light capture from failing closed

The strongest part of the system is not that it works in ideal conditions. It is that it still returns something useful when the RGB image is bad. The depth-only segmentation path exists for total darkness scenarios, and that is a different failure mode from ordinary photo-based segmentation. If I can still read the LiDAR depth map, I can still find structure.

That matters because a hard failure would be the wrong answer in the field. In a dark room, the useful behavior is not "give up because the image is ugly." The useful behavior is "extract the geometry that remains, label the regions that survive thresholding and connected-component splitting, and keep the output usable for calibration or measurement."

Closing

What I ended up with is a depth pipeline that reads levels, not photographs. It thresholds, groups, checks for discontinuities, and validates whether the resulting surfaces can be trusted, which is exactly why it still has something useful to say when the room is nearly dark.

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant