DEV Community: Rahul Sangamker

Why I Built My Own Image Annotation Tool, Then Open-Sourced It

Rahul Sangamker — Fri, 12 Jun 2026 04:46:12 +0000

Five years of computer vision work taught me that the bottleneck is never the model. It's the labeled data. So I built Reticle: a local-first, keyboard-driven annotation tool, now open source.

The inspiration: annotation is where CV projects go to stall

Every computer vision project I've worked on followed the same arc. The model architecture discussion takes a week. The training pipeline takes another. And then the team spends two months labeling images, because somebody has to, and the tooling makes it miserable.

The existing options each failed me in a specific way. Cloud platforms are genuinely good, but they require uploading client imagery to someone else's servers, which is a non-starter the moment an NDA is involved. Per-seat pricing punishes you for adding short-term labelers. And almost all of them are mouse-driven: click the class dropdown, select, draw, click save, click next. Those clicks feel free until you multiply them by ten thousand images.

So in early 2025 I built my own tool for my own projects, shaped by exactly the workflows I kept hitting. In 2026 I cleaned it up and open-sourced it as Reticle.

Design decision 1: local-first, files as the database

Reticle runs as a FastAPI backend plus a React frontend, both on your machine. It points at a folder (or a CSV) on your disk, and it writes annotations back to your disk as JSON and YOLO OBB format. There is no database, no account, no telemetry, no upload.

That sounds like a limitation until you work with regulated or NDA-covered imagery, and then it is the entire feature. Your data never leaves your control because there is no mechanism by which it could. A side benefit: your dataset is portable. The annotations live next to the images as plain files, so version control, backup, and pipeline integration all work with tools you already have.

Design decision 2: the keyboard is the product

The interface is built around hotkeys: number keys select classes, navigation and saving never require the mouse, and auto-save fires as you move between images. The mouse draws boxes; everything else is typing.

The math is unglamorous but decisive. If switching class plus saving plus advancing costs four seconds by mouse and under one second by keyboard, then across a 10,000-image dataset the difference is over eight working hours. Annotation tools are throughput tools, and throughput lives in the input device.

Design decision 3: meet the ML pipeline where it is

Real datasets do not arrive as a folder of images. They arrive as a CSV with image paths, metadata columns, and sometimes pre-annotations from an earlier model run. Reticle accepts that CSV directly: it can read the image list from a column, display metadata columns in the footer while you label (useful when context like location or capture date affects the label), and load pre-annotations in YOLO OBB format as editable starting boxes.

That last part matters most. Reticle supports a pluggable inference endpoint (the repo ships an example server), so you can point it at your current model and pre-label every image. The human's job shifts from drawing boxes to correcting them, which is dramatically faster, and the corrections become exactly the training data your model needs most. That loop, label, train, pre-label, correct, retrain, is the practical engine of most successful CV projects I've seen.

Design decision 4: boxes and keypoints

Bounding boxes cover most tasks, but my own work kept needing precise points: attachment locations, reference marks, structural landmarks. So classes in Reticle can define named keypoints with their own colors, and the same interface handles both annotation types.

What it deliberately is not

Reticle is not a Labelbox competitor. There is no team management, no review queues, no consensus scoring. It is a single-operator tool that optimizes for one thing: a skilled person labeling fast on their own machine with their own data. Knowing what not to build is most of what kept it small enough to finish, and small enough that the code is readable in an afternoon.

Try it

The repo is MIT-licensed with install scripts for Windows, Mac, and Linux: github.com/rs-03/Reticle

If you label images for a living, I would genuinely like to hear which workflow it fails on. That feedback is the roadmap.

More of my work, including live in-browser ML demos: rs-03.github.io

From Keypoints to Measurements: Why Landmarks Alone Are Useless

Rahul Sangamker — Wed, 10 Jun 2026 21:49:57 +0000

Every hand-tracking demo shows you 21 dots. The interesting part is what nobody shows: turning dots into numbers someone can act on.

Dots are a capability, not a product

Run any modern hand-tracking model and you get 21 beautifully stable landmarks per hand at 30 FPS. Impressive, and by itself worthless. No client has ever paid for dots. They pay for measurements: is this clearance compliant, is this part aligned, did this patient's range of motion improve.

I learned this on utility infrastructure work, where the deliverable was never "we detected the wire". It was the attachment height of that wire, and whether it violates clearance rules. Keypoints were step one of three.

The demo: live metrics, not just a skeleton

The demo I built derives three measurements per hand, every frame:

const wrist = lm[0];
const palm = distance(wrist, lm[9]);          // scale reference

const pinch = distance(lm[4], lm[8]) / palm;  // thumb tip ↔ index tip

The crucial line is the scale reference. Pixel distances are meaningless; they change as you move toward the camera. Dividing by palm length (wrist to middle knuckle) gives a relative measurement that's stable under distance, and multiplying by the average adult palm length (~8.5 cm) converts it into an approximate real-world gap. The demo shows "≈ 3.2 cm" floating on the pinch line. In infrastructure work the same role is played by a known object dimension: a standard crossarm, a pole class height. Every measurement-from-pixels system needs its ruler.

Finger counting is a geometric test (is each fingertip farther from the wrist than its middle joint?), and "hand openness" averages fingertip extension. Three lines of geometry each, but they convert a model output into a readout a human understands instantly.

Honest layering

The landmarks come from MediaPipe's pretrained pipeline (palm detector → landmark regressor → gesture classifier, float16, WASM + GPU delegate). They're Google's models, credited on the page. The engineering I own is the integration (lazy loading, render loop, throttled UI) and the measurement layer on top. Knowing when a pretrained model suffices and when you need to fine-tune your own (as I did for utility keypoints, where off-the-shelf models had never seen a crossarm) is most of the senior judgment in applied CV.

The hard problems hiding behind the dots

The demo derives three measurements with three lines of geometry each. Production measurement systems earn their keep on the problems the demo deliberately sidesteps:

Metric scale from a single camera. Palm length is a convenient ruler, but it assumes an average hand. Real systems calibrate against a known object, use stereo or depth sensors, or exploit multi-frame geometry. Choosing the ruler is the core design decision in every measurement-from-pixels system.
Jitter becomes error bars. Landmarks vibrate frame to frame. Filtering trades latency for stability (the one-euro filter is the workhorse), and the residual jitter should propagate into the output as explicit uncertainty: not "3.2 cm" but "3.2 plus or minus 0.3 cm". People make different decisions when they can see the error bars.
When the pretrained model stops being enough. Hands are the best-served keypoint domain in the world. Utility crossarms are not. The judgment call between adapting a pretrained model, fine-tuning on custom annotations, and training from scratch is driven by how far your structures sit from the model's training distribution, and it is most of the senior work in applied CV.
Temporal measurements. Static distances are the beginning. Velocities, ranges of motion, and repetition quality come from trajectories over time, which is where measurement systems start replacing human assessment instead of assisting it.

The pattern to steal

Whatever your domain: landmark → scale reference → relative measurement → threshold → decision. That last hop, from number to decision, is where the business value lives. Models are increasingly commodities; measurement systems are not.

Try it (pinch slowly and watch the number): rs-03.github.io/demos
Source: github.com/rs-03/rs-03.github.io

I Put a Neural Network Inside My Portfolio: No TensorFlow, No Server, 145 KB

Rahul Sangamker — Wed, 10 Jun 2026 21:49:54 +0000

Training a network from scratch in raw NumPy, quantizing it to int8, and running it as ~80 lines of dependency-free JavaScript, with a parity test proving the browser matches Python to 1e-6.

Why bother? MNIST is a solved problem

Digit recognition is the "hello world" of ML, and that's exactly why I used it. The model isn't the point. The point is everything around the model, which happens to be the part that matters in production work too: training without a framework, compressing for deployment, running inference in a constrained environment, and proving the deployed system matches the trained one.

Training: just NumPy and math

The network is a 784→128→64→10 MLP: hand-written forward pass, backpropagation, and Adam optimizer. No autograd, no framework:

# backward pass, by hand
dz3 = (probs - y_batch) / batch_size
grads_w[2] = a2.T @ dz3
da2 = dz3 @ weights[2].T
dz2 = da2 * (z2 > 0)          # ReLU mask
grads_w[1] = a1.T @ dz2
...

One trick that matters for a drawing demo specifically: shift augmentation. MNIST digits are centered; humans draw wherever they like. Training on randomly translated copies makes the model tolerant of sloppy placement. Combined with MNIST-style preprocessing at inference (crop to bounding box, scale into a 20×20 box, center by center-of-mass), real-world doodles classify reliably. Final test accuracy: 98.2%.

Compression: int8 in 15 lines

A float32 weight file would be ~430 KB. Symmetric int8 quantization cuts it ~4×:

scale = np.abs(w).max() / 127.0
q = np.clip(np.round(w / scale), -127, 127).astype(np.int8)

One scale factor per layer, weights stored as base64 in JSON: 145 KB total, and quantized test accuracy is identical to float: 98.2%.

Inference: ~80 lines of plain JavaScript

In the browser, the weights are dequantized once on load, and inference is three matrix-vector products with ReLU and a softmax. ~109K multiply-adds, about a microsecond-scale problem for any modern device. No TensorFlow.js (that runtime is megabytes; the entire model is 145 KB).

The part most deployments skip

Deployed-vs-trained drift is a real production failure mode, so the JS engine is tested against the Python model directly: ten fixture digits, expected probabilities exported from training, asserted in Node:

max prob diff vs Python: 1.14e-6
correct: 10/10
PARITY OK

If I change the inference code and break numerical equivalence, CI knows before a visitor does. That habit, verifying the deployment artifact and not just the training run, is worth more than another accuracy point.

Where this goes beyond MNIST

The demo is intentionally the simplest possible instance, because the point is the deployment discipline around it, and that discipline scales to systems that matter:

Quantization with a budget. Post-training int8 cost nothing here because the network is over-provisioned for the task. Real systems choose between quantization-aware training, mixed precision for sensitive layers, and distillation into a smaller student. The decision process stays identical: measure, compress, re-verify against parity fixtures.
The browser is becoming a serious inference target. WASM SIMD and WebGPU put surprisingly large models within reach of a static page. The interesting product class is anything where privacy is the feature: medical screeners, document processing, anything users would refuse to upload.
Parity testing as a discipline. The 1e-6 check between Python and JavaScript is a miniature of a production problem: proving the deployed artifact matches the trained one across runtimes, hardware, and compiler versions. Most teams discover deployment drift in production; fixtures catch it in CI.
From digits to your domain. Swap the dataset and the same pipeline covers gesture commands, anomaly scoring over sensor windows, or keyword spotting: train anywhere, export weights, verify parity, run on-device. The 145 KB ceiling is a design constraint that forces good decisions.

Try it (draw badly, it copes): rs-03.github.io/demos
Source: github.com/rs-03/rs-03.github.io: training script, inference engine, and parity test.

Testing Camouflage Against the Real Adversary: an AI

Rahul Sangamker — Wed, 10 Jun 2026 21:49:52 +0000

Camouflage has always been graded by human eyes. But the thing hunting for you in 2026 is increasingly a detection model, so test against that.

The premise

Surveillance is automated now: drones, trail cameras, perimeter systems. Most of what "sees" runs an object-detection network. Which makes traditional camouflage evaluation (a person squinting at a photo) the wrong test. The right test is adversarial: run the actual detector against your concealment and measure what it finds.

That's the whole demo: upload a photo, and an object-detection model hunts for people in it at four simulated distances, producing a detection-range profile and a stealth score.

Simulating distance with pixels

You can't move the camera after the photo is taken, but you can simulate the dominant factor in long-range detection: pixels on target. A person at 50 m simply occupies far fewer pixels than at 5 m. So each analysis run downscales the image progressively and re-runs detection:

const DISTANCE_LEVELS = [
    { label: 'Close (~5m)',    scale: 1    },
    { label: 'Mid (~15m)',     scale: 0.45 },
    { label: 'Far (~30m)',     scale: 0.22 },
    { label: 'Very far (~50m)', scale: 0.12 },
];

for (const level of DISTANCE_LEVELS) {
    const scaled = drawScaled(image, level.scale);
    const detections = await model.detect(scaled, 10, 0.15);
    // best 'person' confidence at this simulated range
}

The output reads like a range card: detected at 5 m with 96% confidence, 41% at 15 m, invisible beyond 30 m. A stealth score aggregates it: how poorly did the adversary see you, averaged across ranges?

Honest about the model

The detector is COCO-SSD (a pretrained MobileNet-based model from the TensorFlow.js team) running entirely on-device. I didn't train it, and the demo says so on the page. The contribution here is the evaluation framework: using detectors as adversaries, simulating range, and turning subjective "good camo" into a measurable profile. The full version of this concept goes further: a multi-model ensemble (YOLO, Faster R-CNN, RetinaNet), lighting variation, and heatmaps showing which region of you gave you away.

From toy to instrument

A single detector at four synthetic distances is the honest minimum that demonstrates the idea. Turning it into a real evaluation instrument is mostly known engineering plus a few open questions:

Ensembles disagree, and that is the point. YOLO-family, R-CNN-family, and transformer-based detectors fail differently. Concealment that defeats one family often fails against another, so a credible score must aggregate across architectures, the way a security audit uses multiple scanners.
Explanation, not just detection. The useful output for a designer is which region gave you away. Saliency maps over detector activations turn pass/fail into actionable feedback: break up the shoulder line; the head silhouette is carrying the detection.
Condition sweeps. Honest evaluation varies illumination, weather, motion blur, and sensor type. Thermal is the hard one: visible-spectrum camouflage does nothing against IR, and simulating thermal signatures faithfully is an open problem.
The moving-target problem. Detectors improve every year, so concealment effectiveness is a date-stamped claim. A serious evaluation service would re-test against current models continuously, like dependency scanning for the physical world.

The research framing: this is adversarial robustness studied from the defender's side of the camera, and the literature on physical adversarial attacks maps onto it almost one-to-one.

Why this framing matters beyond camouflage

"Evaluate against the deployed adversary, not a human proxy" generalizes: testing ad creatives against content classifiers, validating anonymization against re-identification models, red-teaming computer vision systems before someone else does. Adversarial evaluation is a product category hiding inside a security habit.

Try it (images analyzed on-device, nothing uploaded): rs-03.github.io/demos
Source: github.com/rs-03/rs-03.github.io

Mirror Therapy Without the Mirror Box: Treating Phantom Limbs in a Browser Tab

Rahul Sangamker — Wed, 10 Jun 2026 21:49:49 +0000

A 1990s Nobel-adjacent therapy, a webcam, and 21 hand keypoints, recreating the mirror-box illusion for phantom limb pain, no hardware required.

A therapy built on an illusion

In the 1990s, neuroscientist V.S. Ramachandran discovered something remarkable: amputees suffering phantom limb pain often felt relief just by seeing their missing limb move again. His apparatus was almost comically simple: a box with a mirror. Put your intact hand in, look at its reflection where the missing hand would be, and move. The brain, watching the "missing" hand obey commands again, often dials the pain down.

The limitation was never the science. It was the box: a physical apparatus, used in clinics, hard to scale, impossible to measure.

Replacing glass with keypoints

A webcam plus real-time hand tracking can produce the same illusion with better properties:

webcam frame → hand landmark model (21 keypoints, on-device)
→ reflect: phantom[i] = { x: 1 − x, y, z }
→ render real hand (solid) + phantom twin (ghost) on canvas

The reflection is one line of math. Everything around it is what makes the illusion land:

const phantom = real.map(p => ({ x: 1 - p.x, y: p.y, z: p.z }));

The visual treatment matters more than I expected. The phantom hand is rendered as a ghostly cyan skeleton with a translucent palm fill, a "breathing" glow that pulses on a ~3 second cycle, and a fading afterimage trail of its last few frames. It reads as present but ethereal, which is exactly the perceptual story mirror therapy needs to tell. A dashed mirror plane down the center of the frame makes the reflection relationship legible at a glance.

The engineering details that matter

Tracking: MediaPipe HandLandmarker (Google's pretrained model, credit where due), running via WebAssembly with GPU delegate. ~30 FPS on a laptop.
Privacy by architecture: every frame is processed on-device. For a medical-adjacent application, "video never leaves your browser" isn't a feature, it's a requirement.
Lazy loading: the model only downloads when the user clicks "Start the Mirror", so visitors who don't engage pay zero bandwidth.
Lifecycle hygiene: camera tracks stopped and the model closed on unmount; nothing leaks.

Why digitize a working therapy?

Because software adds what glass can't: guided exercise sequences, session tracking, range-of-motion measurement over time (the keypoints are already numbers), and remote delivery to patients who will never visit a clinic with a mirror box. The browser preview demonstrates the core interaction; a WebXR version with a fully rendered 3D limb is the natural next step.

The research questions worth chasing

The browser preview proves the illusion mechanism. The interesting work starts after that:

Embodiment dosage. The research suggests the sense of ownership over the virtual limb drives relief. What level of visual fidelity does that require? A glowing skeleton, a stylized hand, or a photorealistic limb with matched skin tone? That is a testable dose-response question, and a browser-based system can run it at a scale no clinic can.
Therapy that measures itself. A physical mirror box produces zero data. Hand tracking produces 21 trajectories per frame, so range of motion, movement smoothness, and session adherence can be measured and trended across weeks. A therapy that quantifies its own effect can prove, or disprove, that it works for a given patient.
Adaptive difficulty. Pain-gated progression: exercises that advance only when movement quality and self-reported pain allow, which is how good physical therapists already operate. Encoding that judgment is a product problem, not a research one.
Lower limbs. Hands are the easy case. Phantom leg pain is more common after amputation, and full-body pose models make a seated mirror-leg version plausible with the same architecture.

None of this needs new ML. It needs careful product and clinical work on top of commodity tracking, which is exactly the engineering that turns a demo into a deployable therapy tool.

Try it (camera optional, on-device only): rs-03.github.io/demos
Source: github.com/rs-03/rs-03.github.io

A demonstration of the interaction concept, not a medical device and not medical advice.

Your Cough Has a Fingerprint: Hand-Rolling an FFT and MFCCs in JavaScript

Rahul Sangamker — Wed, 10 Jun 2026 21:49:05 +0000

I built a personal cough-health monitor that runs entirely in the browser: no ML framework, no server, no audio ever leaving your device. Here's how, down to the math.

The idea: deviation, not classification

Most "AI cough detection" projects train a classifier on a population dataset: thousands of strangers' coughs, labeled sick or healthy. That approach has a fundamental problem: your healthy cough might sound like someone else's sick one.

So I inverted it. Record your own healthy cough a few times to establish a personal acoustic baseline. Later, the system answers a much easier question: how different does your cough sound from your own baseline? No training data needed. No model. Just signal processing, which means it can run anywhere, instantly, privately.

The pipeline

Every cough becomes a 24-dimensional acoustic fingerprint:

microphone → trim to cough peak → frame (Hamming window)
→ FFT → mel filterbank (26 filters) → log → DCT
→ 12 MFCCs → mean + std across frames = fingerprint

This is the classic MFCC (mel-frequency cepstral coefficients) front-end used in speech recognition for decades, but implemented from scratch in ~200 lines of JavaScript, because shipping TensorFlow.js for what is fundamentally an FFT felt absurd.

The FFT, in 40 lines

The heart is an iterative radi
x-2 Cooley–Tukey FFT:

for (let len = 2; len <= n; len <<= 1) {
    const angle = (-2 * Math.PI) / len;
    const wRe = Math.cos(angle), wIm = Math.sin(angle);
    for (let i = 0; i < n; i += len) {
        let curRe = 1, curIm = 0;
        for (let j = 0; j < len / 2; j++) {
            // butterfly: combine even/odd halves
            ...
        }
    }
}

Mel filters then pool FFT bins the way human hearing does (finer resolution at low frequencies, coarser at high), and a DCT decorrelates the log energies into compact coefficients. Two coughs are compared by cosine similarity between their fingerprints.

Verify, don't vibe

Hand-rolled DSP is exactly the kind of code that looks right and is subtly wrong. So the whole pipeline is pure functions, unit-tested in Node before it ever touched a browser:

FFT output vs. a naive O(n²) DFT reference: max difference 1e-14
Identical signal vs. itself: similarity 1.000
Same synthetic "cough" with added noise and 30% volume change: 0.998 (volume invariance matters; you won't cough at calibrated loudness)
Spectrally different burst: ~0.0

That last pair is the whole product: robust to irrelevant variation, sensitive to spectral change.

The demo is the floor, not the ceiling

The live demo is deliberately minimal: three baseline coughs, one comparison, one verdict. That is the smallest version that proves the core mechanism honestly, and it runs in thirty seconds on any device. The concept underneath is a research program:

Adaptive baselines. A fixed baseline ages; respiratory health drifts with seasons, allergies, and habits. The next step is a slowly adapting baseline (exponentially weighted, with change-point detection) that distinguishes "your cough evolved gradually" from "your cough changed overnight", which is the clinically interesting event.
Trajectories, not snapshots. A single deviation score is weak evidence. A two-week trend of rising deviation is a different signal entirely, and it is the one worth showing a doctor.
Confound separation. The honest open problem: distinguishing illness-driven spectral change from microphone distance, background noise, and time of day. Promising attacks include recording-condition normalization, per-session calibration sounds, and paired healthy/sick data per person.
Beyond coughs. Personal-baseline deviation generalizes to any repeated personal sound: voice fatigue for call-center workers and singers, breathing during sleep, machine acoustics on a factory floor. The pattern (baseline, deviation, trend) is the product; the cough is one instance.

There is also a real research gap here: population models dominate the audio-health literature largely because labeled population data exists. Personalized baseline approaches have almost no shared benchmarks. Building one, using anonymized fingerprints rather than raw audio, would be a genuine contribution.

The pattern generalizes

This pattern, personal baseline + deviation scoring instead of population classification, applies way beyond coughs: machine vibration monitoring, voice fatigue, equipment acoustics. It's cheaper than collecting a labeled dataset, inherently personalized, and privacy-preserving by construction.

Try it live (your audio never leaves the page): rs-03.github.io/demos
Source: github.com/rs-03/rs-03.github.io. See dsp.js and its parity test.

Not a medical device; a demonstration of the signal-processing concept.