Running Facial CV in the Browser: What Breaks When You Refuse to Send Pixels to a Server

#ai #machinelearning #webdev #showdev

Running Facial CV in the Browser: What Breaks When You Refuse to Send Pixels to a Server

Most try-on demos I've seen cheat. The phone uploads a frame, a GPU somewhere in us-east-1 runs the heavy model, and the result streams back. It looks magical in a controlled demo and falls apart the moment someone's on hotel wifi or actually reads your privacy policy.

I build Mirrrd, a try-on tool for makeup. Early on I made a constraint that turned out to be the whole engineering problem: no facial frames leave the device. Not "encrypted in transit," not "deleted after 24 hours." They never go anywhere. That sounds like a marketing line, but it forces a pile of real technical decisions, and I want to walk through the ones that actually hurt.

The naive version works for about a second

The obvious approach: grab the webcam with getUserMedia, run a face landmark model on each frame, composite the makeup, draw to a canvas. On my desktop with a discrete GPU this hits 60fps and feels great. Ship it.

Then you open it on a three-year-old Android phone and the frame rate drops to single digits, the fans (if it had fans) would spin up, and the battery visibly drains. The desktop demo lied to you because desktops have thermal headroom phones don't. On-device means on that device, and the device you have to design for is the median phone, not your dev machine.

Where the time actually goes

I assumed inference would be the bottleneck. It mostly isn't, once you pick the right model size. The hidden costs are:

Getting the frame off the camera and into a tensor. The ImageBitmap → tensor copy is not free, and doing it naively means a round trip through CPU memory every frame.
The composite pass. Blending product color onto skin while respecting the existing lighting is per-pixel work, and if you do it in JavaScript on a canvas you've already lost.
Garbage collection. Allocating a new tensor every frame and letting it get collected gives you periodic stalls that read as "jank" even when average FPS looks fine.

The fixes are unglamorous. Reuse tensors instead of allocating per frame. Move the composite into a WebGL fragment shader so it runs on the GPU where the frame already lives. Keep the landmark model and the rendering on the same backend so you're not shuffling data between CPU and GPU twice per frame.

Model size is a product decision, not a config value

You can run a 200-landmark face mesh or a lighter model. The heavy one is more stable around the lips and eyes, which matters for makeup placement. The light one runs on phones that would otherwise be excluded entirely.

I went back and forth and landed on shipping the lighter mesh and spending the saved budget on temporal smoothing instead. A slightly less precise landmark that's stable across frames looks far better than a precise one that jitters. Users read jitter as "broken." They read a 2px placement error as nothing at all, because they've never seen the ground truth.

This is the kind of tradeoff that doesn't show up in a benchmark table. Landmark accuracy is measurable; "looks broken to a human" is not, and the second one is what gets you uninstalled.

What on-device genuinely costs you

I'll be honest about the downsides, because the privacy framing makes it tempting to pretend there aren't any.

You ship the model to every user. That's a multi-megabyte download before anything works, and you eat it on first load. Caching helps on repeat visits but the first impression is slower than a server-side competitor whose model the user never downloads.

You can't quietly improve the model for everyone overnight. Server-side, you swap a checkpoint and every user gets the upgrade. On-device, the old version keeps running until the user reloads and pulls the new bundle. Your model fleet is whatever versions happen to be cached out there.

And you lose the data flywheel, on purpose. Companies that collect try-on frames can train on real usage. I can't, because I don't have the frames, and I decided that's the trade. It means I improve the model from public and synthetic data and from explicit, opt-in feedback, which is slower. I think it's the right call for a beauty product where the input is literally your face, but it is a real cost and anyone telling you on-device is strictly better is selling something.

The part I'm still not happy with

Lighting estimation from a single uncalibrated webcam is hard, and doing it without a server-side model that's seen millions of frames is harder. The current version is conservative: when it can't confidently estimate your lighting, it tells you the preview may be off rather than rendering a confident-looking result that's wrong. I'd rather under-promise on a frame than show someone a foundation shade that looks perfect on screen and wrong in the mirror, which is the exact failure that makes people return products.

That conservatism is a stopgap, not a solution. Real ambient light estimation on-device is still open enough that I won't pretend I've closed it.

If you want to see how the on-device version actually performs on your phone, it's at mirrrd.com. It's free during the beta, partly because I genuinely need to know which phones it falls over on, and the only way to find that out without collecting your camera data is to ask you.

Top comments (1)

Vic Chen • Jun 5

Good piece. The point that the expensive parts are often frame-to-tensor copies, compositing, and GC stalls rather than raw model inference feels very true for real-time browser ML. I also agree with choosing a lighter mesh plus temporal smoothing over a heavier but jittery model. Users forgive a tiny placement error much faster than visible instability. I’d love to see a follow-up with mid-range Android numbers for FPS, battery drain, and thermal throttling over a 3 to 5 minute session.