Zihang Dong 董子航

Posted on May 20

I Open-Sourced a Browser-Based AI Background Remover — Here's the Full Architecture

#webdev #javascript #ai #opensource

Most background removal tools work like this: upload your photo to a server, wait for an AI model to process it, download the result. Your image sits on someone else's infrastructure. You hope they delete it.

I built one that works differently. The AI model runs in your browser tab. Your image never leaves your device. And I just open-sourced the core logic — two files, zero dependencies beyond a CDN import.

Here's how it works under the hood.

The Pipeline

The full flow from "user drops an image" to "transparent PNG download" goes through five stages:

Upload → ONNX Model Load → WebAssembly Inference → Mask Generation → Canvas Compositing

Each stage runs entirely client-side. Let me walk through them.

Stage 1: Loading the AI Model in the Browser

The backbone is @imgly/background-removal, an open-source library that bundles an ONNX segmentation model with ONNX Runtime Web (WebAssembly backend).

const LIB_CDN = 'https://cdn.jsdelivr.net/npm/@imgly/background-removal@1.5.5';

async function loadLibrary() {
  const module = await import(LIB_CDN + '/+esm');
  removeBackgroundFn = module.removeBackground;
}

The first call downloads ~40MB of model weights. That sounds heavy, but:

The browser caches it automatically
Subsequent uses load instantly from cache
No server round-trip on any future use

This is the same trade-off FFmpeg.wasm makes — big initial download, but then your browser becomes a local processing powerhouse.

Stage 2: Running AI Inference Locally

Once the model is loaded, inference is straightforward:

const imageBlob = await new Promise(r => canvas.toBlob(r, 'image/png'));

const resultBlob = await removeBackgroundFn(imageBlob, {
  model: 'medium',
  output: { format: 'image/png' },
  progress: (key, current, total) => {
    // Update loading UI
  }
});

What's happening behind the scenes:

The library resizes your image to the model's input dimensions
Pixel data is converted to a tensor
ONNX Runtime Web runs the segmentation model via WebAssembly
The output tensor (a per-pixel foreground probability map) is converted back to an image with transparent background

The medium model balances quality and speed. On a decent laptop, inference takes 2-5 seconds for a typical photo. On a phone, maybe 8-15 seconds. Acceptable for a free, private tool.

Stage 3: Building the Editable Mask

Here's where it gets interesting. The AI output isn't final — it's a starting point. I extract the alpha channel from the AI result and build an editable grayscale mask:

async function buildMaskFromResult() {
  const w = originalImage.naturalWidth;
  const h = originalImage.naturalHeight;

  // Draw AI result to a temporary canvas
  const resultCanvas = document.createElement('canvas');
  resultCanvas.width = w;
  resultCanvas.height = h;
  const rCtx = resultCanvas.getContext('2d');
  rCtx.drawImage(resultImg, 0, 0);
  const resultData = rCtx.getImageData(0, 0, w, h);

  // Extract alpha channel → grayscale mask
  // White = foreground (keep), Black = background (remove)
  maskCanvas = document.createElement('canvas');
  maskCanvas.width = w;
  maskCanvas.height = h;
  maskCtx = maskCanvas.getContext('2d');
  const maskData = maskCtx.createImageData(w, h);

  for (let i = 0; i < resultData.data.length; i += 4) {
    const alpha = resultData.data[i + 3];
    maskData.data[i] = alpha;     // R
    maskData.data[i + 1] = alpha; // G
    maskData.data[i + 2] = alpha; // B
    maskData.data[i + 3] = 255;   // A (mask itself is always opaque)
  }
  maskCtx.putImageData(maskData, 0, 0);
}

Why a separate mask canvas?

Because users need to fix the AI's mistakes. Hair edges, transparent objects, similar-colored backgrounds — no AI gets these perfect 100% of the time. The mask canvas becomes a paintable surface.

Stage 4: Manual Refinement with Brush & Eraser

This is the feature that separates a toy demo from a usable tool. Users can:

Brush (paint white on mask) → restore foreground areas the AI removed
Eraser (paint black on mask) → remove background areas the AI missed

function paintOnMask(e) {
  const rect = editCanvas.getBoundingClientRect();
  const x = (e.clientX - rect.left) / rect.width * maskCanvas.width;
  const y = (e.clientY - rect.top) / rect.height * maskCanvas.height;

  const brushSize = parseInt(brushSizeEl.value);
  const softness = parseInt(brushSoftEl.value) / 100;

  maskCtx.lineCap = 'round';
  maskCtx.lineWidth = brushSize;

  // Softness = CSS filter blur on the mask canvas context
  if (softness > 0) {
    maskCtx.filter = `blur(${Math.round(brushSize * softness * 0.3)}px)`;
  }

  if (currentTool === 'brush') {
    maskCtx.globalCompositeOperation = 'lighter';
    maskCtx.strokeStyle = '#ffffff';
  } else {
    maskCtx.globalCompositeOperation = 'source-over';
    maskCtx.strokeStyle = '#000000';
  }

  maskCtx.beginPath();
  maskCtx.moveTo(lastX, lastY);
  maskCtx.lineTo(x, y);
  maskCtx.stroke();
}

Key details:

Coordinate mapping: The edit canvas is CSS-scaled to fit the viewport, but the mask operates at full image resolution. Every mouse position gets mapped from display coordinates to mask coordinates.
Edge softness: Uses Canvas 2D filter: blur() on the stroke — this creates feathered edges instead of hard cuts.
Undo stack: Each mousedown saves a full ImageData snapshot of the mask. Up to 20 undo levels.

The brush cursor is a position: fixed div that follows the mouse, sized to match the display-scaled brush diameter. The actual canvas cursor is set to none.

Stage 5: Compositing the Final Output

To generate the downloadable PNG, the mask is applied to the original image:

function applyMaskToOriginal() {
  const origData = origCtx.getImageData(0, 0, w, h);
  const mData = maskCtx.getImageData(0, 0, w, h);
  const outData = oCtx.createImageData(w, h);

  for (let i = 0; i < origData.data.length; i += 4) {
    outData.data[i] = origData.data[i];       // R — original
    outData.data[i + 1] = origData.data[i + 1]; // G — original
    outData.data[i + 2] = origData.data[i + 2]; // B — original
    outData.data[i + 3] = mData.data[i];       // A — from mask R channel
  }

  oCtx.putImageData(outData, 0, 0);
  return outCanvas;
}

The mask's R channel (which equals G and B since it's grayscale) becomes the alpha channel of the output. White mask pixels → fully opaque. Black → fully transparent. Gray → semi-transparent (useful for hair and soft edges).

The Refine Mode Overlay

In refine mode, users see the original image with a semi-transparent red overlay on removed areas:

function renderMaskOverlay() {
  editCtx.drawImage(maskCanvas, 0, 0, dw, dh);
  const overlayData = editCtx.getImageData(0, 0, dw, dh);

  for (let i = 0; i < overlayData.data.length; i += 4) {
    const maskVal = overlayData.data[i];
    if (maskVal < 128) {
      // Removed area → semi-transparent red
      overlayData.data[i] = 220;     // R
      overlayData.data[i + 1] = 50;  // G
      overlayData.data[i + 2] = 50;  // B
      overlayData.data[i + 3] = 120; // A
    } else {
      // Kept area → fully transparent (show original underneath)
      overlayData.data[i + 3] = 0;
    }
  }
  editCtx.putImageData(overlayData, 0, 0);
}

This gives immediate visual feedback — you can see exactly what the AI removed and paint corrections in real time.

Performance Considerations

Memory: Three full-resolution canvases live in memory (original, mask, output). For a 4000×3000 photo, that's ~144MB of pixel data. Mobile devices with <4GB RAM may struggle.
Real-time rendering: Every brush stroke triggers renderPreview() via requestAnimationFrame. This redraws the preview canvas + overlay from the mask. On large images, there's a noticeable lag.
Touch support: Full touch event handling with passive: false to prevent scroll interference.

What I Stripped for the Open-Source Version

The production version on ToolKnit includes:

Daily usage limits (fair-use throttling)
Analytics tracking
Self-hosted model weights (faster loading from our CDN)
Sound effects on completion
Site navigation and SEO shell

The open-source version strips all of that down to two files:

index.html — standalone UI (~250 lines)
app.js — core logic (~380 lines)

You can clone it, run npx serve ., and have a working background remover in 30 seconds.

What's Next

Some ideas for anyone who wants to fork and extend:

Background replacement — solid color or custom image behind the subject
Batch processing — drop multiple images, process all sequentially
WebGPU acceleration — ONNX Runtime Web supports WebGPU; inference could be 3-5x faster
Edge feathering controls — post-process the mask with adjustable blur radius
Before/after slider — drag to compare original and result