Chris McKenzie

Posted on Jul 21

Real-Time Face Tracking in the Browser with MediaPipe

#mediapipe #javascript #machinelearning

Google MediaPipe

Google MediaPipe is a suite of libraries and tools that make it very simple to drop ML into apps — supporting vision, text and audio tasks — without needing to be an ML expert or spin up expensive cloud infrastructure. It runs fast, on-device, and gives you tools to build interactive experiences that work in the real world.

The standout feature: on-device inference with support for Android, iOS, Python, and the web. That means no round-trips, no user data sent to servers, and minimal latency.

MediaPipe supports a wide range of use-cases, such as LLM inference, Object detection, gesture recognition, and much more. This would be an epic if I tried to demo the entire API. To keep it focused, I’ll demo the Face Landmark detection. For more, see the docs

Face Landmark

MediaPipe’s Face Landmarker lets you track 3D face landmarks and expressions in real time — whether from single frames or live video. You get blendshape scores for expressions, 3D points for facial geometry, and matrices for applying effects. Great for filters, avatars, or anything that takes facial input.

This demo uses the BlazeFace (short-range) model, which is optimized for selfie cameras.

At the time of writing, BlazeFace (short-range) is the only model available for this task, but BlazeFace (full-range) and BlazeFace Sparse (full-range) are coming soon, and may be worth checking out if the BlazeFace (short-range) doesn’t work for your use case.

Demo

Try the working demo here:
👉 https://monkey-ears-filter.vercel.app

or run locally:

Clone repo git clone git@github.com:kenzic/monkey-ears-filter.git && cd monkey-ears-filter

Start server npm run start → Then open http://127.0.0.1:3030/

In the monkey-ears-filter/public folder, the core logic is in filter.js.

What the code does

Imports & Constants

Pulls in FaceLandmarker & FilesetResolver from MediaPipe’s vision tasks.
Defines landmark indices for outer eyes and ears.

DOM References

Gets video and canvas elements and 2D context, setting up an overlay on the webcam feed.

Ear Image Loader

makeEar() returns a new transparent ear image each call

Webcam Setup (setupCamera)

Requests user media, attaches it to the video element, and waits for metadata before proceeding.

Model Initialization (loadFaceLandmarker)

Loads WASM runtime via FilesetResolver.forVisionTasks(…).
Creates a FaceLandmarker in LIVE_STREAM mode: GPU delegate, up to 2 faces, outputs blendshapes .

Tilt Calculation (getRollAngle)

Computes head tilt using atan2 of two eye landmarks.

Ear Overlay (drawEars)

Positions, rotates, and draws mirrored ear images at detected ear landmark positions.

Render Loop (render)

Matches canvas size to video, runs faceLandmarker.detectForVideo(…), overlays video + ear images, and loops via requestAnimationFrame.

Startup (main)

On button click: hides UI, starts camera, loads model, and begins the render loop.

The main components of this app: loading model, handling the overlay, and rendering it to the screen.

🧠 Model Initialization

This part initializes WebAssembly and the face landmarking model:

async function loadFaceLandmarker() {
  const filesetResolver = await FilesetResolver.forVisionTasks(
    "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.3/wasm"
  );
  faceLandmarker = await FaceLandmarker.createFromOptions(filesetResolver, {
    baseOptions: {
      modelAssetPath: "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task",
      delegate: "GPU",
    },
    outputFaceBlendshapes: true,
    runningMode: "LIVE_STREAM",
    numFaces: 2,
  });
}

FilesetResolver.forVisionTasks(…) downloads MediaPipe’s WASM runtime optimized for your browser.

createFromOptions(…) sets:

baseOptions.delegate: “GPU” → uses WebGL/WebGPU to accelerate inferences.
outputFaceBlendshapes: true → returns blendshape coefficients for expression data.
runningMode: “LIVE_STREAM” → asynchronous video mode for real-time streams.
numFaces: 2 → allows up to two faces to be tracked.

Result: faceLandmarker is now a live-stream-ready tracker with GPU acceleration, expressive blendshape output, and capable of handling up to two faces.

🎧 Ear Overlay

function drawEars(landmarks) {
  const img = makeEar(); // fresh Image each frame
  const left = landmarks[LANDMARKS.LEFT_EAR];
  const right = landmarks[LANDMARKS.RIGHT_EAR];
  const angle = getRollAngle(landmarks); // gets head tilt
  const w = 100, h = 100;
  ctx.save();
  ctx.translate(left.x * canvas.width - 26, left.y * canvas.height - 10);
  ctx.rotate(angle);
  ctx.drawImage(img, -w/2, -h/2, w, h);
  ctx.restore();
  ctx.save();
  ctx.translate(right.x * canvas.width + 26, right.y * canvas.height - 10);
  ctx.rotate(angle);
  ctx.scale(-1, 1); // mirror image
  ctx.drawImage(img, -w/2, -h/2, w, h);
  ctx.restore();
}

Positioning: landmarks are normalized ([0,1]), so we multiply by canvas dimensions.
Tilt rotation: angle (via atan2) tilts the ears to match your head roll.
Mirroring right ear: we invert with scale(-1,1) so ears attach with correct orientation.

function getRollAngle(landmarks) {
  const A = landmarks[LANDMARKS.RIGHT_EYE_OUTER];
  const B = landmarks[LANDMARKS.LEFT_EYE_OUTER];
  return Math.atan2(B.y - A.y, B.x - A.x);
}

🎯 Render Loop

async function render() {
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const res = await faceLandmarker.detectForVideo(
    video, performance.now()
  );
  ctx.clearRect(0,0,canvas.width,canvas.height);
  ctx.drawImage(video, 0,0);
  if (res.faceLandmarks?.length) {
    drawEars(res.faceLandmarks[0]);
  }
  requestAnimationFrame(render);
}

Sizing: ensures canvas matches video.
Inference: uses detectForVideo(), feeding current frame + timestamp for the live-stream model.
Drawing: clears canvas, redraws video, and overlays ears if at least one face is detected.
Looping: requestAnimationFrame(render) makes it go again, achieving real-time performance.

Final Thoughts

MediaPipe doesn’t get much hype, but it’s one of the most practical tools for real-time ML on-device. No servers. No latency. No user data handoffs just to track a face. That’s a big deal for privacy, performance, and reliability.

You can still use whatever backend stack you like — but with MediaPipe, you don’t have to. This changes the equation: local-first ML is finally viable. This demo barely scratches the surface. MediaPipe supports gesture detection, object tracking, and even early LLM integration. If you’re building interfaces that respond to people in real time, this deserves a place in your toolbox.

To stay connected and share your journey, feel free to reach out through the following channels:

👨‍💼 LinkedIn: Join me for more insights into AI development and tech innovations.
🤖 JavaScript + AI: Join the JavaScript and AI group and share what you’re working on.
💻 GitHub: Explore my projects and contribute to ongoing work.
📚 Medium: Follow my articles for more in-depth discussions on the intersection of JavaScript and AI.