Gesture Control with ElectronJS, MediaPipe and Nut.js - Creative Coding fun

#javascript #programming #tutorial #frontend

DEMO :

A while back, I attended a creative coding jam, where I thought of building something fun. Since college time, I wanted to build an app to use gesture control to navigate PPT presentations (cuz we kept losing our pointers ;P). So I thought of building out something similar.

So to start I knew I needed a desktop app to control a PC and being familiar with Python and JS, the obvious options were PyQT or Electron. Next, after researching a little I found out about MediaPipe from Google.
a open-source framework for real-time multimedia tasks like hand tracking, gesture recognition, and pose estimation. It offers efficient, cross-platform machine learning solutions for developers.

I had seen many python projects using computer vision to do such things, but I had recently been playing with JS, so thought it would be a fun challenge to do it in electron. So far I had electron and MediaPipe for the app and the gesture detection.

Next I needed something to control the computer programmatically, that's when I found Robot.js & Nut.js. I went with nut.js, as it had more documentation and found it easy to use.

Now I had these tasks:

Start app and keep it running in background

Launch camera, get feed and detect gestures

Map the gestures actions to control the computer

1. Start app and keep it running in background

Starting with installing dependencies and setting up the electron app.

npm install @mediapipe/camera_utils @mediapipe/hands @mediapipe/tasks-vision @nut-tree-fork/nut-js @tensorflow-models/hand-pose-detection @tensorflow/tfjs electron

Electron has a simple way to run a app in background. I just had to create a BrowserWindow in the index.js and set the window to show: false. This background window loaded a background.html with below content. Nothing fancy.

<video id="webcam" autoplay playsinline style="display: none;"></video>
<canvas id="output_canvas" style="display: none;"></canvas>
<div id="gesture_output" style="display: none;"></div>
<script src="gestureWorker.js"></script>

2. Launch camera, get feed and detect gestures

The mediapipe documentation is very clear on how to initialize the recognizer, pretty straightforward.
Source : gestureWorker.js

async function initialize() {
  try {
    const vision = await FilesetResolver.forVisionTasks(
      "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.3/wasm"
    );
    gestureRecognizer = await GestureRecognizer.createFromOptions(vision, {
      baseOptions: {
        modelAssetPath: "https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task",
        delegate: "GPU"
      },
      runningMode: "VIDEO"
    });

    // Start webcam
    const constraints = {
      video: {
        width: videoWidthNumber,
        height: videoHeightNumber
      }
    };

    const stream = await navigator.mediaDevices.getUserMedia(constraints);
    video.srcObject = stream;
    webcamRunning = true;
    video.addEventListener("loadeddata", predictWebcam);
  } catch (error) {
    console.error('Initialization error:', error);
    setTimeout(initialize, 5000);
  }
}

3. Map the gestures actions to control the computer

Once I had the feed, all I had to do was
Source : gestureWorker.js

results = gestureRecognizer.recognizeForVideo(video, Date.now());
const gesture = results.gestures[0][0].categoryName;

MediaPipe has some predefined gestures, like Thumb_Up, Thumb_Down, Open_Palm. I used them as below,

if (gesture === "Thumb_Up") {
  await mouse.scrollUp(10); 
} else if (gesture === "Thumb_Down") {
  await mouse.scrollDown(10); 
} else if (gesture === "Open_Palm") {
  await keyboard.pressKey(Key.LeftAlt, Key.LeftCmd, Key.M);
  await keyboard.releaseKey(Key.LeftAlt, Key.LeftCmd, Key.M);
} else if (gesture === "Pointing_Up") {
  await mouse.rightClick();
} else if (gesture === "Victory") {
  await keyboard.pressKey(Key.LeftCmd, Key.Tab);
  await keyboard.releaseKey(Key.LeftCmd, Key.Tab);
}

The mouse and keyboard objects are available from the nut.js package.

And finally I had it working, though there were many aaa, aahh, wutt, moments I learned a lot. As you can see in the demo, the last gesture is buggy, but it works 😉

Complete Source is available on Github

Learnings and Possibilities:

Computer vision has become way more powerful and easy to use than it used to be.
Mediapipe is super super useful, you can use to detect custom gestures. It even has things like DrawingUtils to leave a trail path of the hand movements, etc. It was fun playing around with it. The possibilities are endless if you have a great idea.
I thought this kind of app would require some platform specific code, but to my surprise, all I wrote was JS.
I was able to achieve this just a webcam, assume having a dedicated camera or sensor, you can use it for complex scenarios and use-cases.

This is my first article, do let me know how you find it.

Top comments (1)

Sina Aghajani • Sep 6

This is awesome my dude 😍