DEV Community

Cover image for Gesture Control with ElectronJS, MediaPipe and Nut.js - Creative Coding fun
Anurag D
Anurag D

Posted on

Gesture Control with ElectronJS, MediaPipe and Nut.js - Creative Coding fun

DEMO :

Code: Github

A while back, I attended a creative coding jam, where I thought of building something fun. Since college time, I wanted to build an app to use gesture control to navigate PPT presentations (cuz we kept losing our pointers ;P). So I thought of building out something similar.

So to start I knew I needed a desktop app to control a PC and being familiar with Python and JS, the obvious options were PyQT or Electron. Next, after researching a little I found out about MediaPipe from Google.
a open-source framework for real-time multimedia tasks like hand tracking, gesture recognition, and pose estimation. It offers efficient, cross-platform machine learning solutions for developers.

I had seen many python projects using computer vision to do such things, but I had recently been playing with JS, so thought it would be a fun challenge to do it in electron. So far I had electron and MediaPipe for the app and the gesture detection.

Next I needed something to control the computer programmatically, that's when I found Robot.js & Nut.js. I went with nut.js, as it had more documentation and found it easy to use.

Now I had these tasks:

  • Start app and keep it running in background
  • Launch camera, get feed and detect gestures
  • Map the gestures actions to control the computer

1. Start app and keep it running in background

Starting with installing dependencies and setting up the electron app.

npm install @mediapipe/camera_utils @mediapipe/hands @mediapipe/tasks-vision @nut-tree-fork/nut-js @tensorflow-models/hand-pose-detection @tensorflow/tfjs electron

Enter fullscreen mode Exit fullscreen mode

Electron has a simple way to run a app in background. I just had to create a BrowserWindow in the index.js and set the window to show: false. This background window loaded a background.html with below content. Nothing fancy.

<video id="webcam" autoplay playsinline style="display: none;"></video>
<canvas id="output_canvas" style="display: none;"></canvas>
<div id="gesture_output" style="display: none;"></div>
<script src="gestureWorker.js"></script>
Enter fullscreen mode Exit fullscreen mode

2. Launch camera, get feed and detect gestures

The mediapipe documentation is very clear on how to initialize the recognizer, pretty straightforward.
Source : gestureWorker.js

async function initialize() {
  try {
    const vision = await FilesetResolver.forVisionTasks(
      "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.3/wasm"
    );
    gestureRecognizer = await GestureRecognizer.createFromOptions(vision, {
      baseOptions: {
        modelAssetPath: "https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task",
        delegate: "GPU"
      },
      runningMode: "VIDEO"
    });

    // Start webcam
    const constraints = {
      video: {
        width: videoWidthNumber,
        height: videoHeightNumber
      }
    };

    const stream = await navigator.mediaDevices.getUserMedia(constraints);
    video.srcObject = stream;
    webcamRunning = true;
    video.addEventListener("loadeddata", predictWebcam);
  } catch (error) {
    console.error('Initialization error:', error);
    setTimeout(initialize, 5000);
  }
}
Enter fullscreen mode Exit fullscreen mode

3. Map the gestures actions to control the computer

Once I had the feed, all I had to do was
Source : gestureWorker.js

results = gestureRecognizer.recognizeForVideo(video, Date.now());
const gesture = results.gestures[0][0].categoryName;
Enter fullscreen mode Exit fullscreen mode

MediaPipe has some predefined gestures, like Thumb_Up, Thumb_Down, Open_Palm. I used them as below,

if (gesture === "Thumb_Up") {
  await mouse.scrollUp(10); 
} else if (gesture === "Thumb_Down") {
  await mouse.scrollDown(10); 
} else if (gesture === "Open_Palm") {
  await keyboard.pressKey(Key.LeftAlt, Key.LeftCmd, Key.M);
  await keyboard.releaseKey(Key.LeftAlt, Key.LeftCmd, Key.M);
} else if (gesture === "Pointing_Up") {
  await mouse.rightClick();
} else if (gesture === "Victory") {
  await keyboard.pressKey(Key.LeftCmd, Key.Tab);
  await keyboard.releaseKey(Key.LeftCmd, Key.Tab);
}
Enter fullscreen mode Exit fullscreen mode

The mouse and keyboard objects are available from the nut.js package.

And finally I had it working, though there were many aaa, aahh, wutt, moments I learned a lot. As you can see in the demo, the last gesture is buggy, but it works 😉

Complete Source is available on Github


Learnings and Possibilities:

  1. Computer vision has become way more powerful and easy to use than it used to be.
  2. Mediapipe is super super useful, you can use to detect custom gestures. It even has things like DrawingUtils to leave a trail path of the hand movements, etc. It was fun playing around with it. The possibilities are endless if you have a great idea.
  3. I thought this kind of app would require some platform specific code, but to my surprise, all I wrote was JS.
  4. I was able to achieve this just a webcam, assume having a dedicated camera or sensor, you can use it for complex scenarios and use-cases.

This is my first article, do let me know how you find it.

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

SurveyJS custom survey software

JavaScript Form Builder UI Component

Generate dynamic JSON-driven forms directly in your JavaScript app (Angular, React, Vue.js, jQuery) with a fully customizable drag-and-drop form builder. Easily integrate with any backend system and retain full ownership over your data, with no user or form submission limits.

Learn more