DEMO :
Code: Github
A while back, I attended a creative coding jam, where I thought of building something fun. Since college time, I wanted to build an app to use gesture control to navigate PPT presentations (cuz we kept losing our pointers ;P). So I thought of building out something similar.
So to start I knew I needed a desktop app to control a PC and being familiar with Python and JS, the obvious options were PyQT or Electron. Next, after researching a little I found out about MediaPipe from Google.
a open-source framework for real-time multimedia tasks like hand tracking, gesture recognition, and pose estimation. It offers efficient, cross-platform machine learning solutions for developers.
I had seen many python projects using computer vision to do such things, but I had recently been playing with JS, so thought it would be a fun challenge to do it in electron. So far I had electron and MediaPipe for the app and the gesture detection.
Next I needed something to control the computer programmatically, that's when I found Robot.js & Nut.js. I went with nut.js, as it had more documentation and found it easy to use.
Now I had these tasks:
- Start app and keep it running in background
- Launch camera, get feed and detect gestures
- Map the gestures actions to control the computer
1. Start app and keep it running in background
Starting with installing dependencies and setting up the electron app.
npm install @mediapipe/camera_utils @mediapipe/hands @mediapipe/tasks-vision @nut-tree-fork/nut-js @tensorflow-models/hand-pose-detection @tensorflow/tfjs electron
Electron has a simple way to run a app in background. I just had to create a BrowserWindow
in the index.js
and set the window to show: false
. This background window loaded a background.html
with below content. Nothing fancy.
<video id="webcam" autoplay playsinline style="display: none;"></video>
<canvas id="output_canvas" style="display: none;"></canvas>
<div id="gesture_output" style="display: none;"></div>
<script src="gestureWorker.js"></script>
2. Launch camera, get feed and detect gestures
The mediapipe documentation is very clear on how to initialize the recognizer, pretty straightforward.
Source : gestureWorker.js
async function initialize() {
try {
const vision = await FilesetResolver.forVisionTasks(
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.3/wasm"
);
gestureRecognizer = await GestureRecognizer.createFromOptions(vision, {
baseOptions: {
modelAssetPath: "https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task",
delegate: "GPU"
},
runningMode: "VIDEO"
});
// Start webcam
const constraints = {
video: {
width: videoWidthNumber,
height: videoHeightNumber
}
};
const stream = await navigator.mediaDevices.getUserMedia(constraints);
video.srcObject = stream;
webcamRunning = true;
video.addEventListener("loadeddata", predictWebcam);
} catch (error) {
console.error('Initialization error:', error);
setTimeout(initialize, 5000);
}
}
3. Map the gestures actions to control the computer
Once I had the feed, all I had to do was
Source : gestureWorker.js
results = gestureRecognizer.recognizeForVideo(video, Date.now());
const gesture = results.gestures[0][0].categoryName;
MediaPipe has some predefined gestures, like Thumb_Up, Thumb_Down, Open_Palm. I used them as below,
if (gesture === "Thumb_Up") {
await mouse.scrollUp(10);
} else if (gesture === "Thumb_Down") {
await mouse.scrollDown(10);
} else if (gesture === "Open_Palm") {
await keyboard.pressKey(Key.LeftAlt, Key.LeftCmd, Key.M);
await keyboard.releaseKey(Key.LeftAlt, Key.LeftCmd, Key.M);
} else if (gesture === "Pointing_Up") {
await mouse.rightClick();
} else if (gesture === "Victory") {
await keyboard.pressKey(Key.LeftCmd, Key.Tab);
await keyboard.releaseKey(Key.LeftCmd, Key.Tab);
}
The mouse
and keyboard
objects are available from the nut.js package.
And finally I had it working, though there were many aaa, aahh, wutt, moments I learned a lot. As you can see in the demo, the last gesture is buggy, but it works 😉
Complete Source is available on Github
Learnings and Possibilities:
- Computer vision has become way more powerful and easy to use than it used to be.
- Mediapipe is super super useful, you can use to detect custom gestures. It even has things like DrawingUtils to leave a trail path of the hand movements, etc. It was fun playing around with it. The possibilities are endless if you have a great idea.
- I thought this kind of app would require some platform specific code, but to my surprise, all I wrote was JS.
- I was able to achieve this just a webcam, assume having a dedicated camera or sensor, you can use it for complex scenarios and use-cases.
This is my first article, do let me know how you find it.
Top comments (0)