Mithun Kamath

Posted on Dec 13, 2021

How did they do it | Control the lights in a room using your hand

#javascript #tensorflow #tutorial

A month ago, I came across this tweet from @devdevcharlie where they use hand gestures to control the lights in their room. Check it out:

Wasn't that cooooool?

So - how did they manage to do that? D-uh! It's in their tweet itself!! Great work Sherlock!

They used tensorflow.js
They specifically made use of the pose detection model named Movenet

Ok ok. But how did they actually pull it off? What could their code look like? Here's my take on how they may have achieved it.

The Smart Bulb

Let's get this out of the way sooner than later. I can't make out much details of the smart bulb / light in play but for this task, I shall abstract it. Instead of obsessing over which device they may have used, since it is not central to this task and it only needs to turn on or off based on hand gestures, let us assume that it is a Light model that has a method state to which you pass either ON or OFF. So, something like this:

// To turn the device on
Light.state("ON")

// To turn the device off
Light.state("OFF")

It could be any smart bulb, but at its very basic, its interface would probably have the above methods that we will make use of. We don't have to worry about the intricacies any further. In fact, for our implementation, we'll be logging the detected hand gesture to the browser console and the resulting light state.

The setup

There's a laptop (with a camera) in front of them - and that's the one that's capturing their pose, not the camera that's recorded the scene that we can see (hey - my dumb brain did not see the laptop initially). So you would need a camera / web cam that you can stream yourself through. If you don't have a webcam, but you possess an Android phone (and a USB cable), check out DroidCam that let's you convert your phone to a webcam.

The code

index.html

We start off by creating a very basic HTML page. Code with explanations follow:

// index.html

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>How did they do it? | @devdevcharlie edition</title>
</head>
<body>
  <video id="pose-off"></video>
  <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@3.11.0/dist/tf.min.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/pose-detection@0.0.6/dist/pose-detection.min.js"></script>
  <script src="/script.js"></script>
</body>
</html>

Here, we are creating an index.html file
In this file, we import the tensorflow.js library (@tensorflow/tfjs). We also import the Pose Detection library built on top of tensorflow.js (@tensorflow-models/pose-detection). This requires the tensorflow.js script and hence it is defined after tensorflow.js has loaded.
We have also included our own script.js file, which is where we shall write our script
Lastly, note the presence of the <video> tag. It has an id of #pose-off. It is in this tag that we shall stream our video (and from which we shall analyse the hand gestures)

So far, so good.

We move on to the script.js implementation, which is where we shall have all our logic to control the lights.

script.js

In this file, we start off by defining couple of functions, each of which do a dedicated task.

initVideo()

This function initializes the video tag, so that it plays the video from the camera attached to our computer. It goes something like this:

// script.js

async function initVideo() {
  // Step 1
  const video = document.querySelector("#pose-off");

  // Step 2
  video.width = 640;
  video.height = 480;

  // Step 3
  const mediaStream = await window.navigator.mediaDevices.getUserMedia({
    video: {
      width: 640,
      height: 480,
    },
  });

  // Step 4
  video.srcObject = mediaStream;

  // Step 5
  await new Promise((resolve) => {
    video.onloadedmetadata = () => {
      resolve();
    };
  });

  // Step 6
  video.play();

  // Step 7
  return video;
}

Each code statement has a step associated with it and the explanation of each step is below:

We start off by selecting the video tag in the HTML defined earlier. We are querying by the id of the tag (#pose-off).
We proceed to then set the width and height of the video. In our example, we go with a dimension of 640x480 but you can chose one to your liking. But remember - the value that you set is important. We shall see why further below.
At this step, we are asking the user for permission to access their video stream. The browser should auto detect the camera set up and provide us access to it. We are using the most basic of configuration, where we are setting the video resolution to 640x480 - same as the dimension we set for the video tag in Step 2 above.
Once we get permission to access the video stream, we set that as the source for our video HTML tag.
We then wait until the video metadata loads
Once the video metadata loads, we begin to "play" the video. In our case, since our video source is the camera device, we should start seeing the video feed.
Finally, we return the video object that we have initialised.

initPoseDetector()

This function sets up our "Pose" Detector. Pose here is our body pose / posture. Check out this diagram obtained from the Movenet documentation.

Each number represents part of our body (eye - left/right, wrist - left/right etc). In the referenced link, you can find the identification of each number below the image itself. Reproducing it here for your convenience:

0: nose
1: left_eye
2: right_eye
3: left_ear
4: right_ear
5: left_shoulder
6: right_shoulder
7: left_elbow
8: right_elbow
9: left_wrist
10: right_wrist
11: left_hip
12: right_hip
13: left_knee
14: right_knee
15: left_ankle
16: right_ankle

Isn't that cool? We already have the means to identify the different parts of our body. We just need to make use of it. This is how:

// script.js

async function initPoseDetector() {
  // Step 1
  const model = window.poseDetection.SupportedModels.MoveNet;

  // Step 2
  detector = await window.poseDetection.createDetector(model, {
    modelType: window.poseDetection.movenet.modelType.SINGLEPOSE_THUNDER,
  });

  // Step 3
  return detector;
}

Here's the explanation for each step in the code above:

Tensorflowjs supports multiple "models" for pose detection. Think of models as libraries - there are three prominent ones - MoveNet, BlazePose and PoseNet. We are making use of the MoveNet model. This step is basically configuring the model that we will make use of.
At this step, we are actually initialising our pose detector. We are passing in the model that we'd like to use (MoveNet) and we are further passing the configuration for the MoveNet model. In this case, we are specifying that we'd like to use the SINGLEPOSE_THUNDER variant of the MoveNet model. There are two other variants that we could have chosen but we chose this one because although it is slower, it is more accurate. Also we only intend to detect a single person's pose. (There are models to detect poses of multiple people at the same time).

Think of MoveNet as the Brand of car that you'd like to go with - Tesla. After selecting the Brand, you now need to select which (car) variant you'd like to go with - Tesla Model S, which in our case is the SINGLEPOSE_THUNDER variant of the MoveNet model.
Lastly, we return the pose detector object that we have initialised.

analyzeHandGesture()

Alright. So far we have implemented a function that initialises the webcam based video feed and another function that initialises the MoveNet tensorflow.js model. We now move on to another function that will use the MoveNet model to determine the hand gesture carried out in the video feed. Since this function works on the video feed and makes use of the MoveNet model, we would need to pass as input the video feed and the MoveNet model detector:

// script.js

async function analyzeHandGesture(video, detector) {
  // Step 1
  const poses = await detector.estimatePoses(video, { flipHorizontal: true });

  // Step 2
  recognizeGesture(poses[0].keypoints.find((p) => p.name === "left_wrist"));

  // Step 3
  requestAnimationFrame(async () => {
    await analyzeHandGesture(video, detector);
  });
}

A couple of things are happening in this step. We begin by calling the MoveNet model's estimatePoses() function. To this function we are passing the video feed. Further I have defined a configuration flipHorizontal to flip the video feed, you guessed it, horizontally because the video input from my ghetto camera feed (Recollect that I am using an Android Phone as a webcam) is mirrored. To correct it, I need to flip the feed horizontally.
This function returns the poses identified in the video feed. The structure of the data is an array of objects, where each object has the following structure:
```
  {
    x: // x co-ordinate
    y: // y co-ordinate
    score: // confidence score - how confident
           // the model is about the detected
           // body part
    name: // name of the body part.
          // Ex. right_eye, left_wrist
  }
```
Correction - this is the data structure of one pose. The MoveNet model is capable of detecting multiple humans in a video and for each person, it creates an object that has an attribute of keypoints which is itself again an array of objects. The above is data structure of this keypoint object.
In this step, we are trying to locate the keypoint for the left_wrist body part. Why just the left wrist? We'll find out in a second. After we extract that specific keypoint, we pass it to the recognizeGesture() function. This function identifies the hand gesture and decides the action to carry out based on it. We are yet to define this function - we will do so in the next step.
Lastly, we use requestAnimationFrame() to call the analyzeHandGesture() function again - we essentially end up creating an infinite loop where the analyzeHandGesture() function is called repeatedly thereby analyzing our hand movement forever.

recognizeGesture()

This function receives a keypoint object, with the x and y co-ordinates of a body part and it is expected to recognize the gesture made through that body part.

Bear in mind that detecting a complex movement like a thumbs up or a finger pointing in a direction or a "call me" finger combination requires setting up a neural network to accurately determine the hand pose. That is too cumbersome for our project here. We would like to keep it simple.

In the demonstration by @devdevcharlie we see her lifting her right hand up to turn on the lamp on the right side. And consequently bringing her right hand down to turn it off. Ditto with her left hand movements to control the lamp on the left side of her television.

For our replication, we'll recognize a really simply hand gesture - if our left wrist is on the LEFT SIDE of the video, we'll turn the lights ON. If our left wrist is RIGHT SIDE of the video, we'll turn the lights OFF. We will be dealing with just a single light source, unlike the demonstration where there are two light sources.

So essentially, we are dividing our video area into two parts - since our video width is 640px (see the initVideo() function), this would mean that from 0px to 320px will be our LEFT side of the video while 321px to 640px shall be the RIGHT side of our video.

But hang on - our video feed is flipped. Which would mean that 321px to 640px is our LEFT side while 0px to 320px is our RIGHT side.

Let's translate that to code our recognizeGesture() function:

// script.js

async function recognizeGesture(keypoint) {
  let status;

  if (keypoint.x > 320) {
    status = "ON";
  } else {
    status = "OFF";
  }

  console.log("Light is turned:", status);
}

If the x co-ordinate is greater than 320px, our wrist is on the LEFT side of the video and thus, we turn ON our light. Otherwise we turn it OFF.

That was the penultimate function we implemented.

start()

This is the last function we'll implement. This brings it all together:

// script.js

async function start() {
  const video = await initVideo();

  const detector = await initPoseDetector();

  await analyzeHandGesture(video, detector);
}

// Don't forget to call the function
start();

We initialize the video and store the video object, we then initialize the MoveNet model and store the detector and lastly, we analyze the hand gesture seen in the video.

The full source code for the script.js file looks like:

// script.js

async function recognizeGesture(keypoint) {
  let status;

  if (keypoint.x > 320) {
    status = "ON";
  } else {
    status = "OFF";
  }

  console.log("Light is turned:", status);
}

async function initVideo() {
  const video = document.querySelector("#pose-off");

  video.width = 640;
  video.height = 480;

  const mediaStream = await window.navigator.mediaDevices.getUserMedia({
    video: {
      width: 640,
      height: 480,
    },
  });

  video.srcObject = mediaStream;

  await new Promise((resolve) => {
    video.onloadedmetadata = () => {
      resolve();
    };
  });

  video.play();

  return video;
}

async function initPoseDetector() {
  const model = window.poseDetection.SupportedModels.MoveNet;

  detector = await window.poseDetection.createDetector(model, {
    modelType: window.poseDetection.movenet.modelType.SINGLEPOSE_THUNDER,
  });

  return detector;
}

async function analyzeHandGesture(video, detector) {
  const poses = await detector.estimatePoses(video, { flipHorizontal: true });
  recognizeGesture(poses[0].keypoints.find((p) => p.name === "left_wrist"));

  requestAnimationFrame(async () => {
    await analyzeHandGesture(video, detector);
  });
}

async function start() {
  const video = await initVideo();

  const detector = await initPoseDetector();

  await analyzeHandGesture(video, detector);
}

start();