In this blog post, we’ll show you how to build a web application that will access your camera and say something whenever you make a specific gesture with your hand. This is a simplified version of the Rock, Paper, Scissors, Lizard, Spock application, and you can try out the app here or deploy it yourself with the instructions below. After you launch the app using a desktop browser, click Start and allow access to your camera, and then make one of the hand gestures from the game created by Sam Kass and Karen Bryla. Make sure your volume is turned up, and when the application sees a valid gesture, it will speak to you as it is recognized.
You can customize and run this application yourself by visiting this GitHub repository and following the directions shown. All you need is an Azure subscription, and it uses free services so it won’t cost you anything to try it out.
Let’s dive into the various components of the application:
Speech. The speech generated when the application detects a valid gesture is generated on demand with Cognitive Services Neural Text to Speech. Neural TTS can synthesize a humanlike voice in a variety of languages (with 15 more just added!) and speaking styles.
Vision. The hand gesture detection is driven by Custom Vision in Azure Cognitive Services. It’s based on the same vision model used by the Rock, Paper, Scissors, Lizard, Spock application, but running locally in the browser. No camera images are sent to the server.
Web Application. The application is built with Azure Static Web Apps, which means you can create your own website with a version of the application in just minutes – and for free!
Because we’ve provided all of the code behind the application, it’s easy to customize and see the differences for yourself. As soon as you check in changes to your forked GitHub repository, Static Web Apps will automatically rebuild and deploy the application with your changes. Here are some things to try, and you can find detailed instructions in the repository.
- Change the words spoken for each hand signal by modifying the text.
- Try changing the default voice or language by configuring the default.
- Try a different speaking style, like “newscast” or “empathetic” with SSML.
Customize what’s recognized by the camera. The GitHub repository includes only the exported rock-paper-scissors Custom Vision model, but not the source data used to train the model. You train your own vision model with Custom Vision, export it for TensorFlow.js, and replace the provided model.
If you’d like to learn more about the technology used in this app, check out these Microsoft Learn modules on Static Web Apps, Custom Vision, and Text-to-Speech. If you have any feedback about the app itself, please leave an issue in the Github repository, or reach out to either of us (David and Em) directly. This was a fun app to make, and we hope you have fun playing with it too!