Making Sense of the Senses- Our Top 5 Microsoft Azure Cognitive Services Combos!

#tutorial #productivity #webdev #beginners

As folks around the world make the switch to remote meetings and presentations, we’ve seen a lot of unique, creative, and hilarious ways people have tried to make this format work well for them. From hilarious backgrounds, unique uses of “Python”, and even informal mental health checkins- we’ve loved seeing the ways folks have made this new method of social interaction work well for them.

Take, for example these fun Microsoft Teams backgrounds that raised our spirits on Twitter:

Here's one by @ytechie

And another one by @_sarahyo

Here on the Cloud Advocate Team at Microsoft, we’ve always been a remote-first team. With so many people in several time zones, on several continents, spanning so many teams, meeting our teammates and managers “IRL” after a year of video chats is not uncommon! Using the power of Microsoft Azure Cognitive Services, myself and fellow advocate David Smith put together a list of our top 5 combinations to help make remote presentations and streams better. We hope these useful tools and methods help you make your online presentations easier, more engaging, and help get you a standing ovation from your remote teammates (disclaimer: you may want to make sure you aren’t wearing your PJ pants when standing to clap... 😉👖).

Rehearse and Memorize Your Presentations

Have a big presentation coming up that you want to have memorized (or perhaps you’re rehearsing for the big Spring musical next year?). With this combo, you can rehearse dialogue, listen to it on repeat, and cross-check your accuracy! Using the power of Speech-to-Text combined with Text To Speech (both a part of Azure Cognitive Services), getting set for that big remote presentation is a breeze.

Our speech and text features are a great way to get started and familiarize yourself with Cognitive Services! Do you learn better by listening? By using Text To Speech capabilities, you can listen to a non-robotic voice read your written words aloud in no time.

You can also use Speech to rehearse memorized material! To get a text output of your spoken dialogue (and check your accuracy) our REST-based APIs or the Speech SDK are easy tools to get get the job done.

Here's a fun gif of me (Chloe) testing it out!:

To get started, you'll need to create a Speech resource. This can be done by using the Azure Portal, the Azure CLI, or the Cloud Shell. Once your Speech Service resource created, you'll have access to your API endpoint and subscription keys.

Never used Azure before? No problem- we have an MSLearn Module that will walk you through each step of getting started with our Speech features!

LEARN MORE: If you want to get really fancy, you can even create a custom voice! Using the power of Custom Voice (part of Azure Cognitive Services), you can upload data in our Custom Voice portal and train it to make it sound just how you'd like.

Analyze Your Emotions

If you’re rehearsing a presentation, you might be interested to learn the story your face is telling along with your slides. Use Video Indexer (part of Azure Cognitive Services) to map your emotions over the course of the presentation, and then take a deeper dive by analyzing key frames from the video with Microsoft Cognitive Services Face service.

First, record your presentation as a video file in one of these formats. (One easy way to do this is to schedule a Teams video meeting with yourself and use the “record meeting” feature, and then download the recording from Microsoft Stream.)

Once you have your video file, upload it to Video Indexer, and then wait for the indexing to complete. (This could take minutes or hours, depending on the length of your video.)

Once the video is indexed, you can explore the information Video Indexer has identified: the people and objects in the video, the topic mentioned in the audio, and so on. Video Indexer will also give you a map of emotions throughout the video, and you can use the “Play previous” and “Play next” controls to see what you look like when you’re emoting!

For a deeper dive, you can take a look at the high-resolution keyframes in the video. To get the images, download the “Artifacts (ZIP)” file from Video Indexer, and expand the ZIP file you downloaded. You will also need to extract the contents of the “_KeyFrameThumbnail.zip” file inside the ZIP file as well, which is where the keyframe images are stored.

You can use the Face to analyze the images, but a quick and easy way is just to use the demo form on this page. Scroll down about halfway, click the “Perceived emotion recognition” tab, and upload one of the images from your keyframes folder:

As you can see, in this part of my presentation I was being aggressively Neutral, but try it with frames from your own presentation and see what emotions you were presenting to your audience.

LEARN MORE: Take this Microsoft Learn module to learn how to identify faces and expressions using Computer Vision, or check out this blog post on extracting high-resolution key frames with Video Indexer.

Moderate Your Stream/Comments

Do you stream your content online? Or, perhaps you're building an application that allows comments, but you don't want to be on-call to moderate comments 24/7? Guess what... we have a combo for that! Microsoft Azure Content Moderator provides machine-assisted content moderation for images, text, and video.

You can combine any of our cognitive services with the power of our Content Moderator to specify what specific words, images, and videos you'd like to censor. From optical character recognition (OCR), to specific words/profanities- there are so many ways to automate and combine our services to give you more time to code (and less time cringing at offensive messages).

Check out this great Learn Module on classifying and moderate text with Azure Content Moderator to get started!

Get a Confidence Boost from Cheerleading Toys

Need some encouragement, but lack an audience? With this combo, you can grab one of your three favorite stuffed toys, hold it up to the camera, and hear it give you a pep talk!

First, you’ll need to collect images of your cheerleader toys: about 50 of each should do it. This app from the Microsoft Azure Vision Workshop makes the process super easy: just follow the instructions in that repository and capture images for “rock”, “paper” and “scissors” for each of your toys, and also images for “none” when no toy is in the frame. (Optional: after you fork the repository you can make these changes to customize your toy names: change1 change2 change3.)

Next, you’ll use Custom Vision to train a model to recognize your cheerleaders. The process is explained in Steps 5-6 of the Azure Vision Workshop README.

Now you have an app that can recognize your cheerleaders! In the screenshot below, you can see that Doug the Drop Bear has been correctly identified.

Last, update the app to play a specific piece of audio when a cheerleader is recognized. We’ll leave this as an exercise for the reader, but to get you started here’s how to generate speech from text in Python.

Now, all you need to do when you get a confidence boost is to leave the app running, and hold up one of your cheerleaders to the camera!

LEARN MORE: Try these Microsoft Learn modules to get to know Custom Vision and about using Speech to Text.

Take Notes (with no note taking!)

Ever wish you had a transcript of a meeting you were in? Perhaps you loved the way your classmate phrased or worded a specific thing while chatting about your group project? Or maybe, you don’t want to put the burden on a coworker to take notes? Using these 2 services combined, we can take notes and identify the speaker!

We've already touched on our speech APIs in this post, but let me introduce you to another one of my favorite new features (currently in preview): Speaker Recognition, part of Azure Cognitive Services!

With Speaker Recognition we can return the identity of an unknown speaker paired against a group of speakers selected. This can be used as a means of verification, or in our case to identify what individual is speaking in our meeting for noting taking purposes. Remember: You should ensure you have received the appropriate permissions from the users you select for speaker verification. In combination with Speech to Text features- note taking becomes hands-free!

To get yourself familiar with speaker recognition, we recommend starting with this MSLearn module that will walk you through the process of creating a subscription, manage speaker verification profiles, and implement speaker recognition on your audio. Then head on over to this MSLearn Module to familiarize yourself with our speech to text features.

Happy WFHing!

Thanks for reading- we hope you have a happy #AIApril and can't wait to see what amazing combos you use to make your remote presentations easier and more exciting! Follow the #AIApril hashtag on Twitter, and stay tuned for a new post from Microsoft every day this month- there's so much more to share!