DEV Community 👩‍💻👨‍💻

benank
benank

Posted on • Updated on

Realtime Scoring and Feedback for Dance Moves Using TensorFlow.js

In pretty much every game that's ever existed, there's some way to keep track of how you're doing. This is usually in the form of a some sort of score value, such as how many points you have or how many enemies you've defeated.

In dance and rhythm games, such as Just Dance and Dance Dance Revolution, there are also scoring systems to give you realtime feedback on your performance. We'll dive into these soon to see why and how they work.

As it happens, I am working on a dance game as well! It's much more similar to Just Dance than DDR, but I think there are things that can be learned from both scoring systems. The goal here is to design a scoring system that feels rewarding while not rewarding the wrong moves. I want the player to feel good when they are doing well, and on the flip side, I want to give the player accurate feedback when they aren't doing well so they can improve.

Examining Existing Scoring Systems

Before trying to create a scoring system from scratch, let's take a look at some existing scoring systems for similar dance games to see how they did it.

Dance Dance Revolution

Dance Dance Revolution has been out for over 20 years, and its scoring system has been battle-tested throughout this time. If you're unfamiliar with DDR, it's a rhythm dance game where you have to step on arrows on a dance pad as they line up with the arrows on the screen to music:

(that's me playing above!)

It's a simple concept and it makes the scoring system pretty easy to figure out! If you hit the note right on, you get a perfect score. If not, you get less than perfect, depending on the timing.

Timing is a big concept in DDR. The scoring system uses the concept of timing windows, which are periods of time that you earn certain judgements. The timing windows for DDR look like this:

  • Marvelous: +/- 16.7 ms (+/- One Frame)
  • Perfect: +/- 33 ms (+/- Two Frames)
  • Great: +/- 92 ms (+/- 5.5 Frames)
  • Good: +/- 142 ms (+/- 8.5 Frames)
  • Boo: +/- 225 ms (+/- 13.5 Frames)

You get the idea - it's tricky to get the best scores because you have to hit it perfectly in time with the music and on-screen visuals! But there's no ambiguity to scoring - you either hit it on time, or you didn't.

Your overall score at the end uses these "judgement scores" (the names of each score) to calculate an overall score at the end. If you got all Marvelous, you'll get a perfect score. Otherwise, it'll be a little bit lower. If you're interested in learning more, there's a fun documentary video about the best DDR player in the world here.

Just Dance

The scoring system in Just Dance is quite similar to what mine will be: every few seconds, you receive a "judgement score," or text that indicates how well you've been doing in the time since the last judgement.

It looks like this during gameplay:

screenshot of just dance gameplay

Each player is represented on the top and receives various judgements (from OK to MEGASTAR) based on how well they did in the last period. If the got the moves perfectly, they'll get a MEGASTAR, and if they didn't do so well, they might get an OK or GOOD judgement.

Just Dance has been available on almost every platform imaginable throughout the last decade or so. Each platform is a little bit different on how the dance moves are detected. For example, when playing on the Nintendo Switch, the gyroscope and accelerometer data from your controllers is used to estimate your arm positions. After that, I imagine that the data is compared to the expected dance movements and you're given a score. Only your hand positions are taken into account with this approach.

This same approach is taken with the modern Just Dance games on PC, where it connects to your smartphone to estimate your movements. Now it only has one hand to use to give you scores, which doesn't seem like dancing at all to me! You could sit in a chair and get perfect scores with this approach.

Although, this isn't the whole story. Upon doing more research, I found this quote from an article from 2010:

"We have built in a certain amount of leniency into the rhythm and precision detection, but to say that it's smoke and mirrors simply isn't true. We have constructed symmetrical movements using natural body dynamics to identify how well someone is playing even with just one controller. We look at the Wii remote as an extension of the players' hands and body.

"At first we used the nunchuk as well, but found that the wire would hit you in the face and that you lost that sense of freedom you want when dancing. The system we use takes into account the way you move, dance and handle the Wii Remote to offer more precision in the detection frames. This is seamless for the player, but it takes a lot of work to enable the proper detection through a single input device."

The game features a DDR-esque scoring system, measuring the player's performance on a wide variety of factors. I ask Granger whether a better player will always beat a poorer player, or whether the machine can be duped by flailing arms and raw energy. "No way. If you pick up the moves and hit the beat, your performance will be scored accordingly. A better dancer will always beat a poorer player. If you want to be in time and accurate, you have a much higher chance if you follow both arm movements and keep the rhythm with both your lower and upper body. So essentially, you're better off doing what the dancer is doing... Otherwise you will find the transitions difficult to match."

What they did for the first Just Dance (and likely subsequent versions) sounds much more impressive than I've been giving them credit for! However, based on customer reviews, results from this "one-handed movement detection system" are mixed. There's only so much data you can extrapolate from a single controller.

In contrast to the single controller though, there have been versions that use the Xbox Kinect (and Xbox One Kinect) throughout the years. These camera sensors would allow you to track full body movements, but I am unsure to the extent to which they were used in the scoring system. Reviews that used the Kinect seem like they had much better results than when using a single controller.

Side note: I have always been impressed by the fidelity of the dance videos in Just Dance and didn't think that it was possible to achieve with just animated characters, and I was right! It's super cool to see how much effort they've put into producing their dance videos.

Once you've finished playing a song, you're greeted by a summary screen:

screenshot of just dance summary screen

It shows you how many of each judgement you received and an overall score based on those judgements. There are a few additional scoring attributes, but those are the main aspects.

sparky.tv

As it turns out, someone has already made what I'm making! It's called sparky.tv and it's very similar to what I am working on. It uses AI (or rather, ML) to analyze your movements in realtime and compare them to the video to give you scores.

I'm unsure if it uses the same TensorFlow model that I am using or some custom in-house solution instead. One interesting note is that all the processing is done server-side and takes about a half hour to process what are called "game intervals", or areas of the game that the player will be scored in. My game's processing is all done client-side and will usually take the duration of the video to process, if not less (depending on configuration options).

The gameplay screen looks a little something like this:

screenshot of sparky.tv gameplay

As you do better or worse, the little yellow running dude above the progress bar in the top right moves forwards or backwards. Every few seconds, a judgement (the "excellent" text and blue glow on the sides) appears based on where the yellow dude is on the progress bar. This means that the current judgement scores aren't based on your current performance, but your aggregated performance over time. If you fail a move, your dude will go down and cause your next judgement to be lower, but it will not be terribly low (versus Just Dance or Dance Dance Revolution where if you miss a move you get a "miss" judgement score).

This scoring system has its merits in an experimental AI game - the AI that analyzes you isn't perfect yet, so there are bound to be analysis mistakes. Instead of taking the score each frame, it aggregates it to give you a more consistent score that's still meaningful and mostly representative of how well you're doing.

After the game, it displays a summary screen of statistics showing how well you did, as seen below.

screenshot of sparky.tv summary screen

As seen in the screenshot, it gives you a grade based on your point score and also lists out how many of each type of judgement you got. The summary pictured here is a theoretical maximum score; that is, I used the video that you're supposed to dance to as my webcam feed, so it was receiving the exact inputs that it was outputting. This should result in the maximum score possible since the movements are exact, but it did not result in a perfect score. (the video was also mirrored as my webcam since it assumes you are doing mirrored movements)

I also found a bug where if the game can't detect you in the frame, it will continue to give you the same score. So if you get to the maximum "excellent" score, you will continue to get excellents until the video finishes. It may be possible to get a higher score by utilizing this bug.

sparky.tv has a lot of potential and is still a very new platform. It looks like they started as a simple site to watch workout and dance videos on, but now they've expanded to allow users to play alongside the videos and get scored using their AI. My dance game is similar in some aspects, but will have enough differences to set it apart.

Designing a Scoring System

With all of these different scoring systems in mind, how do we design one based on similarity scores from analyzing TensorFlow.js pose data?

As mentioned in a previous blog post, I've already figured out how to compare two sets of keypoints given to us by TensorFlow. However, there's a lot more to it than that.

After a few weeks of trial and error, I've come up with a system that's working pretty well so far! Here's what it looks like.

1. Get the keypoints by analyzing the current frame of the player's webcam

This step is pretty simple - just feed in the player's webcam to TensorFlow.js and use the MoveNet model to get the list of keypoints. One key thing to note here is that the video is mirrored - this way, it will match the movements since your webcam is rotated 180 degrees to record you.

2. Compare the player's pose to the expected pose

Now that we have keypoint data for the player and dance video, we need to compare them to see how similar they are.

First, compare all frames with keypoint data for the last half second of dance video. By using more than just the current frame's keypoint data, it allows players to have some latency when dancing; that is, you will often lag behind the dance video by a half second or so, and this method will account for that.

The comparison method for each set of keypoints is still the same method used in my earlier blog post where I split the keypoints into three distinct sets of points: head (face), torso (arms), and legs. These groups are compared separately and scores are given for each.

After these two steps, the next method to get the "best" score is ambiguous - which score should we take from this list of scores from the last half second? Should it be an average, the highest, the lowest, or something else entirely?

I'm still experimenting with this step, but right now I'm using a method that removes all outliers (high and low) from the data set and then returns the highest non-outlier scores for each group. This means that it'll take your highest score for head, torso, and legs within the last half second even if they're in different frames. It's important to remove outliers since there could be some frames that weren't analyzed well or other frames that had false positives (similarity scores that are too high).

3. Scaling the similarity scores

After getting a similarity score for the current frame for each of the three body groups (head, torso, legs), it's time to apply some transformation and scaling functions to it so we can get an overall score.

There are three parts to this step: group weights, group score thresholds, and the scaling function.

Group weights: should each of the three groups hold the same weight in the overall score? I thought that the head group should hold less weight, since it's not as important to match the way the dancer is facing exactly. The head group holds little weight in the overall score and the torso and legs hold the rest of the weight.

Group thresholds and scaling: based on testing, similarity scores under a certain amount don't hold much meaning. For example, similarity scores under 70% for the torso group mean that you weren't even close in the movement. I was able to scale this 70%-100% interval to 0%-100% to get more accuracy when looking at the score. Now, there is a much larger score possibility to look at within their movements.

Example: the similarity score for a movement is 90%, and the thresholds for the group are minimum 70% and maximum 95%. The score is scaled to 80% based on these values.

After scaling based on minimum and maximum group thresholds, I also applied a function to adjust the score a little bit more, depending on the value.

graph of scoring curve

It's similar to a curve you'd see on grades or test scores in school. However, this curve punishes scores under 50% (by making them less) and rewards scores over 50% (by making them more). It caps at 92%, so if you hit 92% similarity, then it's a perfect score. There are many ways to create a curve for the scores and I am still constantly experimenting to see what feels best.

4. Get the averaged score for this frame

Step 3 calculates the instantaneous frame score, but it can be wildly inconsistent when the quality of analysis from TensorFlow varies. To account for this, a frame's real score is based on the past second of scores recorded.

This step is quite generous in that it only takes the top 8 scores in the past second and removes anything that's more than 10% lower than the highest score. For example, if the highest score in this list is 90%, any other scores under 81% will not be used.

After getting the list of the highest scores in the last second, it takes the average of them to get the final overall score for the frame. This score is incredibly generous but helps to alleviate TensorFlow.js analysis problems on specific frames.

5. Get the judgement for this frame

Based on the averaged score for this frame, a judgement is given. Judgement cutoffs are simply given by this table:

const JUDGEMENT_CUTOFFS = {
    [JUDGEMENTS.PERFECT]: 0.95,
    [JUDGEMENTS.MARVELOUS]: 0.90,
    [JUDGEMENTS.GREAT]: 0.80,
    [JUDGEMENTS.GOOD]: 0.70,
    [JUDGEMENTS.ALMOST]: 0.60,
    [JUDGEMENTS.MISS]: 0.0,
};
Enter fullscreen mode Exit fullscreen mode

6. Get the real judgement score for this interval

And finally, the last step is to calculate and show the "real" judgement score for this interval. Currently the interval is half a second, so every half second this step will occur and show you your judgement for the last half second.

This step is pretty simple: get all the judgements for each frame in the past interval, remove outliers (high and low), and then take the top judgement. This is also a generous method, but not as generous as some of the steps before.

7. Final Score

And once you've finished playing the score, the final score is given to you as a number between 0 and 100, with 100 being the maximum score possible.

The final score is calculated by your judgement value total out of the maximum judgement value total. Judgement values are calculated by counting up the number of each judgement that you got per interval (see step 6) and adding the value for that judgement.

Judgement values look like this:

const JUDGEMENT_SCORE_VALUES = {
    [JUDGEMENTS.PERFECT]: 5,
    [JUDGEMENTS.MARVELOUS]: 4,
    [JUDGEMENTS.GREAT]: 3,
    [JUDGEMENTS.GOOD]: 2,
    [JUDGEMENTS.ALMOST]: 1,
    [JUDGEMENTS.MISS]: 0,
};
Enter fullscreen mode Exit fullscreen mode

If you got 3 perfects, that's 15 points. If you got 3 perfects and one marvelous, that would be 19 points, and if the song had 4 judgements total in it, then you would have 19/20 points, or 95% for your final score.

My Scoring System In Action

What does all this look like in practice? I'll show you! This is what the gameplay currently looks like in my game with all the scoring logic above. In this video, I am feeding the direct dance video feed back into the game, so in theory, it should give a perfect score. But it doesn't, and that's okay! The TensorFlow model isn't perfect, but it works pretty well!

Surprisingly, I can actually beat the "theoretical maximum score" (by feeding the dance video back in) myself sometimes! And other times...not so much! 😅 They're a little more difficult.

There's still a lot more work to do on this game, but it's starting to finally come together! It's playable now and I've been having a blast playing it myself and with friends.

Top comments (0)

Git push

Stop by this week's meme thread!