DEV Community


Using RunwayML To Create a Lip Sync Animation

music & code.
・4 min read

Related posts:

The goal is to create an open-source app or library which allows musicians to expedite the process of creating visuals for their music:

Lip Sync

In parallel with my study of shader functions, I have been exploring ways to incorporate an animation of my face (or any character I wish to create) that will lip-sync to my song in an HTML/Canvas animation.

This was originally inspired by the output from the forced aligner I used (gentle), which included the time the word was spoken, as well as the duration of each phoneme of the word (phonemes are fundamental units of a word's sound).

For example, gentle's result for the word "let" (the duration of the phoneme is shown in seconds):

      "alignedWord": "let",
      "phones": [
          "duration": 0.09,
          "phone": "l_B"
          "duration": 0.09,
          "phone": "eh_I"
          "duration": 0.04,
          "phone": "t_E"

My first plan was to map mouth shape coordinates to each phoneme when rendering the canvas at each frame of the animation. As a first attempt, I have used the following image I found on the web which shows the mouth shape corresponding to different letters:


I've tried to articulate my intention with comments throughout the code, but essentially, the master image (with all of the mouth shapes) is translated to display the desired phonemes for each word as it is displayed.

I feel confident that this case study can be extended to a full song, with custom mouth shape coordinates (which will probably start out as drawings using vectr). This will be likely the next step I take to produce a full song's animation.

But before I proceed with that route, I wanted to try out something I came across a few days ago: RunwayML, which is software that provides a GUI to run different open-source ML models. RunwayML is explicitly marketed as software for creators. There's a free download and it's unbelievably easy to use so if you are interested in using machine learning for creative endeavors, I highly recommend it.

Using RunwayML

Instead of using the image of mouth shapes, or drawing my own, I was happy to utilize the power of facial recognition to do that work for me.

I started by recording a short video of myself with my phone:

I then created a new workspace in RunwayML and added to it the Face Landmarks model, which is described by its author as follows:

A ResNet-32 network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset2, the VGG dataset1, and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The model takes a video file as input and outputs the coordinates (in x,y pixels) for different recognized face features. The output format I chose was .JSON and the resulting data structure is:

        time: 0.01,
        landmarks: [
               bottom_lip: [[x0,y0], [x1,y1], ...],
               chin: [[x0,y0], [x1,y1], ...],
               left_eye: [[x0,y0], [x1,y1], ...],

Each time value (based on the frame rate of the export, which in this case is 10 fps) has a corresponding set of landmarks (facial features). The facial features have assigned to it an array of [x, y] pixel coordinate arrays.

Here's RunwayML interface during the export. The top panel shows the uploaded video, the bottom panel shows the export/preview of the model's output, and the side panel has model parameters:

I copied over the .JSON output to a pen, and built out an 10 fps animation using the face landmark coordinates:

Woo!! I think that's pretty awesome, given how smooth the whole process went. Note, I did not adjust or study any of the model parameters so I will explore that next.

A small note if you are new to RunwayML: make sure you download, install and open Docker Desktop if you are running the model locally. RunwayML does give you credits to use a remote GPU to run the model, and I'll be using that this week to run a full video with a higher export frame-rate.

Follow Me


Discussion (0)