DEV Community


Posted on

Creating Visuals for Music Using Speech Recognition, Javascript and ffmpeg: Version 0

Hello! This is my first blog post on

I make music and I code.

The Problem

Putting out music and garnering attention to it requires me to wear multiple hats for a variety of tasks: branding, social media marketing, beat production, songwriting, mastering audio, shooting and editing videos, designing graphics, the list goes on...

In order to create social media audiovisual content for my music, I generally follow this process:

  • 1) Make a beat in Garageband
  • 2) Write lyrics
  • 3) Practice the song
  • 4) Setup my the DSLR camera
  • 5) Setup my microphone
  • 6) Video myself recording the song
  • 7) Import the video into Adobe Premiere
  • 8) Import the song audio into Adobe Premiere
  • 9) Align the audio with the video
  • 10) Add and align lyrics (text graphics) with the audio
  • 11) Add some effects to the video I like this 80s look
  • 12) Render the video (45 minutes to an hour)
  • 13) Export to .mp4 (another 30-40 minutes)
  • 14) Upload to YouTube (another 30-40 minutes)
  • 15) Upload to IGTV (another 30-40 minutes)

I want to increase the time I spend on steps 1 through 3 and decrease the time I spend on steps 4 through 15.


Last Sunday (07/07/2019) I was refactoring some of my code on a project from jQuery to Web APIs. One thing led to the next, as they do the longer I am on MDN, and I came across the WebRTC (Web Real-Time Communication) standard and the YouTube LiveStream API documentation. This led me to Googling info about audio and video codecs. This finally led me to ffmpeg, an open source software used for audio and video processing. Sweet--I could start something from there.

I had used this software sparingly in the past, so I spent a few days experimenting with a few different image-to-video conversions in order to learn the basics. Here I've used ffmpeg to convert a sort-of timelapse of the BART (Bay Area Rapid Transit) train that passes nearby using 338 images taken throughout the day:

This inspired and led me to the project I'm working on now.

The Project

I've called this project animatemusic at this GitHub repository. My goal is to create a toolchain in order to expedite the creation of visuals for my songs.

The Tech

  • Node.js
  • DOM Web API
  • JSZip
  • FileSaver
  • ffmpeg

How it Works Thus Far

The process is a bit choppy right now since I'm running the various responsibilities in series in a semi-manual fashion:

  • 1) Export my vocals from Garageband to a single .wav file
  • 2) Type the song lyrics into a .txt file
  • 3) Feed the song vocals and lyrics to a locally run CLI of gentle and receive a JSON file with the forced-alignment results
  • 4) Install and run my animatemusic repo locally
  • 5) upload the JSON file (along with some other parameters) and receive a .zip folder with individual video frame .png files
  • 6) Use ffmpeg to stitch the images into a (lyric) video file
  • 7) Use ffmpeg to combine the song audio and the lyric video

Setting Up gentle

gentle is a forced-alignment tool that relies on kaldi which is a speech recognition toolkit. Forced-alignment involves matching a text transcript with the corresponding speech audio file.

The installation process for gentle was rocky, so the following tips and resources may be useful to you, should you choose to install it:

  • "Error finding kaldi files"
  • I added branch: "master" to the gentle .gitmodules file in order to capture some of the latest updates in kaldi which resolved some installation issues
  • Install gentle in a python virtual environment since they expect you to use python@2.7.x and the corresponding pip version
  • In gentle's bash script, comment out any of the brew install software names that you already have installed since any brew warnings will prevent the bash script from proceeding to the next step, which is the critical process

Generating the Forced-Alignment Results

Once you have gentle running, give yourself a pat on the back and then run the following in your terminal, now outside of the virtual environment which used python@2.7.x:

python3 path/to/audio path/to/transcript -o path/to/output

The resulting file is in JSON format with the following structure:

  "transcript": string,
  "words": [
        "alignedWord": string,
        "case": string,
        "end": number,
        "endOffset": number,
        "phones": [
               "duration": number,
               "phone": string
        "start": number,
        "startOffset": number,
        "word": string
  • transcript
    • holds the full text of your transcript in a single string
  • words
    • holds word Objects in an array
  • alignedWord
    • is the word string that gentle recognized from the audio
  • case
    • is a success string with either "success" or "not-in-audio" values
  • end
    • is the time in seconds of when the word ends in the audio
  • endOffset
    • I'm not sure...TBD (comment if you know)
  • start
    • is the time in seconds of when the word starts in the audio
  • startOffset
    • I'm not sure...TBD (comment if you know)
  • word
    • is the word in the transcript to which it forced-aligned the word in the audio file

Converting Forced-Alignment Results to Video Frames

If I can create an image for each video frame, I can render all of those image frames into a video using ffmpeg.

Right now, I have a single script block in my index.html which performs all of the logic around this process. Here's the minimal interface I've created thus far:

Here are the inputs to my script:

  • "video frame rate" and "full song length"
    • determine the total number of frames in the (eventual) video. Default values: 30 fps (frames per second) and 60 seconds, resulting in 1800 frames.
  • "words per frame" determine how many words will be displayed together on the canvas at any given time
    • right now my script is not optimal--if your cadence is fast, the time between words is short and this causes rounding errors and the script fails. This motivated the addition of this input.
  • "video width" and "video height"
    • set the size for the canvas element
  • "lyrics"
    • is the JSON output from gentle

The following scripts must be loaded first:

  • jszip.min.js
    • The wonderful JSZip client-side library which generates a zip file
  • FileSaver.js
    • The wonderful FileSaver client-side library which, among other functionality, exposes the saveAs variable to trigger a browser download of a file

The script I've written right now, can be seen in the repo's index.html. It's still a work in progress so please provide feedback. Here's how it works:

  • Upon uploading the transcript, the event handler handleFiles is called. handleFiles:
    • Parses the file into a regular JS object
    • Renders either a blank image (no lyrics being sung for that frame) or an image with the lyrics text (for frames where lyrics are being sung) onto the canvas element
    • Saves the canvas element first as a dataURL and then as a .png file object to the folder object which will eventually be zipped
    • Initiates the download of the zipped folder upon completion of all image renders

A few helper functions to break up the responsibilities:

  • prepareWordData
    • takes the words Array from the transcript
    • extracts wordsPerFrame words at a time (default of 3 words)
    • creates an Array of new reduced versions of the original word Objects using the first and last word's start and end values, respectively for every set of words:

  alignedWord: string,
  case: "success",
  end: number,   // the last word's `end` property
  start: number // the first word`s `start` property

  • getWordDuration

    • takes a word object and returns the difference (in seconds) between the start and end values.
    • this "duration" is used to determine how many frames need to be rendered for each set of words
  • renderWordFrames

    • takes the word (empty string if no lyrics are spoken during those frames) and duration of the word
    • creates a new 2D context object
    • fills it with the words' text
    • gets the dataURL using the .toDataURL() property on the canvas element
    • saves it to the folder-object-to-be-zipped with filenames starting with 0.png
    • This filename convention was chosen since it's the default filename sequence that ffmpeg expects

Generating the Video From Rendered Frames

Now that I have an image file for each frame of the video, I can use ffmpeg to stich them together. I have found the following parameters to be successful:

ffmpeg -framerate 30 -i "%d.png" -s:v 640x480 -c:v libx264 -profile:v high -crf 20 -pix_fmt yuv420p path/to/output.mp4

  • -framerate 30 sets the video frame rate to 30 frames per second
  • -i "%d.png" matches the sequential filenames
  • -s:v sets the size of the video frame (corresponding to the canvas element size, in this exampel, 640x480)
  • -c:v specifies the video codec (I've used libx264 which is recommended by YouTube and Instagram)
  • -profile:v sets the quality of the video to high (haven't fully understood how it works yet)
  • crf is the "Constant Rate Factor" which I haven't fully understood, but it ranges from 0 (lossless) to 51 (lowest quality)
  • -pix_fmt sets the pixel format used, in this case, yuv420 which sets the ratio of pixels for luminance Y (or brightness), chrominance blue U and chrominance red V. I'm pretty rough on these concepts so please correct or enlighten if you are more experienced.

This command generates a video at the output path, stiching the images together at a given framerate.

Adding the Song Audio

Now that I have the video for the lyrics, I can add the song audio (full song not just the vocals) using:

ffmpeg -i path/to/video -i path/to/audio -vcodec libx264 -acodec libmp3lame path/to/output.mp4

The first two input flags identify the video and audio files which will be streamed together using the video codec and audio codec specified.

The Result

Here's what I end up with!

It's pretty rough but the adrenaline rush was real when I saw it the first time.

Next Steps

I consider this a successful Proof-Of-Concept. Here are my next steps:

  • Over time, the lyrics fall out of sync with the audio, and this is most likely due to the fact that I rely on rounding the number of frames at 3 different places in the script

  • The manner in which the three words align with the vocals is suboptimal. I may consider increasing the number of words shown per set of frames

  • It's dull! The project is called animatemusic and this video is lacking interesting animations. If you recall, the word objects contain an array of phonemes used to pronounce the word. Mixing this with anime.js, particularly their morphing animation will lead to some interesting lip sync animation attempts down the road

  • The process is fragmented. Generating the forced-alignment output, generating the video frame images and generating the final output video currently takes places in three separate manual steps. I would like to eventually integrate these different services

  • Integrations. The eventual goal is to connect this process with my YouTube and Instagram accounts so that I can upload to them upon completion using their APIs

  • Refactoring. There's a lot of improvements needed in my script and I now feel confident enough to dive in and build this project out properly with tests


If you can help me improve my code, blog post, or my understanding of the context and concepts around anything you read above, please leave a comment below.

Follow Me


Thanks for reading!

Top comments (0)