DEV Community


Posted on

Creating Visuals for Music Using Speech Recognition, Javascript and ffmpeg: Version 1

This is my discussion of version 1 of a project I'm working on called animatemusic.

Click here to go to its GitHub repository.

My blog post on version 0 of this project can be found here.

milliseconds, not frames

In version 0 of this project, in my pursuit to render text on a canvas element (which would eventually become a video frame), I chose to design around the essential question:

which video frame is being rendered?

which was a reasonable question to ask since I was rendering a finite and known quantity of frames (based on the framerate and duration of the video).

However, it ended up having a not-so-reasonable solution because it involved a conversion from float (start time in seconds) to integer (frame number) for each set of words that were to be rendered. This caused repetitive rounding, which resulted in the text lagging behind the vocal audio. Here's codepen to articulate the plight:

For version 1, I've chosen to steer clear of this issue by designing around a new essential question:

at what time is the word rendered?

Examples using requestAnimationFrame

I found two resources that reassured me that my new essential question was worth pursuing:

I would eventually like to use a library such as anime.js or three.js, and their documentation and API also catered to a time-based animation approach.

Refactor and generateFrameImages

I took this opportunity to refactor my original script, in addition to adding functions to render the text on the canvas when the current time (elapsed) of the video is within the start and end times of the word. Here's a codesandbox and sample transcript json for upload:

The Result

I am trembling with excitement as I present to you the new and improved resulting video!! The lyrics sync with the audio so much better than version 0! I don't see many issues with the text (although there are some blanks and one <unk>, which is gentle's version of undefined)

Thanks for reading! Please comment below if you have something you can share to help me improve this project. Going to try and sleep now amongst the adrenaline rush...

Follow Me


Top comments (3)

jochemstoel profile image
Jochem Stoel

This is fantastic. I will follow you right away. How doable do you think it is to sync it per syllable?

vbaknation profile image

Thank you! Great question and I have not thought about it from that perspective!! My initial thoughts: Instead of rendering the full word or set of full words on a frame, each syllable would be rendered separately using something like this ( in conjunction with the millisecond values that the forced alignment outputs. I don't know how exact of an alignment you would get since gentle outputs the milliseconds for phonemes while syllables can often contain multiple phonemes (i.e. pebble is two syllables but I think 4 phonemes). But you could take the start and end timing of each word, divide by the number of syllables and render each syllable at those increments to get an approximated alignment across the syllables of a single word while maintaining the overall timing of each word over the duration of the transcription. This assumes that the speaker is giving equal time for each syllable which may not be true. If you are working on something feel free to share the link here!

jochemstoel profile image
Jochem Stoel

I am a developer myself (duh :P) but I also do visualizations for music videos and like to play with audio responsive visuals. Like those equalizer looking visualizations you see on YouTube that react to the 'beat' of the music.
I have tried a few times in the past to use Google's transcription service or just plain TTML lyrics to visualize them on screen as they are being sung/spoken. These type of videos are generally called typography lyrics videos or something like that. This never really worked out the way I wanted to because Google's TTML files are not precise enough and it is on my maybe someday to do list.
TTML (Timed Text Markup Language) is what you see under some YouTube videos. The subtitles are generated and 1 word is added when it is said. I have various half-finished applications for this that are all on my to-someday-do list.
One of which is a speedreader, an application that allows you to read text 5 times faster by using a central point of attention that continuously updates. Here is one example:
I think that as you suggested when you have the syllables of a word using some API or library and you decide the display time of each syllable based on the time a word is visible divided by the amount of syllables it has, this is probably accurate enough with maybe a few exceptions.
The problem is that you can't really do this with realtime speech recognition because by the time a word is said, you are already too late. The word is already said. Using continuous mode of Google transcription in the browser would seem like a possible solution but its not because the suggestions it gets while the word is spoken have very low confidence levels and generally do not make a lot of sense.
Although I am not really working on anything specific, I play around with the image2text and text2image machine learning models too and I dunno, your post just seemed relatable. :)