DEV Community


Discussion on: Creating Visuals for Music Using Speech Recognition, Javascript and ffmpeg: Version 1

vbaknation profile image
VBAK Author

Thank you! Great question and I have not thought about it from that perspective!! My initial thoughts: Instead of rendering the full word or set of full words on a frame, each syllable would be rendered separately using something like this ( in conjunction with the millisecond values that the forced alignment outputs. I don't know how exact of an alignment you would get since gentle outputs the milliseconds for phonemes while syllables can often contain multiple phonemes (i.e. pebble is two syllables but I think 4 phonemes). But you could take the start and end timing of each word, divide by the number of syllables and render each syllable at those increments to get an approximated alignment across the syllables of a single word while maintaining the overall timing of each word over the duration of the transcription. This assumes that the speaker is giving equal time for each syllable which may not be true. If you are working on something feel free to share the link here!

jochemstoel profile image
Jochem Stoel

I am a developer myself (duh :P) but I also do visualizations for music videos and like to play with audio responsive visuals. Like those equalizer looking visualizations you see on YouTube that react to the 'beat' of the music.
I have tried a few times in the past to use Google's transcription service or just plain TTML lyrics to visualize them on screen as they are being sung/spoken. These type of videos are generally called typography lyrics videos or something like that. This never really worked out the way I wanted to because Google's TTML files are not precise enough and it is on my maybe someday to do list.
TTML (Timed Text Markup Language) is what you see under some YouTube videos. The subtitles are generated and 1 word is added when it is said. I have various half-finished applications for this that are all on my to-someday-do list.
One of which is a speedreader, an application that allows you to read text 5 times faster by using a central point of attention that continuously updates. Here is one example:
I think that as you suggested when you have the syllables of a word using some API or library and you decide the display time of each syllable based on the time a word is visible divided by the amount of syllables it has, this is probably accurate enough with maybe a few exceptions.
The problem is that you can't really do this with realtime speech recognition because by the time a word is said, you are already too late. The word is already said. Using continuous mode of Google transcription in the browser would seem like a possible solution but its not because the suggestions it gets while the word is spoken have very low confidence levels and generally do not make a lot of sense.
Although I am not really working on anything specific, I play around with the image2text and text2image machine learning models too and I dunno, your post just seemed relatable. :)