PLsergent / OWOI_AudioToClip
Python module used for the school project OWOI (One Word One Image)
OWOI_AudioToClip
Python module used for the school project OWOI (One Word One Image)
Installation
After git cloning the repository, you can install the dependencies with the following command:
poetry install
Credentials
Please provide your credentials in the following environment variables:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
export GOOGLE_IMAGES_SEARCH_TOKEN="token"
export GOOGLE_SEARCH_ID="id"
Classes
TranscriptFactory
This class is used to create a transcript from a text file. It will create a list of words and a list of timestamps.
from owoi_audio_to_clip.TranscriptFactory import TranscriptFactory
transcript_factory = TranscriptFactory(gcs_uri="gs://bucket/file.mp3")
Methods:
- transcribe_audio_to_text() -> list[dict]: transcribe audio to text from the gcs_uri and returns a list of dict with the following keys: "word", "start_time" and "end_time"
- get_word_timestamps() -> list[dict]: returns a list of dict with the following keys: "word", "start_time" and "end_time"
This Class should be used to create a transcript from a text…
A cool idea 💡
Everything started when I saw this video:
Sicko Mode but every word is a Google image
I just loved the idea to have an image for every word in the lyrics. A few years ago I was thinking about automating the process.
The project 📁
In my engineering school we had to do a project for the semester and so we decided to give it a try.
The idea was to produce a website where people could upload songs of their choices and generate similar video clips with this idea: each word becomes an image.
Here is the repo for the front end but we won't dive into this in this article.
layfredrc / OneWord_OneImage
OneWordOneimage is a tool that allows users to create image-based clips synchronized to music lyrics.
Project OneWordOneImage (OWOI) | Equipe 7
Badr TADJER | Frédéric LAY | Pierre-Louis SERGENT | Leo TRAN | Meo BIENFAIT | Younes BOUCHAKOUR
Introduction
OneWordOneimage est un outil qui permet aux utilisateurs de créer des clips à base d'images synchronisées aux lyrics d'une musique.
Follow this link to see the source code : Github
Execution
requirements
- Poetry
- Python ^3.10
- PostgreSQL
- Web Browser
setup
-
Configure and launch the
poetry
environmentDocumentation : Poetry
# Launch poetry $ poetry shell # First installation # Check the Python version $ poetry env info # First installation # Change the Python version (if less than 3.10^) $ poetry enve use < path_python_^3.10 > # Install the libraries of the environment $ poetry install
-
Install and build
npm
libraries# Install the npm libraries # path : app/frontend $ npm install --legacy-peer-deps # Build the frontend $ npm run build
-
Install Docker and setup the…
The tools 🔨
In order to automate the creation of clips we would need tools to:
- Recognize the lyrics of any songs with precise timestamp of each word
- Fetch the images from Google Image
- Concatenate the images to create a video clip
Lyrics recognition
We looked at different tools, the obvious better option was to use an API that would give us the lyrics directly (for instance Musicxmatch). But the issue was that we needed the timestamp of each word in order to match the pictures with the song.
That's why we decided to use Google speech-to-text.
This API powered by AI is originally not made to recognize voices singing, this will limit us in term of songs possibilities in the future. But with it we could get the lyrics with pretty good accuracy (again depending on the music) as well as the timestamps.
Google images
To fetch images from Google Image we simply used the Python package Google-Images-Search.
Tricky part for me:
- You'll need to create a search engine here: https://programmablesearchengine.google.com/
- In the package documentation they mention the
project_cx
which is actually your search engine id, which looks like this:234903464501239304239:ccxz234er
Video clip editing
That's pretty much a no brainer, we used the package MoviePy.
Even though you might struggle a bit to debug issues related to Image Magick, the documentation is pretty well made.
This link saved me when MoviePy couldn't find Image Magick: https://github.com/Zulko/moviepy/issues/693#issuecomment-355587113
Also be careful when using subtitles, you'll need to have the selected font installed on your system.
The dev 🧑💻
Without going into the details you'll find those 2 mains classes:
- TranscriptFactory: used to translate the song to lyrics
- ClipMakerFactory: used to create the video clip
I eventually added a few useful functions to upload things to a GCP bucket, delete locally downloaded images and more importantly the ability to get an audio file from a YouTube link, which makes the module easier to use.
And finally here is the complete process to create a video:
- Use a YouTube link => extract audio file and upload to the bucket
- Extract the lyrics and timestamps with Google speech-to-text
- Iterate over the words to get images from Google Image
- Create individual temporary image clip for each word
- Concatenate the clips and add the music
- Upload video to the bucket
Example of usage
Full code: example usage on GitHub
Result 📼
The results are pretty convincing and quite funny 😄.
It's sometimes hard to see why some images have been selected. It's important to understand that the program is first looking for high resolution images and taking the first result. Also, here are some params I used:
_search_params = {
"fileType": "jpg",
"safe": "medium",
"imgColorType": "color",
}
So those specific parameters sometimes produce interesting results to say the least.
There is a lot of things to take into account in order to not hurt people. The parameter safe
is doing a good job at keeping the images "friendly" but sometimes it's more delicate. Like a few times for the word "down" the program would use a picture of someone having the down syndrome, or for the word "Turkey" it would show the recent earthquake that happened in the country.
There is a lot of political and moral aspects to consider and the program is not capable of having a critical point of view on this.
Improvement 🚀
I'm pretty happy with the result but not perfect by any means.
The AI speech-to-text struggles a lot when it comes to something else than rap music. Which make sense considering how the artists sing. The first improvement would be to find a different way to get the lyrics and timestamps from the audio. With the recent upgrades in the AI world I wouldn't be surprised if that'd be possible soon (or if someone has the solution in the comments lol).
The processing time is very very long (about 4-5 mins for a 20-30 secs clip). The longest part is the audio recognition.
And finally as I said before, we would need a way to prevent any hurtful pictures to appear in the video clips.
Top comments (2)
Nice fun project, good job
Thanks :)