Part 5 - Video Compiler


In the previous parts we have generated and processed all the story content that's text, images and audio. In this section, we will use royalty-free background music from Pixabay and a python library called moviepy to compile the final video.

High Level Overview

What we have now is a list of images files each represents a page, a list of audio files, each corresponds to a page. So, we want to go from that to a compiled video. To do so, we have to picture the video in our head first then figure out how to implement it using moviepy.

compiled video diagram

As you may already know, the video is a bunch of frames stick together in a sequence, plus audio. So, what we need to do is set each page image as current frame of the video, and the audio part of that page as the audio part of the video. Once the audio part is done (the narrator completes reading the sentence), we want to move to the next frame and the next audio. And so on.

After we finish with all frames and their audio, we blend in the background music with the full audio.

Image Clips

moviepy offers a simple interface called ImageClip. Given an image, we can create a video clip that runs this image X number of seconds. That number of seconds we know we want to be equal to the audio length of that page. We can also add an AUDIO_GAP of 0.5 seconds to make sure that there's a tiny little breath between pages (you can play with that, to make the gap longer of shorter).

The below code shows an example of creating an image clip called page_clip and adding it to a list of page_clips.

page_clip = mpy.ImageClip(page.page_filepath).set_duration( + self._AUDIO_GAP
We can then concatenate these image clips (page_clips) into a single clip concatenate_videoclips and method='compose'.

clip = mpy.concatenate_videoclips(page_clips, method="compose")
Audio Clips

Similarly, we need to create a list of audio clips from our audio files. But this time, we don't need to set the duration but we want to set when should that audio clip start. So, we can start with a current_start pointer at 0.0 and move it as we go.

current_start = 0.0

# for each page
audio_clip = mpy.AudioFileClip(story.pages[i].audio.mp3_file).set_start(

# keep track of the current length
current_start += + self._AUDIO_GAP
We can then combine them together using CompositeAudioClip and then add that the audio part of the combined ImageClip from previous section.

full_audio_clip = mpy.CompositeAudioClip(audio_clips)

# clip = mpy.concatenate_videoclips(page_clips, method="compose") = full_audio_clip
Background Music

I opted in for a randomized approach that picks up a random background music file from a predefined folder. The below method lists all music files under BACKGROUND_MUSIC_PATH and picks a random background music file.

def _get_background_music_filename(self) -> str:
    """Returns: A path to a mp3 file to be used as a background music."""
    music_files: List[str] = [
        join(self._BACKGROUND_MUSIC_PATH, filename)
        for filename in listdir(self._BACKGROUND_MUSIC_PATH)
        if isfile(join(self._BACKGROUND_MUSIC_PATH, filename))
        and filename.endswith(self._BACKGROUND_MUSIC_EXT)
    return random.choice(music_files)
To add that background music, we first need to read the music file as AudioFileClip. Then, we reduce it's volume and combine it with the existing audio.

# read background music file
background_music = mpy.AudioFileClip(background_music_filepath).fx(

# set volume
background_music = afx.audio_loop(
    background_music, duration=video_clip.duration

# combine it with existing audio = mpy.CompositeAudioClip([, background_music])
Finally, we save the video, by calling clip.write_videofile.

clip_filepath = os.path.join(workdir, self._FILENAME)
clip.write_videofile(clip_filepath, fps=self._FPS)
Bing it all together

That's the cherry on the cake. Given a Story object, this method will return back a filepath that represents the compiled video.

def generate_video(self, workdir: str, story: Story) -> str:
    """Create a video for the given story

        workdir: The root workdir for the story to generate video for
        story: The story object that contains all details about the story

    Returns: A local filepath for where the created video is stored
    page_clips = []
    audio_clips = []
    current_start = 0.0

    # Build image and audio clips
    for i in range(len(story.pages)):
        page = story.pages[i]
        page_clip = mpy.ImageClip(page.page_filepath).set_duration(
   + self._AUDIO_GAP
        audio_clip = mpy.AudioFileClip(story.pages[i].audio.mp3_file).set_start(
        # keep track of the current length
        current_start += + self._AUDIO_GAP

    # Combine image and audio clips
    clip = mpy.concatenate_videoclips(page_clips, method="compose")
    full_audio_clip = mpy.CompositeAudioClip(audio_clips)
        os.path.join(workdir, f"final_audio.wav"), fps=44100
    ) = full_audio_clip

    # Add background music and save
    clip_filepath = os.path.join(workdir, self._FILENAME)
    clip.write_videofile(clip_filepath, fps=self._FPS)
    return clip_filepath
In this 5 part series, we have gone a step by step in building a fully fledged automated AI visual story generator. We have done text and image generation, text processing, image processing, text to speech and video processing.

Thanks for reading :)

