Adding Animated Subtitles to Videos with Python

#tutorial #python #opensource #showdev

In this tutorial, we'll explore how to automate this process using Python to build a script that adds CSS-styled, animated captions directly onto a video file.

We will use pycaps, an open-source library I built to handle the complexities of video processing, transcription, and rendering.

By the end, you'll have a script that can add subtitles like these:

Let's dive into the code.

Getting Started

First, you'll need a video file. For this tutorial, let's assume we have a file named my_video.mp4. The only prerequisite is to have pycaps installed. You can find the installation instructions in the project's README.

Let's start with a minimal script to add styled subtitles. Create a python file:

from pycaps import *

# Let's define our styles as a multi-line string (we could also use a .css file)
my_css_styles = """
/* styles for each word */
.word {
    font-family: system-ui;
    text-transform: uppercase;
    font-weight: 700;
    font-size: 18px;
    color: white;
    padding: 2px 3px;
    text-shadow:
        -1px -1px 0 black,
        1px -1px 0 black,
        -1px  1px 0 black,
        1px  1px 0 black;
}

/* styles for each word being spoken */
.word-being-narrated {
    background-color: blue;
    border-radius: 5%;
}
"""

# This is the core of our script. It defines the rendering pipeline.
builder = CapsPipelineBuilder()

# We tell it which video to process and where to save the output.
builder.with_input_video("cooking_video.mp4")

# We config the CSS to use
builder.add_css_content(my_css_styles)

# We limit the amount length of each segment
builder.add_segment_splitter(LimitByCharsSplitter(min_limit=10, max_limit=15))

# Build the pipeline from our configuration and run it.
pipeline = builder.build()
pipeline.run()

print("Subtitles added!")

When you run python add_subs.py, the library will automatically:

Extract the audio from my_video.mp4.
Transcribe it using Whisper to get word-by-word timestamps.
Split the transcription into shorter segments using LimitByCharsSplitter (see more details about the structure in the docs).
Render the words onto the video using the style provided.

Results:

This is a good start, and you can already customize your styles to achieve different looks.

Customizing your styles

You have access to several predefined CSS classes.

First, there are word and line, which let you style each word or entire lines throughout the video.

Then, there are state-based classes depending on the timing of narration:

For words: word-not-narrated-yet, word-being-narrated, word-already-narrated
For lines: line-not-narrated-yet, line-being-narrated, line-already-narrated

For example, if you want to hide words that haven’t been narrated yet, you can write:

.word-not-narrated-yet {
    display: none;
}

If you add this style to our previous example, the result would look like this:

Bringing Words to Life with Animations and Tags

Styling is one half of the equation; animation is the other. We can select specific elements to be styled or animated using the built-in Tagging System.

Tags are labels automatically applied to elements based on their position (e.g., first-word-in-line) or content. We can use these tags to, for example, apply a different animation to the first word of each line.

In this guide, we use tags to change the animation based on the structure. This same system can be used with semantic tags to change styles based on what a word means. That's the exact technique I used in my post about content-aware subtitles.

Let's modify our script to:

Make the first segment slide in
Make the other segments zoom in

from pycaps import *

my_css_styles = """
/* styles for each word */
.word {
    font-family: system-ui;
    text-transform: uppercase;
    font-weight: 700;
    font-size: 18px;
    color: white;
    padding: 2px 3px;
    text-shadow:
        -1px -1px 0 black,
        1px -1px 0 black,
        -1px  1px 0 black,
        1px  1px 0 black;
}

/* styles for each word being spoken */
.word-being-narrated {
    background-color: blue;
    border-radius: 5%;
}
"""

builder = CapsPipelineBuilder()
builder.with_input_video("cooking_video.mp4")
builder.add_css_content(my_css_styles)
builder.add_segment_splitter(LimitByCharsSplitter(min_limit=10, max_limit=15))

# Add animations
builder.add_animation(
    animation=SlideIn(direction="down"),
    when=EventType.ON_NARRATION_STARTS,
    what=ElementType.SEGMENT,
    tag_condition=TagConditionFactory.HAS(BuiltinTag.FIRST_SEGMENT_IN_DOCUMENT)
)

builder.add_animation(
    animation=ZoomIn(),
    when=EventType.ON_NARRATION_STARTS,
    what=ElementType.SEGMENT,
    tag_condition=TagConditionFactory.NOT(BuiltinTag.FIRST_SEGMENT_IN_DOCUMENT)
)

# Build the pipeline from our configuration and run it.
pipeline = builder.build()
pipeline.run()

print("Subtitles added!")

This script now produces a highly dynamic result. The conditional logic via tags allows for very granular control over the final look and feel.

Results:

Final Words

We've walked through how to programmatically add and customize animated subtitles using Python. We started with a simple script and progressively added styling and animations.

And if you're ready to take this to the next level, you can apply these same principles with more advanced tags. Check out the tutorial on How to Create Content-Aware Animated Subtitles to see how we use AI to make subtitles that react to the meaning of the spoken words.

The project is fully open-source and I welcome you to check it out on GitHub, try the online Web UI demo, and see what you can create.