Creando Subtítulos Automáticos para Vídeos con Python, Faster-Whisper, FFmpeg, Streamlit, Pillow

Hola amigos, hoy les traigo un tutorial de como generar subtítulos de manera automática a tus videos con python, sé que existen muchos programas que actualmente ya hacen esto, pero siento es importante aprender a hacer este tipo de cosas casi desde cero, esto hace que aumente nuestra creatividad y conocimiento y hace que pensemos en un mundo de posibilidades y aplicaciones.

Para lograr esto usaremos como lenguaje principal Python, y otras librerías como:

Faster-whisper (https://github.com/SYSTRAN/faster-whisper)
Streamlit (https://streamlit.io/)
Pillow (https://python-pillow.org/)
FFmpeg (https://ffmpeg.org/)

La implementación completa la puedes encontrar en mi cuenta de GitHub en el siguiente repositorio:

https://github.com/emmendoza2794/automatic-subs-GUI

Este proyecto se encuentra dividido en 4 partes:

La interfaz gráfica realizada con Streamlit
La generación de los subtítulos a partir del audio del video lo hacemos con Fast-Whisper
La creación de las imágenes con fondo transparente con el texto de los subtítulos y las configuraciones que realizaremos con Pillow
Composición final del vídeo original con los subtítulos en el tiempo correcto será con FFmpeg

1. Interfaz gráfica

Para este proyecto usaremos una interfaz gráfica sencilla en donde podremos configurar el estilo y posición de los subtítulos, podremos ver una vista previa de como quedaran y una vista del resultado final.

Sera algo como esto:

Nuestro código en Python seria este:

import os
import shutil
import subprocess

import streamlit as st

if 'text_preview_path' not in st.session_state:
    st.session_state.text_preview_path = None

if 'final_video' not in st.session_state:
    st.session_state.final_video = None


def generate_video_subs():
    ...


def generate_text_preview():
    ...


st.set_page_config(
    page_title="Automatic Subs GUI",
    page_icon="👋",
    layout="wide",
    initial_sidebar_state="expanded",
)

st.markdown(
    unsafe_allow_html=True,
    body="""
    <style>
           img {
               max-height: 700px;         
           }
           .block-container {
                padding-top: 2rem;
           }
           video {
               max-height:700px;
           }
    </style>
    """
)

st.header('Automatic Subtitle Generator', divider='rainbow')

col1, col2, col3 = st.columns([0.25, 0.375, 0.375])

with col1:
    with st.container(border=True):
        st.subheader("Video file")

        st.file_uploader(
            label="Video file",
            key="video_file",
            label_visibility="collapsed",
            type=['mp4']
        )

        if st.session_state.video_file is not None:
            with open(f'temp/video.mp4', 'wb') as f:
                f.write(st.session_state.video_file.getvalue())

    with st.container(border=True):
        st.subheader("Subtitle settings")

        st.selectbox(
            label='Font',
            key="font",
            options=Utils().list_fonts()
        )

        st.slider(
            label="Font size",
            min_value=30,
            max_value=90,
            value=70,
            key="font_size",
            step=1
        )

        col_color_1, col_color_2 = st.columns([0.5, 0.5])

        with col_color_1:
            st.color_picker(
                label="Text color",
                key="color_text",
                value="#ffffff"
            )

        with col_color_2:
            st.color_picker(
                label="Border color",
                key="color_border",
                value="#000"
            )

        st.radio(
            label="Text position",
            key="text_position",
            index=2,
            options=["Top", "Center", "Bottom"],
            horizontal=True,
        )

        st.radio(
            label="Subtitle length",
            key="subtitle_length",
            index=0,
            options=["Automatic", "Custom"],
            horizontal=True,
        )

        st.slider(
            label="Minimum characters",
            min_value=10,
            max_value=150,
            value=70,
            key="min_chars",
            step=1,
            disabled=st.session_state.subtitle_length == "Automatic"
        )

    st.button(
        label="Generate video",
        type="primary",
        on_click=generate_video_subs,
        use_container_width=True,
    )

with col2:
    with st.container(border=True):
        st.subheader("Subtitle preview")

        generate_text_preview()

        if st.session_state.text_preview_path is not None:
            st.image(st.session_state.text_preview_path)

with col3:
    with st.container(border=True):
        st.subheader("Final video")

        if st.session_state.final_video is not None:
            st.video(st.session_state.final_video)

            with open(st.session_state.final_video, "rb") as file:
                st.download_button(
                    type="primary",
                    label="Download video",
                    data=file,
                    file_name="final_video.mp4",
                    mime="video/mp4"
                )

2. Generacion de subtitulos a partir del audio del video

Aquí es donde usamos la librria Faster-Whisper, esta librería es una reimplementación del modelo Whisper de OpenAI utilizando CTranslate2, promete ser hasta 4 veces más rápida que openai/whisper para lograr la misma precisión y utilizar menos memoria.

Básicamente, debemos pasarle el audio del video y automáticamente reconocerá el idioma y nos devolverá el texto del audio con sus tiempos en la línea del tiempo del video.

Para esto primero debemos extraer el audio del video, esto es muy simple con FFmpeg(esta no es una librería de python, debes tenerlo instalado en tu pc https://ffmpeg.org/download.html), que lo podemos usar desde Python muy fácilmente.

Código para extraer el audio de un video con Python y FFmpeg:

import subprocess  

ffmpeg_command = [  
    "ffmpeg",  
    "-loglevel", "warning",  
    "-i", "temp/video.mp4",  
    "temp/audio.mp3",  
    "-y"  
]  

subprocess.run(ffmpeg_command)

Ya con el audio del video podemos ahora si usar Faster-Whisper.

IMPORTANTE: todo el código me estoy mostrando hace parte del repositorio que puedes ver en Github, en cualquier momento puedes revisarlo o clonarlo para entender mejor su funcionamiento

Esto lo hacemos con el siguiente código:

from faster_whisper import WhisperModel
import streamlit as st


@st.cache_resource
def load_model():
    return WhisperModel(
        model_size_or_path="large-v3",
        device="cuda",
        compute_type="float16"
    )


class GenerateSubtitles:

    def __init__(self):
        self.model = None

    def generate_automatic(self):

        model = load_model()

        segments, info = model.transcribe(
            audio="temp/audio.mp3",
            beam_size=5,
        )

        data_segments = []

        for segment in segments:
            data_segments.append({
                'start': segment.start,
                'end': segment.end,
                'text': segment.text
            })

        return data_segments

    def generate_custom(self, len_segment: int):

        model = load_model()

        segments, info = model.transcribe(
            audio="temp/audio.mp3",
            beam_size=5,
            word_timestamps=True
        )

        data_words = []

        for segment in segments:
            for word in segment.words:
                data_words.append({
                    "start": word.start,
                    "end": word.end,
                    "word": word.word
                })

        text = ""
        pos_start = 0

        data_word_expanded = []

        for index, data in enumerate(data_words):

            if text == "":
                pos_start = data["start"]

            text += data["word"]

            if len(text) >= len_segment or index == len(data_words) - 1:
                data_word_expanded.append({
                    'start': pos_start,
                    'end': data["end"],
                    'text': text.strip()
                })

                text = ""

        return data_word_expanded

Como puedes ver, tenemos 2 funciones, una que retorna los subtitulos tal como nos lo da la librería de Faster-Whisper y otra función en donde le pasamos el parámetro len_segment para que nos devuelva una longitud personalizada de cada segmento del subtitulo.

3. Creación de las imágenes .png con los subtitulos y las configuraciones

Para poder tener una mejor personalización a la hora de crear los subtítulos opté por crear temporalmente una imagen por cada subtitulo, así puedo usar cualquier fuente, cambiar colores, tamaños, etc… Para esto use la librería de Python Pillow.

Este es el código:

import os
import subprocess

from PIL import Image, ImageDraw, ImageFont


class GenerateImages:

    def __init__(self):
        self.draw = None

    def _wrap_text(self, text, font, max_width):

        lines = []
        words = text.split(" ")
        temp_line = ""
        for word in words:
            if self.draw.textbbox((0, 0), temp_line + word, font=font)[2] <= max_width:
                temp_line += " " + word if temp_line else word
            else:
                lines.append(temp_line)
                temp_line = word
        lines.append(temp_line)

        return "\n".join(lines)

    def dimensions_video(self):
        ffmpeg_command = [
            'ffmpeg',
            '-loglevel', 'warning',
            '-i', 'temp/video.mp4',
            '-ss', '00:00:3',
            '-vframes', '1',
            'temp/video_preview.jpg',
            '-y'
        ]

        subprocess.run(ffmpeg_command)

        img = Image.open("temp/video_preview.jpg")

        width, height = img.size

        return width, height

    def multi_line_img(
            self,
            text: str,
            font: str,
            font_size: int,
            text_color: str,
            border_color: str,
            text_position: str,
            name: str,
            width: int,
            height: int,
    ):

        image = Image.new("RGBA", (width, height), (255, 255, 255, 0))
        self.draw = ImageDraw.Draw(image)

        margin = height * 0.05

        font = ImageFont.truetype(f"assets/fonts/{font}.ttf", font_size)

        lines_text = self._wrap_text(text, font, width - 2 * margin)

        text_box = self.draw.textbbox(
            xy=(0, 0),
            text=lines_text,
            font=font,
            spacing=int(font_size / 10),
            align="center",
            stroke_width=int(font_size / 15)

        )

        if text_position == "Top":
            position_x = height * 0.1

        if text_position == "Center":
            position_x = (height - (text_box[3] * 1.5)) // 2

        if text_position == "Bottom":
            position_x = (height * 0.8) - (text_box[3] // 2)

        self.draw.multiline_text(
            xy=((width-text_box[2]) // 2, position_x),
            text=lines_text,
            font=font,
            fill=text_color,
            spacing=int(font_size / 10),
            align="center",
            stroke_fill=border_color,
            stroke_width=int(font_size / 15)
        )

        name = f"temp/subs/{name}.png"

        image.save(name)

        return name

    def preview_text(
            self,
            uploaded_video: bool,
            text: str,
            font: str,
            font_size: int,
            text_color: str,
            border_color: str,
            text_position: str,
    ):

        if uploaded_video and os.path.exists('temp/video.mp4'):
            width, height = self.dimensions_video()
            image = Image.open("temp/video_preview.jpg")

        else:
            width = 1920
            height = 1080
            image = Image.new("RGB", (width, height), (234, 234, 234))

        margin = width * 0.05

        self.draw = ImageDraw.Draw(image)

        font = ImageFont.truetype(f"assets/fonts/{font}.ttf", font_size)

        lines_text = self._wrap_text(text, font, width - 2 * margin)

        text_box = self.draw.textbbox(
            xy=(0, 0),
            text=lines_text,
            font=font,
            spacing=int(font_size / 10),
            align="center",
            stroke_width=int(font_size / 15)

        )

        if text_position == "Top":
            position_x = height * 0.1

        if text_position == "Center":
            position_x = (height - (text_box[3] * 1.5)) // 2

        if text_position == "Bottom":
            position_x = (height * 0.8) - (text_box[3] // 2)

        self.draw.multiline_text(
            xy=((width - text_box[2]) // 2, position_x),
            text=lines_text,
            font=font,
            fill=text_color,
            spacing=int(font_size / 10),
            align="center",
            stroke_fill=border_color,
            stroke_width=int(font_size / 15)
        )

        image.save("temp/text_preview.jpg")

Con el código anterior vemos que creamos un preview de como quedaría el texto en el video, para esto obtenemos el tamaño del video actual y trabamos todo en base a ese tamaño, tenemos la función multi_line_img que genera la imagen temporal del subtitulo que le pasemos.

4. Composición final del video original con los subtítulos

Ahora bien, ya tenemos los subtitulos con los tiempos en donde deben ir y tenemos las imágenes .png con esos subtitulos, solo nos falta superponer estas imágenes a lo largo del video original en el tiempo indicado, esto lo logramos con el siguiente código:

import subprocess

LOGLEVEL = "warning"


class GenerateVideo:

    def generate(self, subtitles: list):

        video_final = [
            "ffmpeg",
            "-loglevel", LOGLEVEL,
            "-i", "temp/video.mp4",
            '-y'
        ]

        overlays = []

        for index, subtitle in enumerate(subtitles):
            if index == 0:
                overlays.append(
                    f"[0:v][1:v]"
                    f"overlay=0:0:enable='between(t,{subtitle['start']},{subtitle['end']})'"
                    f"[v1]"
                )
            else:
                overlays.append(
                    f"[v{index}][{index + 1}:v]"
                    f"overlay=0:0:enable='between(t,{subtitle['start']},{subtitle['end']})'"
                    f"[v{index + 1}]"
                )

            video_final.extend(["-i", subtitle["img_path"]])

        video_final += [
            "-filter_complex",
            ";".join(overlays),
            "-map", "0:a",
            '-c:a', 'copy',
            "-map",
            f"[v{len(subtitles)}]",
            "output/final_video.mp4",
            "-y"
        ]

        subprocess.run(video_final)

        return True
import subprocess

LOGLEVEL = "warning"


class GenerateVideo:

    def generate(self, subtitles: list):

        video_final = [
            "ffmpeg",
            "-loglevel", LOGLEVEL,
            "-i", "temp/video.mp4",
            '-y'
        ]

        overlays = []

        for index, subtitle in enumerate(subtitles):
            if index == 0:
                overlays.append(
                    f"[0:v][1:v]"
                    f"overlay=0:0:enable='between(t,{subtitle['start']},{subtitle['end']})'"
                    f"[v1]"
                )
            else:
                overlays.append(
                    f"[v{index}][{index + 1}:v]"
                    f"overlay=0:0:enable='between(t,{subtitle['start']},{subtitle['end']})'"
                    f"[v{index + 1}]"
                )

            video_final.extend(["-i", subtitle["img_path"]])

        video_final += [
            "-filter_complex",
            ";".join(overlays),
            "-map", "0:a",
            '-c:a', 'copy',
            "-map",
            f"[v{len(subtitles)}]",
            "output/final_video.mp4",
            "-y"
        ]

        subprocess.run(video_final)

        return True

Aquí usamos totalmente FFmpeg que con su filtro overlay superpone cada una de las imágenes en el tiempo que le indiquemos.

Conclusiones

Sabemos que esta es una implementación sencilla pero es gratificante entender como esto tiene un sin fin de aplicaciones y mas ahora con el boom que hay de la IA, podríamos por ejemplo automatizar videos, realizar transcripciones de videos de manera automática, editar videos de manera masiva, en fin tenemos infinitas posibilidades.