longtk26

Posted on Jan 18

Subtitle to Audio Tutorial

#python #lambda #algorithms

Subtitle to Audio Tutorial

Hi everyone! In this tutorial, we will explore how to convert subtitles into audio using text-to-speech (TTS) technology. This process can be particularly useful for creating audiobooks, podcasts, or enhancing accessibility for video content.

Subtitle to Audio Tutorial

Introduction

When working with applications that involve video content and translation, it is often necessary to translate an original subtitle file (e.g., SRT format) into another language.
To enhance user experience, we can convert these translated subtitles into audio using TTS technology.
This tutorial will guide you through the process of converting subtitles to audio, addressing common challenges and providing solutions.

Workflow Overview

Convert from subtitle to audio is a job that need several steps to complete that why we will handle it asynchronously.
We choose lamba functions for saving cost and easy in implementation.
The workflow consist of the following steps:
1. Choose a subtitle file from current system: Assume our system already show list of subtitle files that user can choose from.
2. Choose voice settings: User can choose voice settings like male or female voice.
3. Request to convert subtitle to audio: After user choose subtitle file and voice settings we will send request to backend to start the conversion process.
4. Server request to s3 to get subtitle file: Because in real project we usually store files in s3, so our server need to get subtitle file url from s3 first.
5. Server request to lambda function to convert subtitle to audio: After get subtitle file url server will send request to lambda function to start convert subtitle to audio.
6. Lambda function process subtitle file: Lambda function will download subtitle file from s3, parse subtitle file to get text and timing information, then use google text to speech to convert text to audio, finally combine audio segments according to timing information and save the final audio file to s3.

Problems with workflow

1. Limited execution time of lambda functions

AWS Lambda functions have a maximum execution time limit (currently 15 minutes). If the subtitle file is large or the conversion process takes too long, the lambda function may time out before completing the task.

2. Limited resources within lambda functions

Lambda functions have limited CPU and memory resources. If the TTS engine requires more resources or inprogress the size of audio become to large, the conversion process may fail or produce error out of memory.

3. Synchronous time and speed between subtile and audio

Ensuring that the audio segments align perfectly with the timing information in the subtitle file can be challenging. Any discrepancies can lead to audio that is out of sync with the original video content.

4. Popping noises between audio segments

When combining multiple audio segments generated from subtitles, popping noises may occur at the boundaries between segments.

Solutions

This section will show you solutions for the problems mentioned above and successully in converting subtitle file more than 30 minutes to audio only in some minutes. Also implement it with at least resources, we only use default lambda function settings (128MB memory) and timeout is 15 minutes.

Algorithm diagram

Link algorithm diagram: Image

The flowchart above illustrates the complete subtitle-to-audio conversion algorithm with optimized memory management and streaming architecture. Here's how it works:

1. Initialization Phase

The process begins by downloading and parsing the SRT subtitle file from the provided URL
It pre-calculates the total PCM (Pulse Code Modulation) audio size based on subtitle duration and timing information
An S3 Multipart Upload is initialized to enable streaming upload without holding the entire file in memory
A WAV header is created and added to the part buffer, initializing tracking variables (part_number=1, content_id index=0, current_part=b"")

2. Batch Processing Loop

Instead of processing all subtitles at once, the algorithm processes them in batches of approximately 15 subtitles
For each batch, it retrieves the next set of subtitle entries to process
This batching approach prevents memory overflow by keeping only a small subset of subtitles in memory at any time

3. Audio Generation for Each Subtitle

For each subtitle in the current batch:
- Calculate speech speed: Determines the appropriate speaking rate to fit the subtitle text within its designated time window
- Generate audio using Google Cloud TTS: Converts the subtitle text to audio using the selected voice settings
- Apply fade effects: Adds 250ms fade-in and fade-out to prevent popping noises between audio segments
- Calculate and add silence padding: Adds appropriate silence gaps to ensure precise alignment with subtitle timing

4. Streaming Upload Mechanism

As audio segments are generated, PCM data is appended to the part buffer
When the buffer reaches 5MB or when processing the last batch, the current part is uploaded to S3 immediately
After each upload, the buffer is cleared and the part_number is incremented
This streaming approach ensures that Lambda memory usage remains constant (around 5-10MB) regardless of the total audio file size

5. Finalization Phase

Once all subtitle batches have been processed and all parts uploaded, the algorithm completes the S3 Multipart Upload
The final audio file is assembled from all uploaded parts and becomes available on S3
Error handling paths are included to manage upload failures or conversion issues (shown by the red error node)

Key Flow Control

The algorithm uses two main decision loops:
- "More batches to process?": Controls the outer loop for batch processing
- "More subtitles in batch?": Controls the inner loop for processing individual subtitles within a batch
"Buffer size >= 5MB OR last batch?": Triggers S3 part uploads at optimal intervals

Key Optimizations

Batch Processing: Subtitles are processed in batches (default: 15 subtitles per batch) instead of loading all at once. This prevents memory overflow in Lambda.
S3 Multipart Upload:
- Initialize multipart upload at the start
- Upload parts incrementally when buffer reaches 5MB
- Each part is uploaded immediately and buffer is cleared
- Complete multipart upload at the end
- This approach ensures Lambda never holds the entire audio file in memory
Streaming Architecture: Audio data flows directly from TTS → batch buffer → S3 parts, with minimal memory footprint at any given time.
Fade Effects: 250ms fade-in/out applied to each audio segment to eliminate popping noises between segments.
Memory Management: After each part upload, the buffer is reset, keeping memory usage constant regardless of total audio length.
Avoid FFmpeg and audio processing libraries:

We should avoid using FFmpeg (both the binary and Python wrappers like pydub, ffmpeg-python) in Lambda functions because they significantly increase deployment package size.
FFmpeg binaries are typically 50-100MB, and audio processing libraries add another 20-50MB, easily exceeding Lambda's package size limits.
Instead, we should work directly with WAV file format using Python's built-in struct module and pure Python code to manipulate raw PCM audio bytes.
This approach keeps the deployment package minimal (<15MB) while maintaining full control over audio processing operations like fade-in/fade-out, silence padding, and concatenation.

Conclusion

Thank you for following this tutorial on converting subtitles to audio using TTS technology.
I hope this information can help you to implement this feature in your own projects efficiently. You could use these ideas and use any AI tools for generating code to implement it.
This blog is written by my experience in real project that we need to save cost with lambda function, if you have any good ideas or suggestions please feel free to share with me!
You could find more on my blogs: Website

DEV Community

Subtitle to Audio Tutorial

Subtitle to Audio Tutorial

Table of Contents

Introduction

Workflow Overview

Problems with workflow

1. Limited execution time of lambda functions

2. Limited resources within lambda functions

3. Synchronous time and speed between subtile and audio

4. Popping noises between audio segments

Solutions

Algorithm diagram

Key Optimizations

Conclusion

Top comments (0)