M. Abdullah Bin Aftab

Posted on Dec 6, 2024

Case Study: Creating an ETL Data Pipeline using AWS Services - Real-World Problem

#aws #dataengineering #data #cloudcomputing

Architecture

Overview

This ETL pipeline leverages several AWS services to fetch, process, and store YouTube videos with translated subtitles. The core components include:

AWS Lambda for processing video and audio
Step Functions for Orchestrating Workflows
S3 Buckets for storing raw and processed data
AWS Transcribe for converting audio to text
AWS Translate for translating the text into desired languages

Pipeline Steps

Fetching Video from YouTube

Video is fetched from YouTube and sent to an initial Lambda Function.
This function splits the video into separate audio and video files.

Step Functions Orchestration

AWS Step Functions orchestrate all the following processes.
The workflow controls each step, passing data between services and managing the pipeline flow.

Storing Raw Video and Audio

Audio and video files are stored separately here.
The bucket sends an email notification to confirm the files were successfully stored.

Merging Video and Audio Files

This Lambda function combines the audio and video files into a single file.
Error Handling: If merging fails, the pipeline stores the failed files in an Error Bucket.

Transcribing Audio to Text

The merged video’s audio is sent to AWS Transcribe to convert audio into text.
The resulting text is passed to the next Lambda function.

Proofreading Transcribed Text

This Lambda function checks the transcription for accuracy and readability.
If the text is poor quality, it can be flagged for manual review or re-transcription.

Translating Text into Target Language

The proofread text is sent to AWS Translate to convert it into a chosen language (e.g., Arabic, Italian, or Spanish).
The translated text is then passed to the next Lambda function.

Generating Subtitles and Merging Paragraphs

This function formats the translated text as subtitles and merges paragraphs if needed.
The final file is prepared for storage.

Storing Processed Video with Subtitles

The fully processed video with subtitles in the desired language is stored here.
This is the final output location where the processed video is accessible.

Error Handling

Error Bucket for Failed Merges: If audio and video files fail to merge, the Lambda Merge Function sends them to a designated Error Bucket.
Transcription and Proofreading Quality Check: Poor transcription quality detected by the Proofread Function can trigger a flag for manual review or re-transcription.

Notifications

**Email Notifications **for File Arrival in S3:After the raw video and audio files are stored in the S3 Raw Video and Audio Bucket, an email notification is sent to confirm successful storage.

Benefits of This Pipeline

Automation: Streamlines the entire video processing workflow without manual intervention.
Scalability: AWS services like Lambda and Step Functions allow the pipeline to handle many videos.
Language Support: AWS Translate enables easy translation to multiple languages, broadening the video’s reach.
Error Management: Dedicated error-handling buckets and flags ensure issues are logged and handled efficiently.

DEV Community