DEV Community

Cover image for Case Study: Creating an ETL Data Pipeline using AWS Services - Real-World Problem
M. Abdullah Bin Aftab
M. Abdullah Bin Aftab

Posted on

Case Study: Creating an ETL Data Pipeline using AWS Services - Real-World Problem

Architecture

Image description

Overview

This ETL pipeline leverages several AWS services to fetch, process, and store YouTube videos with translated subtitles. The core components include:

  • AWS Lambda for processing video and audio
  • Step Functions for Orchestrating Workflows
  • S3 Buckets for storing raw and processed data
  • AWS Transcribe for converting audio to text
  • AWS Translate for translating the text into desired languages

Pipeline Steps

Fetching Video from YouTube

  • Video is fetched from YouTube and sent to an initial Lambda Function.
  • This function splits the video into separate audio and video files.

Step Functions Orchestration

  • AWS Step Functions orchestrate all the following processes.
  • The workflow controls each step, passing data between services and managing the pipeline flow.

Storing Raw Video and Audio

  • Audio and video files are stored separately here.
  • The bucket sends an email notification to confirm the files were successfully stored.

Merging Video and Audio Files

  • This Lambda function combines the audio and video files into a single file.
  • Error Handling: If merging fails, the pipeline stores the failed files in an Error Bucket.

Transcribing Audio to Text

  • The merged video’s audio is sent to AWS Transcribe to convert audio into text.
  • The resulting text is passed to the next Lambda function.

Proofreading Transcribed Text

  • This Lambda function checks the transcription for accuracy and readability.
  • If the text is poor quality, it can be flagged for manual review or re-transcription.

Translating Text into Target Language

  • The proofread text is sent to AWS Translate to convert it into a chosen language (e.g., Arabic, Italian, or Spanish).
  • The translated text is then passed to the next Lambda function.

Generating Subtitles and Merging Paragraphs

  • This function formats the translated text as subtitles and merges paragraphs if needed.
  • The final file is prepared for storage.

Storing Processed Video with Subtitles

  • The fully processed video with subtitles in the desired language is stored here.
  • This is the final output location where the processed video is accessible.

Error Handling

  1. Error Bucket for Failed Merges: If audio and video files fail to merge, the Lambda Merge Function sends them to a designated Error Bucket.
  2. Transcription and Proofreading Quality Check: Poor transcription quality detected by the Proofread Function can trigger a flag for manual review or re-transcription.

Notifications

**Email Notifications **for File Arrival in S3:After the raw video and audio files are stored in the S3 Raw Video and Audio Bucket, an email notification is sent to confirm successful storage.


Benefits of This Pipeline

  1. Automation: Streamlines the entire video processing workflow without manual intervention.
  2. Scalability: AWS services like Lambda and Step Functions allow the pipeline to handle many videos.
  3. Language Support: AWS Translate enables easy translation to multiple languages, broadening the video’s reach.
  4. Error Management: Dedicated error-handling buckets and flags ensure issues are logged and handled efficiently.

Top comments (0)