DEV Community

Cover image for PDF to Multilingual Audio: Building a Serverless AI Pipeline on AWS
AKASH S
AKASH S

Posted on

PDF to Multilingual Audio: Building a Serverless AI Pipeline on AWS

In this project, I built a fully serverless, event-driven AI pipeline on AWS that automatically converts a PDF document into translated speech audio.

Whenever a PDF is uploaded to an S3 bucket:

  • Text is extracted using Amazon Textract
  • The extracted text is translated using Amazon Translate
  • The translated text is converted into speech using polly
  • The final audio file is stored back in S3

All of this happens automatically, without any manual trigger or server management.

PDF Upload (Amazon S3)
        ↓
AWS Lambda (Triggered by S3 event)
        ↓
Amazon Textract (OCR)
        ↓
Amazon Translate (Language Translation)
        ↓
Amazon Polly (Text to Speech)
        ↓
Audio Output Stored in S3

Enter fullscreen mode Exit fullscreen mode

AWS Services Used

  • Amazon S3 – File storage and event trigger
  • AWS Lambda – Serverless compute
  • Amazon Textract – Extract text from PDFs
  • Amazon Translate – Translate extracted text
  • Amazon Polly – Convert text into speech
  • AWS IAM – Secure access control
  • Amazon CloudWatch – Logging and monitoring

Step-by-Step Project Flow :

Step 1: Upload PDF to S3

  • A PDF file is uploaded to the input/ folder of the S3 bucket.
  • This upload event automatically triggers the Lambda function.

Step 2: Extract Text with Textract

  • Lambda starts an asynchronous Textract job to extract text from the uploaded PDF.
  • Textract reads the PDF directly from S3 and returns the extracted text.

Extracted text using lambda and logged by cloud watch.

Step 3: Translate the Extracted Text

  • The extracted English text is passed to Amazon Translate.
  • The text is translated into a target language (for example, Tamil).

Step 4: Convert Translated Text to Speech

  • The translated text is sent to Amazon Polly.
  • Polly generates a natural-sounding MP3 audio file using a language-appropriate voice.

Step 5: Store the Audio Output

  • The generated MP3 file is saved in the output/ folder of the same S3 bucket.
  • The entire process completes automatically in a few seconds.

Codes to perform lambda and role permissions.
lambda_function.py

import boto3
import time
import uuid

textract = boto3.client('textract')
translate = boto3.client('translate')
polly = boto3.client('polly')
s3 = boto3.client('s3')

BUCKET_NAME = "pdf-translate-speech-ak"

def lambda_handler(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    print(f"PDF uploaded: {key}")

    if not key.startswith("input/"):
        return {"statusCode": 200, "message": "Not an input file"}

    response = textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket,
                'Name': key
            }
        }
    )

    job_id = response['JobId']
    print(f"Textract Job ID: {job_id}")

    extracted_text = ""
    while True:
        result = textract.get_document_text_detection(JobId=job_id)
        status = result['JobStatus']

        if status == "SUCCEEDED":
            for block in result['Blocks']:
                if block['BlockType'] == "LINE":
                    extracted_text += block['Text'] + " "
            break
        elif status == "FAILED":
            raise Exception("Textract failed")

        time.sleep(5)

    print("Text extraction completed")

    translated = translate.translate_text(
        Text=extracted_text[:5000],  # safeguard
        SourceLanguageCode="en",
        TargetLanguageCode="ta"
    )

    translated_text = translated['TranslatedText']

    speech = polly.synthesize_speech(
        Text=translated_text[:3000],
        OutputFormat="mp3",
        VoiceId="Aditi"
    )

    audio_key = f"output/translated_audio_{uuid.uuid4()}.mp3"

    s3.put_object(
        Bucket=bucket,
        Key=audio_key,
        Body=speech['AudioStream'].read(),
        ContentType="audio/mpeg"
    )

    print(f"Audio saved: {audio_key}")

    return {
        "statusCode": 200,
        "audio_file": audio_key
    }

Enter fullscreen mode Exit fullscreen mode

Role and it's Policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:StartDocumentTextDetection",
        "textract:GetDocumentTextDetection"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "translate:TranslateText"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "polly:SynthesizeSpeech"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::pdf-translate-speech-ak/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:*"
      ],
      "Resource": "*"
    }
  ]
}

Enter fullscreen mode Exit fullscreen mode

Output Translated Audio:

Connect With Me

👤 Akash S
☁️ AWS | Cloud | AI Projects
✍️ Writing about real-world cloud learning

Top comments (0)