PDF to Multilingual Audiobook: Building a Serverless AI Pipeline on AWS

#ai #tutorial #devops #aws

In this project, I built a fully serverless, event-driven AI pipeline on AWS that automatically converts a PDF document into translated speech audio.

Whenever a PDF is uploaded to an S3 bucket:

Text is extracted using Amazon Textract
The extracted text is translated using Amazon Translate
The translated text is converted into speech using polly
The final audio file is stored back in S3

All of this happens automatically, without any manual trigger or server management.

PDF Upload (Amazon S3)
        ↓
AWS Lambda (Triggered by S3 event)
        ↓
Amazon Textract (OCR)
        ↓
Amazon Translate (Language Translation)
        ↓
Amazon Polly (Text to Speech)
        ↓
Audio Output Stored in S3

AWS Services Used

Amazon S3 – File storage and event trigger
AWS Lambda – Serverless compute
Amazon Textract – Extract text from PDFs
Amazon Translate – Translate extracted text
Amazon Polly – Convert text into speech
AWS IAM – Secure access control
Amazon CloudWatch – Logging and monitoring

Step-by-Step Project Flow :

Step 1: Upload PDF to S3

A PDF file is uploaded to the input/ folder of the S3 bucket.
This upload event automatically triggers the Lambda function.

Step 2: Extract Text with Textract

Lambda starts an asynchronous Textract job to extract text from the uploaded PDF.
Textract reads the PDF directly from S3 and returns the extracted text.

Extracted text using lambda and logged by cloud watch.

Step 3: Translate the Extracted Text

The extracted English text is passed to Amazon Translate.
The text is translated into a target language (for example, Tamil).

Step 4: Convert Translated Text to Speech

The translated text is sent to Amazon Polly.
Polly generates a natural-sounding MP3 audio file using a language-appropriate voice.

Step 5: Store the Audio Output

The generated MP3 file is saved in the output/ folder of the same S3 bucket.
The entire process completes automatically in a few seconds.

Codes to perform lambda and role permissions.
lambda_function.py

import boto3
import time
import uuid

textract = boto3.client('textract')
translate = boto3.client('translate')
polly = boto3.client('polly')
s3 = boto3.client('s3')

BUCKET_NAME = "pdf-translate-speech-ak"

def lambda_handler(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    print(f"PDF uploaded: {key}")

    if not key.startswith("input/"):
        return {"statusCode": 200, "message": "Not an input file"}

    response = textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket,
                'Name': key
            }
        }
    )

    job_id = response['JobId']
    print(f"Textract Job ID: {job_id}")

    extracted_text = ""
    while True:
        result = textract.get_document_text_detection(JobId=job_id)
        status = result['JobStatus']

        if status == "SUCCEEDED":
            for block in result['Blocks']:
                if block['BlockType'] == "LINE":
                    extracted_text += block['Text'] + " "
            break
        elif status == "FAILED":
            raise Exception("Textract failed")

        time.sleep(5)

    print("Text extraction completed")

    translated = translate.translate_text(
        Text=extracted_text[:5000],  # safeguard
        SourceLanguageCode="en",
        TargetLanguageCode="ta"
    )

    translated_text = translated['TranslatedText']

    speech = polly.synthesize_speech(
        Text=translated_text[:3000],
        OutputFormat="mp3",
        VoiceId="Aditi"
    )

    audio_key = f"output/translated_audio_{uuid.uuid4()}.mp3"

    s3.put_object(
        Bucket=bucket,
        Key=audio_key,
        Body=speech['AudioStream'].read(),
        ContentType="audio/mpeg"
    )

    print(f"Audio saved: {audio_key}")

    return {
        "statusCode": 200,
        "audio_file": audio_key
    }

Role and it's Policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "textract:StartDocumentTextDetection",
        "textract:GetDocumentTextDetection"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "translate:TranslateText"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "polly:SynthesizeSpeech"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::pdf-translate-speech-ak/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:*"
      ],
      "Resource": "*"
    }
  ]
}

Output Translated Audio: