Building a Multi-Language Image Description API with Amazon Nova Lite and Polly

I've always been passionate about technology and how it can transform lives. I occasionally deal with small visual challenges. This has made me deeply interested in exploring ways technology can improve accessibility, particularly for those with greater visual difficulties.

For that reason, I recently embarked on a project to build an API that converts images into descriptive text (Image to Text) or audio (Image to Speech). Using AWS services like Bedrock and Nova Sonic. I built an open-source serverless API that generates image descriptions in multiple languages with optional audio output.

What Does This API Do?

The API accepts base64-encoded images and returns detailed descriptions in over 10 languages, with optional audio narration. It's designed for:

Accessibility applications: Generate alt-text for images in multiple languages or integrate with screen readers
Content management: Automatically describe uploaded images
E-commerce: Create product descriptions from images
Social media: Generate captions in different languages

Architecture Overview

The solution uses a serverless architecture built entirely on AWS:

Key AWS Services Used

1. Amazon Bedrock with Nova Lite Model

The heart of the application uses Amazon's Nova Lite model:

Multimodal understanding: Processes both text prompts and images
Cost efficiency: Optimized for high-volume applications
Fast inference: Low latency responses
Multi-language support: Native understanding of multiple languages

2. AWS Lambda

The serverless compute layer handles:

Image processing: Base64 decoding and validation
Bedrock integration: Model invocation and response handling
Polly integration: Audio generation for accessibility
Error handling: Comprehensive error responses

3. Amazon Polly

Provides text-to-speech capabilities with:

Multiple voices per language: Natural-sounding speech
SSML support: Enhanced audio control
MP3 output: Compressed audio for web delivery

4. API Gateway

Creates a production-ready API with:

REST endpoints: Clean API interface
SSL termination: Secure HTTPS connections
Request/response transformation: Clean JSON interfaces
Request/response transformation: Clean JSON interfaces

Technical Implementation

Lambda Function Structure

The Python Lambda function is organized into clear components:

def lambda_handler(event, context):
    # Parse request and validate input
    # Invoke Bedrock Nova Lite model
    # Generate audio with Polly (if requested)
    # Return structured JSON response

Bedrock Integration

The Nova Lite model is invoked with carefully crafted prompts:

prompt = f"Describe this image in {language_name}. Be descriptive and detailed."
response = bedrock_client.invoke_model(
    modelId="amazon.nova-lite-v1:0",
    body=json.dumps({
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "image": {
                            "format": image_format,
                            "source": {
                                "bytes": base64.b64encode(image_bytes).decode('utf-8')
                            }
                        }
                    },
                    {
                        "text": prompt
                    }
                ]
            }
        ],
        "inferenceConfig": {
            "maxTokens": 300
        }
    })
)

Deployment with AWS SAM

The project uses AWS SAM (Serverless Application Model) for infrastructure as code:

# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  ImageDescriptionFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.9
      Handler: app.lambda_handler
      Policies:
        - Statement:
          - Effect: Allow
            Action:
              - bedrock:InvokeModel
            Resource: '*'
          - Effect: Allow
            Action:
              - polly:SynthesizeSpeech
            Resource: '*'

Deployment Commands

sam build
sam deploy --guided

API Endpoints

Text Description Endpoint

POST /describe/text

{
  "image": "base64_encoded_image_data",
  "language": "en"
}

Response:

{
  "description": "A detailed description of the image in the requested language",
  "format": "text",
  "language": "en"
}

Audio Description Endpoint

POST /describe/audio

{
  "image": "base64_encoded_image_data",
  "language": "en",
  "voice": "Joanna"
}

Response:

{
  "description": "Text description",
  "audio": "base64_encoded_mp3_data",
  "format": "audio",
  "voice": "Joanna",
  "language": "en"
}

Getting Started

Clone the repository
Enable Nova Lite access in Amazon Bedrock console
Deploy with SAM: sam build && sam deploy --guided
Test the endpoints with your images

Conclusion

This project demonstrates how modern AWS services can be combined to create powerful, cost-effective AI applications. The combination of Nova Lite's multimodal capabilities, Lambda's serverless compute, and Polly's text-to-speech creates a comprehensive solution for image accessibility.

The entire codebase is open-source and production-ready, making it easy for developers to deploy their own instance or extend the functionality for specific use cases.

Repository: https://github.com/mkreder/image-to-speech-api