I've always been passionate about technology and how it can transform lives. I occasionally deal with small visual challenges. This has made me deeply interested in exploring ways technology can improve accessibility, particularly for those with greater visual difficulties.
For that reason, I recently embarked on a project to build an API that converts images into descriptive text (Image to Text) or audio (Image to Speech). Using AWS services like Bedrock and Nova Sonic. I built an open-source serverless API that generates image descriptions in multiple languages with optional audio output.
What Does This API Do?
The API accepts base64-encoded images and returns detailed descriptions in over 10 languages, with optional audio narration. It's designed for:
- Accessibility applications: Generate alt-text for images in multiple languages or integrate with screen readers
- Content management: Automatically describe uploaded images
- E-commerce: Create product descriptions from images
- Social media: Generate captions in different languages
Architecture Overview
The solution uses a serverless architecture built entirely on AWS:
Key AWS Services Used
1. Amazon Bedrock with Nova Lite Model
The heart of the application uses Amazon's Nova Lite model:
- Multimodal understanding: Processes both text prompts and images
- Cost efficiency: Optimized for high-volume applications
- Fast inference: Low latency responses
- Multi-language support: Native understanding of multiple languages
2. AWS Lambda
The serverless compute layer handles:
- Image processing: Base64 decoding and validation
- Bedrock integration: Model invocation and response handling
- Polly integration: Audio generation for accessibility
- Error handling: Comprehensive error responses
3. Amazon Polly
Provides text-to-speech capabilities with:
- Multiple voices per language: Natural-sounding speech
- SSML support: Enhanced audio control
- MP3 output: Compressed audio for web delivery
4. API Gateway
Creates a production-ready API with:
- REST endpoints: Clean API interface
- SSL termination: Secure HTTPS connections
Request/response transformation: Clean JSON interfaces
Request/response transformation: Clean JSON interfaces
Technical Implementation
Lambda Function Structure
The Python Lambda function is organized into clear components:
def lambda_handler(event, context):
# Parse request and validate input
# Invoke Bedrock Nova Lite model
# Generate audio with Polly (if requested)
# Return structured JSON response
Bedrock Integration
The Nova Lite model is invoked with carefully crafted prompts:
prompt = f"Describe this image in {language_name}. Be descriptive and detailed."
response = bedrock_client.invoke_model(
modelId="amazon.nova-lite-v1:0",
body=json.dumps({
"messages": [
{
"role": "user",
"content": [
{
"image": {
"format": image_format,
"source": {
"bytes": base64.b64encode(image_bytes).decode('utf-8')
}
}
},
{
"text": prompt
}
]
}
],
"inferenceConfig": {
"maxTokens": 300
}
})
)
Deployment with AWS SAM
The project uses AWS SAM (Serverless Application Model) for infrastructure as code:
# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
ImageDescriptionFunction:
Type: AWS::Serverless::Function
Properties:
Runtime: python3.9
Handler: app.lambda_handler
Policies:
- Statement:
- Effect: Allow
Action:
- bedrock:InvokeModel
Resource: '*'
- Effect: Allow
Action:
- polly:SynthesizeSpeech
Resource: '*'
Deployment Commands
sam build
sam deploy --guided
API Endpoints
Text Description Endpoint
POST /describe/text
{
"image": "base64_encoded_image_data",
"language": "en"
}
Response:
{
"description": "A detailed description of the image in the requested language",
"format": "text",
"language": "en"
}
Audio Description Endpoint
POST /describe/audio
{
"image": "base64_encoded_image_data",
"language": "en",
"voice": "Joanna"
}
Response:
{
"description": "Text description",
"audio": "base64_encoded_mp3_data",
"format": "audio",
"voice": "Joanna",
"language": "en"
}
Getting Started
- Clone the repository
- Enable Nova Lite access in Amazon Bedrock console
-
Deploy with SAM:
sam build && sam deploy --guided
- Test the endpoints with your images
Conclusion
This project demonstrates how modern AWS services can be combined to create powerful, cost-effective AI applications. The combination of Nova Lite's multimodal capabilities, Lambda's serverless compute, and Polly's text-to-speech creates a comprehensive solution for image accessibility.
The entire codebase is open-source and production-ready, making it easy for developers to deploy their own instance or extend the functionality for specific use cases.
Repository: https://github.com/mkreder/image-to-speech-api
Top comments (0)