Introduction
So, you've got a massive archive of newspaper images sitting in an S3 bucket, and you need to extract text from them, structure them into articles, and store the results for further analysis. Sounds like a headache, right? Well, AWS Bedrock Textract makes it easy.
Here's how;
The Big Picture
You want to automate the process of processing text you extracted from from newspaper images, here are a few AWS services you are going to need:
- S3 (to store raw text and processed text)
- DynamoDB (to keep track of which files have been processed)
- AWS Bedrock (to clean up and format extracted text into structured news articles)
Think of it as a conveyor belt: a text goes in, the processed text comes out, and the system remembers what’s been processed so it doesn’t do the same work twice.
Step 1: Setting Up Our Tools
Before we get into processing the files, we initialize some AWS clients and set up logging:
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
bedrock = boto3.client('bedrock-runtime')
S3 is used to read and write files in our S3 bucket.
Dynamodb helps us store and retrieve checkpoints (so we don’t reprocess the same files).
Bedrock is our AI workhorse, helping turn raw text into structured articles.
Logging is also set up to capture what’s happening:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
Logging helps debug when things go wrong.
Step 2: Keeping Track of Processed Files
To avoid processing the same file multiple times, we store a “checkpoint” in DynamoDB. This is done through a CheckpointManager class:
class CheckpointManager:
def __init__(self, table_name=CHECKPOINT_TABLE_NAME):
self.table = dynamodb.Table(table_name)
This class has two key methods:
get_last_processed_key() - Retrieves the last processed file so we can continue from there.
update_checkpoint() - Updates the checkpoint once a file is processed.
If something crashes mid-run, the script will resume from where it left off, instead of reprocessing everything from scratch. Neat, right?
Step 3: Processing a Single File
The NewsProcessor class is where the magic happens. It does three things:
Reads the extracted text from an S3 JSON file.
Sends the text to AWS Bedrock for structuring.
Saves the structured output back to S3.
Here’s how we fetch the text from S3:
response = s3.get_object(Bucket=INPUT_BUCKET, Key=key)
input_data = json.loads(response['Body'].read().decode('utf-8'))
extracted_text = input_data.get('extracted_text', [])
source_image = input_data.get('source_image', '')
We then send the text to AWS Bedrock, using Claude (a large language model from Anthropic):
\body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"messages": [
{"role": "user", "content": prompt}
],
"temperature": 0,
"system": "You are a professional newspaper editor who excels at identifying and structuring news articles."
})
response = bedrock.invoke_model(
modelId="anthropic.claude-3-sonnet-20240229-v1:0",
body=body
)
AWS Bedrock takes the raw text and turns it into well-structured news articles. If it messes up (which AI sometimes does), we try to handle that gracefully.
Step 4: Saving the Output
Once Bedrock does its thing, we save the structured articles back to S3:
s3.put_object(
Bucket=INPUT_BUCKET,
Key=output_key,
Body=json.dumps({
'metadata': {
'source_file': key,
'source_image': source_image,
'articles_processed': len(processed_articles)
},
'articles': processed_articles
}, indent=2),
ContentType='application/json'
)
We also update the checkpoint in DynamoDB so we don’t process this file again.
Step 5: The Lambda Function
To save on cost, we will run inside an AWS Lambda function, which means it needs to:
Find new files to process – It lists JSON files in S3 and filters out ones that have already been processed.
Process them one by one – It loops through and calls process_single_file().
Handle timeouts – If Lambda is running out of time, it stops before getting killed.
The full loop looks like this:
for key in files_to_process:
try:
result = processor.process_single_file(key)
results.append(result)
if context.get_remaining_time_in_millis() < 60000:
logger.info("Approaching Lambda timeout, stopping processing")
break
except Exception as e:
logger.error(f"Error processing file {key}: {e}")
results.append({
'success': False,
'input_key': key,
'error': str(e)
})
When it finishes processing, it saves a summary file in S3 with details of which files were processed successfully and which ones failed.
Wrapping It Up
In short, we can automate text extraction of thousands of newspaper images stored in an S3 bucket, extract text, organizes it into articles using AWS Bedrock, and stores the structured output back in S3.
If you’re dealing with large volumes of scanned newspapers, this kind of automation can save you weeks of manual work.
Top comments (0)