Ehi Enabs

Posted on Mar 27

Text Processing with AWS Bedrock (Extracting Text from Newspaper Images II)

#genai #aws #s3 #aiops

Introduction

So, you've got a massive archive of newspaper images sitting in an S3 bucket, and you need to extract text from them, structure them into articles, and store the results for further analysis. Sounds like a headache, right? Well, AWS Bedrock Textract makes it easy.
Here's how;

The Big Picture

You want to automate the process of processing text you extracted from from newspaper images, here are a few AWS services you are going to need:

S3 (to store raw text and processed text)
DynamoDB (to keep track of which files have been processed)
AWS Bedrock (to clean up and format extracted text into structured news articles)

Think of it as a conveyor belt: a text goes in, the processed text comes out, and the system remembers what’s been processed so it doesn’t do the same work twice.

Step 1: Setting Up Our Tools

Before we get into processing the files, we initialize some AWS clients and set up logging:

s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
bedrock = boto3.client('bedrock-runtime')

S3 is used to read and write files in our S3 bucket.

Dynamodb helps us store and retrieve checkpoints (so we don’t reprocess the same files).

Bedrock is our AI workhorse, helping turn raw text into structured articles.

Logging is also set up to capture what’s happening:

logger = logging.getLogger()
logger.setLevel(logging.INFO)

Logging helps debug when things go wrong.

Step 2: Keeping Track of Processed Files

To avoid processing the same file multiple times, we store a “checkpoint” in DynamoDB. This is done through a CheckpointManager class:

class CheckpointManager:
    def __init__(self, table_name=CHECKPOINT_TABLE_NAME):
        self.table = dynamodb.Table(table_name)

This class has two key methods:

get_last_processed_key() - Retrieves the last processed file so we can continue from there.
update_checkpoint() - Updates the checkpoint once a file is processed.

If something crashes mid-run, the script will resume from where it left off, instead of reprocessing everything from scratch. Neat, right?

Step 3: Processing a Single File

The NewsProcessor class is where the magic happens. It does three things:

Reads the extracted text from an S3 JSON file.

Sends the text to AWS Bedrock for structuring.

Saves the structured output back to S3.

Here’s how we fetch the text from S3:

response = s3.get_object(Bucket=INPUT_BUCKET, Key=key)
input_data = json.loads(response['Body'].read().decode('utf-8'))
extracted_text = input_data.get('extracted_text', [])
source_image = input_data.get('source_image', '')

We then send the text to AWS Bedrock, using Claude (a large language model from Anthropic):

\body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 4096,
    "messages": [
        {"role": "user", "content": prompt}
    ],
    "temperature": 0,
    "system": "You are a professional newspaper editor who excels at identifying and structuring news articles."
})
response = bedrock.invoke_model(
    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
    body=body
)

AWS Bedrock takes the raw text and turns it into well-structured news articles. If it messes up (which AI sometimes does), we try to handle that gracefully.

Step 4: Saving the Output

Once Bedrock does its thing, we save the structured articles back to S3:

s3.put_object(
    Bucket=INPUT_BUCKET,
    Key=output_key,
    Body=json.dumps({
        'metadata': {
            'source_file': key,
            'source_image': source_image,
            'articles_processed': len(processed_articles)
        },
        'articles': processed_articles
    }, indent=2),
    ContentType='application/json'
)

We also update the checkpoint in DynamoDB so we don’t process this file again.

Step 5: The Lambda Function

To save on cost, we will run inside an AWS Lambda function, which means it needs to:

Find new files to process – It lists JSON files in S3 and filters out ones that have already been processed.

Process them one by one – It loops through and calls process_single_file().

Handle timeouts – If Lambda is running out of time, it stops before getting killed.

The full loop looks like this:

for key in files_to_process:
    try:
        result = processor.process_single_file(key)
        results.append(result)

        if context.get_remaining_time_in_millis() < 60000:
            logger.info("Approaching Lambda timeout, stopping processing")
            break
    except Exception as e:
        logger.error(f"Error processing file {key}: {e}")
        results.append({
            'success': False,
            'input_key': key,
            'error': str(e)
        })

When it finishes processing, it saves a summary file in S3 with details of which files were processed successfully and which ones failed.

Wrapping It Up

In short, we can automate text extraction of thousands of newspaper images stored in an S3 bucket, extract text, organizes it into articles using AWS Bedrock, and stores the structured output back in S3.

If you’re dealing with large volumes of scanned newspapers, this kind of automation can save you weeks of manual work.

5 Playwright CLI Flags That Will Transform Your Testing Workflow

0:56 --last-failed
2:34 --only-changed
4:27 --repeat-each
5:15 --forbid-only
5:51 --ui --headed --workers 1

Learn how these powerful command-line options can save you time, strengthen your test suite, and streamline your Playwright testing experience. Click on any timestamp above to jump directly to that section in the tutorial!

Watch Full Video 📹️

DEV Community

Text Processing with AWS Bedrock (Extracting Text from Newspaper Images II)

Introduction

5 Playwright CLI Flags That Will Transform Your Testing Workflow

Top comments (0)

Your AI Code Assistant

Okay