DEV Community

Ehi Enabs
Ehi Enabs

Posted on

Image Processing with AWS Textract (Extracting Text from Newspaper Images I)

Amazon Textract is a powerful AWS service that allows users to extract text, handwriting, and structured data from scanned documents, including newspaper images. This guide will walk you through setting up batch processing for extracting text from newspaper images stored in an Amazon S3 bucket using Textract.

Prerequisites

Before you begin, ensure you have the following:

An AWS account with appropriate permissions.

An S3 bucket containing newspaper images.

An IAM role with permissions for Amazon Textract, S3, and AWS Lambda (optional for automation).

The AWS CLI or SDK (Boto3 for Python) is installed.

Step 1: Upload Newspaper Images to S3

Navigate to the AWS S3 Console.

Create or select an existing bucket.

Upload the newspaper images you want to process.

Step 2: Create an IAM Role for Textract

Go to the AWS IAM Console.

Create a new role with the following permissions:

{
  "Effect": "Allow",
  "Action": [
    "textract:StartDocumentTextDetection",
    "textract:GetDocumentTextDetection",
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "*"
}
Enter fullscreen mode Exit fullscreen mode

Attach this policy to your IAM role and note the ARN.

Step 3: Start a Textract Batch Processing Job

Using the AWS CLI, start the text extraction job:

aws textract start-document-text-detection \
    --document-location "S3Object={Bucket=<your-bucket>,Name=<image-file>}" \
    --notification-channel "RoleArn=<your-iam-role-arn>,SNSTopicArn=<sns-topic-arn>"
Enter fullscreen mode Exit fullscreen mode

Alternatively, using Boto3 in Python:

import boto3

def start_textract_job(bucket, document):
    textract = boto3.client('textract')
    response = textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket,
                'Name': document
            }
        }
    )
    return response['JobId']

job_id = start_textract_job("your-bucket", "your-image.jpg")
print(f"Job started with ID: {job_id}")
Enter fullscreen mode Exit fullscreen mode

Step 4: Retrieve the Extracted Text

Once the job is completed, retrieve the results:

aws textract get-document-text-detection --job-id <your-job-id>aws textract get-document-text-detection --job-id <your-job-id>
Enter fullscreen mode Exit fullscreen mode

Or using boto3

import time

def get_textract_results(job_id):
    textract = boto3.client('textract')
    while True:
        response = textract.get_document_text_detection(JobId=job_id)
        if response['JobStatus'] == 'SUCCEEDED':
            break
        time.sleep(5)
    return response['Blocks']

blocks = get_textract_results(job_id)
for block in blocks:
    if block['BlockType'] == 'LINE':
        print(block['Text'])
Enter fullscreen mode Exit fullscreen mode

Step 5: Store and Process Extracted Text

Once you extract the text, you can:

Store it in an S3 bucket.

Process it with AWS Lambda and DynamoDB.

Perform text analysis using Amazon Comprehend.

Conclusion

Using Amazon Textract, you can efficiently extract text from newspaper images stored in S3 via batch processing. This enables large-scale document processing, automation, and text analytics in AWS.

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

If this post resonated with you, feel free to hit ❤️ or leave a quick comment to share your thoughts!

Okay