DEV Community

Cover image for How to Use AWS Textract with S3
songthamtung
songthamtung

Posted on

1 1

How to Use AWS Textract with S3

This article demonstrates how to use AWS Textract to extract text from scanned documents in an S3 bucket.

This goes beyond Amazon’s documentation — where they only use examples involving one image. Included in this blog is a sample code snippet using AWS Python SDK Boto3 to help you quickly get started.

Definitions

  • Textract is a service that automatically extracts text and data from scanned documents.
  • Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

Code

#!/usr/bin/env python3
# Detects text in a document stored in an S3 bucket.
import boto3
import sys
from time import sleep
import math
import pandas as pd
if __name__ == "__main__":
bucket='your_bucket_name'
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
client = boto3.client('textract',
region_name='your_region',
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = boto3.resource('s3',
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
your_bucket = s3.Bucket(bucket)
extracted_data = []
for s3_file in your_bucket.objects.all():
print(s3_file)
# use textract to process s3 file
response = client.detect_document_text(
Document={'S3Object': {'Bucket': bucket, 'Name': s3_file.key}})
blocks=response['Blocks']
for block in blocks:
if block['BlockType'] != 'PAGE':
print('Detected: ' + block['Text'])
print('Confidence: ' + "{:.2f}".format(block['Confidence']) + "%")
# Example case where you want to extract words with #
if("#" in block['Text']):
words = block['Text'].split()
for word in words:
if("#" in word):
extracted_data.append({"word" : word, "file" : s3_file.key, "confidence": "{:.2f}".format(block['Confidence']) + "%"})
# sleep 2 seconds to prevent ProvisionedThroughputExceededException
sleep(2)
df = pd.DataFrame(extracted_data)
df = df.drop_duplicates()
df.to_csv('output.csv')
view raw textract_s3.py hosted with ❤ by GitHub

Closing

Textract is an amazing OCR (optical character recognition) tool. It can save your team countless man hours by automating the tedious and error-prone task of manual data entry.

Thanks for reading! Originally posted on Hacker Noon.

Billboard image

Deploy and scale your apps on AWS and GCP with a world class developer experience

Coherence makes it easy to set up and maintain cloud infrastructure. Harness the extensibility, compliance and cost efficiency of the cloud.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay