๐ง OCR Any PDF with LangChain, Docker, and AWS Using OCRmyPDF
Many PDFs are just images โ scanned contracts, invoices, or reports. They're unreadable by machines and non-searchable by humans.
What if you could automate adding a searchable text layer and then run a language model like GPT to summarize, extract data, or answer questions from them?
Welcome to a powerful workflow using:
โ
OCRmyPDF
โ
LangChain
โ
Docker
โ
AWS (S3, Lambda/ECS)
๐ What Is OCRmyPDF?
ocrmypdf
is a command-line tool that adds an OCR layer (invisible searchable text) to scanned PDFs using Tesseract. It keeps the original visual layout intact while making the text machine-readable.
ocrmypdf input.pdf output.pdf
Use multiple languages:
ocrmypdf -l eng+fra input.pdf output.pdf
๐งฑ Architecture Overview
User Uploads PDF โ S3 Bucket โ Docker OCR Service โ
โ LangChain Processor โ Response (Extracted Data / Summary / Q&A)
๐ณ Dockerizing the OCR Service
Here's how to containerize the OCR layer:
๐งพ Dockerfile
dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
tesseract-ocr \
ghostscript \
libtesseract-dev \
tesseract-ocr-eng \
tesseract-ocr-fra \
&& pip install ocrmypdf
WORKDIR /app
COPY ocr_service.py .
ENTRYPOINT ["python", "ocr_service.py"]
๐งพ ocr_service.py
python
import ocrmypdf
import sys
input_path = sys.argv[1]
output_path = sys.argv[2]
ocrmypdf.ocr(input_path, output_path, language='eng+fra', skip_text=True)
Build and run locally:
bash
docker build -t ocr-service .
docker run -v $(pwd):/data ocr-service /data/input.pdf /data/output.pdf
๐ง Using LangChain for Text Analysis
After OCR is done, you can feed the PDF into LangChain and perform QA, summarization, or structured data extraction.
python
from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
loader = PyPDFLoader("output.pdf")
pages = loader.load()
chain = load_qa_chain(OpenAI(), chain_type="stuff")
response = chain.run(input_documents=pages, question="What is the document about?")
print(response)
LangChain lets you chain OCR โ LLM โ Output via APIs or a UI.
โ๏ธ Deploying on AWS
Option 1: ECS Fargate
Push Docker image to ECR.
Use Lambda to trigger Fargate task on new S3 uploads.
OCR result uploaded back to S3.
Option 2: Lambda + S3
Use Lambda for lightweight OCR jobs (under 15 minutes runtime).
Sample Lambda Code:
python
import boto3
import subprocess
def handler(event, context):
s3 = boto3.client("s3")
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = event["Records"][0]["s3"]["object"]["key"]
input_path = f"/tmp/{key}"
output_path = f"/tmp/ocr_{key}"
s3.download_file(bucket, key, input_path)
subprocess.run(["ocrmypdf", "-l", "eng+fra", input_path, output_path])
s3.upload_file(output_path, "ocr-output-bucket", f"ocr_{key}")
๐ Security & ๐ฐ Cost Tips
Use IAM roles with least privilege.
Set lifecycle rules on S3 buckets to auto-delete temp files.
Use Lambda for lightweight OCR, ECS for heavier tasks.
Monitor LangChain + LLM token usage if using OpenAI.
โ
Final Thoughts
You now have a production-ready OCR pipeline powered by:
๐ง ocrmypdf for PDF text layers
โ๏ธ Docker for repeatable environments
๐ค LangChain for LLM magic
โ๏ธ AWS for scale
This setup lets you convert unsearchable PDFs into structured insights. Automate document workflows, extract legal data, read scanned invoices โ the possibilities are huge.
Top comments (0)