🧠 OCR Any PDF with LangChain, Docker, and AWS Using OCRmyPDF
Many PDFs are just images — scanned contracts, invoices, or reports. They're unreadable by machines and non-searchable by humans.
What if you could automate adding a searchable text layer and then run a language model like GPT to summarize, extract data, or answer questions from them?
Welcome to a powerful workflow using:
✅ OCRmyPDF
✅ LangChain
✅ Docker
✅ AWS (S3, Lambda/ECS)
🔍 What Is OCRmyPDF?
ocrmypdf
is a command-line tool that adds an OCR layer (invisible searchable text) to scanned PDFs using Tesseract. It keeps the original visual layout intact while making the text machine-readable.
ocrmypdf input.pdf output.pdf
Use multiple languages:
ocrmypdf -l eng+fra input.pdf output.pdf
🧱 Architecture Overview
User Uploads PDF → S3 Bucket → Docker OCR Service →
→ LangChain Processor → Response (Extracted Data / Summary / Q&A)
🐳 Dockerizing the OCR Service
Here's how to containerize the OCR layer:
🧾 Dockerfile
dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
tesseract-ocr \
ghostscript \
libtesseract-dev \
tesseract-ocr-eng \
tesseract-ocr-fra \
&& pip install ocrmypdf
WORKDIR /app
COPY ocr_service.py .
ENTRYPOINT ["python", "ocr_service.py"]
🧾 ocr_service.py
python
import ocrmypdf
import sys
input_path = sys.argv[1]
output_path = sys.argv[2]
ocrmypdf.ocr(input_path, output_path, language='eng+fra', skip_text=True)
Build and run locally:
bash
docker build -t ocr-service .
docker run -v $(pwd):/data ocr-service /data/input.pdf /data/output.pdf
🧠 Using LangChain for Text Analysis
After OCR is done, you can feed the PDF into LangChain and perform QA, summarization, or structured data extraction.
python
from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
loader = PyPDFLoader("output.pdf")
pages = loader.load()
chain = load_qa_chain(OpenAI(), chain_type="stuff")
response = chain.run(input_documents=pages, question="What is the document about?")
print(response)
LangChain lets you chain OCR → LLM → Output via APIs or a UI.
☁️ Deploying on AWS
Option 1: ECS Fargate
Push Docker image to ECR.
Use Lambda to trigger Fargate task on new S3 uploads.
OCR result uploaded back to S3.
Option 2: Lambda + S3
Use Lambda for lightweight OCR jobs (under 15 minutes runtime).
Sample Lambda Code:
python
import boto3
import subprocess
def handler(event, context):
s3 = boto3.client("s3")
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = event["Records"][0]["s3"]["object"]["key"]
input_path = f"/tmp/{key}"
output_path = f"/tmp/ocr_{key}"
s3.download_file(bucket, key, input_path)
subprocess.run(["ocrmypdf", "-l", "eng+fra", input_path, output_path])
s3.upload_file(output_path, "ocr-output-bucket", f"ocr_{key}")
🔐 Security & 💰 Cost Tips
Use IAM roles with least privilege.
Set lifecycle rules on S3 buckets to auto-delete temp files.
Use Lambda for lightweight OCR, ECS for heavier tasks.
Monitor LangChain + LLM token usage if using OpenAI.
✅ Final Thoughts
You now have a production-ready OCR pipeline powered by:
🧠 ocrmypdf for PDF text layers
⚙️ Docker for repeatable environments
🤖 LangChain for LLM magic
☁️ AWS for scale
This setup lets you convert unsearchable PDFs into structured insights. Automate document workflows, extract legal data, read scanned invoices — the possibilities are huge.
Top comments (0)