Chandrani Mukherjee

Posted on Aug 25

From Scanned PDFs to Smart Docs: OCR with LangChain, Docker & AWS

#langchain #docker #aws

🧠 OCR Any PDF with LangChain, Docker, and AWS Using OCRmyPDF

Many PDFs are just images — scanned contracts, invoices, or reports. They're unreadable by machines and non-searchable by humans.

What if you could automate adding a searchable text layer and then run a language model like GPT to summarize, extract data, or answer questions from them?

Welcome to a powerful workflow using:

✅ OCRmyPDF

✅ LangChain

✅ Docker

✅ AWS (S3, Lambda/ECS)

🔍 What Is OCRmyPDF?

ocrmypdf is a command-line tool that adds an OCR layer (invisible searchable text) to scanned PDFs using Tesseract. It keeps the original visual layout intact while making the text machine-readable.

ocrmypdf input.pdf output.pdf

Use multiple languages:

ocrmypdf -l eng+fra input.pdf output.pdf
🧱 Architecture Overview

User Uploads PDF → S3 Bucket → Docker OCR Service →
→ LangChain Processor → Response (Extracted Data / Summary / Q&A)
🐳 Dockerizing the OCR Service
Here's how to containerize the OCR layer:

🧾 Dockerfile

dockerfile

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    tesseract-ocr \
    ghostscript \
    libtesseract-dev \
    tesseract-ocr-eng \
    tesseract-ocr-fra \
    && pip install ocrmypdf

WORKDIR /app
COPY ocr_service.py .

ENTRYPOINT ["python", "ocr_service.py"]

🧾 ocr_service.py

python

import ocrmypdf
import sys

input_path = sys.argv[1]
output_path = sys.argv[2]

ocrmypdf.ocr(input_path, output_path, language='eng+fra', skip_text=True)

Build and run locally:

bash

docker build -t ocr-service .
docker run -v $(pwd):/data ocr-service /data/input.pdf /data/output.pdf

🧠 Using LangChain for Text Analysis
After OCR is done, you can feed the PDF into LangChain and perform QA, summarization, or structured data extraction.

python

from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

loader = PyPDFLoader("output.pdf")
pages = loader.load()

chain = load_qa_chain(OpenAI(), chain_type="stuff")
response = chain.run(input_documents=pages, question="What is the document about?")

print(response)

LangChain lets you chain OCR → LLM → Output via APIs or a UI.

☁️ Deploying on AWS
Option 1: ECS Fargate
Push Docker image to ECR.

Use Lambda to trigger Fargate task on new S3 uploads.

OCR result uploaded back to S3.

Option 2: Lambda + S3
Use Lambda for lightweight OCR jobs (under 15 minutes runtime).

Sample Lambda Code:

python

import boto3
import subprocess

def handler(event, context):
    s3 = boto3.client("s3")
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = event["Records"][0]["s3"]["object"]["key"]

    input_path = f"/tmp/{key}"
    output_path = f"/tmp/ocr_{key}"

    s3.download_file(bucket, key, input_path)
    subprocess.run(["ocrmypdf", "-l", "eng+fra", input_path, output_path])
    s3.upload_file(output_path, "ocr-output-bucket", f"ocr_{key}")

🔐 Security & 💰 Cost Tips
Use IAM roles with least privilege.

Set lifecycle rules on S3 buckets to auto-delete temp files.

Use Lambda for lightweight OCR, ECS for heavier tasks.

Monitor LangChain + LLM token usage if using OpenAI.

✅ Final Thoughts
You now have a production-ready OCR pipeline powered by:

🧠 ocrmypdf for PDF text layers

⚙️ Docker for repeatable environments

🤖 LangChain for LLM magic

☁️ AWS for scale

This setup lets you convert unsearchable PDFs into structured insights. Automate document workflows, extract legal data, read scanned invoices — the possibilities are huge.

DEV Community

From Scanned PDFs to Smart Docs: OCR with LangChain, Docker & AWS

🧠 OCR Any PDF with LangChain, Docker, and AWS Using OCRmyPDF

🔍 What Is OCRmyPDF?

Top comments (0)