DEV Community

Chandrani Mukherjee
Chandrani Mukherjee

Posted on

From Scanned PDFs to Smart Docs: OCR with LangChain, Docker & AWS

๐Ÿง  OCR Any PDF with LangChain, Docker, and AWS Using OCRmyPDF

Many PDFs are just images โ€” scanned contracts, invoices, or reports. They're unreadable by machines and non-searchable by humans.

What if you could automate adding a searchable text layer and then run a language model like GPT to summarize, extract data, or answer questions from them?

Welcome to a powerful workflow using:

โœ… OCRmyPDF

โœ… LangChain

โœ… Docker

โœ… AWS (S3, Lambda/ECS)


๐Ÿ” What Is OCRmyPDF?

ocrmypdf is a command-line tool that adds an OCR layer (invisible searchable text) to scanned PDFs using Tesseract. It keeps the original visual layout intact while making the text machine-readable.

ocrmypdf input.pdf output.pdf

Use multiple languages:

ocrmypdf -l eng+fra input.pdf output.pdf
๐Ÿงฑ Architecture Overview

User Uploads PDF โ†’ S3 Bucket โ†’ Docker OCR Service โ†’
โ†’ LangChain Processor โ†’ Response (Extracted Data / Summary / Q&A)
๐Ÿณ Dockerizing the OCR Service
Here's how to containerize the OCR layer:

๐Ÿงพ Dockerfile

dockerfile

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    tesseract-ocr \
    ghostscript \
    libtesseract-dev \
    tesseract-ocr-eng \
    tesseract-ocr-fra \
    && pip install ocrmypdf

WORKDIR /app
COPY ocr_service.py .

ENTRYPOINT ["python", "ocr_service.py"]
Enter fullscreen mode Exit fullscreen mode

๐Ÿงพ ocr_service.py

python

import ocrmypdf
import sys

input_path = sys.argv[1]
output_path = sys.argv[2]

ocrmypdf.ocr(input_path, output_path, language='eng+fra', skip_text=True)
Enter fullscreen mode Exit fullscreen mode

Build and run locally:

bash

docker build -t ocr-service .
docker run -v $(pwd):/data ocr-service /data/input.pdf /data/output.pdf
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Using LangChain for Text Analysis
After OCR is done, you can feed the PDF into LangChain and perform QA, summarization, or structured data extraction.

python

from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

loader = PyPDFLoader("output.pdf")
pages = loader.load()

chain = load_qa_chain(OpenAI(), chain_type="stuff")
response = chain.run(input_documents=pages, question="What is the document about?")

print(response)
Enter fullscreen mode Exit fullscreen mode

LangChain lets you chain OCR โ†’ LLM โ†’ Output via APIs or a UI.

โ˜๏ธ Deploying on AWS
Option 1: ECS Fargate
Push Docker image to ECR.

Use Lambda to trigger Fargate task on new S3 uploads.

OCR result uploaded back to S3.

Option 2: Lambda + S3
Use Lambda for lightweight OCR jobs (under 15 minutes runtime).

Sample Lambda Code:

python

import boto3
import subprocess

def handler(event, context):
    s3 = boto3.client("s3")
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = event["Records"][0]["s3"]["object"]["key"]

    input_path = f"/tmp/{key}"
    output_path = f"/tmp/ocr_{key}"

    s3.download_file(bucket, key, input_path)
    subprocess.run(["ocrmypdf", "-l", "eng+fra", input_path, output_path])
    s3.upload_file(output_path, "ocr-output-bucket", f"ocr_{key}")
Enter fullscreen mode Exit fullscreen mode

๐Ÿ” Security & ๐Ÿ’ฐ Cost Tips
Use IAM roles with least privilege.

Set lifecycle rules on S3 buckets to auto-delete temp files.

Use Lambda for lightweight OCR, ECS for heavier tasks.

Monitor LangChain + LLM token usage if using OpenAI.

โœ… Final Thoughts
You now have a production-ready OCR pipeline powered by:

๐Ÿง  ocrmypdf for PDF text layers

โš™๏ธ Docker for repeatable environments

๐Ÿค– LangChain for LLM magic

โ˜๏ธ AWS for scale

This setup lets you convert unsearchable PDFs into structured insights. Automate document workflows, extract legal data, read scanned invoices โ€” the possibilities are huge.

Top comments (0)