Ashwin Singh

Posted on Jan 29

I Built an OCR Tool That Extracts 4.5x More Text Than Tesseract Alone

#ocr #pdf

Last year I got frustrated.

I was trying to digitize a stack of old contracts for a family business. Scanned PDFs—hundreds of pages. I needed to find specific clauses across all of them, but Ctrl+F returned nothing because every page was just an image.

"No problem," I thought. "I'll just run them through Tesseract."

The output was... disappointing. Missed words everywhere. Garbled text from slightly tilted pages. Complete failures on low-res scans. I spent more time fixing OCR errors than I would have spent reading the documents manually.

So I built something better. It's called SearchablePDF.org, and it extracts up to 456% more text than vanilla Tesseract. Here's how.

The Problem With Basic OCR

Tesseract is incredible technology. But feed it a real-world scanned document and you'll quickly discover its limitations:

Input: Slightly tilted scan of a contract
Expected: "This Agreement shall terminate on December 31, 2024"
Actual: "Th1s Agr33ment sha11 terminat3 0n Decemb3r 31, 2O24"

The issue isn't Tesseract—it's the input. OCR engines expect clean, properly oriented, high-contrast images. Real scans are:

Tilted by a few degrees
Rotated 90°, 180°, or 270°
Low resolution (faxes, old photocopies)
Covered with watermarks, stamps, or noise
Inconsistent in contrast and brightness

Garbage in, garbage out.

The Solution: Preprocessing Pipeline

The key insight was that OCR accuracy depends more on image quality than on the OCR engine itself. So I built a preprocessing pipeline that runs before Tesseract ever sees the image.

Step 1: Orientation Detection and Correction

Using OpenCV and some heuristics, the tool detects page orientation and rotates accordingly:

# Simplified version of orientation detection
def detect_orientation(image):
    # Use Tesseract's OSD (Orientation and Script Detection)
    osd = pytesseract.image_to_osd(image)
    rotation = int(re.search(r'Rotate: (\d+)', osd).group(1))
    return rotation

def fix_orientation(image):
    rotation = detect_orientation(image)
    if rotation != 0:
        image = rotate_image(image, 360 - rotation)
    return image

This alone fixed about 15% of my failed extractions.

Step 2: Deskewing

Even a 2-3° tilt kills OCR accuracy. The deskew algorithm detects text line angles and straightens the image:

def deskew(image):
    # Convert to grayscale and detect edges
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150, apertureSize=3)

    # Detect lines using Hough transform
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, 
                            minLineLength=100, maxLineGap=10)

    # Calculate median angle
    angles = [np.arctan2(y2-y1, x2-x1) for x1,y1,x2,y2 in lines[:,0]]
    median_angle = np.median(angles)

    # Rotate to correct
    return rotate_image(image, np.degrees(median_angle))

Step 3: Resolution Enhancement

Many scans come in at 72-150 DPI. Tesseract works best at 300+ DPI. I use intelligent upscaling:

def enhance_resolution(image, target_dpi=300):
    current_dpi = estimate_dpi(image)
    if current_dpi < target_dpi:
        scale_factor = target_dpi / current_dpi
        return cv2.resize(image, None, fx=scale_factor, fy=scale_factor,
                         interpolation=cv2.INTER_CUBIC)
    return image

Step 4: Noise Removal and Contrast Enhancement

Watermarks, scanner artifacts, and faded text all interfere with recognition:

def clean_image(image):
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Apply adaptive thresholding for varying lighting
    cleaned = cv2.adaptiveThreshold(gray, 255, 
                                     cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                     cv2.THRESH_BINARY, 11, 2)

    # Denoise
    cleaned = cv2.fastNlMeansDenoising(cleaned, h=10)

    return cleaned

The Secret Sauce: Invisible Text Layers

Here's where it gets interesting.

Most OCR tools give you extracted text—a plain .txt file or copied text. That's useful, but you lose all formatting, layout, and visual context.

I wanted something better: a PDF that looks exactly like the original but is fully searchable and selectable.

The solution is PDF text layers. You embed invisible, precisely-positioned text underneath the original image:

from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

def add_text_layer(original_pdf, ocr_data):
    # ocr_data contains text + bounding box coordinates from Tesseract

    # Create a new PDF with just the text layer
    text_pdf = create_text_layer_pdf(ocr_data)

    # Merge: original image as background, text layer on top (but invisible)
    merger = PdfMerger()
    # ... merge logic

    return searchable_pdf

The text is there—screen readers can read it, Ctrl+F finds it, you can copy it—but visually the PDF is unchanged.

Tesseract LSTM Configuration

Vanilla Tesseract commands often miss the good stuff. Here's what I landed on:

custom_config = r'--oem 1 --psm 3 -l eng'

# --oem 1: Use LSTM neural network engine only (most accurate)
# --psm 3: Fully automatic page segmentation (works for most documents)
# -l eng: Language (supports 35+ languages, can combine: eng+spa)

text = pytesseract.image_to_string(processed_image, config=custom_config)

# For detailed position data (needed for text layer placement):
data = pytesseract.image_to_data(processed_image, config=custom_config, 
                                  output_type=Output.DICT)

The image_to_data function returns bounding boxes for every word—essential for positioning the invisible text layer correctly.

Results

After implementing the full pipeline, I ran tests against the same 500-page document set:

Method	Characters Extracted	Accuracy
Raw Tesseract	127,453	~71%
With preprocessing	580,342	~94%

That's 456% more text extracted, with dramatically fewer errors.

The biggest wins came from:

Deskewing (fixed ~30% of errors)
Resolution enhancement (fixed ~25% of errors)
Proper LSTM configuration (fixed ~15% of errors)

The Production Version

I wrapped all of this into SearchablePDF.org—a web app where you can upload scanned PDFs and get back searchable versions.

Features:

Automatic preprocessing: All the cleanup happens without user configuration
35+ languages: Including multi-language support for mixed documents
Page selection: Process only the pages you need (1-10, 25, 40-50)
Two OCR tiers: Standard (Tesseract LSTM) and Premium AI (99% accuracy for critical documents)
Privacy-first: Files auto-delete after 24 hours

The free tier gives you 25 pages to test. Paid credits start at $0.05/page and never expire.

Tech Stack

For those curious:

Backend: FastAPI (Python)
OCR: Tesseract with pytesseract bindings
Image processing: OpenCV, Pillow
PDF manipulation: pypdf, reportlab
Frontend: Next.js
Infrastructure: Redis for job queuing

Try It / Break It

If you've got scanned PDFs that other tools have failed on, I'd genuinely like to know how SearchablePDF handles them. Edge cases help me improve the preprocessing pipeline.

→ searchablepdf.org

And if you want to build something similar yourself, the core techniques are all above. The preprocessing pipeline is where most of the magic happens—Tesseract does the heavy lifting once you feed it clean images.

Questions? Drop them in the comments. Happy to go deeper on any part of the implementation.

DEV Community