DEV Community

Ashwin Singh
Ashwin Singh

Posted on

I Built an OCR Tool That Extracts 4.5x More Text Than Tesseract Alone


Last year I got frustrated.

I was trying to digitize a stack of old contracts for a family business. Scanned PDFs—hundreds of pages. I needed to find specific clauses across all of them, but Ctrl+F returned nothing because every page was just an image.

"No problem," I thought. "I'll just run them through Tesseract."

The output was... disappointing. Missed words everywhere. Garbled text from slightly tilted pages. Complete failures on low-res scans. I spent more time fixing OCR errors than I would have spent reading the documents manually.

So I built something better. It's called SearchablePDF.org, and it extracts up to 456% more text than vanilla Tesseract. Here's how.

The Problem With Basic OCR

Tesseract is incredible technology. But feed it a real-world scanned document and you'll quickly discover its limitations:

Input: Slightly tilted scan of a contract
Expected: "This Agreement shall terminate on December 31, 2024"
Actual: "Th1s Agr33ment sha11 terminat3 0n Decemb3r 31, 2O24"
Enter fullscreen mode Exit fullscreen mode

The issue isn't Tesseract—it's the input. OCR engines expect clean, properly oriented, high-contrast images. Real scans are:

  • Tilted by a few degrees
  • Rotated 90°, 180°, or 270°
  • Low resolution (faxes, old photocopies)
  • Covered with watermarks, stamps, or noise
  • Inconsistent in contrast and brightness

Garbage in, garbage out.

The Solution: Preprocessing Pipeline

The key insight was that OCR accuracy depends more on image quality than on the OCR engine itself. So I built a preprocessing pipeline that runs before Tesseract ever sees the image.

Step 1: Orientation Detection and Correction

Using OpenCV and some heuristics, the tool detects page orientation and rotates accordingly:

# Simplified version of orientation detection
def detect_orientation(image):
    # Use Tesseract's OSD (Orientation and Script Detection)
    osd = pytesseract.image_to_osd(image)
    rotation = int(re.search(r'Rotate: (\d+)', osd).group(1))
    return rotation

def fix_orientation(image):
    rotation = detect_orientation(image)
    if rotation != 0:
        image = rotate_image(image, 360 - rotation)
    return image
Enter fullscreen mode Exit fullscreen mode

This alone fixed about 15% of my failed extractions.

Step 2: Deskewing

Even a 2-3° tilt kills OCR accuracy. The deskew algorithm detects text line angles and straightens the image:

def deskew(image):
    # Convert to grayscale and detect edges
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150, apertureSize=3)

    # Detect lines using Hough transform
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, 
                            minLineLength=100, maxLineGap=10)

    # Calculate median angle
    angles = [np.arctan2(y2-y1, x2-x1) for x1,y1,x2,y2 in lines[:,0]]
    median_angle = np.median(angles)

    # Rotate to correct
    return rotate_image(image, np.degrees(median_angle))
Enter fullscreen mode Exit fullscreen mode

Step 3: Resolution Enhancement

Many scans come in at 72-150 DPI. Tesseract works best at 300+ DPI. I use intelligent upscaling:

def enhance_resolution(image, target_dpi=300):
    current_dpi = estimate_dpi(image)
    if current_dpi < target_dpi:
        scale_factor = target_dpi / current_dpi
        return cv2.resize(image, None, fx=scale_factor, fy=scale_factor,
                         interpolation=cv2.INTER_CUBIC)
    return image
Enter fullscreen mode Exit fullscreen mode

Step 4: Noise Removal and Contrast Enhancement

Watermarks, scanner artifacts, and faded text all interfere with recognition:

def clean_image(image):
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Apply adaptive thresholding for varying lighting
    cleaned = cv2.adaptiveThreshold(gray, 255, 
                                     cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                     cv2.THRESH_BINARY, 11, 2)

    # Denoise
    cleaned = cv2.fastNlMeansDenoising(cleaned, h=10)

    return cleaned
Enter fullscreen mode Exit fullscreen mode

The Secret Sauce: Invisible Text Layers

Here's where it gets interesting.

Most OCR tools give you extracted text—a plain .txt file or copied text. That's useful, but you lose all formatting, layout, and visual context.

I wanted something better: a PDF that looks exactly like the original but is fully searchable and selectable.

The solution is PDF text layers. You embed invisible, precisely-positioned text underneath the original image:

from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

def add_text_layer(original_pdf, ocr_data):
    # ocr_data contains text + bounding box coordinates from Tesseract

    # Create a new PDF with just the text layer
    text_pdf = create_text_layer_pdf(ocr_data)

    # Merge: original image as background, text layer on top (but invisible)
    merger = PdfMerger()
    # ... merge logic

    return searchable_pdf
Enter fullscreen mode Exit fullscreen mode

The text is there—screen readers can read it, Ctrl+F finds it, you can copy it—but visually the PDF is unchanged.

Tesseract LSTM Configuration

Vanilla Tesseract commands often miss the good stuff. Here's what I landed on:

custom_config = r'--oem 1 --psm 3 -l eng'

# --oem 1: Use LSTM neural network engine only (most accurate)
# --psm 3: Fully automatic page segmentation (works for most documents)
# -l eng: Language (supports 35+ languages, can combine: eng+spa)

text = pytesseract.image_to_string(processed_image, config=custom_config)

# For detailed position data (needed for text layer placement):
data = pytesseract.image_to_data(processed_image, config=custom_config, 
                                  output_type=Output.DICT)
Enter fullscreen mode Exit fullscreen mode

The image_to_data function returns bounding boxes for every word—essential for positioning the invisible text layer correctly.

Results

After implementing the full pipeline, I ran tests against the same 500-page document set:

Method Characters Extracted Accuracy
Raw Tesseract 127,453 ~71%
With preprocessing 580,342 ~94%

That's 456% more text extracted, with dramatically fewer errors.

The biggest wins came from:

  1. Deskewing (fixed ~30% of errors)
  2. Resolution enhancement (fixed ~25% of errors)
  3. Proper LSTM configuration (fixed ~15% of errors)

The Production Version

I wrapped all of this into SearchablePDF.org—a web app where you can upload scanned PDFs and get back searchable versions.

Features:

  • Automatic preprocessing: All the cleanup happens without user configuration
  • 35+ languages: Including multi-language support for mixed documents
  • Page selection: Process only the pages you need (1-10, 25, 40-50)
  • Two OCR tiers: Standard (Tesseract LSTM) and Premium AI (99% accuracy for critical documents)
  • Privacy-first: Files auto-delete after 24 hours

The free tier gives you 25 pages to test. Paid credits start at $0.05/page and never expire.

Tech Stack

For those curious:

  • Backend: FastAPI (Python)
  • OCR: Tesseract with pytesseract bindings
  • Image processing: OpenCV, Pillow
  • PDF manipulation: pypdf, reportlab
  • Frontend: Next.js
  • Infrastructure: Redis for job queuing

Try It / Break It

If you've got scanned PDFs that other tools have failed on, I'd genuinely like to know how SearchablePDF handles them. Edge cases help me improve the preprocessing pipeline.

searchablepdf.org

And if you want to build something similar yourself, the core techniques are all above. The preprocessing pipeline is where most of the magic happens—Tesseract does the heavy lifting once you feed it clean images.


Questions? Drop them in the comments. Happy to go deeper on any part of the implementation.

Top comments (0)