I was trying to digitize a stack of old contracts for a family business. Scanned PDFs—hundreds of pages. I needed to find specific clauses across all of them, but Ctrl+F returned nothing because every page was just an image.
"No problem," I thought. "I'll just run them through Tesseract."
The output was... disappointing. Missed words everywhere. Garbled text from slightly tilted pages. Complete failures on low-res scans. I spent more time fixing OCR errors than I would have spent reading the documents manually.
So I built something better. It's called SearchablePDF.org, and it extracts up to 456% more text than vanilla Tesseract. Here's how.
The Problem With Basic OCR
Tesseract is incredible technology. But feed it a real-world scanned document and you'll quickly discover its limitations:
Input: Slightly tilted scan of a contract
Expected: "This Agreement shall terminate on December 31, 2024"
Actual: "Th1s Agr33ment sha11 terminat3 0n Decemb3r 31, 2O24"
The issue isn't Tesseract—it's the input. OCR engines expect clean, properly oriented, high-contrast images. Real scans are:
- Tilted by a few degrees
- Rotated 90°, 180°, or 270°
- Low resolution (faxes, old photocopies)
- Covered with watermarks, stamps, or noise
- Inconsistent in contrast and brightness
Garbage in, garbage out.
The Solution: Preprocessing Pipeline
The key insight was that OCR accuracy depends more on image quality than on the OCR engine itself. So I built a preprocessing pipeline that runs before Tesseract ever sees the image.
Step 1: Orientation Detection and Correction
Using OpenCV and some heuristics, the tool detects page orientation and rotates accordingly:
# Simplified version of orientation detection
def detect_orientation(image):
# Use Tesseract's OSD (Orientation and Script Detection)
osd = pytesseract.image_to_osd(image)
rotation = int(re.search(r'Rotate: (\d+)', osd).group(1))
return rotation
def fix_orientation(image):
rotation = detect_orientation(image)
if rotation != 0:
image = rotate_image(image, 360 - rotation)
return image
This alone fixed about 15% of my failed extractions.
Step 2: Deskewing
Even a 2-3° tilt kills OCR accuracy. The deskew algorithm detects text line angles and straightens the image:
def deskew(image):
# Convert to grayscale and detect edges
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray, 50, 150, apertureSize=3)
# Detect lines using Hough transform
lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100,
minLineLength=100, maxLineGap=10)
# Calculate median angle
angles = [np.arctan2(y2-y1, x2-x1) for x1,y1,x2,y2 in lines[:,0]]
median_angle = np.median(angles)
# Rotate to correct
return rotate_image(image, np.degrees(median_angle))
Step 3: Resolution Enhancement
Many scans come in at 72-150 DPI. Tesseract works best at 300+ DPI. I use intelligent upscaling:
def enhance_resolution(image, target_dpi=300):
current_dpi = estimate_dpi(image)
if current_dpi < target_dpi:
scale_factor = target_dpi / current_dpi
return cv2.resize(image, None, fx=scale_factor, fy=scale_factor,
interpolation=cv2.INTER_CUBIC)
return image
Step 4: Noise Removal and Contrast Enhancement
Watermarks, scanner artifacts, and faded text all interfere with recognition:
def clean_image(image):
# Convert to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Apply adaptive thresholding for varying lighting
cleaned = cv2.adaptiveThreshold(gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2)
# Denoise
cleaned = cv2.fastNlMeansDenoising(cleaned, h=10)
return cleaned
The Secret Sauce: Invisible Text Layers
Here's where it gets interesting.
Most OCR tools give you extracted text—a plain .txt file or copied text. That's useful, but you lose all formatting, layout, and visual context.
I wanted something better: a PDF that looks exactly like the original but is fully searchable and selectable.
The solution is PDF text layers. You embed invisible, precisely-positioned text underneath the original image:
from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
def add_text_layer(original_pdf, ocr_data):
# ocr_data contains text + bounding box coordinates from Tesseract
# Create a new PDF with just the text layer
text_pdf = create_text_layer_pdf(ocr_data)
# Merge: original image as background, text layer on top (but invisible)
merger = PdfMerger()
# ... merge logic
return searchable_pdf
The text is there—screen readers can read it, Ctrl+F finds it, you can copy it—but visually the PDF is unchanged.
Tesseract LSTM Configuration
Vanilla Tesseract commands often miss the good stuff. Here's what I landed on:
custom_config = r'--oem 1 --psm 3 -l eng'
# --oem 1: Use LSTM neural network engine only (most accurate)
# --psm 3: Fully automatic page segmentation (works for most documents)
# -l eng: Language (supports 35+ languages, can combine: eng+spa)
text = pytesseract.image_to_string(processed_image, config=custom_config)
# For detailed position data (needed for text layer placement):
data = pytesseract.image_to_data(processed_image, config=custom_config,
output_type=Output.DICT)
The image_to_data function returns bounding boxes for every word—essential for positioning the invisible text layer correctly.
Results
After implementing the full pipeline, I ran tests against the same 500-page document set:
| Method | Characters Extracted | Accuracy |
|---|---|---|
| Raw Tesseract | 127,453 | ~71% |
| With preprocessing | 580,342 | ~94% |
That's 456% more text extracted, with dramatically fewer errors.
The biggest wins came from:
- Deskewing (fixed ~30% of errors)
- Resolution enhancement (fixed ~25% of errors)
- Proper LSTM configuration (fixed ~15% of errors)
The Production Version
I wrapped all of this into SearchablePDF.org—a web app where you can upload scanned PDFs and get back searchable versions.
Features:
- Automatic preprocessing: All the cleanup happens without user configuration
- 35+ languages: Including multi-language support for mixed documents
-
Page selection: Process only the pages you need (
1-10, 25, 40-50) - Two OCR tiers: Standard (Tesseract LSTM) and Premium AI (99% accuracy for critical documents)
- Privacy-first: Files auto-delete after 24 hours
The free tier gives you 25 pages to test. Paid credits start at $0.05/page and never expire.
Tech Stack
For those curious:
- Backend: FastAPI (Python)
- OCR: Tesseract with pytesseract bindings
- Image processing: OpenCV, Pillow
- PDF manipulation: pypdf, reportlab
- Frontend: Next.js
- Infrastructure: Redis for job queuing
Try It / Break It
If you've got scanned PDFs that other tools have failed on, I'd genuinely like to know how SearchablePDF handles them. Edge cases help me improve the preprocessing pipeline.
And if you want to build something similar yourself, the core techniques are all above. The preprocessing pipeline is where most of the magic happens—Tesseract does the heavy lifting once you feed it clean images.
Questions? Drop them in the comments. Happy to go deeper on any part of the implementation.

Top comments (0)