DEV Community: Ashwin Singh

How Parsifyx Processes 27 Document Formats Entirely in the Browser — No Server Required

Ashwin Singh — Thu, 12 Feb 2026 10:56:20 +0000

There's a class of web apps that looks simple on the surface but is doing something genuinely impressive under the hood. Parsifyx is one of them.

It's a document toolkit — PDF splitting, merging, conversion, compression, OCR, e-signing, form filling, ZIP handling — 27 tools total. Nothing revolutionary about the feature list. What's interesting is the architecture: every single operation runs client-side. No file uploads. No server-side processing. No cloud functions. Your documents never leave the browser tab.

As a developer, that immediately raised questions. How do you split a 200-page PDF in the browser without melting the tab? How do you run OCR without a backend? What does the conversion pipeline look like for .docx → .pdf when there's no LibreOffice instance to lean on?

Let's break it down.

The Stack: WebAssembly + JavaScript Libraries

Parsifyx's architecture sits on top of a handful of battle-tested client-side libraries. Based on what's publicly inspectable in the browser:

PDF Manipulation — `pdf-lib`

pdf-lib is a pure JavaScript library for creating and modifying PDFs. No native dependencies, no server calls. It parses the PDF binary format directly in memory and exposes a clean API for operations like:

Splitting by page ranges
Merging multiple documents
Removing, extracting, and reordering pages
Rotating pages
Editing metadata (title, author, keywords)

This is the backbone of most of Parsifyx's "Organize & Edit" tools. Because pdf-lib operates on Uint8Array buffers, the entire read → transform → export cycle stays in memory. The browser's File API reads the input, pdf-lib does the work, and a Blob URL triggers the download. Zero network traffic.

// Conceptual example: splitting a PDF with pdf-lib
import { PDFDocument } from 'pdf-lib';

const sourceBytes = await file.arrayBuffer();
const sourcePdf = await PDFDocument.load(sourceBytes);

const newPdf = await PDFDocument.create();
const [page] = await newPdf.copyPages(sourcePdf, [0]); // copy first page
newPdf.addPage(page);

const outputBytes = await newPdf.save();
download(outputBytes, 'split-output.pdf');

No upload. No API key. No latency.

OCR — `Tesseract.js`

This is where it gets more interesting. Tesseract.js is a WebAssembly port of Google's Tesseract OCR engine. It downloads trained language data (.traineddata files) on first use, then runs the full recognition pipeline in a Web Worker.

The architecture is smart: Tesseract.js spawns a worker thread so the main UI thread stays responsive while the WASM engine chews through pixel data. For Parsifyx's "Image to Text" and "Scan to Searchable PDF" tools, the flow looks roughly like:

User drops in a scanned image or PDF
If PDF, render pages to canvas using a PDF renderer (likely pdf.js)
Pass the rasterized image data to the Tesseract.js worker
Tesseract returns recognized text with bounding box coordinates
For searchable PDFs: overlay an invisible text layer on top of the original scan

That last step is the key UX win. The output PDF looks identical to the scan, but you can Ctrl+F through it. All done locally.

import { createWorker } from 'tesseract.js';

const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize(imageFile);
console.log(text);
await worker.terminate();

The trade-off is the initial download of language data (~10-15MB for English). But once cached by the browser, subsequent runs are fast.

PDF Generation — `jsPDF`

For conversion tools (Markdown → PDF, HTML → PDF, Image → PDF), Parsifyx likely uses jsPDF or a combination of jsPDF and html2canvas. The pipeline:

HTML/Markdown → PDF: Parse the markup, render it to a virtual canvas or directly to jsPDF drawing commands, then serialize to PDF bytes.
Image → PDF: Read image dimensions, create a PDF page with matching dimensions, embed the image, export.
Office formats (Word, Excel, PowerPoint): This is trickier client-side. Libraries like mammoth.js handle .docx → HTML conversion, which can then be piped into the PDF generation step. For .xlsx, SheetJS parses the spreadsheet format. For .pptx, similar XML-parsing approaches apply.

Compression

PDF compression in the browser typically involves re-encoding embedded images at lower quality. A scanned document with uncompressed TIFF images inside the PDF can be dramatically reduced by re-encoding those images as compressed JPEG. Libraries can extract embedded image streams, re-compress them via the Canvas API's toBlob() with a quality parameter, and re-embed them.

// Browser-native image recompression
canvas.toBlob(
  (blob) => { /* re-embed compressed image */ },
  'image/jpeg',
  0.7 // quality factor
);

This is why Parsifyx can shrink a 20MB scanned PDF down to 3MB without any server-side tooling.

Why This Architecture Matters

1. Privacy by construction, not by policy

Most PDF tools publish privacy policies saying "we delete your files within 1 hour." That's a policy decision. It can be changed, breached, or circumvented. Parsifyx's approach is structurally private — there's no server endpoint to receive the file in the first place. You can verify this by opening DevTools → Network tab and watching for outbound requests during processing. There aren't any.

This isn't just a nice-to-have. If you're handling HIPAA-covered documents, GDPR-sensitive data, legal contracts, or financial records, the difference between "we promise we delete it" and "it never left your machine" is the difference between compliance risk and no compliance risk.

2. Zero-latency processing

Server-based PDF tools follow a upload → queue → process → download cycle. Depending on file size and server load, that's anywhere from 5 to 30+ seconds. Client-side processing eliminates the upload and download legs entirely. For a 10MB PDF merge, the bottleneck is JavaScript execution speed, not network bandwidth. On a modern machine, that's sub-second.

3. Offline capability

Once the page and its WASM/JS dependencies are cached, the tools work offline. This is a natural side effect of the architecture — if nothing requires a server, nothing breaks when the server is unreachable. For developers working on planes, in cafés with flaky WiFi, or in air-gapped environments, this is a real advantage.

4. No infrastructure cost scaling

This is the part that should interest anyone building SaaS tools. Traditional document processing services need to scale server capacity with user volume. More users = more CPU/RAM for PDF processing = higher cloud bills. When processing runs on the client, the "server" is every user's own machine. The infrastructure cost of serving 1,000 users and 100,000 users is essentially the same — you're just serving static assets.

Limitations of the Client-Side Approach

It's not all upside. There are real constraints:

Memory limits: Browsers have memory ceilings. Processing a 500-page, image-heavy PDF might hit those limits on low-RAM devices. Server-side tools can throw more hardware at the problem.
Format fidelity: Server-side conversion tools like LibreOffice have decades of format-parsing logic. Client-side JS libraries are good but can struggle with complex .docx layouts (nested tables, embedded OLE objects, exotic fonts).
Initial load: WASM modules and language data for OCR add to the initial page weight. This is mitigated by lazy loading and caching, but the first run is heavier than subsequent ones.
No batch automation: There's no API to call programmatically. If you need to convert 10,000 invoices, you need a server-side pipeline. Parsifyx is built for interactive, one-off document tasks.

Takeaways for Developers

Parsifyx is a clean case study in what's possible with modern browser APIs. A few patterns worth noting:

WebAssembly for compute-heavy work: OCR, compression, and PDF parsing are CPU-intensive. WASM makes them viable in the browser without the UX penalty of blocking the main thread.
Web Workers for responsiveness: Offloading heavy processing to workers keeps the UI snappy. If your app does any non-trivial computation, workers aren't optional — they're essential.
The File API + Blob URLs for zero-upload workflows: Reading files locally, processing them in memory, and triggering downloads via Blob URLs is a powerful pattern that eliminates entire categories of privacy and infrastructure concerns.
Privacy as architecture, not policy: If your product handles sensitive data, consider whether the processing needs to happen on your server. If it doesn't, moving it to the client is a stronger privacy guarantee than any policy you can write.

Try It

If you work with documents — and if you're a developer, you do — bookmark parsifyx.com. It's fast, it's free, there's no signup, and it respects your data by never touching it in the first place.

Open DevTools while you use it. It's a good learning exercise.

I Built an OCR Tool That Extracts 4.5x More Text Than Tesseract Alone

Ashwin Singh — Thu, 29 Jan 2026 18:41:25 +0000

Last year I got frustrated.

I was trying to digitize a stack of old contracts for a family business. Scanned PDFs—hundreds of pages. I needed to find specific clauses across all of them, but Ctrl+F returned nothing because every page was just an image.

"No problem," I thought. "I'll just run them through Tesseract."

The output was... disappointing. Missed words everywhere. Garbled text from slightly tilted pages. Complete failures on low-res scans. I spent more time fixing OCR errors than I would have spent reading the documents manually.

So I built something better. It's called SearchablePDF.org, and it extracts up to 456% more text than vanilla Tesseract. Here's how.

The Problem With Basic OCR

Tesseract is incredible technology. But feed it a real-world scanned document and you'll quickly discover its limitations:

Input: Slightly tilted scan of a contract
Expected: "This Agreement shall terminate on December 31, 2024"
Actual: "Th1s Agr33ment sha11 terminat3 0n Decemb3r 31, 2O24"

The issue isn't Tesseract—it's the input. OCR engines expect clean, properly oriented, high-contrast images. Real scans are:

Tilted by a few degrees
Rotated 90°, 180°, or 270°
Low resolution (faxes, old photocopies)
Covered with watermarks, stamps, or noise
Inconsistent in contrast and brightness

Garbage in, garbage out.

The Solution: Preprocessing Pipeline

The key insight was that OCR accuracy depends more on image quality than on the OCR engine itself. So I built a preprocessing pipeline that runs before Tesseract ever sees the image.

Step 1: Orientation Detection and Correction

Using OpenCV and some heuristics, the tool detects page orientation and rotates accordingly:

# Simplified version of orientation detection
def detect_orientation(image):
    # Use Tesseract's OSD (Orientation and Script Detection)
    osd = pytesseract.image_to_osd(image)
    rotation = int(re.search(r'Rotate: (\d+)', osd).group(1))
    return rotation

def fix_orientation(image):
    rotation = detect_orientation(image)
    if rotation != 0:
        image = rotate_image(image, 360 - rotation)
    return image

This alone fixed about 15% of my failed extractions.

Step 2: Deskewing

Even a 2-3° tilt kills OCR accuracy. The deskew algorithm detects text line angles and straightens the image:

def deskew(image):
    # Convert to grayscale and detect edges
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150, apertureSize=3)

    # Detect lines using Hough transform
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, 
                            minLineLength=100, maxLineGap=10)

    # Calculate median angle
    angles = [np.arctan2(y2-y1, x2-x1) for x1,y1,x2,y2 in lines[:,0]]
    median_angle = np.median(angles)

    # Rotate to correct
    return rotate_image(image, np.degrees(median_angle))

Step 3: Resolution Enhancement

Many scans come in at 72-150 DPI. Tesseract works best at 300+ DPI. I use intelligent upscaling:

def enhance_resolution(image, target_dpi=300):
    current_dpi = estimate_dpi(image)
    if current_dpi < target_dpi:
        scale_factor = target_dpi / current_dpi
        return cv2.resize(image, None, fx=scale_factor, fy=scale_factor,
                         interpolation=cv2.INTER_CUBIC)
    return image

Step 4: Noise Removal and Contrast Enhancement

Watermarks, scanner artifacts, and faded text all interfere with recognition:

def clean_image(image):
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Apply adaptive thresholding for varying lighting
    cleaned = cv2.adaptiveThreshold(gray, 255, 
                                     cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                     cv2.THRESH_BINARY, 11, 2)

    # Denoise
    cleaned = cv2.fastNlMeansDenoising(cleaned, h=10)

    return cleaned

The Secret Sauce: Invisible Text Layers

Here's where it gets interesting.

Most OCR tools give you extracted text—a plain .txt file or copied text. That's useful, but you lose all formatting, layout, and visual context.

I wanted something better: a PDF that looks exactly like the original but is fully searchable and selectable.

The solution is PDF text layers. You embed invisible, precisely-positioned text underneath the original image:

from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

def add_text_layer(original_pdf, ocr_data):
    # ocr_data contains text + bounding box coordinates from Tesseract

    # Create a new PDF with just the text layer
    text_pdf = create_text_layer_pdf(ocr_data)

    # Merge: original image as background, text layer on top (but invisible)
    merger = PdfMerger()
    # ... merge logic

    return searchable_pdf

The text is there—screen readers can read it, Ctrl+F finds it, you can copy it—but visually the PDF is unchanged.

Tesseract LSTM Configuration

Vanilla Tesseract commands often miss the good stuff. Here's what I landed on:

custom_config = r'--oem 1 --psm 3 -l eng'

# --oem 1: Use LSTM neural network engine only (most accurate)
# --psm 3: Fully automatic page segmentation (works for most documents)
# -l eng: Language (supports 35+ languages, can combine: eng+spa)

text = pytesseract.image_to_string(processed_image, config=custom_config)

# For detailed position data (needed for text layer placement):
data = pytesseract.image_to_data(processed_image, config=custom_config, 
                                  output_type=Output.DICT)

The image_to_data function returns bounding boxes for every word—essential for positioning the invisible text layer correctly.

Results

After implementing the full pipeline, I ran tests against the same 500-page document set:

Method	Characters Extracted	Accuracy
Raw Tesseract	127,453	~71%
With preprocessing	580,342	~94%

That's 456% more text extracted, with dramatically fewer errors.

The biggest wins came from:

Deskewing (fixed ~30% of errors)
Resolution enhancement (fixed ~25% of errors)
Proper LSTM configuration (fixed ~15% of errors)

The Production Version

I wrapped all of this into SearchablePDF.org—a web app where you can upload scanned PDFs and get back searchable versions.

Features:

Automatic preprocessing: All the cleanup happens without user configuration
35+ languages: Including multi-language support for mixed documents
Page selection: Process only the pages you need (1-10, 25, 40-50)
Two OCR tiers: Standard (Tesseract LSTM) and Premium AI (99% accuracy for critical documents)
Privacy-first: Files auto-delete after 24 hours

The free tier gives you 25 pages to test. Paid credits start at $0.05/page and never expire.

Tech Stack

For those curious:

Backend: FastAPI (Python)
OCR: Tesseract with pytesseract bindings
Image processing: OpenCV, Pillow
PDF manipulation: pypdf, reportlab
Frontend: Next.js
Infrastructure: Redis for job queuing

Try It / Break It

If you've got scanned PDFs that other tools have failed on, I'd genuinely like to know how SearchablePDF handles them. Edge cases help me improve the preprocessing pipeline.

→ searchablepdf.org

And if you want to build something similar yourself, the core techniques are all above. The preprocessing pipeline is where most of the magic happens—Tesseract does the heavy lifting once you feed it clean images.

Questions? Drop them in the comments. Happy to go deeper on any part of the implementation.