<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ashwin Singh</title>
    <description>The latest articles on DEV Community by Ashwin Singh (@ashwin_singh_304bc222ecbe).</description>
    <link>https://dev.to/ashwin_singh_304bc222ecbe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3740317%2F578116a2-0917-44eb-9955-fbb5d92fb306.webp</url>
      <title>DEV Community: Ashwin Singh</title>
      <link>https://dev.to/ashwin_singh_304bc222ecbe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ashwin_singh_304bc222ecbe"/>
    <language>en</language>
    <item>
      <title>How Parsifyx Processes 27 Document Formats Entirely in the Browser — No Server Required</title>
      <dc:creator>Ashwin Singh</dc:creator>
      <pubDate>Thu, 12 Feb 2026 10:56:20 +0000</pubDate>
      <link>https://dev.to/ashwin_singh_304bc222ecbe/how-parsifyx-processes-27-document-formats-entirely-in-the-browser-no-server-required-2ka9</link>
      <guid>https://dev.to/ashwin_singh_304bc222ecbe/how-parsifyx-processes-27-document-formats-entirely-in-the-browser-no-server-required-2ka9</guid>
      <description>&lt;p&gt;There's a class of web apps that looks simple on the surface but is doing something genuinely impressive under the hood. &lt;a href="https://parsifyx.com" rel="noopener noreferrer"&gt;Parsifyx&lt;/a&gt; is one of them.&lt;/p&gt;

&lt;p&gt;It's a document toolkit — PDF splitting, merging, conversion, compression, OCR, e-signing, form filling, ZIP handling — 27 tools total. Nothing revolutionary about the feature list. What's interesting is the architecture: &lt;strong&gt;every single operation runs client-side.&lt;/strong&gt; No file uploads. No server-side processing. No cloud functions. Your documents never leave the browser tab.&lt;/p&gt;

&lt;p&gt;As a developer, that immediately raised questions. How do you split a 200-page PDF in the browser without melting the tab? How do you run OCR without a backend? What does the conversion pipeline look like for &lt;code&gt;.docx&lt;/code&gt; → &lt;code&gt;.pdf&lt;/code&gt; when there's no LibreOffice instance to lean on?&lt;/p&gt;

&lt;p&gt;Let's break it down.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack: WebAssembly + JavaScript Libraries
&lt;/h2&gt;

&lt;p&gt;Parsifyx's architecture sits on top of a handful of battle-tested client-side libraries. Based on what's publicly inspectable in the browser:&lt;/p&gt;

&lt;h3&gt;
  
  
  PDF Manipulation — &lt;code&gt;pdf-lib&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/Marak/pdf-lib" rel="noopener noreferrer"&gt;pdf-lib&lt;/a&gt; is a pure JavaScript library for creating and modifying PDFs. No native dependencies, no server calls. It parses the PDF binary format directly in memory and exposes a clean API for operations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Splitting by page ranges&lt;/li&gt;
&lt;li&gt;Merging multiple documents&lt;/li&gt;
&lt;li&gt;Removing, extracting, and reordering pages&lt;/li&gt;
&lt;li&gt;Rotating pages&lt;/li&gt;
&lt;li&gt;Editing metadata (title, author, keywords)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the backbone of most of Parsifyx's "Organize &amp;amp; Edit" tools. Because &lt;code&gt;pdf-lib&lt;/code&gt; operates on &lt;code&gt;Uint8Array&lt;/code&gt; buffers, the entire read → transform → export cycle stays in memory. The browser's &lt;code&gt;File&lt;/code&gt; API reads the input, &lt;code&gt;pdf-lib&lt;/code&gt; does the work, and a &lt;code&gt;Blob&lt;/code&gt; URL triggers the download. Zero network traffic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Conceptual example: splitting a PDF with pdf-lib&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PDFDocument&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pdf-lib&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sourceBytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arrayBuffer&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sourcePdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;PDFDocument&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sourceBytes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;newPdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;PDFDocument&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;newPdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copyPages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sourcePdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt; &lt;span class="c1"&gt;// copy first page&lt;/span&gt;
&lt;span class="nx"&gt;newPdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;outputBytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;newPdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outputBytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;split-output.pdf&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No upload. No API key. No latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  OCR — &lt;code&gt;Tesseract.js&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is where it gets more interesting. &lt;a href="https://github.com/naptha/tesseract.js" rel="noopener noreferrer"&gt;Tesseract.js&lt;/a&gt; is a WebAssembly port of Google's Tesseract OCR engine. It downloads trained language data (&lt;code&gt;.traineddata&lt;/code&gt; files) on first use, then runs the full recognition pipeline in a Web Worker.&lt;/p&gt;

&lt;p&gt;The architecture is smart: Tesseract.js spawns a worker thread so the main UI thread stays responsive while the WASM engine chews through pixel data. For Parsifyx's "Image to Text" and "Scan to Searchable PDF" tools, the flow looks roughly like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User drops in a scanned image or PDF&lt;/li&gt;
&lt;li&gt;If PDF, render pages to canvas using a PDF renderer (likely &lt;code&gt;pdf.js&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Pass the rasterized image data to the Tesseract.js worker&lt;/li&gt;
&lt;li&gt;Tesseract returns recognized text with bounding box coordinates&lt;/li&gt;
&lt;li&gt;For searchable PDFs: overlay an invisible text layer on top of the original scan&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That last step is the key UX win. The output PDF looks identical to the scan, but you can &lt;code&gt;Ctrl+F&lt;/code&gt; through it. All done locally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createWorker&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tesseract.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;createWorker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;eng&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recognize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imageFile&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;terminate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trade-off is the initial download of language data (~10-15MB for English). But once cached by the browser, subsequent runs are fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  PDF Generation — &lt;code&gt;jsPDF&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;For conversion tools (Markdown → PDF, HTML → PDF, Image → PDF), Parsifyx likely uses &lt;a href="https://github.com/parallax/jsPDF" rel="noopener noreferrer"&gt;jsPDF&lt;/a&gt; or a combination of &lt;code&gt;jsPDF&lt;/code&gt; and &lt;code&gt;html2canvas&lt;/code&gt;. The pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTML/Markdown → PDF&lt;/strong&gt;: Parse the markup, render it to a virtual canvas or directly to jsPDF drawing commands, then serialize to PDF bytes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image → PDF&lt;/strong&gt;: Read image dimensions, create a PDF page with matching dimensions, embed the image, export.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Office formats (Word, Excel, PowerPoint)&lt;/strong&gt;: This is trickier client-side. Libraries like &lt;a href="https://github.com/mwilliamson/mammoth.js" rel="noopener noreferrer"&gt;mammoth.js&lt;/a&gt; handle &lt;code&gt;.docx&lt;/code&gt; → HTML conversion, which can then be piped into the PDF generation step. For &lt;code&gt;.xlsx&lt;/code&gt;, &lt;a href="https://github.com/SheetJS/sheetjs" rel="noopener noreferrer"&gt;SheetJS&lt;/a&gt; parses the spreadsheet format. For &lt;code&gt;.pptx&lt;/code&gt;, similar XML-parsing approaches apply.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compression
&lt;/h3&gt;

&lt;p&gt;PDF compression in the browser typically involves re-encoding embedded images at lower quality. A scanned document with uncompressed TIFF images inside the PDF can be dramatically reduced by re-encoding those images as compressed JPEG. Libraries can extract embedded image streams, re-compress them via the Canvas API's &lt;code&gt;toBlob()&lt;/code&gt; with a quality parameter, and re-embed them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Browser-native image recompression&lt;/span&gt;
&lt;span class="nx"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toBlob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* re-embed compressed image */&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;image/jpeg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="c1"&gt;// quality factor&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why Parsifyx can shrink a 20MB scanned PDF down to 3MB without any server-side tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Architecture Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Privacy by construction, not by policy
&lt;/h3&gt;

&lt;p&gt;Most PDF tools publish privacy policies saying "we delete your files within 1 hour." That's a policy decision. It can be changed, breached, or circumvented. Parsifyx's approach is structurally private — there's no server endpoint to receive the file in the first place. You can verify this by opening DevTools → Network tab and watching for outbound requests during processing. There aren't any.&lt;/p&gt;

&lt;p&gt;This isn't just a nice-to-have. If you're handling HIPAA-covered documents, GDPR-sensitive data, legal contracts, or financial records, the difference between "we promise we delete it" and "it never left your machine" is the difference between compliance risk and no compliance risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Zero-latency processing
&lt;/h3&gt;

&lt;p&gt;Server-based PDF tools follow a &lt;code&gt;upload → queue → process → download&lt;/code&gt; cycle. Depending on file size and server load, that's anywhere from 5 to 30+ seconds. Client-side processing eliminates the upload and download legs entirely. For a 10MB PDF merge, the bottleneck is JavaScript execution speed, not network bandwidth. On a modern machine, that's sub-second.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Offline capability
&lt;/h3&gt;

&lt;p&gt;Once the page and its WASM/JS dependencies are cached, the tools work offline. This is a natural side effect of the architecture — if nothing requires a server, nothing breaks when the server is unreachable. For developers working on planes, in cafés with flaky WiFi, or in air-gapped environments, this is a real advantage.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. No infrastructure cost scaling
&lt;/h3&gt;

&lt;p&gt;This is the part that should interest anyone building SaaS tools. Traditional document processing services need to scale server capacity with user volume. More users = more CPU/RAM for PDF processing = higher cloud bills. When processing runs on the client, the "server" is every user's own machine. The infrastructure cost of serving 1,000 users and 100,000 users is essentially the same — you're just serving static assets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations of the Client-Side Approach
&lt;/h2&gt;

&lt;p&gt;It's not all upside. There are real constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory limits&lt;/strong&gt;: Browsers have memory ceilings. Processing a 500-page, image-heavy PDF might hit those limits on low-RAM devices. Server-side tools can throw more hardware at the problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format fidelity&lt;/strong&gt;: Server-side conversion tools like LibreOffice have decades of format-parsing logic. Client-side JS libraries are good but can struggle with complex &lt;code&gt;.docx&lt;/code&gt; layouts (nested tables, embedded OLE objects, exotic fonts).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Initial load&lt;/strong&gt;: WASM modules and language data for OCR add to the initial page weight. This is mitigated by lazy loading and caching, but the first run is heavier than subsequent ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No batch automation&lt;/strong&gt;: There's no API to call programmatically. If you need to convert 10,000 invoices, you need a server-side pipeline. Parsifyx is built for interactive, one-off document tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways for Developers
&lt;/h2&gt;

&lt;p&gt;Parsifyx is a clean case study in what's possible with modern browser APIs. A few patterns worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WebAssembly for compute-heavy work&lt;/strong&gt;: OCR, compression, and PDF parsing are CPU-intensive. WASM makes them viable in the browser without the UX penalty of blocking the main thread.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web Workers for responsiveness&lt;/strong&gt;: Offloading heavy processing to workers keeps the UI snappy. If your app does any non-trivial computation, workers aren't optional — they're essential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;File&lt;/code&gt; API + &lt;code&gt;Blob&lt;/code&gt; URLs for zero-upload workflows&lt;/strong&gt;: Reading files locally, processing them in memory, and triggering downloads via &lt;code&gt;Blob&lt;/code&gt; URLs is a powerful pattern that eliminates entire categories of privacy and infrastructure concerns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy as architecture, not policy&lt;/strong&gt;: If your product handles sensitive data, consider whether the processing &lt;em&gt;needs&lt;/em&gt; to happen on your server. If it doesn't, moving it to the client is a stronger privacy guarantee than any policy you can write.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;If you work with documents — and if you're a developer, you do — bookmark &lt;a href="https://parsifyx.com" rel="noopener noreferrer"&gt;parsifyx.com&lt;/a&gt;. It's fast, it's free, there's no signup, and it respects your data by never touching it in the first place.&lt;/p&gt;

&lt;p&gt;Open DevTools while you use it. It's a good learning exercise.&lt;/p&gt;

</description>
      <category>webassembly</category>
      <category>nextjs</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Built an OCR Tool That Extracts 4.5x More Text Than Tesseract Alone</title>
      <dc:creator>Ashwin Singh</dc:creator>
      <pubDate>Thu, 29 Jan 2026 18:41:25 +0000</pubDate>
      <link>https://dev.to/ashwin_singh_304bc222ecbe/i-built-an-ocr-tool-that-extracts-45x-more-text-than-tesseract-alone-n1b</link>
      <guid>https://dev.to/ashwin_singh_304bc222ecbe/i-built-an-ocr-tool-that-extracts-45x-more-text-than-tesseract-alone-n1b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhn50taqnnmw9z4h5bgk.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhn50taqnnmw9z4h5bgk.jpg" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
Last year I got frustrated.&lt;/p&gt;

&lt;p&gt;I was trying to digitize a stack of old contracts for a family business. Scanned PDFs—hundreds of pages. I needed to find specific clauses across all of them, but Ctrl+F returned nothing because every page was just an image.&lt;/p&gt;

&lt;p&gt;"No problem," I thought. "I'll just run them through Tesseract."&lt;/p&gt;

&lt;p&gt;The output was... disappointing. Missed words everywhere. Garbled text from slightly tilted pages. Complete failures on low-res scans. I spent more time fixing OCR errors than I would have spent reading the documents manually.&lt;/p&gt;

&lt;p&gt;So I built something better. It's called &lt;a href="https://searchablepdf.org" rel="noopener noreferrer"&gt;SearchablePDF.org&lt;/a&gt;, and it extracts up to 456% more text than vanilla Tesseract. Here's how.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem With Basic OCR
&lt;/h2&gt;

&lt;p&gt;Tesseract is incredible technology. But feed it a real-world scanned document and you'll quickly discover its limitations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: Slightly tilted scan of a contract
Expected: "This Agreement shall terminate on December 31, 2024"
Actual: "Th1s Agr33ment sha11 terminat3 0n Decemb3r 31, 2O24"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The issue isn't Tesseract—it's the input. OCR engines expect clean, properly oriented, high-contrast images. Real scans are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tilted by a few degrees&lt;/li&gt;
&lt;li&gt;Rotated 90°, 180°, or 270°&lt;/li&gt;
&lt;li&gt;Low resolution (faxes, old photocopies)&lt;/li&gt;
&lt;li&gt;Covered with watermarks, stamps, or noise&lt;/li&gt;
&lt;li&gt;Inconsistent in contrast and brightness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Garbage in, garbage out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Preprocessing Pipeline
&lt;/h2&gt;

&lt;p&gt;The key insight was that OCR accuracy depends more on image quality than on the OCR engine itself. So I built a preprocessing pipeline that runs before Tesseract ever sees the image.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Orientation Detection and Correction
&lt;/h3&gt;

&lt;p&gt;Using OpenCV and some heuristics, the tool detects page orientation and rotates accordingly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified version of orientation detection
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;detect_orientation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Use Tesseract's OSD (Orientation and Script Detection)
&lt;/span&gt;    &lt;span class="n"&gt;osd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pytesseract&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;image_to_osd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rotation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Rotate: (\d+)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;osd&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rotation&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fix_orientation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rotation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;detect_orientation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rotation&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rotate_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone fixed about 15% of my failed extractions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Deskewing
&lt;/h3&gt;

&lt;p&gt;Even a 2-3° tilt kills OCR accuracy. The deskew algorithm detects text line angles and straightens the image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deskew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Convert to grayscale and detect edges
&lt;/span&gt;    &lt;span class="n"&gt;gray&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cvtColor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COLOR_BGR2GRAY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;edges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Canny&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;apertureSize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Detect lines using Hough transform
&lt;/span&gt;    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HoughLinesP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pi&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                            &lt;span class="n"&gt;minLineLength&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxLineGap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate median angle
&lt;/span&gt;    &lt;span class="n"&gt;angles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arctan2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;x2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y2&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="n"&gt;median_angle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;angles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rotate to correct
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;rotate_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;degrees&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;median_angle&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Resolution Enhancement
&lt;/h3&gt;

&lt;p&gt;Many scans come in at 72-150 DPI. Tesseract works best at 300+ DPI. I use intelligent upscaling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enhance_resolution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_dpi&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;current_dpi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;estimate_dpi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_dpi&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;target_dpi&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_dpi&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;current_dpi&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scale_factor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scale_factor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;interpolation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INTER_CUBIC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Noise Removal and Contrast Enhancement
&lt;/h3&gt;

&lt;p&gt;Watermarks, scanner artifacts, and faded text all interfere with recognition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Convert to grayscale
&lt;/span&gt;    &lt;span class="n"&gt;gray&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cvtColor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COLOR_BGR2GRAY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply adaptive thresholding for varying lighting
&lt;/span&gt;    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;adaptiveThreshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                                     &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ADAPTIVE_THRESH_GAUSSIAN_C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                     &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;THRESH_BINARY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Denoise
&lt;/span&gt;    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fastNlMeansDenoising&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Secret Sauce: Invisible Text Layers
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting.&lt;/p&gt;

&lt;p&gt;Most OCR tools give you extracted text—a plain &lt;code&gt;.txt&lt;/code&gt; file or copied text. That's useful, but you lose all formatting, layout, and visual context.&lt;/p&gt;

&lt;p&gt;I wanted something better: a PDF that looks exactly like the original but is fully searchable and selectable.&lt;/p&gt;

&lt;p&gt;The solution is PDF text layers. You embed invisible, precisely-positioned text underneath the original image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pypdf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PdfReader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PdfWriter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;reportlab.pdfgen&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;canvas&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;reportlab.lib.pagesizes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;letter&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_text_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_pdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ocr_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ocr_data contains text + bounding box coordinates from Tesseract
&lt;/span&gt;
    &lt;span class="c1"&gt;# Create a new PDF with just the text layer
&lt;/span&gt;    &lt;span class="n"&gt;text_pdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_text_layer_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ocr_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Merge: original image as background, text layer on top (but invisible)
&lt;/span&gt;    &lt;span class="n"&gt;merger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PdfMerger&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# ... merge logic
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;searchable_pdf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The text is there—screen readers can read it, Ctrl+F finds it, you can copy it—but visually the PDF is unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tesseract LSTM Configuration
&lt;/h2&gt;

&lt;p&gt;Vanilla Tesseract commands often miss the good stuff. Here's what I landed on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;custom_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--oem 1 --psm 3 -l eng&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;# --oem 1: Use LSTM neural network engine only (most accurate)
# --psm 3: Fully automatic page segmentation (works for most documents)
# -l eng: Language (supports 35+ languages, can combine: eng+spa)
&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pytesseract&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;image_to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;custom_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# For detailed position data (needed for text layer placement):
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pytesseract&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;image_to_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;custom_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                                  &lt;span class="n"&gt;output_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DICT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;image_to_data&lt;/code&gt; function returns bounding boxes for every word—essential for positioning the invisible text layer correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After implementing the full pipeline, I ran tests against the same 500-page document set:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Characters Extracted&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw Tesseract&lt;/td&gt;
&lt;td&gt;127,453&lt;/td&gt;
&lt;td&gt;~71%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;With preprocessing&lt;/td&gt;
&lt;td&gt;580,342&lt;/td&gt;
&lt;td&gt;~94%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's 456% more text extracted, with dramatically fewer errors.&lt;/p&gt;

&lt;p&gt;The biggest wins came from:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deskewing (fixed ~30% of errors)&lt;/li&gt;
&lt;li&gt;Resolution enhancement (fixed ~25% of errors)&lt;/li&gt;
&lt;li&gt;Proper LSTM configuration (fixed ~15% of errors)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Production Version
&lt;/h2&gt;

&lt;p&gt;I wrapped all of this into &lt;a href="https://searchablepdf.org" rel="noopener noreferrer"&gt;SearchablePDF.org&lt;/a&gt;—a web app where you can upload scanned PDFs and get back searchable versions.&lt;/p&gt;

&lt;p&gt;Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic preprocessing&lt;/strong&gt;: All the cleanup happens without user configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;35+ languages&lt;/strong&gt;: Including multi-language support for mixed documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Page selection&lt;/strong&gt;: Process only the pages you need (&lt;code&gt;1-10, 25, 40-50&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two OCR tiers&lt;/strong&gt;: Standard (Tesseract LSTM) and Premium AI (99% accuracy for critical documents)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-first&lt;/strong&gt;: Files auto-delete after 24 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The free tier gives you 25 pages to test. Paid credits start at $0.05/page and never expire.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;p&gt;For those curious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: FastAPI (Python)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OCR&lt;/strong&gt;: Tesseract with pytesseract bindings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image processing&lt;/strong&gt;: OpenCV, Pillow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDF manipulation&lt;/strong&gt;: pypdf, reportlab&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Next.js&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: Redis for job queuing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It / Break It
&lt;/h2&gt;

&lt;p&gt;If you've got scanned PDFs that other tools have failed on, I'd genuinely like to know how SearchablePDF handles them. Edge cases help me improve the preprocessing pipeline.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://searchablepdf.org" rel="noopener noreferrer"&gt;searchablepdf.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And if you want to build something similar yourself, the core techniques are all above. The preprocessing pipeline is where most of the magic happens—Tesseract does the heavy lifting once you feed it clean images.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Questions?&lt;/strong&gt; Drop them in the comments. Happy to go deeper on any part of the implementation.&lt;/p&gt;

</description>
      <category>ocr</category>
      <category>pdf</category>
    </item>
  </channel>
</rss>
