<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Luandro Vieira</title>
    <description>The latest articles on DEV Community by Luandro Vieira (@luandro).</description>
    <link>https://dev.to/luandro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3119838%2Ff216a42e-5e0f-4f01-b18f-fe5b72a05d25.png</url>
      <title>DEV Community: Luandro Vieira</title>
      <link>https://dev.to/luandro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/luandro"/>
    <language>en</language>
    <item>
      <title>Transform Scanned PDFs into Searchable Documents with PDF-OCR CLI</title>
      <dc:creator>Luandro Vieira</dc:creator>
      <pubDate>Sat, 03 May 2025 15:55:27 +0000</pubDate>
      <link>https://dev.to/luandro/transform-scanned-pdfs-into-searchable-documents-with-pdf-ocr-cli-2mgn</link>
      <guid>https://dev.to/luandro/transform-scanned-pdfs-into-searchable-documents-with-pdf-ocr-cli-2mgn</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn8bpjchgq6qlvn3txqo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn8bpjchgq6qlvn3txqo.png" alt="Image description" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Unsearchable PDFs
&lt;/h2&gt;

&lt;p&gt;We've all been there. You receive an important document as a PDF, but when you try to search for specific text, nothing happens. That's because many PDFs, especially scanned documents, are essentially just images of text rather than actual text content.&lt;/p&gt;

&lt;p&gt;This creates several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can't search for specific information&lt;/li&gt;
&lt;li&gt;You can't copy and paste text&lt;/li&gt;
&lt;li&gt;You can't use screen readers or other accessibility tools&lt;/li&gt;
&lt;li&gt;You can't easily extract or analyze the content&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introducing PDF-OCR CLI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2aph1nzhqotg8h5o9hn5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2aph1nzhqotg8h5o9hn5.jpg" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To solve this problem, I created &lt;a href="https://github.com/luandro/pdf-ocr" rel="noopener noreferrer"&gt;PDF-OCR CLI&lt;/a&gt;, an open-source tool that transforms scanned PDFs into fully searchable documents. It's built with TypeScript and leverages the power of Mistral AI's OCR capabilities, with optional text verification using Together.ai's LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;The tool follows a simple but powerful pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Takes your PDF as input&lt;/li&gt;
&lt;li&gt;Processes each page with Mistral API's OCR&lt;/li&gt;
&lt;li&gt;Optionally verifies and improves text quality with an LLM&lt;/li&gt;
&lt;li&gt;Reassembles everything into a searchable PDF&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started in 2 Minutes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install globally&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; pdf-ocr-cli

&lt;span class="c"&gt;# Or use without installing&lt;/span&gt;
npx pdf-ocr-cli &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.pdf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Set Up API Keys
&lt;/h3&gt;

&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file in your working directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"MISTRAL_API_KEY=your_mistral_api_key_here"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; .env
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"TOGETHER_API_KEY=your_together_api_key_here"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Basic Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Process a PDF file&lt;/span&gt;
pdf-ocr &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.pdf

&lt;span class="c"&gt;# With verification to improve OCR quality&lt;/span&gt;
pdf-ocr &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.pdf &lt;span class="nt"&gt;--verify&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Digitizing Research Papers
&lt;/h3&gt;

&lt;p&gt;As a developer who reads a lot of research papers, I often encounter PDFs that are scanned copies. With PDF-OCR CLI, I can quickly make these papers searchable, allowing me to find specific sections or references without scrolling through the entire document.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Processing Legal Documents
&lt;/h3&gt;

&lt;p&gt;Legal documents often come as scanned PDFs. By making them searchable, lawyers and paralegals can quickly find relevant clauses or terms, saving hours of manual reading.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Archiving Historical Documents
&lt;/h3&gt;

&lt;p&gt;Libraries and archives can use this tool to make historical documents more accessible and searchable, preserving knowledge while making it more usable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Handling Large Documents
&lt;/h3&gt;

&lt;p&gt;For large documents, you can control the processing with options like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Process 3 pages at a time&lt;/span&gt;
pdf-ocr &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.pdf &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 3

&lt;span class="c"&gt;# Process only the first 10 pages&lt;/span&gt;
pdf-ocr &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.pdf &lt;span class="nt"&gt;--max-pages&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Improving OCR Quality
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;--verify&lt;/code&gt; flag uses an LLM to check and improve the OCR results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pdf-ocr &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.pdf &lt;span class="nt"&gt;--verify&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is particularly useful for documents with complex layouts, poor scan quality, or unusual fonts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Details
&lt;/h2&gt;

&lt;p&gt;PDF-OCR CLI is built with TypeScript and follows a modular architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PDF Splitter&lt;/strong&gt;: Divides PDFs into individual pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OCR Module&lt;/strong&gt;: Extracts text using Mistral API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Verification&lt;/strong&gt;: Improves text with LLM (optional)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text-to-PDF Converter&lt;/strong&gt;: Converts text back to PDF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDF Merger&lt;/strong&gt;: Combines processed pages&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tool is designed to be robust, with features like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configurable retry mechanisms for API calls&lt;/li&gt;
&lt;li&gt;Adjustable concurrency for processing multiple pages&lt;/li&gt;
&lt;li&gt;Detailed logging options for troubleshooting&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;As a developer who works with a lot of documentation, I was frustrated by the limitations of scanned PDFs. Existing OCR solutions were either expensive, closed-source, or difficult to integrate into my workflow.&lt;/p&gt;

&lt;p&gt;I wanted a simple CLI tool that I could use in scripts or automation pipelines, and that leveraged the latest AI capabilities for high-quality text extraction. PDF-OCR CLI is the result of that need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source and Contributions
&lt;/h2&gt;

&lt;p&gt;PDF-OCR CLI is open source under the ISC license. Contributions are welcome! Whether it's adding new features, improving documentation, or reporting bugs, every contribution helps make the tool better.&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/luandro/pdf-ocr" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; to get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;PDF-OCR CLI transforms the way we work with scanned documents, making them as useful and accessible as natively digital PDFs. Give it a try and let me know what you think in the comments!&lt;/p&gt;




&lt;p&gt;Have you encountered problems with unsearchable PDFs? What solutions have you tried? Let me know in the comments!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ocr</category>
      <category>pdf</category>
      <category>cli</category>
    </item>
  </channel>
</rss>
