<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: John</title>
    <description>The latest articles on DEV Community by John (@john_e62541d6d7f95ead0bcf).</description>
    <link>https://dev.to/john_e62541d6d7f95ead0bcf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2951177%2Fe012d710-8894-4cfe-8be2-11b2b927942e.png</url>
      <title>DEV Community: John</title>
      <link>https://dev.to/john_e62541d6d7f95ead0bcf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/john_e62541d6d7f95ead0bcf"/>
    <language>en</language>
    <item>
      <title>Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool</title>
      <dc:creator>John</dc:creator>
      <pubDate>Tue, 18 Mar 2025 15:33:17 +0000</pubDate>
      <link>https://dev.to/john_e62541d6d7f95ead0bcf/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool-2a3l</link>
      <guid>https://dev.to/john_e62541d6d7f95ead0bcf/deep-learning-meets-ocr-my-fastapi-powered-document-cleaning-tool-2a3l</guid>
      <description>&lt;p&gt;Document Cleaner API — Clean Up Scanned Docs with AI + FastAPI&lt;br&gt;
Hey folks! &lt;br&gt;
I recently wrapped up a project that combines deep learning, OCR, and FastAPI to make scanned documents  more readable and searchable. Whether you're working with messy handwritten notes, low-contrast scans, or old documents, this tool helps clean them up and exports them as OCR-ready PDFs.&lt;br&gt;
I call it the “Document Cleaner API,” and it’s live on Google Cloud Run if you want to try it.&lt;br&gt;
🧠 What It Does&lt;br&gt;
The app takes scanned .jpg, .png, or zipped image files and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cleans and denoises them using a pretrained deep learning model (DnCNN) and OpenCV image processing.&lt;/li&gt;
&lt;li&gt;Auto-tunes the model weights for best OCR clarity on batches using 20% of the images  or 10 of the images whichever is smaller, to select the best weight for the  batch. It then tunes OpenCV processing parameters on a per image basis.&lt;/li&gt;
&lt;li&gt;Returns both cleaned PNGs and a PDF optimized for OCR.&lt;/li&gt;
&lt;li&gt;Works as both a CLI tool and a REST API.&lt;/li&gt;
&lt;li&gt;Designed for cloud deployment. (GCP / Docker-ready)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tech Stack&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10&lt;/li&gt;
&lt;li&gt;[FastAPI] &lt;a href="https://fastapi.tiangolo.com/" rel="noopener noreferrer"&gt;https://fastapi.tiangolo.com/&lt;/a&gt;) for the web server&lt;/li&gt;
&lt;li&gt;PyTorch for deep learning&lt;/li&gt;
&lt;li&gt;OpenCV for image cleanup&lt;/li&gt;
&lt;li&gt;Tesseract OCR for text recognition&lt;/li&gt;
&lt;li&gt;Deployed via Google Cloud Run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Try It Live&lt;/p&gt;

&lt;p&gt;The API is live and public on Cloud Run.&lt;br&gt;
&lt;a href="https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/" rel="noopener noreferrer"&gt;https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/&lt;/a&gt;&lt;br&gt;
You can test it by uploading a .png or .zip.&lt;br&gt;
 Example — Clean a Single Image in your terminal&lt;br&gt;
Run:&lt;br&gt;
bash&lt;br&gt;
curl -X POST -F "file=&lt;a class="mentioned-user" href="https://dev.to/sample"&gt;@sample&lt;/a&gt;.png" \&lt;br&gt;
&lt;a href="https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/process-document/" rel="noopener noreferrer"&gt;https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/process-document/&lt;/a&gt;&lt;br&gt;
Example — Clean a ZIP of Images in your terminal&lt;br&gt;
Run:&lt;br&gt;
bash&lt;br&gt;
curl -X POST -F "file=@your_batch.zip" \&lt;br&gt;
&lt;a href="https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/process-batch/" rel="noopener noreferrer"&gt;https://document-cleaning-cli-111-777-888-7777-934773375188.us-central1.run.app/process-batch/&lt;/a&gt; \&lt;br&gt;
--output cleaned_output.zip&lt;/p&gt;

&lt;p&gt;Auto-Tuning Per Batch&lt;br&gt;
When you upload a ZIP of images, the API:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Samples up to 20% of the images (max 10)&lt;/li&gt;
&lt;li&gt;Runs OCR tests using different model weights&lt;/li&gt;
&lt;li&gt;Picks the best-performing one&lt;/li&gt;
&lt;li&gt;Applies it to the entire batch for maximum clarity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This helps maintain high quality while keeping runtime fast—perfect for bulk jobs.&lt;br&gt;
Local Setup&lt;/p&gt;

&lt;p&gt;If you want to run it locally or tweak it:&lt;br&gt;
Run:&lt;br&gt;
bash&lt;br&gt;
git clone &lt;a href="https://github.com/jcaperella29/Document_cleaning_CLI.git" rel="noopener noreferrer"&gt;https://github.com/jcaperella29/Document_cleaning_CLI.git&lt;/a&gt;&lt;br&gt;
cd Document_cleaning_CLI&lt;br&gt;
pip install -r requirements.txt&lt;/p&gt;

&lt;p&gt;Make sure you have Tesseract OCR installed:&lt;br&gt;
Use the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux: &lt;code&gt;sudo apt install tesseract-ocr&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;macOS: &lt;code&gt;brew install tesseract&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Windows: &lt;a href="https://github.com/UB-Mannheim/tesseract/wiki" rel="noopener noreferrer"&gt;Tesseract Download&lt;/a&gt;
Ideas for Usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you’re automating document workflows or just trying to make old PDFs legible again, here are a few ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean up scanned lab notebooks
-Prep historical documents for OCR archiving&lt;/li&gt;
&lt;li&gt;Make handwritten notes searchable
-Integrate into pipelines with Python, Bash, or Node.js
Example integrations are in the repository:&lt;/li&gt;
&lt;li&gt;“curl” + shell script for batch runs&lt;/li&gt;
&lt;li&gt;Python “requests” snippet for automation&lt;/li&gt;
&lt;li&gt;Node.js + Axios setup for full-stack integration
Project Structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;├── main.py           # FastAPI routes&lt;br&gt;
├── processor.py      # Image cleanup logic (DnCNN + OCR)&lt;br&gt;
├── model_weights/    # .mat weight files&lt;br&gt;
├── uploads/          # Temp folder for input&lt;br&gt;
├── processed/        # Output folder for cleaned files&lt;/p&gt;

&lt;p&gt;🙏 Feedback Welcome!&lt;/p&gt;

&lt;p&gt;If you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have feature suggestions&lt;/li&gt;
&lt;li&gt;Want to try a custom model&lt;/li&gt;
&lt;li&gt;Need help deploying your own version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feel free to open an issue or drop a star ⭐ over at:&lt;/p&gt;

&lt;p&gt;GitHub Repo: &lt;a href="https://github.com/jcaperella29/Document_cleaning_CLI" rel="noopener noreferrer"&gt;jcaperella29/Document_cleaning_CLI&lt;/a&gt;&lt;br&gt;
Thanks for reading! Always happy to connect with fellow developers working on AI, bioinformatics, or productivity tools.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>fastapi</category>
      <category>python</category>
      <category>deeplearning</category>
    </item>
  </channel>
</rss>
