<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rajesh Pethe</title>
    <description>The latest articles on DEV Community by Rajesh Pethe (@eklavvya).</description>
    <link>https://dev.to/eklavvya</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3214254%2F76f5dbaa-1577-46a4-94f8-0670d21d88ea.jpg</url>
      <title>DEV Community: Rajesh Pethe</title>
      <link>https://dev.to/eklavvya</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eklavvya"/>
    <language>en</language>
    <item>
      <title>Building an Event-Driven OCR Service: Challenges and Solutions</title>
      <dc:creator>Rajesh Pethe</dc:creator>
      <pubDate>Wed, 10 Dec 2025 14:29:10 +0000</pubDate>
      <link>https://dev.to/eklavvya/building-an-event-driven-ocr-service-challenges-and-solutions-35c9</link>
      <guid>https://dev.to/eklavvya/building-an-event-driven-ocr-service-challenges-and-solutions-35c9</guid>
      <description>&lt;p&gt;Optical Character Recognition (OCR) is a powerful AI/ML technology that recognizes and extracts text from images and scanned documents. &lt;/p&gt;

&lt;p&gt;Creating a scalable, event-driven web OCR service comes with challenges. This write-up details the problems, lessons and solutions uncovered while building a FastAPI + Celery + Redis + PaddleOCR OCR service aimed for integration with Paperless-ngx, an open source document management.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Wanted to Build (and Why)
&lt;/h2&gt;

&lt;p&gt;Our objective was to build an &lt;strong&gt;event-driven service&lt;/strong&gt; that efficiently converts PDFs or images into searchable PDFs with a selectable and searchable text layer. The focus was on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handling PDFs of arbitrary length and complexity.&lt;/li&gt;
&lt;li&gt;Delivering results asynchronously due to CPU-heavy OCR tasks.&lt;/li&gt;
&lt;li&gt;Creating outputs integrable with Paperless-ngx for document archiving and retrieval.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why a Simple Script is just not good enough
&lt;/h2&gt;

&lt;p&gt;OCR workloads demand significant compute power, especially on large or image-heavy PDFs. The process involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OCR inference:&lt;/strong&gt; Detecting and recognizing text from images - the most CPU intensive part.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collating results:&lt;/strong&gt; Combining recognized text from many pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding a text layer:&lt;/strong&gt; Creating PDFs with searchable text overlay, crucial for usability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Making this scalable and responsive requires &lt;strong&gt;moving beyond a simple blocking script&lt;/strong&gt; into asynchronous, event-driven architecture. Multiprocessing seems like a natural fit at first, but Celery and PaddleOCR takes care of workload and performance respectively as you'll see below. Keep reading.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture (How All the Pieces Fit Together)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Flow:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Client uploads PDF → FastAPI returns task ID immediately&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;FastAPI enqueues task in Redis Broker&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Celery Workers pick up tasks, use PaddleOCR (cached per process)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Workers store searchable PDFs in File Storage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Redis Backend tracks task status&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Client polls FastAPI → gets status + download link&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This architecture scales by adding more Celery Workers and handles OCR's CPU intensity through async processing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FastAPI ──► Redis Broker ──► Celery Workers ──► PaddleOCR ──► File Storage
    ▲        ▲ Result Backend     ▲ Cached models      ▲
    └────────┘                    └────────────────────┘

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How Celery Actually Works in this Setup (And Surprises)
&lt;/h3&gt;

&lt;p&gt;Celery orchestrates asynchronous OCR processing - points 3 and 4 in the flow above and here this ets very interesting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Orchestrate:

&lt;ol&gt;
&lt;li&gt;takes in PDF/image input file&lt;/li&gt;
&lt;li&gt;converts PDF to list of images (OCR needs images)&lt;/li&gt;
&lt;li&gt;Decides on size of task (Files with &amp;gt; 5 pages gets delegated to chord)&lt;/li&gt;
&lt;li&gt;Calls &lt;code&gt;process_single_page&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Finally calls &lt;code&gt;assemble_final_pdf&lt;/code&gt; which returns results.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Process single page:

&lt;ol&gt;
&lt;li&gt;Create &lt;code&gt;ocr_engine = get_ocr_engine()&lt;/code&gt; and get ocr results.&lt;/li&gt;
&lt;li&gt;Create text files with OCRed texts (We need raw text as well).&lt;/li&gt;
&lt;li&gt;Creates single paged PDF file with selectable and searchable text layer.&lt;/li&gt;
&lt;li&gt;Returns page index and file path.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Final assembly:

&lt;ol&gt;
&lt;li&gt;Receives all single paged PDFs's paths&lt;/li&gt;
&lt;li&gt;Collates/merges all in one resulting PDF.&lt;/li&gt;
&lt;li&gt;Merges all text files in one.&lt;/li&gt;
&lt;li&gt;Cleans up temp PDF/text files.&lt;/li&gt;
&lt;li&gt;Returns final PDF and text file URLs.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;h3&gt;
  
  
  Visual Overview of the Celery Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;             +------------------+
Upload PDF → |    FastAPI       |
             +------------------+
                       |
                       v
              [Redis Message Broker]
                       |
                       v
           +---------------------------+
           |    Orchestrator Task     |
           |  (orchestrate_pdf_ocr)   |
           +---------------------------+
                       |
     +-----------------+-----------------+
     |                 |                 |
     v                 v                 v
+-----------+   +-----------+    +---------------+
| Page 0    |   | Page 1    |    | Page N        |
| OCR Task  |   | OCR Task  |    | OCR Task      |
+-----------+   +-----------+    +---------------+
     \             |                /
      \            |               /
       +-----------+--------------+
                       |
                       v
        +-----------------------------------+
        |   assemble_final_pdf (Callback)   |
        +-----------------------------------+
                       |
                       v
      Searchable PDF + merged text file saved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pitfall:&lt;/strong&gt; Some might argue - why not pass on OCR results to #3 "Final assembly" step above to do the final assembly of PDF and text file? Considered that and found that PaddleOCR results are big nested data-structure with &lt;code&gt;NumPy&lt;/code&gt; &lt;code&gt;numpy.ndarray&lt;/code&gt; which need custom recursive serialization for Redis.&lt;/p&gt;

&lt;p&gt;I briefly experimented with passing lightweight structured results (page text + bounding boxes), but even that ballooned in size on longer PDFs. I concluded serialization was a headache and creating single pages PDFs appealed to me more due to several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tiny Payload Size:&lt;/strong&gt; Instead of serializing huge, complex nested lists of coordinates and text (which stresses Redis/Celery result backend), you just pass a tiny string: "/tmp/page_5_ocr.pdf".&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Solves Serialization:&lt;/strong&gt; The complex OCR data stays in memory, gets written to PDF immediately, and is discarded.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retires/Check points:&lt;/strong&gt; If the final assembly task fails, you still have the individual page PDFs on disk. You could technically inspect or re-assemble them manually. Retry only that page which failed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Assembly:&lt;/strong&gt; The final &lt;code&gt;assemble_final_pdf&lt;/code&gt; task becomes extremely cheap.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Redis:&lt;/strong&gt; No memory pressure on Redis&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cleanup:&lt;/strong&gt; Requires careful temp file management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared Volume:&lt;/strong&gt; If you are on a cluster (K8s/multiple VMs), you need a shared volume.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  PaddleOCR: Model Caching and Threading
&lt;/h3&gt;

&lt;p&gt;⚠️ &lt;strong&gt;Lessons learnt:&lt;/strong&gt; PaddleOcr has a known issue with &lt;strong&gt;singleton&lt;/strong&gt; objects - initializing once and reusing PaddleOCR will most certainly fail for subsequent OCR requests. Solution is to cache models and re-initialize it for every call - a slight overhead. &lt;em&gt;I lost quite a few hairs scratching my head over this 😉&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching models gets a big speedup&lt;/strong&gt;: rather than reload PaddleOCR models for &lt;em&gt;each process&lt;/em&gt;, we use a model cache so each process loads the model once and reuses it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PPDX_HOME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/app/model-cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✨ Using model cache from: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_cache&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ocr_engine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PaddleOCR&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;PaddleOCR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;text_recognition_model_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_cache&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/.paddlex/official_models/PP-OCRv5_server_rec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text_detection_model_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_cache&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/.paddlex/official_models/PP-OCRv5_server_det&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;textline_orientation_model_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_cache&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/.paddlex/official_models/PP-LCNet_x1_0_textline_ori&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;doc_orientation_classify_model_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_cache&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/.paddlex/official_models/PP-LCNet_x1_0_doc_ori&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;doc_unwarping_model_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_cache&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/.paddlex/official_models/UVDoc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_doc_unwarping&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_doc_orientation_classify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_textline_orientation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PaddleOCR itself is inherently &lt;strong&gt;multi-threaded&lt;/strong&gt; via its native C++ inference engine, relying on optimized libraries like MKL and oneDNN. These libraries internally run on multiple CPU threads and &lt;strong&gt;bypass Python's GIL&lt;/strong&gt;, enabling you to get most out of CPU cores available to you using option &lt;code&gt;cpu_threads&lt;/code&gt; in init.&lt;/p&gt;

&lt;h3&gt;
  
  
  Redis: The Glue
&lt;/h3&gt;

&lt;p&gt;Redis acts as both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;message broker&lt;/strong&gt; queuing tasks.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;result backend&lt;/strong&gt; tracking task status and storing outputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This decouples FastAPI from OCR workers, enabling scalability and fault tolerance.&lt;/p&gt;

&lt;h3&gt;
  
  
  FastAPI: The Front Door to the OCR Service
&lt;/h3&gt;

&lt;p&gt;FastAPI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accepts PDF uploads and immediately returns a task ID.&lt;/li&gt;
&lt;li&gt;Provides endpoints to poll for task status and download results.&lt;/li&gt;
&lt;li&gt;Delegates heavy processing to the event-driven Celery workers.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Didn’t Work (and Why)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PaddleOCR singleton failing:&lt;/strong&gt; A known issue with PaddleOCR - it fails on subsequent OCR calls, most likely because it retains state from previous calls and needs a reset. And reset means almost same overhead of re-creating the object.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Serialization of large numpy structures:&lt;/strong&gt; Recursive serialization of nested &lt;code&gt;numpy&lt;/code&gt; data types was an option but seemed too much os headache.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shared filesystem:&lt;/strong&gt; This is top on todo list as its necessary to make this horizontally scalable.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I Learned While Building This
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Keep the Celery payload small. I was surprised how easy it was to create multiple files and re-assemble in different Celery workers&lt;/li&gt;
&lt;li&gt;PaddleOCR is good, but it has quirks. Don’t fight the library - work around it.&lt;/li&gt;
&lt;li&gt;Celery chords turned out to be the perfect fit for multi-page PDFs, but it took me a while to get the signatures right.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final notes / Next Steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Pipeline Improvements
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Retries &amp;amp; Error Handling&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Per-page retries with backoff
- Custom exceptions for OCR failures
- Fail-fast if &amp;gt;N pages fail
- Cleanup orphan files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Task Timeouts&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Timeout for each OCR task
- Timeout for orchestration/chord
- Deadline propagation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Progress Reporting&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Track completed_pages / total_pages
- Publish progress to Redis
- FastAPI poll endpoint or SSE/WebSocket
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Distributed Pipeline (in progress)

&lt;ul&gt;
&lt;li&gt;Add shared volume or S3/MinIO&lt;/li&gt;
&lt;li&gt;Convert file paths to storage URIs&lt;/li&gt;
&lt;li&gt;Remove reliance on local disk per worker&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Building this thing reminded me that OCR isn’t just text extraction - it’s a messy mix of CPU bottlenecks, weird library quirks, and architectural decisions that don’t show up in tutorials.&lt;/p&gt;

&lt;p&gt;Turns out, building a ‘simple OCR service’ is anything but simple — but now it’s fast, scalable, and plays nicely with Paperless-ngx.&lt;/p&gt;

</description>
      <category>eventdriven</category>
      <category>ocr</category>
      <category>microservices</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Building a Practical DevSecOps Pipeline: From Basic Security to Enterprise-Style Protection</title>
      <dc:creator>Rajesh Pethe</dc:creator>
      <pubDate>Wed, 24 Sep 2025 12:25:14 +0000</pubDate>
      <link>https://dev.to/eklavvya/building-a-practical-devsecops-pipeline-from-basic-security-to-enterprise-style-protection-2b9k</link>
      <guid>https://dev.to/eklavvya/building-a-practical-devsecops-pipeline-from-basic-security-to-enterprise-style-protection-2b9k</guid>
      <description>&lt;p&gt;Hello folks!&lt;/p&gt;

&lt;p&gt;So you've got your CI/CD pipeline running smoothly, but now you're missing some security scanning to your codebase. I recently took a basic security workflow and enhanced it, this is my journey building an enterprise-style pipeline including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scanning for secrets with GitGuardian and TruffleHog&lt;/li&gt;
&lt;li&gt;Reporting vulnerabilities using Bandit, Semgrep and Safety&lt;/li&gt;
&lt;li&gt;Scanning vulnerabilities in Python dependencies using Snyk&lt;/li&gt;
&lt;li&gt;Licensing and Compliance scan using FOSSA&lt;/li&gt;
&lt;li&gt;Checkov IaC Security Scan to find vulnerabilities in Docker, Kubernetes and Terraform specs&lt;/li&gt;
&lt;li&gt;Container security scan using Trivy and Docker Scout&lt;/li&gt;
&lt;li&gt;Dynamic security testing for API endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let me show you exactly how I did it, and more importantly, &lt;strong&gt;why&lt;/strong&gt; each piece matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic but Not Enough
&lt;/h2&gt;

&lt;p&gt;Most of us start with something like this in their GitHub Actions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Bandit (SAST)&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bandit -r .&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Snyk Dependency Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;snyk/actions/python@master&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy Container Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This covers the basics - some static analysis, dependency scanning, and container security. But &lt;strong&gt;it's not enough&lt;/strong&gt; for real-world applications. You're missing secrets detection, infrastructure security, proper quality gates, and a bunch of other stuff that might bite you later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Snyk Dependency Scan can be slow if your codebase has large dependency tree and you are using a free account.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Enhanced Security Workflow:
&lt;/h2&gt;

&lt;p&gt;I've created the "enhanced security workflow" that covers pretty much every security scanning angle I could think of. Let's dive into each section and understand why each piece is crucial.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Security-First Permissions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
  &lt;span class="na"&gt;security-events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; By default, GitHub Actions gets too many permissions. This is the principle of least privilege in action - only give what's absolutely necesary. The &lt;code&gt;security-events: write&lt;/code&gt; permission is what lets us upload SARIF reports to GitHub's security tab.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Secret Detection
&lt;/h3&gt;

&lt;p&gt;This is probably the most important addition. You might have seen developers accidentally commit API keys, database passwords, or AWS credentials.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GitGuardian Security Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GitGuardian/ggshield/actions/secret@v1.25.0&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
    &lt;span class="na"&gt;GITGUARDIAN_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITGUARDIAN_API_KEY }}&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TruffleHog OSS Secret Scanning&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trufflesecurity/trufflehog@main&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./services/upload_service&lt;/span&gt;
    &lt;span class="na"&gt;base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.event.repository.default_branch }}&lt;/span&gt;
    &lt;span class="na"&gt;head&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
    &lt;span class="na"&gt;extra_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--debug --only-verified&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's happening here:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitGuardian&lt;/strong&gt; knows about 450 types of secrets and has really low false positives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TruffleHog&lt;/strong&gt; is the open-source alternative that's really good&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;--only-verified&lt;/code&gt; flag means it'll only alert on secrets it can actually verify (like testing if an API key actually works)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Run both! GitGuardian might catch something TruffleHog misses and vice versa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Either or both should be in your pre-commit hook as well to catch secrets at the earliest possible stage.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Enhanced SAST (Static Application Security Testing)
&lt;/h3&gt;

&lt;p&gt;Instead of just running Bandit, we're going to add couple more tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Security Tools&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;pip install bandit[toml] safety semgrep&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Bandit SAST (Enhanced)&lt;/span&gt;
  &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./services/upload_service&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;bandit -r . -f json -o bandit-report.json || true&lt;/span&gt;
    &lt;span class="s"&gt;bandit -r . -f txt&lt;/span&gt;
  &lt;span class="na"&gt;continue-on-error&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Semgrep Security Analysis&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;semgrep/semgrep-action@v1&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;-&lt;/span&gt;
      &lt;span class="s"&gt;p/security-audit&lt;/span&gt;
      &lt;span class="s"&gt;p/python&lt;/span&gt;
      &lt;span class="s"&gt;p/owasp-top-ten&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;SEMGREP_APP_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.SEMGREP_APP_TOKEN }}&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Safety Check (Python Dependencies)&lt;/span&gt;
  &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./services/upload_service&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;safety check --json --output safety-report.json || &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Breaking this down:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bandit&lt;/strong&gt; is still our Python-specific security scanner, but now we're saving reports in both JSON (for processing) and text (for human reading) - &lt;strong&gt;Note:&lt;/strong&gt; This will run bandit twice, adjust accordingly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semgrep&lt;/strong&gt; is new SAST tool - it's got rules for OWASP Top 10, language-specific issues, and general security patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety&lt;/strong&gt; checks your Python dependencies against known vulnerability databases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why multiple tools?&lt;/strong&gt; Each tool has its strengths. Bandit knows Python really well, Semgrep has broader coverage, and Safety focuses specifically on dependencies. It's like having multiple security experts scan your code. Some of us might declare this an over kill, but there is for us understand, evaluate and choose the best combination.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Infrastructure as Code Security (Dockerfiles Matter Too)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkov IaC Security Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridgecrewio/checkov-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;framework&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dockerfile,kubernetes,terraform&lt;/span&gt;
    &lt;span class="na"&gt;output_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sarif&lt;/span&gt;
    &lt;span class="na"&gt;output_file_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkov-report.sarif&lt;/span&gt;
    &lt;span class="na"&gt;quiet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;soft_fail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This is very important step!&lt;/strong&gt; Checkov scans your Docker files, Kubernetes manifests, Terraform configs - basically any infrastructure-as-code you've got. It'll catch stuff like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running containers as root (big no-no)&lt;/li&gt;
&lt;li&gt;Mandatory Health checks in containers&lt;/li&gt;
&lt;li&gt;Missing security contexts in Kubernetes&lt;/li&gt;
&lt;li&gt;Overly permissive IAM policies in Terraform&lt;/li&gt;
&lt;li&gt;Secrets hardcoded in Docker files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;soft_fail: true&lt;/code&gt; means it won't break your build, but it'll still report issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The Quality Gate
&lt;/h3&gt;

&lt;p&gt;This step decides if the workflow fails or not. Instead of just running scans and hoping someone reads the reports, this implements automated decision-making:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_security_reports&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;critical_issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;high_issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="c1"&gt;# Check Bandit report
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;services/upload_service/bandit-report.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;bandit_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bandit_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;issue_severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HIGH&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;high_issues&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;issue_severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MEDIUM&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;critical_issues&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bandit report not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Security gate logic
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;critical_issues&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ SECURITY GATE FAILED: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;critical_issues&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; critical security issues found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;high_issues&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️  WARNING: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;high_issues&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; high-severity issues found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ SECURITY GATE PASSED: No critical security issues detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This is where the magic happens!&lt;/strong&gt; The pipeline will actually fail if there are critical security issues. Fix it now or your deployment won't happen.&lt;/p&gt;

&lt;p&gt;You can customize these thresholds based on your risk tolerance. Maybe you allow 0 critical issues in production but 3 in development branches.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Container Security
&lt;/h3&gt;

&lt;p&gt;We're not just scanning the final image anymore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy Filesystem Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;scan-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fs'&lt;/span&gt;
    &lt;span class="na"&gt;scan-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;./services/upload_service'&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy Container Image Scan (Critical)&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;upload-service:latest'&lt;/span&gt;
    &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL,HIGH'&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Docker Scout CVE Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/scout-action@v1&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cves&lt;/span&gt;
    &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;upload-service:latest&lt;/span&gt;
    &lt;span class="na"&gt;only-severities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical,high&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Three layers of container security:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Filesystem scan&lt;/strong&gt; - checks your source code and files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image scan&lt;/strong&gt; - scans the built Docker image for vulnerabilities
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Scout&lt;/strong&gt; - Docker's own security scanning (different vulnerability database)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;exit-code: '1'&lt;/code&gt; on the image scan means it'll fail the build if critical or high severity issues are found.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Dynamic Security Testing (The Runtime Check)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Start Application for DAST&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;docker-compose up -d&lt;/span&gt;
    &lt;span class="s"&gt;sleep 30&lt;/span&gt;
    &lt;span class="s"&gt;curl -f http://localhost:8000/health || exit 1&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OWASP ZAP API Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zaproxy/action-api-scan@v0.7.0&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000'&lt;/span&gt;
    &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;openapi'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DAST (Dynamic Application Security Testing)&lt;/strong&gt; is where we actually run the application and poke at it to see if there are vulnerabilities that only show up at runtime. OWASP ZAP is like having a hacker test your API for common web vulnerabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Secret Sauce: SARIF Integration
&lt;/h2&gt;

&lt;p&gt;You'll notice we're outputting a lot of reports in SARIF format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload Trivy SARIF Reports&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github/codeql-action/upload-sarif@v3&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;sarif_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;trivy-fs-results.sarif&lt;/span&gt;
      &lt;span class="s"&gt;trivy-image-results.sarif&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SARIF (Static Analysis Results Interchange Format)&lt;/strong&gt; is a standard format that GitHub understands. When you upload SARIF files, all your security findings show up beautifully in GitHub's Security tab. No more digging through build logs!&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Gotchas and How to Avoid Them
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. False Positives
&lt;/h3&gt;

&lt;p&gt;Every security tool produces false positives. Here's how to handle them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with &lt;code&gt;continue-on-error: true&lt;/code&gt; while you tune your thresholds&lt;/li&gt;
&lt;li&gt;Create suppresion files for known false positives
&lt;/li&gt;
&lt;li&gt;Use multiple tools to cross-verify findings&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Secrets Management for Security Tools
&lt;/h3&gt;

&lt;p&gt;You'll need API tokens for most of these tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SNYK_TOKEN
GITGUARDIAN_API_KEY  
SEMGREP_APP_TOKEN
FOSSA_API_KEY
SAFETY_API_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store these as GitHub secrets, obviously. Most tools have free tiers that are perfect for getting started.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Performance Impact
&lt;/h3&gt;

&lt;p&gt;This workflow is comprehensive but it's also slow. Here's how to optimize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snyk is particularly slow if you use a free account&lt;/li&gt;
&lt;li&gt;Run heavy scans only on main branch pushes and PRs&lt;/li&gt;
&lt;li&gt;Use caching for tool installations&lt;/li&gt;
&lt;li&gt;Run some scans in parallel when possible&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Developer Experience
&lt;/h3&gt;

&lt;p&gt;Nobody likes pipelines that break all the time. Make it developer-friendly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear error messages in quality gates&lt;/li&gt;
&lt;li&gt;Easy-to-find security reports&lt;/li&gt;
&lt;li&gt;Documentation on how to fix common issues&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;This setup covers most core security checks, and here are some steps to take it even further:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Auto-remediation&lt;/strong&gt;: Use Dependabot to automatically fix dependency vulnerabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security notifications&lt;/strong&gt;: Integrate with Slack/Teams for security alerts
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security metrics&lt;/strong&gt;: Track MTTR (Mean Time to Remediation) and security fixes&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This pipeline isn't just about adding lots of tools - it's about enforcing policies. This workflow gives developers immediate feedback on security issues while maintaining development velocity.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;not perfect&lt;/strong&gt;, but significantly better security is totally achievable with the right tooling and processes. You can pick and choose, tune it for your specific needs, and gradually level up your security.&lt;/p&gt;

&lt;p&gt;The goal isn't to make development slower - it's to catch issues early when they're easy to fix, rather than in production when they're expensive and embarrassing.&lt;/p&gt;

&lt;p&gt;Happy coding, speedy deployments! 🚀&lt;/p&gt;




</description>
      <category>devops</category>
      <category>python</category>
      <category>containers</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Enhanced Paperless-NGX with Paddle OCR + LLM Pipeline</title>
      <dc:creator>Rajesh Pethe</dc:creator>
      <pubDate>Sat, 12 Jul 2025 10:31:46 +0000</pubDate>
      <link>https://dev.to/eklavvya/enhanced-paperless-ngx-with-paddle-ocr-llm-pipeline-3mgp</link>
      <guid>https://dev.to/eklavvya/enhanced-paperless-ngx-with-paddle-ocr-llm-pipeline-3mgp</guid>
      <description>&lt;h2&gt;
  
  
  Building a Private AI Document Pipeline with Paperless, PaddleOCR, and LLMs
&lt;/h2&gt;

&lt;p&gt;So this past week I hacked together a little side project to smarten up my &lt;a href="https://github.com/paperless-ngx/paperless-ngx" rel="noopener noreferrer"&gt;Paperless-ngx&lt;/a&gt; setup — you know, that self-hosted document management system that eats PDFs and makes them searchable.&lt;/p&gt;

&lt;p&gt;Now, Paperless-ngx is solid, don’t get me wrong. But it uses Tesseract for OCR, and honestly... Tesseract is not optimal for anything that's not clean text.&lt;/p&gt;

&lt;p&gt;So this just evolved out of need to improve paperless-ngx's OCR capability and to properly classify documents and extract tags. titles and summaries for documents.&lt;/p&gt;

&lt;p&gt;This blog is a walkthrough of what I built — what worked, what didn’t, and how it turned into a pretty neat little pipeline with its own microservices. Hope it gives you some ideas.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 What I Was Going For
&lt;/h2&gt;

&lt;p&gt;I wanted a system that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses &lt;strong&gt;PaddleOCR&lt;/strong&gt; instead of Tesseract for better OCR output&lt;/li&gt;
&lt;li&gt;Runs a &lt;strong&gt;local LLM&lt;/strong&gt; using &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; to:

&lt;ul&gt;
&lt;li&gt;Suggest a smart document title&lt;/li&gt;
&lt;li&gt;Classify the document into a type (invoice, id, tax, etc.)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Pushes that back to Paperless so the doc is nicely searchable and tagged&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Everything stays local/private. No exposure to external LLMs. Just Python, containers, compose to stitch everything together.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 How It All Works (Now)
&lt;/h2&gt;

&lt;p&gt;After a few iterations, I ended up with a clean microservice setup with each part doing its job:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;paperless-ngx&lt;/code&gt;: The main document management system (already amazing)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;ollama&lt;/code&gt;: Runs a local LLM like Mistral (or phi3 or any configurable in env), no cloud stuff&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;ocr-service&lt;/code&gt;: FastAPI service that runs PaddleOCR&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;pipeline&lt;/code&gt;: Python CLI that connects all the dots — downloads doc from Paperless, sends it to OCR and LLM, then updates Paperless with the smart results&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That way, each service does one thing and does it well or can be enhanced in isolation. They all run in Docker, talk to each other over the same network, and make one smooth, local AI document workflow.&lt;/p&gt;

&lt;p&gt;This leverages &lt;code&gt;paperless-ngx&lt;/code&gt;'s extensive features and augments it with better OCR and LLM classification capabilities.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚦 The Flow Looks Like This:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📁 Paperless stores docs in its DB
⬇️
🤖 I run my pipeline CLI: `docker compose run pipeline 42`
⬇️
📥 pipeline downloads the doc via Paperless API (by ID)
⬇️
📤 Sends it to the OCR microservice over HTTP
⬇️
🧠 Gets back clean OCR’d text
⬇️
🧠 Sends text to Ollama (LLM) to generate:
    - title
    - document type
⬇️
🔁 Updates Paperless document via PATCH
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🐳 Everything Runs in Docker
&lt;/h2&gt;

&lt;p&gt;Here's the final list of containers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;paperless → standard Paperless-ngx&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;redis → required by Paperless&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ollama → runs local LLMs like Mistral&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ocr-service → FastAPI + PaddleOCR&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pipeline → command-line microservice that ties it all together&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🗺 Architecture Diagram
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                   ┌────────────────────────┐
                   │ 📄 Paperless-ngx (UI)  │
                   └────────────┬───────────┘
                                │
                    [User notes Document ID]
                                │
                   ┌────────────▼────────────┐
                   │   🐍 Pipeline Service    │
                   │ (LLM &amp;amp; Orchestration)   │
                   └────────────┬────────────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
  ┌────────────────┐   ┌────────────────┐   ┌────────────────────┐
  │ Downloads PDF  │   │ Sends to OCR   │   │ Sends OCR text to  │
  │ via Paperless  │   │ microservice   │   │ Ollama LLM (Mistral)│
  └────────────────┘   └────────────────┘   └────────────────────┘
                                                │
                     ◀────────────┬─────────────┘
                                  ▼
                       📝 Title + Type Prediction
                                  │
                       🔁 PATCH back to Paperless
                       (update metadata + text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  😵 What Gave Me Trouble
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Paperless consumes files from consume/ automatically and moves it — I had to work around that by working only with doc IDs via API as at this time I was more focused on adding OCR/AI features. This one is high on my TODO list.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PaddleOCR kept re-downloading models — made sure models are cached.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📦 Folder Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;├── docker-compose.yml
├── __init__.py
├── model-cache
├── ocr_service
│   ├── app
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── ocr_config.yaml
│   │   └── ocr_engine.py
│   ├── Dockerfile
│   └── requirements.txt
├── paperless-data
│   ├── consume
│   ├── data
│   │   ├── db.sqlite3
│   │   ├── index
│   │   │   ├── _MAIN_63.toc
│   │   │   ├── MAIN_9v88o8vye3gbqub1.seg
│   │   │   ├── MAIN_ui8jrpftrvauh4n1.seg
│   │   │   ├── MAIN_wceagdbh71brm5wg.seg
│   │   │   └── MAIN_WRITELOCK
│   │   ├── log
│   │   │   └── celery.log.1
│   │   └── migration_lock
│   └── media
│       ├── documents
│       │   ├── archive
│       │   │   ├── 0000009.pdf
│       │   │   └── 0000016.pdf
│       │   ├── originals
│       │   │   ├── 0000009.jpg
│       │   │   └── 0000016.pdf
│       │   └── thumbnails
│       │       ├── 0000009.webp
│       │       └── 0000016.webp
│       └── media.lock
├── pipeline_service
│   ├── app
│   │   ├── api_client.py
│   │   ├── __init__.py
│   │   ├── llm_processor.py
│   │   ├── main.py
│   │   └── watcher.py
│   ├── Dockerfile
│   ├── __init__.py
│   ├── logger.py
│   ├── prompts
│   │   └── classify_title.txt
│   ├── requirements.txt
│   └── test.py
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ✅ What's Next?
&lt;/h2&gt;

&lt;p&gt;I might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Run the pipeline automatically when a new doc lands&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add authentication for OCR and pipeline service (utilizing &lt;code&gt;paperless-ngx&lt;/code&gt;'s token auth?)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Improve performance of OCR service, perhaps using other language (Go, Rust)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add document summarization via LLM&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Extract metadata like amount, date, sender&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hook into Paperless tags and correspondents&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏁 Wrapping Up
&lt;/h2&gt;

&lt;p&gt;If you need a document management system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;That ingests all kinds off docs PDFs, images etc.&lt;/li&gt;
&lt;li&gt;That reliably extracts text from varied kind of docs.&lt;/li&gt;
&lt;li&gt;That classifies, tags and summarizes docs using LLM.&lt;/li&gt;
&lt;li&gt;That keeps stuff private - no exposure to external LLMs.&lt;/li&gt;
&lt;li&gt;It comes with features loaded from &lt;code&gt;paperless-ngx&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then this might be an appealing setup.&lt;/p&gt;

&lt;p&gt;It’s Python all the way down. No rocket science — just containers, OCR, and a private LLM.&lt;/p&gt;

&lt;p&gt;Questions or inputs are welcome.&lt;/p&gt;

</description>
      <category>python</category>
      <category>docker</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Flags API: Flagging Phishing Emails</title>
      <dc:creator>Rajesh Pethe</dc:creator>
      <pubDate>Sun, 08 Jun 2025 13:15:18 +0000</pubDate>
      <link>https://dev.to/eklavvya/flags-api-flagging-phishing-emails-3ckd</link>
      <guid>https://dev.to/eklavvya/flags-api-flagging-phishing-emails-3ckd</guid>
      <description>&lt;p&gt;This is a submission for the &lt;a href="https://dev.to/challenges/postmark"&gt;Postmark Challenge: Inbox Innovators&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I built a &lt;strong&gt;developer-focused phishing detection microservice&lt;/strong&gt; that analyzes inbound emails (via Postmark) and scores them for potential phishing indicators. The solution combines &lt;strong&gt;classic heuristics&lt;/strong&gt; (e.g., mismatched links, suspicious reply-to addresses) with &lt;strong&gt;machine learning-based email intent classification&lt;/strong&gt; to provide explainable, interpretable results.&lt;/p&gt;

&lt;p&gt;The service is designed to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Explainable&lt;/strong&gt; – Every phishing score is backed by specific, human-readable reasons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensible&lt;/strong&gt; – Built on FastAPI, with a modular architecture for adding more rules or ML models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmark-ready&lt;/strong&gt; – Accepts Postmark’s inbound webhook payloads out-of-the-box.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;You can run the service locally using Docker or Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn app.api:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then test it using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/postmark/webhook &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; @tests/sample_postmark_email.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The service will return a phishing verdict like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reasons"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Mismatch between link text and URL destination"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Suspicious reply-to address"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Detected urgent or manipulative language"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"threat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"intent_confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No credentials are required for testing. Feel free to use the sample payloads in the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Repository
&lt;/h2&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/dteklavya/mail-sentinel" rel="noopener noreferrer"&gt;https://github.com/dteklavya/mail-sentinel&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built It
&lt;/h2&gt;

&lt;p&gt;This project was built using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python + FastAPI for the web API&lt;/li&gt;
&lt;li&gt;Pytest for test coverage&lt;/li&gt;
&lt;li&gt;Hugging Face Transformers to detect manipulative email intent&lt;/li&gt;
&lt;li&gt;Postmark Inbound Webhook to ingest real-world email data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The phishing detector combines rule-based checks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mismatched anchor text and URLs&lt;/li&gt;
&lt;li&gt;Suspicious Reply-To headers&lt;/li&gt;
&lt;li&gt;Common urgent phrases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With ML-based sentiment intent detection for “fear”, “threats”, and similar phishing tones.&lt;/p&gt;

&lt;p&gt;The design keeps the logic explainable and modular, making it ideal for dev-focused environments where transparency in email filtering is critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  TODO / Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The current ML model for email intent focuses on emotional tone (e.g., fear, threat) but doesn't fully capture all varieties of phishing tactics (like fake promotions, lotteries, or impersonated legal notices).&lt;/li&gt;
&lt;li&gt;    Intent classification can be further refined by fine-tuning on email-specific datasets or integrating custom-trained classifiers for phishing intent.&lt;/li&gt;
&lt;li&gt;    UI/visualization layer is not included — future plans include adding a simple dashboard or Postmark-friendly email header injection for visibility.&lt;/li&gt;
&lt;li&gt;    Due to the short development window, this is an MVP — several enhancements (e.g., domain reputation checks, attachment analysis) are on the roadmap.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devchallenge</category>
      <category>postmarkchallenge</category>
      <category>webdev</category>
      <category>api</category>
    </item>
    <item>
      <title>Brilliantly simple: The Linux File System</title>
      <dc:creator>Rajesh Pethe</dc:creator>
      <pubDate>Mon, 02 Jun 2025 12:49:04 +0000</pubDate>
      <link>https://dev.to/eklavvya/brilliantly-simple-the-linux-file-system-2190</link>
      <guid>https://dev.to/eklavvya/brilliantly-simple-the-linux-file-system-2190</guid>
      <description>&lt;p&gt;Note: This is a re-publish from the original post at &lt;a href="https://eklavvya.hashnode.dev/brilliantly-simple-the-linux-file-system" rel="noopener noreferrer"&gt;Hashnode&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The simple and effective design of Unix/Linux has been inspiring for any one who studies them. There are some gems in Linux filesystem that has awed me and here's me sharing them.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Permissions
&lt;/h3&gt;

&lt;p&gt;Any file has nine bits of permissions, three each for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Owner of the file - usually the user that created the file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;User group that owns the file - usually the group the user belongs to.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All others - rest of the world.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And the three bits are &lt;code&gt;rwx&lt;/code&gt; for read, write and execute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-rw-rw-r-- 1 user1 user1 0 Jun 24 10:13 temp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first bit &lt;code&gt;-&lt;/code&gt; in the listing above is for type of file, &lt;code&gt;-&lt;/code&gt; just means a normal file other values can be &lt;code&gt;d&lt;/code&gt;, &lt;code&gt;c&lt;/code&gt; , &lt;code&gt;b&lt;/code&gt; etc.&lt;/p&gt;

&lt;p&gt;So the &lt;code&gt;user1&lt;/code&gt; and group has read and write permission on file temp an all others have read-only permission. The default permission for any file is decided by setting of &lt;code&gt;umask&lt;/code&gt; - We'll not go into details at this point.&lt;/p&gt;

&lt;p&gt;What exactly these &lt;code&gt;rwx&lt;/code&gt; means?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;r&lt;/code&gt; - permission to read file contents.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;w&lt;/code&gt; - permission to change file contents.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;x&lt;/code&gt; - permission to execute the file as a program.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, permission on &lt;code&gt;cat&lt;/code&gt; program:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-rwxr-xr-x&lt;/span&gt; 1 root root 35288 Feb  8 09:16 /usr/bin/cat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;/usr/bin/cat&lt;/code&gt; is owned by &lt;code&gt;root&lt;/code&gt; user and group but only &lt;code&gt;root&lt;/code&gt; user can overwrite it. Users belonging to &lt;code&gt;root&lt;/code&gt; group and &lt;code&gt;others&lt;/code&gt; can execute the program.&lt;/p&gt;

&lt;p&gt;You'll say &lt;code&gt;rwx&lt;/code&gt; is pretty much what it says, quite simply, yes. But there are subtle differences of same permission bits when it comes to other types of files.&lt;/p&gt;

&lt;h3&gt;
  
  
  SETUID Bit on Files
&lt;/h3&gt;

&lt;p&gt;Users on Linux also need to perform certain actions which needs super user permissions. For example, changing passwords or switching user. There are specific commands for these, here is a couple of permissions of these commands/programs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ls -l /usr/bin/passwd /usr/bin/su /usr/bin/sudo&lt;/span&gt;
&lt;span class="nt"&gt;-rwsr-xr-x&lt;/span&gt; 1 root root 232416 Apr  3  2023 /usr/bin/sudo
&lt;span class="nt"&gt;-rwsr-xr-x&lt;/span&gt; 1 root root  59976 Feb  6 18:24 /usr/bin/passwd
&lt;span class="nt"&gt;-rwsr-xr-x&lt;/span&gt; 1 root root  55680 Apr  9 21:02 /usr/bin/su
&lt;span class="c"&gt;# &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Permissions set on these are &lt;code&gt;rws&lt;/code&gt; for owner (which is root). This means that whoever runs these programs, it will be executed &lt;strong&gt;as&lt;/strong&gt; &lt;code&gt;root&lt;/code&gt; user and hence will have super user privileges.&lt;/p&gt;

&lt;p&gt;It is intriguing, what if any of these special programs have &lt;strong&gt;write permissions&lt;/strong&gt; for &lt;strong&gt;others&lt;/strong&gt;? That will allow &lt;strong&gt;any&lt;/strong&gt; user to overwrite these files/programs with malicious code and whenever that program is executed, it will have super user privileges.&lt;/p&gt;

&lt;p&gt;You can find all program files with SETUID bit using &lt;code&gt;find /usr/bin -perm /u+s&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Directory Permissions
&lt;/h3&gt;

&lt;p&gt;A Directory is just a special case of file. Take a look at this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rajesh .../blog &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-ld&lt;/span&gt; /usr/bin/
drwxr-xr-x 2 root root 86016 Jun 24 09:12 /usr/bin/
rajesh .../blog &lt;span class="err"&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So &lt;code&gt;/usr/bin&lt;/code&gt; directory is owned by &lt;code&gt;root&lt;/code&gt; user. Note the first bit changed to &lt;code&gt;d&lt;/code&gt; since this is a directory. Now lets look at what these rest of nine bits of permission means.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;r&lt;/code&gt; - means directory is readable - list contents of directory using &lt;code&gt;ls&lt;/code&gt; or other commands or system/function calls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;w&lt;/code&gt; - write permission means to be able to create/delete/rename files. Note that creation/deletion/renaming of file DOES NOT depend on file permissions but on directory permissions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;x&lt;/code&gt; - execute a directory? No, a directory surely cannot be executed as a program :) rather it means to be able to list the file if you know the complete path name, even if you do not have read permission on that directory. Following example should make it clear:&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;*&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```bash
# mkdir /tmp/test
# ls -ld /tmp/test
drwxr-xr-x 2 root root 4096 Jun 24 11:34 /tmp/test
# chmod 751 /tmp/test/
# ls -ld /tmp/test
drwxr-x--x 2 root root 4096 Jun 24 11:34 /tmp/test
# 
# touch /tmp/test/temp
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set of commands above makes a directory &lt;code&gt;/tmp/test&lt;/code&gt; and changes the directory permission to &lt;code&gt;drwxr-x--x&lt;/code&gt; - note only execute permission is given to others, no read permission. Then as normal user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rajesh .../blog &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-ltr&lt;/span&gt; /tmp/test/
&lt;span class="nb"&gt;ls&lt;/span&gt;: cannot open directory &lt;span class="s1"&gt;'/tmp/test/'&lt;/span&gt;: Permission denied
rajesh .../blog &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-ltr&lt;/span&gt; /tmp/test/temp
&lt;span class="nt"&gt;-rw-r--r--&lt;/span&gt; 1 root root 0 Jun 24 11:35 /tmp/test/temp
rajesh .../blog &lt;span class="err"&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On close inspection, you'll find that listing of directory contents is denied but you could still list the file since you used the complete path name. That happened because of &lt;code&gt;x&lt;/code&gt; permission on directory for &lt;code&gt;others&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sticky Bit Directory Permissions
&lt;/h2&gt;

&lt;p&gt;There is a special directory which is needed by all users to store their temporary data in files and sub-directories. And this directory has special permissions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ls -ld /tmp/&lt;/span&gt;
drwxrwxrwt 27 root root 20480 Jun 24 14:16 /tmp/
&lt;span class="c"&gt;# &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at the last bit in permissions which is &lt;code&gt;t&lt;/code&gt; - it is know as sticky bit and also notice all users have all the permissions on this directory. Which normally means everyone can create/delete/rename files in such a directory.&lt;/p&gt;

&lt;p&gt;But here is where the sticky bit comes into play. When sticky bit is set on a directory, then any user can delete/rename files &lt;strong&gt;only owned&lt;/strong&gt; by him/her. Try delete any file belonging to other users and you'll get &lt;code&gt;permission denied&lt;/code&gt; error.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;That summarises the traditional file permissions on Linux. There are other ways to handle more granular permissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  File Attributes
&lt;/h2&gt;

&lt;p&gt;Can we have a file that cannot be modified, deleted, or renamed—even by &lt;code&gt;root&lt;/code&gt;? Yes, that's where &lt;code&gt;chattr&lt;/code&gt; comes in - it is part of &lt;code&gt;e2fsprogs&lt;/code&gt; package on Linux.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;chattr&lt;/code&gt; command in Linux is used to &lt;strong&gt;change file attributes&lt;/strong&gt; on a Linux file system. These attributes go beyond traditional file permissions (read/write/execute) and offer more granular control over file behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access Control List
&lt;/h2&gt;

&lt;p&gt;Can we have &lt;strong&gt;two different users&lt;/strong&gt; to have &lt;strong&gt;read&lt;/strong&gt; access to a file, while no one else can access it? Or want to give a &lt;strong&gt;specific user&lt;/strong&gt; write access, but not change group ownership? &lt;strong&gt;Access Control List (ACL)&lt;/strong&gt; is there to address these scenarios. It comes with &lt;code&gt;getfacl&lt;/code&gt; and &lt;code&gt;setfacl&lt;/code&gt; command-line tools to achieve granular file permissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;That wraps up few things which really appealed to me as brilliant yet simple way to implement a flexible and robust filesystem.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
