DEV Community: Rajesh Pethe

Building an Event-Driven OCR Service: Challenges and Solutions

Rajesh Pethe — Wed, 10 Dec 2025 14:29:10 +0000

Optical Character Recognition (OCR) is a powerful AI/ML technology that recognizes and extracts text from images and scanned documents.

Creating a scalable, event-driven web OCR service comes with challenges. This write-up details the problems, lessons and solutions uncovered while building a FastAPI + Celery + Redis + PaddleOCR OCR service aimed for integration with Paperless-ngx, an open source document management.

What I Wanted to Build (and Why)

Our objective was to build an event-driven service that efficiently converts PDFs or images into searchable PDFs with a selectable and searchable text layer. The focus was on:

Handling PDFs of arbitrary length and complexity.
Delivering results asynchronously due to CPU-heavy OCR tasks.
Creating outputs integrable with Paperless-ngx for document archiving and retrieval.

Why a Simple Script is just not good enough

OCR workloads demand significant compute power, especially on large or image-heavy PDFs. The process involves:

OCR inference: Detecting and recognizing text from images - the most CPU intensive part.
Collating results: Combining recognized text from many pages.
Embedding a text layer: Creating PDFs with searchable text overlay, crucial for usability.

Making this scalable and responsive requires moving beyond a simple blocking script into asynchronous, event-driven architecture. Multiprocessing seems like a natural fit at first, but Celery and PaddleOCR takes care of workload and performance respectively as you'll see below. Keep reading.

The Architecture (How All the Pieces Fit Together)

Flow:

Client uploads PDF → FastAPI returns task ID immediately
FastAPI enqueues task in Redis Broker
Celery Workers pick up tasks, use PaddleOCR (cached per process)
Workers store searchable PDFs in File Storage
Redis Backend tracks task status
Client polls FastAPI → gets status + download link

This architecture scales by adding more Celery Workers and handles OCR's CPU intensity through async processing.

FastAPI ──► Redis Broker ──► Celery Workers ──► PaddleOCR ──► File Storage
    ▲        ▲ Result Backend     ▲ Cached models      ▲
    └────────┘                    └────────────────────┘

How Celery Actually Works in this Setup (And Surprises)

Celery orchestrates asynchronous OCR processing - points 3 and 4 in the flow above and here this ets very interesting:

Orchestrate:
1. takes in PDF/image input file
2. converts PDF to list of images (OCR needs images)
3. Decides on size of task (Files with > 5 pages gets delegated to chord)
4. Calls process_single_page.
5. Finally calls assemble_final_pdf which returns results.
Process single page:
1. Create ocr_engine = get_ocr_engine() and get ocr results.
2. Create text files with OCRed texts (We need raw text as well).
3. Creates single paged PDF file with selectable and searchable text layer.
4. Returns page index and file path.
Final assembly:
1. Receives all single paged PDFs's paths
2. Collates/merges all in one resulting PDF.
3. Merges all text files in one.
4. Cleans up temp PDF/text files.
5. Returns final PDF and text file URLs.

Visual Overview of the Celery Pipeline

             +------------------+
Upload PDF → |    FastAPI       |
             +------------------+
                       |
                       v
              [Redis Message Broker]
                       |
                       v
           +---------------------------+
           |    Orchestrator Task     |
           |  (orchestrate_pdf_ocr)   |
           +---------------------------+
                       |
     +-----------------+-----------------+
     |                 |                 |
     v                 v                 v
+-----------+   +-----------+    +---------------+
| Page 0    |   | Page 1    |    | Page N        |
| OCR Task  |   | OCR Task  |    | OCR Task      |
+-----------+   +-----------+    +---------------+
     \             |                /
      \            |               /
       +-----------+--------------+
                       |
                       v
        +-----------------------------------+
        |   assemble_final_pdf (Callback)   |
        +-----------------------------------+
                       |
                       v
      Searchable PDF + merged text file saved

Pitfall: Some might argue - why not pass on OCR results to #3 "Final assembly" step above to do the final assembly of PDF and text file? Considered that and found that PaddleOCR results are big nested data-structure with NumPy numpy.ndarray which need custom recursive serialization for Redis.

I briefly experimented with passing lightweight structured results (page text + bounding boxes), but even that ballooned in size on longer PDFs. I concluded serialization was a headache and creating single pages PDFs appealed to me more due to several reasons:

Tiny Payload Size: Instead of serializing huge, complex nested lists of coordinates and text (which stresses Redis/Celery result backend), you just pass a tiny string: "/tmp/page_5_ocr.pdf".
Solves Serialization: The complex OCR data stays in memory, gets written to PDF immediately, and is discarded.
Retires/Check points: If the final assembly task fails, you still have the individual page PDFs on disk. You could technically inspect or re-assemble them manually. Retry only that page which failed.
Assembly: The final assemble_final_pdf task becomes extremely cheap.
Redis: No memory pressure on Redis

Cons:

Cleanup: Requires careful temp file management
Shared Volume: If you are on a cluster (K8s/multiple VMs), you need a shared volume.

PaddleOCR: Model Caching and Threading

⚠️ Lessons learnt: PaddleOcr has a known issue with singleton objects - initializing once and reusing PaddleOCR will most certainly fail for subsequent OCR requests. Solution is to cache models and re-initialize it for every call - a slight overhead. I lost quite a few hairs scratching my head over this 😉

Caching models gets a big speedup: rather than reload PaddleOCR models for each process, we use a model cache so each process loads the model once and reuses it.

model_cache = os.environ.get("PPDX_HOME", "/app/model-cache")
logger.info(f"✨ Using model cache from: {model_cache}")


def get_ocr_engine() -> PaddleOCR:
    return PaddleOCR(
        text_recognition_model_dir=f"{model_cache}/.paddlex/official_models/PP-OCRv5_server_rec",
        text_detection_model_dir=f"{model_cache}/.paddlex/official_models/PP-OCRv5_server_det",
        textline_orientation_model_dir=f"{model_cache}/.paddlex/official_models/PP-LCNet_x1_0_textline_ori",
        doc_orientation_classify_model_dir=f"{model_cache}/.paddlex/official_models/PP-LCNet_x1_0_doc_ori",
        doc_unwarping_model_dir=f"{model_cache}/.paddlex/official_models/UVDoc",
        use_doc_unwarping=False,
        use_doc_orientation_classify=False,
        use_textline_orientation=False,
    )

PaddleOCR itself is inherently multi-threaded via its native C++ inference engine, relying on optimized libraries like MKL and oneDNN. These libraries internally run on multiple CPU threads and bypass Python's GIL, enabling you to get most out of CPU cores available to you using option cpu_threads in init.

Redis: The Glue

Redis acts as both:

The message broker queuing tasks.
The result backend tracking task status and storing outputs.

This decouples FastAPI from OCR workers, enabling scalability and fault tolerance.

FastAPI: The Front Door to the OCR Service

FastAPI:

Accepts PDF uploads and immediately returns a task ID.
Provides endpoints to poll for task status and download results.
Delegates heavy processing to the event-driven Celery workers.

What Didn’t Work (and Why)

PaddleOCR singleton failing: A known issue with PaddleOCR - it fails on subsequent OCR calls, most likely because it retains state from previous calls and needs a reset. And reset means almost same overhead of re-creating the object.
Serialization of large numpy structures: Recursive serialization of nested numpy data types was an option but seemed too much os headache.
Shared filesystem: This is top on todo list as its necessary to make this horizontally scalable.

What I Learned While Building This

Keep the Celery payload small. I was surprised how easy it was to create multiple files and re-assemble in different Celery workers
PaddleOCR is good, but it has quirks. Don’t fight the library - work around it.
Celery chords turned out to be the perfect fit for multi-page PDFs, but it took me a while to get the signatures right.

Final notes / Next Steps

Core Pipeline Improvements

Retries & Error Handling

- Per-page retries with backoff
- Custom exceptions for OCR failures
- Fail-fast if >N pages fail
- Cleanup orphan files

Task Timeouts

- Timeout for each OCR task
- Timeout for orchestration/chord
- Deadline propagation

Progress Reporting

- Track completed_pages / total_pages
- Publish progress to Redis
- FastAPI poll endpoint or SSE/WebSocket

Distributed Pipeline (in progress)
- Add shared volume or S3/MinIO
- Convert file paths to storage URIs
- Remove reliance on local disk per worker

Building this thing reminded me that OCR isn’t just text extraction - it’s a messy mix of CPU bottlenecks, weird library quirks, and architectural decisions that don’t show up in tutorials.

Turns out, building a ‘simple OCR service’ is anything but simple — but now it’s fast, scalable, and plays nicely with Paperless-ngx.

Building a Practical DevSecOps Pipeline: From Basic Security to Enterprise-Style Protection

Rajesh Pethe — Wed, 24 Sep 2025 12:25:14 +0000

Hello folks!

So you've got your CI/CD pipeline running smoothly, but now you're missing some security scanning to your codebase. I recently took a basic security workflow and enhanced it, this is my journey building an enterprise-style pipeline including:

scanning for secrets with GitGuardian and TruffleHog
Reporting vulnerabilities using Bandit, Semgrep and Safety
Scanning vulnerabilities in Python dependencies using Snyk
Licensing and Compliance scan using FOSSA
Checkov IaC Security Scan to find vulnerabilities in Docker, Kubernetes and Terraform specs
Container security scan using Trivy and Docker Scout
Dynamic security testing for API endpoints

Let me show you exactly how I did it, and more importantly, why each piece matters.

Basic but Not Enough

Most of us start with something like this in their GitHub Actions:

- name: Run Bandit (SAST)
  run: bandit -r .

- name: Snyk Dependency Scan
  uses: snyk/actions/python@master

- name: Trivy Container Scan
  uses: aquasecurity/trivy-action@master

This covers the basics - some static analysis, dependency scanning, and container security. But it's not enough for real-world applications. You're missing secrets detection, infrastructure security, proper quality gates, and a bunch of other stuff that might bite you later.

Note: Snyk Dependency Scan can be slow if your codebase has large dependency tree and you are using a free account.

The Enhanced Security Workflow:

I've created the "enhanced security workflow" that covers pretty much every security scanning angle I could think of. Let's dive into each section and understand why each piece is crucial.

1. Security-First Permissions

permissions:
  contents: read
  security-events: write
  actions: read

Why this matters: By default, GitHub Actions gets too many permissions. This is the principle of least privilege in action - only give what's absolutely necesary. The security-events: write permission is what lets us upload SARIF reports to GitHub's security tab.

2. Secret Detection

This is probably the most important addition. You might have seen developers accidentally commit API keys, database passwords, or AWS credentials.

- name: GitGuardian Security Scan
  uses: GitGuardian/ggshield/actions/secret@v1.25.0
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    GITGUARDIAN_API_KEY: ${{ secrets.GITGUARDIAN_API_KEY }}

- name: TruffleHog OSS Secret Scanning
  uses: trufflesecurity/trufflehog@main
  with:
    path: ./services/upload_service
    base: ${{ github.event.repository.default_branch }}
    head: HEAD
    extra_args: --debug --only-verified

What's happening here:

GitGuardian knows about 450 types of secrets and has really low false positives.
TruffleHog is the open-source alternative that's really good
The --only-verified flag means it'll only alert on secrets it can actually verify (like testing if an API key actually works)

Tip: Run both! GitGuardian might catch something TruffleHog misses and vice versa.

Note: Either or both should be in your pre-commit hook as well to catch secrets at the earliest possible stage.

3. Enhanced SAST (Static Application Security Testing)

Instead of just running Bandit, we're going to add couple more tools:

- name: Install Security Tools
  run: |
    pip install bandit[toml] safety semgrep

- name: Run Bandit SAST (Enhanced)
  working-directory: ./services/upload_service
  run: |
    bandit -r . -f json -o bandit-report.json || true
    bandit -r . -f txt
  continue-on-error: true

- name: Semgrep Security Analysis
  uses: semgrep/semgrep-action@v1
  with:
    config: >-
      p/security-audit
      p/python
      p/owasp-top-ten
  env:
    SEMGREP_APP_TOKEN: ${{ secrets.SEMGREP_APP_TOKEN }}

- name: Safety Check (Python Dependencies)
  working-directory: ./services/upload_service
  run: safety check --json --output safety-report.json || true

Breaking this down:

Bandit is still our Python-specific security scanner, but now we're saving reports in both JSON (for processing) and text (for human reading) - Note: This will run bandit twice, adjust accordingly.
Semgrep is new SAST tool - it's got rules for OWASP Top 10, language-specific issues, and general security patterns
Safety checks your Python dependencies against known vulnerability databases

Why multiple tools? Each tool has its strengths. Bandit knows Python really well, Semgrep has broader coverage, and Safety focuses specifically on dependencies. It's like having multiple security experts scan your code. Some of us might declare this an over kill, but there is for us understand, evaluate and choose the best combination.

4. Infrastructure as Code Security (Dockerfiles Matter Too)

- name: Checkov IaC Security Scan
  uses: bridgecrewio/checkov-action@master
  with:
    directory: .
    framework: dockerfile,kubernetes,terraform
    output_format: sarif
    output_file_path: checkov-report.sarif
    quiet: true
    soft_fail: true

This is very important step! Checkov scans your Docker files, Kubernetes manifests, Terraform configs - basically any infrastructure-as-code you've got. It'll catch stuff like:

Running containers as root (big no-no)
Mandatory Health checks in containers
Missing security contexts in Kubernetes
Overly permissive IAM policies in Terraform
Secrets hardcoded in Docker files

The soft_fail: true means it won't break your build, but it'll still report issues.

5. The Quality Gate

This step decides if the workflow fails or not. Instead of just running scans and hoping someone reads the reports, this implements automated decision-making:

def check_security_reports():
    critical_issues = 0
    high_issues = 0

    # Check Bandit report
    try:
        with open('services/upload_service/bandit-report.json') as f:
            bandit_data = json.load(f)
            for result in bandit_data.get('results', []):
                if result['issue_severity'] == 'HIGH':
                    high_issues += 1
                elif result['issue_severity'] == 'MEDIUM':
                    critical_issues += 1
    except FileNotFoundError:
        print("Bandit report not found")

    # Security gate logic
    if critical_issues > 0:
        print(f"❌ SECURITY GATE FAILED: {critical_issues} critical security issues found")
        sys.exit(1)
    elif high_issues > 5:
        print(f"⚠️  WARNING: {high_issues} high-severity issues found")
    else:
        print("✅ SECURITY GATE PASSED: No critical security issues detected")

This is where the magic happens! The pipeline will actually fail if there are critical security issues. Fix it now or your deployment won't happen.

You can customize these thresholds based on your risk tolerance. Maybe you allow 0 critical issues in production but 3 in development branches.

6. Container Security

We're not just scanning the final image anymore:

- name: Trivy Filesystem Scan
  uses: aquasecurity/trivy-action@master
  with:
    scan-type: 'fs'
    scan-ref: './services/upload_service'

- name: Trivy Container Image Scan (Critical)
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'upload-service:latest'
    exit-code: '1'
    severity: 'CRITICAL,HIGH'

- name: Docker Scout CVE Scan
  uses: docker/scout-action@v1
  with:
    command: cves
    image-ref: upload-service:latest
    only-severities: critical,high

Three layers of container security:

Filesystem scan - checks your source code and files
Image scan - scans the built Docker image for vulnerabilities
Docker Scout - Docker's own security scanning (different vulnerability database)

The exit-code: '1' on the image scan means it'll fail the build if critical or high severity issues are found.

7. Dynamic Security Testing (The Runtime Check)

- name: Start Application for DAST
  run: |
    docker-compose up -d
    sleep 30
    curl -f http://localhost:8000/health || exit 1

- name: OWASP ZAP API Scan
  uses: zaproxy/action-api-scan@v0.7.0
  with:
    target: 'http://localhost:8000'
    format: 'openapi'

DAST (Dynamic Application Security Testing) is where we actually run the application and poke at it to see if there are vulnerabilities that only show up at runtime. OWASP ZAP is like having a hacker test your API for common web vulnerabilities.

The Secret Sauce: SARIF Integration

You'll notice we're outputting a lot of reports in SARIF format:

- name: Upload Trivy SARIF Reports
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: |
      trivy-fs-results.sarif
      trivy-image-results.sarif

SARIF (Static Analysis Results Interchange Format) is a standard format that GitHub understands. When you upload SARIF files, all your security findings show up beautifully in GitHub's Security tab. No more digging through build logs!

Common Gotchas and How to Avoid Them

1. False Positives

Every security tool produces false positives. Here's how to handle them:

Start with continue-on-error: true while you tune your thresholds
Create suppresion files for known false positives
Use multiple tools to cross-verify findings

2. Secrets Management for Security Tools

You'll need API tokens for most of these tools:

SNYK_TOKEN
GITGUARDIAN_API_KEY  
SEMGREP_APP_TOKEN
FOSSA_API_KEY
SAFETY_API_KEY

Store these as GitHub secrets, obviously. Most tools have free tiers that are perfect for getting started.

3. Performance Impact

This workflow is comprehensive but it's also slow. Here's how to optimize:

Snyk is particularly slow if you use a free account
Run heavy scans only on main branch pushes and PRs
Use caching for tool installations
Run some scans in parallel when possible

4. Developer Experience

Nobody likes pipelines that break all the time. Make it developer-friendly:

Clear error messages in quality gates
Easy-to-find security reports
Documentation on how to fix common issues

What's Next?

This setup covers most core security checks, and here are some steps to take it even further:

Auto-remediation: Use Dependabot to automatically fix dependency vulnerabilities
Security notifications: Integrate with Slack/Teams for security alerts
Security metrics: Track MTTR (Mean Time to Remediation) and security fixes

Final Thoughts

This pipeline isn't just about adding lots of tools - it's about enforcing policies. This workflow gives developers immediate feedback on security issues while maintaining development velocity.

This is not perfect, but significantly better security is totally achievable with the right tooling and processes. You can pick and choose, tune it for your specific needs, and gradually level up your security.

The goal isn't to make development slower - it's to catch issues early when they're easy to fix, rather than in production when they're expensive and embarrassing.

Happy coding, speedy deployments! 🚀

Enhanced Paperless-NGX with Paddle OCR + LLM Pipeline

Rajesh Pethe — Sat, 12 Jul 2025 10:31:46 +0000

Building a Private AI Document Pipeline with Paperless, PaddleOCR, and LLMs

So this past week I hacked together a little side project to smarten up my Paperless-ngx setup — you know, that self-hosted document management system that eats PDFs and makes them searchable.

Now, Paperless-ngx is solid, don’t get me wrong. But it uses Tesseract for OCR, and honestly... Tesseract is not optimal for anything that's not clean text.

So this just evolved out of need to improve paperless-ngx's OCR capability and to properly classify documents and extract tags. titles and summaries for documents.

This blog is a walkthrough of what I built — what worked, what didn’t, and how it turned into a pretty neat little pipeline with its own microservices. Hope it gives you some ideas.

💡 What I Was Going For

I wanted a system that:

Uses PaddleOCR instead of Tesseract for better OCR output
Runs a local LLM using Ollama to:
- Suggest a smart document title
- Classify the document into a type (invoice, id, tax, etc.)
Pushes that back to Paperless so the doc is nicely searchable and tagged

Everything stays local/private. No exposure to external LLMs. Just Python, containers, compose to stitch everything together.

🔧 How It All Works (Now)

After a few iterations, I ended up with a clean microservice setup with each part doing its job:

paperless-ngx: The main document management system (already amazing)
ollama: Runs a local LLM like Mistral (or phi3 or any configurable in env), no cloud stuff
ocr-service: FastAPI service that runs PaddleOCR
pipeline: Python CLI that connects all the dots — downloads doc from Paperless, sends it to OCR and LLM, then updates Paperless with the smart results

That way, each service does one thing and does it well or can be enhanced in isolation. They all run in Docker, talk to each other over the same network, and make one smooth, local AI document workflow.

This leverages paperless-ngx's extensive features and augments it with better OCR and LLM classification capabilities.

🚦 The Flow Looks Like This:

📁 Paperless stores docs in its DB
⬇️
🤖 I run my pipeline CLI: `docker compose run pipeline 42`
⬇️
📥 pipeline downloads the doc via Paperless API (by ID)
⬇️
📤 Sends it to the OCR microservice over HTTP
⬇️
🧠 Gets back clean OCR’d text
⬇️
🧠 Sends text to Ollama (LLM) to generate:
    - title
    - document type
⬇️
🔁 Updates Paperless document via PATCH

🐳 Everything Runs in Docker

Here's the final list of containers:

paperless → standard Paperless-ngx
redis → required by Paperless
ollama → runs local LLMs like Mistral
ocr-service → FastAPI + PaddleOCR
pipeline → command-line microservice that ties it all together

🗺 Architecture Diagram

                   ┌────────────────────────┐
                   │ 📄 Paperless-ngx (UI)  │
                   └────────────┬───────────┘
                                │
                    [User notes Document ID]
                                │
                   ┌────────────▼────────────┐
                   │   🐍 Pipeline Service    │
                   │ (LLM & Orchestration)   │
                   └────────────┬────────────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
  ┌────────────────┐   ┌────────────────┐   ┌────────────────────┐
  │ Downloads PDF  │   │ Sends to OCR   │   │ Sends OCR text to  │
  │ via Paperless  │   │ microservice   │   │ Ollama LLM (Mistral)│
  └────────────────┘   └────────────────┘   └────────────────────┘
                                                │
                     ◀────────────┬─────────────┘
                                  ▼
                       📝 Title + Type Prediction
                                  │
                       🔁 PATCH back to Paperless
                       (update metadata + text)

😵 What Gave Me Trouble

Paperless consumes files from consume/ automatically and moves it — I had to work around that by working only with doc IDs via API as at this time I was more focused on adding OCR/AI features. This one is high on my TODO list.
PaddleOCR kept re-downloading models — made sure models are cached.

📦 Folder Structure

├── docker-compose.yml
├── __init__.py
├── model-cache
├── ocr_service
│   ├── app
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── ocr_config.yaml
│   │   └── ocr_engine.py
│   ├── Dockerfile
│   └── requirements.txt
├── paperless-data
│   ├── consume
│   ├── data
│   │   ├── db.sqlite3
│   │   ├── index
│   │   │   ├── _MAIN_63.toc
│   │   │   ├── MAIN_9v88o8vye3gbqub1.seg
│   │   │   ├── MAIN_ui8jrpftrvauh4n1.seg
│   │   │   ├── MAIN_wceagdbh71brm5wg.seg
│   │   │   └── MAIN_WRITELOCK
│   │   ├── log
│   │   │   └── celery.log.1
│   │   └── migration_lock
│   └── media
│       ├── documents
│       │   ├── archive
│       │   │   ├── 0000009.pdf
│       │   │   └── 0000016.pdf
│       │   ├── originals
│       │   │   ├── 0000009.jpg
│       │   │   └── 0000016.pdf
│       │   └── thumbnails
│       │       ├── 0000009.webp
│       │       └── 0000016.webp
│       └── media.lock
├── pipeline_service
│   ├── app
│   │   ├── api_client.py
│   │   ├── __init__.py
│   │   ├── llm_processor.py
│   │   ├── main.py
│   │   └── watcher.py
│   ├── Dockerfile
│   ├── __init__.py
│   ├── logger.py
│   ├── prompts
│   │   └── classify_title.txt
│   ├── requirements.txt
│   └── test.py
└── README.md

✅ What's Next?

I might:

Run the pipeline automatically when a new doc lands
Add authentication for OCR and pipeline service (utilizing paperless-ngx's token auth?)
Improve performance of OCR service, perhaps using other language (Go, Rust)
Add document summarization via LLM
Extract metadata like amount, date, sender
Hook into Paperless tags and correspondents

🏁 Wrapping Up

If you need a document management system:

That ingests all kinds off docs PDFs, images etc.
That reliably extracts text from varied kind of docs.
That classifies, tags and summarizes docs using LLM.
That keeps stuff private - no exposure to external LLMs.
It comes with features loaded from paperless-ngx.

Then this might be an appealing setup.

It’s Python all the way down. No rocket science — just containers, OCR, and a private LLM.

Questions or inputs are welcome.

Flags API: Flagging Phishing Emails

Rajesh Pethe — Sun, 08 Jun 2025 13:15:18 +0000

This is a submission for the Postmark Challenge: Inbox Innovators.

What I Built

I built a developer-focused phishing detection microservice that analyzes inbound emails (via Postmark) and scores them for potential phishing indicators. The solution combines classic heuristics (e.g., mismatched links, suspicious reply-to addresses) with machine learning-based email intent classification to provide explainable, interpretable results.

The service is designed to be:

Explainable – Every phishing score is backed by specific, human-readable reasons.
Extensible – Built on FastAPI, with a modular architecture for adding more rules or ML models.
Postmark-ready – Accepts Postmark’s inbound webhook payloads out-of-the-box.

Demo

You can run the service locally using Docker or Python:

uvicorn app.api:app --reload

Then test it using:

curl -X POST http://localhost:8000/postmark/webhook \
  -H "Content-Type: application/json" \
  -d @tests/sample_postmark_email.json

The service will return a phishing verdict like:

{
  "verdict": {
    "score": 0.75,
    "reasons": [
      "Mismatch between link text and URL destination",
      "Suspicious reply-to address",
      "Detected urgent or manipulative language"
    ],
    "intent": "threat",
    "intent_confidence": 0.92
  }
}

No credentials are required for testing. Feel free to use the sample payloads in the repo.

Code Repository

GitHub: https://github.com/dteklavya/mail-sentinel

How I Built It

This project was built using:

Python + FastAPI for the web API
Pytest for test coverage
Hugging Face Transformers to detect manipulative email intent
Postmark Inbound Webhook to ingest real-world email data

The phishing detector combines rule-based checks like:

Mismatched anchor text and URLs
Suspicious Reply-To headers
Common urgent phrases

With ML-based sentiment intent detection for “fear”, “threats”, and similar phishing tones.

The design keeps the logic explainable and modular, making it ideal for dev-focused environments where transparency in email filtering is critical.

TODO / Limitations

The current ML model for email intent focuses on emotional tone (e.g., fear, threat) but doesn't fully capture all varieties of phishing tactics (like fake promotions, lotteries, or impersonated legal notices).
Intent classification can be further refined by fine-tuning on email-specific datasets or integrating custom-trained classifiers for phishing intent.
UI/visualization layer is not included — future plans include adding a simple dashboard or Postmark-friendly email header injection for visibility.
Due to the short development window, this is an MVP — several enhancements (e.g., domain reputation checks, attachment analysis) are on the roadmap.

Brilliantly simple: The Linux File System

Rajesh Pethe — Mon, 02 Jun 2025 12:49:04 +0000

Note: This is a re-publish from the original post at Hashnode.

The simple and effective design of Unix/Linux has been inspiring for any one who studies them. There are some gems in Linux filesystem that has awed me and here's me sharing them.

File Permissions

Any file has nine bits of permissions, three each for:

Owner of the file - usually the user that created the file.
User group that owns the file - usually the group the user belongs to.
All others - rest of the world.

And the three bits are rwx for read, write and execute.

-rw-rw-r-- 1 user1 user1 0 Jun 24 10:13 temp

The first bit - in the listing above is for type of file, - just means a normal file other values can be d, c , b etc.

So the user1 and group has read and write permission on file temp an all others have read-only permission. The default permission for any file is decided by setting of umask - We'll not go into details at this point.

What exactly these rwx means?

r - permission to read file contents.
w - permission to change file contents.
x - permission to execute the file as a program.

For example, permission on cat program:

-rwxr-xr-x 1 root root 35288 Feb  8 09:16 /usr/bin/cat

/usr/bin/cat is owned by root user and group but only root user can overwrite it. Users belonging to root group and others can execute the program.

You'll say rwx is pretty much what it says, quite simply, yes. But there are subtle differences of same permission bits when it comes to other types of files.

SETUID Bit on Files

Users on Linux also need to perform certain actions which needs super user permissions. For example, changing passwords or switching user. There are specific commands for these, here is a couple of permissions of these commands/programs:

# ls -l /usr/bin/passwd /usr/bin/su /usr/bin/sudo
-rwsr-xr-x 1 root root 232416 Apr  3  2023 /usr/bin/sudo
-rwsr-xr-x 1 root root  59976 Feb  6 18:24 /usr/bin/passwd
-rwsr-xr-x 1 root root  55680 Apr  9 21:02 /usr/bin/su
#

Permissions set on these are rws for owner (which is root). This means that whoever runs these programs, it will be executed as root user and hence will have super user privileges.

It is intriguing, what if any of these special programs have write permissions for others? That will allow any user to overwrite these files/programs with malicious code and whenever that program is executed, it will have super user privileges.

You can find all program files with SETUID bit using find /usr/bin -perm /u+s

Directory Permissions

A Directory is just a special case of file. Take a look at this:

rajesh .../blog $ ls -ld /usr/bin/
drwxr-xr-x 2 root root 86016 Jun 24 09:12 /usr/bin/
rajesh .../blog $

So /usr/bin directory is owned by root user. Note the first bit changed to d since this is a directory. Now lets look at what these rest of nine bits of permission means.

r - means directory is readable - list contents of directory using ls or other commands or system/function calls.
w - write permission means to be able to create/delete/rename files. Note that creation/deletion/renaming of file DOES NOT depend on file permissions but on directory permissions.
x - execute a directory? No, a directory surely cannot be executed as a program :) rather it means to be able to list the file if you know the complete path name, even if you do not have read permission on that directory. Following example should make it clear:

```bash
# mkdir /tmp/test
# ls -ld /tmp/test
drwxr-xr-x 2 root root 4096 Jun 24 11:34 /tmp/test
# chmod 751 /tmp/test/
# ls -ld /tmp/test
drwxr-x--x 2 root root 4096 Jun 24 11:34 /tmp/test
# 
# touch /tmp/test/temp
```

Set of commands above makes a directory /tmp/test and changes the directory permission to drwxr-x--x - note only execute permission is given to others, no read permission. Then as normal user:

rajesh .../blog $ ls -ltr /tmp/test/
ls: cannot open directory '/tmp/test/': Permission denied
rajesh .../blog $ ls -ltr /tmp/test/temp
-rw-r--r-- 1 root root 0 Jun 24 11:35 /tmp/test/temp
rajesh .../blog $

On close inspection, you'll find that listing of directory contents is denied but you could still list the file since you used the complete path name. That happened because of x permission on directory for others.

Sticky Bit Directory Permissions

There is a special directory which is needed by all users to store their temporary data in files and sub-directories. And this directory has special permissions:

# ls -ld /tmp/
drwxrwxrwt 27 root root 20480 Jun 24 14:16 /tmp/
#

Look at the last bit in permissions which is t - it is know as sticky bit and also notice all users have all the permissions on this directory. Which normally means everyone can create/delete/rename files in such a directory.

But here is where the sticky bit comes into play. When sticky bit is set on a directory, then any user can delete/rename files only owned by him/her. Try delete any file belonging to other users and you'll get permission denied error.

What's next?

That summarises the traditional file permissions on Linux. There are other ways to handle more granular permissions.

File Attributes

Can we have a file that cannot be modified, deleted, or renamed—even by root? Yes, that's where chattr comes in - it is part of e2fsprogs package on Linux.

The chattr command in Linux is used to change file attributes on a Linux file system. These attributes go beyond traditional file permissions (read/write/execute) and offer more granular control over file behaviour.

Access Control List

Can we have two different users to have read access to a file, while no one else can access it? Or want to give a specific user write access, but not change group ownership? Access Control List (ACL) is there to address these scenarios. It comes with getfacl and setfacl command-line tools to achieve granular file permissions.

Conclusion

That wraps up few things which really appealed to me as brilliant yet simple way to implement a flexible and robust filesystem.