DEV Community

Michael Garcia
Michael Garcia

Posted on

FastAPI + OCR Pipeline: Should You Use BackgroundTasks or Celery? A Complete Guide

FastAPI + OCR Pipeline: Should You Use BackgroundTasks or Celery? A Complete Guide

You've just deployed your document processing system, and everything works great—until your first user uploads a 50-page scanned document at 2 PM on a Tuesday. Their browser hangs. Your FastAPI server is locked up. And you're frantically googling whether you made a terrible architectural decision.

Welcome to the real-world challenge of building async document processing systems. I've been there, and I'm going to walk you through exactly how to avoid this situation—and when to know it's time to level up your infrastructure.

The Root Cause: Why This Matters

Here's the thing about OCR processing: it's computationally expensive. Whether you're using Tesseract, EasyOCR, or a more sophisticated solution, you're looking at CPU-intensive operations that can take anywhere from 2-30 seconds per document. In a traditional synchronous architecture, this would completely block your server, making it unable to handle other requests.

FastAPI gives us concurrency through async/await, but there's a critical distinction many developers miss: async doesn't mean parallel execution for CPU-bound tasks. OCR is CPU-bound, not I/O-bound. When your Python interpreter is crunching through image processing algorithms, async context switching won't help—your server thread is still occupied.

This is where the decision between BackgroundTasks and Celery becomes pivotal.

Understanding FastAPI BackgroundTasks

BackgroundTasks is FastAPI's built-in solution for fire-and-forget operations. It's elegant, requires zero external dependencies, and works like this:

from fastapi import FastAPI, UploadFile, File, BackgroundTasks
from fastapi.responses import JSONResponse
import uuid
import time
from pathlib import Path

app = FastAPI()

# Simple in-memory store for demo (use database in production)
job_status = {}

def perform_ocr(file_path: str, job_id: str):
    """Simulates OCR processing"""
    try:
        job_status[job_id] = {"status": "processing", "progress": 0}

        # Simulate OCR work
        time.sleep(5)  # Replace with actual OCR

        job_status[job_id] = {
            "status": "completed",
            "progress": 100,
            "result": "Extracted text from document"
        }

        # Cleanup
        Path(file_path).unlink(missing_ok=True)

    except Exception as e:
        job_status[job_id] = {
            "status": "failed",
            "error": str(e)
        }

@app.post("/upload/")
async def upload_document(file: UploadFile = File(...), 
                         background_tasks: BackgroundTasks = None):
    # Save uploaded file
    job_id = str(uuid.uuid4())
    file_path = f"/tmp/{job_id}_{file.filename}"

    with open(file_path, "wb") as f:
        contents = await file.read()
        f.write(contents)

    # Initialize job status
    job_status[job_id] = {"status": "pending"}

    # Queue background task
    background_tasks.add_task(perform_ocr, file_path, job_id)

    return {
        "job_id": job_id,
        "status": "queued",
        "message": "Document queued for processing"
    }

@app.get("/status/{job_id}")
async def get_status(job_id: str):
    if job_id not in job_status:
        return {"error": "Job not found"}, 404

    return job_status[job_id]
Enter fullscreen mode Exit fullscreen mode

Strengths:

  • Zero infrastructure overhead
  • Easy to implement and test
  • Great for prototypes and MVPs
  • No external dependency management

Critical Limitations:

  • Tasks run on the same process
  • No persistence—if the server restarts, tasks are lost
  • No retry logic
  • No task prioritization
  • Limited scalability
  • No distributed processing

For a prototype with light traffic, this works beautifully. But here's the catch: BackgroundTasks are tied to the lifespan of your application. If you restart the server while processing 10 documents, those jobs simply disappear.

When Celery + Redis Becomes Essential

Let me be direct: Celery is overkill for a research prototype, but it's necessary the moment you care about reliability.

Celery with Redis provides:

  • Persistent task queues (tasks survive restarts)
  • Automatic retries with exponential backoff
  • Task prioritization and routing
  • Distributed workers (scale horizontally)
  • Progress tracking with hooks
  • Dead letter queues for failed tasks

Here's a production-ready Celery setup:

from fastapi import FastAPI, UploadFile, File, HTTPException
from celery import Celery
from celery.result import AsyncResult
import uuid
from pathlib import Path
import pytesseract
from PIL import Image
import redis

# Celery configuration
celery_app = Celery(
    "ocr_pipeline",
    broker="redis://localhost:6379/0",
    backend="redis://localhost:6379/1"
)

celery_app.conf.update(
    task_serializer='json',
    accept_content=['json'],
    result_serializer='json',
    timezone='UTC',
    enable_utc=True,
    task_track_started=True,
    task_acks_late=True,
    worker_prefetch_multiplier=1,  # Process one task at a time
)

app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379, db=2)

# Define Celery task
@celery_app.task(bind=True, max_retries=3)
def process_document(self, file_path: str, job_id: str):
    """
    Process document with OCR
    bind=True allows access to self for retries
    max_retries=3 enables automatic retry on failure
    """
    try:
        # Update status
        redis_client.hset(f"job:{job_id}", 
                         mapping={"status": "processing", "progress": "0"})

        # Open image and perform OCR
        image = Image.open(file_path)
        extracted_text = pytesseract.image_to_string(image)

        # Store result
        redis_client.hset(f"job:{job_id}", 
                         mapping={
                             "status": "completed",
                             "progress": "100",
                             "result": extracted_text[:5000]  # Store first 5000 chars
                         })

        # Cleanup
        Path(file_path).unlink(missing_ok=True)

        return {"status": "success", "job_id": job_id}

    except Exception as exc:
        # Automatic retry with exponential backoff
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)

@app.post("/upload/")
async def upload_document(file: UploadFile = File(...)):
    """Accept file upload and queue for processing"""
    job_id = str(uuid.uuid4())
    file_path = f"/tmp/{job_id}_{file.filename}"

    # Save file
    contents = await file.read()
    with open(file_path, "wb") as f:
        f.write(contents)

    # Initialize job in Redis
    redis_client.hset(f"job:{job_id}", 
                     mapping={"status": "pending", "created_at": str(uuid.uuid4())})

    # Queue task with priority
    task = process_document.apply_async(
        args=[file_path, job_id],
        priority=5,  # 0-9, higher = more important
        expires=3600  # Task expires after 1 hour
    )

    return {
        "job_id": job_id,
        "task_id": task.id,
        "status": "queued"
    }

@app.get("/status/{job_id}")
async def get_status(job_id: str):
    """Get current job status"""
    job_data = redis_client.hgetall(f"job:{job_id}")

    if not job_data:
        raise HTTPException(status_code=404, detail="Job not found")

    return {k.decode(): v.decode() for k, v in job_data.items()}

@app.get("/cancel/{job_id}")
async def cancel_job(job_id: str):
    """Cancel a running job"""
    job_data = redis_client.hgetall(f"job:{job_id}")
    if not job_data:
        raise HTTPException(status_code=404, detail="Job not found")

    # Mark as cancelled
    redis_client.hset(f"job:{job_id}", "status", "cancelled")

    return {"status": "Job cancelled"}
Enter fullscreen mode Exit fullscreen mode

The Decision Matrix

Criterion BackgroundTasks Celery + Redis
Prototype/MVP ✅ Excellent ⚠️ Overkill
Task Persistence ❌ Lost on restart ✅ Full persistence
Retries ❌ Manual implementation ✅ Built-in
Scaling ❌ Single process ✅ Infinite workers
Monitoring ❌ Limited ✅ Comprehensive
Setup Time 5 minutes 30-45 minutes
Operational Complexity Low Medium

Recommended OCR Stack

For your use case, I'd recommend:

  1. EasyOCR for general documents (good balance of accuracy and speed)
  2. Tesseract as a lightweight alternative (faster, slightly lower accuracy)
  3. PaddleOCR for non-English documents

Implement a human-in-the-loop correction pipeline:

@celery_app.task(bind=True)
def process_with_confidence_feedback(self, file_path: str, job_id: str):
    """
    Process with confidence scores for human review
    """
    import easyocr

    reader = easyocr.Reader(['en'], gpu=False)
    image = Image.open(file_path)

    results = reader.readtext(str(image))

    extracted_data = {
        "high_confidence": [],
        "needs_review": [],
        "raw_results": []
    }

    for (bbox, text, confidence) in results:
        item = {"text": text, "confidence": confidence}

        if confidence > 0.85:
            extracted_data["high_confidence"].append(item)
        else:
            extracted_data["needs_review"].append(item)

        extracted_data["raw_results"].append(item)

    redis_client.hset(f"job:{job_id}", 
                     "result", 
                     json.dumps(extracted_data))

    return extracted_data
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls and Edge Cases

1. Memory Leaks in BackgroundTasks
BackgroundTasks doesn't clear references immediately. For large-scale file processing, this accumulates. Solution: Use Celery or implement explicit cleanup.

2. Lost Jobs on Server Restart
BackgroundTasks tasks are memory-only. Always use a persistent queue for production. Even a lightweight solution like RQ (Redis Queue) is better than nothing.

3. Timeout Issues
OCR can take unpredictably long for complex documents. Set realistic timeouts and implement graceful degradation.

4. File System Cleanup
Temporary files can accumulate if processing fails. Always use try/finally or context managers.

5. Concurrent File Access
Multiple workers accessing the same file simultaneously causes corruption. Use unique file names and proper locking.

My Recommendation for Your Research Project

Start with BackgroundTasks, but architect


Want This Automated for Your Business?

I build custom AI bots, automation pipelines, and trading systems that run 24/7 and generate revenue on autopilot.

Hire me on Fiverr — AI bots, web scrapers, data pipelines, and automation built to your spec.

Browse my templates on Gumroad — ready-to-deploy bot templates, automation scripts, and AI toolkits.

Recommended Resources

If you want to go deeper on the topics covered in this article:

Some links above are affiliate links — they help support this content at no extra cost to you.

Top comments (0)