FastAPI + OCR Pipeline: Should You Use BackgroundTasks or Celery? A Complete Guide
You've just deployed your document processing system, and everything works great—until your first user uploads a 50-page scanned document at 2 PM on a Tuesday. Their browser hangs. Your FastAPI server is locked up. And you're frantically googling whether you made a terrible architectural decision.
Welcome to the real-world challenge of building async document processing systems. I've been there, and I'm going to walk you through exactly how to avoid this situation—and when to know it's time to level up your infrastructure.
The Root Cause: Why This Matters
Here's the thing about OCR processing: it's computationally expensive. Whether you're using Tesseract, EasyOCR, or a more sophisticated solution, you're looking at CPU-intensive operations that can take anywhere from 2-30 seconds per document. In a traditional synchronous architecture, this would completely block your server, making it unable to handle other requests.
FastAPI gives us concurrency through async/await, but there's a critical distinction many developers miss: async doesn't mean parallel execution for CPU-bound tasks. OCR is CPU-bound, not I/O-bound. When your Python interpreter is crunching through image processing algorithms, async context switching won't help—your server thread is still occupied.
This is where the decision between BackgroundTasks and Celery becomes pivotal.
Understanding FastAPI BackgroundTasks
BackgroundTasks is FastAPI's built-in solution for fire-and-forget operations. It's elegant, requires zero external dependencies, and works like this:
from fastapi import FastAPI, UploadFile, File, BackgroundTasks
from fastapi.responses import JSONResponse
import uuid
import time
from pathlib import Path
app = FastAPI()
# Simple in-memory store for demo (use database in production)
job_status = {}
def perform_ocr(file_path: str, job_id: str):
"""Simulates OCR processing"""
try:
job_status[job_id] = {"status": "processing", "progress": 0}
# Simulate OCR work
time.sleep(5) # Replace with actual OCR
job_status[job_id] = {
"status": "completed",
"progress": 100,
"result": "Extracted text from document"
}
# Cleanup
Path(file_path).unlink(missing_ok=True)
except Exception as e:
job_status[job_id] = {
"status": "failed",
"error": str(e)
}
@app.post("/upload/")
async def upload_document(file: UploadFile = File(...),
background_tasks: BackgroundTasks = None):
# Save uploaded file
job_id = str(uuid.uuid4())
file_path = f"/tmp/{job_id}_{file.filename}"
with open(file_path, "wb") as f:
contents = await file.read()
f.write(contents)
# Initialize job status
job_status[job_id] = {"status": "pending"}
# Queue background task
background_tasks.add_task(perform_ocr, file_path, job_id)
return {
"job_id": job_id,
"status": "queued",
"message": "Document queued for processing"
}
@app.get("/status/{job_id}")
async def get_status(job_id: str):
if job_id not in job_status:
return {"error": "Job not found"}, 404
return job_status[job_id]
Strengths:
- Zero infrastructure overhead
- Easy to implement and test
- Great for prototypes and MVPs
- No external dependency management
Critical Limitations:
- Tasks run on the same process
- No persistence—if the server restarts, tasks are lost
- No retry logic
- No task prioritization
- Limited scalability
- No distributed processing
For a prototype with light traffic, this works beautifully. But here's the catch: BackgroundTasks are tied to the lifespan of your application. If you restart the server while processing 10 documents, those jobs simply disappear.
When Celery + Redis Becomes Essential
Let me be direct: Celery is overkill for a research prototype, but it's necessary the moment you care about reliability.
Celery with Redis provides:
- Persistent task queues (tasks survive restarts)
- Automatic retries with exponential backoff
- Task prioritization and routing
- Distributed workers (scale horizontally)
- Progress tracking with hooks
- Dead letter queues for failed tasks
Here's a production-ready Celery setup:
from fastapi import FastAPI, UploadFile, File, HTTPException
from celery import Celery
from celery.result import AsyncResult
import uuid
from pathlib import Path
import pytesseract
from PIL import Image
import redis
# Celery configuration
celery_app = Celery(
"ocr_pipeline",
broker="redis://localhost:6379/0",
backend="redis://localhost:6379/1"
)
celery_app.conf.update(
task_serializer='json',
accept_content=['json'],
result_serializer='json',
timezone='UTC',
enable_utc=True,
task_track_started=True,
task_acks_late=True,
worker_prefetch_multiplier=1, # Process one task at a time
)
app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379, db=2)
# Define Celery task
@celery_app.task(bind=True, max_retries=3)
def process_document(self, file_path: str, job_id: str):
"""
Process document with OCR
bind=True allows access to self for retries
max_retries=3 enables automatic retry on failure
"""
try:
# Update status
redis_client.hset(f"job:{job_id}",
mapping={"status": "processing", "progress": "0"})
# Open image and perform OCR
image = Image.open(file_path)
extracted_text = pytesseract.image_to_string(image)
# Store result
redis_client.hset(f"job:{job_id}",
mapping={
"status": "completed",
"progress": "100",
"result": extracted_text[:5000] # Store first 5000 chars
})
# Cleanup
Path(file_path).unlink(missing_ok=True)
return {"status": "success", "job_id": job_id}
except Exception as exc:
# Automatic retry with exponential backoff
raise self.retry(exc=exc, countdown=2 ** self.request.retries)
@app.post("/upload/")
async def upload_document(file: UploadFile = File(...)):
"""Accept file upload and queue for processing"""
job_id = str(uuid.uuid4())
file_path = f"/tmp/{job_id}_{file.filename}"
# Save file
contents = await file.read()
with open(file_path, "wb") as f:
f.write(contents)
# Initialize job in Redis
redis_client.hset(f"job:{job_id}",
mapping={"status": "pending", "created_at": str(uuid.uuid4())})
# Queue task with priority
task = process_document.apply_async(
args=[file_path, job_id],
priority=5, # 0-9, higher = more important
expires=3600 # Task expires after 1 hour
)
return {
"job_id": job_id,
"task_id": task.id,
"status": "queued"
}
@app.get("/status/{job_id}")
async def get_status(job_id: str):
"""Get current job status"""
job_data = redis_client.hgetall(f"job:{job_id}")
if not job_data:
raise HTTPException(status_code=404, detail="Job not found")
return {k.decode(): v.decode() for k, v in job_data.items()}
@app.get("/cancel/{job_id}")
async def cancel_job(job_id: str):
"""Cancel a running job"""
job_data = redis_client.hgetall(f"job:{job_id}")
if not job_data:
raise HTTPException(status_code=404, detail="Job not found")
# Mark as cancelled
redis_client.hset(f"job:{job_id}", "status", "cancelled")
return {"status": "Job cancelled"}
The Decision Matrix
| Criterion | BackgroundTasks | Celery + Redis |
|---|---|---|
| Prototype/MVP | ✅ Excellent | ⚠️ Overkill |
| Task Persistence | ❌ Lost on restart | ✅ Full persistence |
| Retries | ❌ Manual implementation | ✅ Built-in |
| Scaling | ❌ Single process | ✅ Infinite workers |
| Monitoring | ❌ Limited | ✅ Comprehensive |
| Setup Time | 5 minutes | 30-45 minutes |
| Operational Complexity | Low | Medium |
Recommended OCR Stack
For your use case, I'd recommend:
- EasyOCR for general documents (good balance of accuracy and speed)
- Tesseract as a lightweight alternative (faster, slightly lower accuracy)
- PaddleOCR for non-English documents
Implement a human-in-the-loop correction pipeline:
@celery_app.task(bind=True)
def process_with_confidence_feedback(self, file_path: str, job_id: str):
"""
Process with confidence scores for human review
"""
import easyocr
reader = easyocr.Reader(['en'], gpu=False)
image = Image.open(file_path)
results = reader.readtext(str(image))
extracted_data = {
"high_confidence": [],
"needs_review": [],
"raw_results": []
}
for (bbox, text, confidence) in results:
item = {"text": text, "confidence": confidence}
if confidence > 0.85:
extracted_data["high_confidence"].append(item)
else:
extracted_data["needs_review"].append(item)
extracted_data["raw_results"].append(item)
redis_client.hset(f"job:{job_id}",
"result",
json.dumps(extracted_data))
return extracted_data
Common Pitfalls and Edge Cases
1. Memory Leaks in BackgroundTasks
BackgroundTasks doesn't clear references immediately. For large-scale file processing, this accumulates. Solution: Use Celery or implement explicit cleanup.
2. Lost Jobs on Server Restart
BackgroundTasks tasks are memory-only. Always use a persistent queue for production. Even a lightweight solution like RQ (Redis Queue) is better than nothing.
3. Timeout Issues
OCR can take unpredictably long for complex documents. Set realistic timeouts and implement graceful degradation.
4. File System Cleanup
Temporary files can accumulate if processing fails. Always use try/finally or context managers.
5. Concurrent File Access
Multiple workers accessing the same file simultaneously causes corruption. Use unique file names and proper locking.
My Recommendation for Your Research Project
Start with BackgroundTasks, but architect
Want This Automated for Your Business?
I build custom AI bots, automation pipelines, and trading systems that run 24/7 and generate revenue on autopilot.
Hire me on Fiverr — AI bots, web scrapers, data pipelines, and automation built to your spec.
Browse my templates on Gumroad — ready-to-deploy bot templates, automation scripts, and AI toolkits.
Recommended Resources
If you want to go deeper on the topics covered in this article:
- Designing APIs with Swagger and OpenAPI
- RESTful Web APIs (O'Reilly)
- Hands-On Machine Learning (O'Reilly)
Some links above are affiliate links — they help support this content at no extra cost to you.
Top comments (0)