Scaling PaddleOCR to Zero: A Multi-Cloud GPU Pipeline with KEDA

#aws #azure #ocr #ai

Running GPU-intensive tasks like OCR can get expensive quickly. If you leave a GPU server running 24/7, you pay for idle time. If you use a standard CPU, processing multi-page PDFs takes forever.

We built a document analysis API that solves this by splitting the workload across AWS and Azure, using a "scale-to-zero" architecture. Here is the technical breakdown.

The Architecture

The system follows an asynchronous worker pattern to ensure the API stays responsive even when processing 100-page documents.

1. The Entry Point (AWS)

We use AWS Lambda as our API gateway. It handles the "light" work:

Validation: Checking file signatures (hex headers) to verify if the file is a real PDF/JPG.

Storage: Saving the raw file to Amazon S3.

State Management: Creating a job record in Amazon DynamoDB.

2. The Bridge (Azure Queue)

Once the file is safe in S3, the Lambda sends a Base64-encoded message to Azure Queue Storage. This acts as our buffer.

3. The GPU Worker (Azure Container Apps)

This is where the heavy lifting happens. We use Azure Container Apps running on Consumption-GPU (NVIDIA T4) profiles.

Scale-to-Zero: Using KEDA (Kubernetes-based Event Driven Autoscaling), the GPU workers only spin up when there is a message in the queue. When the queue is empty, the replica count drops to 0, so we stop paying for the GPU.

The Engine: The worker runs PaddleOCR (PP-StructureV3). It handles layout analysis, identifying titles, text blocks, and tables.

PDF Rendering: Since PaddleOCR processes images, we use pypdfium2 to render PDF pages into high-res bitmaps before analysis.

# A look at our page-by-page processing logic
for i in range(start_idx, end_idx):
    page = pdf[I]
    bitmap = page.render(scale=2) # 2x scale for better OCR accuracy
    images_to_process.append(bitmap.to_numpy())

4. Result Delivery

The worker saves two JSON files to Azure Blob Storage:

Raw: Every coordinate and confidence score from the engine.

Normalized: A cleaned version that maps text lines to specific layout regions.

Finally, a Webhook is triggered to let the user know the data is ready for download.

Why Multi-Cloud?

We chose this setup to get the best of both worlds: AWS’s robust API management and DynamoDB, combined with Azure’s flexible GPU Container Apps and KEDA integration. This setup allows us to process complex documents in seconds while keeping infrastructure costs extremely low during quiet periods.

Skip the Infrastructure

Setting up GPU clusters and managing auto-scaling is a headache. Skip the infrastructure work and start using our ready-made API today. You can focus on your code while we handle the servers.
Try out our PaddleOCR API live in your browser at https://www.silverlining.cloud/products/ocr-api