DEV Community: Shalini Srivastava

Build an AI Pipeline FastAPI + Kafka + Workers

Shalini Srivastava — Tue, 16 Jun 2026 03:29:45 +0000

Most AI demos work perfectly on a laptop.

But production AI systems can become fragile when everything is handled inside one synchronous API call.

A user sends a request.

The API extracts text.

The API chunks the content.

The API generates embeddings.

The API stores data.

The API waits for everything to finish.

This may look simple in a demo, but it quickly becomes a problem in real systems.

The problem with one giant API call

In many AI applications, the API is expected to do too much.

For example, in a document processing or RAG pipeline, one request may trigger multiple heavy steps:

text extraction
chunking
embedding generation
indexing
summarization
database updates

If all of this happens inside one synchronous request, the API becomes slow and fragile.

If one downstream step fails, the complete request may fail.

If traffic increases suddenly, the API may become overloaded.

This is why event-driven architecture becomes useful for AI workloads.

A better approach: API + Kafka + workers

Instead of making the API do everything, we can split the workflow into smaller services.

The API accepts the request and publishes an event.

Background workers consume events and continue the processing asynchronously.

A simple flow looks like this:

User Request
   ↓
FastAPI
   ↓
Kafka / Redpanda Topic
   ↓
Python Worker
   ↓
Next Processing Stage

In my practical demo, I am using:

FastAPI
Redpanda
Python workers
Docker Compose
Kafka-compatible messaging

Why Redpanda?

Redpanda is Kafka-compatible, which makes it useful for local demos and event-driven architecture experiments.

It allows us to work with Kafka-style topics, producers, and consumers while keeping the setup simple for development.

What this architecture gives us

This approach helps with:

decoupling services
handling bursty workloads
moving long-running tasks to background workers
improving scalability
isolating failures
building production-style AI pipelines

This pattern is especially useful for AI systems involving:

document processing
chunking
embeddings
RAG indexing
summarization
long-running background jobs

Key architecture idea

The API should not behave like a worker.

The API should accept the request, publish an event, and return quickly.

Workers should handle the heavy processing in the background.

That separation makes the system easier to scale, debug, and extend.

Video demo

I created a practical video where I build this Kafka-based AI pipeline step by step using FastAPI, Redpanda, Docker Compose, and Python workers.

Watch the video here:

https://youtu.be/c2ijN2KAWXw

Final thought

AI architecture is not only about calling an LLM.

The real challenge is designing the system around the AI workload.

For many production AI applications, especially those involving document processing, RAG, embeddings, or summarization, event-driven architecture can make the system much more resilient.

This is the kind of foundation we need before building more advanced AI pipelines.

Build an AI Pipeline FastAPI + Kafka + Workers

Shalini Srivastava — Tue, 16 Jun 2026 03:24:13 +0000

Most AI demos work perfectly on a laptop.
But production AI systems can become fragile when everything is handled inside one synchronous API call.
A user sends a request.
The API extracts text.
The API chunks the content.
The API generates embeddings.
The API stores data.
The API waits for everything to finish.
This may look simple in a demo, but it quickly becomes a problem in real systems.
The problem with one giant API call
In many AI applications, the API is expected to do too much.
For example, in a document processing or RAG pipeline, one request may trigger multiple heavy steps:
text extraction
chunking
embedding generation
indexing
summarization
database updates
If all of this happens inside one synchronous request, the API becomes slow and fragile.
If one downstream step fails, the complete request may fail.
If traffic increases suddenly, the API may become overloaded.
This is why event-driven architecture becomes useful for AI workloads.
A better approach: API + Kafka + workers
Instead of making the API do everything, we can split the workflow into smaller services.
The API accepts the request and publishes an event.
Background workers consume events and continue the processing asynchronously.
A simple flow looks like this:
User Request
↓
FastAPI
↓
Kafka / Redpanda Topic
↓
Python Worker
↓
Next Processing Stage
In my practical demo, I am using:
FastAPI
Redpanda
Python workers
Docker Compose
Kafka-compatible messaging
Why Redpanda?
Redpanda is Kafka-compatible, which makes it useful for local demos and event-driven architecture experiments.
It allows us to work with Kafka-style topics, producers, and consumers while keeping the setup simple for development.
What this architecture gives us
This approach helps with:
decoupling services
handling bursty workloads
moving long-running tasks to background workers
improving scalability
isolating failures
building production-style AI pipelines
This pattern is especially useful for AI systems involving:
document processing
chunking
embeddings
RAG indexing
summarization
long-running background jobs
Key architecture idea
The API should not behave like a worker.
The API should accept the request, publish an event, and return quickly.
Workers should handle the heavy processing in the background.
That separation makes the system easier to scale, debug, and extend.
Video demo
I created a practical video where I build this Kafka-based AI pipeline step by step using FastAPI, Redpanda, Docker Compose, and Python workers.
Watch the video here:
https://youtu.be/c2ijN2KAWXw
Final thought
AI architecture is not only about calling an LLM.
The real challenge is designing the system around the AI workload.
For many production AI applications, especially those involving document processing, RAG, embeddings, or summarization, event-driven architecture can make the system much more resilient.
This is the kind of foundation we need before building more advanced AI pipelines.

The "Demo vs. Production" Trap: Building a Scalable Kafka Pipeline for LLMs

Shalini Srivastava — Sat, 13 Jun 2026 03:50:35 +0000

Why synchronous API wrappers break under bursty AI traffic, and how to fix it using an event-driven architecture with Apache Kafka.

Most AI tutorials you see online follow a simple, clean path:
User ➔ API ➔ LLM ➔ Response

It works perfectly in a local development environment. But if you try pushing that synchronous design straight into production under heavy, real-world traffic, things fall apart fast.

Forcing long-running tasks like text extraction, chunking, embedding generation, and multi-step LLM orchestration into a single blocking HTTP request is a recipe for timeouts, resource exhaustion, and cascading backend failures.

If your LLM provider introduces a 15-second latency spike or hits a rate limit, your entire worker thread pool sits idle, consuming memory while waiting for external network I/O to resolve. Upstream clients give up, and requests start dropping.

Shifting to an Event-Driven AI Pipeline

To build enterprise-grade infrastructure that survives bursty workloads, you have to decouple the ingestion layer from your heavy processing services. This is where a durable event backbone like Apache Kafka becomes crucial.

By moving to an asynchronous architecture:

Immediate Ingestion: Your API layer instantly accepts the payload, publishes an event, and returns an acknowledgment to the user. No blocking.
Backpressure Buffer: Kafka acts as a shock absorber. If document extraction or vector database upserts slow down, events safely queue up in the log instead of crashing your servers.
Fault Isolation: If a downstream service fails, the data isn't lost. It sits securely in the log until the service recovers and resumes processing.

Full Architectural Breakdown & Walkthrough

I put together a complete video breakdown detailing the exact mechanics of these production bottlenecks, the failure dynamics of brittle retry chains, and how to implement this decoupling step-by-step.

Complete Video Breakdowns & Implementation

This is a growing weekly series where we transition from simple AI wrappers to robust, enterprise-grade backends. You can watch the full architectural breakdowns below:

Part 1: The "Demo vs. Production" Trap

We break down the 5 major bottlenecks that bring synchronous AI systems to their knees and why a distributed commit log is the right foundation.

Part 2: Designing Multi-Stage Pipelines & The Claim Check Pattern

We explore how to handle heavy 20MB+ files without choking Kafka, isolating faults, and scaling individual extraction and summarization consumer groups.

I've also open-sourced the reference documents and architectural layouts for this series. You can grab the reference materials over on GitHub: AI Reference Documents & Code Repository.