Introduction
As Vision AI gains popularity, more organizations from government to the private sector are adopting it to solve real-world challenges: tracking assets, counting people, monitoring vehicles, and more. However, one of the biggest challenges we face while building these systems is Scaling (Processing massive volumes of data in real-time while working within constrained computational resources)
Prerequisites
Before diving in, you should be familiar with:
- Kafka basics: Topics, partitions, producers, and consumers
- Python: Threading and async programming concepts
- Video processing: Basic understanding of frames and inference
- Docker: Running containerized services
The Problem: Bottlenecks in Traditional Approaches
Traditional media processing pipelines often rely on synchronous, monolithic architectures that are difficult to scale.
- Resource Contention: CPU/GPU resources are tied up during both extraction and inference
- No Fault Tolerance: If one process fails, the entire pipeline stops
- Inflexible Scaling: Can't independently scale frame extraction vs. inference workloads
Enter Kafka: Decoupling and Scaling
Kafka addresses these challenges by introducing an event-driven, decoupled architecture. Here's how:
- Asynchronous Decoupling Frame extraction and inference can run independently.
- Horizontal Scaling Easily scale extraction and inference workloads independently.
- Fault Tolerance If one process fails, the other can continue running.
-
Independent Scaling
- High video input? Add more producers and extraction workers.
- Slow inference? Add more inference workers.
- Need real-time results? Scale inference workers; extraction continues unaffected.
Enough talk - let's roll up our sleeves and dive into the Magic. π¬
Clone the repository
git clone https://github.com/JayashTripathy/Distributed-Media-Inferencing-With-Kafka.git
cd Distributed-Media-Inferencing-With-Kafka
Start Docker Compose
docker-compose up -d
Install dependencies
pip install -r requirements.txt
In this blog I will not overwhelm you with big and cumbersome
codeblocks. I will focus on the core concepts and how to use them in a practical way. And anyways you can always checkout the code in the repository.
Architecture Overview
Here's how everything fits together:
The Flow:
- Videos are stored in
media/videos_to_process/ - Producers extract frames and send metadata to Kafka (3 partitions)
- Kafka distributes frame metadata across partitions
- Consumers (3 threads) pull metadata, load frames, and run inference
- Annotated frames are saved to
media/annotated_frames/
So, where it all begin? Media streams - the life blood of our pipeline, whether it's a video file, a live camera feed, or streaming content, we need a way to ingest it. For this example we'll work with video files stored in the media/videos_to_process directory. You can add as many videos as you want to this directory.
media/videos_to_process/
βββ video1.mp4 # Video 1
βββ video2.mp4 # Video 2
βββ video3.mp4 # Video 3
βββ ... # More videos
Act 1: Building the Highway π£οΈ - Setup Kafka topics
Before we start moving frames, we build the highway they will travel on.
python cli.py setup
Behind this command, we're creating a Kafka topic with 3 partitions:
def setup():
admin = AdminClient({"bootstrap.servers": kafka_config.bootstrap_servers})
# Partitions: 3 (enables parallel processing)
# Replication: 1 (for local dev; production would use 3+)
admin.create_topics([NewTopic(kafka_config.frame_topic, 3, 1)])
Understanding Partitions: Think of a Kafka topic as a highway with multiple lanes. Each lane (partition) is an ordered, immutable sequence of messages. Multiple vehicles (consumers) can travel side-by-side, each in their own lane, without blocking each other.
Topic: "frame_topic"
βββ Partition 0: [frame1, frame4, frame7, ...]
βββ Partition 1: [frame2, frame5, frame8, ...]
βββ Partition 2: [frame3, frame6, frame9, ...]
More lanes π£οΈ = More vehicles π = More throughput π
we start with 3, but you can scale up as your traffic grows.
This topic is where frame metadata will flow, ready to be distributed to your inference workers.
Act 2: The Frame Factory π¨ - Producing Frames
The Journey Begins
For each video, the producer performs the following steps:
for video in videos_to_process:
open_video(video)
for frame in video:
# Step 1: Extract frame
image_data = extract_frame(video, frame)
# Step 2: Store heavy frame locally (not in Kafka!)
frame_path = f"media/raw_frames/{video_name}/{frame_no}.jpg"
save_to_disk(image_data, frame_path)
# Step 3: Send lightweight metadata to Kafka
metadata = {
"video_name": video_name,
"frame_no": frame_no
}
kafka_producer.send("frame_topic", json.dumps(metadata))
# Frame continues journey on Kafka highway...
The producer does three things:
- Extracts frames at a configurable interval.
-
Stores raw frames locally in the
media/raw_frames/directory. - Publishes lightweight metadata to Kafka β just a JSON message containing the video name and frame number.
Note: In a production environment, you would typically use a more robust storage solution, such as S3 or a database.
Here's the clever part: we don't send the entire frame, just the metadata. This keeps the payload small and fast.
- Store frames on disk (they're heavyβJPEG images can be 100KB+ each)
- Send metadata through Kafka (lightweightβjust a few bytes of JSON)
Why this matters: Kafka is optimized for small messages. Sending 100KB images would slow everything down and fill up your Kafka brokers quickly. By storing frames separately and only sending metadata, we keep Kafka fast and efficient.
Now, let's launch a multi-threaded producers army that extracts frames from the videos and sends them to Kafka using the following command.
python cli.py produce
Internally, the producer uses a ThreadPoolExecutor to process multiple videos simultaneously. This means if you have 5 videos, up to 10 threads can work on extracting and producing frames concurrently.
def start(self, media_paths: list[str]):
with ThreadPoolExecutor(max_workers=10) as executor:
# Each video gets its own worker thread
futures = [
executor.submit(self.produce_media_frames, media_path)
for media_path in media_paths
]
Act 3: The Inference Squad π€ - Consuming and Processing Frames
Frames are queued in Kafka. Time for the inference workers to process them.
python cli.py consume
This launches a multi-threaded consumer army that pulls messages from Kafka and runs YOLO11 inference.
This is what is happening inside of the consumer:
consumer_group.start(num_threads=3)
...
for each consumer_thread:
while True:
# Step 1: Pull message from Kafka highway
message = kafka_consumer.poll()
metadata = json.loads(message)
# Step 2: Load actual frame from disk
frame_path = f"media/raw_frames/{metadata.video_name}/{metadata.frame_no}.jpg"
frame_image = load_from_disk(frame_path)
# Step 3: Run YOLO11 inference
model = YOLO("yolo11n.pt")
detections = model.predict(frame_image)
# Step 4: Draw bounding boxes & annotations
annotated_frame = draw_detections(detections)
# Step 5: Save annotated frame
output_path = f"media/annotated_frames/{video_name}/{frame_no}.jpg"
save_to_disk(annotated_frame, output_path)
# Step 6: Commit to Kafka - "I'm done, next frame!"
kafka_consumer.commit()
The Magic of Parallel Processing: Remember those 3 partitions we created in Act 1? Here's how they enable true parallelism:
Partition 0 β Consumer Thread 1 β GPU/CPU β Annotated Frames
Partition 1 β Consumer Thread 2 β GPU/CPU β Annotated Frames
Partition 2 β Consumer Thread 3 β GPU/CPU β Annotated Frames
Each thread processes frames from its assigned partition independently. This means:
- No waiting: Thread 1 doesn't block Thread 2 or Thread 3
- Full resource utilization: All CPU/GPU cores can be used simultaneously
- Linear scaling: Add more partitions + threads = more throughput
The Power of Consumer Groups
Kafka's consumer groups automatically distribute work:
- Message distribution: Each consumer gets a fair share of the messages
- Load balancing: Even if some consumers are slower, the others pick up the slack
- Fault tolerance: If a consumer fails, the group automatically rebalances
Conclusion
After inference, annotated frames are saved to media/annotated_frames/, organized by video name. Each frame shows detected objects and their bounding boxes.
The Journey Complete:
Video β Frame Extraction β Kafka Topic β Multi-threaded Consumers β YOLO Inference β Annotated Frames
The beauty of this architecture is everything runs asynchronously. Producers keep feeding Kafka, Kafka keeps distributing work, and consumers keep processing - all independently, all in parallel.
Happy scaling! π

Top comments (0)