Distributed Media Inferencing with Kafka

Jayash Tripathy — Thu, 06 Nov 2025 20:02:15 +0000

Introduction

As Vision AI gains popularity, more organizations from government to the private sector are adopting it to solve real-world challenges: tracking assets, counting people, monitoring vehicles, and more. However, one of the biggest challenges we face while building these systems is Scaling (Processing massive volumes of data in real-time while working within constrained computational resources)

Prerequisites

Before diving in, you should be familiar with:

Kafka basics: Topics, partitions, producers, and consumers
Python: Threading and async programming concepts
Video processing: Basic understanding of frames and inference
Docker: Running containerized services

The Problem: Bottlenecks in Traditional Approaches

Traditional media processing pipelines often rely on synchronous, monolithic architectures that are difficult to scale.

Resource Contention: CPU/GPU resources are tied up during both extraction and inference
No Fault Tolerance: If one process fails, the entire pipeline stops
Inflexible Scaling: Can't independently scale frame extraction vs. inference workloads

Enter Kafka: Decoupling and Scaling

Kafka addresses these challenges by introducing an event-driven, decoupled architecture. Here's how:

Asynchronous Decoupling Frame extraction and inference can run independently.
Horizontal Scaling Easily scale extraction and inference workloads independently.
Fault Tolerance If one process fails, the other can continue running.
Independent Scaling
- High video input? Add more producers and extraction workers.
- Slow inference? Add more inference workers.
- Need real-time results? Scale inference workers; extraction continues unaffected.

Enough talk - let's roll up our sleeves and dive into the Magic. 🎬

Clone the repository

git clone https://github.com/JayashTripathy/Distributed-Media-Inferencing-With-Kafka.git
cd Distributed-Media-Inferencing-With-Kafka

Start Docker Compose

docker-compose up -d

Install dependencies

pip install -r requirements.txt

In this blog I will not overwhelm you with big and cumbersome
codeblocks. I will focus on the core concepts and how to use them in a practical way. And anyways you can always checkout the code in the repository.

Architecture Overview

Here's how everything fits together:

The Flow:

Videos are stored in media/videos_to_process/
Producers extract frames and send metadata to Kafka (3 partitions)
Kafka distributes frame metadata across partitions
Consumers (3 threads) pull metadata, load frames, and run inference
Annotated frames are saved to media/annotated_frames/

So, where it all begin? Media streams - the life blood of our pipeline, whether it's a video file, a live camera feed, or streaming content, we need a way to ingest it. For this example we'll work with video files stored in the media/videos_to_process directory. You can add as many videos as you want to this directory.

media/videos_to_process/
├── video1.mp4 # Video 1
├── video2.mp4 # Video 2
├── video3.mp4 # Video 3
└── ... # More videos

Act 1: Building the Highway 🛣️ - Setup Kafka topics

Before we start moving frames, we build the highway they will travel on.

python cli.py setup

Behind this command, we're creating a Kafka topic with 3 partitions:

def setup():
    admin = AdminClient({"bootstrap.servers": kafka_config.bootstrap_servers})
    # Partitions: 3 (enables parallel processing)
    # Replication: 1 (for local dev; production would use 3+)
    admin.create_topics([NewTopic(kafka_config.frame_topic, 3, 1)])

Understanding Partitions: Think of a Kafka topic as a highway with multiple lanes. Each lane (partition) is an ordered, immutable sequence of messages. Multiple vehicles (consumers) can travel side-by-side, each in their own lane, without blocking each other.

Topic: "frame_topic"
├── Partition 0: [frame1, frame4, frame7, ...]
├── Partition 1: [frame2, frame5, frame8, ...]
└── Partition 2: [frame3, frame6, frame9, ...]

More lanes 🛣️ = More vehicles 🚗 = More throughput 🚀

we start with 3, but you can scale up as your traffic grows.

This topic is where frame metadata will flow, ready to be distributed to your inference workers.

Act 2: The Frame Factory 🔨 - Producing Frames

The Journey Begins

For each video, the producer performs the following steps:

for video in videos_to_process:
    open_video(video)

    for frame in video:
         # Step 1: Extract frame
         image_data = extract_frame(video, frame)

         # Step 2: Store heavy frame locally (not in Kafka!)
         frame_path = f"media/raw_frames/{video_name}/{frame_no}.jpg"
         save_to_disk(image_data, frame_path)

         # Step 3: Send lightweight metadata to Kafka
         metadata = {
               "video_name": video_name,
               "frame_no": frame_no
         }
         kafka_producer.send("frame_topic", json.dumps(metadata))

         # Frame continues journey on Kafka highway...

The producer does three things:

Extracts frames at a configurable interval.
Stores raw frames locally in the media/raw_frames/ directory.
Publishes lightweight metadata to Kafka — just a JSON message containing the video name and frame number.

Note: In a production environment, you would typically use a more robust storage solution, such as S3 or a database.

Here's the clever part: we don't send the entire frame, just the metadata. This keeps the payload small and fast.

Store frames on disk (they're heavy—JPEG images can be 100KB+ each)
Send metadata through Kafka (lightweight—just a few bytes of JSON)

Why this matters: Kafka is optimized for small messages. Sending 100KB images would slow everything down and fill up your Kafka brokers quickly. By storing frames separately and only sending metadata, we keep Kafka fast and efficient.

Now, let's launch a multi-threaded producers army that extracts frames from the videos and sends them to Kafka using the following command.

python cli.py produce

Internally, the producer uses a ThreadPoolExecutor to process multiple videos simultaneously. This means if you have 5 videos, up to 10 threads can work on extracting and producing frames concurrently.

def start(self, media_paths: list[str]):
    with ThreadPoolExecutor(max_workers=10) as executor:
        # Each video gets its own worker thread
        futures = [
            executor.submit(self.produce_media_frames, media_path)
            for media_path in media_paths
        ]

Act 3: The Inference Squad 🤖 - Consuming and Processing Frames

Frames are queued in Kafka. Time for the inference workers to process them.

python cli.py consume

This launches a multi-threaded consumer army that pulls messages from Kafka and runs YOLO11 inference.

This is what is happening inside of the consumer:

consumer_group.start(num_threads=3)
...
for each consumer_thread:
    while True:
        # Step 1: Pull message from Kafka highway
        message = kafka_consumer.poll()
        metadata = json.loads(message)

        # Step 2: Load actual frame from disk
        frame_path = f"media/raw_frames/{metadata.video_name}/{metadata.frame_no}.jpg"
        frame_image = load_from_disk(frame_path)

        # Step 3: Run YOLO11 inference
        model = YOLO("yolo11n.pt")
        detections = model.predict(frame_image)

        # Step 4: Draw bounding boxes & annotations
        annotated_frame = draw_detections(detections)

        # Step 5: Save annotated frame
        output_path = f"media/annotated_frames/{video_name}/{frame_no}.jpg"
        save_to_disk(annotated_frame, output_path)

        # Step 6: Commit to Kafka - "I'm done, next frame!"
        kafka_consumer.commit()

The Magic of Parallel Processing: Remember those 3 partitions we created in Act 1? Here's how they enable true parallelism:

Partition 0 → Consumer Thread 1 → GPU/CPU → Annotated Frames
Partition 1 → Consumer Thread 2 → GPU/CPU → Annotated Frames  
Partition 2 → Consumer Thread 3 → GPU/CPU → Annotated Frames

Each thread processes frames from its assigned partition independently. This means:

No waiting: Thread 1 doesn't block Thread 2 or Thread 3
Full resource utilization: All CPU/GPU cores can be used simultaneously
Linear scaling: Add more partitions + threads = more throughput

The Power of Consumer Groups

Kafka's consumer groups automatically distribute work:

Message distribution: Each consumer gets a fair share of the messages
Load balancing: Even if some consumers are slower, the others pick up the slack
Fault tolerance: If a consumer fails, the group automatically rebalances

Conclusion

After inference, annotated frames are saved to media/annotated_frames/, organized by video name. Each frame shows detected objects and their bounding boxes.

The Journey Complete:

Video → Frame Extraction → Kafka Topic → Multi-threaded Consumers → YOLO Inference → Annotated Frames

The beauty of this architecture is everything runs asynchronously. Producers keep feeding Kafka, Kafka keeps distributing work, and consumers keep processing - all independently, all in parallel.

DEV Community: Jayash Tripathy