Aryan Khola

Posted on Dec 2

Scaling with Celery and Redis

#backend #architecture #python #performance

Task queues and background workers are fundamental patterns in modern software architecture. If you're building an app that does anything heavier than a simple database query and you want it to scale, you need Task Queues.

I learned this while building something of my own.

The Task

So, I was working on something, a more personalized version of Filmot.

If you aren’t familiar, the idea is simple: I wanted to build a search engine that could index my personal YouTube playlists and let me instantly search through the transcripts (as well as titles and descriptions) of every video in those playlists.

The user experience was designed to be pretty straightforward:

You log in using Google OAuth and grant read access to your YouTube account.
You pick a playlist from your dashboard.
You hit the "Start Indexing" button.

The Issue

Now, under the hood, the moment the indexing started, my backend initiated a synchronous loop. It attempted to process every single video in the playlist sequentially before returning any response to the user.

I was pretty happy cause I was able to effortlessly index my playlists. However, when I showed it to a friend, he had a playlist having around 1,000 videos, and it failed. The screen froze, the spinner spun forever, and eventually, the app crashed.

The bottleneck was simple math: if fetching and indexing a transcript takes roughly 1.5 seconds, doing that for 1,000 videos takes 25 minutes. The browser gave up after about 60 seconds. Now I was aware that this was a synchronous app and the issues that came with it, but I expected it to fail for some other reason, failing during indexing kinda defeats the whole purpose of making this application.

The Fix

The issue was obvious: indexing was taking too much time, and this was blocking my application from doing anything else. While looking for solutions, I came across a video by Sriniously on task queues, an insanely insightful video; I truly learned a lot. It broke down the producer-consumer model and emphasized why offloading task execution from user requests is essential for building scalable applications.

This prompted me to revamp the architecture entirely, shifting from a synchronous, blocking system to an asynchronous, non-blocking one.

I'll try to explain what task queues and background workers are and unpack them methodically, from what they are to how they work, and show how they made my app way more efficient

Why Do We Need Task Queues and Background Workers?

When your app is under heavy load, a fully synchronous design quickly becomes a bottleneck (just like in my case). Long-running operations block the main thread, requests pile up behind each other, latency grows worse, and eventually the system starts dropping connections or even crashing. That’s exactly what happened with the 1,000-video playlist, every transcript fetch happened one after another on the main thread, taking roughly 25 minutes, while the client timed out after just 60 seconds.

Task queues solve this by shifting work away from the request–response cycle and into a background process. Instead of the server waiting for every API call or slow operation to finish, the HTTP handler simply records the intent, places a job in a queue, returns an acknowledgment and hands control back immediately.

Meanwhile, separate worker processes consume tasks from the queue and execute them in parallel, or concurrently, depending on how you’ve set things up. This keeps the main server responsive because it’s no longer forced to wait on slow operations or external APIs. Scaling becomes far more straightforward as well: whenever you need additional throughput, you simply add more workers without touching the web tier. And because each worker is isolated, a failure in one won’t cascade through the entire system

A particularly useful feature for me was the ability to rate-limit the queue, which helped avoid IP bans from YouTube.

The Components

At its core, a task queue is just a queue data structure to store work that should be done later instead of during the main request. You can think of it as a first-in-first-out list of jobs that won’t disappear if the web application restarts, usually backed by a message broker that guarantees the tasks aren’t lost. Around this simple idea, you get a whole ecosystem that lets different parts of this system communicate without blocking each other, almost like lightweight microservices passing messages around.

A task is a single piece of work the system needs to do. In my app, that unit of work is process_video_task. This function takes on video ID, fetches its transcript through a proxy, and indexes the data into Elasticsearch. Each video is processed independently by a worker.

@celery.task(bind=True, max_retries=3)
def process_video_task(self, video_data, index_name):
    try:
        transcript = get_video_transcript(video_data['id'])

        if index_video(index_name, video_data, transcript):
            return (video_data['id'], True)

    except Exception as e:
        pass

The producer is the part of the application that creates tasks and sends them to the queue. In my setup, the flask endpoint is the producer. When a user starts indexing a playlist, the endpoint gathers all the needed info and sends the main job to Celery. This immediately pushes the job into the background without blocking the user.

The broker is the software that actually stores those tasks and moves them between producers and consumers. In practice the broker exposes a set of primitives, append a message, pop a message, block-and-wait for a message, acknowledge a message and many more, and the queue is the logical structure you build on top of those primitives.

I chose Redis not only as the message broker for the queue but also the storage for tracking task status, enabling real-time progress updates for the user.

# Storing the task ID in Redis so we can look it up later
task_id_key = f"{TASK_KEY_PREFIX}{playlist_id}"
redis_conn.set(task_id_key, task.id, ex=7200)

The consumer A.K.A the workers, are the ones who actually do the tasks, by taking them from the message broker. This runs on a different process, by having the consumer running on a background worker or even a separate server sometimes, the main application stays fast and responsive.

In my application, I am using Celery to manage these consumers. Celery wraps the complex logic of spawning worker processes, managing memory, and acknowledging messages, so I don't have to write that infrastructure code from scratch.

How they all tie up together

The definition of a broker according to Wikipedia is "a person or entity that arranges transactions between a buyer and a seller." This is exactly the role Redis plays in this ecosystem.

Technically, Redis is an in-memory data structure store. Because it lives entirely in RAM, it is incredibly fast. However, in this specific architecture, it acts as a "dumb" middleman. It has no idea what it is storing and doesn't understand the business logic of my application. Its only job is to accept tasks from the producer, hold them safely in memory, and facilitate adding or removing them with microsecond latency.

On the other hand, Celery acts as the brain of the operation. It sits on top of Redis and provides the logic. Celery handles the serialization, manages the creation of worker processes, and monitors task health. If a worker fails, Celery is the one that detects it and executes the retry logic.

If I were building this in other languages, the specific tools would change, but the roles would remain exactly the same:

In Node.js: You might use BullMQ with Redis.
In Go: You might use Asynq with Redis.

My Approach

The architecture I implemented with the task queues and background workers is based on the Fan-Out Pattern (also known as the Orchestrator Pattern). Basically, there are two roles: a Manager who plans the work, and the Workers who do the work. When the user clicks "Index Playlist," the backend triggers just one single task: index_playlist_task. This task doesn't actually download any transcripts. Its job is to act as the foreman. It connects to the YouTube API, fetches the list of all videos in the playlist, and calculates exactly what needs to be done.

@celery.task(bind=True)
def index_playlist_task(self, playlist_id, playlist_title, credentials_dict, ...):

    # 1. The Manager fetches the "To-Do List" from YouTube
    videos = get_playlist_videos(playlist_id, credentials)

Once the manager has the list of videos (assume 500), it creates a "Signature" (a task blueprint) for every single video. It packages these hundreds of signatures into a Celery Group.

This is the critical moment where the interaction with Redis happens.

When the code executes job_group.apply_async(), Celery takes those 500 task signatures, serializes them (turns them into JSON messages), and blasts them into the Redis queue.

tasks_to_run = []

    # 2. The Manager prepares orders for the workers
    for video in videos:
        # .s() creates a "signature" - a task message that is ready to be sent
        tasks_to_run.append(process_video_task.s(video, index_name))

    if tasks_to_run:
        # 3. THE FAN-OUT
        # 'group' bundles all these tasks together
        job_group = group(tasks_to_run)

        # 4. SEND TO REDIS
        # This single line pushes hundreds of messages into the Redis list instantly.
        result_group = job_group.apply_async()

At this specific millisecond, the Redis list (the queue) spikes. It goes from having 0 items to having 500 pending jobs.

Redis acts as the high-speed buffer here. It holds these 500 serialized messages safely in memory. It doesn't care that they are video processing tasks; it just knows it has 500 items that need to be popped off the list.

Now, my fleet of Celery workers (running on the server) wakes up. They see the queue in Redis is full and start "popping" tasks off the list as fast as they can. To make this truly efficient, I utilize Green Threads (via Gevent).

Each worker runs this isolated logic:

@celery.task(bind=True, max_retries=3)
def process_video_task(self, video_data, index_name):
    # The worker pops one message from Redis and processes just that ONE video
    transcript = get_video_transcript(video_data['id'])

    if index_video(index_name, video_data, transcript):
        return (video_data['id'], True)

By "fanning out" the workload and using Green Threads to handle the I/O waiting, I can index a massive playlist in a fraction of the time.

THE RESULT

After revamping the architecture the math completely changed.

Because I am processing videos using multiple workers, I wasn't limited to the speed of a single thread. I configured my system to handle 50 tasks concurrently. Since downloading transcripts is mostly just waiting for the network, my server could easily juggle these 50 requests at the almost same time.

Now the math is,
(1,000 videos / 50 concurrent tasks) × 1.5 seconds ≈ 30 seconds
Compared to the old,
1,000 videos × 1.5 seconds/video = 1,500 seconds (≈ 25 minutes)

These numbers are just approximate figures based on average response times, but the speed did increase considerably. The application went from choking on a 1000-video playlist to indexing a massive 3,000-video playlist effortlessly. Truly, task queues for the win.

TL;DR

The project is fully deployed on GCP and you can try the live demo here: https://yts-88.com.

A quick note on stability: YouTube blocks transcript requests coming from data center IPs (like GCP), flagging them as bots. To bypass this, I implemented a residential routing proxy using Webshare, which worked flawlessly. However, maintaining residential proxies became too expensive for a side project, so I am currently trying to find some cost-effective workarounds.

Top comments (1)

Qtron Arise • Dec 3

Insane!!