DEV Community: Ganesh Parella

I Broke My Own Workflow Engine at Scale — Here's How I Fixed It

Ganesh Parella — Wed, 18 Mar 2026 13:26:22 +0000

In my last post, I broke down how I built FlowForge, a fault-tolerant DAG workflow engine using ASP.NET Core, React, and MySQL. I explained how I solved complex branching and dependency execution using Kahn's Algorithm and a database-backed state machine.

It works perfectly for hundreds of concurrent users. But as engineers, we always have to ask the dangerous question: What happens when we 1000x the load? Let's deep dive into the absolute limits of my current architecture, watch it break, and re-architect it to handle a massive scale: 1,000,000 flow executions per second.

How the V1 Engine Works (And Why It Will Fail)
To understand how to scale the engine, you need to understand what it's currently doing.

Storage & Parsing:
When a user builds a flow in the React frontend, we save the entire raw JSON (including UI viewports and node coordinates) into the definitionJson column of our Flow table. When execution starts, we parse this JSON into a backend-friendly ParsedFlow (a strict list of Nodes and Edges) and verify there are no cyclical loops using Topological Sorting.

The Execution Loop:
When a flow is triggered, we create a FlowInstance in the database to track the overall run. We then generate NodeInstances for every step, marking the initial trigger node as Ready and everything else as Pending.

The Polling Mechanism:
A background service aggressively polls the MySQL database every 10 milliseconds, looking for nodes in the Ready state. When it finds one, it executes it, evaluates the downstream dependencies, and marks the next children as Ready.

The Breaking Points at 1 Million RPS
If we force 1 million workflows per second through this V1 architecture, two things will immediately catch fire:

Breaking Point 1: ThreadPool Starvation (CPU Bottleneck)
ASP.NET Core is incredibly efficient with async/await, releasing threads back to the pool during I/O operations. However, parsing a massive workflow JSON 1 million times a second is a heavy, synchronous, CPU-bound operation. We will quickly exhaust the available worker threads, leading to severe latency spikes and HTTP 503 errors as the API Gateway drops incoming requests.

Breaking Point 2: The Database Meltdown
Polling a database every 10ms for 1,000,000 concurrent flows results in an astronomical amount of read queries. The MySQL CPU will hit 100%, disk I/O will bottleneck, and the database will crash before the web servers even break a sweat.

Re-Architecting for 1M Flows per Second

To survive this scale, we have to fundamentally shift from a monolithic, polling-based architecture to a distributed, event-driven one.

1. Decoupling with Message Queues
Instinctively, many developers think, "Just throw a Message Queue in front of it." But what actually goes into the queue?

Instead of the API server trying to execute the flow, the API server's only job is to receive the HTTP request, validate it, and drop a StartFlowMessage (containing the FlowId and payload) into a high-throughput broker like Kafka or RabbitMQ. The API responds with a 202 Accepted immediately, freeing up the web thread in milliseconds.

2. Scaling Horizontally with Stateless Workers
Now that execution is decoupled, we deploy dedicated, stateless Worker Services reading directly from the Kafka partitions. If the queue gets backed up, we simply spin up 50 more worker containers in Kubernetes to chew through the backlog.

3. Eradicating Database Polling (Event-Driven Execution)
We completely remove the 10ms SELECT loop. When a worker finishes executing "Node A", it doesn't wait for the database. Instead, it calculates what nodes are unblocked, updates their state in MySQL, and immediately pushes a new ExecuteNodeMessage into the queue for "Node B". The engine is now entirely event-driven.

4. Handling Concurrency and Idempotency
Here is the massive catch with horizontal scaling: What if two different workers pick up the same flow at the exact same time?

To maintain idempotency, we implement Optimistic Concurrency Control in our database. We add a Version or RowVersion column to our NodeInstance table. When a worker tries to update a node from Ready to Running, it executes:
UPDATE NodeInstance SET Status = 'Running', Version = Version + 1 WHERE Id = X AND Version = Y;
If another worker already took the job, the row version will have changed, the update will return 0 affected rows, and the second worker knows to safely drop the duplicate task.

5. Reducing Latency with Distributed Caching
Parsing the raw definitionJson from MySQL on every execution is too expensive. We introduce Redis to cache the highly requested, pre-parsed DAG structures. When a worker picks up a flow, it grabs the pre-compiled execution plan from Redis in sub-milliseconds, bypassing the JSON deserialization tax entirely.

Conclusion
Scaling a system is rarely about writing "faster code"; it is about removing bottlenecks. By shifting from database polling to an event-driven queue, offloading work to stateless consumers, utilizing distributed caching, and enforcing optimistic concurrency, FlowForge evolves from a reliable prototype into an enterprise-grade orchestration engine.
(Missed the first part? Check out the original build: How I Built a Fault-Tolerant DAG Workflow Engine in ASP.NET Core)

How I Built a Fault-Tolerant DAG Workflow Engine in ASP.NET Core

Ganesh Parella — Sat, 14 Mar 2026 10:03:13 +0000

I recently built FlowForge, a visual workflow automation platform where users can connect multiple external applications and orchestrate complex flows simply by dragging and dropping nodes onto a canvas.

At first glance, building a Zapier-like clone seemed straightforward: build a React frontend to draw the boxes, and a backend to execute them. But as I got deeper into the architecture, handling state, concurrency, and fault tolerance turned this into a massive distributed systems challenge.

Without any filler, let's dive into the architecture of how I actually built a workflow automation platform from scratch.

System Requirements
Functional Requirements:

Users can authenticate with multiple third-party applications (Google, Slack, etc.) via OAuth.
Users can build and configure workflows using a drag-and-drop canvas.

Non-Functional Requirements:

Pluggability: Adding a new integration must be as simple as adding a new file, without modifying the core engine.
Fault Tolerance (Persistence): Even if the server crashes unexpectedly mid-execution, the workflow must be able to resume exactly where it left off.
Concurrent Branching: The engine must support complex dependencies (e.g., Node B and Node C can run simultaneously, but both must wait for Node A to finish).

Core Entities

Flow: The blueprint of the workflow.
Connection: External OAuth credentials and integration metadata.
FlowInstance: A single, unique execution run of a Flow.
NodeInstance: The execution state of an individual step within that flow.

Tech Stack & High-Level Architecture

Frontend: React.js
Backend: ASP.NET Core Web API
Database: MySQL

Core Services:

Client: The visual node-based editor for the user interface.

API Gateway: Handles routing and load distribution to the backend services.

Flow Service: Manages CRUD operations for the workflows.

Connection Service: Securely handles external application connections and token management.

Flow Engine: The core "brain" of the system that parses and executes the workflows.

Deep Dive 1: The Evolution of the Core Engine
Building the execution engine was the biggest architectural bottleneck. I actually had to rewrite this component three separate times to get it right.

V1: Linear Execution (The Naive Approach)
Initially, I just converted the workflow into a List and executed the nodes in sequence. This failed immediately. Real workflows don't always run in a straight line, and sometimes a flow gets triggered by an event in the middle of a chain.

V2: Topological Sorting (Kahn's Algorithm)
To fix the branching issue, I modeled the workflow as a Directed Acyclic Graph (DAG). By using Kahn's Algorithm for topological sorting, I could preserve the strict execution order, ensure dependencies were met, and detect invalid cycles.

However, I hit a massive limitation: State Loss. Because Kahn's algorithm was running entirely in server memory, if the ASP.NET server restarted, the entire execution vanished like it never existed.

V3: The Persistent DAG Scheduler (The Final Form)
To achieve true fault tolerance, I moved the state machine out of memory and into the MySQL database. The engine now operates as a persistent scheduler. It continuously polls the database to find nodes where all parent dependencies are marked as Completed, marks them as Ready, and executes them.

Here is the actual C# implementation of the persistent engine:

 public class FlowEngine
 {
     private readonly IServiceProvider _serviceProvider;
     private readonly ILogger<FlowEngine> _logger;
     private readonly IEnumerable<INode> _availableNodes;
     private readonly DagConvertor _dagConvertor;
     private readonly NodeExecutionRepository _nodeExecutionRepository;
     private readonly IFlowInstanceRepository _flowInstanceRepository;

     public FlowEngine(
         IServiceProvider serviceProvider,
         ILogger<FlowEngine> logger,
         IEnumerable<INode> availableNodes,
         DagConvertor dagConvertor,
         NodeExecutionRepository nodeExecutionRepository,
         IFlowInstanceRepository flowInstanceRepository)
     {
         _serviceProvider = serviceProvider;
         _logger = logger;
         _availableNodes = availableNodes;
         _dagConvertor = dagConvertor;
         _nodeExecutionRepository = nodeExecutionRepository;
         _flowInstanceRepository = flowInstanceRepository;
     }

     // Allows service to reuse DAG building logic
     public Dag BuildDag(ParsedFlow parsedFlow)
     {
         return _dagConvertor.ParsedFlowToDag(parsedFlow);
     }

     // ============================================================
     // 🚀 Persistent DAG Scheduler
     // ============================================================

     public async Task RunPersistentAsync(
         Guid flowInstanceId,
         ParsedFlow parsedFlow,
         string clerkUserId,
         Dictionary<string, object>? initialPayload = null)
     {
         var payload = initialPayload ?? new Dictionary<string, object>();
         var dag = _dagConvertor.ParsedFlowToDag(parsedFlow);

         while (true)
         {
             // Fail fast
             if (await _nodeExecutionRepository.AnyFailedAsync(flowInstanceId))
             {
                 await _flowInstanceRepository.MarkFailedAsync(flowInstanceId);
                 return;
             }

             // Complete if done
             if (await _nodeExecutionRepository.AllCompletedAsync(flowInstanceId))
             {
                 await _flowInstanceRepository.MarkCompletedAsync(flowInstanceId);
                 return;
             }

             var nodeExecution =
                 await _nodeExecutionRepository.GetNextReadyNodeAsync(flowInstanceId);

             if (nodeExecution == null)
             {
                 await Task.Delay(50);
                 continue;
             }

             await _nodeExecutionRepository.MarkRunningAsync(nodeExecution.Id);

             var parsedNode =
                 parsedFlow.Nodes.First(n => n.Id == nodeExecution.NodeId);

             try
             {
                 var output = await ExecuteNodeAsync(
                     parsedNode,
                     parsedFlow.Name,
                     clerkUserId,
                     payload);

                 await _nodeExecutionRepository
                     .MarkCompletedAsync(nodeExecution.Id);

                 if (output != null)
                 {
                     foreach (var kvp in output)
                         payload[kvp.Key] = kvp.Value;
                 }

                 await UnlockChildrenAsync(
                     flowInstanceId,
                     dag,
                     parsedNode.Id);
             }
             catch (Exception ex)
             {
                 await _nodeExecutionRepository
                     .MarkFailedAsync(nodeExecution.Id, ex.Message);
             }
         }
     }

     private async Task<Dictionary<string, object>?> ExecuteNodeAsync(
         ParsedNode node,
         string flowName,
         string clerkUserId,
         Dictionary<string, object> payload)
     {
         var executor =
             _availableNodes.FirstOrDefault(n => n.Type == node.Type);

         if (executor == null)
             throw new InvalidOperationException(
                 $"Executor for node type '{node.Type}' not found.");

         var context = new FlowExecutionContext(
             clerkUserId,
             payload,
             ConvertToJsonElement(node.Data));

         return await executor.ExecuteAsync(context, _serviceProvider);
     }

     private async Task UnlockChildrenAsync(
         Guid flowInstanceId,
         Dag dag,
         string completedNodeId)
     {
         if (!dag.AdjList.TryGetValue(completedNodeId, out var children))
             return;

         foreach (var child in children)
         {
             if (!dag.ReverseAdjList.TryGetValue(child.Id, out var parents))
                 continue;

             var allParentsCompleted =
                 await _nodeExecutionRepository
                     .AreAllParentsCompletedAsync(flowInstanceId, parents);

             if (allParentsCompleted)
             {
                 await _nodeExecutionRepository
                     .MarkReadyAsync(flowInstanceId, child.Id);
             }
         }
     }

     private JsonElement ConvertToJsonElement(object? obj)
     {
         if (obj == null) return default;
         var json = JsonSerializer.Serialize(obj);
         return JsonSerializer.Deserialize<JsonElement>(json);
     }
 }

Deep Dive 2: Building a Pluggable Architecture
A workflow engine is useless if it is hard to add new integrations. To make the system truly pluggable, I utilized the Factory Pattern and .NET Reflection.

Instead of hardcoding a massive switch statement or storing the node executors in a static list, I made every node type (Action, Trigger, Conditional) inherit from an INode interface. At startup, the application scans the assembly to find all implementations and registers them dynamically in the Dependency Injection container.

 public static class ServiceCollectionExtenctions
 {
     public static IServiceCollection AddNodeServices(this IServiceCollection services)
     {
         var NodeType = typeof(INode);
         var implementations = AppDomain.CurrentDomain.GetAssemblies()
             .SelectMany(s => s.GetTypes())
             .Where(t => NodeType.IsAssignableFrom(t) && t is { IsClass: true, IsAbstract: false });
         foreach (var implementation in implementations)
         {
             services.AddScoped(typeof(INode), implementation);
         }
         return services;
     }
 }

Now, adding a new Slack integration is as simple as creating a new class that implements INode. The engine handles the rest automatically.

Deep Dive 3: Ensuring Concurrent Execution
If you look at the engine logic, you will notice we retrieve the status of each node directly from the database.

If a parent node completes its execution and unlocks two distinct child nodes, both of those children will independently transition to the Ready state. Because the engine processes ready nodes asynchronously, both child branches will execute concurrently without blocking each other.

Trade-Offs & Scaling Bottlenecks
Building this highlighted some clear scaling limits in my current architecture:

ThreadPool Exhaustion: Right now, I am executing nodes asynchronously in memory. If the platform scales to 100,000 concurrent users running workflows, we will hit the ASP.NET ThreadPool limits. To solve this, I would need to decouple the execution by introducing a Message Queue (like RabbitMQ) and dedicated background worker services.

Database Write Limits: Flow execution is incredibly write-heavy (updating status to Running, Completed, Failed every few milliseconds). MySQL is great, but at massive scale, it will hit write-throughput bottlenecks. Shifting the execution state storage to a Key-Value database (like DynamoDB or Redis) would be necessary for global scalability.

Conclusion
Building FlowForge forced me to move beyond basic CRUD applications and tackle real distributed system challenges. By leveraging topological sorting for execution order, a database-backed state machine for fault tolerance, and .NET reflection for pluggability, the engine is robust and highly extensible.

You can explore the complete architecture and source code here: FlowForge on GitHub.

What are your thoughts on using database polling for the DAG scheduler versus an event-driven message queue? Let me know in the comments!

Designing Uber: Geospatial Indexing, WebSockets, and Distributed Locks

Ganesh Parella — Fri, 13 Mar 2026 20:27:05 +0000

Designing a platform like Uber might seem straightforward at first glance—just match a rider with a driver, right? But when you get into the details of real-time location tracking, geospatial querying, and concurrent bookings, it becomes an incredibly hard system to scale and maintain.

Without any filler, let's dive into the architecture.

System Requirements
Functional Requirements:

Users can input a source and destination to calculate a fare.
Users can view nearby available drivers in real-time.
Users can book a ride.
Drivers can accept or reject ride requests.

Non-Functional Requirements:

Strict Consistency (for matching): Two drivers cannot accept the exact same ride.
Low Latency: Ride matching must happen in < 1 minute.
High Availability: Location tracking and routing must remain highly available.
Scalability: Must support millions of concurrent users and high-frequency location updates.

Core Entities

User
Ride

High-Level Architecture

Here is a look at the core components of our system:

Load Balancer / API Gateway: Distributes incoming traffic and routes requests to the appropriate backend microservices.

WebSocket Servers: Traditional HTTP isn't fast enough for real-time tracking. We use WebSockets to maintain a persistent, bi-directional connection with the drivers so we can push ride requests to them instantly.

Matching Service: The core engine that runs the matching algorithm to pair a rider with the optimal nearby driver.

External Map Provider (Google Maps/Mapbox API): Used to calculate the optimal route, estimated time of arrival (ETA), and the trip fare.

Spatial Database (Redis / QuadTree): A specialized data store designed to hold the real-time geographical coordinates (Latitude/Longitude) of all active drivers.

Database (NoSQL): We use a highly scalable Key-Value database (like DynamoDB) to store driver statuses and trip metadata, as this system is extremely write-heavy.

Deep Dive: The Core Engineering Challenges
1. Tracking Location: QuadTrees vs. Geohashing
To show users the cars moving on their screen, drivers ping our servers with their GPS coordinates every 4 seconds. Storing this in a traditional SQL database would instantly crash our system. We need a Spatial Index.

Geohashing (Redis GEO): This divides the map into a fixed grid of varying resolutions. It is incredibly fast for querying "find all drivers within a 3km radius" and is a standard choice for high-throughput, real-time location caching.

QuadTrees: An alternative tree data structure that dynamically subdivides map regions. It is excellent for unevenly distributed data (e.g., millions of drivers in a dense city center, but very few in a rural area).

2. Concurrency & Idempotency: The "Double Booking" Problem
What happens if the Matching Service sends a ride request to three nearby drivers, and two drivers hit "Accept" at the exact same millisecond?

To prevent double-booking, we must ensure our system is strictly consistent and idempotent. We achieve this using Optimistic Concurrency Control or a Distributed Lock (like Redis Redlock) on the Database. When a driver accepts the ride, the database checks a version number or a lock status. The first request successfully updates the ride status to "Accepted" and assigns the driver ID. The second request is rejected, ensuring only one driver gets the trip.

3. Handling Massive Traffic Spikes
What if 50,000 people request a ride at the exact same moment after a major sports game ends? If these requests hit our Matching Server directly, it will crash.

To make our system resilient to traffic spikes, we place a Message Queue (like Kafka) directly behind the API Gateway. When a user requests a ride, the request is instantly dropped into the queue. The user gets a "Finding your ride..." screen. Our Matching Servers then consume these messages at their maximum safe capacity, ensuring the system never gets overwhelmed and no ride requests are lost.

Conclusion
Ride-sharing architectures are a beautiful blend of heavy read-write throughput, complex geospatial mathematics, and strict transactional consistency. By leveraging WebSockets for real-time communication, spatial caching for location tracking, and Message Queues for peak load management, we can build a highly scalable platform.

Would you prefer using Redis Geohashing or building a custom QuadTree service for location tracking? Let me know your thoughts in the comments!

How to Design YouTube: CDNs, Transcoding, and the Hot Video Problem

Ganesh Parella — Tue, 10 Mar 2026 12:16:07 +0000

If you read my previous post about designing a News Feed system, you might be wondering: what makes a video streaming platform any different? While a news feed handles text and small image payloads, streaming 4K video globally is an entirely different beast.

Without any delay, let's break down the system architecture in a structured manner.

System Requirements
Functional Requirements:

Users can upload and post videos.
Users can watch/stream videos smoothly.
Users can like and comment on videos.
Users can subscribe to other creators.
Videos must be available in multiple qualities (240p, 480p, 720p, 1080p, 4K) depending on the user's internet speed.

Non-Functional Requirements:

Low Latency: Video playback should start in < 2 seconds.
Highly Scalable: Must support up to a billion users.
High Availability: The system must remain accessible (favoring availability over strict consistency).
Eventual Consistency: It is perfectly fine if a user's subscriber count takes a few seconds to update globally.

Core Entities

User
Video
Like
Comment

Core API Endpoints
Keeping it RESTful, our core endpoints would look something like this:

POST /v1/videos (Request upload URL and submit metadata)
GET /v1/videos/{video_id} (Fetch video stream and metadata)

High-Level Architecture

The diagram above illustrates the high-level architecture of our system:

Load Balancer / API Gateway: Distributes incoming traffic evenly across our stateless backend servers to prevent any single point of failure.

Blob Storage (Amazon S3): We cannot store massive 10GB+ video files in a traditional SQL or Key-Value database. Instead, the actual video files are stored in object storage like S3.

DynamoDB (Metadata Store): Since we only need to store the metadata of the video (Title, Uploader ID, S3 URL, Likes) and don't require strict ACID properties, a highly scalable Key-Value database like DynamoDB is the perfect fit.

Transcoding Pipeline (Chunkers): To stream video seamlessly, we don't just send one massive file. We pass the uploaded video to a background service that chunks it into 3-second segments and transcodes it into different resolutions (1080p, 720p, 240p).

Step-by-Step Data Flow
To really understand this architecture, let's walk through the exact lifecycle of the two most important actions in our system: uploading a video and watching a video.

1. The Write Path (Uploading a Video)
When a creator uploads a new video, here is exactly what happens behind the scenes:

Request Permission: The client app sends a request to our API Gateway to upload a video.
Pre-signed URL Issued: The API Server responds with a secure, temporary Pre-signed S3 URL.
Direct Upload: The client bypasses our servers and uploads the massive video file directly into our "Raw Videos" S3 bucket.
Event Triggered: Once S3 finishes receiving the file, it fires an event directly into our Message Queue (e.g., Kafka or RabbitMQ).
Transcoding Pipeline: Our background Transcoding Workers pick up the event, pull the raw video from S3, and convert it into various resolutions (1080p, 720p, etc.) and chunks. -** Final Storage & DB Update:** The workers save the processed chunks into a "Transcoded Videos" S3 bucket and update DynamoDB with the final metadata (URLs, formats available, uploader ID).

2. The Read Path (Streaming a Video)
When a user clicks on a thumbnail to watch a video, speed is everything:

Fetch Metadata: The client requests the video details from the API Server.
Cache Check: The server checks Redis. If the video is popular, the metadata (title, S3/CDN URLs) is instantly returned. If it’s a cache miss, it fetches it from DynamoDB, updates Redis, and returns it to the client.
**Stream Request: **The client's video player uses the returned URL to request the actual video chunks from the closest CDN edge server.
Video Delivery: If the CDN has the chunks (Cache Hit), the video plays instantly. If not (Cache Miss), the CDN fetches the chunks from our Transcoded S3 bucket, caches them locally for the next user, and streams them to the client.

The Hard Parts: Trade-Offs & Bottlenecks
1. The Upload Bottleneck (Bypassing the API)
Many of you might be wondering: Why are we storing the video directly in S3 instead of sending it through our API Gateway?

If millions of users try to upload 10GB video files directly through our backend API servers, the network I/O will immediately crash our system. Instead, we use Pre-signed URLs. The client asks our API for permission, the API grants a secure, temporary S3 URL, and the client uploads the heavy video chunks directly to S3, bypassing our servers entirely.

2. Achieving < 2s Latency (The Power of the CDN)
You might instantly think that using a Redis cache is the perfect way to decrease video load times. But caching a 4K video in Redis isn't practical.

To achieve zero-buffering streaming globally, we use a CDN (Content Delivery Network). The transcoded video chunks are copied to edge servers all around the world. If a user in India watches a video uploaded in the US, the CDN serves the video from a server right down the street from them, effectively eliminating latency.

Coupled with Adaptive Bitrate Streaming (ABR), the video player automatically switches between quality chunks (e.g., dropping from 1080p to 480p) if the user's internet speed drops, ensuring the video never stops to buffer.

3. Scaling to a Billion Users
Scaling this system horizontally is remarkably straightforward:

Amazon S3 provides virtually infinite storage capacity.
Our backend API servers are stateless, meaning we can simply spin up more instances behind the Load Balancer as traffic increases.
DynamoDB partitions data automatically, though we could implement consistent hashing if we needed to scale a custom database cluster.

4. The "Hot Video" Problem
Imagine a massive creator uploads a video and a million users try to access it within 2 seconds.

Our CDN will easily handle the load of serving the actual video file. But what about our DynamoDB instance? A million simultaneous reads for the video's metadata (Title, View Count, Likes) will cause database throttling. To solve this, we introduce Redis. We cache the metadata of highly popular videos in Redis with multiple read replicas, completely shielding our main database from the viral traffic spike.

5. The "Long-Tail" Problem: CDN Cost vs. Performance
We established that pushing videos to a CDN provides a latency-free experience. But CDNs are incredibly expensive.

YouTube has billions of videos, but 80% of the daily traffic comes from only 20% of the videos (the viral hits and new releases). The remaining 80% are "long-tail" videos—perhaps a tutorial uploaded 5 years ago that gets 2 views a month.

The Trade-Off: Should we cache every single video in our CDN? No. Pushing dead, unwatched videos to expensive edge servers worldwide would bankrupt the company. Instead, we use an intelligent eviction policy. We aggressively cache the "hot" 20% of videos in the CDN. For the "long-tail" videos, we accept a slightly higher latency and stream them directly from our S3 storage, saving millions of dollars in infrastructure costs.

Conclusion
Designing a video streaming platform is a masterclass in decoupling and asynchronous processing. By keeping heavy media out of our API servers, utilizing a background event-driven transcoding pipeline, and intelligently routing traffic between CDNs for hot videos and S3 for long-tail content, we can build a resilient system capable of entertaining a billion users without a single moment of buffering.

If you were building this, what message queue would you choose for the transcoding pipeline? RabbitMQ, Kafka, or AWS SQS? Let me know your thoughts down in the comments!

How to Design a News Feed: Caching, Queues, and Millions of Users

Ganesh Parella — Sun, 08 Mar 2026 09:51:32 +0000

Designing a News Feed system might seem a bit less complex at first glance compared to the Real-Time Chat application we designed before (if you haven't read that one yet, check it out on my profile!). But when you scale a feed to millions of users, things get incredibly interesting.

Let's break down how platforms like Instagram or Twitter handle massive traffic without breaking a sweat.

System Requirements
Functional Requirements:

Users can publish posts.
Users can view a news feed of posts from people they follow.
Users can follow/unfollow others.
Users can like and comment on posts.

Non-Functional Requirements:

Highly Scalable: Must handle millions of active users.
Low Latency: Feed generation should feel instant (< 100ms).
High Availability: The system must stay up (we will favor availability over strict consistency).
Eventual Consistency: It’s perfectly fine if a user sees a post a few seconds after it’s published.

Core Entities

User: User profile and follower metadata.
Post: The actual content.
Like & Comment: Engagements tied to posts.

High-Level Architecture

Here is a quick look at the core components of our system:

Load Balancer: Distributes incoming traffic across our API servers to prevent bottlenecks and ensure horizontal scalability.

Redis (Cache): The secret weapon for our < 100ms latency goal. It stores pre-computed user feeds so we don't have to hit the database every time a user opens the app.

Object Storage (S3): Used to store heavy media files like images and videos, keeping our core databases lightweight.

Database (NoSQL): Stores the metadata of the posts (text content, S3 URLs, timestamps) and user relationship graphs.

The Hard Parts: Solving the Core Challenges
1. The "Hot User" Problem: Fan-out on Write vs. Fan-out on Read
Getting a user's feed by querying the database on every single refresh is a recipe for high latency and system failure. We need to cache the feeds in Redis. But how do we get the data into the cache efficiently?

Fan-out on Write (Push Model):
For a normal user with a few hundred followers, when they post, we immediately "push" that post into the Redis cache of all their followers. This is perfect because the number of writes is small, and it makes reading the feed instant.

Fan-out on Read (Pull Model):
But what if an influencer with 50 million followers posts a picture? Pushing to 50 million caches instantly would absolutely crush our servers. To solve this, we use a hybrid approach. For massive influencers, we do not push the data. Instead, when a follower opens their app, the system dynamically pulls the influencer's recent posts and merges them into the user's feed at read-time.

2. Handling Server Crashes & Heavy Uploads
What if a user uploads a massive video, and the server crashes before it reaches the database? Or what if the database goes down momentarily?

To make this problem disappear and ensure strict reliability, we introduce a Message Queue (like Kafka or RabbitMQ). When a user hits "post," the API drops the payload into the queue and instantly tells the user "Success!" Background workers then safely consume this queue at their own pace, saving the heavy media to S3 and the metadata to the DB. This decouples our architecture and prevents data loss entirely.

3. Preserving Computational Power: Pagination
We absolutely do not want to load a user's entire history of posts at once. It wastes computational power, eats up bandwidth, and spikes latency.

To keep the system affordable and lightning-fast, we implement Pagination. By loading only 20 posts per page, we drastically reduce the load on our backend. For a news feed, using Cursor-based Pagination (using a timestamp or unique ID as a cursor rather than an offset) is the best approach, as it prevents duplicate posts from showing up if new items are added while the user is scrolling.

Conclusion
Designing a news feed is all about managing trade-offs. By decoupling our storage, smartly handling influencers with a hybrid fan-out model, and relying heavily on caching and message queues, we can build a robust system that feels instantaneous for the end user.

What are your thoughts on this architecture? Would you use a different approach for handling influencer posts? Let me know in the comments!

How to design a Real-Time Chat Application

Ganesh Parella — Wed, 04 Mar 2026 10:06:51 +0000

Why Designing a Real-Time Chat Application Is Hard

Designing a real-time chat application is significantly more complex than building systems like a URL shortener or a notification service.

The main reasons are:

Real-time bidirectional communication
Handling millions of concurrent connections
Ensuring low latency
Managing message persistence and offline delivery

Unlike simple request-response systems, chat applications require persistent connections and instant delivery at scale.

Functional Requirements

1-to-1 messaging
Group messaging
Message persistence
Offline message delivery (messages should be delivered when a user comes online)

Non-Functional Requirements

Scalable to millions of users
Low latency (< 500 ms)
Fault tolerant
Highly available
Durable storage

Choosing the Correct Communication Protocol

Since our latency requirement is less than 500 ms, traditional short polling or long polling are not ideal because they introduce unnecessary delays and overhead.

Server-Sent Events (SSE) are also not suitable because they support only one-way communication (server → client), whereas a chat system requires two-way communication.

Therefore, we use WebSockets, which provide:

Persistent connections
Bidirectional communication
Low latency
Reduced network overhead

Modern messaging platforms like WhatsApp use persistent connections to achieve real-time communication.

High-Level Architecture

Our system consists of the following components:

1. Client

The client maintains a WebSocket connection with the server to send and receive messages.

2. Load Balancer

The load balancer distributes incoming WebSocket connections across multiple chat servers to ensure scalability and high availability.

3. Chat Servers

Chat servers handle the core business logic:

Manage WebSocket connections
Validate messages
Store messages in the database
Deliver messages to recipients

4. Redis

Since the load balancer does not know which user is connected to which chat server, we store connection mappings in Redis.

Example:

userId → serverId / connectionId

This allows any server to determine whether a user is online and where to route the message.

Database

We use a scalable NoSQL database such as Amazon DynamoDB or any key-value store because:

We require high write throughput
We do not need strict ACID guarantees
Horizontal scaling is easier

1-to-1 Message Flow

The sender sends a message via WebSocket.
The chat server validates and stores the message in the database (for persistence).
The server checks Redis to determine whether the recipient is online.
If the recipient is online: The message is delivered immediately via WebSocket.
If the recipient is offline: The message remains stored in the database.
It will be delivered when the user reconnects.

Group Chat Message Flow

A user sends a message to a group.
The message is stored in the database with the group ID.
The server retrieves the list of group members.
For each member: Check Redis for their connection.
If online → deliver via WebSocket.
If offline → deliver when they reconnect.

Challenges
Designing the architecture is only the beginning. The real complexity lies in handling the following challenges at scale.

Scaling Millions of WebSocket Connections

Each active user maintains a persistent WebSocket connection with the server.

Problems:

Each connection consumes memory.
A single server can handle only a limited number of concurrent connections.
Sudden traffic spikes (e.g., during peak hours) can overwhelm servers. Solution:

Use horizontal scaling (multiple chat servers).

Keep servers stateless.
Store connection metadata in a centralized store like Redis.
Use load balancers to distribute traffic evenly.

This ensures we can scale to millions of concurrent users.

The Fan-Out Problem in Group Chats

When a user sends a message in a group with 10,000 members, the system must deliver that message to all members.

This creates a massive delivery overhead.

Two common approaches:

Fan-out on Write

When a message is sent, it is immediately distributed to all group members.
Faster reads.
Heavy write amplification.

Fan-out on Read

Store one copy of the message.
Deliver it only when users fetch or reconnect.
Reduces write load but increases read complexity.

Large-scale systems like Slack often use optimized hybrid approaches depending on group size.

Message Ordering

Messages must appear in the correct order for each conversation.

Problems:

Messages may arrive out of order due to network delays.
Multiple servers handling requests can cause race conditions.

Solution:

Assign a sequence number per conversation.
Store timestamps.
Let clients reorder messages based on sequence IDs.

Maintaining ordering becomes especially challenging in distributed systems.

Handling Offline Users

Users may disconnect unexpectedly due to:

Network issues
App crashes
Device shutdown

The system must:

Store undelivered messages safely.
Detect when the user reconnects.
Deliver pending messages reliably.

This requires durable storage (e.g., NoSQL databases like Amazon DynamoDB).

Delivery Guarantees

Should messages be delivered:

At most once?
At least once?
Exactly once?

Exactly-once delivery is extremely hard in distributed systems.

Most chat systems:

Use at-least-once delivery.
Assign unique message IDs. Let clients deduplicate messages if needed.

Fault Tolerance

What happens if:

A chat server crashes?
Redis goes down?
A database node fails?

Solutions:

Replicated databases.
Redis clustering.
Health checks and auto-restarts.
Multi-availability zone deployments.

Large messaging systems like WhatsApp are designed with redundancy at every layer to avoid message loss.

Data Storage & Hot Partitions

If many users are chatting in the same popular group, all writes may hit the same database partition.

This creates:

Hot keys
Increased latency
Throttling

Solutions:

Partition by conversation ID + time bucket.
Use sharding strategies.
Distribute load evenly across nodes.

Conclusion

Designing a real-time chat application goes far beyond simply sending messages between users. It requires solving complex distributed systems problems such as scaling millions of persistent connections, ensuring low latency, handling offline users, maintaining message ordering, and guaranteeing fault tolerance.

By using WebSockets for bidirectional communication, horizontally scalable chat servers, centralized connection mapping with Redis, and durable storage solutions like Amazon DynamoDB, we can build a system capable of supporting millions of users efficiently.

The real challenge is not just building the architecture — it’s understanding the trade-offs between scalability, consistency, and reliability.

A well-designed chat system is a practical example of how distributed systems principles are applied in real-world applications.

How to design a Notification System ?

Ganesh Parella — Sat, 21 Feb 2026 19:59:32 +0000

Imagine you’re building a social platform.

A user signs up.
Someone likes a post.
Someone comments.

Each of these actions should trigger a notification.

Sounds simple, right?

But what happens when thousands of users trigger events at the same time?
Let’s design it properly.

Functional Requirements

When an event is triggered → send a notification.
If sending fails → retry.
Support multiple channels (Email, Push, In-App).

Non-functional Requirements

High availability
Notifications should not be lost (persistence)
Scalable under traffic spikes
Pluggable architecture (easy to add new channels)

High-Level Architecture

The basic architecture looks straightforward. But many people ask:

Why use a message queue instead of directly sending the request to the notification service?

Let’s say we want to send a welcome email when a new user signs up.

Most email service providers impose rate limits. Assume the limit is 30 requests per second.

Now imagine 100 users click the sign-up button within one second.

If we send requests directly to the email service:

30 succeed
70 fail

That’s 70 lost users. Not acceptable.

Instead, we push all events into a message queue and process them at a controlled rate. The queue acts as a buffer during traffic spikes. Workers dequeue messages when the service is available and send notifications gradually.

This way, we don’t lose requests, and we stay within the provider’s rate limit.

Bottlenecks and Improvements

1. What if the notification provider is down?
Suppose the email service goes down and every request starts failing.

If we retry infinitely:

We waste CPU resources
The queue keeps growing
The system becomes unstable

To solve this, we use exponential backoff retries.

Instead of retrying immediately, we wait longer between each attempt:
1s → 2s → 4s → 8s → 16s …

After a certain number of retries, we move the message to a Dead Letter Queue (DLQ) for later inspection.

2. Avoiding Notification Spam
Initially, we might send an email for every event.

But that’s not ideal.

If a user is actively using the app, sending an email for every like or comment would feel like spam.

To handle this, we introduce a Notification Engine.

All requests go through this engine, which decides:

Which channel to use (Email, Push, In-App)
Whether the user has disabled certain notifications
Whether the user is currently active in the app

We store user preferences and last login time in a cache for quick access.

For example:

If the user is active → send only In-App notification
If the user is offline → send Push
If Push fails → fallback to Email

This makes the system smarter and more user-friendly.

3. Making It Pluggable
We don’t want to tightly couple our system to just Email or Push.

Instead, we design it so that each notification channel implements a common interface.

That way, if we want to add:

SMS
WhatsApp
Slack

We can plug it in without rewriting the core logic.

This keeps the system flexible and future-proof.

Final Thoughts
What started as “just send a notification” quickly becomes a distributed system problem.

By introducing:

A message queue for decoupling
Worker-based async processing
Exponential backoff retries
Dead Letter Queues
A centralized Notification Engine
User preference caching

We build a system that is scalable, resilient, and production-ready.

Simple feature. Complex engineering.

And that’s the fun part.

Designing a Rate Limiter to prevent spamming

Ganesh Parella — Wed, 18 Feb 2026 10:33:23 +0000

Imagine you are building a social website or any large-scale system. Suddenly, a million requests flood your system from a single IP address. Your servers slow down or even crash.

How do we prevent this?

The answer is: Rate Limiting.

In simple terms, a rate limiter restricts the number of requests a user (or IP address) can send within a given time window.

Let’s design one.

Functional Requirements

Limit the number of requests per user ID or IP address
Return an error (e.g., HTTP 429) when the limit is exceeded

Non-Functional Requirements

Low latency while checking the limit (e.g., <10ms)
High availability (Availability > Consistency)
Scalable for millions of users

System Function / Endpoint
boolean isAvailable(userId, request)

If true → forward request to backend
If false → return 429 (Too Many Requests)

Choosing the Right Algorithm

When we think about limiting requests over time, a natural idea is:

Limit the number of requests in a fixed time window.
But there’s a problem.
Suppose we allow 100 requests per second.
If a user sends:
100 requests at the end of second 1
100 requests at the beginning of second 2
That’s 200 requests within ~1 second.
This is known as the Fixed Window problem.
We don’t want that.

Sliding Window
To solve this, we can use a sliding window approach.
The idea:
At any given time window, the number of requests must not exceed the limit.
This is more accurate but requires storing timestamps of requests.
Implementation might use:

Sorted sets
Heaps / priority queues

However, memory usage increases with traffic.

Token Bucket (Preferred Approach)

Think of tokens as balls in a bucket.
Each request consumes one token.
Tokens are refilled at a fixed rate.
If no tokens are available → reject the request.

Example:

Bucket size = 100
Refill rate = 100 per minute
If a user sends 100 requests instantly, they must wait until tokens are refilled.

Benefits:

Allows burst traffic (up to bucket capacity)
Smoothens traffic over time
Flexible and production-friendly
Token Bucket is widely used in real systems.

High-Level Architecture

In this design:

The Rate Limiter logic is placed before the backend API.
Load balancer distributes traffic across multiple app servers.
A shared Redis store keeps token bucket state.

This ensures:

Distributed rate limiting
No single point of failure
Low latency checks

Bottlenecks
Redis Bottleneck
If millions of users hit the system simultaneously, Redis may become the bottleneck.
To scale:

Use Redis clustering
Shard keys across multiple Redis nodes
Use consistent hashing for distribution If each Redis instance stores 100k users and we need to support 1 million users: We need around 10 Redis nodes.

2. Concurrency Problem

What if:

A user has only 1 token left
Two requests hit Redis at the same time from different servers?
Redis solves this using atomic operations.

Using:

Lua scripts
Atomic commands like INCR
Or transactions

This prevents race conditions.

3. Latency Considerations

To reduce latency:

Keep Redis close to application servers (same region)
Use cluster topology
Avoid cross-region calls for rate limit checks
Geographical distance directly impacts response time.

Final Thought

Functional requirements define what the system does.
Non-functional requirements define how well it performs at scale.

Rate limiting may look simple, but designing it correctly in distributed systems requires careful thought.

See you in the next post 🚀

How to design a Simple URL Shortener(TinyURL)

Ganesh Parella — Sun, 15 Feb 2026 18:16:36 +0000

TinyURL is often called the “Hello World” of system design because it has minimal requirements but forces us to think about scalability, caching, ID generation, and bottlenecks.

Let’s design it step by step.

Functional Requirements :

Convert a Long URL → Short URL
Redirect Short URL → Long URL

Non-Functional Requirements:

High Availability
Low Latency
Scalable under heavy traffic

API End-Points:

POST /shorten → Accepts Long URL
GET /{shortId} → Redirects to Long URL

High-Level Design:

→ Load Balancer
→ App Servers
→ Cache
→ Sharded Database" width="800" height="202">
This is a simple and scalable design.

Since we require low latency, we introduce a cache layer to store frequently accessed short URLs. Most read requests will be served directly from cache, reducing database load.

To ensure high availability, we avoid single points of failure. App servers are scaled horizontally and placed behind a load balancer, which distributes incoming traffic evenly.

Because our system only needs to store simple mappings:
short_url → long_url
we can use either:

A Key-Value database (natural fit for simple mapping)
Or an SQL database if additional analytics or constraints are required

This covers the basic design derived from requirements.

But now comes the interesting part.

How Short Should the URL Be?
We want to convert long URLs into short ones. But how short should they be?

Assume:

K new URLs are generated every second.
We store URLs for 10 years.
Total URLs required:
K*60*60*24*365*10

If our short URL can use:

26 lowercase letters (a–z)
26 uppercase letters (A–Z)
10 digits (0–9)

That gives us 62 possible characters.
To determine required length:

62^n ≥ K × 60 × 60 × 24 × 365 × 10
Where n is the length of the short URL.
If n = 7:
62^7 ≈ 3.5 trillion combinations

Which is sufficient for large-scale systems.

Bottlenecks
Hot Key Problem (Read Bottleneck)
Suppose the application becomes popular and millions of users request the same short URL simultaneously.

Where would the system collapse first?

The cache.
When many users access the same key, we face a hot key problem. Horizontal scaling alone does not solve this because the same key may map to the same cache node.

Solution:

Use cache replicas
Introduce a CDN layer Distribute read load across multiple cache nodes

Write Bottleneck (Database)

Now assume we receive a large number of write requests (URL creation) and writes typically go to the primary database node.
Where will the bottleneck occur?
The database.
Since every new short URL requires a write operation, database throughput becomes the limiting factor.
Solution:
Sharding the database.
However, simple modulo-based sharding can cause problems when adding new shards because it requires massive data redistribution.

A better approach is:
Consistent hashing, which minimizes data movement when scaling.

ID Collision Problem

Since app servers are horizontally scaled, two servers might generate the same short URL.

How do we prevent collisions?

Possible approaches:

Random Base62 generation + collision check
Centralized ID generator
Distributed ID service
Using Redis atomic counter (e.g., INCR)

Final Thought

TinyURL may look simple, but it teaches us:

Scalability
Caching strategies
Sharding techniques
Bottleneck analysis
ID generation trade-offs

That’s why it’s called the “Hello World” of System Design. Let's meet again with another interesting design.

How to Choose a Database?

Ganesh Parella — Fri, 13 Feb 2026 18:57:39 +0000

Before choosing a database, we must understand the types of databases that exist.

1.Relational Databases (1970s)
In 1970, Edgar F. Codd proposed storing data in tables (relations) and treating them using mathematical principles.
This led to the creation of Relational Databases.
They offer:

Atomicity
Consistency
Isolation
Durability (ACID properties)

Examples include:

MySQL
PostgreSQL

Relational databases are powerful when data is structured and transactional consistency is critical.
However, as systems scale and joins grow across millions of rows, performance tuning and horizontal scaling become challenging.

2.Key–Value Databases (2000s Scaling Era)
In the 2000s, companies like Amazon faced massive scalability challenges.
Instead of complex relational joins, they proposed storing data as:
Key → Value

For example:
UserID → List of Orders

A popular example is: Redis

Key–Value databases offer:

Extremely fast lookups
Easy horizontal scaling
Great performance for caching and session storage However, they are not ideal for handling complex relationships between data.

3. Graph Databases
Graph databases store data as:
Nodes
Edges (relationships)
Example: Neo4j

They are extremely useful when relationships are first-class citizens, such as:

Social networks
Recommendation systems
Fraud detection Graph databases shine when traversing connected data efficiently.

4.Document Databases:
Document database stores data as JSON-like documents

Example: MongoDb

Document Databases offer:
-Easy Horizontal Scaling
-Flexible Schema
-Better support for hierarchical data

While modern document databases support limited joins and aggregations, they are not as optimized for complex relational queries as relational databases.

Final Verdict
Don’t choose a database by default.
Choose a database based on how you access and scale your data.

If you need:

Strong ACID guarantees → Relational Database
Ultra-fast lookups → Key–Value Database
Relationship-heavy queries → Graph Database
Flexible schema with high scalability → Document Database

Modern systems often use multiple databases together — a concept known as polyglot persistence.
For example, Netflix uses:

Relational databases for user data
Key–Value stores for caching
Graph databases for recommendations

Why do we need Databases?

Ganesh Parella — Fri, 13 Feb 2026 17:10:40 +0000

Have you ever wondered why we need databases when we can store data directly on an SSD?

I recently asked myself this question. After all, SSDs store data permanently. If we can read and write directly to disk, why don’t we use that for building web applications and systems?

At first, I thought maybe it’s because most applications use in-memory computation — in simple terms, RAM. When an application crashes, all data stored in RAM is lost. This creates reliability issues.

So then I wondered: if persistence is our goal, why not write everything directly to the SSD?

But then I realized something important — an SSD only provides physical storage. It does not provide data management. SSDs do not provide indexing, concurrent updates, or crash recovery. Implementing these manually on top of raw file storage would be extremely complex.

Databases abstract this complexity and handle data efficiently.

Now that we understand why databases exist, the next question becomes — what type of database should we choose for different systems?