<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ganesh Parella</title>
    <description>The latest articles on DEV Community by Ganesh Parella (@ganesh_parella).</description>
    <link>https://dev.to/ganesh_parella</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3771329%2F5c80f144-e42b-45f7-a665-5de4b340ed0f.png</url>
      <title>DEV Community: Ganesh Parella</title>
      <link>https://dev.to/ganesh_parella</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ganesh_parella"/>
    <language>en</language>
    <item>
      <title>I Broke My Own Workflow Engine at Scale — Here's How I Fixed It</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Wed, 18 Mar 2026 13:26:22 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/i-broke-my-own-workflow-engine-at-scale-heres-how-i-fixed-it-hows-the-title-4ml5</link>
      <guid>https://dev.to/ganesh_parella/i-broke-my-own-workflow-engine-at-scale-heres-how-i-fixed-it-hows-the-title-4ml5</guid>
      <description>&lt;p&gt;In my last post, I broke down how I built FlowForge, a fault-tolerant DAG workflow engine using ASP.NET Core, React, and MySQL. I explained how I solved complex branching and dependency execution using Kahn's Algorithm and a database-backed state machine.&lt;/p&gt;

&lt;p&gt;It works perfectly for hundreds of concurrent users. But as engineers, we always have to ask the dangerous question: What happens when we 1000x the load? Let's deep dive into the absolute limits of my current architecture, watch it break, and re-architect it to handle a massive scale: 1,000,000 flow executions per second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How the V1 Engine Works (And Why It Will Fail)&lt;/strong&gt;&lt;br&gt;
To understand how to scale the engine, you need to understand what it's currently doing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage &amp;amp; Parsing:&lt;/strong&gt;&lt;br&gt;
When a user builds a flow in the React frontend, we save the entire raw JSON (including UI viewports and node coordinates) into the definitionJson column of our Flow table. When execution starts, we parse this JSON into a backend-friendly ParsedFlow (a strict list of Nodes and Edges) and verify there are no cyclical loops using Topological Sorting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Execution Loop:&lt;/strong&gt;&lt;br&gt;
When a flow is triggered, we create a FlowInstance in the database to track the overall run. We then generate NodeInstances for every step, marking the initial trigger node as Ready and everything else as Pending.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Polling Mechanism:&lt;/strong&gt;&lt;br&gt;
A background service aggressively polls the MySQL database every 10 milliseconds, looking for nodes in the Ready state. When it finds one, it executes it, evaluates the downstream dependencies, and marks the next children as Ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Breaking Points at 1 Million RPS&lt;/strong&gt;&lt;br&gt;
If we force 1 million workflows per second through this V1 architecture, two things will immediately catch fire:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Breaking Point 1: ThreadPool Starvation (CPU Bottleneck)&lt;/strong&gt;&lt;br&gt;
ASP.NET Core is incredibly efficient with async/await, releasing threads back to the pool during I/O operations. However, parsing a massive workflow JSON 1 million times a second is a heavy, synchronous, CPU-bound operation. We will quickly exhaust the available worker threads, leading to severe latency spikes and HTTP 503 errors as the API Gateway drops incoming requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Breaking Point 2: The Database Meltdown&lt;/strong&gt;&lt;br&gt;
Polling a database every 10ms for 1,000,000 concurrent flows results in an astronomical amount of read queries. The MySQL CPU will hit 100%, disk I/O will bottleneck, and the database will crash before the web servers even break a sweat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Re-Architecting for 1M Flows per Second&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsjk8jktrsuyo2cearfg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsjk8jktrsuyo2cearfg.png" alt="Flow-forge" width="516" height="844"&gt;&lt;/a&gt;&lt;br&gt;
To survive this scale, we have to fundamentally shift from a monolithic, polling-based architecture to a distributed, event-driven one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Decoupling with Message Queues&lt;/strong&gt;&lt;br&gt;
Instinctively, many developers think, "Just throw a Message Queue in front of it." But what actually goes into the queue?&lt;/p&gt;

&lt;p&gt;Instead of the API server trying to execute the flow, the API server's only job is to receive the HTTP request, validate it, and drop a StartFlowMessage (containing the FlowId and payload) into a high-throughput broker like Kafka or RabbitMQ. The API responds with a 202 Accepted immediately, freeing up the web thread in milliseconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Scaling Horizontally with Stateless Workers&lt;/strong&gt;&lt;br&gt;
Now that execution is decoupled, we deploy dedicated, stateless Worker Services reading directly from the Kafka partitions. If the queue gets backed up, we simply spin up 50 more worker containers in Kubernetes to chew through the backlog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Eradicating Database Polling (Event-Driven Execution)&lt;/strong&gt;&lt;br&gt;
We completely remove the 10ms SELECT loop. When a worker finishes executing "Node A", it doesn't wait for the database. Instead, it calculates what nodes are unblocked, updates their state in MySQL, and immediately pushes a new ExecuteNodeMessage into the queue for "Node B". The engine is now entirely event-driven.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Handling Concurrency and Idempotency&lt;/strong&gt;&lt;br&gt;
Here is the massive catch with horizontal scaling: What if two different workers pick up the same flow at the exact same time?&lt;/p&gt;

&lt;p&gt;To maintain idempotency, we implement Optimistic Concurrency Control in our database. We add a Version or RowVersion column to our NodeInstance table. When a worker tries to update a node from Ready to Running, it executes:&lt;br&gt;
UPDATE NodeInstance SET Status = 'Running', Version = Version + 1 WHERE Id = X AND Version = Y;&lt;br&gt;
If another worker already took the job, the row version will have changed, the update will return 0 affected rows, and the second worker knows to safely drop the duplicate task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Reducing Latency with Distributed Caching&lt;/strong&gt;&lt;br&gt;
Parsing the raw definitionJson from MySQL on every execution is too expensive. We introduce Redis to cache the highly requested, pre-parsed DAG structures. When a worker picks up a flow, it grabs the pre-compiled execution plan from Redis in sub-milliseconds, bypassing the JSON deserialization tax entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Scaling a system is rarely about writing "faster code"; it is about removing bottlenecks. By shifting from database polling to an event-driven queue, offloading work to stateless consumers, utilizing distributed caching, and enforcing optimistic concurrency, FlowForge evolves from a reliable prototype into an enterprise-grade orchestration engine.&lt;br&gt;
(Missed the first part? Check out the original build: &lt;a href="https://dev.to/ganesh_parella/building-flowforge-architecting-a-dag-based-workflow-engine-in-net-5ad4"&gt;How I Built a Fault-Tolerant DAG Workflow Engine in ASP.NET Core&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>dotnet</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How I Built a Fault-Tolerant DAG Workflow Engine in ASP.NET Core</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Sat, 14 Mar 2026 10:03:13 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/building-flowforge-architecting-a-dag-based-workflow-engine-in-net-5ad4</link>
      <guid>https://dev.to/ganesh_parella/building-flowforge-architecting-a-dag-based-workflow-engine-in-net-5ad4</guid>
      <description>&lt;p&gt;I recently built &lt;strong&gt;FlowForge&lt;/strong&gt;, a visual workflow automation platform where users can connect multiple external applications and orchestrate complex flows simply by dragging and dropping nodes onto a canvas.&lt;/p&gt;

&lt;p&gt;At first glance, building a Zapier-like clone seemed straightforward: build a React frontend to draw the boxes, and a backend to execute them. But as I got deeper into the architecture, handling state, concurrency, and fault tolerance turned this into a massive distributed systems challenge.&lt;/p&gt;

&lt;p&gt;Without any filler, let's dive into the architecture of how I actually built a workflow automation platform from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users can authenticate with multiple third-party applications (Google, Slack, etc.) via OAuth.&lt;/li&gt;
&lt;li&gt;Users can build and configure workflows using a drag-and-drop canvas.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Non-Functional Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pluggability: Adding a new integration must be as simple as adding a new file, without modifying the core engine.&lt;/li&gt;
&lt;li&gt;Fault Tolerance (Persistence): Even if the server crashes unexpectedly mid-execution, the workflow must be able to resume exactly where it left off.&lt;/li&gt;
&lt;li&gt;Concurrent Branching: The engine must support complex dependencies (e.g., Node B and Node C can run simultaneously, but both must wait for Node A to finish).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Core Entities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flow: The blueprint of the workflow.&lt;/li&gt;
&lt;li&gt;Connection: External OAuth credentials and integration metadata.&lt;/li&gt;
&lt;li&gt;FlowInstance: A single, unique execution run of a Flow.&lt;/li&gt;
&lt;li&gt;NodeInstance: The execution state of an individual step within that flow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack &amp;amp; High-Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9tsbrrs87g7qyt96g1pw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9tsbrrs87g7qyt96g1pw.png" alt="flow-forge"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend: React.js&lt;/li&gt;
&lt;li&gt;Backend: ASP.NET Core Web API&lt;/li&gt;
&lt;li&gt;Database: MySQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Core Services:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client:&lt;/strong&gt; The visual node-based editor for the user interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Gateway:&lt;/strong&gt; Handles routing and load distribution to the backend services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow Service:&lt;/strong&gt; Manages CRUD operations for the workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection Service:&lt;/strong&gt; Securely handles external application connections and token management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow Engine:&lt;/strong&gt; The core "brain" of the system that parses and executes the workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deep Dive 1: The Evolution of the Core Engine&lt;/strong&gt;&lt;br&gt;
Building the execution engine was the biggest architectural bottleneck. I actually had to rewrite this component three separate times to get it right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V1: Linear Execution (The Naive Approach)&lt;/strong&gt;&lt;br&gt;
Initially, I just converted the workflow into a List and executed the nodes in sequence. This failed immediately. Real workflows don't always run in a straight line, and sometimes a flow gets triggered by an event in the middle of a chain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V2: Topological Sorting (Kahn's Algorithm)&lt;/strong&gt;&lt;br&gt;
To fix the branching issue, I modeled the workflow as a Directed Acyclic Graph (DAG). By using Kahn's Algorithm for topological sorting, I could preserve the strict execution order, ensure dependencies were met, and detect invalid cycles.&lt;/p&gt;

&lt;p&gt;However, I hit a massive limitation: State Loss. Because Kahn's algorithm was running entirely in server memory, if the ASP.NET server restarted, the entire execution vanished like it never existed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V3: The Persistent DAG Scheduler (The Final Form)&lt;/strong&gt;&lt;br&gt;
To achieve true fault tolerance, I moved the state machine out of memory and into the MySQL database. The engine now operates as a persistent scheduler. It continuously polls the database to find nodes where all parent dependencies are marked as Completed, marks them as Ready, and executes them.&lt;/p&gt;

&lt;p&gt;Here is the actual C# implementation of the persistent engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt; &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FlowEngine&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;IServiceProvider&lt;/span&gt; &lt;span class="n"&gt;_serviceProvider&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;ILogger&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;FlowEngine&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;_logger&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;IEnumerable&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;INode&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;_availableNodes&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;DagConvertor&lt;/span&gt; &lt;span class="n"&gt;_dagConvertor&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;NodeExecutionRepository&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;IFlowInstanceRepository&lt;/span&gt; &lt;span class="n"&gt;_flowInstanceRepository&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

     &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;FlowEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="n"&gt;IServiceProvider&lt;/span&gt; &lt;span class="n"&gt;serviceProvider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;ILogger&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;FlowEngine&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;IEnumerable&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;INode&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;availableNodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;DagConvertor&lt;/span&gt; &lt;span class="n"&gt;dagConvertor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;NodeExecutionRepository&lt;/span&gt; &lt;span class="n"&gt;nodeExecutionRepository&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;IFlowInstanceRepository&lt;/span&gt; &lt;span class="n"&gt;flowInstanceRepository&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="n"&gt;_serviceProvider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serviceProvider&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
         &lt;span class="n"&gt;_logger&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
         &lt;span class="n"&gt;_availableNodes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;availableNodes&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
         &lt;span class="n"&gt;_dagConvertor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dagConvertor&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
         &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nodeExecutionRepository&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
         &lt;span class="n"&gt;_flowInstanceRepository&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flowInstanceRepository&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;

     &lt;span class="c1"&gt;// Allows service to reuse DAG building logic&lt;/span&gt;
     &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Dag&lt;/span&gt; &lt;span class="nf"&gt;BuildDag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ParsedFlow&lt;/span&gt; &lt;span class="n"&gt;parsedFlow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_dagConvertor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ParsedFlowToDag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsedFlow&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;

     &lt;span class="c1"&gt;// ============================================================&lt;/span&gt;
     &lt;span class="c1"&gt;// 🚀 Persistent DAG Scheduler&lt;/span&gt;
     &lt;span class="c1"&gt;// ============================================================&lt;/span&gt;

     &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="nf"&gt;RunPersistentAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="n"&gt;Guid&lt;/span&gt; &lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;ParsedFlow&lt;/span&gt; &lt;span class="n"&gt;parsedFlow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;clerkUserId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;Dictionary&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;?&lt;/span&gt; &lt;span class="n"&gt;initialPayload&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;initialPayload&lt;/span&gt; &lt;span class="p"&gt;??&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Dictionary&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;();&lt;/span&gt;
         &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_dagConvertor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ParsedFlowToDag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsedFlow&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

         &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
         &lt;span class="p"&gt;{&lt;/span&gt;
             &lt;span class="c1"&gt;// Fail fast&lt;/span&gt;
             &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AnyFailedAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
             &lt;span class="p"&gt;{&lt;/span&gt;
                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_flowInstanceRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;MarkFailedAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                 &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
             &lt;span class="p"&gt;}&lt;/span&gt;

             &lt;span class="c1"&gt;// Complete if done&lt;/span&gt;
             &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AllCompletedAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
             &lt;span class="p"&gt;{&lt;/span&gt;
                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_flowInstanceRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;MarkCompletedAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                 &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
             &lt;span class="p"&gt;}&lt;/span&gt;

             &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;nodeExecution&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetNextReadyNodeAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

             &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodeExecution&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="p"&gt;{&lt;/span&gt;
                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                 &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
             &lt;span class="p"&gt;}&lt;/span&gt;

             &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;MarkRunningAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodeExecution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

             &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;parsedNode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
                 &lt;span class="n"&gt;parsedFlow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;First&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;nodeExecution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NodeId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

             &lt;span class="k"&gt;try&lt;/span&gt;
             &lt;span class="p"&gt;{&lt;/span&gt;
                 &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ExecuteNodeAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                     &lt;span class="n"&gt;parsedNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;parsedFlow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;clerkUserId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;
                     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;MarkCompletedAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodeExecution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                 &lt;span class="p"&gt;{&lt;/span&gt;
                     &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;kvp&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                         &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;kvp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kvp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                 &lt;span class="p"&gt;}&lt;/span&gt;

                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;UnlockChildrenAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                     &lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;parsedNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
             &lt;span class="p"&gt;}&lt;/span&gt;
             &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="p"&gt;{&lt;/span&gt;
                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;
                     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;MarkFailedAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodeExecution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
             &lt;span class="p"&gt;}&lt;/span&gt;
         &lt;span class="p"&gt;}&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;

     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Dictionary&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;?&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;ExecuteNodeAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="n"&gt;ParsedNode&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;flowName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;clerkUserId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;Dictionary&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
             &lt;span class="n"&gt;_availableNodes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;FirstOrDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

         &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;InvalidOperationException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                 &lt;span class="s"&gt;$"Executor for node type '&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;' not found."&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

         &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;FlowExecutionContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
             &lt;span class="n"&gt;clerkUserId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="nf"&gt;ConvertToJsonElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

         &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ExecuteAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_serviceProvider&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;

     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="nf"&gt;UnlockChildrenAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
         &lt;span class="n"&gt;Guid&lt;/span&gt; &lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;Dag&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;completedNodeId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AdjList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;TryGetValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completedNodeId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;out&lt;/span&gt; &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
             &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

         &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
         &lt;span class="p"&gt;{&lt;/span&gt;
             &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReverseAdjList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;TryGetValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;out&lt;/span&gt; &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                 &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

             &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;allParentsCompleted&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;
                     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AreAllParentsCompletedAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

             &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allParentsCompleted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="p"&gt;{&lt;/span&gt;
                 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_nodeExecutionRepository&lt;/span&gt;
                     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;MarkReadyAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flowInstanceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
             &lt;span class="p"&gt;}&lt;/span&gt;
         &lt;span class="p"&gt;}&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;

     &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="n"&gt;JsonElement&lt;/span&gt; &lt;span class="nf"&gt;ConvertToJsonElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
         &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;JsonSerializer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Serialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
         &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;JsonSerializer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Deserialize&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;JsonElement&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deep Dive 2: Building a Pluggable Architecture&lt;/strong&gt;&lt;br&gt;
A workflow engine is useless if it is hard to add new integrations. To make the system truly pluggable, I utilized the Factory Pattern and .NET Reflection.&lt;/p&gt;

&lt;p&gt;Instead of hardcoding a massive switch statement or storing the node executors in a static list, I made every node type (Action, Trigger, Conditional) inherit from an INode interface. At startup, the application scans the assembly to find all implementations and registers them dynamically in the Dependency Injection container.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt; &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ServiceCollectionExtenctions&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
     &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;IServiceCollection&lt;/span&gt; &lt;span class="nf"&gt;AddNodeServices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt; &lt;span class="n"&gt;IServiceCollection&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;NodeType&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;INode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
         &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;implementations&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AppDomain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CurrentDomain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetAssemblies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
             &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;SelectMany&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetTypes&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
             &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NodeType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsAssignableFrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;IsClass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IsAbstract&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
         &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;implementation&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;implementations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
         &lt;span class="p"&gt;{&lt;/span&gt;
             &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddScoped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;INode&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;implementation&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
         &lt;span class="p"&gt;}&lt;/span&gt;
         &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, adding a new Slack integration is as simple as creating a new class that implements INode. The engine handles the rest automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deep Dive 3: Ensuring Concurrent Execution&lt;/strong&gt;&lt;br&gt;
If you look at the engine logic, you will notice we retrieve the status of each node directly from the database.&lt;/p&gt;

&lt;p&gt;If a parent node completes its execution and unlocks two distinct child nodes, both of those children will independently transition to the Ready state. Because the engine processes ready nodes asynchronously, both child branches will execute concurrently without blocking each other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trade-Offs &amp;amp; Scaling Bottlenecks&lt;/strong&gt;&lt;br&gt;
Building this highlighted some clear scaling limits in my current architecture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ThreadPool Exhaustion:&lt;/strong&gt; Right now, I am executing nodes asynchronously in memory. If the platform scales to 100,000 concurrent users running workflows, we will hit the ASP.NET ThreadPool limits. To solve this, I would need to decouple the execution by introducing a Message Queue (like RabbitMQ) and dedicated background worker services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database Write Limits:&lt;/strong&gt; Flow execution is incredibly write-heavy (updating status to Running, Completed, Failed every few milliseconds). MySQL is great, but at massive scale, it will hit write-throughput bottlenecks. Shifting the execution state storage to a Key-Value database (like DynamoDB or Redis) would be necessary for global scalability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Building FlowForge forced me to move beyond basic CRUD applications and tackle real distributed system challenges. By leveraging topological sorting for execution order, a database-backed state machine for fault tolerance, and .NET reflection for pluggability, the engine is robust and highly extensible.&lt;/p&gt;

&lt;p&gt;You can explore the complete architecture and source code here: &lt;a href="https://github.com/Ganesh-parella/Flow-Forge" rel="noopener noreferrer"&gt;FlowForge on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What are your thoughts on using database polling for the DAG scheduler versus an event-driven message queue? Let me know in the comments!&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>automation</category>
      <category>dotnet</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Designing Uber: Geospatial Indexing, WebSockets, and Distributed Locks</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Fri, 13 Mar 2026 20:27:05 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/designing-uber-geospatial-indexing-websockets-and-distributed-locks-4mhb</link>
      <guid>https://dev.to/ganesh_parella/designing-uber-geospatial-indexing-websockets-and-distributed-locks-4mhb</guid>
      <description>&lt;p&gt;Designing a platform like Uber might seem straightforward at first glance—just match a rider with a driver, right? But when you get into the details of real-time location tracking, geospatial querying, and concurrent bookings, it becomes an incredibly hard system to scale and maintain.&lt;/p&gt;

&lt;p&gt;Without any filler, let's dive into the architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users can input a source and destination to calculate a fare.&lt;/li&gt;
&lt;li&gt;Users can view nearby available drivers in real-time.&lt;/li&gt;
&lt;li&gt;Users can book a ride.&lt;/li&gt;
&lt;li&gt;Drivers can accept or reject ride requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strict Consistency (for matching): Two drivers cannot accept the exact same ride.&lt;/li&gt;
&lt;li&gt;Low Latency: Ride matching must happen in &amp;lt; 1 minute.&lt;/li&gt;
&lt;li&gt;High Availability: Location tracking and routing must remain highly available.&lt;/li&gt;
&lt;li&gt;Scalability: Must support millions of concurrent users and high-frequency location updates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Core Entities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User &lt;/li&gt;
&lt;li&gt;Ride&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;High-Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25sysi22l6i4kmi1ddaf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25sysi22l6i4kmi1ddaf.png" alt="uber" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is a look at the core components of our system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load Balancer / API Gateway:&lt;/strong&gt; Distributes incoming traffic and routes requests to the appropriate backend microservices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebSocket Servers:&lt;/strong&gt; Traditional HTTP isn't fast enough for real-time tracking. We use WebSockets to maintain a persistent, bi-directional connection with the drivers so we can push ride requests to them instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Matching Service:&lt;/strong&gt; The core engine that runs the matching algorithm to pair a rider with the optimal nearby driver.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External Map Provider (Google Maps/Mapbox API):&lt;/strong&gt; Used to calculate the optimal route, estimated time of arrival (ETA), and the trip fare.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spatial Database (Redis / QuadTree):&lt;/strong&gt; A specialized data store designed to hold the real-time geographical coordinates (Latitude/Longitude) of all active drivers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database (NoSQL):&lt;/strong&gt; We use a highly scalable Key-Value database (like DynamoDB) to store driver statuses and trip metadata, as this system is extremely write-heavy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deep Dive: The Core Engineering Challenges&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. Tracking Location: QuadTrees vs. Geohashing&lt;/strong&gt;&lt;br&gt;
To show users the cars moving on their screen, drivers ping our servers with their GPS coordinates every 4 seconds. Storing this in a traditional SQL database would instantly crash our system. We need a Spatial Index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Geohashing (Redis GEO):&lt;/strong&gt; This divides the map into a fixed grid of varying resolutions. It is incredibly fast for querying "find all drivers within a 3km radius" and is a standard choice for high-throughput, real-time location caching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QuadTrees:&lt;/strong&gt; An alternative tree data structure that dynamically subdivides map regions. It is excellent for unevenly distributed data (e.g., millions of drivers in a dense city center, but very few in a rural area).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Concurrency &amp;amp; Idempotency: The "Double Booking" Problem&lt;/strong&gt;&lt;br&gt;
What happens if the Matching Service sends a ride request to three nearby drivers, and two drivers hit "Accept" at the exact same millisecond?&lt;/p&gt;

&lt;p&gt;To prevent double-booking, we must ensure our system is strictly consistent and idempotent. We achieve this using Optimistic Concurrency Control or a Distributed Lock (like Redis Redlock) on the Database. When a driver accepts the ride, the database checks a version number or a lock status. The first request successfully updates the ride status to "Accepted" and assigns the driver ID. The second request is rejected, ensuring only one driver gets the trip.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Handling Massive Traffic Spikes&lt;/strong&gt;&lt;br&gt;
What if 50,000 people request a ride at the exact same moment after a major sports game ends? If these requests hit our Matching Server directly, it will crash.&lt;/p&gt;

&lt;p&gt;To make our system resilient to traffic spikes, we place a Message Queue (like Kafka) directly behind the API Gateway. When a user requests a ride, the request is instantly dropped into the queue. The user gets a "Finding your ride..." screen. Our Matching Servers then consume these messages at their maximum safe capacity, ensuring the system never gets overwhelmed and no ride requests are lost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Ride-sharing architectures are a beautiful blend of heavy read-write throughput, complex geospatial mathematics, and strict transactional consistency. By leveraging WebSockets for real-time communication, spatial caching for location tracking, and Message Queues for peak load management, we can build a highly scalable platform.&lt;/p&gt;

&lt;p&gt;Would you prefer using Redis Geohashing or building a custom QuadTree service for location tracking? Let me know your thoughts in the comments!&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>backend</category>
      <category>distributedsystems</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How to Design YouTube: CDNs, Transcoding, and the Hot Video Problem</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Tue, 10 Mar 2026 12:16:07 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/how-to-design-youtube-cdns-transcoding-and-the-hot-video-problem-cm0</link>
      <guid>https://dev.to/ganesh_parella/how-to-design-youtube-cdns-transcoding-and-the-hot-video-problem-cm0</guid>
      <description>&lt;p&gt;If you read my previous post about designing a News Feed system, you might be wondering: what makes a video streaming platform any different? While a news feed handles text and small image payloads, streaming 4K video globally is an entirely different beast.&lt;/p&gt;

&lt;p&gt;Without any delay, let's break down the system architecture in a structured manner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users can upload and post videos.&lt;/li&gt;
&lt;li&gt;Users can watch/stream videos smoothly.&lt;/li&gt;
&lt;li&gt;Users can like and comment on videos.&lt;/li&gt;
&lt;li&gt;Users can subscribe to other creators.&lt;/li&gt;
&lt;li&gt;Videos must be available in multiple qualities (240p, 480p, 720p, 1080p, 4K) depending on the user's internet speed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low Latency: Video playback should start in &amp;lt; 2 seconds.&lt;/li&gt;
&lt;li&gt;Highly Scalable: Must support up to a billion users.&lt;/li&gt;
&lt;li&gt;High Availability: The system must remain accessible (favoring availability over strict consistency).&lt;/li&gt;
&lt;li&gt;Eventual Consistency: It is perfectly fine if a user's subscriber count takes a few seconds to update globally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Core Entities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User&lt;/li&gt;
&lt;li&gt;Video&lt;/li&gt;
&lt;li&gt;Like&lt;/li&gt;
&lt;li&gt;Comment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Core API Endpoints&lt;/strong&gt;&lt;br&gt;
Keeping it RESTful, our core endpoints would look something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;POST /v1/videos (Request upload URL and submit metadata)&lt;/li&gt;
&lt;li&gt;GET /v1/videos/{video_id} (Fetch video stream and metadata)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;High-Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuxjfcgq8maxfkdzmfb4t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuxjfcgq8maxfkdzmfb4t.png" alt="Video Streaming" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram above illustrates the high-level architecture of our system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load Balancer / API Gateway:&lt;/strong&gt; Distributes incoming traffic evenly across our stateless backend servers to prevent any single point of failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blob Storage (Amazon S3):&lt;/strong&gt; We cannot store massive 10GB+ video files in a traditional SQL or Key-Value database. Instead, the actual video files are stored in object storage like S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB (Metadata Store):&lt;/strong&gt; Since we only need to store the metadata of the video (Title, Uploader ID, S3 URL, Likes) and don't require strict ACID properties, a highly scalable Key-Value database like DynamoDB is the perfect fit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transcoding Pipeline (Chunkers):&lt;/strong&gt; To stream video seamlessly, we don't just send one massive file. We pass the uploaded video to a background service that chunks it into 3-second segments and transcodes it into different resolutions (1080p, 720p, 240p).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-Step Data Flow&lt;/strong&gt;&lt;br&gt;
To really understand this architecture, let's walk through the exact lifecycle of the two most important actions in our system: uploading a video and watching a video.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Write Path (Uploading a Video)&lt;/strong&gt;&lt;br&gt;
When a creator uploads a new video, here is exactly what happens behind the scenes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Request Permission:&lt;/strong&gt; The client app sends a request to our API Gateway to upload a video.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-signed URL Issued:&lt;/strong&gt; The API Server responds with a secure, temporary Pre-signed S3 URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct Upload:&lt;/strong&gt; The client bypasses our servers and uploads the massive video file directly into our "Raw Videos" S3 bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event Triggered:&lt;/strong&gt; Once S3 finishes receiving the file, it fires an event directly into our Message Queue (e.g., Kafka or RabbitMQ).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transcoding Pipeline:&lt;/strong&gt; Our background Transcoding Workers pick up the event, pull the raw video from S3, and convert it into various resolutions (1080p, 720p, etc.) and chunks.
-** Final Storage &amp;amp; DB Update:** The workers save the processed chunks into a "Transcoded Videos" S3 bucket and update DynamoDB with the final metadata (URLs, formats available, uploader ID).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. The Read Path (Streaming a Video)&lt;/strong&gt;&lt;br&gt;
When a user clicks on a thumbnail to watch a video, speed is everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fetch Metadata:&lt;/strong&gt; The client requests the video details from the API Server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache Check:&lt;/strong&gt; The server checks Redis. If the video is popular, the metadata (title, S3/CDN URLs) is instantly returned. If it’s a cache miss, it fetches it from DynamoDB, updates Redis, and returns it to the client.&lt;/li&gt;
&lt;li&gt;**Stream Request: **The client's video player uses the returned URL to request the actual video chunks from the closest CDN edge server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video Delivery:&lt;/strong&gt; If the CDN has the chunks (Cache Hit), the video plays instantly. If not (Cache Miss), the CDN fetches the chunks from our Transcoded S3 bucket, caches them locally for the next user, and streams them to the client.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Hard Parts: Trade-Offs &amp;amp; Bottlenecks&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. The Upload Bottleneck (Bypassing the API)&lt;/strong&gt;&lt;br&gt;
Many of you might be wondering: Why are we storing the video directly in S3 instead of sending it through our API Gateway?&lt;/p&gt;

&lt;p&gt;If millions of users try to upload 10GB video files directly through our backend API servers, the network I/O will immediately crash our system. Instead, we use Pre-signed URLs. The client asks our API for permission, the API grants a secure, temporary S3 URL, and the client uploads the heavy video chunks directly to S3, bypassing our servers entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Achieving &amp;lt; 2s Latency (The Power of the CDN)&lt;/strong&gt;&lt;br&gt;
You might instantly think that using a Redis cache is the perfect way to decrease video load times. But caching a 4K video in Redis isn't practical.&lt;/p&gt;

&lt;p&gt;To achieve zero-buffering streaming globally, we use a CDN (Content Delivery Network). The transcoded video chunks are copied to edge servers all around the world. If a user in India watches a video uploaded in the US, the CDN serves the video from a server right down the street from them, effectively eliminating latency.&lt;/p&gt;

&lt;p&gt;Coupled with Adaptive Bitrate Streaming (ABR), the video player automatically switches between quality chunks (e.g., dropping from 1080p to 480p) if the user's internet speed drops, ensuring the video never stops to buffer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Scaling to a Billion Users&lt;/strong&gt;&lt;br&gt;
Scaling this system horizontally is remarkably straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon S3 provides virtually infinite storage capacity.&lt;/li&gt;
&lt;li&gt;Our backend API servers are stateless, meaning we can simply spin up more instances behind the Load Balancer as traffic increases.&lt;/li&gt;
&lt;li&gt;DynamoDB partitions data automatically, though we could implement consistent hashing if we needed to scale a custom database cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. The "Hot Video" Problem&lt;/strong&gt;&lt;br&gt;
Imagine a massive creator uploads a video and a million users try to access it within 2 seconds.&lt;/p&gt;

&lt;p&gt;Our CDN will easily handle the load of serving the actual video file. But what about our DynamoDB instance? A million simultaneous reads for the video's metadata (Title, View Count, Likes) will cause database throttling. To solve this, we introduce Redis. We cache the metadata of highly popular videos in Redis with multiple read replicas, completely shielding our main database from the viral traffic spike.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The "Long-Tail" Problem: CDN Cost vs. Performance&lt;/strong&gt;&lt;br&gt;
We established that pushing videos to a CDN provides a latency-free experience. But CDNs are incredibly expensive.&lt;/p&gt;

&lt;p&gt;YouTube has billions of videos, but 80% of the daily traffic comes from only 20% of the videos (the viral hits and new releases). The remaining 80% are "long-tail" videos—perhaps a tutorial uploaded 5 years ago that gets 2 views a month.&lt;/p&gt;

&lt;p&gt;The Trade-Off: Should we cache every single video in our CDN? No. Pushing dead, unwatched videos to expensive edge servers worldwide would bankrupt the company. Instead, we use an intelligent eviction policy. We aggressively cache the "hot" 20% of videos in the CDN. For the "long-tail" videos, we accept a slightly higher latency and stream them directly from our S3 storage, saving millions of dollars in infrastructure costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Designing a video streaming platform is a masterclass in decoupling and asynchronous processing. By keeping heavy media out of our API servers, utilizing a background event-driven transcoding pipeline, and intelligently routing traffic between CDNs for hot videos and S3 for long-tail content, we can build a resilient system capable of entertaining a billion users without a single moment of buffering.&lt;/p&gt;

&lt;p&gt;If you were building this, what message queue would you choose for the transcoding pipeline? RabbitMQ, Kafka, or AWS SQS? Let me know your thoughts down in the comments!&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>systemdesign</category>
      <category>backend</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>How to Design a News Feed: Caching, Queues, and Millions of Users</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Sun, 08 Mar 2026 09:51:32 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/how-to-design-a-news-feed-caching-queues-and-millions-of-users-gcl</link>
      <guid>https://dev.to/ganesh_parella/how-to-design-a-news-feed-caching-queues-and-millions-of-users-gcl</guid>
      <description>&lt;p&gt;Designing a News Feed system might seem a bit less complex at first glance compared to the Real-Time Chat application we designed before (if you haven't read that one yet, check it out on my profile!). But when you scale a feed to millions of users, things get incredibly interesting.&lt;/p&gt;

&lt;p&gt;Let's break down how platforms like Instagram or Twitter handle massive traffic without breaking a sweat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users can publish posts.&lt;/li&gt;
&lt;li&gt;Users can view a news feed of posts from people they follow.&lt;/li&gt;
&lt;li&gt;Users can follow/unfollow others.&lt;/li&gt;
&lt;li&gt;Users can like and comment on posts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highly Scalable: Must handle millions of active users.&lt;/li&gt;
&lt;li&gt;Low Latency: Feed generation should feel instant (&amp;lt; 100ms).&lt;/li&gt;
&lt;li&gt;High Availability: The system must stay up (we will favor availability over strict consistency).&lt;/li&gt;
&lt;li&gt;Eventual Consistency: It’s perfectly fine if a user sees a post a few seconds after it’s published.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Core Entities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User: User profile and follower metadata.&lt;/li&gt;
&lt;li&gt;Post: The actual content.&lt;/li&gt;
&lt;li&gt;Like &amp;amp; Comment: Engagements tied to posts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;High-Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funuj053ta3o1eujmgxxu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funuj053ta3o1eujmgxxu.png" alt=" " width="800" height="363"&gt;&lt;/a&gt;&lt;br&gt;
Here is a quick look at the core components of our system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load Balancer:&lt;/strong&gt; Distributes incoming traffic across our API servers to prevent bottlenecks and ensure horizontal scalability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redis (Cache):&lt;/strong&gt; The secret weapon for our &amp;lt; 100ms latency goal. It stores pre-computed user feeds so we don't have to hit the database every time a user opens the app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Object Storage (S3):&lt;/strong&gt; Used to store heavy media files like images and videos, keeping our core databases lightweight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database (NoSQL):&lt;/strong&gt; Stores the metadata of the posts (text content, S3 URLs, timestamps) and user relationship graphs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Hard Parts: Solving the Core Challenges&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. The "Hot User" Problem:&lt;/strong&gt; Fan-out on Write vs. Fan-out on Read&lt;br&gt;
Getting a user's feed by querying the database on every single refresh is a recipe for high latency and system failure. We need to cache the feeds in Redis. But how do we get the data into the cache efficiently?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fan-out on Write (Push Model):&lt;/strong&gt; &lt;br&gt;
For a normal user with a few hundred followers, when they post, we immediately "push" that post into the Redis cache of all their followers. This is perfect because the number of writes is small, and it makes reading the feed instant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fan-out on Read (Pull Model):&lt;/strong&gt;&lt;br&gt;
But what if an influencer with 50 million followers posts a picture? Pushing to 50 million caches instantly would absolutely crush our servers. To solve this, we use a hybrid approach. For massive influencers, we do not push the data. Instead, when a follower opens their app, the system dynamically pulls the influencer's recent posts and merges them into the user's feed at read-time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Handling Server Crashes &amp;amp; Heavy Uploads&lt;/strong&gt;&lt;br&gt;
What if a user uploads a massive video, and the server crashes before it reaches the database? Or what if the database goes down momentarily?&lt;/p&gt;

&lt;p&gt;To make this problem disappear and ensure strict reliability, we introduce a Message Queue (like Kafka or RabbitMQ). When a user hits "post," the API drops the payload into the queue and instantly tells the user "Success!" Background workers then safely consume this queue at their own pace, saving the heavy media to S3 and the metadata to the DB. This decouples our architecture and prevents data loss entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Preserving Computational Power: Pagination&lt;/strong&gt;&lt;br&gt;
We absolutely do not want to load a user's entire history of posts at once. It wastes computational power, eats up bandwidth, and spikes latency.&lt;/p&gt;

&lt;p&gt;To keep the system affordable and lightning-fast, we implement Pagination. By loading only 20 posts per page, we drastically reduce the load on our backend. For a news feed, using Cursor-based Pagination (using a timestamp or unique ID as a cursor rather than an offset) is the best approach, as it prevents duplicate posts from showing up if new items are added while the user is scrolling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Designing a news feed is all about managing trade-offs. By decoupling our storage, smartly handling influencers with a hybrid fan-out model, and relying heavily on caching and message queues, we can build a robust system that feels instantaneous for the end user.&lt;/p&gt;

&lt;p&gt;What are your thoughts on this architecture? Would you use a different approach for handling influencer posts? Let me know in the comments!&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>distributedsystems</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How to design a Real-Time Chat Application</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Wed, 04 Mar 2026 10:06:51 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/how-to-design-a-real-time-chat-application-5go6</link>
      <guid>https://dev.to/ganesh_parella/how-to-design-a-real-time-chat-application-5go6</guid>
      <description>&lt;p&gt;&lt;strong&gt;Why Designing a Real-Time Chat Application Is Hard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Designing a real-time chat application is significantly more complex than building systems like a URL shortener or a notification service.&lt;/p&gt;

&lt;p&gt;The main reasons are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time bidirectional communication&lt;/li&gt;
&lt;li&gt;Handling millions of concurrent connections&lt;/li&gt;
&lt;li&gt;Ensuring low latency&lt;/li&gt;
&lt;li&gt;Managing message persistence and offline delivery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike simple request-response systems, chat applications require persistent connections and instant delivery at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functional Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1-to-1 messaging&lt;/li&gt;
&lt;li&gt;Group messaging&lt;/li&gt;
&lt;li&gt;Message persistence&lt;/li&gt;
&lt;li&gt;Offline message delivery (messages should be delivered when a user comes online)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-Functional Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalable to millions of users&lt;/li&gt;
&lt;li&gt;Low latency (&amp;lt; 500 ms)&lt;/li&gt;
&lt;li&gt;Fault tolerant&lt;/li&gt;
&lt;li&gt;Highly available&lt;/li&gt;
&lt;li&gt;Durable storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choosing the Correct Communication Protocol&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since our latency requirement is less than 500 ms, traditional short polling or long polling are not ideal because they introduce unnecessary delays and overhead.&lt;/p&gt;

&lt;p&gt;Server-Sent Events (SSE) are also not suitable because they support only one-way communication (server → client), whereas a chat system requires two-way communication.&lt;/p&gt;

&lt;p&gt;Therefore, we use WebSockets, which provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persistent connections&lt;/li&gt;
&lt;li&gt;Bidirectional communication&lt;/li&gt;
&lt;li&gt;Low latency&lt;/li&gt;
&lt;li&gt;Reduced network overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern messaging platforms like WhatsApp use persistent connections to achieve real-time communication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High-Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v0x8r6fld1o80bkjnay.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v0x8r6fld1o80bkjnay.png" alt="chat System" width="800" height="308"&gt;&lt;/a&gt;&lt;br&gt;
Our system consists of the following components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Client&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The client maintains a WebSocket connection with the server to send and receive messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Load Balancer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The load balancer distributes incoming WebSocket connections across multiple chat servers to ensure scalability and high availability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Chat Servers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chat servers handle the core business logic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manage WebSocket connections&lt;/li&gt;
&lt;li&gt;Validate messages&lt;/li&gt;
&lt;li&gt;Store messages in the database&lt;/li&gt;
&lt;li&gt;Deliver messages to recipients&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Redis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since the load balancer does not know which user is connected to which chat server, we store connection mappings in Redis.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;userId → serverId / connectionId&lt;/p&gt;

&lt;p&gt;This allows any server to determine whether a user is online and where to route the message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We use a scalable NoSQL database such as Amazon DynamoDB or any key-value store because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We require high write throughput&lt;/li&gt;
&lt;li&gt;We do not need strict ACID guarantees&lt;/li&gt;
&lt;li&gt;Horizontal scaling is easier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;1-to-1 Message Flow&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The sender sends a message via WebSocket.&lt;/li&gt;
&lt;li&gt;The chat server validates and stores the message in the database (for persistence).&lt;/li&gt;
&lt;li&gt;The server checks Redis to determine whether the recipient is online.&lt;/li&gt;
&lt;li&gt;If the recipient is online:
The message is delivered immediately via WebSocket.&lt;/li&gt;
&lt;li&gt;If the recipient is offline:
The message remains stored in the database.&lt;/li&gt;
&lt;li&gt;It will be delivered when the user reconnects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Group Chat Message Flow&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A user sends a message to a group.&lt;/li&gt;
&lt;li&gt;The message is stored in the database with the group ID.&lt;/li&gt;
&lt;li&gt;The server retrieves the list of group members.&lt;/li&gt;
&lt;li&gt;For each member:
Check Redis for their connection.&lt;/li&gt;
&lt;li&gt;If online → deliver via WebSocket.&lt;/li&gt;
&lt;li&gt;If offline → deliver when they reconnect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges&lt;/strong&gt;&lt;br&gt;
Designing the architecture is only the beginning. The real complexity lies in handling the following challenges at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling Millions of WebSocket Connections&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each active user maintains a persistent WebSocket connection with the server.&lt;/p&gt;

&lt;p&gt;Problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each connection consumes memory.&lt;/li&gt;
&lt;li&gt;A single server can handle only a limited number of concurrent connections.&lt;/li&gt;
&lt;li&gt;Sudden traffic spikes (e.g., during peak hours) can overwhelm servers.
Solution:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use horizontal scaling (multiple chat servers).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep servers stateless.&lt;/li&gt;
&lt;li&gt;Store connection metadata in a centralized store like Redis.&lt;/li&gt;
&lt;li&gt;Use load balancers to distribute traffic evenly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures we can scale to millions of concurrent users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fan-Out Problem in Group Chats&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a user sends a message in a group with 10,000 members, the system must deliver that message to all members.&lt;/p&gt;

&lt;p&gt;This creates a massive delivery overhead.&lt;/p&gt;

&lt;p&gt;Two common approaches:&lt;/p&gt;

&lt;p&gt;Fan-out on Write&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When a message is sent, it is immediately distributed to all group members.&lt;/li&gt;
&lt;li&gt;Faster reads.&lt;/li&gt;
&lt;li&gt;Heavy write amplification.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fan-out on Read&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store one copy of the message.&lt;/li&gt;
&lt;li&gt;Deliver it only when users fetch or reconnect.&lt;/li&gt;
&lt;li&gt;Reduces write load but increases read complexity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Large-scale systems like Slack often use optimized hybrid approaches depending on group size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message Ordering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Messages must appear in the correct order for each conversation.&lt;/p&gt;

&lt;p&gt;Problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Messages may arrive out of order due to network delays.&lt;/li&gt;
&lt;li&gt;Multiple servers handling requests can cause race conditions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assign a sequence number per conversation.&lt;/li&gt;
&lt;li&gt;Store timestamps.&lt;/li&gt;
&lt;li&gt;Let clients reorder messages based on sequence IDs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Maintaining ordering becomes especially challenging in distributed systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handling Offline Users&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Users may disconnect unexpectedly due to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network issues&lt;/li&gt;
&lt;li&gt;App crashes&lt;/li&gt;
&lt;li&gt;Device shutdown&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store undelivered messages safely.&lt;/li&gt;
&lt;li&gt;Detect when the user reconnects.&lt;/li&gt;
&lt;li&gt;Deliver pending messages reliably.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This requires durable storage (e.g., NoSQL databases like Amazon DynamoDB).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delivery Guarantees&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Should messages be delivered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At most once?&lt;/li&gt;
&lt;li&gt;At least once?&lt;/li&gt;
&lt;li&gt;Exactly once?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Exactly-once delivery is extremely hard in distributed systems.&lt;/p&gt;

&lt;p&gt;Most chat systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use at-least-once delivery.&lt;/li&gt;
&lt;li&gt;Assign unique message IDs.
Let clients deduplicate messages if needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fault Tolerance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What happens if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A chat server crashes?&lt;/li&gt;
&lt;li&gt;Redis goes down?&lt;/li&gt;
&lt;li&gt;A database node fails?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replicated databases.&lt;/li&gt;
&lt;li&gt;Redis clustering.&lt;/li&gt;
&lt;li&gt;Health checks and auto-restarts.&lt;/li&gt;
&lt;li&gt;Multi-availability zone deployments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Large messaging systems like WhatsApp are designed with redundancy at every layer to avoid message loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Storage &amp;amp; Hot Partitions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If many users are chatting in the same popular group, all writes may hit the same database partition.&lt;/p&gt;

&lt;p&gt;This creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hot keys&lt;/li&gt;
&lt;li&gt;Increased latency&lt;/li&gt;
&lt;li&gt;Throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Partition by conversation ID + time bucket.&lt;/li&gt;
&lt;li&gt;Use sharding strategies.&lt;/li&gt;
&lt;li&gt;Distribute load evenly across nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Designing a real-time chat application goes far beyond simply sending messages between users. It requires solving complex distributed systems problems such as scaling millions of persistent connections, ensuring low latency, handling offline users, maintaining message ordering, and guaranteeing fault tolerance.&lt;/p&gt;

&lt;p&gt;By using WebSockets for bidirectional communication, horizontally scalable chat servers, centralized connection mapping with Redis, and durable storage solutions like Amazon DynamoDB, we can build a system capable of supporting millions of users efficiently.&lt;/p&gt;

&lt;p&gt;The real challenge is not just building the architecture — it’s understanding the trade-offs between scalability, consistency, and reliability.&lt;/p&gt;

&lt;p&gt;A well-designed chat system is a practical example of how distributed systems principles are applied in real-world applications.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>career</category>
      <category>systemdesign</category>
      <category>architecture</category>
    </item>
    <item>
      <title>How to design a Notification System ?</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Sat, 21 Feb 2026 19:59:32 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/how-to-design-a-notification-system--28ah</link>
      <guid>https://dev.to/ganesh_parella/how-to-design-a-notification-system--28ah</guid>
      <description>&lt;p&gt;Imagine you’re building a social platform.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A user signs up.&lt;/li&gt;
&lt;li&gt;Someone likes a post.&lt;/li&gt;
&lt;li&gt;Someone comments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these actions should trigger a notification.&lt;/p&gt;

&lt;p&gt;Sounds simple, right?&lt;/p&gt;

&lt;p&gt;But what happens when thousands of users trigger events at the same time?&lt;br&gt;
Let’s design it properly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functional Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When an event is triggered → send a notification.&lt;/li&gt;
&lt;li&gt;If sending fails → retry.&lt;/li&gt;
&lt;li&gt;Support multiple channels (Email, Push, In-App).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-functional Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High availability&lt;/li&gt;
&lt;li&gt;Notifications should not be lost (persistence)&lt;/li&gt;
&lt;li&gt;Scalable under traffic spikes&lt;/li&gt;
&lt;li&gt;Pluggable architecture (easy to add new channels)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;High-Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed8lua1j6h0llf2o835x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed8lua1j6h0llf2o835x.png" alt="High-Level Architecture" width="800" height="245"&gt;&lt;/a&gt;&lt;br&gt;
The basic architecture looks straightforward. But many people ask:&lt;/p&gt;

&lt;p&gt;Why use a message queue instead of directly sending the request to the notification service?&lt;/p&gt;

&lt;p&gt;Let’s say we want to send a welcome email when a new user signs up.&lt;/p&gt;

&lt;p&gt;Most email service providers impose rate limits. Assume the limit is 30 requests per second.&lt;/p&gt;

&lt;p&gt;Now imagine 100 users click the sign-up button within one second.&lt;/p&gt;

&lt;p&gt;If we send requests directly to the email service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30 succeed&lt;/li&gt;
&lt;li&gt;70 fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s 70 lost users. Not acceptable.&lt;/p&gt;

&lt;p&gt;Instead, we push all events into a message queue and process them at a controlled rate. The queue acts as a buffer during traffic spikes. Workers dequeue messages when the service is available and send notifications gradually.&lt;/p&gt;

&lt;p&gt;This way, we don’t lose requests, and we stay within the provider’s rate limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottlenecks and Improvements&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. What if the notification provider is down?&lt;/strong&gt;&lt;br&gt;
Suppose the email service goes down and every request starts failing.&lt;/p&gt;

&lt;p&gt;If we retry infinitely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We waste CPU resources&lt;/li&gt;
&lt;li&gt;The queue keeps growing&lt;/li&gt;
&lt;li&gt;The system becomes unstable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To solve this, we use exponential backoff retries.&lt;/p&gt;

&lt;p&gt;Instead of retrying immediately, we wait longer between each attempt:&lt;br&gt;
1s → 2s → 4s → 8s → 16s …&lt;/p&gt;

&lt;p&gt;After a certain number of retries, we move the message to a Dead Letter Queue (DLQ) for later inspection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Avoiding Notification Spam&lt;/strong&gt;&lt;br&gt;
Initially, we might send an email for every event.&lt;/p&gt;

&lt;p&gt;But that’s not ideal.&lt;/p&gt;

&lt;p&gt;If a user is actively using the app, sending an email for every like or comment would feel like spam.&lt;/p&gt;

&lt;p&gt;To handle this, we introduce a Notification Engine.&lt;/p&gt;

&lt;p&gt;All requests go through this engine, which decides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which channel to use (Email, Push, In-App)&lt;/li&gt;
&lt;li&gt;Whether the user has disabled certain notifications&lt;/li&gt;
&lt;li&gt;Whether the user is currently active in the app&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We store user preferences and last login time in a cache for quick access.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the user is active → send only In-App notification&lt;/li&gt;
&lt;li&gt;If the user is offline → send Push&lt;/li&gt;
&lt;li&gt;If Push fails → fallback to Email&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes the system smarter and more user-friendly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Making It Pluggable&lt;/strong&gt;&lt;br&gt;
We don’t want to tightly couple our system to just Email or Push.&lt;/p&gt;

&lt;p&gt;Instead, we design it so that each notification channel implements a common interface.&lt;/p&gt;

&lt;p&gt;That way, if we want to add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SMS&lt;/li&gt;
&lt;li&gt;WhatsApp&lt;/li&gt;
&lt;li&gt;Slack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can plug it in without rewriting the core logic.&lt;/p&gt;

&lt;p&gt;This keeps the system flexible and future-proof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;br&gt;
What started as “just send a notification” quickly becomes a distributed system problem.&lt;/p&gt;

&lt;p&gt;By introducing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A message queue for decoupling&lt;/li&gt;
&lt;li&gt;Worker-based async processing&lt;/li&gt;
&lt;li&gt;Exponential backoff retries&lt;/li&gt;
&lt;li&gt;Dead Letter Queues&lt;/li&gt;
&lt;li&gt;A centralized Notification Engine&lt;/li&gt;
&lt;li&gt;User preference caching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We build a system that is scalable, resilient, and production-ready.&lt;/p&gt;

&lt;p&gt;Simple feature. Complex engineering.&lt;/p&gt;

&lt;p&gt;And that’s the fun part.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>softwaredevelopment</category>
      <category>webdev</category>
      <category>learning</category>
    </item>
    <item>
      <title>Designing a Rate Limiter to prevent spamming</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Wed, 18 Feb 2026 10:33:23 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/designing-a-rate-limiter-to-prevent-spamming-32e1</link>
      <guid>https://dev.to/ganesh_parella/designing-a-rate-limiter-to-prevent-spamming-32e1</guid>
      <description>&lt;p&gt;Imagine you are building a social website or any large-scale system. Suddenly, a million requests flood your system from a single IP address. Your servers slow down or even crash.&lt;/p&gt;

&lt;p&gt;How do we prevent this?&lt;/p&gt;

&lt;p&gt;The answer is: Rate Limiting.&lt;/p&gt;

&lt;p&gt;In simple terms, a rate limiter restricts the number of requests a user (or IP address) can send within a given time window.&lt;/p&gt;

&lt;p&gt;Let’s design one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functional Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limit the number of requests per user ID or IP address&lt;/li&gt;
&lt;li&gt;Return an error (e.g., HTTP 429) when the limit is exceeded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-Functional Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low latency while checking the limit (e.g., &amp;lt;10ms)&lt;/li&gt;
&lt;li&gt;High availability (Availability &amp;gt; Consistency)&lt;/li&gt;
&lt;li&gt;Scalable for millions of users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;System Function / Endpoint&lt;/strong&gt;&lt;br&gt;
boolean isAvailable(userId, request)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If true → forward request to backend&lt;/li&gt;
&lt;li&gt;If false → return 429 (Too Many Requests)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choosing the Right Algorithm&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When we think about limiting requests over time, a natural idea is:&lt;/p&gt;

&lt;p&gt;Limit the number of requests in a fixed time window.&lt;br&gt;
But there’s a problem.&lt;br&gt;
Suppose we allow 100 requests per second.&lt;br&gt;
If a user sends:&lt;br&gt;
100 requests at the end of second 1&lt;br&gt;
100 requests at the beginning of second 2&lt;br&gt;
That’s 200 requests within ~1 second.&lt;br&gt;
This is known as the Fixed Window problem.&lt;br&gt;
We don’t want that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sliding Window&lt;/strong&gt;&lt;br&gt;
To solve this, we can use a sliding window approach.&lt;br&gt;
The idea:&lt;br&gt;
At any given time window, the number of requests must not exceed the limit.&lt;br&gt;
This is more accurate but requires storing timestamps of requests.&lt;br&gt;
Implementation might use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sorted sets&lt;/li&gt;
&lt;li&gt;Heaps / priority queues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, memory usage increases with traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token Bucket (Preferred Approach)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Think of tokens as balls in a bucket.&lt;/li&gt;
&lt;li&gt;Each request consumes one token.&lt;/li&gt;
&lt;li&gt;Tokens are refilled at a fixed rate.&lt;/li&gt;
&lt;li&gt;If no tokens are available → reject the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket size = 100&lt;/li&gt;
&lt;li&gt;Refill rate = 100 per minute&lt;/li&gt;
&lt;li&gt;If a user sends 100 requests instantly, they must wait until tokens are refilled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allows burst traffic (up to bucket capacity)&lt;/li&gt;
&lt;li&gt;Smoothens traffic over time&lt;/li&gt;
&lt;li&gt;Flexible and production-friendly&lt;/li&gt;
&lt;li&gt;Token Bucket is widely used in real systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;High-Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcn0oiyftcqcfiupco4k4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcn0oiyftcqcfiupco4k4.png" alt="High-Level Architecture" width="800" height="239"&gt;&lt;/a&gt;&lt;br&gt;
In this design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Rate Limiter logic is placed before the backend API.&lt;/li&gt;
&lt;li&gt;Load balancer distributes traffic across multiple app servers.&lt;/li&gt;
&lt;li&gt;A shared Redis store keeps token bucket state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed rate limiting&lt;/li&gt;
&lt;li&gt;No single point of failure&lt;/li&gt;
&lt;li&gt;Low latency checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottlenecks&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Redis Bottleneck&lt;/strong&gt;&lt;br&gt;
If millions of users hit the system simultaneously, Redis may become the bottleneck.&lt;br&gt;
To scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Redis clustering&lt;/li&gt;
&lt;li&gt;Shard keys across multiple Redis nodes&lt;/li&gt;
&lt;li&gt;Use consistent hashing for distribution
If each Redis instance stores 100k users and we need to support 1 million users:
We need around 10 Redis nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Concurrency Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A user has only 1 token left&lt;/li&gt;
&lt;li&gt;Two requests hit Redis at the same time from different servers?&lt;/li&gt;
&lt;li&gt;Redis solves this using atomic operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lua scripts&lt;/li&gt;
&lt;li&gt;Atomic commands like INCR&lt;/li&gt;
&lt;li&gt;Or transactions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents race conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Latency Considerations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To reduce latency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep Redis close to application servers (same region)&lt;/li&gt;
&lt;li&gt;Use cluster topology&lt;/li&gt;
&lt;li&gt;Avoid cross-region calls for rate limit checks&lt;/li&gt;
&lt;li&gt;Geographical distance directly impacts response time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Final Thought&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Functional requirements define what the system does.&lt;/li&gt;
&lt;li&gt;Non-functional requirements define how well it performs at scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rate limiting may look simple, but designing it correctly in distributed systems requires careful thought.&lt;/p&gt;

&lt;p&gt;See you in the next post 🚀&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>backend</category>
      <category>systemdesign</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to design a Simple URL Shortener(TinyURL)</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Sun, 15 Feb 2026 18:16:36 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/how-to-design-a-simple-url-shortenertinyurl-3n2m</link>
      <guid>https://dev.to/ganesh_parella/how-to-design-a-simple-url-shortenertinyurl-3n2m</guid>
      <description>&lt;p&gt;TinyURL is often called the “Hello World” of system design because it has minimal requirements but forces us to think about scalability, caching, ID generation, and bottlenecks.&lt;/p&gt;

&lt;p&gt;Let’s design it step by step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functional Requirements :&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Convert a Long URL → Short URL&lt;/li&gt;
&lt;li&gt;Redirect Short URL → Long URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Non-Functional Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High Availability&lt;/li&gt;
&lt;li&gt;Low Latency &lt;/li&gt;
&lt;li&gt;Scalable under heavy traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;API End-Points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;POST /shorten → Accepts Long URL&lt;/li&gt;
&lt;li&gt;GET /{shortId} → Redirects to Long URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;High-Level Design:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyu0pf22akqz45e8hvi2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyu0pf22akqz45e8hvi2.png" alt="User&amp;lt;br&amp;gt;
→ Load Balancer&amp;lt;br&amp;gt;
→ App Servers&amp;lt;br&amp;gt;
→ Cache&amp;lt;br&amp;gt;
→ Sharded Database" width="800" height="202"&gt;&lt;/a&gt;&lt;br&gt;
This is a simple and scalable design.&lt;/p&gt;

&lt;p&gt;Since we require low latency, we introduce a cache layer to store frequently accessed short URLs. Most read requests will be served directly from cache, reducing database load.&lt;/p&gt;

&lt;p&gt;To ensure high availability, we avoid single points of failure. App servers are scaled horizontally and placed behind a load balancer, which distributes incoming traffic evenly.&lt;/p&gt;

&lt;p&gt;Because our system only needs to store simple mappings:&lt;br&gt;
               short_url → long_url&lt;br&gt;
we can use either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Key-Value database (natural fit for simple mapping)&lt;/li&gt;
&lt;li&gt;Or an SQL database if additional analytics or constraints are required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This covers the basic design derived from requirements.&lt;/p&gt;

&lt;p&gt;But now comes the interesting part.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Short Should the URL Be?&lt;/strong&gt;&lt;br&gt;
We want to convert long URLs into short ones. But how short should they be?&lt;/p&gt;

&lt;p&gt;Assume:&lt;/p&gt;

&lt;p&gt;K new URLs are generated every second.&lt;br&gt;
We store URLs for 10 years.&lt;br&gt;
Total URLs required:&lt;br&gt;
                 &lt;strong&gt;K*60*60*24*365*10&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If our short URL can use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;26 lowercase letters (a–z)&lt;/li&gt;
&lt;li&gt;26 uppercase letters (A–Z)&lt;/li&gt;
&lt;li&gt;10 digits (0–9)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives us 62 possible characters.&lt;br&gt;
To determine required length:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;62^n ≥ K × 60 × 60 × 24 × 365 × 10&lt;/strong&gt;&lt;br&gt;
Where n is the length of the short URL.&lt;br&gt;
If n = 7:&lt;br&gt;
62^7 ≈ 3.5 trillion combinations&lt;/p&gt;

&lt;p&gt;Which is sufficient for large-scale systems.&lt;/p&gt;

&lt;p&gt;Bottlenecks&lt;br&gt;
&lt;strong&gt;Hot Key Problem (Read Bottleneck)&lt;/strong&gt;&lt;br&gt;
Suppose the application becomes popular and millions of users request the same short URL simultaneously.&lt;/p&gt;

&lt;p&gt;Where would the system collapse first?&lt;/p&gt;

&lt;p&gt;The cache.&lt;br&gt;
When many users access the same key, we face a hot key problem. Horizontal scaling alone does not solve this because the same key may map to the same cache node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use cache replicas&lt;/li&gt;
&lt;li&gt;Introduce a CDN layer
Distribute read load across multiple cache nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Write Bottleneck (Database)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now assume we receive a large number of write requests (URL creation) and writes typically go to the primary database node.&lt;br&gt;
Where will the bottleneck occur?&lt;br&gt;
The database.&lt;br&gt;
Since every new short URL requires a write operation, database throughput becomes the limiting factor.&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt;&lt;br&gt;
Sharding the database.&lt;br&gt;
However, simple modulo-based sharding can cause problems when adding new shards because it requires massive data redistribution.&lt;/p&gt;

&lt;p&gt;A better approach is:&lt;br&gt;
Consistent hashing, which minimizes data movement when scaling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ID Collision Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since app servers are horizontally scaled, two servers might generate the same short URL.&lt;/p&gt;

&lt;p&gt;How do we prevent collisions?&lt;/p&gt;

&lt;p&gt;Possible approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random Base62 generation + collision check&lt;/li&gt;
&lt;li&gt;Centralized ID generator&lt;/li&gt;
&lt;li&gt;Distributed ID service&lt;/li&gt;
&lt;li&gt;Using Redis atomic counter (e.g., INCR)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Final Thought&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TinyURL may look simple, but it teaches us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalability&lt;/li&gt;
&lt;li&gt;Caching strategies&lt;/li&gt;
&lt;li&gt;Sharding techniques&lt;/li&gt;
&lt;li&gt;Bottleneck analysis&lt;/li&gt;
&lt;li&gt;ID generation trade-offs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s why it’s called the “Hello World” of System Design. Let's meet again with another interesting design.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>backend</category>
      <category>systemdesign</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Choose a Database?</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Fri, 13 Feb 2026 18:57:39 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/how-to-choose-a-database-56g1</link>
      <guid>https://dev.to/ganesh_parella/how-to-choose-a-database-56g1</guid>
      <description>&lt;p&gt;Before choosing a database, we must understand the types of databases that exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.Relational Databases (1970s)&lt;/strong&gt;&lt;br&gt;
In 1970, Edgar F. Codd proposed storing data in tables (relations) and treating them using mathematical principles.&lt;br&gt;
This led to the creation of Relational Databases.&lt;br&gt;
They offer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Atomicity&lt;/li&gt;
&lt;li&gt;Consistency&lt;/li&gt;
&lt;li&gt;Isolation&lt;/li&gt;
&lt;li&gt;Durability (ACID properties)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MySQL&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Relational databases are powerful when data is structured and transactional consistency is critical.&lt;br&gt;
However, as systems scale and joins grow across millions of rows, performance tuning and horizontal scaling become challenging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.Key–Value Databases (2000s Scaling Era)&lt;/strong&gt;&lt;br&gt;
In the 2000s, companies like Amazon faced massive scalability challenges.&lt;br&gt;
Instead of complex relational joins, they proposed storing data as:&lt;br&gt;
Key → Value&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
UserID → List of Orders&lt;/p&gt;

&lt;p&gt;A popular example is: Redis&lt;/p&gt;

&lt;p&gt;Key–Value databases offer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extremely fast lookups&lt;/li&gt;
&lt;li&gt;Easy horizontal scaling&lt;/li&gt;
&lt;li&gt;Great performance for caching and session storage
However, they are not ideal for handling complex relationships between data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Graph Databases&lt;/strong&gt;&lt;br&gt;
Graph databases store data as:&lt;br&gt;
Nodes&lt;br&gt;
Edges (relationships)&lt;br&gt;
Example: Neo4j&lt;/p&gt;

&lt;p&gt;They are extremely useful when relationships are first-class citizens, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Social networks&lt;/li&gt;
&lt;li&gt;Recommendation systems&lt;/li&gt;
&lt;li&gt;Fraud detection
Graph databases shine when traversing connected data efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4.Document Databases:&lt;/strong&gt;&lt;br&gt;
Document database stores data as JSON-like documents&lt;/p&gt;

&lt;p&gt;Example: MongoDb&lt;/p&gt;

&lt;p&gt;Document Databases offer:&lt;br&gt;
-Easy Horizontal Scaling&lt;br&gt;
-Flexible Schema&lt;br&gt;
-Better support for hierarchical data&lt;/p&gt;

&lt;p&gt;While modern document databases support limited joins and aggregations, they are not as optimized for complex relational queries as relational databases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Verdict&lt;/strong&gt;&lt;br&gt;
Don’t choose a database by default.&lt;br&gt;
Choose a database based on how you access and scale your data.&lt;/p&gt;

&lt;p&gt;If you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong ACID guarantees → Relational Database&lt;/li&gt;
&lt;li&gt;Ultra-fast lookups → Key–Value Database&lt;/li&gt;
&lt;li&gt;Relationship-heavy queries → Graph Database&lt;/li&gt;
&lt;li&gt;Flexible schema with high scalability → Document Database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern systems often use multiple databases together — a concept known as polyglot persistence.&lt;br&gt;
For example, Netflix uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Relational databases for user data&lt;/li&gt;
&lt;li&gt;Key–Value stores for caching&lt;/li&gt;
&lt;li&gt;Graph databases for recommendations&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>database</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Why do we need Databases?</title>
      <dc:creator>Ganesh Parella</dc:creator>
      <pubDate>Fri, 13 Feb 2026 17:10:40 +0000</pubDate>
      <link>https://dev.to/ganesh_parella/why-do-we-need-databases-519n</link>
      <guid>https://dev.to/ganesh_parella/why-do-we-need-databases-519n</guid>
      <description>&lt;p&gt;Have you ever wondered why we need databases when we can store data directly on an SSD?&lt;/p&gt;

&lt;p&gt;I recently asked myself this question. After all, SSDs store data permanently. If we can read and write directly to disk, why don’t we use that for building web applications and systems?&lt;/p&gt;

&lt;p&gt;At first, I thought maybe it’s because most applications use in-memory computation — in simple terms, RAM. When an application crashes, all data stored in RAM is lost. This creates reliability issues.&lt;/p&gt;

&lt;p&gt;So then I wondered: if persistence is our goal, why not write everything directly to the SSD?&lt;/p&gt;

&lt;p&gt;But then I realized something important — an SSD only provides physical storage. It does not provide data management. SSDs do not provide indexing, concurrent updates, or crash recovery. Implementing these manually on top of raw file storage would be extremely complex.&lt;/p&gt;

&lt;p&gt;Databases abstract this complexity and handle data efficiently.&lt;/p&gt;

&lt;p&gt;Now that we understand why databases exist, the next question becomes — what type of database should we choose for different systems?&lt;/p&gt;

</description>
      <category>backend</category>
      <category>beginners</category>
      <category>computerscience</category>
      <category>database</category>
    </item>
  </channel>
</rss>
