<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: rishabh pahwa</title>
    <description>The latest articles on DEV Community by rishabh pahwa (@rishabh_pahwa_1a2b93e60b0).</description>
    <link>https://dev.to/rishabh_pahwa_1a2b93e60b0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3923022%2F72c2c898-8a65-4376-847e-b979b04f6f40.png</url>
      <title>DEV Community: rishabh pahwa</title>
      <link>https://dev.to/rishabh_pahwa_1a2b93e60b0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rishabh_pahwa_1a2b93e60b0"/>
    <language>en</language>
    <item>
      <title>Problem Framing</title>
      <dc:creator>rishabh pahwa</dc:creator>
      <pubDate>Wed, 27 May 2026 17:10:33 +0000</pubDate>
      <link>https://dev.to/rishabh_pahwa_1a2b93e60b0/problem-framing-4g91</link>
      <guid>https://dev.to/rishabh_pahwa_1a2b93e60b0/problem-framing-4g91</guid>
      <description>&lt;p&gt;Your transaction IDs are a critical database indexing strategy, not just a unique identifier. Generate them wrong, and your multi-tenant financial system will grind to a halt because you've inadvertently shattered data locality for common queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem Framing
&lt;/h2&gt;

&lt;p&gt;Imagine running a payment processor handling millions of transactions daily across thousands of merchants. A fundamental, frequently executed query is "show me the last 100 transactions for merchant &lt;code&gt;ABC&lt;/code&gt;." If your &lt;code&gt;transaction_id&lt;/code&gt; is a Twitter Snowflake ID and serves as the primary key, your database will struggle.&lt;/p&gt;

&lt;p&gt;Here's why: Snowflake IDs are globally unique and generally time-ordered. When &lt;code&gt;merchant_ABC&lt;/code&gt; processes a transaction at 10:00:00.123, its &lt;code&gt;transaction_id&lt;/code&gt; will be numerically close to &lt;code&gt;merchant_XYZ&lt;/code&gt;'s transaction at 10:00:00.124. This means &lt;code&gt;merchant_ABC&lt;/code&gt;'s transactions from Monday will be physically interspersed with &lt;em&gt;all other merchants'&lt;/em&gt; transactions from Monday in your database's primary index.&lt;/p&gt;

&lt;p&gt;To satisfy the "last 100 transactions for merchant &lt;code&gt;ABC&lt;/code&gt;" query, the database engine can't efficiently read contiguous blocks of data. It must scan an index (potentially a secondary index on &lt;code&gt;(merchant_id, created_at)&lt;/code&gt;) to find &lt;code&gt;transaction_id&lt;/code&gt;s, then perform random lookups in the primary index. Each lookup for a scattered row forces the database to fetch a new 8KB disk page from SSD (a 0.1-1ms operation), likely causing a cache miss. Instead of a few efficient disk reads for many rows, you get hundreds of inefficient, random reads, blowing query latency from sub-50ms to hundreds of milliseconds or even seconds at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Concept: Snowflake IDs vs. Data Locality
&lt;/h2&gt;

&lt;p&gt;Twitter's Snowflake ID is a 64-bit integer designed for globally unique, distributed ID generation. It encodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;64 bits total:
+-------------------------------------------------+----------------------+-------------------+
|               Timestamp (41 bits)               |   Worker ID (10 bits)  |  Sequence (12 bits) |
+-------------------------------------------------+----------------------+-------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The timestamp component ensures IDs are roughly time-ordered, which is excellent for things like Twitter timelines where you want to fetch recent tweets quickly, regardless of the user who posted them. The worker ID allows multiple servers to generate IDs concurrently without collisions, and the sequence number handles bursts within a millisecond on a single worker.&lt;/p&gt;

&lt;p&gt;For Twitter's use case, where global uniqueness and time-based sorting are paramount, Snowflake IDs are a brilliant fit. The system rarely needs to query "all tweets from user X" ordered chronologically; instead, it aggregates a user's timeline from various sources.&lt;/p&gt;

&lt;p&gt;However, in a multi-tenant financial system, the access patterns are fundamentally different:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Dominant Query Pattern:&lt;/strong&gt; Almost all critical queries are scoped by &lt;code&gt;tenant_id&lt;/code&gt; (e.g., &lt;code&gt;merchant_id&lt;/code&gt;, &lt;code&gt;customer_id&lt;/code&gt;). For example: "Get all transactions for &lt;code&gt;merchant_ABC&lt;/code&gt;," "Find a specific invoice for &lt;code&gt;customer_XYZ&lt;/code&gt;," "List recent withdrawals for &lt;code&gt;user_123&lt;/code&gt;."&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;B-Tree Indexing:&lt;/strong&gt; Modern relational databases (PostgreSQL, MySQL InnoDB) use B-tree indexes. The primary key physically dictates the storage order of your data on disk (or SSD). If your PK is a Snowflake ID, rows are ordered by that ID.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Fragmentation:&lt;/strong&gt; Since a Snowflake ID's primary sorting component is time, &lt;code&gt;merchant_ABC&lt;/code&gt;'s transactions from &lt;code&gt;T1&lt;/code&gt; will be stored near &lt;code&gt;merchant_XYZ&lt;/code&gt;'s transactions from &lt;code&gt;T1+1ms&lt;/code&gt;. This means &lt;code&gt;merchant_ABC&lt;/code&gt;'s data is scattered across numerous disk pages.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Consider the physical layout difference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Primary Key: Snowflake ID (Fragmented Data)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Disk Pages:
Page 1: [SnowflakeID_T1_W1_S1 (TenantA_Txn1)] [SnowflakeID_T1_W1_S2 (TenantB_Txn1)] ...
Page 2: [SnowflakeID_T1_W2_S1 (TenantC_Txn1)] [SnowflakeID_T1_W2_S2 (TenantA_Txn2)] ...
Page 3: [SnowflakeID_T2_W1_S1 (TenantB_Txn2)] [SnowflakeID_T2_W1_S2 (TenantD_Txn1)] ...

To query TenantA's transactions, the DB jumps between Page 1, Page 2, etc. --&amp;gt; Many random reads, low cache hit rate.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Composite Primary Key: (Tenant ID, Transaction Timestamp) (Co-located Data)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Disk Pages:
Page 1: [TenantA_Txn1_T1] [TenantA_Txn2_T1] [TenantA_Txn3_T2] [TenantA_Txn4_T2] ...
Page 2: [TenantB_Txn1_T1] [TenantB_Txn2_T1] [TenantB_Txn3_T2] [TenantB_Txn4_T2] ...
Page 3: [TenantC_Txn1_T1] [TenantC_Txn2_T1] [TenantC_Txn3_T2] [TenantC_Txn4_T2] ...

To query TenantA's transactions, the DB reads Page 1 sequentially --&amp;gt; Few sequential reads, high cache hit rate.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference is stark: sequential disk reads are orders of magnitude faster than random reads because modern storage devices are optimized for them, and data can be prefetched into CPU caches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-world Application: Prioritizing Locality for Financial Systems
&lt;/h2&gt;

&lt;p&gt;For systems like payment processors (e.g., Stripe, Adyen) or ledger databases, data locality around the &lt;code&gt;tenant_id&lt;/code&gt; is paramount. They prioritize fast, reliable access to an individual merchant's or user's financial history.&lt;/p&gt;

&lt;p&gt;A robust approach involves using a &lt;strong&gt;composite primary key&lt;/strong&gt; that starts with the &lt;code&gt;tenant_id&lt;/code&gt;. For example: &lt;code&gt;PRIMARY KEY (merchant_id, created_at_timestamp_ms)&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;How it works:&lt;/strong&gt; When you define &lt;code&gt;(merchant_id, created_at_timestamp_ms)&lt;/code&gt; as your primary key, the database physically stores all transactions for &lt;code&gt;merchant_A&lt;/code&gt; together, sorted by &lt;code&gt;created_at_timestamp_ms&lt;/code&gt;. After &lt;code&gt;merchant_A&lt;/code&gt;'s data, &lt;code&gt;merchant_B&lt;/code&gt;'s data follows, and so on.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Performance Impact:&lt;/strong&gt; When &lt;code&gt;merchant_A&lt;/code&gt; requests their last 100 transactions, the database performs a single, efficient index scan directly to &lt;code&gt;merchant_A&lt;/code&gt;'s section of the B-tree. It then reads a few contiguous disk pages to retrieve all 100 rows. This can reduce I/O operations from potentially hundreds of random page fetches (taking 50-100ms) down to 2-3 sequential page fetches (taking &amp;lt;1ms). This isn't just a small optimization; it's the difference between a usable system and one that collapses under load. This directly impacts P99 query latency, a critical metric for production financial systems.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Unique Identifier Trade-offs:&lt;/strong&gt; You can still generate a globally unique &lt;code&gt;transaction_id&lt;/code&gt; (perhaps even a Snowflake ID) if other parts of your system need it. However, it should not be the primary clustering key for your main transaction table. If a globally unique &lt;code&gt;transaction_id&lt;/code&gt; is required as &lt;em&gt;the&lt;/em&gt; primary key for external reasons, then ensure you explicitly &lt;code&gt;CLUSTER&lt;/code&gt; your table on &lt;code&gt;(tenant_id, created_at)&lt;/code&gt; if your database supports it, to physically reorder the data for efficient reads. This is an operational overhead but yields similar performance benefits.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Blindly Applying "Cool" Tech:&lt;/strong&gt; Snowflake IDs are elegant, but they are a solution to a specific problem (distributed, globally unique, time-sortable IDs where global sorting is often the primary access pattern). Assuming it's universally "best practice" without understanding your specific query patterns is a critical mistake.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Ignoring Database Storage Engine Details:&lt;/strong&gt; Most engineers understand indexes, but fewer deeply grasp how B-trees physically store data and how that impacts page reads, buffer cache efficiency, and disk I/O. Your primary key isn't just a uniqueness constraint; it's a fundamental data clustering strategy.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Over-indexing to Compensate:&lt;/strong&gt; Creating a secondary index on &lt;code&gt;(tenant_id, created_at DESC)&lt;/code&gt; helps the database find relevant rows, but if the table is clustered by a Snowflake ID, the database still needs to perform a "double lookup"—scanning the secondary index, then randomly fetching rows from the primary table. This is less efficient than a primary key that inherently clusters the data.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Prioritizing Global Uniqueness Over Query Locality:&lt;/strong&gt; While global uniqueness for IDs is often important, it should not come at the cost of crippling your most common, performance-critical queries. Always design your primary key around your dominant read patterns first.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Interview Angle
&lt;/h2&gt;

&lt;p&gt;You're likely to encounter questions about distributed ID generation in system design interviews. When discussing a multi-tenant system, expect follow-ups that probe your understanding of data locality and database performance.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; "You're designing a high-throughput payment processing system for multiple merchants. How would you generate transaction IDs, and what considerations would you make for querying transaction history for a specific merchant?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strong Answer:&lt;/strong&gt; "I'd start by recognizing that for a multi-tenant financial system, the most common and critical queries will be scoped by &lt;code&gt;merchant_id&lt;/code&gt;. Therefore, optimizing for data locality around &lt;code&gt;merchant_id&lt;/code&gt; is paramount. Instead of a globally unique, time-ordered ID like Twitter's Snowflake as the primary key, I would advocate for a &lt;strong&gt;composite primary key&lt;/strong&gt; such as &lt;code&gt;(merchant_id, transaction_timestamp_ms)&lt;/code&gt;. This ensures all transactions for a given merchant are physically co-located on disk, dramatically improving cache hit rates and reducing random I/O for &lt;code&gt;WHERE merchant_id = X ORDER BY transaction_timestamp_ms DESC&lt;/code&gt; queries. We could still generate a separate, globally unique &lt;code&gt;transaction_id&lt;/code&gt; (using UUIDs or even Snowflake-like IDs) for external system integration or specific global lookups, but it wouldn't be the clustering key of our main transaction table."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; "What specific performance metrics would you monitor to detect if your primary key strategy is leading to index fragmentation issues, and how would you mitigate them?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strong Answer:&lt;/strong&gt; "I'd closely monitor several database metrics: average disk read latency, page fault rates, buffer cache hit ratio, and index scan efficiency. High values for latency and page faults, coupled with a low cache hit ratio, would strongly suggest data fragmentation. To mitigate, if my primary key wasn't tenant-aware, I'd first analyze query patterns to confirm the common access paths. Then, I'd consider refactoring the primary key to a composite &lt;code&gt;(tenant_id, timestamp)&lt;/code&gt; structure, or, if the existing primary key must be maintained, leverage database-specific features like PostgreSQL's &lt;code&gt;CLUSTER&lt;/code&gt; command or MySQL's &lt;code&gt;OPTIMIZE TABLE&lt;/code&gt; to physically reorder the table data according to a more locality-friendly index."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Thinking through complex system design?&lt;br&gt;
Let's connect for a 1:1 on Topmate to discuss your challenges and level up your skills.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;I do 1:1 sessions on system design, backend architecture, and interview prep.&lt;br&gt;
If you're preparing for a Staff/Senior role or cracking FAANG rounds — &lt;a href="https://topmate.io/rishabh_pahwa" rel="noopener noreferrer"&gt;book a session here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>databaseperformance</category>
      <category>systemdesign</category>
      <category>multitenant</category>
      <category>distributedids</category>
    </item>
    <item>
      <title>Why Your LLM Bot Forgets Everything</title>
      <dc:creator>rishabh pahwa</dc:creator>
      <pubDate>Fri, 22 May 2026 07:18:31 +0000</pubDate>
      <link>https://dev.to/rishabh_pahwa_1a2b93e60b0/why-your-llm-bot-forgets-everything-16p8</link>
      <guid>https://dev.to/rishabh_pahwa_1a2b93e60b0/why-your-llm-bot-forgets-everything-16p8</guid>
      <description>&lt;p&gt;Your decade-old "stateless microservice" mantra is failing your LLM-powered applications. Treating every LLM request as an independent, isolated transaction ignores the fundamental need for persistent, evolving context, leading to astronomically high costs and a broken user experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Your LLM Bot Forgets Everything
&lt;/h2&gt;

&lt;p&gt;Imagine you're building a customer support chatbot. A user asks: "My order #7890 is stuck, can you help?" Your API Gateway routes this to a stateless &lt;code&gt;llm-processor&lt;/code&gt; microservice. This service pulls the order details from a database, adds them to the prompt, sends it to GPT-4, and returns a polite "I'm looking into order #7890."&lt;/p&gt;

&lt;p&gt;The user then asks: "What's the estimated delivery date?"&lt;br&gt;
If your architecture is purely stateless, that second request hits a new &lt;code&gt;llm-processor&lt;/code&gt; instance, completely unaware of the previous interaction. It has no idea what "the estimated delivery date" refers to. It will likely respond with a generic "Please specify which order you're referring to," or worse, hallucinate.&lt;/p&gt;

&lt;p&gt;This isn't just annoying; it's slow, expensive, and wastes user patience. Every single turn of the conversation means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Re-fetching context:&lt;/strong&gt; The system has to re-query databases for order #7890 details.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Re-prompting:&lt;/strong&gt; The LLM receives a prompt that likely needs to re-introduce previous context, consuming more tokens and increasing latency and cost.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;No conversational memory:&lt;/strong&gt; The user experience is disjointed and frustrating. Your bot acts like it has severe amnesia. This drives user churn faster than any bug.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  The Dedicated State Service: Your LLM's Memory Bank
&lt;/h2&gt;

&lt;p&gt;A new generation of LLM architectures moves away from purely stateless services for core interaction flows. Instead, they introduce a dedicated &lt;strong&gt;State Service&lt;/strong&gt;. This isn't just a database; it's an intelligent orchestrator of user-specific context, session history, and often, retrieved external information.&lt;/p&gt;

&lt;p&gt;The core idea is to establish a persistent &lt;em&gt;session context&lt;/em&gt; for each user interaction. When a user sends a query, the LLM Orchestrator service first retrieves relevant context from the State Service before composing the final prompt. After the LLM responds, the orchestrator updates the State Service with the latest turn, optionally summarizing or pruning older history.&lt;/p&gt;

&lt;p&gt;Here's how it generally flows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;USER
  |
  V
[API Gateway]
  |
  V
[LLM Orchestrator] --- (User ID) ---&amp;gt; [State Service]
  |                                     ^      |
  | (Get Context)                       |      | (Store/Update Context)
  +-------------------------------------+      |
  |                                            |
  V (Context + Current Prompt)                 V (Session History, RAG Data, Preferences)
[LLM Provider] (e.g., OpenAI, Anthropic, OSS LLM)
  |
  V (LLM Response)
[LLM Orchestrator]
  |
  V (User Response)
USER
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The State Service stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Conversation History:&lt;/strong&gt; The raw turns of the conversation, potentially summarized.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;User Preferences/Profile:&lt;/strong&gt; Specific settings, roles, or persona details.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Retrieval Augmented Generation (RAG) Data:&lt;/strong&gt; Documents, database records, or search results retrieved for the current session.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Intermediate Results:&lt;/strong&gt; Partially completed tasks, user intentions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By doing this, the LLM Orchestrator can construct a lean, targeted prompt for the LLM, reducing token counts by 50-80% on subsequent turns compared to rebuilding context from scratch. This directly translates to lower API costs and faster response times.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Companies Handle Stateful LLM Interactions at Scale
&lt;/h2&gt;

&lt;p&gt;Consider a platform like &lt;strong&gt;Intercom's Fin AI Bot&lt;/strong&gt; or &lt;strong&gt;Zendesk's AI Agent Assist&lt;/strong&gt;. These systems can't afford to rebuild context for every user interaction across millions of conversations. They leverage sophisticated state management.&lt;/p&gt;

&lt;p&gt;When a user initiates a chat, a unique &lt;code&gt;session_id&lt;/code&gt; is established. This &lt;code&gt;session_id&lt;/code&gt; becomes the key for retrieving and storing conversational state in a dedicated, low-latency data store. They might use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Redis Enterprise&lt;/strong&gt; for in-memory caching of active session data, providing sub-millisecond latency for context retrieval.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Amazon DynamoDB&lt;/strong&gt; or &lt;strong&gt;Cassandra&lt;/strong&gt; for more durable, sharded storage of full conversation histories, with an eviction policy for very old, inactive sessions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Custom data structures&lt;/strong&gt; within the State Service that intelligently summarize older conversation turns using an LLM itself (e.g., "Summarize the conversation so far for the LLM") to keep the active prompt window small and token-efficient.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They don't just dump raw text. They might store structured JSON objects representing key-value pairs of extracted entities (e.g., &lt;code&gt;{"order_id": "7890", "issue": "delivery_delay"}&lt;/code&gt;) alongside the conversation history. This allows the orchestrator to quickly inject relevant, structured data into the prompt without re-parsing lengthy texts. This approach reduces the effective context window size passed to the LLM, directly saving compute and API costs, while maintaining a coherent conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Most People Get Wrong
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Treating the State Service as just a Cache:&lt;/strong&gt; This isn't temporary, easily discardable data. It's critical, active conversational context. A simple LRU cache is insufficient because it doesn't account for persistence, intelligent summarization, or the active lifecycle of a conversation. State needs to be durable enough to survive orchestrator restarts and potentially consistent for multi-turn operations.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Storing Too Much, Unstructured State:&lt;/strong&gt; Engineers often just dump the entire raw conversation history into the state store. This quickly bloats the context window, leading to higher token costs and slower inference times. The State Service needs logic for:

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Summarization:&lt;/strong&gt; Periodically summarizing older parts of the conversation.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Pruning:&lt;/strong&gt; Removing irrelevant or outdated information.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Structured Entity Extraction:&lt;/strong&gt; Converting free-form text into key-value pairs (e.g., extracting order IDs, dates, user names) to provide concise, direct context.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Lack of Distributed Coordination:&lt;/strong&gt; In a scaled-out system, multiple &lt;code&gt;LLM Orchestrator&lt;/code&gt; instances might try to read or update the same user's session state concurrently. Without proper distributed locks or optimistic concurrency controls, you can end up with race conditions, inconsistent state, or lost updates, making your bot "forget" recent turns.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Interview Angle
&lt;/h2&gt;

&lt;p&gt;When designing LLM-powered systems, interviewers will challenge your understanding of state management beyond simple caching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"How would you handle state for a million concurrent users in a personalized LLM assistant?"&lt;/strong&gt;&lt;br&gt;
A strong answer goes beyond "use Redis." You'd discuss sharding the state service by &lt;code&gt;user_id&lt;/code&gt; or &lt;code&gt;session_id&lt;/code&gt; to distribute load and improve retrieval latency. Mention replication for high availability and durability. Crucially, talk about &lt;strong&gt;intelligent state management&lt;/strong&gt;: implementing a policy for summarization and eviction (e.g., active sessions in-memory, older sessions in a persistent store like DynamoDB, with an LLM-powered summarizer pruning the context window dynamically). You'd discuss how to identify "inactive" sessions to move them to cheaper storage or expire them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What are the trade-offs of storing full conversation history versus summarized history?"&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Full History:&lt;/strong&gt; Pros – complete context, no loss of nuance. Cons – high token cost, increased latency, storage bloat, hits LLM context window limits quickly. Good for debugging or very short, critical interactions.&lt;br&gt;
&lt;strong&gt;Summarized History:&lt;/strong&gt; Pros – significantly reduced token cost, faster inference, fits within smaller context windows. Cons – potential loss of nuance/detail, summarization itself consumes LLM tokens/compute, risk of "hallucinated summaries" if not carefully engineered. Good for long-running conversations where fine-grained detail isn't critical for every turn. The trade-off is often between token efficiency/latency and conversational coherence/accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"How does Retrieval Augmented Generation (RAG) fit into this state management?"&lt;/strong&gt;&lt;br&gt;
RAG isn't just a one-off query. The &lt;em&gt;results&lt;/em&gt; of RAG (e.g., retrieved documents, database query outputs) become part of the session state. If a user asks about "order status" and your RAG system pulls order #7890's details, those details should be stored in the State Service. This ensures subsequent turns referencing "the order" can access those previously retrieved facts without hitting the RAG system again, further reducing latency and redundant work.&lt;/p&gt;

&lt;p&gt;Designing LLM applications successfully requires a fundamental shift from purely stateless paradigms to intelligent, distributed state management. Master this, and you'll build robust, cost-effective, and genuinely helpful AI experiences.&lt;/p&gt;




&lt;p&gt;Want to level up your system design skills for LLM-powered applications? Book a 1:1 session with me on Topmate to dive deeper into these architectures and prepare for your next interview.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;I do 1:1 sessions on system design, backend architecture, and interview prep.&lt;br&gt;
If you're preparing for a Staff/Senior role or cracking FAANG rounds — &lt;a href="https://topmate.io/rishabh_pahwa" rel="noopener noreferrer"&gt;book a session here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llmarchitecture</category>
      <category>systemdesign</category>
      <category>statemanagement</category>
      <category>microservices</category>
    </item>
    <item>
      <title>Problem Framing: The Cost of Naiveté</title>
      <dc:creator>rishabh pahwa</dc:creator>
      <pubDate>Tue, 19 May 2026 09:23:28 +0000</pubDate>
      <link>https://dev.to/rishabh_pahwa_1a2b93e60b0/problem-framing-the-cost-of-naivete-48dd</link>
      <guid>https://dev.to/rishabh_pahwa_1a2b93e60b0/problem-framing-the-cost-of-naivete-48dd</guid>
      <description>&lt;p&gt;Most rate limiters are designed to manage request volume, preventing system overload and abuse. But when you’re dealing with LLM API calls, a single request isn't just "one request"—it can be a $5 transaction or take 60 seconds to complete. Your standard distributed counter or token bucket approach will quickly burn through budgets and exhaust critical resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem Framing: The Cost of Naiveté
&lt;/h2&gt;

&lt;p&gt;Imagine you're building an AI-powered assistant. Users interact with it, triggering calls to an expensive LLM API. A simple rate limit, say 10 requests per second per user, seems reasonable. Now, consider a user who sends one complex prompt that generates a 50,000-token response, costing $10 and taking 30 seconds. With a naive rate limit, this user still has 9 "requests" remaining for that second, which could be another 9 expensive calls, costing $100 and congesting your LLM gateway. Meanwhile, another user needing a quick, cheap 100-token summary might be blocked because the first user's long-running request is tying up the underlying LLM capacity. You're not just preventing DDoS; you're managing a financial burn rate and ensuring fair resource allocation for non-uniform work. The system fails when it treats a $0.001 request the same as a $10 request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Concept: Cost-Aware Rate Limiting
&lt;/h2&gt;

&lt;p&gt;Effective rate limiting for LLMs needs to go beyond simple request counts. It requires a &lt;em&gt;cost-aware&lt;/em&gt; or &lt;em&gt;resource-aware&lt;/em&gt; approach. Instead of merely counting requests, you assign a "weight" or "cost unit" to each potential API call. This cost can be an estimation of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Tokens:&lt;/strong&gt; Input + estimated output tokens.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Monetary Cost:&lt;/strong&gt; Based on provider pricing (e.g., $X per 1k tokens).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Processing Time:&lt;/strong&gt; Estimated latency for the specific model and prompt complexity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your rate limiter then operates on these cost units. For example, a user might be allowed 100,000 cost units per minute, where a simple call consumes 100 units and a complex one consumes 10,000 units. A common pattern is to use a token bucket or leaky bucket, but instead of "tokens" representing requests, they represent these "cost units."&lt;/p&gt;

&lt;p&gt;Here's how a cost-aware rate limiter might integrate into your LLM service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------------------+        +---------------------+        +---------------------+
|  Incoming LLM Call  | ----&amp;gt;  |  Request Parser     | ----&amp;gt;  |  Policy Engine      |
| (user_id, model_id, |        | (Extracts prompt,   |        | (Defines cost rules:|
|     prompt)         |        |  params, headers)   |        |  e.g., model_A = $X/ |
+---------------------+        +---------------------+        |  token, user_tier_Y |
                                                               |  has budget $Z/min) |
                                                               +---------+---------+
                                                                         |
                                                                         V
                                                        +---------------------------+
                                                        |  Cost Estimator           |
                                                        | (Calculates estimated cost|
                                                        |  for this request based   |
                                                        |  on policy and input)     |
                                                        +---------+---------+
                                                                  |
                                                                  V
                                                        +---------------------------+
                                                        |  Rate Limiter Backend     |
                                                        | (e.g., Redis HSET user_id |
                                                        |  { 'cost_spent_min': X,   |
                                                        |    'req_count_min': Y,    |
                                                        |    'last_reset': TS })    |
                                                        |  Decision: ALLOW/DENY     |
                                                        +---------+---------+
                                                                  | (ALLOW)
                                                                  V
                                                        +---------------------+
                                                        |  LLM Service Proxy  |
                                                        | (Forwards request to|
                                                        |  LLM Provider)      |
                                                        +---------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a request arrives, the &lt;code&gt;Request Parser&lt;/code&gt; extracts relevant details. The &lt;code&gt;Policy Engine&lt;/code&gt; defines the rules (e.g., &lt;code&gt;gpt-4-turbo&lt;/code&gt; costs $10/1M input tokens, $30/1M output tokens; premium users get 5x standard budget). The &lt;code&gt;Cost Estimator&lt;/code&gt; then calculates the &lt;em&gt;estimated cost&lt;/em&gt; of the incoming request. This estimation considers factors like input token count, chosen model, and a heuristic for expected output tokens (e.g., average response length, or a configurable maximum).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Rate Limiter Backend&lt;/code&gt; (often Redis for distributed counters) then checks if the user/tenant has enough "budget" (cost units) remaining within the defined time window. If allowed, the estimated cost is deducted, and the request is forwarded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Application: OpenAI's Token-Based Limits
&lt;/h2&gt;

&lt;p&gt;OpenAI itself uses a form of cost-aware rate limiting. Instead of just "Requests Per Minute" (RPM), they impose "Tokens Per Minute" (TPM) limits. For example, a &lt;code&gt;gpt-4&lt;/code&gt; model might have a limit of 10,000 RPM and 1,000,000 TPM. This means you could theoretically send many small requests that sum up to 1M tokens, or fewer, larger requests.&lt;/p&gt;

&lt;p&gt;This combined limit forces developers to consider both the sheer volume and the computational/cost weight of their API calls. If you hit your TPM limit, even if you haven't hit your RPM limit, your requests are throttled. This effectively manages the load on their GPUs and the financial burden for users.&lt;/p&gt;

&lt;p&gt;Organizations building on top of LLMs, like &lt;strong&gt;Stripe&lt;/strong&gt; (for internal fraud detection using AI) or &lt;strong&gt;Uber&lt;/strong&gt; (for customer support summarization), would implement similar cost-aware strategies. They might allocate a specific budget to each internal team or external customer, measured in tokens or estimated dollars per hour/day. When a request comes in, it's checked against that team's remaining budget. If a request is estimated to cost $0.50 and the team only has $0.20 remaining for the hour, the request is denied or queued. Post-call, actual token usage and cost can be reconciled, and overages might incur penalties or stricter temporary limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Treating all LLM requests equally:&lt;/strong&gt; The most fundamental mistake. A simple "hello world" prompt to a cheap model is not the same as a complex prompt engineering chain for code generation on an expensive model. Failing to differentiate leads to uneven resource consumption and inaccurate billing/budgeting.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Ignoring non-determinism in LLM responses:&lt;/strong&gt; LLM output length (and thus token count) is often non-deterministic. If you estimate cost solely on input tokens, you'll frequently under-allocate budget. Strong solutions pre-allocate based on a conservative estimate (e.g., input tokens + max expected output tokens or a high percentile of historical output), then reconcile the &lt;em&gt;actual&lt;/em&gt; cost after the LLM call. If the actual cost exceeds the pre-allocated budget, you might temporarily penalize the user or mark it as an overage.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Only applying limits at the service ingress:&lt;/strong&gt; If your rate limiter is only at the API Gateway, it might catch basic abuse. However, for LLM-specific limits, you often need context from the &lt;em&gt;request payload&lt;/em&gt; (e.g., the prompt length, specific model ID). This requires the rate limiter to be closer to the application logic, often implemented as a middleware or proxy &lt;em&gt;before&lt;/em&gt; the call leaves your infrastructure for the LLM provider.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Static pricing/cost models:&lt;/strong&gt; LLM costs and model capabilities evolve rapidly. Hardcoding cost units or assuming fixed pricing is brittle. Your &lt;code&gt;Policy Engine&lt;/code&gt; must be configurable, ideally pulling pricing and model details from a dynamic source or a regularly updated configuration.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Interview Angle
&lt;/h2&gt;

&lt;p&gt;Interviewers will test your understanding of these nuances:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;"How do you handle the non-deterministic nature of LLM output tokens when estimating cost for rate limiting?"&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strong Answer:&lt;/strong&gt; "You can't get it perfectly upfront. I'd implement a two-phase commit: first, estimate based on input tokens plus a generous, configurable max_output_tokens, or a percentile from historical data for that &lt;code&gt;(user_id, model_id)&lt;/code&gt; pair. Deduct this estimated cost. After the LLM call returns, get the &lt;em&gt;actual&lt;/em&gt; token usage. If the actual is less than estimated, credit the difference back. If it's significantly more, log an overage, potentially apply a temporary stricter limit, or trigger an alert. This balances immediate enforcement with eventual consistency."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;"What if a user intentionally tries to exhaust their budget with short, cheap prompts but many of them, or a few very expensive ones?"&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strong Answer:&lt;/strong&gt; "This is why you need multi-dimensional limits. We'd have limits on both 'cost units per minute' &lt;em&gt;and&lt;/em&gt; 'requests per minute.' The cost unit limit handles expensive calls, while the request limit prevents flooding with many cheap calls. For expensive prompts, you might also introduce a 'concurrent expensive requests' limit to prevent single users from monopolizing LLM capacity."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;"How would you store and manage these cost-aware rate limiting states in a distributed system?"&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strong Answer:&lt;/strong&gt; "We'd use a distributed key-value store like Redis. For each &lt;code&gt;user_id&lt;/code&gt; (or &lt;code&gt;client_id&lt;/code&gt;, &lt;code&gt;tenant_id&lt;/code&gt;), we'd store a hash map containing &lt;code&gt;current_cost_spent&lt;/code&gt;, &lt;code&gt;current_request_count&lt;/code&gt;, and &lt;code&gt;last_reset_timestamp&lt;/code&gt; for each time window (e.g., minute, hour). We'd use Redis's &lt;code&gt;INCRBY&lt;/code&gt; (for cost units) and &lt;code&gt;EXPIRE&lt;/code&gt; for the time window reset. Atomic operations are crucial to prevent race conditions during updates."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Need to refine your system design skills for real-world scenarios?&lt;br&gt;
Book a 1:1 session with me on Topmate to deep dive into advanced patterns and interview strategies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;I do 1:1 sessions on system design, backend architecture, and interview prep.&lt;br&gt;
If you're preparing for a Staff/Senior role or cracking FAANG rounds — &lt;a href="https://topmate.io/rishabh_pahwa" rel="noopener noreferrer"&gt;book a session here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>llm</category>
      <category>ratelimiting</category>
      <category>backendengineering</category>
    </item>
    <item>
      <title>Why "No Rollback" Breaks Production</title>
      <dc:creator>rishabh pahwa</dc:creator>
      <pubDate>Fri, 15 May 2026 08:44:38 +0000</pubDate>
      <link>https://dev.to/rishabh_pahwa_1a2b93e60b0/why-no-rollback-breaks-production-23ea</link>
      <guid>https://dev.to/rishabh_pahwa_1a2b93e60b0/why-no-rollback-breaks-production-23ea</guid>
      <description>&lt;p&gt;Most data migration strategies focus on getting to the new state. But your actual success metric isn't "migration complete," it's "can we revert this change without data loss?" A robust rollback mechanism isn't a luxury; it's the only way to guarantee business continuity when migrations inevitably hit a snag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "No Rollback" Breaks Production
&lt;/h2&gt;

&lt;p&gt;Imagine your team deploys a new feature requiring a crucial schema change—say, adding a &lt;code&gt;user_preferences&lt;/code&gt; JSONB column with a &lt;code&gt;NOT NULL&lt;/code&gt; constraint. You run the migration, deploy the new application code, and for the first 10 minutes, everything looks green. Then, an edge case surfaces: existing users with implicit empty preference data (handled by old app logic) start seeing 500 errors because the new application expects a specific, non-null JSON structure. Revenue instantly drops by 15%, and PagerDuty is screaming.&lt;/p&gt;

&lt;p&gt;Without a safe rollback strategy, you're in a nightmare scenario:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Roll forward with a hotfix:&lt;/strong&gt; Rushing a fix under pressure is a recipe for more bugs, especially if the underlying data is already corrupted or partially transformed.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Restore from backup:&lt;/strong&gt; This means hours of downtime and guaranteed data loss since the backup was taken. Any new data written in the last few hours is gone.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Manual data repair:&lt;/strong&gt; An error-prone, slow process for critical data, often involving direct database manipulation, leading to further inconsistency.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All options are unacceptable in a production system handling high traffic or sensitive data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing for Zero-Data-Loss Rollback: The Phased Migration
&lt;/h2&gt;

&lt;p&gt;The core idea for safe rollbacks is to ensure your &lt;em&gt;old&lt;/em&gt; system can continue to operate correctly throughout the migration, especially writing data, even as you transition to a new schema or database. This allows you to revert to the old application version without data loss if something breaks.&lt;/p&gt;

&lt;p&gt;This typically involves a phased approach often called "dual write" or "shadow write."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;           +--------------------+
           |                    |
           |   Application v1   |
           |  (Reads/Writes Old)|
           |                    |
           +----------+---------+
                      |
                      | Reads/Writes (Old Schema)
                      v
            +-------------------+
            |                   |
            |    Old Database   |
            |    (Old Schema)   |
            |                   |
            +-------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Phase 1: Dual Write Introduction (No Read Change)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your new application version (v2) is deployed alongside v1. Critically, v2 &lt;em&gt;writes to both the old schema and the new schema&lt;/em&gt;. Reads continue to come from the old schema by both v1 and v2. This ensures the old path is always kept up-to-date and valid.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;           +--------------------+      +--------------------+
           |    Application v1  |      |    Application v2  |
           | (Reads/Writes Old) |      | (Writes Old &amp;amp; New) |
           |                    |      | (Reads Old)        |
           +----------+---------+      +----------+---------+
                      |                             |
                      | Reads/Writes (Old Schema)   | Writes (New Schema)
                      v                             v
            +-------------------+           +-------------------+
            |                   |           |                   |
            |    Old Database   |&amp;lt;----------|    New Database   |
            |    (Old Schema)   |           |    (New Schema)   |
            |                   |           |                   |
            +-------------------+           +-------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Phase 2: Backfill Historical Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While dual writes ensure new data is captured in both places, existing historical data only lives in the old schema. An asynchronous job is run to backfill and transform this data from the old schema into the new schema. This must be idempotent and carefully handle concurrent writes from Phase 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Read Switchover (Still Dual Writing)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the backfill is complete and verified, you update Application v2 to read primarily from the new schema. Application v1 continues to read and write to the old schema. Dual writes from v2 continue, ensuring both databases remain synchronized.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;           +--------------------+      +--------------------+
           |    Application v1  |      |    Application v2  |
           | (Reads/Writes Old) |      | (Writes Old &amp;amp; New) |
           |                    |      | (Reads New)        |
           +----------+---------+      +----------+---------+
                      |                             |
                      | Reads/Writes (Old Schema)   | Writes (New Schema)
                      v                             v
            +-------------------+           +-------------------+
            |                   |           |                   |
            |    Old Database   |&amp;lt;----------|    New Database   |
            |    (Old Schema)   |           |    (New Schema)   |
            |                   |           |                   |
            +-------------------+           +-------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rollback Point:&lt;/strong&gt; If at any point during Phases 1-3 an issue arises, you can instantly rollback &lt;code&gt;Application v2&lt;/code&gt; to &lt;code&gt;Application v1&lt;/code&gt;. Since &lt;code&gt;Application v1&lt;/code&gt; was always writing to the old schema, and &lt;code&gt;Application v2&lt;/code&gt; was also writing to it, the critical data for your production system remains intact and consistent in the old schema. The new schema might contain inconsistent or orphaned data, but your core business operations are unaffected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 4: Cutover and Cleanup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once confidence is high (e.g., after weeks of monitoring with no issues), you can remove the dual writes from v2 and eventually deprecate/drop the old schema or database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-world Application: Stripe's Data Migrations
&lt;/h2&gt;

&lt;p&gt;Stripe, processing billions of API calls daily, cannot afford data loss or significant downtime. Their approach to critical data migrations (e.g., changing how &lt;code&gt;PaymentIntent&lt;/code&gt; objects are stored, or migrating customer data between sharded databases) heavily relies on phased strategies for zero-downtime, zero-data-loss transitions.&lt;/p&gt;

&lt;p&gt;When migrating to new data models or infrastructure, Stripe often employs a variation of the dual-write pattern, sometimes extended with a "shadow-read" phase. For instance, if migrating a service to a new database or schema, they might:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Replicate data:&lt;/strong&gt; Stream existing data from the old system to the new, ensuring eventual consistency.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Dual-write:&lt;/strong&gt; All new writes go to &lt;em&gt;both&lt;/em&gt; the old and new systems. This is critical for rollback: the old system always has the latest state.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Shadow-read/Verify:&lt;/strong&gt; New application code starts reading from the new system but &lt;em&gt;compares the result with the old system&lt;/em&gt;. If there's a discrepancy, it logs an error but serves the response from the old system. This acts as a "dark launch" validation, catching data inconsistencies before they impact users.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Phased Read Cutover:&lt;/strong&gt; Once shadow-reads are validated (e.g., 99.999% consistency over days), reads are progressively switched to the new system, starting with a small percentage of traffic (canary deployment) and gradually increasing.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Remove Dual-write:&lt;/strong&gt; Once all traffic is routed to the new system and it's stable, the dual-write logic is removed.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Decommission:&lt;/strong&gt; The old system is eventually decommissioned.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This process can take weeks or even months for critical systems, providing an extremely long window for verification and instant rollback at any stage before the old system is retired. The overhead of writing twice (or reading twice) is a recognized trade-off for business continuity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes Engineers Make
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Forgetting Data Integrity Constraints:&lt;/strong&gt; Focusing only on changing column types but neglecting the &lt;code&gt;NOT NULL&lt;/code&gt; constraints or unique indexes. If you add &lt;code&gt;NOT NULL&lt;/code&gt; to a column that has existing &lt;code&gt;NULL&lt;/code&gt; values, your migration will fail unless you've backfilled defaults &lt;em&gt;before&lt;/em&gt; applying the constraint. This seems basic, but it's a frequent cause of production failures.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Prematurely Dropping Old Data or Indices:&lt;/strong&gt; Convinced the migration is "done" after a few hours, engineers drop old columns, tables, or indices. If a hidden bug emerges days later, a rollback becomes a partial data restoration from backup (data loss) or a manual, complex data reconstruction task. Keep old structures around for &lt;em&gt;weeks&lt;/em&gt; or &lt;em&gt;months&lt;/em&gt; if possible, even if unused, until full confidence is achieved.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Inadequate Monitoring on the Old Path:&lt;/strong&gt; During dual-write, the focus often shifts entirely to the new path. If the old path's writes (which are critical for rollback) start failing due to unexpected application interactions or database load, and you don't monitor it, your safety net is silently compromised. Monitor both paths comprehensively, especially write success rates and latencies.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Interview Angle
&lt;/h2&gt;

&lt;p&gt;Interviewers love to probe into data migration because it exposes your understanding of trade-offs and production resilience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; "You need to add a new &lt;code&gt;status&lt;/code&gt; column (enum type) to a critical &lt;code&gt;orders&lt;/code&gt; table that processes thousands of transactions per second. Describe a zero-downtime, zero-data-loss migration strategy and how you'd handle a rollback."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strong Answer Breakdown:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Phase 1: Safe Schema Evolution.&lt;/strong&gt; Start by adding the new &lt;code&gt;status&lt;/code&gt; column as &lt;code&gt;NULLABLE&lt;/code&gt; and with no default. This ensures existing rows remain valid. Deploy this schema change &lt;em&gt;without&lt;/em&gt; application code changes.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Phase 2: Dual Write with Backfill.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  Deploy a new version of your application (v2) that, when writing or updating an order, writes to &lt;em&gt;both&lt;/em&gt; the old and new &lt;code&gt;status&lt;/code&gt; columns. For existing orders, backfill the &lt;code&gt;status&lt;/code&gt; column based on existing logic or a reasonable default value using an asynchronous, idempotent job.&lt;/li&gt;
&lt;li&gt;  Application v1 continues to operate as normal, reading/writing only the old columns.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Rollback Safety:&lt;/strong&gt; At this stage, if v2 has issues, you can roll back to v1. All critical data (including the old status representation) is preserved in the original format. The new &lt;code&gt;status&lt;/code&gt; column might become stale or inconsistent, but it doesn't impact v1.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Phase 3: Phased Read Switchover.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  Once backfill is complete and the dual-write period has passed without issues, deploy an updated v2 that reads the &lt;code&gt;status&lt;/code&gt; from the &lt;em&gt;new&lt;/em&gt; column first. If it's &lt;code&gt;NULL&lt;/code&gt; (indicating an un-migrated row or an old version), fall back to inferring status from the old logic. Continue dual-writing.&lt;/li&gt;
&lt;li&gt;  Use feature flags to gradually roll out this read change to a small percentage of users, carefully monitoring for errors and data discrepancies.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Phase 4: Enforce Constraint and Cleanup.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  Once confident, add a &lt;code&gt;NOT NULL&lt;/code&gt; constraint to the &lt;code&gt;status&lt;/code&gt; column.&lt;/li&gt;
&lt;li&gt;  Finally, remove the old status logic and column, typically after a significant soak period (weeks).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Key Mitigations and Trade-offs:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Data Inconsistency:&lt;/strong&gt; Validate data written to the new column against the old. Use eventual consistency patterns.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Performance Overhead:&lt;/strong&gt; Dual writes add latency and database load. Monitor this closely.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Complexity:&lt;/strong&gt; More application code paths, more deployment steps. Mitigate with automated testing and clear operational runbooks.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Rollback:&lt;/strong&gt; Emphasize that the existence of the old, valid data and the ability for the old application version to function means you can always revert to a known good state without data loss.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Need help designing robust migration strategies or preparing for your next system design interview?&lt;/p&gt;

&lt;p&gt;Book a 1:1 session with me on Topmate to discuss your challenges and level up your skills.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;I do 1:1 sessions on system design, backend architecture, and interview prep.&lt;br&gt;
If you're preparing for a Staff/Senior role or cracking FAANG rounds — &lt;a href="https://topmate.io/rishabh_pahwa" rel="noopener noreferrer"&gt;book a session here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>backendengineering</category>
      <category>systemdesign</category>
      <category>datamigration</category>
      <category>rollbackstrategy</category>
    </item>
    <item>
      <title>The Production Problem with Async Dual Writes</title>
      <dc:creator>rishabh pahwa</dc:creator>
      <pubDate>Wed, 13 May 2026 15:00:19 +0000</pubDate>
      <link>https://dev.to/rishabh_pahwa_1a2b93e60b0/the-production-problem-with-async-dual-writes-ao4</link>
      <guid>https://dev.to/rishabh_pahwa_1a2b93e60b0/the-production-problem-with-async-dual-writes-ao4</guid>
      <description>&lt;p&gt;Many "zero-downtime" data migration strategies involving dual writes promise seamless transitions, but often hide insidious data consistency traps. Without careful handling, you're not just moving data; you're silently corrupting or losing it, only to discover the issue months after cutover.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Production Problem with Async Dual Writes
&lt;/h2&gt;

&lt;p&gt;Imagine you're an engineer at a rapidly growing SaaS company. Your &lt;code&gt;users&lt;/code&gt; table needs to be sharded or migrated to a new database technology. To avoid downtime, you implement a dual-write strategy: all new writes go to both the old and new &lt;code&gt;users&lt;/code&gt; tables. Reads initially come from the old table, then eventually switch to the new one. This sounds solid.&lt;/p&gt;

&lt;p&gt;Now, picture this: A user updates their profile. Your application sends two write requests: one to &lt;code&gt;OldDB.users&lt;/code&gt; and one to &lt;code&gt;NewDB.users&lt;/code&gt;. The write to &lt;code&gt;OldDB&lt;/code&gt; succeeds, returning HTTP 200. But the write to &lt;code&gt;NewDB&lt;/code&gt; fails due to a network timeout, a transient database hiccup, or a schema validation error specific to the new system. What does your application do? If it immediately returns success because the &lt;code&gt;OldDB&lt;/code&gt; write worked, you now have an inconsistency: the user's profile is updated in the old system but stale in the new. Over days or weeks, these small, non-atomic failures accumulate, leading to widespread data divergence. When you finally cut over to reading solely from &lt;code&gt;NewDB&lt;/code&gt;, users start seeing outdated profiles, missing orders, or incorrect balances. Your "zero-downtime" migration just became a "zero-consistency" disaster.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Expand-Contract Pattern and Dual Writes
&lt;/h2&gt;

&lt;p&gt;The Expand-Contract pattern is a common strategy for zero-downtime schema migrations. It involves phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Expand&lt;/strong&gt;: Modify your application to read from the old schema and write to both the old and new schemas.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Migrate Data&lt;/strong&gt;: Backfill historical data from the old schema to the new.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Validate&lt;/strong&gt;: Continuously compare data between old and new.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Contract&lt;/strong&gt;: Switch reads to the new schema, then remove the old schema and dual-write logic.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's how the dual-write phase typically works, and where consistency issues arise:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  +-----------------------------------+
                  |            Application            |
                  |  (v1.1 - Dual-Write/Read Old)     |
                  +-----------------------------------+
                       |        ^         ^
                       | Write  | Read    | Write
                       v        |         |
      +---------------------+   |         |   +---------------------+
      | Old Database (v1.0) |&amp;lt;--+---------+--&amp;gt;| New Database (v1.1) |
      | (e.g., MySQL)       |                 | (e.g., PostgreSQL)  |
      +---------------------+                 +---------------------+
                                  ^
                                  | Backfill / Sync Job
                                  | (e.g., Debezium, custom scripts)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Reads&lt;/strong&gt;: Go to the &lt;code&gt;Old Database&lt;/code&gt; (or read from both and merge, with old as authoritative).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Writes&lt;/strong&gt;: Go to &lt;em&gt;both&lt;/em&gt; &lt;code&gt;Old Database&lt;/code&gt; and &lt;code&gt;New Database&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Backfill&lt;/strong&gt;: A separate job continuously copies existing data from &lt;code&gt;Old&lt;/code&gt; to &lt;code&gt;New&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fundamental challenge is that writing to two separate databases (or even two different tables in the same database) is not an atomic operation. Without a distributed transaction across both write operations, there's always a window where one succeeds and the other fails, leading to divergence.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Stripe Maintains Sanity at Scale
&lt;/h2&gt;

&lt;p&gt;Stripe, processing billions in transactions, performs hundreds of schema changes monthly. Their approach to zero-downtime data migration heavily relies on dual writes but is backed by extensive reconciliation. When migrating critical financial data, they recognize that non-atomic dual writes are a reality.&lt;/p&gt;

&lt;p&gt;Instead of assuming perfect consistency, Stripe engineers build systems that detect and fix discrepancies. Their strategy often includes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Shadow Writes&lt;/strong&gt;: Before dual-writing, they might "shadow write" to the new schema. The new system receives a copy of write traffic, but these writes aren't considered authoritative and are often discarded. This allows testing the performance and correctness of the new schema under production load &lt;em&gt;without&lt;/em&gt; impacting the old system or risking data integrity.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Idempotency and Retries&lt;/strong&gt;: Application logic ensures that write operations are idempotent, meaning they can be safely retried. When a dual write occurs, if one database write fails, the application logs the failure and often retries later or enqueues it for asynchronous processing.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Continuous Reconciliation&lt;/strong&gt;: This is the most crucial part. After dual writes are enabled, Stripe runs continuous, automated reconciliation jobs. These jobs scan both the old and new databases, compare records based on a unique identifier, and identify discrepancies. If a difference is found (e.g., a record exists in &lt;code&gt;OldDB&lt;/code&gt; but not &lt;code&gt;NewDB&lt;/code&gt;, or attributes differ), the reconciliation job logs it, potentially attempts to fix it (e.g., by re-applying the change to &lt;code&gt;NewDB&lt;/code&gt;), or flags it for manual review. For example, a reconciliation job might compare 100 million &lt;code&gt;customer&lt;/code&gt; records daily, flagging any divergence beyond a 0.0001% threshold. This background process ensures eventual consistency and acts as a safety net against non-atomic dual-write failures.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This rigorous validation and reconciliation process is what turns a risky dual-write strategy into a production-grade, zero-downtime migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes When Implementing Dual Writes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Assuming Atomicity Across Databases&lt;/strong&gt;: Many engineers treat a dual-write operation (e.g., &lt;code&gt;db1.save()&lt;/code&gt; and &lt;code&gt;db2.save()&lt;/code&gt;) as a single atomic unit. It's not. If your application code just calls two database clients, success from one and failure from the other leads to data divergence. You need explicit error handling, retries, and compensation logic, or rely on eventual consistency with strong reconciliation.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Inadequate Read Strategy During Transition&lt;/strong&gt;: During the dual-write phase, how do you read?

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Read-Old&lt;/strong&gt;: Reading only from the old system is safer for consistency &lt;em&gt;during&lt;/em&gt; the transition, but means data written to the new system isn't immediately visible, and requires a hard cutover for reads.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Read-New-Fallback-Old&lt;/strong&gt;: Reading from the new, falling back to old if not found, can lead to inconsistencies if the new system is incomplete or subtly different.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Read-Both-Merge&lt;/strong&gt;: Reading from both and merging requires complex conflict resolution and can be slow. Most get this wrong by not clearly defining the source of truth for reads at each stage.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Neglecting Reconciliation and Observability&lt;/strong&gt;: Simply setting up dual writes and a backfill job isn't enough. Without robust monitoring to track dual-write success rates, latency for each write, and, critically, continuous data validation (reconciliation) between the old and new systems, you're flying blind. Silent data loss is guaranteed without it. Many engineers skip this crucial, complex step, leading to post-cutover data integrity nightmares.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Interview Angle: What Interviewers Ask
&lt;/h2&gt;

&lt;p&gt;Interviewers will probe your understanding beyond the basic concept. Expect questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"How do you ensure data consistency during a dual-write phase if one database write succeeds and the other fails?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strong Answer&lt;/strong&gt;: "Since distributed transactions are rarely feasible or desirable, I wouldn't assume atomicity. Instead, I'd implement a compensation mechanism. For writes, I'd typically wrap the dual-write logic in a transaction &lt;em&gt;within the application&lt;/em&gt; or use an idempotent message queue. The application would first publish the data change to a reliable queue (e.g., Kafka). A consumer would then attempt to write to both databases. If one write fails, the message could be retried with backoff. If persistent failures occur, it lands in a dead-letter queue for manual intervention or triggers an alert. Ultimately, even with retries, you need a continuous, asynchronous reconciliation job that scans both databases for discrepancies and fixes them, ensuring eventual consistency. This shifts the complexity from transactional guarantees to robust error handling and eventual repair."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;"When would you use a 'shadow write' versus a 'dual write'?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strong Answer&lt;/strong&gt;: "Shadow writes are primarily for &lt;em&gt;testing&lt;/em&gt; the new system with production-like load and data, without letting it impact the live system. You write to both the old authoritative system and the new system, but the new system's writes are often ignored or merely logged for validation. This is low-risk. Dual writes, however, mean both systems are authoritative &lt;em&gt;for writes&lt;/em&gt; during a transitional period, with the intent to eventually cut over reads to the new system. It's a higher-risk strategy because data consistency is paramount. I'd use shadow writes for initial performance testing or schema validation of the new system, and dual writes when I'm confident in the new system's write path and am preparing for a full cutover, backed by strong reconciliation."&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Moving critical data without disruption is hard. Do it right, and your systems evolve gracefully. Cut corners, and you'll spend weeks on data recovery.&lt;/p&gt;




&lt;p&gt;Need to refine your system design skills for your next interview? Book a 1:1 session with me to discuss real-world system challenges and effective design patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;I do 1:1 sessions on system design, backend architecture, and interview prep.&lt;br&gt;
If you're preparing for a Staff/Senior role or cracking FAANG rounds — &lt;a href="https://topmate.io/rishabh_pahwa" rel="noopener noreferrer"&gt;book a session here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>databasemigration</category>
      <category>distributedsystems</category>
      <category>dataconsistency</category>
    </item>
    <item>
      <title>Your "Cache Invalidation is Hard" Answer Misses the Real Horror</title>
      <dc:creator>rishabh pahwa</dc:creator>
      <pubDate>Sun, 10 May 2026 08:42:41 +0000</pubDate>
      <link>https://dev.to/rishabh_pahwa_1a2b93e60b0/your-cache-invalidation-is-hard-answer-misses-the-real-horror-5em7</link>
      <guid>https://dev.to/rishabh_pahwa_1a2b93e60b0/your-cache-invalidation-is-hard-answer-misses-the-real-horror-5em7</guid>
      <description>&lt;h2&gt;
  
  
  Your "Cache Invalidation is Hard" Answer Misses the Real Horror
&lt;/h2&gt;

&lt;p&gt;Most engineers parrot "cache invalidation is hard" as a standard interview response, but few understand &lt;em&gt;why&lt;/em&gt; it's hard or the real-world horrors it introduces. It's not just about stale data; it's about financial losses, broken business logic, and cascading failures when eventual consistency hits critical paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Production Nightmare: Financial Impact of Stale Data
&lt;/h2&gt;

&lt;p&gt;Imagine a ride-sharing platform like Uber. A user updates their payment method because the old card expired. The update is written to the database successfully. However, due to an aggressive cache TTL or a failed invalidation, the dispatch service still sees the &lt;em&gt;old&lt;/em&gt;, expired card for the next 5 minutes. The user tries to book a ride, it fails. They try again, it fails. Frustrated, they switch to a competitor.&lt;/p&gt;

&lt;p&gt;This isn't just "stale data"; it's a direct loss of revenue, a degraded user experience, and a hit to brand loyalty. In banking, showing an incorrect account balance, even for seconds, can trigger compliance violations and massive reputational damage. In e-commerce, a product showing "in stock" when it's sold out leads to cancelled orders and angry customers. The problem isn't theoretical; it's financial and operational.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond TTLs: Active Invalidation in Distributed Systems
&lt;/h2&gt;

&lt;p&gt;The naive approach to cache invalidation often relies on Time-To-Live (TTL) or a simple write-through/write-around policy. While these have their place, critical systems demand more robust strategies that aim for &lt;em&gt;stronger consistency&lt;/em&gt; than basic eventual consistency can provide, especially when data is updated from multiple sources.&lt;/p&gt;

&lt;p&gt;Consider an active invalidation strategy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------+       +------------+       +------------+       +-------------+
|    User    |       |  Frontend  |       |  Backend   |       |   Database  |
| (API Client)|       |    Service |       |    Service |       |  (Postgres) |
+------------+       +------------+       +------------+       +-------------+
      |                   |                      |                      |
      | 1. Update Profile |                      |                      |
      +------------------&amp;gt;|                      |                      |
      |                   | 2. Call Update API   |                      |
      |                   +---------------------&amp;gt;|                      |
      |                   |                      | 3. Update DB         |
      |                   |                      +---------------------&amp;gt;|
      |                   |                      | (DB transaction ACK) |
      |                   |                      |&amp;lt;---------------------+
      |                   |                      |                      |
      |                   |                      | 4. Publish Invalidation Event to Message Bus
      |                   |                      +---------------------&amp;gt;+
      |                   |                      | (e.g., Kafka)        |
      |                   |                      |                      |
      |                   |                      |                      |
      |                   |                      |                      |
      |                   |                      |                      |
      |                   |                      |                      |
      |                   |                      |                      |
+------------+       +------------+       +------------+       +-------------+
|  Cache     |       | Invalidator|       |  Message   |
| (Redis)    |       |  Service   |       |    Bus     |
+------------+       +------------+       +------------+
      ^                   ^                      ^
      |                   | 5. Consume Invalidation Event
      |                   |&amp;lt;---------------------+
      |                   |                      |
      | 6. Invalidate Key |                      |
      |&amp;lt;------------------+                      |
      | (Cache ACK)       |                      |
      |                   |                      |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this flow, after the database is updated (step 3), an invalidation event is &lt;em&gt;published&lt;/em&gt; to a message bus (step 4). An &lt;code&gt;Invalidator Service&lt;/code&gt; &lt;em&gt;consumes&lt;/em&gt; this event (step 5) and then explicitly &lt;em&gt;deletes&lt;/em&gt; or &lt;em&gt;updates&lt;/em&gt; the corresponding key in the cache (step 6). This decouples the write path from cache invalidation, improving write latency, but introduces eventual consistency. The critical aspect is making this event propagation and consumption &lt;em&gt;reliable&lt;/em&gt; and &lt;em&gt;fast&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Meta's Approach to Consistent Caching at Scale
&lt;/h2&gt;

&lt;p&gt;At companies like Meta (Facebook), operating some of the world's largest caches, simple TTLs aren't enough. They can't afford to show stale profile data, friend lists, or post engagement for minutes. Their "Cache Made Consistent" initiatives aim to solve the very race conditions and inconsistencies that plague distributed caching.&lt;/p&gt;

&lt;p&gt;They've moved beyond basic invalidation to sophisticated systems that ensure stronger consistency guarantees. One approach involves using transaction logs (like binlogs in MySQL) from the database to drive invalidation. A service tails these logs, filters relevant updates, and publishes specific invalidation messages to a distributed system. Cache nodes then subscribe to these messages. This pushes the consistency window from minutes (TTL) down to milliseconds, closely following database writes.&lt;/p&gt;

&lt;p&gt;This system is built for extreme scale: potentially hundreds of thousands of updates per second across petabytes of data. It's not just about sending an &lt;code&gt;invalidate(key)&lt;/code&gt; command; it's about guaranteeing delivery, handling partial failures (what if a cache node is down?), and ensuring that &lt;em&gt;all&lt;/em&gt; relevant dependent caches (e.g., user profile, friend count, feed items) are consistently updated or invalidated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes Engineers Make
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Over-relying on TTL for critical data:&lt;/strong&gt; While great for performance, a 5-minute TTL on a user's payment method or an item's stock count is a ticking time bomb. It trades consistency for availability in places where consistency is paramount. For high-stakes data, TTLs should be very short (seconds) and coupled with active invalidation, or the cache should be bypassed entirely for reads requiring strong consistency.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Ignoring cache dependency graphs:&lt;/strong&gt; Invalidating a single key like &lt;code&gt;user:123&lt;/code&gt; is often insufficient. What about other cached entities that &lt;em&gt;depend&lt;/em&gt; on &lt;code&gt;user:123&lt;/code&gt;'s data, such as &lt;code&gt;user_profile_page:123&lt;/code&gt; or &lt;code&gt;feed_for_user:123&lt;/code&gt;? If you don't invalidate the entire dependency tree, you'll still show stale data. Building and maintaining this dependency graph is complex and often overlooked until production issues arise.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Not building resilient invalidation pipelines:&lt;/strong&gt; Active invalidation introduces its own distributed system problems. What happens if the message bus is down? What if an invalidation message is lost? What if a cache node fails to receive an invalidation? Without retries, dead-letter queues, and eventual reconciliation mechanisms, your cache will drift indefinitely. This is where &lt;code&gt;cache invalidation is hard&lt;/code&gt; actually holds true – building a &lt;em&gt;reliable&lt;/em&gt; invalidation mechanism.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Interview Angle: Beyond the Buzzwords
&lt;/h2&gt;

&lt;p&gt;When an interviewer asks about cache invalidation, they're looking for more than "it's hard, use TTL." They want to understand your appreciation for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Consistency models and trade-offs:&lt;/strong&gt; When would you tolerate eventual consistency? When do you need strong consistency, and how would you achieve it with a cache? (e.g., using a write-through cache with a transactional database, or bypassing the cache for critical reads).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Failure modes:&lt;/strong&gt; What happens if invalidation fails? How do you detect it? How do you recover? Strong answers discuss monitoring cache hit ratios, consistency checks between cache and DB, and fallback mechanisms like circuit breakers.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Complexity at scale:&lt;/strong&gt; How do you invalidate data across hundreds or thousands of cache nodes? How do you handle fan-out invalidation for dependent data? Think about event-driven architectures, distributed transactions (though rare for caches), and sophisticated messaging patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For instance, if asked, "How would you design a caching system for a bank account balance?", a strong answer would emphasize &lt;em&gt;strong consistency&lt;/em&gt;. You might propose a very short TTL (e.g., 1 second) coupled with immediate, transactional invalidation for updates, or even suggest &lt;em&gt;not caching&lt;/em&gt; the balance at all for reads that require absolute accuracy, fetching directly from the database to avoid any risk of stale data. The cost of an inconsistent balance outweighs the latency benefit of a cache.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need to level up your system design skills?
&lt;/h2&gt;

&lt;p&gt;Book a 1:1 session with me to deep dive into real-world system challenges and ace your next interview. Let's build your expertise together.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;I do 1:1 sessions on system design, backend architecture, and interview prep.&lt;br&gt;
If you're preparing for a Staff/Senior role or cracking FAANG rounds — &lt;a href="https://topmate.io/rishabh_pahwa" rel="noopener noreferrer"&gt;book a session here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>caching</category>
      <category>distributedsystems</category>
      <category>backendengineering</category>
    </item>
  </channel>
</rss>
