<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kaustubh Alandkar</title>
    <description>The latest articles on DEV Community by Kaustubh Alandkar (@kaustubhalandkar).</description>
    <link>https://dev.to/kaustubhalandkar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1705656%2F2d3f3348-fe64-42a8-97ea-9435762ff4ed.png</url>
      <title>DEV Community: Kaustubh Alandkar</title>
      <link>https://dev.to/kaustubhalandkar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kaustubhalandkar"/>
    <language>en</language>
    <item>
      <title>How I Built an MQTT Ingest Core with FastAPI</title>
      <dc:creator>Kaustubh Alandkar</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:38:01 +0000</pubDate>
      <link>https://dev.to/kaustubhalandkar/how-i-built-an-mqtt-ingest-core-with-fastapi-50dg</link>
      <guid>https://dev.to/kaustubhalandkar/how-i-built-an-mqtt-ingest-core-with-fastapi-50dg</guid>
      <description>&lt;p&gt;Most telemetry pipelines look clean in architecture diagrams.&lt;/p&gt;

&lt;p&gt;Devices publish data.&lt;br&gt;&lt;br&gt;
A collector receives it.&lt;br&gt;&lt;br&gt;
A backend stores it.&lt;br&gt;&lt;br&gt;
Dashboards read from it.&lt;/p&gt;

&lt;p&gt;Simple.&lt;/p&gt;

&lt;p&gt;Until you have to decide what to do with imperfect data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;records arrive incomplete&lt;/li&gt;
&lt;li&gt;payloads vary by topic&lt;/li&gt;
&lt;li&gt;timestamps are missing or inconsistent&lt;/li&gt;
&lt;li&gt;some data is usable, but not trustworthy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the problem is no longer ingestion.&lt;/p&gt;

&lt;p&gt;It’s &lt;strong&gt;data correctness at the system boundary&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That’s what I built this project to handle.&lt;/p&gt;


&lt;h2&gt;
  
  
  What This Project Actually Is
&lt;/h2&gt;

&lt;p&gt;This service acts as a &lt;strong&gt;data-quality boundary&lt;/strong&gt; between raw MQTT ingestion and downstream systems.&lt;/p&gt;

&lt;p&gt;It does not subscribe to MQTT directly.&lt;/p&gt;

&lt;p&gt;Instead, it receives batched telemetry records and is responsible for turning them into &lt;strong&gt;structured, queryable system state&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Typical topics look like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sites&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;inverters&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;strings&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;weather&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;grid&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each record, it decides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is it structurally valid?&lt;/li&gt;
&lt;li&gt;What fields are missing?&lt;/li&gt;
&lt;li&gt;How should it be normalized?&lt;/li&gt;
&lt;li&gt;Should it be accepted or rejected?&lt;/li&gt;
&lt;li&gt;What aggregate state should be updated?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is not just to store data.&lt;/p&gt;

&lt;p&gt;The goal is to ensure that &lt;strong&gt;everything stored is explicitly understood&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At a high level, the flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MQTT Collector / Buffer
   ↓
FastAPI Ingest Core
   ↓
Validation
   ↓
Normalization
   ↓
MongoDB
   ├── normalized_records
   ├── invalid_records
   └── topic_aggregates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation is intentional.&lt;/p&gt;

&lt;p&gt;The upstream system handles delivery reliability.&lt;/p&gt;

&lt;p&gt;This service handles data correctness.&lt;/p&gt;

&lt;p&gt;Mixing the two usually makes both harder to reason about.&lt;/p&gt;

&lt;p&gt;That boundary matters.&lt;/p&gt;

&lt;p&gt;Because raw device or telemetry data is rarely clean enough to use directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Didn't Want "Just a Raw Ingest Endpoint"
&lt;/h2&gt;

&lt;p&gt;A very common approach looks like this:&lt;/p&gt;

&lt;p&gt;POST /ingest&lt;br&gt;
→ accept payload&lt;br&gt;
→ store raw document&lt;br&gt;
→ defer cleanup downstream&lt;/p&gt;

&lt;p&gt;This works early on.&lt;/p&gt;

&lt;p&gt;But over time, it pushes data quality problems into every downstream system.&lt;/p&gt;

&lt;p&gt;Each consumer ends up re-implementing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validation&lt;/li&gt;
&lt;li&gt;normalization&lt;/li&gt;
&lt;li&gt;fallback logic&lt;/li&gt;
&lt;li&gt;schema assumptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That duplication is where systems start drifting apart.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Service Does
&lt;/h2&gt;

&lt;p&gt;The FastAPI service exposes a few core endpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;POST /auth&lt;/li&gt;
&lt;li&gt;POST /ingest&lt;/li&gt;
&lt;li&gt;GET /topics/summary&lt;/li&gt;
&lt;li&gt;GET /invalid/recent&lt;/li&gt;
&lt;li&gt;GET /health&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only &lt;code&gt;/health&lt;/code&gt; and &lt;code&gt;/auth&lt;/code&gt; are public. Everything else is JWT-protected.&lt;/p&gt;

&lt;p&gt;That was intentional.&lt;/p&gt;

&lt;p&gt;Once a service becomes the data-quality boundary, it should control:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who can write
&lt;/li&gt;
&lt;li&gt;what gets accepted
&lt;/li&gt;
&lt;li&gt;how that data is shaped
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is about protecting the integrity of the data entering the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Design Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Keeping Auth Out of the Ingest Path
&lt;/h3&gt;

&lt;p&gt;The first decision was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Ingestion should be authenticated, but not complicated.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I added a small &lt;code&gt;/auth&lt;/code&gt; endpoint that validates a configured client ID + secret and issues a JWT.&lt;/p&gt;

&lt;p&gt;That token is then required for ingestion and read endpoints.&lt;/p&gt;

&lt;p&gt;This gave me a few useful properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collector services can authenticate once and reuse a token&lt;/li&gt;
&lt;li&gt;Ingest logic stays separate from credential validation&lt;/li&gt;
&lt;li&gt;Downstream protected endpoints remain simple&lt;/li&gt;
&lt;li&gt;The service can reject unauthenticated writes before touching data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t about building a full auth system.&lt;/p&gt;

&lt;p&gt;It’s about ensuring that &lt;strong&gt;ingestion is an explicit contract between services&lt;/strong&gt;, not an open write surface.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: This service is designed to sit behind a buffered MQTT ingestion layer. &lt;br&gt;
The buffer handles retries, batching, and offline durability, while this service &lt;br&gt;
focuses purely on validation, normalization, and storage.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can think of it as the next step after this layer:&lt;br&gt;
&lt;a href="https://dev.to/kaustubhalandkar/designing-an-offline-resilient-mqtt-buffer-with-sqlite-dj4"&gt;https://dev.to/kaustubhalandkar/designing-an-offline-resilient-mqtt-buffer-with-sqlite-dj4&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Strict Validation Without Throwing Away Useful Data
&lt;/h3&gt;

&lt;p&gt;This was probably the most important design decision in the whole project.&lt;/p&gt;

&lt;p&gt;A telemetry record can be "bad" in multiple ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unknown topic&lt;/li&gt;
&lt;li&gt;Missing payload entirely&lt;/li&gt;
&lt;li&gt;Incomplete payload&lt;/li&gt;
&lt;li&gt;Missing timestamp&lt;/li&gt;
&lt;li&gt;Partially missing expected fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But not all of those failures should be treated the same.&lt;/p&gt;

&lt;p&gt;So I split the logic into two layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard validation failures&lt;/strong&gt; — make the record invalid and route it to &lt;code&gt;invalid_records&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unsupported topic&lt;/li&gt;
&lt;li&gt;Missing payload&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Soft data quality gaps&lt;/strong&gt; — still allow the record to be stored as valid, but with visibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing expected payload keys&lt;/li&gt;
&lt;li&gt;Fallback timestamp usage&lt;/li&gt;
&lt;li&gt;Null-filled required fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction ended up being extremely useful.&lt;/p&gt;

&lt;p&gt;In real telemetry systems, partial data is often still operationally valuable. &lt;br&gt;
Throwing it away completely is often worse than storing it with clear quality metadata.&lt;/p&gt;

&lt;p&gt;The goal wasn’t strict correctness.&lt;/p&gt;

&lt;p&gt;It was &lt;strong&gt;controlled acceptance&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Normalizing Records Into a Consistent Shape
&lt;/h3&gt;

&lt;p&gt;One of the biggest problems with raw telemetry is that even "valid" data is often inconsistent.&lt;/p&gt;

&lt;p&gt;So I normalize each accepted record before storage. That includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standardizing timestamps&lt;/li&gt;
&lt;li&gt;Filling missing required payload keys with &lt;code&gt;null&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tracking missing fields explicitly&lt;/li&gt;
&lt;li&gt;Shaping records into a consistent internal model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means downstream consumers don’t have to guess whether a field was absent, renamed, or formatted differently by source.&lt;/p&gt;

&lt;p&gt;The ingest core makes that decision once.&lt;/p&gt;

&lt;p&gt;This shifts complexity away from every consumer and into a single, well-defined boundary.&lt;/p&gt;

&lt;p&gt;Which is exactly where it belongs.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Timestamps Needed Clear Rules
&lt;/h3&gt;

&lt;p&gt;At first, timestamp handling felt like a detail. It wasn't.&lt;/p&gt;

&lt;p&gt;Records may arrive with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;timestamp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;timestamp inside the payload (e.g. &lt;code&gt;payload.ts&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;missing timestamp entirely&lt;/li&gt;
&lt;li&gt;malformed values&lt;/li&gt;
&lt;li&gt;naive timestamps with no timezone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I made the normalization path explicit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use &lt;code&gt;timestamp&lt;/code&gt; if valid&lt;/li&gt;
&lt;li&gt;Otherwise try a timestamp inside the payload (e.g. &lt;code&gt;payload.ts&lt;/code&gt;)
&lt;/li&gt;
&lt;li&gt;Otherwise fall back to &lt;code&gt;received_at&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Treat naive timestamps as UTC&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Time flows into everything later — ordering, summaries, freshness, debugging.&lt;/p&gt;

&lt;p&gt;A bad timestamp policy quietly poisons all of it.&lt;/p&gt;

&lt;p&gt;So this rule now lives in one place instead of being reinterpreted across services.&lt;/p&gt;

&lt;p&gt;I treated timestamp normalization as a core responsibility.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Flexible Storage, Strict Ingest Rules
&lt;/h3&gt;

&lt;p&gt;This service stores three different kinds of data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Collection&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;normalized_records&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accepted records after validation + normalization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;invalid_records&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rejected records with explicit error codes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;topic_aggregates&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-topic summary state: count, missing field count, last event and received timestamps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That combination made MongoDB a natural fit.&lt;/p&gt;

&lt;p&gt;The ingest rules are strict.&lt;/p&gt;

&lt;p&gt;But the shape of incoming data varies across topics.&lt;/p&gt;

&lt;p&gt;I wanted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flexible document storage&lt;/li&gt;
&lt;li&gt;Simple inserts&lt;/li&gt;
&lt;li&gt;Topic-based querying&lt;/li&gt;
&lt;li&gt;Aggregate updates without over-modeling too early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So MongoDB became the persistence layer, while the FastAPI service became the place where structure is enforced.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Invalid Data Should Be Queryable, Not Just Logged
&lt;/h3&gt;

&lt;p&gt;A lot of ingestion systems do this:&lt;/p&gt;

&lt;p&gt;bad input → log error → drop it&lt;/p&gt;

&lt;p&gt;That sounds fine until someone asks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's failing most often?&lt;/li&gt;
&lt;li&gt;Which topic is malformed?&lt;/li&gt;
&lt;li&gt;Are devices sending incomplete data?&lt;/li&gt;
&lt;li&gt;Are we rejecting too aggressively?&lt;/li&gt;
&lt;li&gt;Did a deployment break payload shape expectations?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If invalid records only exist in logs, answering those questions becomes annoying very quickly.&lt;/p&gt;

&lt;p&gt;So invalid records are stored intentionally with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Topic&lt;/li&gt;
&lt;li&gt;Payload&lt;/li&gt;
&lt;li&gt;Received time&lt;/li&gt;
&lt;li&gt;Error list&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives the system a &lt;strong&gt;memory of failure&lt;/strong&gt; instead of just a momentary complaint.&lt;/p&gt;

&lt;p&gt;And operationally, that's much more useful.&lt;/p&gt;

&lt;p&gt;Invalid data is not noise.&lt;/p&gt;

&lt;p&gt;It’s feedback from the system.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Raw Counts Weren’t Enough — So I Added Topic-Level Aggregates
&lt;/h3&gt;

&lt;p&gt;Raw ingestion tells you volume.&lt;/p&gt;

&lt;p&gt;It doesn’t tell you whether your system is healthy.&lt;/p&gt;

&lt;p&gt;So I built topic-level aggregate updates that track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Record count per topic&lt;/li&gt;
&lt;li&gt;Missing field count&lt;/li&gt;
&lt;li&gt;Latest event time&lt;/li&gt;
&lt;li&gt;Latest received time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives the system a lightweight operational view without needing a full analytics layer yet.&lt;/p&gt;

&lt;p&gt;It's not a dashboard product. &lt;/p&gt;

&lt;p&gt;But it creates the kind of summary surface area you actually need once ingestion is running continuously.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the Actual Friction Was
&lt;/h2&gt;

&lt;p&gt;The API layer itself was straightforward.&lt;/p&gt;

&lt;p&gt;The hard part was deciding how the system should behave under imperfect input.&lt;/p&gt;

&lt;p&gt;The code isn't huge. But the subtle parts were much more interesting than the "API" part.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defining What "Valid Enough" Means Is Harder Than It Sounds
&lt;/h3&gt;

&lt;p&gt;One of the trickiest design questions was: &lt;em&gt;when should a record be rejected vs accepted with missing fields?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's not a purely technical decision. It's a system behavior decision.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reject too aggressively&lt;/strong&gt; → you lose useful operational data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accept too loosely&lt;/strong&gt; → you pollute downstream trust&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I ended up treating &lt;strong&gt;structural integrity as mandatory&lt;/strong&gt;, and &lt;strong&gt;field completeness as observable but tolerable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That balance felt much more realistic than pretending telemetry is always complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Topic Schemas Are Useful — But They Create Ownership
&lt;/h3&gt;

&lt;p&gt;Each topic has its own expected payload shape:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sites&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Plant-level metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;inverters&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inverter telemetry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;strings&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String-level electrical details&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;weather&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Irradiance and atmospheric fields&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;grid&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Power and grid metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That means the ingest core now owns a kind of schema contract. Which is good.&lt;/p&gt;

&lt;p&gt;But it also means adding a new topic is a &lt;strong&gt;deliberate system change&lt;/strong&gt;, not just "new data showing up."&lt;/p&gt;

&lt;p&gt;And honestly, I think that's the right trade-off.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Store Everything" Sounds Simple Until Queryability Matters
&lt;/h3&gt;

&lt;p&gt;Raw ingestion and useful ingestion are not the same thing.&lt;/p&gt;

&lt;p&gt;If you just store incoming payloads blindly, you'll probably feel productive for a while. But later you'll want to ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which records were incomplete?&lt;/li&gt;
&lt;li&gt;Which topic is most degraded?&lt;/li&gt;
&lt;li&gt;What's the latest valid data per category?&lt;/li&gt;
&lt;li&gt;How many records are structurally invalid?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those questions only become answerable if the ingest layer stores &lt;strong&gt;intent&lt;/strong&gt;, not just bytes.&lt;/p&gt;

&lt;p&gt;That's why normalization, invalid storage, and aggregate tracking ended up mattering more than the endpoint itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error Handling Is Part of the Data Model
&lt;/h3&gt;

&lt;p&gt;One thing this project reinforced for me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bad data is still data.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An invalid telemetry record is not "nothing." It's a signal — of schema drift, upstream bugs, partial device failure, rollout mistakes, or simply incomplete operational conditions.&lt;/p&gt;

&lt;p&gt;That's why I increasingly think of error handling as part of the &lt;strong&gt;data model&lt;/strong&gt;, not just exception handling.&lt;/p&gt;

&lt;p&gt;Once I started thinking that way, the service design got a lot cleaner.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Shifted in How I Think
&lt;/h2&gt;

&lt;p&gt;Before building this, I thought of ingestion as a transport problem.&lt;/p&gt;

&lt;p&gt;Now I think of it as a &lt;strong&gt;trust boundary problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That boundary decides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What gets accepted&lt;/li&gt;
&lt;li&gt;What gets normalized&lt;/li&gt;
&lt;li&gt;What gets rejected&lt;/li&gt;
&lt;li&gt;What becomes queryable system state&lt;/li&gt;
&lt;li&gt;What quality guarantees downstream code can rely on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a much more important role than "just receive and store."&lt;/p&gt;

&lt;p&gt;One thing that became clear:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In practice, keeping partially correct data is often better than forcing everything to look clean.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Real systems don't always give you perfect input. Sometimes the right move is not to reject everything imperfect. Sometimes it's to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept what is structurally usable&lt;/li&gt;
&lt;li&gt;Preserve missingness explicitly&lt;/li&gt;
&lt;li&gt;Separate invalid data cleanly&lt;/li&gt;
&lt;li&gt;Make the trade-offs visible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That felt like the right design for this kind of ingestion boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;p&gt;I didn’t build an ingestion API.&lt;/p&gt;

&lt;p&gt;I built a boundary that decides what data becomes part of the system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A FastAPI service that turns raw telemetry into something downstream systems can reason about.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not by overengineering it. Just by being very explicit about authentication, validation, normalization, invalid record handling, and operational visibility.&lt;/p&gt;

&lt;p&gt;And honestly, this is where systems tend to either stay clean…&lt;/p&gt;

&lt;p&gt;or become harder to reason about over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  If You've Built Something Similar
&lt;/h2&gt;

&lt;p&gt;I'd genuinely be curious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do you reject incomplete telemetry, or store it with quality metadata?&lt;/li&gt;
&lt;li&gt;Do you treat invalid records as operational artifacts or just log noise?&lt;/li&gt;
&lt;li&gt;Where do you define your "data trust boundary"?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That seems to be where the real system design decisions start showing up.&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>systemdesign</category>
      <category>backend</category>
      <category>datavalidation</category>
    </item>
    <item>
      <title>Designing an Offline-Resilient MQTT Buffer with SQLite</title>
      <dc:creator>Kaustubh Alandkar</dc:creator>
      <pubDate>Mon, 06 Apr 2026 17:38:01 +0000</pubDate>
      <link>https://dev.to/kaustubhalandkar/designing-an-offline-resilient-mqtt-buffer-with-sqlite-dj4</link>
      <guid>https://dev.to/kaustubhalandkar/designing-an-offline-resilient-mqtt-buffer-with-sqlite-dj4</guid>
      <description>&lt;h3&gt;
  
  
  The most crucial question for a data collection service over MQTT (Message Queuing Telemetry Transport) Protocol
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;what happens when the downstream service disappears for 20 minutes?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's the question I had when designing a lightweight client to collect data from devices working on MQTT protocol.&lt;/p&gt;

&lt;p&gt;At first I thought of forwarding every MQTT message directly to an HTTP API. But I had to consider for the unreliable network problem which is common in distributed systems. &lt;/p&gt;

&lt;p&gt;The moment the downstream API becomes slow, or timeouts, or has an auth issue, the whole ingestion path would start inheriting failures.&lt;/p&gt;

&lt;p&gt;So I had to build a system methodically. &lt;/p&gt;

&lt;p&gt;The goal was simple: keep accepting data even when the rest of the pipeline is having problems.&lt;/p&gt;

&lt;p&gt;I ended up with a lightweight Python service that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;subscribes to MQTT topics&lt;/li&gt;
&lt;li&gt;keeps a local durable buffer in &lt;strong&gt;SQLite&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;forwards records downstream in batches&lt;/li&gt;
&lt;li&gt;retries when delivery fails&lt;/li&gt;
&lt;li&gt;survives process restarts without losing unacknowledged data by the downstream API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Honestly, by the end of it I stopped thinking of this as an MQTT project. It became more of an exercise in &lt;em&gt;where reliability should live&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it's structured
&lt;/h2&gt;

&lt;p&gt;The service sits between an MQTT broker and a downstream HTTP ingest endpoint. A simple topology.&lt;/p&gt;

&lt;p&gt;Data flows like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MQTT Broker
   ↓
MQTT Client (on_message)
   ↓
latest_by_topic (in-memory dedup)
   ↓
Flush Worker (interval-based)
   ↓
SQLite — mqtt_buffer
   ↓
Sender Worker (batched + authenticated)
   ↓
Downstream HTTP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It subscribes to topics like &lt;code&gt;sites&lt;/code&gt;, &lt;code&gt;inverters&lt;/code&gt;, &lt;code&gt;strings&lt;/code&gt;, &lt;code&gt;weather&lt;/code&gt;, and &lt;code&gt;grid&lt;/code&gt;. Under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one Python process&lt;/li&gt;
&lt;li&gt;one MQTT client loop&lt;/li&gt;
&lt;li&gt;two worker threads&lt;/li&gt;
&lt;li&gt;one SQLite file as the durable queue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The three-way split — receive, persist, deliver — is intentional. Each stage failing independently is the whole point.&lt;/p&gt;




&lt;h2&gt;
  
  
  The direct approach — and why I moved away from it
&lt;/h2&gt;

&lt;p&gt;The obvious implementation is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;on_message → send HTTP request immediately
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works great in demos. Becomes uncomfortable in real environments almost immediately.&lt;/p&gt;

&lt;p&gt;Once you couple ingestion to delivery, your MQTT callback is now implicitly dependent on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;downstream API latency&lt;/li&gt;
&lt;li&gt;downstream availability&lt;/li&gt;
&lt;li&gt;auth/token health&lt;/li&gt;
&lt;li&gt;retry behavior&lt;/li&gt;
&lt;li&gt;network flakiness&lt;/li&gt;
&lt;li&gt;partial delivery handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five concerns in one function isn't simplicity — it's just hidden coupling.&lt;/p&gt;

&lt;p&gt;Receiving data and delivering data belong on different sides of a boundary. Mixing them is where fragility starts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Design decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. SQLite as the durability boundary
&lt;/h3&gt;

&lt;p&gt;This is probably the most important decision in the whole project.&lt;/p&gt;

&lt;p&gt;I wanted "buffered" to mean &lt;em&gt;persisted and recoverable&lt;/em&gt; — not just storing data in-memory.&lt;/p&gt;

&lt;p&gt;So SQLite is not just storage here. It's the line between &lt;em&gt;received&lt;/em&gt; and &lt;em&gt;safe enough to retry&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Each row in &lt;code&gt;mqtt_buffer&lt;/code&gt; carries:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;column&lt;/th&gt;
&lt;th&gt;purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ordering and dedup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;topic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;source topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;timestamp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payload&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;raw message content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qos&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;delivery quality level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MQTT retain flag&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;attempts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;delivery retry count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;last_error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;last failure reason&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rows are only deleted after the downstream service acknowledges them. That single rule changes the failure model entirely. A failed send doesn't destroy the record — it just changes its lifecycle state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why SQLite and not Redis or Kafka?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because this service lives at the edge. I needed something embedded, durable, operationally cheap, and easy to inspect over SSH. SQLite is all of those things.&lt;/p&gt;

&lt;p&gt;More importantly, it keeps the deployment footprint at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python mqtt_to_sqlite.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wanted to avoid the operational overhead of additional infrastructure components just to run this service.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Keep the callback thin
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;on_message&lt;/code&gt; path does exactly four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse the message&lt;/li&gt;
&lt;li&gt;Capture MQTT metadata and a timestamp&lt;/li&gt;
&lt;li&gt;Update in-memory state&lt;/li&gt;
&lt;li&gt;Exit&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No disk writes. No HTTP calls. No auth. No retries.&lt;/p&gt;

&lt;p&gt;That thinness matters more than it looks. Because once the callback starts doing real work, message receipt becomes coupled to delivery health. Which means if downstream is slow, the callback slows down. &lt;/p&gt;

&lt;p&gt;I wanted the opposite: the service should keep accepting MQTT messages even while delivery is completely broken. The callback being fast and simple is what makes that possible.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Two workers, one job each
&lt;/h3&gt;

&lt;p&gt;Once I separated receipt from delivery, it became natural to split the background work into two threads:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flush worker&lt;/strong&gt; — reads from the in-memory state and writes batches to SQLite at a configured interval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sender worker&lt;/strong&gt; — checks downstream health, fetches and caches a JWT, reads rows from SQLite, POSTs batches, deletes acknowledged rows, and records failures with &lt;code&gt;attempts&lt;/code&gt; and &lt;code&gt;last_error&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The failure isolation this gives you is real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;downstream offline → ingestion still works&lt;/li&gt;
&lt;li&gt;auth broken → buffering still works&lt;/li&gt;
&lt;li&gt;send rate slower than receive rate → SQLite absorbs the mismatch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One loop trying to do all of that at once doesn't give you any of this.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Using SQLite as a durable queue
&lt;/h3&gt;

&lt;p&gt;I'm aware SQLite isn't a message broker.&lt;/p&gt;

&lt;p&gt;For this scope, what I needed was: append new records, read oldest rows in order, retry failures, delete only after acknowledgment. SQLite handles all of that just fine if you're disciplined about concurrency.&lt;/p&gt;

&lt;p&gt;I enabled:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;WAL&lt;/code&gt; mode — allows concurrent reads while writes are in progress&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;synchronous=FULL&lt;/code&gt; — no data loss on OS crash&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;busy_timeout&lt;/code&gt; — handles lock contention without erroring immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those settings matter. The process has one thread writing rows, another reading and deleting them, and shared mutable state between them. Without proper configuration, this is where you get bugs that only appear in production, under load, after three weeks.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Batched delivery
&lt;/h3&gt;

&lt;p&gt;The sender doesn't push records downstream one at a time on arrival. It reads from SQLite in batches, controlled by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SEND_BATCH_SIZE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SEND_INTERVAL_SECONDS&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives a few useful properties:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Controlled downstream pressure.&lt;/strong&gt; Every incoming MQTT message doesn't immediately become an outbound HTTP request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cleaner retry behavior.&lt;/strong&gt; Failures happen at the batch level, not hidden inside a per-message request loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Easier tuning.&lt;/strong&gt; If downstream can handle more throughput, I change two config values. No ingestion logic changes.&lt;/p&gt;

&lt;p&gt;This system is not ultra real-time. That's a deliberate tradeoff. I'd rather have a controlled delivery loop with predictable behavior than a fast path that breaks in subtle ways under load.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Auth belongs to delivery
&lt;/h3&gt;

&lt;p&gt;The downstream service requires JWT auth. The sender worker:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fetches a token from &lt;code&gt;AUTH_URL&lt;/code&gt; on startup&lt;/li&gt;
&lt;li&gt;caches it in memory&lt;/li&gt;
&lt;li&gt;attaches it to outbound requests&lt;/li&gt;
&lt;li&gt;invalidates the cache on &lt;code&gt;401&lt;/code&gt; and re-fetches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps auth scoped to delivery. If the auth service goes down, MQTT messages still get accepted and buffered. Delivery pauses. Ingestion doesn't.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Pre-send reachability check
&lt;/h3&gt;

&lt;p&gt;Before each send attempt, the sender does a lightweight reachability check — either against a configured health endpoint or a TCP probe fallback.&lt;/p&gt;

&lt;p&gt;This is less about sophistication and more about avoiding unnecessary work when downstream is unavailable.&lt;/p&gt;

&lt;p&gt;A quick check before each batch keeps retry behavior quieter and more predictable, which makes the system easier to reason about under failure.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. At-least-once, stated clearly
&lt;/h3&gt;

&lt;p&gt;This service provides &lt;strong&gt;at-least-once delivery from the local buffer&lt;/strong&gt;. Not exactly-once.&lt;/p&gt;

&lt;p&gt;If a batch is processed downstream but the acknowledgement doesn't arrive cleanly, the sender will retry. That means downstream consumers need to be idempotent.&lt;/p&gt;

&lt;p&gt;That’s a reasonable contract for this layer.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. Operational behavior by design
&lt;/h3&gt;

&lt;p&gt;Even a small service needs to be operationally trustworthy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;logs go to stdout&lt;/li&gt;
&lt;li&gt;failure events (&lt;code&gt;Downstream send failed&lt;/code&gt;, &lt;code&gt;Unauthorized&lt;/code&gt;, &lt;code&gt;MQTT connect failed&lt;/code&gt;) are explicit and visible&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SIGINT&lt;/code&gt; / &lt;code&gt;SIGTERM&lt;/code&gt; trigger graceful shutdown&lt;/li&gt;
&lt;li&gt;rows not acknowledged before shutdown are retried on next start&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point matters. If the process stops mid-send, the next start simply retries any rows that were not acknowledged. That makes shutdown behavior predictable and recovery straightforward.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the friction actually lives
&lt;/h2&gt;

&lt;p&gt;The architecture is not complex. But the subtle edges are very real.&lt;/p&gt;

&lt;h3&gt;
  
  
  What "buffered" actually means
&lt;/h3&gt;

&lt;p&gt;There's a meaningful difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;message received&lt;/strong&gt; — it's in memory somewhere&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;message durably buffered&lt;/strong&gt; — it will survive a restart&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the first design decisions was defining that boundary clearly: a record is only "safe" once it has been written to SQLite.&lt;/p&gt;




&lt;h3&gt;
  
  
  Threading discipline
&lt;/h3&gt;

&lt;p&gt;With a flush thread writing rows and a sender thread reading and deleting them, ownership of shared state matters. I used explicit locks around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;database access&lt;/li&gt;
&lt;li&gt;in-memory state snapshots&lt;/li&gt;
&lt;li&gt;saved hashes for deduplication&lt;/li&gt;
&lt;li&gt;the token cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once two threads are interacting with the same rows and shared state, small mistakes can turn into failures that are difficult to reproduce and debug.&lt;/p&gt;




&lt;h3&gt;
  
  
  Disk is part of the capacity model
&lt;/h3&gt;

&lt;p&gt;If downstream stays offline long enough, the SQLite file keeps growing. That’s not a bug — it’s the service doing exactly what it was designed to do.&lt;/p&gt;

&lt;p&gt;But it does mean disk is part of the capacity model now. &lt;/p&gt;

&lt;p&gt;In practice, that means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;monitor SQLite file size in production&lt;/li&gt;
&lt;li&gt;store the database on persistent writable storage (not an ephemeral container layer)&lt;/li&gt;
&lt;li&gt;alert if buffer growth rate becomes abnormal&lt;/li&gt;
&lt;li&gt;think about retention policy if the domain allows data expiry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choosing local disk over message loss only works well if the storage side of the system has been thought through too.&lt;/p&gt;




&lt;h3&gt;
  
  
  When retries get complicated
&lt;/h3&gt;

&lt;p&gt;Retries feel straightforward until the scenarios get real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What if downstream partially processed the batch before timing out?&lt;/li&gt;
&lt;li&gt;What if auth failed after rows were already selected?&lt;/li&gt;
&lt;li&gt;What if the process restarted between "sent" and "acknowledged"?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly why I didn't try to build stronger delivery guarantees than the system actually has. The sender deletes rows only after acknowledgment. That's the safest default. And it means downstream needs to be idempotent. Which is the right place to put that responsibility.&lt;/p&gt;




&lt;h3&gt;
  
  
  Auth failure is a system concern
&lt;/h3&gt;

&lt;p&gt;If downstream delivery requires a valid JWT, then auth availability is part of the delivery path.&lt;/p&gt;

&lt;p&gt;That leads to a few design questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should auth failure stop ingestion? → No.&lt;/li&gt;
&lt;li&gt;Should the sender keep retrying with a known-bad token? → No.&lt;/li&gt;
&lt;li&gt;Should token fetch happen per request? → No.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s why the sender caches the token and invalidates it on &lt;code&gt;401&lt;/code&gt;. Auth failure stays scoped to delivery instead of propagating back into ingestion.&lt;/p&gt;




&lt;h3&gt;
  
  
  From local defaults to production
&lt;/h3&gt;

&lt;p&gt;The code works fine locally over plain HTTP and a local auth endpoint.&lt;/p&gt;

&lt;p&gt;That does not mean those defaults should carry into production unchanged.&lt;/p&gt;

&lt;p&gt;Before anything beyond local use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;AUTH_URL&lt;/code&gt; and &lt;code&gt;DOWNSTREAM_URL&lt;/code&gt; should use &lt;strong&gt;HTTPS&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;MQTT should use &lt;strong&gt;TLS&lt;/strong&gt; if the broker is outside a trusted network&lt;/li&gt;
&lt;li&gt;secrets should live in environment variables, not committed config files&lt;/li&gt;
&lt;li&gt;SQLite should live on persistent storage, not an ephemeral container layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not large design changes, but they are part of turning a working service into a production-safe one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What shifted in how I think about this kind of work
&lt;/h2&gt;

&lt;p&gt;This project kept reinforcing the same idea:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reliability is mostly about where you choose to put state.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For this deployment context — edge, single process, operationally minimal — SQLite was the right fit. But the broader lesson goes beyond the tool choice. Sometimes the more useful move is to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;accept the data&lt;/li&gt;
&lt;li&gt;persist it locally&lt;/li&gt;
&lt;li&gt;decouple delivery&lt;/li&gt;
&lt;li&gt;retry predictably&lt;/li&gt;
&lt;li&gt;make the trade-offs explicit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That may not be a flashy architecture, but it is still architecture.&lt;/p&gt;

&lt;p&gt;I've also started caring more about the distinction between &lt;em&gt;works when healthy&lt;/em&gt; and &lt;em&gt;behaves predictably when unhealthy&lt;/em&gt;. The second one is harder, and usually matters more.&lt;/p&gt;

&lt;p&gt;Most painful bugs don't happen on the happy path. They happen when two otherwise normal systems fail in slightly different ways at the same time.&lt;/p&gt;

&lt;p&gt;That's the kind of behavior I keep trying to design for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;If I had to compress this project into one sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I didn't just build an MQTT subscriber. I built a small failure-tolerant delivery boundary.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once one system is producing data and another is consuming it, the real question isn't "can they talk to each other?" It's "what happens when they can't?"&lt;/p&gt;

&lt;p&gt;For this project, the answer was: receive everything, persist locally, deliver in batches, retry predictably, and keep the system boring enough to debug under pressure.&lt;/p&gt;

&lt;p&gt;Not the only valid design, but for an edge ingestion component, it felt like the right one.&lt;/p&gt;




&lt;h2&gt;
  
  
  If you've built something similar
&lt;/h2&gt;

&lt;p&gt;I'm genuinely curious how others have handled this boundary — whether you buffered locally first or sent directly to HTTP, used Redis or Kafka instead of SQLite, treated delivery as best-effort or durable.&lt;/p&gt;

&lt;p&gt;And more specifically: &lt;strong&gt;where did you decide "safe" actually begins?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s usually where the real design trade-offs start to show.&lt;/p&gt;

</description>
      <category>mqtt</category>
      <category>python</category>
      <category>sqlite</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How I Built a High-Throughput Transaction Processor with Kafka, Redis, PostgreSQL, and MongoDB</title>
      <dc:creator>Kaustubh Alandkar</dc:creator>
      <pubDate>Fri, 03 Apr 2026 16:03:34 +0000</pubDate>
      <link>https://dev.to/kaustubhalandkar/how-i-built-a-high-throughput-transaction-processor-with-kafka-redis-postgresql-and-mongodb-58gm</link>
      <guid>https://dev.to/kaustubhalandkar/how-i-built-a-high-throughput-transaction-processor-with-kafka-redis-postgresql-and-mongodb-58gm</guid>
      <description>&lt;p&gt;When I started building this project, I wanted to learn by building something similar to how backend systems in payment processing apps work.&lt;/p&gt;

&lt;p&gt;I wanted to build something that made me think carefully about &lt;strong&gt;throughput, ordering, idempotency, auditability, and failure boundaries&lt;/strong&gt; together.&lt;/p&gt;

&lt;p&gt;That led me to build &lt;strong&gt;HVTP (High Volume Transaction Processor)&lt;/strong&gt; — a portfolio-grade, event-driven transaction processor that behaves more like a small transaction backend.&lt;/p&gt;

&lt;p&gt;What made this project valuable for me wasn’t just wiring Kafka into a system.&lt;/p&gt;

&lt;p&gt;It was learning how to shape the system so the right work happened in the right place.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the project actually is
&lt;/h2&gt;

&lt;p&gt;At a practical level, HVTP is a &lt;strong&gt;signed transaction ingestion pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A merchant client sends a transaction request over HTTP. The system validates the request at ingress, accepts it quickly, and then hands it off for asynchronous processing.&lt;/p&gt;

&lt;p&gt;From there, the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validates and processes the transaction&lt;/li&gt;
&lt;li&gt;enforces idempotency&lt;/li&gt;
&lt;li&gt;persists ledger state&lt;/li&gt;
&lt;li&gt;stores immutable audit events&lt;/li&gt;
&lt;li&gt;exposes a status API&lt;/li&gt;
&lt;li&gt;supports reconciliation between stores&lt;/li&gt;
&lt;li&gt;emits terminal outcomes through downstream flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stack looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kafka&lt;/strong&gt; for event flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Valkey (open-source Redis fork)&lt;/strong&gt; for idempotency and some read-path control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; for the ledger / queryable durable state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt; for immutable audit events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spring Boot&lt;/strong&gt; services split by responsibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;k6&lt;/strong&gt; for ingress load testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project is not about reproducing a regulated payments platform.&lt;/p&gt;

&lt;p&gt;It is about building a system shape where correctness, isolation of responsibility, and observable behavior matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I kept the request path small
&lt;/h2&gt;

&lt;p&gt;One option was to do everything in the request path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Receive HTTP request&lt;/li&gt;
&lt;li&gt;Validate everything in the same service&lt;/li&gt;
&lt;li&gt;Write directly to PostgreSQL&lt;/li&gt;
&lt;li&gt;Also write to MongoDB&lt;/li&gt;
&lt;li&gt;Return success&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That would have been simpler to build at first.&lt;/p&gt;

&lt;p&gt;But for this project, I wanted to separate request acceptance from downstream processing. I wanted the ingress layer to stay focused on validating, accepting, and handing work off quickly, instead of taking on ledger writes, audit writes, and every other downstream concern synchronously.&lt;/p&gt;

&lt;p&gt;That decision shaped the rest of the architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture I ended up with
&lt;/h2&gt;

&lt;p&gt;I split the write path into a small event-driven pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3l4pw20dy60ojsz416w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3l4pw20dy60ojsz416w.png" alt="Architecture Image" width="800" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That split gave each service one main responsibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;api-service&lt;/code&gt; → signed ingress + fast acceptance&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;processor-service&lt;/code&gt; → validation + idempotency&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ledger-writer-service&lt;/code&gt; → durable ledger persistence&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;audit-service&lt;/code&gt; → immutable audit history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I liked about this structure was that each boundary had a clear reason to exist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Main architecture and design decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) Keep the API fast
&lt;/h3&gt;

&lt;p&gt;The api-service just accepts the request and returns &lt;code&gt;202 Accepted&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I did this to keep the HTTP layer as an intake boundary, and not process the full transaction.&lt;/p&gt;

&lt;p&gt;In HVTP, the ingress path is intentionally limited to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate request shape&lt;/li&gt;
&lt;li&gt;verify signature&lt;/li&gt;
&lt;li&gt;publish to Kafka&lt;/li&gt;
&lt;li&gt;return acceptance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the API is not waiting on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idempotency checks&lt;/li&gt;
&lt;li&gt;PostgreSQL ledger persistence&lt;/li&gt;
&lt;li&gt;MongoDB audit writes&lt;/li&gt;
&lt;li&gt;downstream webhook behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was one of the most important decisions in the project because it kept the front door responsive even when downstream work had different timing characteristics.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Use Kafka for decoupling
&lt;/h3&gt;

&lt;p&gt;I used Kafka because I wanted request acceptance, transaction processing, ledger persistence, and audit persistence to move at different speeds without being tightly bound to one another.&lt;/p&gt;

&lt;p&gt;HVTP currently uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;transaction_requests&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;transaction_log&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;dead-letter topics for failure paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gave me a few concrete benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the API can accept requests without waiting for downstream writes&lt;/li&gt;
&lt;li&gt;the ledger writer and audit service can scale independently&lt;/li&gt;
&lt;li&gt;replay becomes possible&lt;/li&gt;
&lt;li&gt;failure handling becomes clearer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also used &lt;code&gt;accountId&lt;/code&gt; as the key for the main topics.&lt;/p&gt;

&lt;p&gt;That was deliberate.&lt;/p&gt;

&lt;p&gt;For this project, the ordering boundary I cared about was not global ordering across every transaction.&lt;/p&gt;

&lt;p&gt;It was preserving ordering for transactions belonging to the same account.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Treat idempotency as a correctness concern
&lt;/h3&gt;

&lt;p&gt;To deal with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;duplicate submissions&lt;/li&gt;
&lt;li&gt;consumer reprocessing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and to make the system idempotent, I used an idempotency key.&lt;/p&gt;

&lt;p&gt;Each request sent from the client includes an &lt;code&gt;Idempotency-Key&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Without it, processing the same request twice could result in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate ledger updates&lt;/li&gt;
&lt;li&gt;Duplicate audit events&lt;/li&gt;
&lt;li&gt;Inconsistent downstream outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I used &lt;strong&gt;Valkey&lt;/strong&gt; (open-source Redis fork) to store and check this idempotency key in the processor service.&lt;/p&gt;

&lt;p&gt;One of the most useful mindset shifts from this project was moving from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How do I process this request?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What must remain true even if this request appears more than once?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question improved the architecture more than any individual framework decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Let PostgreSQL and MongoDB do different jobs
&lt;/h3&gt;

&lt;p&gt;I used two stores intentionally because the write patterns and query needs are different.&lt;/p&gt;

&lt;h4&gt;
  
  
  PostgreSQL is the ledger
&lt;/h4&gt;

&lt;p&gt;PostgreSQL stores the durable transaction state that the system can query through the status path.&lt;/p&gt;

&lt;p&gt;It holds the queryable record of a transaction in a ledger-style structure, including fields like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;transaction_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;idempotency_key&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;merchant_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;account_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;amount&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;currency&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;type&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;processed_at&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the durable store for the transaction state I want to query directly.&lt;/p&gt;

&lt;h4&gt;
  
  
  MongoDB is the audit trail
&lt;/h4&gt;

&lt;p&gt;MongoDB stores immutable audit events, including values such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transaction IDs&lt;/li&gt;
&lt;li&gt;merchant/account IDs&lt;/li&gt;
&lt;li&gt;correlation IDs&lt;/li&gt;
&lt;li&gt;statuses&lt;/li&gt;
&lt;li&gt;source topic&lt;/li&gt;
&lt;li&gt;timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These stores answer different questions.&lt;/p&gt;

&lt;p&gt;The ledger answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What is the durable transaction state?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The audit store answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What happened around this transaction over time?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Separating those concerns made the model cleaner and easier to reason about.&lt;/p&gt;

&lt;h3&gt;
  
  
  5) Design for replay and reconciliation
&lt;/h3&gt;

&lt;p&gt;The ledger writer and audit service consume from the same event stream, but they write to different storage systems.&lt;/p&gt;

&lt;p&gt;That means there is always some possibility of drift, timing gaps, or mismatched writes across stores.&lt;/p&gt;

&lt;p&gt;So I added reconciliation support.&lt;/p&gt;

&lt;p&gt;The project includes a reconciliation model that compares recent ledger and audit state and records summary runs like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;audit count&lt;/li&gt;
&lt;li&gt;ledger count&lt;/li&gt;
&lt;li&gt;missing in ledger&lt;/li&gt;
&lt;li&gt;run status&lt;/li&gt;
&lt;li&gt;notes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also wanted replay support to exist in the architecture before it became necessary.&lt;/p&gt;

&lt;p&gt;That decision made the system feel more operationally realistic.&lt;/p&gt;

&lt;p&gt;It shifted the design from “write to multiple places” toward “write, verify, and recover.”&lt;/p&gt;

&lt;h3&gt;
  
  
  6) Measure ingress behavior under overload
&lt;/h3&gt;

&lt;p&gt;I also ran k6 load tests against the signed transaction ingestion endpoint at multiple offered rates, including 50K RPS and 100K RPS.&lt;/p&gt;

&lt;p&gt;The purpose was not to describe the whole system as completing transactions at those rates end to end.&lt;/p&gt;

&lt;p&gt;The goal was more specific:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How does the ingress layer behave when offered far more traffic than the machine can sustain?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That framing was important to me because it matched what I was actually measuring.&lt;/p&gt;

&lt;h4&gt;
  
  
  What the numbers showed
&lt;/h4&gt;

&lt;p&gt;In local testing on a single machine, the API maintained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0% HTTP failure rate&lt;/li&gt;
&lt;li&gt;100% &lt;code&gt;202 Accepted&lt;/code&gt; for completed HTTP requests&lt;/li&gt;
&lt;li&gt;accepted ingress throughput that leveled off around 3.1K–3.2K req/sec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At 1K offered RPS, it handled 60,001 accepted requests in 60s&lt;/li&gt;
&lt;li&gt;At 50K offered RPS, accepted throughput peaked at about 3,172.5 req/sec&lt;/li&gt;
&lt;li&gt;At 100K offered RPS, it still completed 189,936 accepted requests in 60s&lt;/li&gt;
&lt;li&gt;P95/P99 latency increased under overload, but the HTTP layer remained responsive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I liked about that result was not the raw offered rate, but the saturation behavior.&lt;/p&gt;

&lt;p&gt;The ingress layer stayed usable, throughput leveled off in a predictable way, and latency rose before failure.&lt;/p&gt;

&lt;p&gt;That is a useful property in an asynchronous system.&lt;/p&gt;

&lt;h4&gt;
  
  
  The important caveat
&lt;/h4&gt;

&lt;p&gt;These are HTTP ingress acceptance results, not end-to-end transaction completion metrics.&lt;/p&gt;

&lt;p&gt;So the correct interpretation is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the API accepted the requests&lt;/li&gt;
&lt;li&gt;downstream completion happens asynchronously&lt;/li&gt;
&lt;li&gt;the numbers describe front-door behavior, not full workflow completion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this project, that was the honest and useful performance story to tell.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real implementation friction and subtle problems
&lt;/h2&gt;

&lt;p&gt;The architecture diagram is the clean version.&lt;/p&gt;

&lt;p&gt;Implementation is where the edge cases become visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  1) &lt;code&gt;202 Accepted&lt;/code&gt; creates a visibility obligation
&lt;/h3&gt;

&lt;p&gt;Returning &lt;code&gt;202 Accepted&lt;/code&gt; simplified the ingress path, but it also meant the system needed to answer follow-up questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;did it persist?&lt;/li&gt;
&lt;li&gt;did it fail?&lt;/li&gt;
&lt;li&gt;was it rejected?&lt;/li&gt;
&lt;li&gt;is it still in-flight?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why HVTP includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a status endpoint&lt;/li&gt;
&lt;li&gt;correlation IDs&lt;/li&gt;
&lt;li&gt;downstream event flow for terminal outcomes and tracing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Moving work out of the synchronous path reduced coupling, but it also increased the need for visibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Ordering had to be defined carefully
&lt;/h3&gt;

&lt;p&gt;Early on, I had to be specific about what “ordering” meant in this system.&lt;/p&gt;

&lt;p&gt;For HVTP, global ordering across all transactions was not the target.&lt;/p&gt;

&lt;p&gt;Per-account ordering was the meaningful boundary.&lt;/p&gt;

&lt;p&gt;That is why Kafka messages are keyed by &lt;code&gt;accountId&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It gives the ordering guarantee I actually needed without forcing all traffic through one serialized path.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Multi-store systems introduce operational edges
&lt;/h3&gt;

&lt;p&gt;Using PostgreSQL for ledger state and MongoDB for audit events was the right choice for this project.&lt;/p&gt;

&lt;p&gt;It also meant I had to care whether both stores continued to reflect the same logical transaction stream.&lt;/p&gt;

&lt;p&gt;That is why reconciliation became part of the design rather than an afterthought.&lt;/p&gt;

&lt;p&gt;There was also a useful implementation lesson here: the Mongo mapping used for reconciliation has to stay aligned with the collection the audit service is actually writing to.&lt;/p&gt;

&lt;p&gt;That kind of mismatch does not always fail loudly.&lt;/p&gt;

&lt;p&gt;It can quietly reduce trust in operational checks.&lt;/p&gt;

&lt;h3&gt;
  
  
  4) Performance framing matters
&lt;/h3&gt;

&lt;p&gt;Once I added the higher offered-RPS tests, I spent time thinking about how to describe the results precisely.&lt;/p&gt;

&lt;p&gt;The more useful framing was not a headline number.&lt;/p&gt;

&lt;p&gt;It was explaining what the tests actually demonstrated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the ingress layer remains stable under overload&lt;/li&gt;
&lt;li&gt;throughput saturates at a predictable point&lt;/li&gt;
&lt;li&gt;latency rises as load increases&lt;/li&gt;
&lt;li&gt;the asynchronous boundary protects the front door on this hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That framing is more useful because it stays aligned with what the measurements actually represent.&lt;/p&gt;




&lt;h2&gt;
  
  
  What changed in how I think
&lt;/h2&gt;

&lt;p&gt;Before building this, I mostly thought about high throughput as a performance problem.&lt;/p&gt;

&lt;p&gt;After building it, I think about it much more as a boundary design problem.&lt;/p&gt;

&lt;p&gt;The question that stayed with me was not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How fast can one service go?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Where should work happen, where should it not happen, and what must remain true when parts of the system are delayed, retried, duplicated, or partially broken?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That shift changed how I think about backend systems.&lt;/p&gt;

&lt;p&gt;A few things became much clearer to me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Async systems need strong visibility&lt;/li&gt;
&lt;li&gt;Idempotency is part of the design, not just an implementation detail&lt;/li&gt;
&lt;li&gt;Storage choices should follow write semantics&lt;/li&gt;
&lt;li&gt;Graceful saturation is a useful success condition&lt;/li&gt;
&lt;li&gt;Good architecture is often about clean responsibility boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One practical lesson from this project was that precision matters.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;202 Accepted&lt;/code&gt; should mean something specific.&lt;br&gt;
A benchmark should measure something specific.&lt;br&gt;
And each service should have a clearly defined responsibility.&lt;/p&gt;

&lt;p&gt;That mindset ended up being one of the most useful outcomes of the project.&lt;/p&gt;


&lt;h2&gt;
  
  
  Final takeaway
&lt;/h2&gt;

&lt;p&gt;If I had to compress the whole project into one sentence, it would be this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I built HVTP to practice designing a system that can accept load quickly while keeping correctness, separation of concerns, and recovery paths in view.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is what this project gave me.&lt;/p&gt;

&lt;p&gt;It helped me think more clearly about how to keep the front door fast, how to handle duplicates intentionally, how to separate durable state from audit history, and how to design for verification instead of assuming everything will always stay aligned.&lt;/p&gt;

&lt;p&gt;For me, that was far more valuable than just assembling a stack.&lt;/p&gt;


&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;If you’ve built something in this space, I’d be genuinely interested in how you approached trade-offs around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;202 Accepted&lt;/code&gt; vs synchronous confirmation&lt;/li&gt;
&lt;li&gt;Redis idempotency boundaries&lt;/li&gt;
&lt;li&gt;ledger vs audit store separation&lt;/li&gt;
&lt;li&gt;what you consider a useful throughput benchmark&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those design choices ended up being the most interesting part of the project for me.&lt;/p&gt;



&lt;p&gt;If you want to explore the implementation, docs, and load tests, the full repo is here:&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/kaustubh-26" rel="noopener noreferrer"&gt;
        kaustubh-26
      &lt;/a&gt; / &lt;a href="https://github.com/kaustubh-26/high-volume-transaction-processor" rel="noopener noreferrer"&gt;
        high-volume-transaction-processor
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Event-driven transaction processor with signed ingress, Kafka workflows, ledger persistence, audit storage, and webhook notifications
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;High Volume Transaction Processor&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;High Volume Transaction Processor&lt;/strong&gt; — &lt;em&gt;An event-driven transaction processor&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/kaustubh-26/high-volume-transaction-processor/actions/workflows/ci.yml/badge.svg"&gt;&lt;img src="https://github.com/kaustubh-26/high-volume-transaction-processor/actions/workflows/ci.yml/badge.svg" alt="CI Status"&gt;&lt;/a&gt;
&lt;a href="https://sonarcloud.io/dashboard?id=kaustubh-26_high-volume-transaction-processor" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e81dd4c4ebd488f3735c9d8ffdf7d93bc10d57893c69154b3627b48cc2452ea7/68747470733a2f2f736f6e6172636c6f75642e696f2f6170692f70726f6a6563745f6261646765732f6d6561737572653f70726f6a6563743d6b617573747562682d32365f686967682d766f6c756d652d7472616e73616374696f6e2d70726f636573736f72266d65747269633d616c6572745f737461747573" alt="Quality Gate Status"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/80c3ad9383db58bb4c6ac36bcebc2434ed4cac4692f4ba7af9d169aaa3ee5b42/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f6b617573747562682d32362f686967682d766f6c756d652d7472616e73616374696f6e2d70726f636573736f72"&gt;&lt;img src="https://camo.githubusercontent.com/80c3ad9383db58bb4c6ac36bcebc2434ed4cac4692f4ba7af9d169aaa3ee5b42/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f6b617573747562682d32362f686967682d766f6c756d652d7472616e73616374696f6e2d70726f636573736f72" alt="License"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A production-style, event-driven transaction pipeline showcasing signed API ingestion, asynchronous Kafka processing, Redis idempotency, PostgreSQL ledger writes, and MongoDB audit persistence.&lt;/p&gt;
&lt;p&gt;The repository is structured like a small payment platform:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;signed transaction ingestion over HTTP&lt;/li&gt;
&lt;li&gt;asynchronous processing over Kafka&lt;/li&gt;
&lt;li&gt;Redis-backed idempotency protection&lt;/li&gt;
&lt;li&gt;ledger persistence in PostgreSQL&lt;/li&gt;
&lt;li&gt;immutable audit persistence in MongoDB&lt;/li&gt;
&lt;li&gt;dead-letter topics for failed records&lt;/li&gt;
&lt;li&gt;webhook notifications for transaction state changes&lt;/li&gt;
&lt;li&gt;Actuator and Prometheus endpoints on every service&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What This Project Demonstrates&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Event-driven microservices with clearly separated write responsibilities&lt;/li&gt;
&lt;li&gt;Per-account ordering by using &lt;code&gt;accountId&lt;/code&gt; as the Kafka message key&lt;/li&gt;
&lt;li&gt;Idempotency enforcement in the processor with Redis TTL-backed keys&lt;/li&gt;
&lt;li&gt;PostgreSQL as the ledger source of truth for persisted transactions&lt;/li&gt;
&lt;li&gt;MongoDB as an append-only audit store&lt;/li&gt;
&lt;li&gt;Reconciliation between the audit store and the ledger&lt;/li&gt;
&lt;li&gt;Replay support for rebuilding ledger state from &lt;code&gt;transaction_log&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Signed ingress requests and API-key-protected status…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/kaustubh-26/high-volume-transaction-processor" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





</description>
      <category>architecture</category>
      <category>backend</category>
      <category>kafka</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How I Designed a Real-Time Dashboard Using Kafka, Socket.IO, and a BFF</title>
      <dc:creator>Kaustubh Alandkar</dc:creator>
      <pubDate>Mon, 30 Mar 2026 15:06:12 +0000</pubDate>
      <link>https://dev.to/kaustubhalandkar/how-i-designed-a-real-time-dashboard-using-kafka-socketio-and-a-bff-4b8m</link>
      <guid>https://dev.to/kaustubhalandkar/how-i-designed-a-real-time-dashboard-using-kafka-socketio-and-a-bff-4b8m</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A practical breakdown of the architecture decisions, trade-offs, and frontend/backend boundaries behind Flux — an event-driven real-time dashboard platform I built.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;While building &lt;strong&gt;Flux&lt;/strong&gt;, I decided to build something that &lt;strong&gt;felt like a real-time system&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not a frontend that keeps polling every few seconds.&lt;/li&gt;
&lt;li&gt;Not a UI that directly calls five different APIs.&lt;/li&gt;
&lt;li&gt;Not a project where everything works only when the happy path works.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted something closer to how production systems are usually designed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multiple data domains
&lt;/li&gt;
&lt;li&gt;asynchronous communication
&lt;/li&gt;
&lt;li&gt;real-time delivery
&lt;/li&gt;
&lt;li&gt;graceful degradation
&lt;/li&gt;
&lt;li&gt;and a frontend that stays simple even when the backend gets more complex
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So this post is a breakdown of &lt;strong&gt;how I designed the architecture for Flux&lt;/strong&gt; — and more importantly, &lt;strong&gt;why I made the decisions I made&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Flux actually is
&lt;/h2&gt;

&lt;p&gt;Flux is a &lt;strong&gt;real-time dashboard&lt;/strong&gt; that streams and displays:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Weather&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;News&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stocks&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Crypto&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first glance, it looks like a frontend-heavy project.&lt;/p&gt;

&lt;p&gt;But the interesting part is actually the &lt;strong&gt;backend architecture&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Because the real problem wasn’t:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"How do I render cards on a dashboard?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The real problem was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"How can I design a system that cleanly ingests, processes, and streams multiple real‑time data feeds to clients?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That changed everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  The first architecture I didn’t want
&lt;/h2&gt;

&lt;p&gt;The most obvious way to build this would have been something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frontend
 ├── calls weather API
 ├── calls news API
 ├── calls stocks API
 └── calls crypto API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This works for a small demo.&lt;/p&gt;

&lt;p&gt;But I expected that approach to become painful pretty quickly.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why I avoided that approach
&lt;/h3&gt;

&lt;p&gt;Because the frontend would slowly become responsible for things it should &lt;strong&gt;not&lt;/strong&gt; own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;request orchestration
&lt;/li&gt;
&lt;li&gt;retries
&lt;/li&gt;
&lt;li&gt;service-specific logic
&lt;/li&gt;
&lt;li&gt;failure handling
&lt;/li&gt;
&lt;li&gt;data normalization
&lt;/li&gt;
&lt;li&gt;caching decisions
&lt;/li&gt;
&lt;li&gt;reconnect behavior
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s how “simple dashboards” become messy.&lt;/p&gt;

&lt;p&gt;And honestly, this was one of the main design lines I kept repeating to myself while building this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;I wanted the frontend to stay thin and focused.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;render data
&lt;/li&gt;
&lt;li&gt;send user intent
&lt;/li&gt;
&lt;li&gt;keep a real-time connection alive
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s it.&lt;/p&gt;


&lt;h2&gt;
  
  
  The architecture I settled on
&lt;/h2&gt;

&lt;p&gt;I ended up designing Flux around &lt;strong&gt;3 main layers&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frontend (Web UI)
 │
 ▼
BFF (Socket.IO + Cache + Kafka coordination)
 │
 ▼
Domain Services (Kafka-based)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This became the &lt;strong&gt;core mental model&lt;/strong&gt; of the whole project.&lt;/p&gt;

&lt;p&gt;Each layer has &lt;strong&gt;one job&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And once I locked that in, the rest of the system became much easier to reason about.&lt;/p&gt;


&lt;h2&gt;
  
  
  1) Frontend: thin, reactive, and intentionally limited
&lt;/h2&gt;

&lt;p&gt;The frontend in Flux is intentionally &lt;strong&gt;thin and focused&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That was a &lt;strong&gt;design choice&lt;/strong&gt;, not a shortcut.&lt;/p&gt;

&lt;p&gt;Its job is to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;open a persistent &lt;strong&gt;Socket.IO&lt;/strong&gt; connection
&lt;/li&gt;
&lt;li&gt;send user context (like location)
&lt;/li&gt;
&lt;li&gt;subscribe to real-time updates
&lt;/li&gt;
&lt;li&gt;render whatever arrives
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Its job is &lt;strong&gt;not&lt;/strong&gt; to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;talk to Kafka
&lt;/li&gt;
&lt;li&gt;call backend services directly
&lt;/li&gt;
&lt;li&gt;aggregate data
&lt;/li&gt;
&lt;li&gt;implement retry policies
&lt;/li&gt;
&lt;li&gt;decide caching behavior
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation made the frontend much cleaner.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why this mattered
&lt;/h3&gt;

&lt;p&gt;Because when frontend code starts knowing too much about backend infrastructure, everything becomes harder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;harder to debug
&lt;/li&gt;
&lt;li&gt;harder to test
&lt;/li&gt;
&lt;li&gt;harder to scale
&lt;/li&gt;
&lt;li&gt;harder to change later
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So in Flux, the frontend only talks to &lt;strong&gt;one thing&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;the BFF&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That one decision removed a lot of future complexity.&lt;/p&gt;


&lt;h2&gt;
  
  
  2) Why I used a BFF instead of exposing services directly
&lt;/h2&gt;

&lt;p&gt;This was probably the &lt;strong&gt;most important architecture decision&lt;/strong&gt; in the whole project.&lt;/p&gt;

&lt;p&gt;I introduced a &lt;strong&gt;Backend-for-Frontend (BFF)&lt;/strong&gt; layer between the UI and the backend services.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the BFF does
&lt;/h3&gt;

&lt;p&gt;The BFF is responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maintaining client &lt;strong&gt;Socket.IO&lt;/strong&gt; connections
&lt;/li&gt;
&lt;li&gt;receiving events from backend services
&lt;/li&gt;
&lt;li&gt;hydrating reconnecting clients quickly
&lt;/li&gt;
&lt;li&gt;deciding what data to fan out to which users
&lt;/li&gt;
&lt;li&gt;acting as the &lt;strong&gt;real-time gateway&lt;/strong&gt; of the system
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So instead of this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frontend → many services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I made it:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frontend → BFF → services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;On paper that sounds small.&lt;/p&gt;

&lt;p&gt;In practice, it changed a lot.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why I liked this model
&lt;/h3&gt;

&lt;p&gt;Because the frontend now has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;one connection model&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;one integration boundary&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;one real-time contract&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the backend can evolve without breaking the UI every time.&lt;/p&gt;

&lt;p&gt;That gave me a much better separation between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;presentation concerns&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;system concerns&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which is exactly what I wanted from the start.&lt;/p&gt;


&lt;h2&gt;
  
  
  3) Why I used Kafka in the middle
&lt;/h2&gt;

&lt;p&gt;Once I knew I wanted multiple real-time domains, I also knew I didn’t want everything &lt;strong&gt;tightly coupled&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If weather updates, crypto updates, stock updates, and news updates all depend directly on each other — or on one giant central service — that becomes painful fast.&lt;/p&gt;

&lt;p&gt;So I used &lt;strong&gt;Kafka&lt;/strong&gt; as the backbone.&lt;/p&gt;
&lt;h3&gt;
  
  
  What Kafka gave me
&lt;/h3&gt;

&lt;p&gt;Kafka helped me design the system around &lt;strong&gt;events&lt;/strong&gt;, not direct service-to-service coupling.&lt;/p&gt;

&lt;p&gt;That gave me a few nice properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;services can evolve independently
&lt;/li&gt;
&lt;li&gt;producers and consumers don’t need to know too much about each other
&lt;/li&gt;
&lt;li&gt;scaling one domain doesn’t force scaling everything
&lt;/li&gt;
&lt;li&gt;the architecture feels much closer to real production systems
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was important to me.&lt;/p&gt;

&lt;p&gt;Because I didn’t want Flux to be a project optimized more for presentation than for system trade-offs.&lt;/p&gt;

&lt;p&gt;I wanted it to feel like something that was designed with &lt;strong&gt;actual backend trade-offs&lt;/strong&gt; in mind.&lt;/p&gt;


&lt;h2&gt;
  
  
  4) Why I chose Socket.IO for real-time delivery
&lt;/h2&gt;

&lt;p&gt;For the client-facing real-time layer, I chose &lt;strong&gt;Socket.IO&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And yes — I know raw WebSockets are often the more low-level answer on paper.&lt;/p&gt;

&lt;p&gt;But for this project, I cared more about &lt;strong&gt;reliability&lt;/strong&gt; and &lt;strong&gt;developer ergonomics&lt;/strong&gt; than sounding low-level.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why Socket.IO made sense here
&lt;/h3&gt;

&lt;p&gt;It gave me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;automatic reconnection
&lt;/li&gt;
&lt;li&gt;fallback transport support
&lt;/li&gt;
&lt;li&gt;room-based fan-out
&lt;/li&gt;
&lt;li&gt;simpler event semantics
&lt;/li&gt;
&lt;li&gt;less boilerplate for real-time client communication
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That mattered because Flux is not just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“send one stream to one client”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It’s a &lt;strong&gt;multi-stream dashboard&lt;/strong&gt; with different categories of data and different update patterns.&lt;/p&gt;

&lt;p&gt;So having a stable, practical abstraction here was worth it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Sometimes “more production-realistic” is not about choosing the lowest-level primitive.&lt;br&gt;&lt;br&gt;
Sometimes it’s about choosing the thing you can operate more reliably.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  5) The problem I ran into: reconnects and hydration
&lt;/h2&gt;

&lt;p&gt;This is where the architecture got more interesting.&lt;/p&gt;

&lt;p&gt;Real-time apps are not just about &lt;strong&gt;live streaming&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;They’re also about:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“What happens when a user reconnects?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question forced me to think beyond just pushing events.&lt;/p&gt;

&lt;p&gt;Because if a user refreshes the page or reconnects after a network blip, I don’t want them staring at an empty dashboard waiting for all streams to naturally update again.&lt;/p&gt;

&lt;p&gt;That creates a poor reconnect experience.&lt;/p&gt;

&lt;p&gt;So I split the system mentally into &lt;strong&gt;two kinds of data&lt;/strong&gt;:&lt;/p&gt;


&lt;h3&gt;
  
  
  A) Snapshot data
&lt;/h3&gt;

&lt;p&gt;Data that should be shown &lt;strong&gt;immediately on reconnect&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Examples:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;top news
&lt;/li&gt;
&lt;li&gt;weather snapshot
&lt;/li&gt;
&lt;li&gt;top crypto coins
&lt;/li&gt;
&lt;li&gt;stock summaries
&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  B) Stream data
&lt;/h3&gt;

&lt;p&gt;Data that should continue flowing &lt;strong&gt;live&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Examples:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;ticker updates
&lt;/li&gt;
&lt;li&gt;incremental changes
&lt;/li&gt;
&lt;li&gt;fast-moving live events
&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;That separation ended up being very useful.&lt;/p&gt;

&lt;p&gt;Because it let me design &lt;strong&gt;hydration&lt;/strong&gt; and &lt;strong&gt;streaming&lt;/strong&gt; differently instead of pretending all real-time data behaves the same way.&lt;/p&gt;

&lt;p&gt;And honestly, that was one of the cleanest architecture decisions in the project.&lt;/p&gt;


&lt;h2&gt;
  
  
  6) Why I added cache, but made it optional
&lt;/h2&gt;

&lt;p&gt;Once I started thinking about reconnect hydration, cache became the obvious next step.&lt;/p&gt;

&lt;p&gt;But I also didn’t want to build a system that completely dies if cache is unavailable.&lt;/p&gt;

&lt;p&gt;So I used &lt;strong&gt;Valkey (open-source Redis fork)&lt;/strong&gt; as an &lt;strong&gt;optional accelerator&lt;/strong&gt;, not as a hard dependency.&lt;/p&gt;

&lt;p&gt;That distinction mattered a lot.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why “optional cache” was important
&lt;/h3&gt;

&lt;p&gt;Cache is amazing for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fast reconnect hydration
&lt;/li&gt;
&lt;li&gt;reducing repeated work
&lt;/li&gt;
&lt;li&gt;serving recent snapshot data quickly
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I didn’t want Flux to become:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“works only if every dependency is healthy”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I designed it with this mindset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;if cache is available → great, faster experience&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;if cache is unavailable → system should still keep working&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s a small detail, but it changes how resilient the system feels.&lt;/p&gt;

&lt;p&gt;And personally, I’ve started appreciating this design style a lot more:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Acceleration should not become fragility.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  7) One of the subtle problems: selective fan-out
&lt;/h2&gt;

&lt;p&gt;Once I had a BFF pushing real-time data, I ran into another question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“Should every client receive every event?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In practice, the answer is &lt;strong&gt;no&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;waste bandwidth
&lt;/li&gt;
&lt;li&gt;add unnecessary frontend filtering
&lt;/li&gt;
&lt;li&gt;generally be inefficient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I used &lt;strong&gt;Socket.IO rooms&lt;/strong&gt; to scope event delivery.&lt;/p&gt;

&lt;p&gt;That meant I could think in terms like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;weather by city
&lt;/li&gt;
&lt;li&gt;global news stream
&lt;/li&gt;
&lt;li&gt;crypto stream
&lt;/li&gt;
&lt;li&gt;stock stream
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This helped keep the fan-out more intentional instead of just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“broadcast everything and let the client figure it out”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s one of those things that sounds small when you say it in one sentence, but it makes the architecture much cleaner.&lt;/p&gt;


&lt;h2&gt;
  
  
  8) One frontend decision I’m glad I didn’t stay stubborn about
&lt;/h2&gt;

&lt;p&gt;Initially, I was trying to keep the frontend very clean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hooks handle data and subscriptions
&lt;/li&gt;
&lt;li&gt;components just render UI
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And for the most part, that worked well.&lt;/p&gt;

&lt;p&gt;Each domain had its own hook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;weather
&lt;/li&gt;
&lt;li&gt;news
&lt;/li&gt;
&lt;li&gt;stocks
&lt;/li&gt;
&lt;li&gt;crypto
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the general rule was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hooks deal with real-time logic. Components stay simple.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That kept things pretty clean.&lt;/p&gt;


&lt;h3&gt;
  
  
  But then I ran into a small UX problem
&lt;/h3&gt;

&lt;p&gt;In the &lt;strong&gt;Crypto&lt;/strong&gt; card, I had two tabs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Top Movers&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Live Ticker&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And every time I switched between them, I didn’t like what I was seeing.&lt;/p&gt;

&lt;p&gt;Some data in the Crypto card would reload again, and state didn’t feel stable across tab switches.&lt;/p&gt;

&lt;p&gt;The UI didn’t feel as smooth as I wanted.&lt;/p&gt;

&lt;p&gt;Nothing was actually broken.&lt;/p&gt;

&lt;p&gt;It just didn’t feel good.&lt;/p&gt;

&lt;p&gt;And I’ve started trusting that feeling more while building projects.&lt;/p&gt;

&lt;p&gt;Because a lot of times, architecture looks clean in code but feels annoying in the actual product.&lt;/p&gt;


&lt;h3&gt;
  
  
  So this is where I bent my own rule a bit
&lt;/h3&gt;

&lt;p&gt;Instead of forcing everything through local hook state, I used &lt;strong&gt;Redux selectively&lt;/strong&gt; for the crypto section.&lt;/p&gt;

&lt;p&gt;Not across the whole app.&lt;/p&gt;

&lt;p&gt;Just where it actually helped.&lt;/p&gt;

&lt;p&gt;Mainly to keep things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ticker data
&lt;/li&gt;
&lt;li&gt;top coins data
&lt;/li&gt;
&lt;li&gt;price-related state
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;stable across tab switches.&lt;/p&gt;

&lt;p&gt;The pattern that felt right was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hooks handle subscriptions and incoming socket events
&lt;/li&gt;
&lt;li&gt;Redux keeps shared UI state stable where needed
&lt;/li&gt;
&lt;li&gt;components just read and render
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That ended up feeling much better.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why I’m glad I did this
&lt;/h3&gt;

&lt;p&gt;Because this was one of those cases where:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;being too “architecturally pure” would have made the user experience worse.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And honestly, I’d rather have a slightly more practical architecture than a “perfectly clean” one that adds unnecessary friction to the user experience.&lt;/p&gt;

&lt;p&gt;That small change made the Crypto card feel way smoother.&lt;/p&gt;

&lt;p&gt;And I think that’s a useful reminder in general:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sometimes the right architecture decision is just the one that makes the product feel better.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  9) Failure isolation was not an afterthought
&lt;/h2&gt;

&lt;p&gt;One thing I really wanted in Flux was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If one stream fails, the dashboard should still feel alive.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I designed the UI and backend with &lt;strong&gt;failure isolation&lt;/strong&gt; in mind.&lt;/p&gt;

&lt;p&gt;Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if weather is delayed → crypto still updates
&lt;/li&gt;
&lt;li&gt;if news fails → stocks still render
&lt;/li&gt;
&lt;li&gt;if one service lags → the whole app shouldn’t feel dead
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds like a UX decision, but it’s actually an architecture decision too.&lt;/p&gt;

&lt;p&gt;Because if your system shape forces everything to depend on everything else, then partial failure becomes full failure.&lt;/p&gt;

&lt;p&gt;And I wanted to avoid that.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A real-time system should degrade, not collapse.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  10) I intentionally did not chase “perfect distributed systems purity”
&lt;/h2&gt;

&lt;p&gt;This was a very conscious choice.&lt;/p&gt;

&lt;p&gt;Because once you start building event-driven systems, it’s very easy to go down the rabbit hole of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exactly-once everything
&lt;/li&gt;
&lt;li&gt;over-engineered delivery guarantees
&lt;/li&gt;
&lt;li&gt;too many abstractions too early
&lt;/li&gt;
&lt;li&gt;adding architectural complexity that doesn’t meaningfully improve the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I tried hard not to do that.&lt;/p&gt;

&lt;p&gt;So Flux is opinionated in a practical way.&lt;/p&gt;

&lt;p&gt;I optimized for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clarity
&lt;/li&gt;
&lt;li&gt;resilience
&lt;/li&gt;
&lt;li&gt;clean boundaries
&lt;/li&gt;
&lt;li&gt;realistic trade-offs
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not &lt;strong&gt;unnecessary complexity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That was important to me because I wanted this project to reflect how I actually think as an engineer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;I like systems that are thoughtful, not just complicated.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Final architecture summary
&lt;/h2&gt;

&lt;p&gt;If I had to describe Flux’s architecture in one sentence, I’d say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;It’s a real-time dashboard designed like a small event-driven platform, not like a frontend project with extra backend code attached.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s the difference I cared about.&lt;/p&gt;
&lt;h3&gt;
  
  
  The main ideas behind the design were:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;keep the frontend thin
&lt;/li&gt;
&lt;li&gt;centralize real-time delivery in a BFF
&lt;/li&gt;
&lt;li&gt;use Kafka for loose coupling
&lt;/li&gt;
&lt;li&gt;use Socket.IO for practical real-time delivery
&lt;/li&gt;
&lt;li&gt;separate snapshot hydration from live streams
&lt;/li&gt;
&lt;li&gt;use cache as an accelerator, not a crutch
&lt;/li&gt;
&lt;li&gt;isolate failures so the whole app doesn’t feel broken
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And honestly, designing those boundaries was way more interesting than building the UI itself.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I learned from building this
&lt;/h2&gt;

&lt;p&gt;If I had to compress the biggest lesson into one line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Most architecture problems become easier once each layer has one clear responsibility.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A lot of messy systems are messy because &lt;strong&gt;responsibilities are blurry&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Flux became much easier to build once I stopped asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“How do I make everything talk to everything?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;…and started asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“What is each layer allowed to know?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That single shift made the architecture cleaner.&lt;/p&gt;


&lt;h2&gt;
  
  
  If you’re building a real-time dashboard too
&lt;/h2&gt;

&lt;p&gt;A few things I’d strongly recommend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don’t let the frontend talk to too many things directly.
&lt;/li&gt;
&lt;li&gt;Decide early what is &lt;strong&gt;snapshot data&lt;/strong&gt; vs &lt;strong&gt;stream data&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Design for reconnects, not just the initial page load.&lt;/li&gt;
&lt;li&gt;Think about partial failure early.&lt;/li&gt;
&lt;li&gt;Keep boundaries boring and explicit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That alone will save you a lot of pain later.&lt;/p&gt;


&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;If you’ve built something similar — or if you would have designed this differently — I’d genuinely love to hear your approach.&lt;/p&gt;

&lt;p&gt;I always find architecture discussions more useful when they’re about trade-offs as well as best practices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclosure&lt;/strong&gt;: This post was proofread with AI assistance. The core ideas, architecture decisions, and final content are my own.&lt;/p&gt;




&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/kaustubh-26" rel="noopener noreferrer"&gt;
        kaustubh-26
      &lt;/a&gt; / &lt;a href="https://github.com/kaustubh-26/flux-platform" rel="noopener noreferrer"&gt;
        flux-platform
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      An event-driven real-time data platform
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Flux Platform&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Flux&lt;/strong&gt; — &lt;em&gt;An event-driven real-time data platform&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/kaustubh-26/flux-platform/actions/workflows/ci.yml/badge.svg"&gt;&lt;img src="https://github.com/kaustubh-26/flux-platform/actions/workflows/ci.yml/badge.svg" alt="Build"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/kaustubh-26/flux-platform/actions/workflows/tests.yml/badge.svg"&gt;&lt;img src="https://github.com/kaustubh-26/flux-platform/actions/workflows/tests.yml/badge.svg" alt="Tests"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/kaustubh-26/flux-platform/actions/workflows/deploy.yml/badge.svg"&gt;&lt;img src="https://github.com/kaustubh-26/flux-platform/actions/workflows/deploy.yml/badge.svg" alt="Deploy"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/18748f0cf63c6d7ac501797ffb84fe43d21e10cd03ad5e030c06afef51ff906f/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f6b617573747562682d32362f666c75782d706c6174666f726d"&gt;&lt;img src="https://camo.githubusercontent.com/18748f0cf63c6d7ac501797ffb84fe43d21e10cd03ad5e030c06afef51ff906f/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f6b617573747562682d32362f666c75782d706c6174666f726d" alt="License"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Live Demo:&lt;/strong&gt; &lt;a href="https://flux.kaustubhalandkar.workers.dev" rel="nofollow noopener noreferrer"&gt;https://flux.kaustubhalandkar.workers.dev&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;production-style, event-driven real-time data platform&lt;/strong&gt; showcasing modern &lt;strong&gt;Kafka-based streaming&lt;/strong&gt;, &lt;strong&gt;WebSocket fan-out&lt;/strong&gt;, and a clean &lt;strong&gt;Backend-for-Frontend (BFF)&lt;/strong&gt; architecture.&lt;/p&gt;
&lt;p&gt;This repository is intentionally built as a &lt;strong&gt;portfolio-grade, open-source system&lt;/strong&gt; that mirrors how real-world, &lt;strong&gt;streaming data platforms&lt;/strong&gt; are designed, operated, and documented.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🖥️ Live Dashboard Preview&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/kaustubh-26/flux-platform/docs/screenshots/dashboard-desktop.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkaustubh-26%2Fflux-platform%2FHEAD%2Fdocs%2Fscreenshots%2Fdashboard-desktop.png" width="900"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/kaustubh-26/flux-platform/docs/screenshots/dashboard-crypto.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkaustubh-26%2Fflux-platform%2FHEAD%2Fdocs%2Fscreenshots%2Fdashboard-crypto.png" width="800"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;📱 Mobile View&lt;/h3&gt;

&lt;/div&gt;

&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/kaustubh-26/flux-platform/docs/screenshots/mobile-dashboard.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkaustubh-26%2Fflux-platform%2FHEAD%2Fdocs%2Fscreenshots%2Fmobile-dashboard.png" width="350"&gt;&lt;/a&gt;
&lt;/p&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;✨ What This Project Demonstrates&lt;/h2&gt;

&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Event-driven microservices using &lt;strong&gt;Kafka&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A resilient &lt;strong&gt;BFF layer&lt;/strong&gt; for real-time fan-out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Socket.IO&lt;/strong&gt;–based client streaming&lt;/li&gt;
&lt;li&gt;Cache-accelerated hydration with graceful degradation&lt;/li&gt;
&lt;li&gt;Idempotency &amp;amp; deduplication patterns&lt;/li&gt;
&lt;li&gt;Structured logging &amp;amp; observability&lt;/li&gt;
&lt;li&gt;Self-healing Kafka connectivity&lt;/li&gt;
&lt;li&gt;Clean, scalable monorepo organization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project emphasizes real-world system design concerns such as &lt;strong&gt;event-driven communication, failure handling, and scalability&lt;/strong&gt;.&lt;/p&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🧠 High-Level Architecture&lt;/h2&gt;

&lt;/div&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;┌──────────────────────┐
│        Client        │
│  Browser / Mobile    │
└─────────▲────────────┘
          │ Socket.IO (real-time)
┌─────────┴──────────────┐
│   Backend-for-Frontend │
│          (BFF)         │
│  - Socket.IO Server    │
│  - Kafka Producer      │&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/kaustubh-26/flux-platform" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;






</description>
      <category>architecture</category>
      <category>backend</category>
      <category>webdev</category>
      <category>kafka</category>
    </item>
  </channel>
</rss>
