DEV Community: Kaustubh Alandkar

How I Built an MQTT Ingest Core with FastAPI

Kaustubh Alandkar — Mon, 13 Apr 2026 16:38:01 +0000

Most telemetry pipelines look clean in architecture diagrams.

Devices publish data.

A collector receives it.

A backend stores it.

Dashboards read from it.

Simple.

Until you have to decide what to do with imperfect data.

records arrive incomplete
payloads vary by topic
timestamps are missing or inconsistent
some data is usable, but not trustworthy

At that point, the problem is no longer ingestion.

It’s data correctness at the system boundary.

That’s what I built this project to handle.

What This Project Actually Is

This service acts as a data-quality boundary between raw MQTT ingestion and downstream systems.

It does not subscribe to MQTT directly.

Instead, it receives batched telemetry records and is responsible for turning them into structured, queryable system state.

Typical topics look like:

sites
inverters
strings
weather
grid

For each record, it decides:

Is it structurally valid?
What fields are missing?
How should it be normalized?
Should it be accepted or rejected?
What aggregate state should be updated?

The goal is not just to store data.

The goal is to ensure that everything stored is explicitly understood.

At a high level, the flow looks like this:

MQTT Collector / Buffer
   ↓
FastAPI Ingest Core
   ↓
Validation
   ↓
Normalization
   ↓
MongoDB
   ├── normalized_records
   ├── invalid_records
   └── topic_aggregates

This separation is intentional.

The upstream system handles delivery reliability.

This service handles data correctness.

Mixing the two usually makes both harder to reason about.

That boundary matters.

Because raw device or telemetry data is rarely clean enough to use directly.

Why I Didn't Want "Just a Raw Ingest Endpoint"

A very common approach looks like this:

POST /ingest
→ accept payload
→ store raw document
→ defer cleanup downstream

This works early on.

But over time, it pushes data quality problems into every downstream system.

Each consumer ends up re-implementing:

validation
normalization
fallback logic
schema assumptions

That duplication is where systems start drifting apart.

What the Service Does

The FastAPI service exposes a few core endpoints:

POST /auth
POST /ingest
GET /topics/summary
GET /invalid/recent
GET /health

Only /health and /auth are public. Everything else is JWT-protected.

That was intentional.

Once a service becomes the data-quality boundary, it should control:

who can write
what gets accepted
how that data is shaped

This is about protecting the integrity of the data entering the system.

Design Decisions

1. Keeping Auth Out of the Ingest Path

The first decision was simple:

Ingestion should be authenticated, but not complicated.

So I added a small /auth endpoint that validates a configured client ID + secret and issues a JWT.

That token is then required for ingestion and read endpoints.

This gave me a few useful properties:

Collector services can authenticate once and reuse a token
Ingest logic stays separate from credential validation
Downstream protected endpoints remain simple
The service can reject unauthenticated writes before touching data

This isn’t about building a full auth system.

It’s about ensuring that ingestion is an explicit contract between services, not an open write surface.

Note: This service is designed to sit behind a buffered MQTT ingestion layer.
The buffer handles retries, batching, and offline durability, while this service
focuses purely on validation, normalization, and storage.

You can think of it as the next step after this layer:
https://dev.to/kaustubhalandkar/designing-an-offline-resilient-mqtt-buffer-with-sqlite-dj4

2. Strict Validation Without Throwing Away Useful Data

This was probably the most important design decision in the whole project.

A telemetry record can be "bad" in multiple ways:

Unknown topic
Missing payload entirely
Incomplete payload
Missing timestamp
Partially missing expected fields

But not all of those failures should be treated the same.

So I split the logic into two layers:

Hard validation failures — make the record invalid and route it to invalid_records:

Unsupported topic
Missing payload

Soft data quality gaps — still allow the record to be stored as valid, but with visibility:

Missing expected payload keys
Fallback timestamp usage
Null-filled required fields

That distinction ended up being extremely useful.

In real telemetry systems, partial data is often still operationally valuable.
Throwing it away completely is often worse than storing it with clear quality metadata.

The goal wasn’t strict correctness.

It was controlled acceptance.

3. Normalizing Records Into a Consistent Shape

One of the biggest problems with raw telemetry is that even "valid" data is often inconsistent.

So I normalize each accepted record before storage. That includes:

Standardizing timestamps
Filling missing required payload keys with null
Tracking missing fields explicitly
Shaping records into a consistent internal model

This means downstream consumers don’t have to guess whether a field was absent, renamed, or formatted differently by source.

The ingest core makes that decision once.

This shifts complexity away from every consumer and into a single, well-defined boundary.

Which is exactly where it belongs.

4. Timestamps Needed Clear Rules

At first, timestamp handling felt like a detail. It wasn't.

Records may arrive with:

timestamp
timestamp inside the payload (e.g. payload.ts)
missing timestamp entirely
malformed values
naive timestamps with no timezone

So I made the normalization path explicit:

Use timestamp if valid
Otherwise try a timestamp inside the payload (e.g. payload.ts)
Otherwise fall back to received_at
Treat naive timestamps as UTC

Time flows into everything later — ordering, summaries, freshness, debugging.

A bad timestamp policy quietly poisons all of it.

So this rule now lives in one place instead of being reinterpreted across services.

I treated timestamp normalization as a core responsibility.

5. Flexible Storage, Strict Ingest Rules

This service stores three different kinds of data:

Collection	Purpose
`normalized_records`	Accepted records after validation + normalization
`invalid_records`	Rejected records with explicit error codes
`topic_aggregates`	Per-topic summary state: count, missing field count, last event and received timestamps

That combination made MongoDB a natural fit.

The ingest rules are strict.

But the shape of incoming data varies across topics.

I wanted:

Flexible document storage
Simple inserts
Topic-based querying
Aggregate updates without over-modeling too early

So MongoDB became the persistence layer, while the FastAPI service became the place where structure is enforced.

6. Invalid Data Should Be Queryable, Not Just Logged

A lot of ingestion systems do this:

bad input → log error → drop it

That sounds fine until someone asks:

What's failing most often?
Which topic is malformed?
Are devices sending incomplete data?
Are we rejecting too aggressively?
Did a deployment break payload shape expectations?

If invalid records only exist in logs, answering those questions becomes annoying very quickly.

So invalid records are stored intentionally with:

Topic
Payload
Received time
Error list

That gives the system a memory of failure instead of just a momentary complaint.

And operationally, that's much more useful.

Invalid data is not noise.

It’s feedback from the system.

7. Raw Counts Weren’t Enough — So I Added Topic-Level Aggregates

Raw ingestion tells you volume.

It doesn’t tell you whether your system is healthy.

So I built topic-level aggregate updates that track:

Record count per topic
Missing field count
Latest event time
Latest received time

That gives the system a lightweight operational view without needing a full analytics layer yet.

It's not a dashboard product.

But it creates the kind of summary surface area you actually need once ingestion is running continuously.

Where the Actual Friction Was

The API layer itself was straightforward.

The hard part was deciding how the system should behave under imperfect input.

The code isn't huge. But the subtle parts were much more interesting than the "API" part.

Defining What "Valid Enough" Means Is Harder Than It Sounds

One of the trickiest design questions was: when should a record be rejected vs accepted with missing fields?

That's not a purely technical decision. It's a system behavior decision.

Reject too aggressively → you lose useful operational data
Accept too loosely → you pollute downstream trust

So I ended up treating structural integrity as mandatory, and field completeness as observable but tolerable.

That balance felt much more realistic than pretending telemetry is always complete.

Topic Schemas Are Useful — But They Create Ownership

Each topic has its own expected payload shape:

Topic	Scope
`sites`	Plant-level metrics
`inverters`	Inverter telemetry
`strings`	String-level electrical details
`weather`	Irradiance and atmospheric fields
`grid`	Power and grid metrics

That means the ingest core now owns a kind of schema contract. Which is good.

But it also means adding a new topic is a deliberate system change, not just "new data showing up."

And honestly, I think that's the right trade-off.

"Store Everything" Sounds Simple Until Queryability Matters

Raw ingestion and useful ingestion are not the same thing.

If you just store incoming payloads blindly, you'll probably feel productive for a while. But later you'll want to ask:

Which records were incomplete?
Which topic is most degraded?
What's the latest valid data per category?
How many records are structurally invalid?

Those questions only become answerable if the ingest layer stores intent, not just bytes.

That's why normalization, invalid storage, and aggregate tracking ended up mattering more than the endpoint itself.

Error Handling Is Part of the Data Model

One thing this project reinforced for me:

Bad data is still data.

An invalid telemetry record is not "nothing." It's a signal — of schema drift, upstream bugs, partial device failure, rollout mistakes, or simply incomplete operational conditions.

That's why I increasingly think of error handling as part of the data model, not just exception handling.

Once I started thinking that way, the service design got a lot cleaner.

What Shifted in How I Think

Before building this, I thought of ingestion as a transport problem.

Now I think of it as a trust boundary problem.

That boundary decides:

What gets accepted
What gets normalized
What gets rejected
What becomes queryable system state
What quality guarantees downstream code can rely on

That's a much more important role than "just receive and store."

One thing that became clear:

In practice, keeping partially correct data is often better than forcing everything to look clean.

Real systems don't always give you perfect input. Sometimes the right move is not to reject everything imperfect. Sometimes it's to:

Accept what is structurally usable
Preserve missingness explicitly
Separate invalid data cleanly
Make the trade-offs visible

That felt like the right design for this kind of ingestion boundary.

Final Takeaway

I didn’t build an ingestion API.

I built a boundary that decides what data becomes part of the system.

A FastAPI service that turns raw telemetry into something downstream systems can reason about.

Not by overengineering it. Just by being very explicit about authentication, validation, normalization, invalid record handling, and operational visibility.

And honestly, this is where systems tend to either stay clean…

or become harder to reason about over time.

If You've Built Something Similar

I'd genuinely be curious:

Do you reject incomplete telemetry, or store it with quality metadata?
Do you treat invalid records as operational artifacts or just log noise?
Where do you define your "data trust boundary"?

That seems to be where the real system design decisions start showing up.

Designing an Offline-Resilient MQTT Buffer with SQLite

Kaustubh Alandkar — Mon, 06 Apr 2026 17:38:01 +0000

The most crucial question for a data collection service over MQTT (Message Queuing Telemetry Transport) Protocol

what happens when the downstream service disappears for 20 minutes?

That's the question I had when designing a lightweight client to collect data from devices working on MQTT protocol.

At first I thought of forwarding every MQTT message directly to an HTTP API. But I had to consider for the unreliable network problem which is common in distributed systems.

The moment the downstream API becomes slow, or timeouts, or has an auth issue, the whole ingestion path would start inheriting failures.

So I had to build a system methodically.

The goal was simple: keep accepting data even when the rest of the pipeline is having problems.

I ended up with a lightweight Python service that:

subscribes to MQTT topics
keeps a local durable buffer in SQLite
forwards records downstream in batches
retries when delivery fails
survives process restarts without losing unacknowledged data by the downstream API.

Honestly, by the end of it I stopped thinking of this as an MQTT project. It became more of an exercise in where reliability should live.

How it's structured

The service sits between an MQTT broker and a downstream HTTP ingest endpoint. A simple topology.

Data flows like this:

MQTT Broker
   ↓
MQTT Client (on_message)
   ↓
latest_by_topic (in-memory dedup)
   ↓
Flush Worker (interval-based)
   ↓
SQLite — mqtt_buffer
   ↓
Sender Worker (batched + authenticated)
   ↓
Downstream HTTP

It subscribes to topics like sites, inverters, strings, weather, and grid. Under the hood:

one Python process
one MQTT client loop
two worker threads
one SQLite file as the durable queue

The three-way split — receive, persist, deliver — is intentional. Each stage failing independently is the whole point.

The direct approach — and why I moved away from it

The obvious implementation is:

on_message → send HTTP request immediately

It works great in demos. Becomes uncomfortable in real environments almost immediately.

Once you couple ingestion to delivery, your MQTT callback is now implicitly dependent on:

downstream API latency
downstream availability
auth/token health
retry behavior
network flakiness
partial delivery handling

Five concerns in one function isn't simplicity — it's just hidden coupling.

Receiving data and delivering data belong on different sides of a boundary. Mixing them is where fragility starts.

Design decisions

1. SQLite as the durability boundary

This is probably the most important decision in the whole project.

I wanted "buffered" to mean persisted and recoverable — not just storing data in-memory.

So SQLite is not just storage here. It's the line between received and safe enough to retry.

Each row in mqtt_buffer carries:

column	purpose
`id`	ordering and dedup
`topic`	source topic
`ts`	timestamp
`payload`	raw message content
`qos`	delivery quality level
`retain`	MQTT retain flag
`attempts`	delivery retry count
`last_error`	last failure reason

Rows are only deleted after the downstream service acknowledges them. That single rule changes the failure model entirely. A failed send doesn't destroy the record — it just changes its lifecycle state.

Why SQLite and not Redis or Kafka?

Because this service lives at the edge. I needed something embedded, durable, operationally cheap, and easy to inspect over SSH. SQLite is all of those things.

More importantly, it keeps the deployment footprint at:

python mqtt_to_sqlite.py

I wanted to avoid the operational overhead of additional infrastructure components just to run this service.

2. Keep the callback thin

The on_message path does exactly four things:

Parse the message
Capture MQTT metadata and a timestamp
Update in-memory state
Exit

No disk writes. No HTTP calls. No auth. No retries.

That thinness matters more than it looks. Because once the callback starts doing real work, message receipt becomes coupled to delivery health. Which means if downstream is slow, the callback slows down.

I wanted the opposite: the service should keep accepting MQTT messages even while delivery is completely broken. The callback being fast and simple is what makes that possible.

3. Two workers, one job each

Once I separated receipt from delivery, it became natural to split the background work into two threads:

Flush worker — reads from the in-memory state and writes batches to SQLite at a configured interval.

Sender worker — checks downstream health, fetches and caches a JWT, reads rows from SQLite, POSTs batches, deletes acknowledged rows, and records failures with attempts and last_error.

The failure isolation this gives you is real:

downstream offline → ingestion still works
auth broken → buffering still works
send rate slower than receive rate → SQLite absorbs the mismatch

One loop trying to do all of that at once doesn't give you any of this.

4. Using SQLite as a durable queue

I'm aware SQLite isn't a message broker.

For this scope, what I needed was: append new records, read oldest rows in order, retry failures, delete only after acknowledgment. SQLite handles all of that just fine if you're disciplined about concurrency.

I enabled:

WAL mode — allows concurrent reads while writes are in progress
synchronous=FULL — no data loss on OS crash
busy_timeout — handles lock contention without erroring immediately

Those settings matter. The process has one thread writing rows, another reading and deleting them, and shared mutable state between them. Without proper configuration, this is where you get bugs that only appear in production, under load, after three weeks.

5. Batched delivery

The sender doesn't push records downstream one at a time on arrival. It reads from SQLite in batches, controlled by:

SEND_BATCH_SIZE
SEND_INTERVAL_SECONDS

This gives a few useful properties:

Controlled downstream pressure. Every incoming MQTT message doesn't immediately become an outbound HTTP request.

Cleaner retry behavior. Failures happen at the batch level, not hidden inside a per-message request loop.

Easier tuning. If downstream can handle more throughput, I change two config values. No ingestion logic changes.

This system is not ultra real-time. That's a deliberate tradeoff. I'd rather have a controlled delivery loop with predictable behavior than a fast path that breaks in subtle ways under load.

6. Auth belongs to delivery

The downstream service requires JWT auth. The sender worker:

fetches a token from AUTH_URL on startup
caches it in memory
attaches it to outbound requests
invalidates the cache on 401 and re-fetches

This keeps auth scoped to delivery. If the auth service goes down, MQTT messages still get accepted and buffered. Delivery pauses. Ingestion doesn't.

7. Pre-send reachability check

Before each send attempt, the sender does a lightweight reachability check — either against a configured health endpoint or a TCP probe fallback.

This is less about sophistication and more about avoiding unnecessary work when downstream is unavailable.

A quick check before each batch keeps retry behavior quieter and more predictable, which makes the system easier to reason about under failure.

8. At-least-once, stated clearly

This service provides at-least-once delivery from the local buffer. Not exactly-once.

If a batch is processed downstream but the acknowledgement doesn't arrive cleanly, the sender will retry. That means downstream consumers need to be idempotent.

That’s a reasonable contract for this layer.

9. Operational behavior by design

Even a small service needs to be operationally trustworthy:

logs go to stdout
failure events (Downstream send failed, Unauthorized, MQTT connect failed) are explicit and visible
SIGINT / SIGTERM trigger graceful shutdown
rows not acknowledged before shutdown are retried on next start

That last point matters. If the process stops mid-send, the next start simply retries any rows that were not acknowledged. That makes shutdown behavior predictable and recovery straightforward.

Where the friction actually lives

The architecture is not complex. But the subtle edges are very real.

What "buffered" actually means

There's a meaningful difference between:

message received — it's in memory somewhere
message durably buffered — it will survive a restart

One of the first design decisions was defining that boundary clearly: a record is only "safe" once it has been written to SQLite.

Threading discipline

With a flush thread writing rows and a sender thread reading and deleting them, ownership of shared state matters. I used explicit locks around:

database access
in-memory state snapshots
saved hashes for deduplication
the token cache

Once two threads are interacting with the same rows and shared state, small mistakes can turn into failures that are difficult to reproduce and debug.

Disk is part of the capacity model

If downstream stays offline long enough, the SQLite file keeps growing. That’s not a bug — it’s the service doing exactly what it was designed to do.

But it does mean disk is part of the capacity model now.

In practice, that means:

monitor SQLite file size in production
store the database on persistent writable storage (not an ephemeral container layer)
alert if buffer growth rate becomes abnormal
think about retention policy if the domain allows data expiry

Choosing local disk over message loss only works well if the storage side of the system has been thought through too.

When retries get complicated

Retries feel straightforward until the scenarios get real:

What if downstream partially processed the batch before timing out?
What if auth failed after rows were already selected?
What if the process restarted between "sent" and "acknowledged"?

This is exactly why I didn't try to build stronger delivery guarantees than the system actually has. The sender deletes rows only after acknowledgment. That's the safest default. And it means downstream needs to be idempotent. Which is the right place to put that responsibility.

Auth failure is a system concern

If downstream delivery requires a valid JWT, then auth availability is part of the delivery path.

That leads to a few design questions:

Should auth failure stop ingestion? → No.
Should the sender keep retrying with a known-bad token? → No.
Should token fetch happen per request? → No.

That’s why the sender caches the token and invalidates it on 401. Auth failure stays scoped to delivery instead of propagating back into ingestion.

From local defaults to production

The code works fine locally over plain HTTP and a local auth endpoint.

That does not mean those defaults should carry into production unchanged.

Before anything beyond local use:

AUTH_URL and DOWNSTREAM_URL should use HTTPS
MQTT should use TLS if the broker is outside a trusted network
secrets should live in environment variables, not committed config files
SQLite should live on persistent storage, not an ephemeral container layer

These are not large design changes, but they are part of turning a working service into a production-safe one.

What shifted in how I think about this kind of work

This project kept reinforcing the same idea:

Reliability is mostly about where you choose to put state.

For this deployment context — edge, single process, operationally minimal — SQLite was the right fit. But the broader lesson goes beyond the tool choice. Sometimes the more useful move is to:

accept the data
persist it locally
decouple delivery
retry predictably
make the trade-offs explicit

That may not be a flashy architecture, but it is still architecture.

I've also started caring more about the distinction between works when healthy and behaves predictably when unhealthy. The second one is harder, and usually matters more.

Most painful bugs don't happen on the happy path. They happen when two otherwise normal systems fail in slightly different ways at the same time.

That's the kind of behavior I keep trying to design for.

Final thought

If I had to compress this project into one sentence:

I didn't just build an MQTT subscriber. I built a small failure-tolerant delivery boundary.

Once one system is producing data and another is consuming it, the real question isn't "can they talk to each other?" It's "what happens when they can't?"

For this project, the answer was: receive everything, persist locally, deliver in batches, retry predictably, and keep the system boring enough to debug under pressure.

Not the only valid design, but for an edge ingestion component, it felt like the right one.

If you've built something similar

I'm genuinely curious how others have handled this boundary — whether you buffered locally first or sent directly to HTTP, used Redis or Kafka instead of SQLite, treated delivery as best-effort or durable.

And more specifically: where did you decide "safe" actually begins?

That’s usually where the real design trade-offs start to show.

How I Built a High-Throughput Transaction Processor with Kafka, Redis, PostgreSQL, and MongoDB

Kaustubh Alandkar — Fri, 03 Apr 2026 16:03:34 +0000

When I started building this project, I wanted to learn by building something similar to how backend systems in payment processing apps work.

I wanted to build something that made me think carefully about throughput, ordering, idempotency, auditability, and failure boundaries together.

That led me to build HVTP (High Volume Transaction Processor) — a portfolio-grade, event-driven transaction processor that behaves more like a small transaction backend.

What made this project valuable for me wasn’t just wiring Kafka into a system.

It was learning how to shape the system so the right work happened in the right place.

What the project actually is

At a practical level, HVTP is a signed transaction ingestion pipeline.

A merchant client sends a transaction request over HTTP. The system validates the request at ingress, accepts it quickly, and then hands it off for asynchronous processing.

From there, the system:

validates and processes the transaction
enforces idempotency
persists ledger state
stores immutable audit events
exposes a status API
supports reconciliation between stores
emits terminal outcomes through downstream flows

The stack looks like this:

Kafka for event flow
Valkey (open-source Redis fork) for idempotency and some read-path control
PostgreSQL for the ledger / queryable durable state
MongoDB for immutable audit events
Spring Boot services split by responsibility
k6 for ingress load testing

This project is not about reproducing a regulated payments platform.

It is about building a system shape where correctness, isolation of responsibility, and observable behavior matter.

Why I kept the request path small

One option was to do everything in the request path:

Receive HTTP request
Validate everything in the same service
Write directly to PostgreSQL
Also write to MongoDB
Return success

That would have been simpler to build at first.

But for this project, I wanted to separate request acceptance from downstream processing. I wanted the ingress layer to stay focused on validating, accepting, and handing work off quickly, instead of taking on ledger writes, audit writes, and every other downstream concern synchronously.

That decision shaped the rest of the architecture.

The architecture I ended up with

I split the write path into a small event-driven pipeline:

That split gave each service one main responsibility:

api-service → signed ingress + fast acceptance
processor-service → validation + idempotency
ledger-writer-service → durable ledger persistence
audit-service → immutable audit history

What I liked about this structure was that each boundary had a clear reason to exist.

Main architecture and design decisions

1) Keep the API fast

The api-service just accepts the request and returns 202 Accepted.

I did this to keep the HTTP layer as an intake boundary, and not process the full transaction.

In HVTP, the ingress path is intentionally limited to:

validate request shape
verify signature
publish to Kafka
return acceptance

That means the API is not waiting on:

Idempotency checks
PostgreSQL ledger persistence
MongoDB audit writes
downstream webhook behavior

This was one of the most important decisions in the project because it kept the front door responsive even when downstream work had different timing characteristics.

2) Use Kafka for decoupling

I used Kafka because I wanted request acceptance, transaction processing, ledger persistence, and audit persistence to move at different speeds without being tightly bound to one another.

HVTP currently uses:

transaction_requests
transaction_log
dead-letter topics for failure paths

That gave me a few concrete benefits:

the API can accept requests without waiting for downstream writes
the ledger writer and audit service can scale independently
replay becomes possible
failure handling becomes clearer

I also used accountId as the key for the main topics.

That was deliberate.

For this project, the ordering boundary I cared about was not global ordering across every transaction.

It was preserving ordering for transactions belonging to the same account.

3) Treat idempotency as a correctness concern

To deal with:

retries
duplicate submissions
consumer reprocessing

and to make the system idempotent, I used an idempotency key.

Each request sent from the client includes an Idempotency-Key.

Without it, processing the same request twice could result in:

Duplicate ledger updates
Duplicate audit events
Inconsistent downstream outcomes

I used Valkey (open-source Redis fork) to store and check this idempotency key in the processor service.

One of the most useful mindset shifts from this project was moving from:

“How do I process this request?”

to:

“What must remain true even if this request appears more than once?”

That question improved the architecture more than any individual framework decision.

4) Let PostgreSQL and MongoDB do different jobs

I used two stores intentionally because the write patterns and query needs are different.

PostgreSQL is the ledger

PostgreSQL stores the durable transaction state that the system can query through the status path.

It holds the queryable record of a transaction in a ledger-style structure, including fields like:

transaction_id
idempotency_key
merchant_id
account_id
amount
currency
type
status
processed_at

That is the durable store for the transaction state I want to query directly.

MongoDB is the audit trail

MongoDB stores immutable audit events, including values such as:

transaction IDs
merchant/account IDs
correlation IDs
statuses
source topic
timestamps

These stores answer different questions.

The ledger answers:

“What is the durable transaction state?”

The audit store answers:

“What happened around this transaction over time?”

Separating those concerns made the model cleaner and easier to reason about.

5) Design for replay and reconciliation

The ledger writer and audit service consume from the same event stream, but they write to different storage systems.

That means there is always some possibility of drift, timing gaps, or mismatched writes across stores.

So I added reconciliation support.

The project includes a reconciliation model that compares recent ledger and audit state and records summary runs like:

audit count
ledger count
missing in ledger
run status
notes

I also wanted replay support to exist in the architecture before it became necessary.

That decision made the system feel more operationally realistic.

It shifted the design from “write to multiple places” toward “write, verify, and recover.”

6) Measure ingress behavior under overload

I also ran k6 load tests against the signed transaction ingestion endpoint at multiple offered rates, including 50K RPS and 100K RPS.

The purpose was not to describe the whole system as completing transactions at those rates end to end.

The goal was more specific:

“How does the ingress layer behave when offered far more traffic than the machine can sustain?”

That framing was important to me because it matched what I was actually measuring.

What the numbers showed

In local testing on a single machine, the API maintained:

0% HTTP failure rate
100% 202 Accepted for completed HTTP requests
accepted ingress throughput that leveled off around 3.1K–3.2K req/sec

A few highlights:

At 1K offered RPS, it handled 60,001 accepted requests in 60s
At 50K offered RPS, accepted throughput peaked at about 3,172.5 req/sec
At 100K offered RPS, it still completed 189,936 accepted requests in 60s
P95/P99 latency increased under overload, but the HTTP layer remained responsive

What I liked about that result was not the raw offered rate, but the saturation behavior.

The ingress layer stayed usable, throughput leveled off in a predictable way, and latency rose before failure.

That is a useful property in an asynchronous system.

The important caveat

These are HTTP ingress acceptance results, not end-to-end transaction completion metrics.

So the correct interpretation is:

the API accepted the requests
downstream completion happens asynchronously
the numbers describe front-door behavior, not full workflow completion

For this project, that was the honest and useful performance story to tell.

Real implementation friction and subtle problems

The architecture diagram is the clean version.

Implementation is where the edge cases become visible.

1) `202 Accepted` creates a visibility obligation

Returning 202 Accepted simplified the ingress path, but it also meant the system needed to answer follow-up questions such as:

did it persist?
did it fail?
was it rejected?
is it still in-flight?

That is why HVTP includes:

a status endpoint
correlation IDs
downstream event flow for terminal outcomes and tracing

Moving work out of the synchronous path reduced coupling, but it also increased the need for visibility.

2) Ordering had to be defined carefully

Early on, I had to be specific about what “ordering” meant in this system.

For HVTP, global ordering across all transactions was not the target.

Per-account ordering was the meaningful boundary.

That is why Kafka messages are keyed by accountId.

It gives the ordering guarantee I actually needed without forcing all traffic through one serialized path.

3) Multi-store systems introduce operational edges

Using PostgreSQL for ledger state and MongoDB for audit events was the right choice for this project.

It also meant I had to care whether both stores continued to reflect the same logical transaction stream.

That is why reconciliation became part of the design rather than an afterthought.

There was also a useful implementation lesson here: the Mongo mapping used for reconciliation has to stay aligned with the collection the audit service is actually writing to.

That kind of mismatch does not always fail loudly.

It can quietly reduce trust in operational checks.

4) Performance framing matters

Once I added the higher offered-RPS tests, I spent time thinking about how to describe the results precisely.

The more useful framing was not a headline number.

It was explaining what the tests actually demonstrated:

the ingress layer remains stable under overload
throughput saturates at a predictable point
latency rises as load increases
the asynchronous boundary protects the front door on this hardware

That framing is more useful because it stays aligned with what the measurements actually represent.

What changed in how I think

Before building this, I mostly thought about high throughput as a performance problem.

After building it, I think about it much more as a boundary design problem.

The question that stayed with me was not:

“How fast can one service go?”

It was:

“Where should work happen, where should it not happen, and what must remain true when parts of the system are delayed, retried, duplicated, or partially broken?”

That shift changed how I think about backend systems.

A few things became much clearer to me:

Async systems need strong visibility
Idempotency is part of the design, not just an implementation detail
Storage choices should follow write semantics
Graceful saturation is a useful success condition
Good architecture is often about clean responsibility boundaries

One practical lesson from this project was that precision matters.

202 Accepted should mean something specific.
A benchmark should measure something specific.
And each service should have a clearly defined responsibility.

That mindset ended up being one of the most useful outcomes of the project.

Final takeaway

If I had to compress the whole project into one sentence, it would be this:

I built HVTP to practice designing a system that can accept load quickly while keeping correctness, separation of concerns, and recovery paths in view.

That is what this project gave me.

It helped me think more clearly about how to keep the front door fast, how to handle duplicates intentionally, how to separate durable state from audit history, and how to design for verification instead of assuming everything will always stay aligned.

For me, that was far more valuable than just assembling a stack.

Final thoughts

If you’ve built something in this space, I’d be genuinely interested in how you approached trade-offs around:

202 Accepted vs synchronous confirmation
Redis idempotency boundaries
ledger vs audit store separation
what you consider a useful throughput benchmark

Those design choices ended up being the most interesting part of the project for me.

If you want to explore the implementation, docs, and load tests, the full repo is here:

kaustubh-26 / high-volume-transaction-processor

Event-driven transaction processor with signed ingress, Kafka workflows, ledger persistence, audit storage, and webhook notifications

High Volume Transaction Processor

High Volume Transaction Processor — An event-driven transaction processor

A production-style, event-driven transaction pipeline showcasing signed API ingestion, asynchronous Kafka processing, Redis idempotency, PostgreSQL ledger writes, and MongoDB audit persistence.

The repository is structured like a small payment platform:

signed transaction ingestion over HTTP
asynchronous processing over Kafka
Redis-backed idempotency protection
ledger persistence in PostgreSQL
immutable audit persistence in MongoDB
dead-letter topics for failed records
webhook notifications for transaction state changes
Actuator and Prometheus endpoints on every service

What This Project Demonstrates

Event-driven microservices with clearly separated write responsibilities
Per-account ordering by using accountId as the Kafka message key
Idempotency enforcement in the processor with Redis TTL-backed keys
PostgreSQL as the ledger source of truth for persisted transactions
MongoDB as an append-only audit store
Reconciliation between the audit store and the ledger
Replay support for rebuilding ledger state from transaction_log
Signed ingress requests and API-key-protected status…

View on GitHub

How I Designed a Real-Time Dashboard Using Kafka, Socket.IO, and a BFF

Kaustubh Alandkar — Mon, 30 Mar 2026 15:06:12 +0000

A practical breakdown of the architecture decisions, trade-offs, and frontend/backend boundaries behind Flux — an event-driven real-time dashboard platform I built.

While building Flux, I decided to build something that felt like a real-time system.

Not a frontend that keeps polling every few seconds.
Not a UI that directly calls five different APIs.
Not a project where everything works only when the happy path works.

I wanted something closer to how production systems are usually designed:

multiple data domains
asynchronous communication
real-time delivery
graceful degradation
and a frontend that stays simple even when the backend gets more complex

So this post is a breakdown of how I designed the architecture for Flux — and more importantly, why I made the decisions I made.

What Flux actually is

Flux is a real-time dashboard that streams and displays:

Weather
News
Stocks
Crypto

At first glance, it looks like a frontend-heavy project.

But the interesting part is actually the backend architecture.

Because the real problem wasn’t:

"How do I render cards on a dashboard?"

The real problem was:

"How can I design a system that cleanly ingests, processes, and streams multiple real‑time data feeds to clients?"

That changed everything.

The first architecture I didn’t want

The most obvious way to build this would have been something like this:

Frontend
 ├── calls weather API
 ├── calls news API
 ├── calls stocks API
 └── calls crypto API

This works for a small demo.

But I expected that approach to become painful pretty quickly.

Why I avoided that approach

Because the frontend would slowly become responsible for things it should not own:

request orchestration
retries
service-specific logic
failure handling
data normalization
caching decisions
reconnect behavior

That’s how “simple dashboards” become messy.

And honestly, this was one of the main design lines I kept repeating to myself while building this:

I wanted the frontend to stay thin and focused.

Meaning:

render data
send user intent
keep a real-time connection alive

That’s it.

The architecture I settled on

I ended up designing Flux around 3 main layers:

Frontend (Web UI)
 │
 ▼
BFF (Socket.IO + Cache + Kafka coordination)
 │
 ▼
Domain Services (Kafka-based)

This became the core mental model of the whole project.

Each layer has one job.

And once I locked that in, the rest of the system became much easier to reason about.

1) Frontend: thin, reactive, and intentionally limited

The frontend in Flux is intentionally thin and focused.

That was a design choice, not a shortcut.

Its job is to:

open a persistent Socket.IO connection
send user context (like location)
subscribe to real-time updates
render whatever arrives

Its job is not to:

talk to Kafka
call backend services directly
aggregate data
implement retry policies
decide caching behavior

That separation made the frontend much cleaner.

Why this mattered

Because when frontend code starts knowing too much about backend infrastructure, everything becomes harder:

harder to debug
harder to test
harder to scale
harder to change later

So in Flux, the frontend only talks to one thing:

the BFF

That one decision removed a lot of future complexity.

2) Why I used a BFF instead of exposing services directly

This was probably the most important architecture decision in the whole project.

I introduced a Backend-for-Frontend (BFF) layer between the UI and the backend services.

What the BFF does

The BFF is responsible for:

maintaining client Socket.IO connections
receiving events from backend services
hydrating reconnecting clients quickly
deciding what data to fan out to which users
acting as the real-time gateway of the system

So instead of this:

Frontend → many services

I made it:

Frontend → BFF → services

On paper that sounds small.

In practice, it changed a lot.

Why I liked this model

Because the frontend now has:

one connection model
one integration boundary
one real-time contract

And the backend can evolve without breaking the UI every time.

That gave me a much better separation between:

presentation concerns
system concerns

Which is exactly what I wanted from the start.

3) Why I used Kafka in the middle

Once I knew I wanted multiple real-time domains, I also knew I didn’t want everything tightly coupled.

If weather updates, crypto updates, stock updates, and news updates all depend directly on each other — or on one giant central service — that becomes painful fast.

So I used Kafka as the backbone.

What Kafka gave me

Kafka helped me design the system around events, not direct service-to-service coupling.

That gave me a few nice properties:

services can evolve independently
producers and consumers don’t need to know too much about each other
scaling one domain doesn’t force scaling everything
the architecture feels much closer to real production systems

That was important to me.

Because I didn’t want Flux to be a project optimized more for presentation than for system trade-offs.

I wanted it to feel like something that was designed with actual backend trade-offs in mind.

4) Why I chose Socket.IO for real-time delivery

For the client-facing real-time layer, I chose Socket.IO.

And yes — I know raw WebSockets are often the more low-level answer on paper.

But for this project, I cared more about reliability and developer ergonomics than sounding low-level.

Why Socket.IO made sense here

It gave me:

automatic reconnection
fallback transport support
room-based fan-out
simpler event semantics
less boilerplate for real-time client communication

That mattered because Flux is not just:

“send one stream to one client”

It’s a multi-stream dashboard with different categories of data and different update patterns.

So having a stable, practical abstraction here was worth it.

Sometimes “more production-realistic” is not about choosing the lowest-level primitive.

Sometimes it’s about choosing the thing you can operate more reliably.

5) The problem I ran into: reconnects and hydration

This is where the architecture got more interesting.

Real-time apps are not just about live streaming.

They’re also about:

“What happens when a user reconnects?”

That question forced me to think beyond just pushing events.

Because if a user refreshes the page or reconnects after a network blip, I don’t want them staring at an empty dashboard waiting for all streams to naturally update again.

That creates a poor reconnect experience.

So I split the system mentally into two kinds of data:

A) Snapshot data

Data that should be shown immediately on reconnect

Examples:

top news
weather snapshot
top crypto coins
stock summaries

B) Stream data

Data that should continue flowing live

Examples:

ticker updates
incremental changes
fast-moving live events

That separation ended up being very useful.

Because it let me design hydration and streaming differently instead of pretending all real-time data behaves the same way.

And honestly, that was one of the cleanest architecture decisions in the project.

6) Why I added cache, but made it optional

Once I started thinking about reconnect hydration, cache became the obvious next step.

But I also didn’t want to build a system that completely dies if cache is unavailable.

So I used Valkey (open-source Redis fork) as an optional accelerator, not as a hard dependency.

That distinction mattered a lot.

Why “optional cache” was important

Cache is amazing for:

fast reconnect hydration
reducing repeated work
serving recent snapshot data quickly

But I didn’t want Flux to become:

“works only if every dependency is healthy”

So I designed it with this mindset:

if cache is available → great, faster experience
if cache is unavailable → system should still keep working

That’s a small detail, but it changes how resilient the system feels.

And personally, I’ve started appreciating this design style a lot more:

Acceleration should not become fragility.

7) One of the subtle problems: selective fan-out

Once I had a BFF pushing real-time data, I ran into another question:

“Should every client receive every event?”

In practice, the answer is no.

That would:

waste bandwidth
add unnecessary frontend filtering
generally be inefficient

So I used Socket.IO rooms to scope event delivery.

That meant I could think in terms like:

weather by city
global news stream
crypto stream
stock stream

This helped keep the fan-out more intentional instead of just:

“broadcast everything and let the client figure it out”

That’s one of those things that sounds small when you say it in one sentence, but it makes the architecture much cleaner.

8) One frontend decision I’m glad I didn’t stay stubborn about

Initially, I was trying to keep the frontend very clean:

hooks handle data and subscriptions
components just render UI

And for the most part, that worked well.

Each domain had its own hook:

weather
news
stocks
crypto

So the general rule was simple:

Hooks deal with real-time logic. Components stay simple.

That kept things pretty clean.

But then I ran into a small UX problem

In the Crypto card, I had two tabs:

Top Movers
Live Ticker

And every time I switched between them, I didn’t like what I was seeing.

Some data in the Crypto card would reload again, and state didn’t feel stable across tab switches.

The UI didn’t feel as smooth as I wanted.

Nothing was actually broken.

It just didn’t feel good.

And I’ve started trusting that feeling more while building projects.

Because a lot of times, architecture looks clean in code but feels annoying in the actual product.

So this is where I bent my own rule a bit

Instead of forcing everything through local hook state, I used Redux selectively for the crypto section.

Not across the whole app.

Just where it actually helped.

Mainly to keep things like:

ticker data
top coins data
price-related state

stable across tab switches.

The pattern that felt right was:

hooks handle subscriptions and incoming socket events
Redux keeps shared UI state stable where needed
components just read and render

That ended up feeling much better.

Why I’m glad I did this

Because this was one of those cases where:

being too “architecturally pure” would have made the user experience worse.

And honestly, I’d rather have a slightly more practical architecture than a “perfectly clean” one that adds unnecessary friction to the user experience.

That small change made the Crypto card feel way smoother.

And I think that’s a useful reminder in general:

Sometimes the right architecture decision is just the one that makes the product feel better.

9) Failure isolation was not an afterthought

One thing I really wanted in Flux was this:

If one stream fails, the dashboard should still feel alive.

So I designed the UI and backend with failure isolation in mind.

Meaning:

if weather is delayed → crypto still updates
if news fails → stocks still render
if one service lags → the whole app shouldn’t feel dead

That sounds like a UX decision, but it’s actually an architecture decision too.

Because if your system shape forces everything to depend on everything else, then partial failure becomes full failure.

And I wanted to avoid that.

A real-time system should degrade, not collapse.

10) I intentionally did not chase “perfect distributed systems purity”

This was a very conscious choice.

Because once you start building event-driven systems, it’s very easy to go down the rabbit hole of:

exactly-once everything
over-engineered delivery guarantees
too many abstractions too early
adding architectural complexity that doesn’t meaningfully improve the system

I tried hard not to do that.

So Flux is opinionated in a practical way.

I optimized for:

clarity
resilience
clean boundaries
realistic trade-offs

Not unnecessary complexity.

That was important to me because I wanted this project to reflect how I actually think as an engineer:

I like systems that are thoughtful, not just complicated.

Final architecture summary

If I had to describe Flux’s architecture in one sentence, I’d say:

It’s a real-time dashboard designed like a small event-driven platform, not like a frontend project with extra backend code attached.

That’s the difference I cared about.

The main ideas behind the design were:

keep the frontend thin
centralize real-time delivery in a BFF
use Kafka for loose coupling
use Socket.IO for practical real-time delivery
separate snapshot hydration from live streams
use cache as an accelerator, not a crutch
isolate failures so the whole app doesn’t feel broken

And honestly, designing those boundaries was way more interesting than building the UI itself.

What I learned from building this

If I had to compress the biggest lesson into one line:

Most architecture problems become easier once each layer has one clear responsibility.

A lot of messy systems are messy because responsibilities are blurry.

Flux became much easier to build once I stopped asking:

“How do I make everything talk to everything?”

…and started asking:

“What is each layer allowed to know?”

That single shift made the architecture cleaner.

If you’re building a real-time dashboard too

A few things I’d strongly recommend:

Don’t let the frontend talk to too many things directly.
Decide early what is snapshot data vs stream data.
Design for reconnects, not just the initial page load.
Think about partial failure early.
Keep boundaries boring and explicit.

That alone will save you a lot of pain later.

Final thoughts

If you’ve built something similar — or if you would have designed this differently — I’d genuinely love to hear your approach.

I always find architecture discussions more useful when they’re about trade-offs as well as best practices.

Disclosure: This post was proofread with AI assistance. The core ideas, architecture decisions, and final content are my own.

kaustubh-26 / flux-platform

An event-driven real-time data platform

Flux Platform

Flux — An event-driven real-time data platform

Live Demo: https://flux.kaustubhalandkar.workers.dev

A production-style, event-driven real-time data platform showcasing modern Kafka-based streaming, WebSocket fan-out, and a clean Backend-for-Frontend (BFF) architecture.

This repository is intentionally built as a portfolio-grade, open-source system that mirrors how real-world, streaming data platforms are designed, operated, and documented.

🖥️ Live Dashboard Preview

📱 Mobile View

✨ What This Project Demonstrates

Event-driven microservices using Kafka
A resilient BFF layer for real-time fan-out
Socket.IO–based client streaming
Cache-accelerated hydration with graceful degradation
Idempotency & deduplication patterns
Structured logging & observability
Self-healing Kafka connectivity
Clean, scalable monorepo organization

The project emphasizes real-world system design concerns such as event-driven communication, failure handling, and scalability.

🧠 High-Level Architecture

┌──────────────────────┐
│        Client        │
│  Browser / Mobile    │
└─────────▲────────────┘
          │ Socket.IO (real-time)
┌─────────┴──────────────┐
│   Backend-for-Frontend │
│          (BFF)         │
│  - Socket.IO Server    │
│  - Kafka Producer      │

…

View on GitHub

DEV Community: Kaustubh Alandkar

How I Built an MQTT Ingest Core with FastAPI

What This Project Actually Is

Why I Didn't Want "Just a Raw Ingest Endpoint"

What the Service Does

Design Decisions

1. Keeping Auth Out of the Ingest Path

2. Strict Validation Without Throwing Away Useful Data

3. Normalizing Records Into a Consistent Shape

4. Timestamps Needed Clear Rules

5. Flexible Storage, Strict Ingest Rules

6. Invalid Data Should Be Queryable, Not Just Logged

7. Raw Counts Weren’t Enough — So I Added Topic-Level Aggregates

Where the Actual Friction Was

Defining What "Valid Enough" Means Is Harder Than It Sounds

Topic Schemas Are Useful — But They Create Ownership

"Store Everything" Sounds Simple Until Queryability Matters

Error Handling Is Part of the Data Model

What Shifted in How I Think

Final Takeaway

If You've Built Something Similar

Designing an Offline-Resilient MQTT Buffer with SQLite

The most crucial question for a data collection service over MQTT (Message Queuing Telemetry Transport) Protocol

How it's structured

The direct approach — and why I moved away from it

Design decisions

1. SQLite as the durability boundary

2. Keep the callback thin

3. Two workers, one job each

4. Using SQLite as a durable queue

5. Batched delivery

6. Auth belongs to delivery

7. Pre-send reachability check

8. At-least-once, stated clearly

9. Operational behavior by design

Where the friction actually lives

What "buffered" actually means

Threading discipline

Disk is part of the capacity model

When retries get complicated

Auth failure is a system concern

From local defaults to production

What shifted in how I think about this kind of work

Final thought

If you've built something similar

How I Built a High-Throughput Transaction Processor with Kafka, Redis, PostgreSQL, and MongoDB

What the project actually is

Why I kept the request path small

The architecture I ended up with

Main architecture and design decisions

1) Keep the API fast

2) Use Kafka for decoupling

3) Treat idempotency as a correctness concern

4) Let PostgreSQL and MongoDB do different jobs

PostgreSQL is the ledger

MongoDB is the audit trail

5) Design for replay and reconciliation

6) Measure ingress behavior under overload

What the numbers showed

The important caveat

Real implementation friction and subtle problems

1) 202 Accepted creates a visibility obligation

2) Ordering had to be defined carefully

3) Multi-store systems introduce operational edges

4) Performance framing matters

What changed in how I think

Final takeaway

Final thoughts

kaustubh-26 / high-volume-transaction-processor

Event-driven transaction processor with signed ingress, Kafka workflows, ledger persistence, audit storage, and webhook notifications

High Volume Transaction Processor

What This Project Demonstrates

How I Designed a Real-Time Dashboard Using Kafka, Socket.IO, and a BFF

What Flux actually is

The first architecture I didn’t want

Why I avoided that approach

The architecture I settled on

1) Frontend: thin, reactive, and intentionally limited

Why this mattered

2) Why I used a BFF instead of exposing services directly

1) `202 Accepted` creates a visibility obligation