DEV Community

Cover image for Designing an Offline-Resilient MQTT Buffer with SQLite
Kaustubh Alandkar
Kaustubh Alandkar

Posted on

Designing an Offline-Resilient MQTT Buffer with SQLite

The most crucial question for a data collection service over MQTT (Message Queuing Telemetry Transport) Protocol

what happens when the downstream service disappears for 20 minutes?

That's the question I had when designing a lightweight client to collect data from devices working on MQTT protocol.

At first I thought of forwarding every MQTT message directly to an HTTP API. But I had to consider for the unreliable network problem which is common in distributed systems.

The moment the downstream API becomes slow, or timeouts, or has an auth issue, the whole ingestion path would start inheriting failures.

So I had to build a system methodically.

The goal was simple: keep accepting data even when the rest of the pipeline is having problems.

I ended up with a lightweight Python service that:

  • subscribes to MQTT topics
  • keeps a local durable buffer in SQLite
  • forwards records downstream in batches
  • retries when delivery fails
  • survives process restarts without losing unacknowledged data by the downstream API.

Honestly, by the end of it I stopped thinking of this as an MQTT project. It became more of an exercise in where reliability should live.


How it's structured

The service sits between an MQTT broker and a downstream HTTP ingest endpoint. A simple topology.

Data flows like this:

MQTT Broker
   ↓
MQTT Client (on_message)
   ↓
latest_by_topic (in-memory dedup)
   ↓
Flush Worker (interval-based)
   ↓
SQLite — mqtt_buffer
   ↓
Sender Worker (batched + authenticated)
   ↓
Downstream HTTP
Enter fullscreen mode Exit fullscreen mode

It subscribes to topics like sites, inverters, strings, weather, and grid. Under the hood:

  • one Python process
  • one MQTT client loop
  • two worker threads
  • one SQLite file as the durable queue

The three-way split — receive, persist, deliver — is intentional. Each stage failing independently is the whole point.


The direct approach — and why I moved away from it

The obvious implementation is:

on_message → send HTTP request immediately
Enter fullscreen mode Exit fullscreen mode

It works great in demos. Becomes uncomfortable in real environments almost immediately.

Once you couple ingestion to delivery, your MQTT callback is now implicitly dependent on:

  • downstream API latency
  • downstream availability
  • auth/token health
  • retry behavior
  • network flakiness
  • partial delivery handling

Five concerns in one function isn't simplicity — it's just hidden coupling.

Receiving data and delivering data belong on different sides of a boundary. Mixing them is where fragility starts.


Design decisions

1. SQLite as the durability boundary

This is probably the most important decision in the whole project.

I wanted "buffered" to mean persisted and recoverable — not just storing data in-memory.

So SQLite is not just storage here. It's the line between received and safe enough to retry.

Each row in mqtt_buffer carries:

column purpose
id ordering and dedup
topic source topic
ts timestamp
payload raw message content
qos delivery quality level
retain MQTT retain flag
attempts delivery retry count
last_error last failure reason

Rows are only deleted after the downstream service acknowledges them. That single rule changes the failure model entirely. A failed send doesn't destroy the record — it just changes its lifecycle state.

Why SQLite and not Redis or Kafka?

Because this service lives at the edge. I needed something embedded, durable, operationally cheap, and easy to inspect over SSH. SQLite is all of those things.

More importantly, it keeps the deployment footprint at:

python mqtt_to_sqlite.py
Enter fullscreen mode Exit fullscreen mode

I wanted to avoid the operational overhead of additional infrastructure components just to run this service.


2. Keep the callback thin

The on_message path does exactly four things:

  1. Parse the message
  2. Capture MQTT metadata and a timestamp
  3. Update in-memory state
  4. Exit

No disk writes. No HTTP calls. No auth. No retries.

That thinness matters more than it looks. Because once the callback starts doing real work, message receipt becomes coupled to delivery health. Which means if downstream is slow, the callback slows down.

I wanted the opposite: the service should keep accepting MQTT messages even while delivery is completely broken. The callback being fast and simple is what makes that possible.


3. Two workers, one job each

Once I separated receipt from delivery, it became natural to split the background work into two threads:

Flush worker — reads from the in-memory state and writes batches to SQLite at a configured interval.

Sender worker — checks downstream health, fetches and caches a JWT, reads rows from SQLite, POSTs batches, deletes acknowledged rows, and records failures with attempts and last_error.

The failure isolation this gives you is real:

  • downstream offline → ingestion still works
  • auth broken → buffering still works
  • send rate slower than receive rate → SQLite absorbs the mismatch

One loop trying to do all of that at once doesn't give you any of this.


4. Using SQLite as a durable queue

I'm aware SQLite isn't a message broker.

For this scope, what I needed was: append new records, read oldest rows in order, retry failures, delete only after acknowledgment. SQLite handles all of that just fine if you're disciplined about concurrency.

I enabled:

  • WAL mode — allows concurrent reads while writes are in progress
  • synchronous=FULL — no data loss on OS crash
  • busy_timeout — handles lock contention without erroring immediately

Those settings matter. The process has one thread writing rows, another reading and deleting them, and shared mutable state between them. Without proper configuration, this is where you get bugs that only appear in production, under load, after three weeks.


5. Batched delivery

The sender doesn't push records downstream one at a time on arrival. It reads from SQLite in batches, controlled by:

  • SEND_BATCH_SIZE
  • SEND_INTERVAL_SECONDS

This gives a few useful properties:

Controlled downstream pressure. Every incoming MQTT message doesn't immediately become an outbound HTTP request.

Cleaner retry behavior. Failures happen at the batch level, not hidden inside a per-message request loop.

Easier tuning. If downstream can handle more throughput, I change two config values. No ingestion logic changes.

This system is not ultra real-time. That's a deliberate tradeoff. I'd rather have a controlled delivery loop with predictable behavior than a fast path that breaks in subtle ways under load.


6. Auth belongs to delivery

The downstream service requires JWT auth. The sender worker:

  • fetches a token from AUTH_URL on startup
  • caches it in memory
  • attaches it to outbound requests
  • invalidates the cache on 401 and re-fetches

This keeps auth scoped to delivery. If the auth service goes down, MQTT messages still get accepted and buffered. Delivery pauses. Ingestion doesn't.


7. Pre-send reachability check

Before each send attempt, the sender does a lightweight reachability check — either against a configured health endpoint or a TCP probe fallback.

This is less about sophistication and more about avoiding unnecessary work when downstream is unavailable.

A quick check before each batch keeps retry behavior quieter and more predictable, which makes the system easier to reason about under failure.


8. At-least-once, stated clearly

This service provides at-least-once delivery from the local buffer. Not exactly-once.

If a batch is processed downstream but the acknowledgement doesn't arrive cleanly, the sender will retry. That means downstream consumers need to be idempotent.

That’s a reasonable contract for this layer.


9. Operational behavior by design

Even a small service needs to be operationally trustworthy:

  • logs go to stdout
  • failure events (Downstream send failed, Unauthorized, MQTT connect failed) are explicit and visible
  • SIGINT / SIGTERM trigger graceful shutdown
  • rows not acknowledged before shutdown are retried on next start

That last point matters. If the process stops mid-send, the next start simply retries any rows that were not acknowledged. That makes shutdown behavior predictable and recovery straightforward.


Where the friction actually lives

The architecture is not complex. But the subtle edges are very real.

What "buffered" actually means

There's a meaningful difference between:

  • message received — it's in memory somewhere
  • message durably buffered — it will survive a restart

One of the first design decisions was defining that boundary clearly: a record is only "safe" once it has been written to SQLite.


Threading discipline

With a flush thread writing rows and a sender thread reading and deleting them, ownership of shared state matters. I used explicit locks around:

  • database access
  • in-memory state snapshots
  • saved hashes for deduplication
  • the token cache

Once two threads are interacting with the same rows and shared state, small mistakes can turn into failures that are difficult to reproduce and debug.


Disk is part of the capacity model

If downstream stays offline long enough, the SQLite file keeps growing. That’s not a bug — it’s the service doing exactly what it was designed to do.

But it does mean disk is part of the capacity model now.

In practice, that means:

  • monitor SQLite file size in production
  • store the database on persistent writable storage (not an ephemeral container layer)
  • alert if buffer growth rate becomes abnormal
  • think about retention policy if the domain allows data expiry

Choosing local disk over message loss only works well if the storage side of the system has been thought through too.


When retries get complicated

Retries feel straightforward until the scenarios get real:

  • What if downstream partially processed the batch before timing out?
  • What if auth failed after rows were already selected?
  • What if the process restarted between "sent" and "acknowledged"?

This is exactly why I didn't try to build stronger delivery guarantees than the system actually has. The sender deletes rows only after acknowledgment. That's the safest default. And it means downstream needs to be idempotent. Which is the right place to put that responsibility.


Auth failure is a system concern

If downstream delivery requires a valid JWT, then auth availability is part of the delivery path.

That leads to a few design questions:

  • Should auth failure stop ingestion? → No.
  • Should the sender keep retrying with a known-bad token? → No.
  • Should token fetch happen per request? → No.

That’s why the sender caches the token and invalidates it on 401. Auth failure stays scoped to delivery instead of propagating back into ingestion.


From local defaults to production

The code works fine locally over plain HTTP and a local auth endpoint.

That does not mean those defaults should carry into production unchanged.

Before anything beyond local use:

  • AUTH_URL and DOWNSTREAM_URL should use HTTPS
  • MQTT should use TLS if the broker is outside a trusted network
  • secrets should live in environment variables, not committed config files
  • SQLite should live on persistent storage, not an ephemeral container layer

These are not large design changes, but they are part of turning a working service into a production-safe one.


What shifted in how I think about this kind of work

This project kept reinforcing the same idea:

Reliability is mostly about where you choose to put state.

For this deployment context — edge, single process, operationally minimal — SQLite was the right fit. But the broader lesson goes beyond the tool choice. Sometimes the more useful move is to:

  • accept the data
  • persist it locally
  • decouple delivery
  • retry predictably
  • make the trade-offs explicit

That may not be a flashy architecture, but it is still architecture.

I've also started caring more about the distinction between works when healthy and behaves predictably when unhealthy. The second one is harder, and usually matters more.

Most painful bugs don't happen on the happy path. They happen when two otherwise normal systems fail in slightly different ways at the same time.

That's the kind of behavior I keep trying to design for.


Final thought

If I had to compress this project into one sentence:

I didn't just build an MQTT subscriber. I built a small failure-tolerant delivery boundary.

Once one system is producing data and another is consuming it, the real question isn't "can they talk to each other?" It's "what happens when they can't?"

For this project, the answer was: receive everything, persist locally, deliver in batches, retry predictably, and keep the system boring enough to debug under pressure.

Not the only valid design, but for an edge ingestion component, it felt like the right one.


If you've built something similar

I'm genuinely curious how others have handled this boundary — whether you buffered locally first or sent directly to HTTP, used Redis or Kafka instead of SQLite, treated delivery as best-effort or durable.

And more specifically: where did you decide "safe" actually begins?

That’s usually where the real design trade-offs start to show.

Top comments (0)