How to design a fault-tolerant IoT data pipeline for asset tracking at scale

#iot #backend #kafka #architecture

A pipeline that is designed to work with 100 devices is not a pipeline that can work with 10,000 devices. Furthermore, a pipeline that works on Tuesday morning is not a pipeline that works in a case of network disruption, database failover, or MQTT broker crash. Fault tolerance should be taken into consideration when developing the architecture. Here’s how.

Failure modes to take into account while designing fault tolerant infrastructure

Loss of device connection
Devices lose network connections because of network disruptions. The data has to be buffered and then replayed upon connection restoration.

Crash of a broker
Restarting an MQTT broker during processing. Persistent sessions and Quality-of-Service levels 1 & 2 provide the solution to this issue.

Lagging consumer
Processing is delayed by various reasons. In the absence of a backpressure mechanism, the pipeline will stop working due to increasing pressure.

Database failure
The failover process occurs while writing events. Absence of a retry mechanism will cause event loss forever.

The fault-tolerant pipeline architecture

Layer 1
Device-side local buffer

Tags cache events when disconnected. Events cached will be replayed sequentially prior to new events being sent. No data loss occurs when there is no network connectivity.

Layer 2
MQTT with persistent sessions & QoS = 1

Persistent sessions preserve broker session state upon disconnection. QoS 1 provides at-least-once guarantee, meaning broker retries till an acknowledgement from the device is received. No message drops happen silently.

Layer 3
Kafka with replication & DLQ

Event writes into Kafka with replication factor three. If any event fails to be processed, the event is sent to the DLQ - no data gets lost here.

Layer 4
Idempotent consumers with retry

Idempotent consumers handle events exactly once with help of deduplication keys. If writing to database fails, exponential backoff is used to retry. Nothing gets silently dropped or duplicated.

Layer 5
TimescaleDB with read replicas

All writes happen to the primary DB. Reads and dashboard are served by read replicas. The primary DB failing does not affect the running dashboard due to read replica.

Step 1 – Buffering on Device Side

Buffer the events locally in the device (Raspberry Pi, or embedded MCU), in case of broker unavailability, and send them back to the broker on reconnect:

# Python device client with local SQLite buffer
import sqlite3, json, paho.mqtt.client as mqtt

db = sqlite3.connect('buffer.db')
db.execute('CREATE TABLE IF NOT EXISTS queue (id INTEGER PRIMARY KEY, payload TEXT, ts REAL)')

def buffer_event(payload):
  db.execute('INSERT INTO queue (payload, ts) VALUES (?, ?)', [json.dumps(payload), time()])
  db.commit()

def flush_buffer(client):
  rows = db.execute('SELECT id, payload FROM queue ORDER BY ts ASC').fetchall()
  for row_id, payload in rows:
    result = client.publish('assets/+/telemetry', payload, qos=1)
    if result.rc == mqtt.MQTT_ERR_SUCCESS:
      db.execute('DELETE FROM queue WHERE id = ?', [row_id])
  db.commit()

Step 2 – MQTT Persistent Sessions

Use the clean_session=False broker setting, making sure that the broker buffers unacknowledged messages and sends them back on reconnect:

# Persistent session — broker retains state across disconnects
client = mqtt.Client(
  client_id='asset-tag-7821',
  clean_session=False  # key setting — retain session on reconnect
)
client.reconnect_delay_set(min_delay=1, max_delay=60)

def on_connect(client, userdata, flags, rc):
  if flags['session present']:
    print('Session resumed — broker has queued messages')
  flush_buffer(client)  # replay local buffer on reconnect

Step 3 – Kafka DLQ

Any event that cannot be processed after N attempts lands up in a Kafka DLQ instead of getting dropped altogether. The operations team can then examine, fix, and replay them without losing any data:

const MAX_RETRIES = 3

async function processWithRetry(event, producer) {
  let attempts = 0

  while (attempts < MAX_RETRIES) {
    try {
      await processEvent(event)
      return  // success — exit loop
    } catch (err) {
      attempts++
      const delay = Math.pow(2, attempts) * 1000  // exponential backoff
      await sleep(delay)
    }
  }

  // All retries failed — send to DLQ
  await producer.send({
    topic: 'iot-events-dlq',
    messages: [{ value: JSON.stringify({ event, failedAt: new Date() }) }]
  })
}

Step 4 – Idempotent Writes & Dedupe
Your QoS1 messages will lead to duplicate messages. Therefore, your write calls need to be idempotent; i.e., the same message processed twice needs to give you the same output, no duplicates:

-- Idempotent insert using ON CONFLICT DO NOTHING
-- event_id is a hash of (asset_id + timestamp)
INSERT INTO asset_locations
  (event_id, asset_id, lat, lng, temp_c, time)
VALUES
  ($1, $2, $3, $4, $5, $6)
ON CONFLICT (event_id) DO NOTHING;

-- Generate event_id in application layer
-- const eventId = crypto.createHash('sha256')
-- .update(`${assetId}:${timestamp}`).digest('hex')

Tip for deduplication window: When you have assets that update every few seconds and produce events continuously, cache latest event_ids in Redis with 60 sec TTL to avoid database look-up for deduplication on each insert operation. You can use Redis SET NX command to do sub-millisecond deduplication checks.

Production Fault Tolerance Checklist

✓ Device local buffer: SQLite or Flash-based storage; will replay events when reconnecting in order.
✓ MQTT sessions with clean_session=False, and minimum QoS 1.
✓ Kafka replication factor = 3. Survives single broker failure with zero messages lost.
✓ Dead Letter Queue. Failed events are captured, inspectable and re-playable.
✓ Exponential back-off during retries to avoid thundering herd effect.
✓ Idempotent database writes ON CONFLICT DO NOTHING. Deduplicate by event_id
✓ Read Replicas for dashboards. Write and read paths don’t interfere with each other
✓ Health Checks and alerts: Consumer lag, DLQ depth, and broker health are always monitored.

The most frequent mistake: Implementing retry mechanism without implementing idempotent writes. Retry mechanism without idempotence turns transient error into duplicate data - a much more difficult problem to solve than the one you've faced before.

Stack recommendations

Mosquitto / HiveMQ
Kafka (RF=3)
KafkaJS / confluent-kafka
Redis (for deduplication)
TimescaleDB & replica
Prometheus & Grafana

Three key metrics to watch for in production: Kafka consumer lag (higher lag means a bottleneck in consumption), DLQ queue size (all messages in here require immediate attention), and MQTT reconnection rates (high numbers mean device connectivity problems). You should set alerts on all three.

The IoT data pipeline of AssetTrackPro is based on this exact fail-proof architecture — device-side buffering, Kafka ingestion, idempotent writes, and DLQ monitoring. Find out more →

Working on your IoT pipeline? Get the reliability and scalability of AssetTrackPro's infrastructure that takes care of everything under the hood, allowing you to focus on product development.

Learn more about AssetTrackPro ↗