applekoiot

Posted on Jun 10 • Originally published at blog.appleko.io

Surviving the Dead Zone: Keeping a Cold-Chain Temperature Record Whole Offline

#iot #embedded #architecture #dataengineering

Why is a cold-chain record only as good as its worst gap?

Because a cold-chain temperature record is judged at one moment only: after the trip, when a shipment is questioned, a batch is held, or an auditor asks what happened. At that point the live dashboard is no longer the evidence of record; what matters is whether the stored history is complete, time-true, and honest about its own uncertainty. The hard part of building a wireless temperature logger is not reading a thermistor every few minutes — it is capturing locally, detecting gaps, preserving provenance, and making any missing or doubtful data explicit, even when the network is gone for hours.

And on many cold-chain lanes, the network is gone for material stretches of the route.

Why is the connectivity dead zone the default, not the exception?

Because refrigerated freight spends much of its life in RF-hostile places: steel shipping containers, ocean legs with no cellular coverage, rural corridors, the metal-and-moisture interior of a cold room. A design that assumes a live uplink will silently drop exactly the readings taken in those stretches — which are often the very periods a reviewer later cares about, even when they are not the most thermally risky. The first architectural commitment is therefore that the device must never depend on connectivity to capture data. Connectivity is for delivery, not capture.

How should a logger capture data when it is offline?

Locally, and unconditionally. The device writes every sample to non-volatile memory on a fixed cadence, independent of whether any phone or gateway is in range. The raw record can be compact:

typedef struct {
    uint32_t ts;        // seconds since epoch, from the RTC
    int16_t  temp_cC;   // centi-degC  (-1850 = -18.50 C)
    uint8_t  flags;     // bit0 alarm, bit1 light/exposure, bit2 backfilled
} __attribute__((packed)) sample_t;   // 7-byte raw payload

That 7 bytes is the payload only. In practice each record also carries a sequence number and a schema version, and the storage layer adds a CRC, atomic power-fail-safe commits, and flash wear-leveling and erase-block overhead. At a 5-minute interval, 30,000 samples is roughly 210 KB of raw payload and more than 100 days of history — comfortably longer than many planned shipments. Held as a ring buffer, a trip shorter than that retention horizon is not truncated by normal wraparound, provided the backlog uploads before old data is overwritten.

Two details matter. First, the sampling clock runs off the device's own RTC, not off connectivity events; a missed upload must never become a missed sample. Second, the flags byte carries an exposure bit driven by an onboard light sensor — useful as a signal that a sealed carton met daylight in transit, though it is an exposure indicator rather than tamper-proof evidence (opaque packaging, darkness, or a device buried under the payload can all mask an opening).

Why is store-and-forward a sync problem, not a stream?

Because once a gateway or phone reappears and the backlog uploads, data can arrive late and out of order — which turns the server side into a reconciliation problem. A few rules keep the record trustworthy:

Idempotency on a stable key. Key each record by device identity plus a per-sample counter, with a persistent boot/session epoch so a counter that restarts after a reset stays unambiguous — not by timestamp. Timestamps drift and can reset, so two distinct samples could collide on (device_id, ts); a per-device sequence is what makes re-delivery (common when a connection drops mid-upload) a safe no-op.
Capture order from the counter, wall-clock from the RTC. Use the sequence number for the order samples were taken, and a corrected, quality-flagged RTC timestamp for when they were taken. Wall-clock time shouldn't be the sole sequencing authority.
Gap vs. lag, decided by sequence. A hole isn't 'missing' just because rows are absent. If the server lacks sequences 1000–1040 while the device reports its current counter at 1040 and still holds those records, that's lag; if the device later reports counter 1100 and on-device retention now starts at 1051, then 1000–1050 are a confirmed gap. Telling them apart needs the device's sequence and retention state, not just timestamp holes.
Provenance. Backfilled records stay flagged as backfilled, and events worth their own records — boot, brownout, RTC correction, config or threshold change, memory wrap, upload acknowledgement — sit alongside the samples so the timeline explains itself.

Get these wrong and the symptoms are subtle: phantom gaps, double-counted excursions, or a timeline that looks clean because a loss was quietly papered over.

What makes timestamp integrity the quietly hard part?

A temperature value is only meaningful if its 'when' can be trusted, and time is where low-cost loggers tend to fail: RTCs drift, and a battery dip can reset the clock. If a device timestamps a hundred days of samples against a clock that silently jumped, the record is precise and wrong — the worst combination in an audit. The mitigations are well understood but easy to skip:

Discipline the RTC against an authenticated, server-authoritative time source on sync — not just whatever a phone reports — and record the correction's source and magnitude instead of silently rewriting history.
Keep the monotonic sample counter alongside wall-clock time, so capture order survives a clock reset.
Flag any detected time discontinuity in the record itself, so a reviewer sees it rather than inheriting a hidden error.

Completeness answers did anything go missing. Timestamp integrity answers can the timing be trusted. A defensible record needs both.

What makes a temperature record defensible?

A record is defensible when a complete, time-true history is backed by integrity it can prove: calibration traceability tied to a device identity, an append-only audit log of any change, and tamper-evident, signed exports — because access control alone won't stop a privileged operator, a buggy migration, or a compromised service from altering data. The export — a clean PDF or CSV with events flagged — is the artifact most reviewers actually open, but it should trace back to append-only, WORM-style raw records and carry generation metadata: device identity, calibration reference, and a hash or signature. In regulated pharmaceutical distribution, EU GDP expects calibrated temperature monitoring, documented investigation of excursions, and retained records. (US FSMA 204, often cited in this context, is a food traceability record-keeping rule rather than a temperature mandate, with compliance pushed to July 20, 2028 — relevant to which records must exist, not to how a logger is built.)

Does Bluetooth 6.0 change any of this?

Not really. Bluetooth 6.0's headline feature, Channel Sounding, is about precise distance measurement and locating assets — not temperature accuracy or data completeness. Any practical gain for a logger comes from the actual controller, firmware, gateway scan policy, and power budget, not from the spec number on the box. A newer radio is welcome; it changes nothing about the buffering and reconciliation work above.

What should this change about how you build?

Treat the dashboard as a view, not the source of truth. The product is the retained, reconcilable record — one that captures offline, survives a clock reset, and can show exactly where it's certain and where it isn't. Build for the dead zone, because the dead zone is where the dispute lives.

Note: this article was drafted with AI assistance and reviewed for technical accuracy before publishing.

DEV Community