Why Your IoT Device Is Online and Offline at the Same Time (And What to Do About It)

#iot #webdev #programming #api

This bug has a name. It's called a race condition, and if you've built
anything on top of MQTT, Zigbee, Z-Wave, or any event-driven IoT
protocol, you've hit it.

Here's how it happens.

Your device drops connection. It immediately reconnects. Two events fire:
status: offline and status: online. They travel over the network and
arrive in the wrong order. Your system processes online first, then
offline. Your monitoring dashboard marks the device dead. The device
is sitting right there, fully operational, blinking at you.

You've just been lied to by your own infrastructure.

Why the obvious fix doesn't work

The first thing everyone tries is a timestamp check. "I'll just use
whichever event has the later timestamp." Reasonable. Except:

Firmware clocks drift. A device that's been running for 3 weeks may be 40+ seconds off from server time.
Some firmware doesn't include reliable timestamps at all.
Battery-powered devices often have uninitialized clocks on first boot.

So now you're trusting timestamps that lie. You've replaced one bug
with a subtler one.

The second thing people try is a debounce delay. "I'll wait 500ms before
committing any state change." This works until your legitimate disconnect
events start arriving late and your automations stop firing. You've
traded false positives for false negatives.

What actually works

You need to stop trusting any single signal in isolation. The right
approach is multi-signal arbitration — weighing several inputs together
before committing to a state change:

1. Event arrival time — when did this event actually reach your
server, regardless of what the device timestamp says?

2. Signal strength at time of event — a disconnect from a device
with -55 dBm RSSI is suspicious. A disconnect from a device at -91 dBm
is expected.

3. Sequence continuity — did we see a clean sequence of events, or
did something arrive out of order?

4. Plausibility window — if an offline and online event arrive within
300ms of each other, the offline event is almost certainly a ghost.

When you weigh all four together, the noise collapses. The race condition
stops mattering because no single late-arriving event can flip your state.

A concrete example

Say you receive these two events within 200ms of each other:

{ "device_id": "sensor_007", "status": "offline", "timestamp": "2026-03-01T14:32:01Z" }
{ "device_id": "sensor_007", "status": "online",  "timestamp": "2026-03-01T14:32:00Z" }

The offline event arrived second but has a later timestamp. A naive
system commits offline. A multi-signal arbitration engine looks at
both events, notices the 200ms arrival gap, checks that signal strength
was strong at the time of the disconnect, confirms sequence continuity
was maintained, and resolves to online — because that's what actually
happened.

How I handle this now

I got tired of rebuilding this logic on every project, so I extracted
it into a standalone API called SignalCend. You POST your raw device
payload to /v1/sign then /v1/resolve, and get back a single
authoritative state with a confidence score and a full arbitration trace
showing exactly why it resolved the way it did.

The demo works right now with no signup — grab the demo key from
signalcend.com and fire a request. The
whole round trip runs in under 50ms.

# Sign your payload
SIG=$(curl -s -X POST https://api.signalcend.com/v1/sign \
  -H "Content-Type: application/json" \
  -d '{"api_key":"demo","payload":{"device_id":"sensor_007","status":"online","signal_strength":-58}}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['signature'])")

# Resolve the state
curl -X POST https://api.signalcend.com/v1/resolve \
  -H "Content-Type: application/json" \
  -H "X-Signature: $SIG" \
  -d '{"api_key":"demo","payload":{"device_id":"sensor_007","status":"online","signal_strength":-58}}'

The response includes recommended_action: "ACT" when confidence is
high, "CONFIRM" when it's moderate, and "LOG_ONLY" when there's too
much noise to act on. Your integration branches on a named signal instead
of hardcoding thresholds against a raw float.

The floor matters

One design decision worth noting: the confidence score never goes below
0.50. Even under maximum uncertainty, the API returns the best available
answer rather than an error. A result is always more useful than silence
— especially at 2am when something is on fire.

If you've hit this race condition in your own projects, I'd genuinely
like to hear how you handled it. The arbitration logic has a lot of edge
cases and I'm still finding new ones.