Device State Is Not What Your Devices Report. It Is What Your Infrastructure Decides. Most Infrastructure Was Never Designed to Make That Decision.

#devops #api #architecture #iot

A precise examination of the epistemological gap at the center of IoT state management — how distributed systems create the conditions for confident wrongness at scale, and what a rigorous decision function looks like when it is finally applied to the problem.

In epistemology — the branch of philosophy concerned with the nature and limits of knowledge — there is a distinction between justified belief and true belief.

A justified belief is one that follows rationally from the available evidence. A true belief is one that corresponds to reality. In normal circumstances, justified beliefs and true beliefs overlap significantly. In distributed systems under network stress, they diverge in ways that are systematic, predictable, and expensive.

Your IoT monitoring stack holds justified beliefs about device state. It receives events, processes them according to its logic, and arrives at conclusions that are internally coherent. In 34% of offline classifications, those conclusions do not correspond to physical reality. The belief is justified. The belief is wrong. And the system has no mechanism to know the difference.

This is not a software bug. It is an epistemological gap built into the architecture of every event-driven IoT system ever deployed. Understanding it precisely is the first step toward building infrastructure that can close it.

How a distributed network creates confident wrongness

The mechanism is worth tracing carefully because it is counterintuitive on first encounter.

A field device — a Siemens PLC on a production line, a Particle Boron cellular sensor on a remote asset, a Dexcom CGM transmitting patient glucose readings — generates state events as a function of its physical condition. When it drops connectivity, it generates a disconnect event. When it reconnects, it generates a reconnect event. The events are accurate at the moment of generation. The device knows its state. The events correctly represent that state.

Both events enter the network simultaneously. The disconnect event, generated at T+0, and the reconnect event, generated at T+340ms, travel toward the broker through paths that the network selects independently. Network routing is not aware of the temporal relationship between these two packets. It routes them based on congestion, available paths, and QoS handling at each hop.

The reconnect event arrives at the broker first.

The broker, operating correctly, delivers it to all subscribers. The historian logs online. The monitoring system registers online. The automation layer continues normally.

340 milliseconds later, the disconnect event arrives.

The broker, operating correctly, delivers it to all subscribers. The historian logs offline. The monitoring system fires an alert. The automation layer triggers its offline response.

Every system in this chain made a justified decision based on available evidence. Every system was wrong about physical reality. The device reconnected before the disconnect event was processed. It has been continuously online since T+340ms. None of the systems involved had any mechanism to know this.

The wrongness was not caused by a failure. It was caused by a success — the network successfully delivered both events, in arrival order, to all subscribers. The system worked exactly as designed. The result was incorrect.

The scale at which justified wrongness accumulates

Medical facilities will employ approximately 7.4 million Internet of Things devices by 2026. The IoT device management market is projected to grow from $2.8 billion in 2023 to $45 billion by 2033 at a 32% annual growth rate.

Those numbers frame the scale at which justified wrongness accumulates across the industry.

In manufacturing, IoT revenue reached $490 billion in 2025 , driven substantially by real-time monitoring and predictive maintenance applications — applications that make automated decisions based on device state. Each decision made on a false negative — a device showing offline when online — or a false positive — a device showing online when genuinely offline — carries a cost that the industry has categorized as a normal operational expense rather than a solvable engineering problem.

The normalization of this expense is itself worth examining. A 23% false positive alert rate in a mature engineering field would typically trigger a root cause analysis and a remediation project. In IoT monitoring, it has been accepted as a property of the architecture rather than a failure mode to be engineered out. The acceptance makes sense historically — the tools to engineer it out did not exist in a form that could be applied generically across diverse deployment environments.

They exist now.

The epistemological requirements of a correct decision function

What would a system need to know to correctly resolve the device state in the scenario described above?

It would need to know the temporal relationship between the disconnect event and the reconnect event — not the arrival time relationship, which the broker provides, but the generation time relationship, which requires evaluating the device timestamps against a trusted time reference.

It would need to know whether the device timestamp is itself trustworthy — whether the device clock was synchronized recently enough to be used as a primary ordering signal, or whether clock drift has accumulated to the point where arrival sequencing is more reliable than device-reported timestamps.

It would need to know whether the signal environment that carried the disconnect event was sufficiently clean to treat its reported state as reliable, or whether RF degradation at the time of transmission elevated the probability that the event represents a transmission artifact rather than a genuine state change.

It would need to know the sequence context — whether the sequence numbers on these events are consistent with their reported order, or whether a causal inversion has occurred that indicates the disconnect event was generated before the reconnect event but arrived after it.

It would need to know the reconnect window context — whether the temporal proximity of the disconnect event to the current server time places it within the window where a late-arriving disconnect is more probable than a genuine new outage.

Each of these requirements corresponds to a specific signal available in the event payload and the event metadata. None of them individually produces a definitive answer. Together, weighted correctly against each other, they produce a verdict that is significantly more likely to correspond to physical reality than arrival order alone.

This is the five-step multi-signal arbitration model. It is not theoretical. It has been validated against 1.3 million real device state resolution events across production deployments and published for peer review: https://doi.org/10.5281/zenodo.19025514

The distributed systems principle this violates — and why the violation matters

There is a principle in distributed systems architecture called the principle of explicit assumptions — the requirement that any assumption a system makes about the reliability or ordering of its inputs should be explicitly stated in the system's design rather than implicitly embedded in its behavior.

Event-driven IoT architectures violate this principle systematically with respect to device state. The implicit assumption — that arrival order corresponds to generation order — is never stated in the system design. It is never flagged in the monitoring system's documentation. It is never surfaced in the historian's audit log. It is simply assumed, at every layer of the stack, by every consumer of device state events.

Out-of-order MQTT messages should be expected and solutions should be designed with this principle in mind , according to AWS's own engineering guidance. The strategies recommended — sequence numbers, timestamp filtering — are necessary but not sufficient. They address the ordering problem within a single device's event stream. They do not address the arbitration problem that arises when the signals available about device state are themselves conflicting or degraded.

The arbitration problem requires a decision function. The decision function requires explicit outputs that tell the downstream system not just what the verdict is but how much evidence supported it and what action the evidence quality warrants.

This is the engineering gap that justified wrongness has been filling for two decades. Not because the industry lacked the intelligence to identify it. Because the tools to address it generically — across diverse deployment environments, protocols, hardware types, and signal conditions — did not exist until recently.

The practical architecture of explicit confidence

When a device state arbitration layer is placed between the broker and the downstream consumers — historian, monitoring system, automation layer — the output changes from a status string to a structured verdict.

The verdict contains the authoritative state after full multi-signal evaluation. It contains a confidence score between 0.20 and 1.0 reflecting the integrity of the signal environment. It contains a recommended action — ACT, CONFIRM, or LOG_ONLY — that encodes the confidence tier in terms the downstream system can branch on without implementing its own threshold logic. It contains the complete arbitration trace: which signals were evaluated, which degradation conditions were detected, which conflicts were resolved and how.

The minimum confidence floor of 0.20 is a deliberate design choice. It means the system never returns silence — not for corrupted payloads, not for vendor-opaque field names, not for RF readings 30 dBm below the documented critical threshold, not for sensor readings from environments that should not physically be possible. It always returns the best available answer and tells the consumer exactly how much to trust it.

This is what explicit assumptions look like in practice. The arbitration layer makes its assumptions explicit in every response. The downstream system knows whether it is acting on physical certainty or probabilistic best-effort. The audit trail is in the response, not reconstructed after the fact.

The epistemological gap that produces justified wrongness at scale is not closed by better hardware. It is not closed by faster networks. It is not closed by more sophisticated brokers. It is closed by inserting a decision function that treats device state as what it actually is: a verdict rendered by infrastructure from imperfect evidence, carrying an explicit account of the evidence quality that produced it.

With 820,000 IoT attacks per day in 2025 and a device landscape growing toward 40 billion units by 2034, the cost of implicit full confidence in device state — whether from adversarial manipulation or simple network ordering — is not a future risk. It is a present operational reality.

The arbitration model that addresses it: https://doi.org/10.5281/zenodo.19025514

SignalCend: signalcend.com