DEV Community: Jon Zuanich

Your Perimeter Is Already Gone — Edge Security Isn't a Checkbox

Jon Zuanich — Thu, 16 Apr 2026 15:39:00 +0000

Edge devices live outside your control plane, in physically accessible environments, often running default credentials. Treating that as an afterthought has a predictable outcome.

There's a mental model that dominated enterprise security thinking for decades: draw a perimeter around your systems, trust everything inside it, and defend the boundary.

That model was already struggling in the cloud era. At the edge, it doesn't apply at all because your "perimeter" is now:

A drilling rig in a remote field,
A charging station in a concrete parking garage,
A sensor package on a factory floor accessible to any maintenance technician,
Or a gateway installed in an industrial cabinet that ships via a third-party supply chain before it ever reaches your operations team.

The edge doesn't have a perimeter. It has exposure.

The threat model most architects skip

When security comes up in edge architecture conversations, the instinct is to reach for encryption. TLS everywhere. Certificates rotated regularly. Done.

Encryption is necessary but it addresses only one part of the problem. The OWASP IoT Top 10 and real-world incident data consistently point to a broader set of failure modes that encryption alone doesn't solve:

Credential compromise. Edge devices frequently ship with default or hardcoded credentials. According to SentinelOne's IoT security risk analysis, default credentials remain one of the top attack vectors precisely because they're predictable and widely documented in manufacturer manuals. Even when credentials are changed, they're often shared across devices, rarely rotated, and stored in ways that don't survive physical access.

Tampered data injection. A compromised edge device doesn't have to announce itself. It can sit in your topology for weeks or months, injecting subtly malformed data — readings that are plausible enough to pass through your pipelines and influence decisions in core systems. This is especially dangerous in domains like energy management, predictive maintenance, and industrial process control, where bad telemetry drives bad actions.

Lateral movement. This is the one that keeps security architects up at night. An attacker who compromises one edge device has a foothold. If that device's credentials or network access is broadly scoped (if it can reach subjects or channels it has no business touching) the blast radius extends far beyond the device itself. Bitsight's research on ICS/OT exposure shows that critical infrastructure systems are routinely left accessible with minimal segmentation, and that a single entry point can ripple into core systems fast.

The pattern across all three: the breach doesn't originate inside your perimeter. It originates at the edge, and then it walks in.

Why the old model breaks here specifically

In a data center, the security assumption is: everything on the network is (relatively) trusted, and you protect the boundary aggressively. That works when you control the physical environment, the hardware lifecycle, and the access to every node.

At the edge, you control none of those things reliably. Devices are in warehouses, on vehicles, in the field, in customer facilities. Firmware gets updated over-the-air or sometimes not at all. Hardware gets swapped by contractors who have no security training. According to Vectra AI's IoT security data, supply chain compromise is now one of the dominant attack vectors; with incidents like BadBox 2.0 pre-installing malware on more than 10 million devices before they ever reached an operational environment.

The environment is adversarial by nature, not by exception. And that demands a fundamentally different security design: not perimeter-based, but realm-based.

Separate realms, constrained paths

This is the architectural shift that actually moves the needle — and it's one of the core arguments in Synadia's Living on the Edge white paper: treat edge and core as separate security realms connected by deliberately constrained paths, not by open network access that happens to be encrypted.

What that looks like in practice:

Scoped credentials. Each edge device gets credentials that authorize only what that device legitimately needs to publish and subscribe to, nothing more. A temperature sensor has no business reaching a command channel. A gateway serving one site shouldn't be able to reach subjects for another. If a credential is compromised, the blast radius is bounded to what that credential could do, not to everything on the network.

Subject-level boundary constraints. In an event-driven architecture built on NATS, the paths that cross from edge to core aren't open by default, they're explicitly defined. You configure which subjects are local to the edge leaf node, which are permitted to cross the boundary, and which are strictly core-only. A compromised edge node can't suddenly start publishing to a core command channel; the topology simply doesn't permit it. Synadia's decentralized security model extends this further as credentials are cryptographically scoped, not centrally issued, which means there's no single credential store to compromise.

Encrypted boundary links. Traffic crossing from edge to core should be encrypted in transit (this is the part most teams already do). But encrypting the link doesn't constrain what traverses it, that's what subject scoping is for.

These aren't compensating controls layered on top of a permissive architecture. They are the architecture.

What this looks like when you get it wrong

Consider the failure mode that played out for decades in OT environments: IT teams would extend their networks into industrial control systems without redesigning the security model. The logic was "we already have VPNs and firewalls." The result was that a single phishing email or a compromised contractor credential could traverse from the enterprise network into systems controlling physical processes like gas flow, water pressure, power distribution.

The same failure mode is replicating itself in modern edge deployments, just faster and at larger scale.

Edge AI inference nodes,
EV charging infrastructure,
factory sensor networks

These are all being connected to core systems with the "we have TLS" assumption standing in for a real security architecture.

The question to ask isn't "is the connection encrypted?" It's "what can this device actually reach, and what happens if it's compromised?"

The previous post in this series covered why "just retry" logic fails when connectivity is intermittent. Security has a similar anti-pattern: "just encrypt" fails when the threat model includes physical access, credential compromise, and lateral movement. Both retry logic and perimeter encryption are correct answers to the wrong problems.

In edge-to-core systems, the right security architecture is one where:

Each device operates with the minimum credential scope it needs
Subjects that cross realm boundaries are explicitly allowed, not implicitly open
A compromised edge node cannot become a lateral movement vector into core systems
Security isn't implemented as a layer on top of the architecture — it's built into the topology

The good news is that modern eventing platforms designed for edge-to-core scenarios (like NATS, which supports decentralized JWT-based credentials and fine-grained subject scoping natively) make these constraints composable and operationally manageable. Synadia's platform layer adds the control plane for managing these policies across environments at scale.

The hard part, as always, isn't the technology. It's accepting that edge security isn't a feature you add at the end of the architecture review. It's a design constraint you start with.

This post is part of a series exploring architecture patterns for resilient edge-to-core systems, based on Synadia's white paper Living on the Edge: Eventing for a New Dimension. If you're just joining, the first post covers why edge is an operating reality, not a geography, and the second covers why "just retry" is the wrong mental model for intermittent connectivity. Find the full series here.

Next up: why flow control isn't a performance optimization — it's an architecture decision, and building it as an afterthought costs more than you think.

Tags: Edge Computing · Distributed Systems · IoT Security · Zero Trust · Software Architecture · Microservices

This blog post was originally published at Synadia.com.

Why "Just Retry" Will Kill Your Edge System

Jon Zuanich — Thu, 02 Apr 2026 13:28:00 +0000

The assumption baked into most distributed systems advice doesn't hold at the edge — and the cost of finding out the hard way is high.

There's a piece of conventional wisdom that gets passed around in backend engineering circles like it's settled science:

If a request fails, retry it.

It's good advice — for systems where the failure mode is "the service was temporarily unavailable." It's dangerous advice for systems where the failure mode is "the network doesn't exist right now and won't for… unknown hours."

The latter is the reality for edge systems.

The retry model is built on a hidden assumption

When you write retry logic, you're implicitly betting that the failure is temporary (and not recurrent or even chronic) and that the infrastructure on the other side is still there, waiting.

In a data center, that’s a fair bet. Services restart, load balancers reroute, and the upstream is almost always reachable within seconds.

At the edge, the betting odds are against you. Think of drilling rigs in remote fields, EV charging stations in concrete parking garages with spotty wireless service, factory gateways behind industrial firewalls, vehicles roaming in and out of coverage. The upstream isn't "temporarily slow." It's genuinely gone, and gone for perhaps an indeterminate amount of time.

So what happens?

Your edge process retries, and retries, and retries. It hammers at a connection that isn't there. It burns CPU. It piles up in-memory state. And when the connection finally comes back — whether that's minutes or hours later — it doesn't trickle data back in. It storms. Every backed-up message hitting the core at once, overwhelming consumers who had no idea the edge was disconnected at all.

Retry storms aren't a theoretical risk. They're the predictable outcome of applying online-system thinking to offline-tolerant problems.

Disconnection isn't an “edge case” at the edge

This is the reframe that changes everything.

In traditional distributed systems design, disconnection is an exception. You build for the happy path and handle failure gracefully. At the edge, disconnection is a first-class operating condition. It's not an exception, intermittent connectivity should be designed for from day one.

The Living on the Edge white paper from Synadia puts it plainly: edge connections can be intermittent by nature: devices roam, networks drop, power cycles. And once you are operating at any meaningful scale, "the cost of 'just retry' becomes painfully real."

The question isn't if your edge nodes will disconnect; it's what your system does while they're gone.

Store-and-forward: designing for the gap

The pattern that actually works here is store-and-forward. It's conceptually simple, even if it requires deliberate architectural choices.

Instead of your edge process trying to push data upstream in real time, it writes to a local durable store first. Events accumulate locally, whether connectivity is up or down. When the link comes back, the store forwards upstream in an orderly, controlled way. This avoids a storm and potential data loss. You’re removing brittle retry logic that has to know the difference between "upstream is slow" and "upstream doesn't exist."

The four steps look like this:

Collect events locally, always
Forward upstream when connected
Continue collecting while disconnected
Catch up at a controlled rate when connectivity returns

That fourth step is where most implementations trip up. "Catch up" can't mean "send everything immediately." It has to mean "resume forwarding at a rate the core can absorb." This is where flow control and pull-based consumption models become critical, but that's a topic for another post.

What this looks like in practice

Consider what Rivitt is doing in oil and gas. They're capturing machine data from drilling rigs and field devices under genuinely harsh conditions: remote locations, intermittent connectivity, high-volume telemetry that cannot be lost. The retry approach fails immediately in that environment. Store-and-forward, built on NATS with JetStream, is what makes continuous data capture possible even when the network isn't.

Or PowerFlex, managing EV charging stations, battery storage, and solar arrays at the edge. Their system uses JetStream to buffer data locally and sync with the cloud even through intermittent connectivity. The charging stations don't stop working when the link drops, and they don't retry themselves into a storm when it comes back. They keep operating, and the data is there, in order, when the connection restores.

These are examples of what thoughtful edge architecture looks like when you accept disconnection as a design constraint rather than an afterthought.

The subtler cost you're probably not counting

Beyond the retry storm risk, there's an even quieter cost to getting this wrong: losing fidelity.

When retry logic fails, you don’t always know exactly why or how to fix the problem. A message can get dropped because the retry window is exhausted, or maybe the in-memory buffer overflowed, or the process restarted mid-flight. It can be a number of things. The upstream system sees a gap and often has no way to distinguish "that event didn't happen" from "that event happened and we lost it."

In some domains, that's tolerable. In others like energy systems, manufacturing floors, predictive maintenance that gap is the signal. That gap is the anomaly. Missing it isn't just a data quality problem, it's a decision quality problem.

Store-and-forward is a completeness guarantee. Events get written locally before anything else happens. If the process crashes, they survive. If the network drops, they survive. If the upstream is overwhelmed and you have to throttle, they survive. The chronological record is intact.

That integrity is worth more than most people realize until the moment they need it.

The design principle underneath all of this

What store-and-forward really represents is a refusal to couple the availability of your edge system to the availability of your core.

Retry logic creates tight coupling. Your edge process either succeeds in reaching the core, or it spins; and eventually, the spinning shows up as degraded behavior, lost data, or both. Store-and-forward decouples them. The edge keeps working regardless of what's happening upstream. The core catches up when the connection allows.

That decoupling is the foundational principle of resilient edge architecture — and it's the thread running through everything in Synadia's Living on the Edge architecture guide: treating edge and core as separate operational realms connected by controlled paths, not assumed ones.

Once you internalize that framing, "just retry" starts to look less like a safety net and more like wishful thinking in disguise.

What to do instead

If you're building or rethinking an edge-to-core system, a few concrete questions worth asking:

Does your edge process write locally before it tries to send? If the answer is no, you have data loss risk. If you're pushing directly to an upstream queue or API without local durability, you have data degradation and loss risk.

Do you know what your edge nodes are doing when disconnected? If the answer is "retrying," you have retry storm risk every time connectivity restores.

Does your catch-up behavior respect core capacity? Fast producers resuming after a long disconnect can look like a denial-of-service attack to downstream consumers. Pull-based consumption models exist precisely to prevent this.

Is disconnection in your test suite? If your integration tests only run against an always-on network, you're testing the happy path and shipping the edge case.

The good news – these are solvable problems. The platforms and patterns exist. The harder part is accepting that the mental model that works for cloud-native services doesn't automatically transfer to distributed systems where connectivity is optional.

This post is part of a series exploring the architecture patterns behind resilient edge-to-core systems, based on Synadia's white paper Living on the Edge: Eventing for a New Dimension. If you're designing for industrial IoT, connected fleets, energy systems, or any environment where "the network will always be there" is not a safe assumption — the full guide is worth your time.

Next up:

Why edge security isn't a checkbox — and what happens when you treat it like one

Previous:

The Edge Isn't a Place — It's an Operating Reality

Tags: Edge Computing · Distributed Systems · Streaming · IoT · Software Architecture

The Edge Isn't a Place — It's an Operating Reality

Jon Zuanich — Thu, 26 Mar 2026 20:57:47 +0000

The mental model you start with determines every architecture decision that follows. Most teams are starting with the wrong one.

Edge computing used to mean "some devices send telemetry to the cloud."

That era is over.

This is a re-post of Bruno Baloi's blog Part 1: The Edge Isn't a Place - It's an Operating Reality on Synadia.com.

Today's edge is a full operational domain where the physical world meets software systems: machines, vehicles, gateways, sensors, factories, field deployments. And once you move compute and messaging into that world, the rules change fast. Connections drop. Environments get hostile. Data gets generated faster than it can be forwarded. And the assumptions baked into a decade of cloud-native architecture patterns start failing in ways that are hard to diagnose because they fail quietly.

The first and most important shift is not a technical one, it's a conceptual one.

The edge is not a far-away part of your system. It's a different operating dimension entirely.

Why the geography framing gets you into trouble

When engineers hear "edge computing," the mental image is usually spatial: devices on the left, cloud on the right, data flowing between them. The edge is just the far end of the pipeline.

That framing seems harmless until you start making architecture decisions based on it. If edge is just far-away infrastructure, you design for distance — low latency, efficient serialization, maybe some compression. You optimize the happy path.

What you don't design for is disconnection as a first-class operating condition. Or physical exposure as a security assumption. Or the possibility that the "far end" of your system is running on hardware installed by a third-party contractor in an industrial cabinet that nobody has touched in eighteen months.

Those aren't edge cases at the edge. They're normal operating conditions. And the gap between "optimized for distance" and "designed for that reality" is where most edge architecture problems live.

The four constraints that don't go away

If you're building edge-to-core systems, you're going to run into the same four problems regardless of industry, scale, or stack. Synadia's Living on the Edge white paper names them clearly, and they're worth exploring:

Connectivity. Edge links are intermittent by nature — not by failure. Devices roam in and out of coverage. Field gateways lose upstream access during maintenance windows. Vehicles move through dead zones. The question isn't whether your edge nodes will disconnect; it's whether your architecture treats that as an exception to handle or a condition to design for.

Security. Edge devices are physically exposed in ways that data center hardware never is. They're accessible to anyone with physical proximity: maintenance crews, contractors, hostile actors with a USB drive and ten minutes to spare. Credentials get copied. Firmware gets tampered with. Unlike a compromised cloud instance that stays logically contained, a compromised edge device has a direct path toward your core systems if you haven't designed the boundary carefully.

Distribution. Edge environments generate data at rates and granularities that cores can't absorb naively. A manufacturing floor streaming sensor data from hundreds of machines isn't a throughput problem — it's a routing and filtering problem. The right data needs to reach the right destination at the right rate, which means the system has to be opinionated about what crosses the boundary and at what volume.

Observability. You need a real-time view across devices and the infrastructure connecting them to core — not just health checks, but end-to-end event traceability. Without it, you're making operational decisions based on incomplete signals, and at the edge, incomplete signals tend to mean delayed incident detection and incorrect root cause analysis.

None of these four constraints are solvable by optimizing the happy path. They require deliberate design choices that change the shape of the architecture.

The reframe that unlocks the right patterns

Once you stop thinking of edge as geography and start thinking of it as a separate operational realm, the right patterns become obvious — or at least, the wrong patterns become obviously wrong.

The core insight, articulated in the Living on the Edge architecture guide, is this: treat edge and core as distinct realms connected by deliberately controlled paths. Not a single distributed system. Not an extended network. Two separate operating environments with an explicit, managed boundary between them.

That separation shows up in three practical ways:

Asynchronous communication instead of synchronous request-reply. HTTP works when the upstream is available and fast. At the edge, you can't assume either. Asynchronous, event-driven communication means edge systems continue operating regardless of upstream availability — they produce events locally and let the transport layer handle delivery when conditions allow.

Store-and-forward instead of retry logic. When a link drops, edge systems shouldn't be hammering a dead connection. They should be writing to a local durable store and resuming forwarding when connectivity restores — at a controlled rate that doesn't overwhelm core consumers. (The second post in this series covers why this distinction matters more than most architects realize.)

Security realm constraints instead of shared credentials and open subjects. The boundary between edge and core should be explicit about what's permitted to cross it. Not "encrypt everything and trust the network" — that's perimeter thinking applied to a perimeter-free environment. The constraint lives in the topology, not just the transport.

This is where eventing becomes the backbone

If those three patterns share a common thread, it's that they all require an eventing layer — not just messaging, but a platform that can handle asynchronous delivery, local durability, stream replication, and subject-level routing as composable primitives rather than as separate infrastructure.

Edge-to-core isn't a "connect things to the cloud" problem. It's a problem of how signals, commands, and durable streams move safely across unreliable, adversarial terrain — in both directions, at scale, with full traceability.

NATS and the Synadia Platform are purpose-built for exactly this architecture: leaf node topologies that treat edge clusters as first-class entities, JetStream for local durability and controlled forwarding, decentralized security for tightly scoped credentials, and end-to-end observability across the full mesh.

Questions worth asking before your next architecture review

Do you have a written definition of what "the edge" means in your system? If the answer is "the devices on the other side of the network," the framing is still geographic. Try: "a separate operational realm with different connectivity, security, and observability characteristics than our core."

Does your architecture document cover the disconnected case explicitly? If the connectivity section assumes the link is up, it's documenting the happy path. The disconnected case is where edge architectures succeed or fail.

Are your edge and core security models the same? If your edge nodes use the same credentials, the same access scope, and the same network trust assumptions as your core services — you haven't built a boundary. You've built a flat network with devices in inconvenient locations.

Do you know what your edge nodes are doing right now? Not in aggregate. Individually. If the answer is "we'd have to look at logs," your observability model was designed for a data center, not an operational edge.

Getting these right doesn't require exotic technology. It requires accepting that the edge is a different operational dimension — and designing for it from the start, not retrofitting resilience onto a system that assumed it wouldn't need any.

This post is the first in a series exploring architecture patterns for resilient edge-to-core systems, based on Synadia's white paper Living on the Edge: Eventing for a New Dimension.

Next up: why "just retry" is the wrong mental model for intermittent connectivity — and what happens when you find out the hard way.

Tags: Edge Computing · Distributed Systems · IoT · Software Architecture · Streaming · Microservices