Jon Zuanich

Posted on Mar 18

Why “Just Retry” Will Kill Your Edge System

#edgecomputing #performance #go #computerscience

The assumption baked into most distributed systems advice doesn’t hold at the edge — and the cost of finding out the hard way is high.

There’s a piece of conventional wisdom that gets passed around in backend engineering circles like it’s settled science:

If a request fails, retry it.

It’s good advice — for systems where the failure mode is “the service was temporarily unavailable.” It’s dangerous advice for systems where the failure mode is “the network doesn’t exist right now and won’t for… unknown hours.”

The latter is the reality for edge systems.

The retry model is built on a hidden assumption

When you write retry logic, you’re implicitly betting that the failure is temporary (and not recurrent or even chronic) and that the infrastructure on the other side is still there, waiting.

In a data center, that’s a fair bet. Services restart, load balancers reroute, and the upstream is almost always reachable within seconds.

At the edge, the betting odds are against you. Think of drilling rigs in remote fields, EV charging stations in concrete parking garages with spotty wireless service, factory gateways behind industrial firewalls, vehicles roaming in and out of coverage. The upstream isn’t “temporarily slow.” It’s genuinely gone, and gone for perhaps an indeterminate amount of time.

So what happens?

Your edge process retries, and retries, and retries. It hammers at a connection that isn’t there. It burns CPU. It piles up in-memory state. And when the connection finally comes back — whether that’s minutes or hours later — it doesn’t trickle data back in. It storms. Every backed-up message hitting the core at once, overwhelming consumers who had no idea the edge was disconnected at all.

Retry storms aren’t a theoretical risk. They’re the predictable outcome of applying online-system thinking to offline-tolerant problems.

Disconnection isn’t an edge case at the edge

This is the reframe that changes everything.

In traditional distributed systems design, disconnection is an exception. You build for the happy path and handle failure gracefully. At the edge, disconnection is a first-class operating condition. It’s not an exception, intermittent connectivity should be designed for from day one.

TheLiving on the Edge white paper from Synadia puts it plainly: edge connections can be intermittent by nature: devices roam, networks drop, power cycles. And once you are operating at any meaningful scale, “the cost of ‘just retry’ becomes painfully real.”

The question isn’t if your edge nodes will disconnect; it’s what your system does while they’re gone.

Store-and-forward: designing for the gap

The pattern that actually works here is store-and-forward. It’s conceptually simple, even if it requires deliberate architectural choices.

Instead of your edge process trying to push data upstream in real time, it writes to a local durable store first. Events accumulate locally, whether connectivity is up or down. When the link comes back, the store forwards upstream in an orderly, controlled way. This avoids a storm nad potential data loss. You’re removing brittle retry logic that has to know the difference between “upstream is slow” and “upstream doesn’t exist.”

The four steps look like this:

Collect events locally, always
Forward upstream when connected
Continue collecting while disconnected
Catch up at a controlled rate when connectivity returns

That fourth step is where most implementations trip up. “Catch up” can’t mean “send everything immediately.” It has to mean “resume forwarding at a rate the core can absorb.” This is where flow control and pull-based consumption models become critical, but that’s a topic for another post.

What this looks like in practice

Consider what Rivitt is doing in oil and gas. They’re capturing machine data from drilling rigs and field devices under genuinely harsh conditions: remote locations, intermittent connectivity, high-volume telemetry that cannot be lost. The retry approach fails immediately in that environment. Store-and-forward, built on NATS with JetStream, is what makes continuous data capture possible even when the network isn’t.

Or PowerFlex, managing EV charging stations, battery storage, and solar arrays at the edge. Their system uses JetStream to buffer data locally and sync with the cloud even through intermittent connectivity. The charging stations don’t stop working when the link drops, and they don’t retry themselves into a storm when it comes back. They keep operating, and the data is there, in order, when the connection restores.

These are examples of what thoughtful edge architecture looks like when you accept disconnection as a design constraint rather than an afterthought.

The subtler cost you’re probably not counting

Beyond the retry storm risk, there’s an even quieter cost to getting this wrong: losing fidelity.

When retry logic fails, you don’t always know exactly why or how to fix the problem. A message can get dropped because the retry window exhausted, or maybe the in-memory buffer overflowed, or the process restarted mid-flight. It can be a number of things. The upstream system sees a gap and often has no way to distinguish “that event didn’t happen” from “that event happened and we lost it.”

In some domains, that’s tolerable. In others like energy systems, manufacturing floors, predictive maintenance that gap is the signal. That gap is the anomaly. Missing it isn’t just a data quality problem, it’s a decision quality problem.

Store-and-forward is a completeness guarantee. Events get written locally before anything else happens. If the process crashes, they survive. If the network drops, they survive. If the upstream is overwhelmed and you have to throttle, they survive. The chronological record is intact.

That integrity is worth more than most people realize until the moment they need it.

The design principle underneath all of this

What store-and-forward really represents is a refusal to couple the availability of your edge system to the availability of your core.

Retry logic creates tight coupling. Your edge process either succeeds in reaching the core, or it spins; and eventually, the spinning shows up as degraded behavior, lost data, or both. Store-and-forward decouples them. The edge keeps working regardless of what’s happening upstream. The core catches up when the connection allows.

That decoupling is the foundational principle of resilient edge architecture — and it’s the thread running through everything in the Living on the Edge architecture guide: treating edge and core as separate operational realms connected by controlled paths, not assumed ones.

Once you internalize that framing, “just retry” starts to look less like a safety net and more like wishful thinking in disguise.

What to do instead

If you’re building or rethinking an edge-to-core system, a few concrete questions worth asking:

Does your edge process write locally before it tries to send? If the answer is no, you have data loss risk. If you’re pushing directly to an upstream queue or API without local durability, you have data degradation and loss risk.

Do you know what your edge nodes are doing when disconnected? If the answer is “retrying,” you have retry storm risk every time connectivity restores.

Does your catch-up behavior respect core capacity? Fast producers resuming after a long disconnect can look like a denial-of-service attack to downstream consumers. Pull-based consumption models exist precisely to prevent this.

Is disconnection in your test suite? If your integration tests only run against an always-on network, you’re testing the happy path and shipping the edge case.

The good news — these are solvable problems. The platforms and patterns exist. The harder part is accepting that the mental model that works for cloud-native services doesn’t automatically transfer to distributed systems where connectivity is optional.

This post is part of a series exploring the architecture patterns behind resilient edge-to-core systems, based on Synadia's white paper Living on the Edge: Eventing for a New Dimension. If you’re designing for industrial IoT, connected fleets, energy systems, or any environment where “the network will always be there” is not a safe assumption — the full guide is worth your time.

Next up: why edge security isn’t a checkbox — and what happens when you treat it like one.

DEV Community