DEV Community

Cover image for What actually happens to your data when an edge node crashes
Aidarbek
Aidarbek

Posted on

What actually happens to your data when an edge node crashes

What actually happens to your data when an edge node crashes?

In edge systems, failure is not an exception it’s the normal case.

Power loss. Process crashes. Disk issues.

Yet most pipelines are designed as if systems are stable.

The real question is:

what actually happens to your data when a node crashes mid-write?


The reality in many edge setups

In typical edge / IIoT pipelines, data is buffered locally:

  • MQTT brokers
  • local files
  • in-memory queues

When a crash happens, one of three things usually occurs:

  • partial writes
  • lost buffered data
  • manual recovery

Sometimes this is acceptable.

Sometimes it goes completely unnoticed until something breaks downstream.


What I tested instead

Instead of focusing on throughput or benchmarks, I focused on failure:

  • SIGKILL scenarios (no graceful shutdown)
  • container restarts
  • disk replay after crash
  • offset correctness under recovery

I validated this behavior using Jepsen.

Result: 45/45 mixed-fault tests passed.

Not just happy-path performance but behavior under real failure conditions.


What surprised me

The surprising part wasn’t the system behavior.

It was how unclear the guarantees are in many real-world setups.

In many cases:

  • durability is assumed, not verified
  • recovery is “best effort”
  • correctness depends on implementation details

And most teams don’t test failure scenarios explicitly.


The real question isn’t technical

It’s this:

how often does this actually matter in practice?

  • Do teams really lose data during crashes?
  • Or is it “good enough” most of the time?
  • Where does it become unacceptable?

For example:

  • industrial monitoring
  • financial events
  • critical telemetry

I’m trying to understand this

I’m currently exploring this space by building and testing failure scenarios,

but more importantly I’m trying to map real-world experience.

If you’re working on edge / IIoT systems:

  • Have you seen data loss after crashes?
  • How do you recover today?
  • Is this a real pain or an acceptable trade-off?

I’m not trying to sell anything just trying to understand where this actually matters.

Top comments (0)