DEV Community

Cover image for When TCP failed for IoT, I wrote a new protocol
Aafaq Zahid
Aafaq Zahid

Posted on • Originally published at aafaqzahid.substack.com

When TCP failed for IoT, I wrote a new protocol

We were losing time.

Not seconds. Not packets. Minutes. Sometimes 20 minutes of data. Sometimes 40. Vanishing completely, with no way to recover it.

We were building an IoT telemetry framework for a fleet of vehicles. Cars moving through cities, highways, tunnels, rural areas — transmitting a continuous stream of sensor data back to our servers. The kind of system where every data point matters. Where gaps in the record are not acceptable.

The culprit, after thorough debugging, was TCP.

Not a bug in our code. Not a misconfiguration. TCP itself — the protocol that powers the entire internet — was simply not designed for what we were asking it to do.

  1. The problem with TCP at the edge TCP was built for reliable, persistent connections. When a connection drops, TCP doesn’t gracefully degrade — it crashes. And when it crashes on a constrained IoT device, it loses its entire in-memory buffer. Every packet that was queued and waiting to be sent: gone. Unrecoverable.

In urban areas with consistent coverage, this wasn’t a problem. But IoT devices don't operate in ideal conditions. They go through dead zones. They operate in basements, tunnels, remote areas — anywhere connectivity is intermittent or completely absent for extended periods.

Every time a device lost signal and TCP crashed, we lost everything buffered since the last successful transmission. That’s how you lose 40 minutes of data. Not one catastrophic failure — dozens of small ones, each silently swallowing its buffer.

The standard engineering answer would have been to patch around it. Add retry logic. Increase buffer sizes. Hope for better coverage.

I went a level deeper.

  1. Designing from first principles The question I asked was: what would a transport protocol look like if it was designed specifically for intermittent connectivity on constrained hardware?

At the time, Google was developing QUIC — a UDP-based protocol designed to address TCP’s limitations. It was still in early development. I was arriving at similar problems from a completely different direction, on completely different hardware, for a completely different use case.

The principles I arrived at independently:

One session per device — like QUIC

Rather than re-establishing connections on every reconnect, the protocol maintains a single persistent logical session per device. The session survives network interruptions. The transport layer going down does not mean the application layer loses state.

The queue lives on hardware, not in the transport layer This was the core insight. TCP’s vulnerability is that its buffer lives in the transport layer — when TCP crashes, the buffer goes with it. In RUDP, every outgoing message is written to a persistent queue in device hardware memory before transmission. The queue is the source of truth. The network is just the delivery mechanism.

If the network goes down for 4 hours, the queue holds. When connectivity returns, transmission resumes from exactly where it left off. Nothing is lost because nothing was ever only in memory.

Packet numbering with NACK/SYN acknowledgment Every packet carries a sequence number. The server tracks receipt and sends two types of signals back to the client:

SYN — “I have received and persisted everything up to packet N.” The client receives this and safely removes all packets up to N from its hardware queue, freeing space.

NACK — “I received packets 1–10 and 15, but packets 11–14 are missing.” The client retransmits only the missing range.

This two-signal system gives you guaranteed delivery without unbounded queue growth. The device knows exactly what has been safely received and can reclaim hardware memory accordingly.

  1. BitPacking for communication efficiency Every byte matters on constrained hardware transmitting continuously. The protocol uses BitPacking — encoding data at the bit level rather than the byte level — to minimise payload size before transmission. Less data on the wire means less to retransmit, lower latency, and lower power consumption.

Building it
The protocol was designed and implemented in Go. The full implementation — client-side queue management, packet numbering, NACK/SYN handling, BitPacking, and server-side receipt tracking — was built and then tested and hardened across the full fleet over two to three months.

I built it alone.

The result
40 minutes of data loss: zero.

The system became faster than it had been before — not just more reliable. BitPacking reduced payload sizes enough that transmission throughput improved even as reliability guarantees increased.

IoT devices. Continuous operation. Hostile network conditions. Zero data loss.

What this taught me
The lesson is not specific to IoT or transport protocols. It generalises.

Never trust the layer beneath you. Design for failure as the default, not the exception. If your system’s correctness depends on a layer it doesn’t control behaving correctly, your system is fragile. Build the guarantee into your own layer.

This principle has shaped how I think about every system since — from transport protocols to AI infrastructure, from IoT telemetry to protocol-level policy enforcement. The question is always the same: what happens when the thing beneath me fails, and have I designed for that case explicitly?

TCP wasn’t wrong. It was designed for a different world. The right response wasn’t to blame TCP — it was to understand the problem deeply enough to design something better suited to the actual conditions.

That’s the only kind of engineering that lasts.

—————————————————————————————————————-
Aafaq Zahid is a software architect and founding engineer specialising in protocol-level systems, custom transport protocols, and AI infrastructure. He has founded and led engineering at multiple technology companies. This protocol was designed and built in 2017.

LinkedIn: linkedin.com/in/aafaqzahid

Top comments (0)