deep dive into network protocols

#architecture #programming

Most people never think about network protocols, and that is exactly how it should be. Protocols are the invisible agreements that let billions of devices, built by competing companies and running incompatible software, cooperate well enough to load a web page. They only become visible when they fail and when they fail at scale, the failure is spectacular.
This article walks through the four protocols that carry almost everything you do online:
IP and TCP: which move and order the bytes;
BGP: which decides which path those bytes take across the planet;
DNS: which turns human-friendly names into machine addresses;
HTTP/HTTPS: the language the web itself speaks.
Then it tells the story of October 4, 2021, when a single change to one of these protocols took Facebook, Instagram, and WhatsApp off the internet for roughly six hours a case study that shows, better than any diagram, how tightly these layers depend on one another.

A protocol is just an agreement
A protocol is a shared set of rules for how two parties exchange information what a message looks like, what order things happen in, and what to do when something goes wrong. Network protocols do the same thing, just formalized down to the bit.
The internet is built as a stack of these agreements, each handling one job and trusting the layer below it to handle its own. The mental model worth holding onto is this: lower layers move raw data between machines, and higher layers add meaning.
The beauty of this design is separation of concerns: your browser speaking HTTP does not need to know whether your packets travel over fiber, copper, or radio. But the cost of that design is dependency. Each layer assumes the one beneath it is working. When a low layer breaks, everything above it breaks too and the symptoms often appear at the top, far from the real cause. Hold that thought; it is the whole point of the case study.
IP and TCP: moving the bytes
At the foundation sits the Internet Protocol (IP). IP's only job is addressing and delivery: every device gets an IP address, and IP defines how to package data into packets and send them toward a destination address. Crucially, IP makes no promises. Packets can arrive out of order, be duplicated, or vanish entirely. IP is a postcard system drop it in the mail and hope for the best.
That unreliability is intentional, because it keeps IP simple and fast. The job of making it reliable is handed up to the Transmission Control Protocol (TCP). TCP sits on top of IP and adds the guarantees applications actually want: it numbers each chunk of data so the receiver can reassemble it in order, it asks for retransmission of anything that goes missing, and it controls the rate of sending so a fast server does not overwhelm a slow connection.
TCP also opens a connection before sending real data, through a three-step handshake (often summarized as SYN, SYN-ACK, ACK). This is the digital equivalent of "Hello? Hi, can you hear me? Yes, go ahead." Only after both sides confirm they can reach each other does the real conversation begin. Together, IP and TCP give the rest of the stack a clean abstraction: a reliable, ordered stream of bytes between any two addresses on Earth. Everything else is built on that promise.
BGP: the map of the internet
Knowing a destination's IP address tells you who you want to reach, but not how to get there. The internet is not one network; it is tens of thousands of independent networks called Autonomous Systems (AS) operated by ISPs, cloud providers, universities, and large companies like Meta. The Border Gateway Protocol (BGP) is how these networks tell each other which addresses they can deliver traffic to.
Think of BGP as the postal routing system between countries. Each network "advertises" the blocks of IP addresses (called prefixes, written like 69.171.250.0/24) that live inside it, effectively announcing, "If you have traffic for these addresses, send it my way and I'll get it there." Routers across the world collect these advertisements and assemble them into a constantly updated map of which paths lead where. When you advertise a prefix, the world learns how to reach you. When you withdraw it, the world forgets you exist your addresses simply fall off the map, and traffic for them has nowhere to go.
BGP is built on trust and operates at a scale where small mistakes propagate globally in seconds. A network that withdraws its routes, whether by accident or because of a hijack, can make itself unreachable almost instantly. Remember this property: BGP is the layer where you can erase yourself from the internet.
DNS: the phone book
Humans remember names; machines route to numbers. The Domain Name System (DNS) is the translation layer between the two. When you type instagram.com, your device asks a DNS resolver, "What IP address does this name point to?" The resolver walks a hierarchy of servers root servers, then the servers for .com, then the servers that are authoritative for instagram.com until it gets an answer, then hands the address back so your browser can actually connect.
DNS is so fundamental that it is easy to forget how much depends on it. Almost nothing on the modern internet is reached by raw IP address; nearly every connection begins with a name lookup. This makes DNS a frequent suspect when things break. But DNS is often the messenger, not the murderer. As the case study will show, DNS can fail loudly because of a problem that originated two layers below it.
One more detail matters for the story: large providers run their own authoritative DNS servers, and those servers are reachable only if the rest of the world has a working route to them. DNS depends on BGP. If the route to your DNS servers disappears, your names stop resolving not because DNS is broken, but because no one can reach the machines that answer DNS questions.
HTTP and HTTPS: the language of the web
Once a name has been resolved to an address and a TCP connection has been opened to it, the two machines finally speak the language of the web: the HyperText Transfer Protocol (HTTP). HTTP is a request-and-response protocol. Your browser sends a request "GET me the home page" and the server sends back a response containing the page, an image, or data, along with a status code (the familiar 200 OK, 404 Not Found, and so on).
HTTP by itself is plain text, which means anyone between you and the server could read or tamper with it. HTTPS fixes this by wrapping HTTP inside an encrypted TLS (Transport Layer Security) connection. Before any web data flows, the client and server perform a TLS handshake: they verify the server's identity using a digital certificate and agree on encryption keys. Only then does the actual HTTP exchange happen, now scrambled so that intermediaries see only gibberish. This is the difference between the padlock in your address bar and its absence and it is why HTTPS is now the default for essentially the entire web.
HTTP/HTTPS is the layer users actually experience. But notice how much had to happen first: a name resolved (DNS), a route found (BGP), a reliable connection opened (TCP/IP), and only at the very end does the page load. Every web request is a quiet relay race through all four protocols.
How it all fits together: one click, four protocols
Putting it together, here is what happens in the second after you tap a link to example.com:

DNS turns example.com into an IP address like 93.184.216.34. To do that, your resolver must be able to reach the authoritative DNS servers which depends on routing.
BGP is the reason your resolver knows a path to those servers, and the reason your packets later know a path to the destination address.
TCP/IP opens a reliable connection to that address through the three-way handshake and guarantees the bytes arrive in order.
HTTPS runs a TLS handshake over that connection, then sends the HTTP request and receives the page.

It works flawlessly billions of times a day, which is why no one thinks about it. But because each step assumes the previous one succeeded, a failure low in the chain can masquerade as a failure high in the chain. That is precisely what happened to one of the largest companies on Earth.
Case study: the day Facebook erased itself from the map
What happened
On October 4, 2021, beginning at roughly 15:40 UTC, Facebook and every major service it owned Instagram, WhatsApp, Messenger, and Oculus became globally unreachable for approximately six to seven hours. The disruption was not a slowdown or a partial degradation; the services simply could not be found. Even "Log in with Facebook" buttons on unrelated third-party sites stopped working. By some estimates the outage cost the company around $100 million in lost revenue, alongside a sharp hit to its stock.
The symptom everyone first noticed was DNS: names like facebook.com stopped resolving.
The root cause: a BGP withdrawal triggered from below
According to Meta's own postmortem, the trouble began during routine maintenance on the company's backbone the internal high-capacity network that connects its data centers to one another and to the outside world. A command intended to assess spare backbone capacity was issued with a flaw, and a configuration change disconnected Meta's data centers from the backbone.
Here is where the protocols interlock in a way that turned a mistake into a catastrophe. Meta's DNS servers were configured with a sensible-sounding safety feature: each DNS server continuously checks whether it can reach the data centers behind it, and if it cannot, it assumes the network around it is unhealthy and withdraws its own BGP route advertisements. The logic is "if I can't do my job correctly, stop sending traffic to me." Under normal conditions this is a good failsafe it routes around a single sick server.
But on October 4, the backbone removal meant that every DNS server simultaneously concluded it was unhealthy. So they all did exactly what they were told to do: they withdrew their BGP advertisements at the same time. In an instant, the routes to Meta's DNS servers vanished from the global routing table. The DNS servers themselves were still running and perfectly capable of answering but with their routes withdrawn, no one on the internet had any path to reach them.
That is the whole chain in one sentence: a backbone misconfiguration (internal network) made the DNS servers declare themselves unhealthy (DNS layer), which made them withdraw their routes (BGP layer), which erased Facebook's addresses from the internet's map. The DNS failures everyone saw were the top of the stack reporting a wound inflicted near the bottom.
Why it took six hours to fix
A change that propagates in seconds should be reversible in seconds. It was not, and the reasons are a lesson in how dependencies cut both ways.
First, with the routes withdrawn, engineers could not reach the affected systems remotely through their normal tools those tools relied on the very network that was down. Second, many of Meta's internal systems for diagnosis and access depended on DNS, which was now broken, so the tools needed to fix the problem were themselves casualties of the problem. Third, when engineers tried to fix the routers physically, they faced obstacles at the data centers, where access control and badge systems that would normally be trivial were entangled with the same failed infrastructure. The company that had erased itself from the internet had also, inadvertently, locked itself out of its own buildings.
Recovery required getting people physically to the hardware to restore the backbone connection. Once that happened, BGP advertisements came back, routes reappeared in the global table, DNS began resolving again, and the application layer recovered over the following hour. Service was broadly restored by around 22:50 UTC.
The cascade, in order
To make the dependency chain explicit, here is the sequence the case lays bare:

Backbone maintenance error disconnects data centers from the internal network.
DNS servers' health checks fail because they can no longer reach those data centers.
DNS servers withdraw their BGP routes as designed, believing they are protecting the network.
BGP routes disappear globally, so the internet loses all paths to Facebook's addresses.
DNS resolution fails worldwide the visible symptom because the DNS servers are now unreachable.
Recovery stalls because the tools and access methods needed to fix it depended on the broken layers.

Each step is a different protocol, and each step was working correctly by its own rules. There was no bug in BGP or DNS. The disaster lived in the coupling between them a failsafe at one layer became an amplifier at another.
What the outage teaches about protocols
The technical lessons engineers drew from the incident generalize well beyond Meta. Failsafes that act in isolation can fail in unison: a rule that safely removes one unhealthy server will remove all of them when the unhealthiness is global, so safety mechanisms need to consider correlated failure. Out-of-band access matters: the systems you use to fix an outage must not depend on the systems that are down, or you lose your hands and eyes at the worst possible moment. And tight coupling between layers is a hidden risk: DNS automatically reacting to network health, and BGP automatically reacting to DNS, created a chain reaction no single team had fully modeled.
But the deeper lesson is the one this article opened with. The internet works because each protocol trusts the layer beneath it and handles one job well. That same trust is what lets a low-level mistake ripple upward until a configuration command on an internal backbone manifests, to three billion users, as "Instagram won't load." Understanding the protocols individually is useful. Understanding how they depend on one another is what separates a tidy diagram from an accurate picture of how the internet actually behaves and breaks.
Takeaways

Protocols are layered agreements, each solving one problem and relying on the layer below. IP/TCP move bytes, BGP finds paths, DNS resolves names, HTTP/HTTPS carries meaning.
Symptoms appear at the top; causes often live at the bottom. A DNS failure is frequently a routing or network failure wearing a disguise.
BGP is where you can erase yourself. Withdrawing route advertisements removes you from the internet's map almost instantly and globally.
Coupling is the real danger. The 2021 Meta outage involved no broken protocol it was the interaction between correct behaviors at different layers that caused the cascade.
Recovery tooling must be independent of the thing it recovers. When your diagnostic and access systems share the failure, the outage extends itself.

DEV Community

deep dive into network protocols

Top comments (0)