Solved: Anyone not using hub and spoke?

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Traditional hub-and-spoke cloud networking often creates bottlenecks and increases the blast radius of misconfigurations as organizations scale. The article explores alternatives like direct VPC peering for specific high-bandwidth needs, modern hub-and-spoke with AWS Transit Gateway for scalable solutions, and multi-hub or Cloud WAN for global enterprises to overcome these limitations.

🎯 Key Takeaways

Classic hub-and-spoke architectures can become central bottlenecks, increase the blast radius of misconfigurations, and introduce bureaucratic delays for network changes.
Direct VPC peering is a quick, low-latency solution for a small number of VPCs (2-4) with high-traffic requirements, but its overuse leads to unmanageable ‘spaghetti networking’.
AWS Transit Gateway (TGW) offers a modern, scalable hub-and-spoke solution, eliminating bottlenecks, providing granular routing, and efficiently handling spoke-to-spoke communication.
For large, multi-region enterprises, advanced topologies like multi-hub TGW deployments or AWS Cloud WAN are necessary to create a global backbone and segment networks effectively.

Hub-and-spoke isn’t the only answer for cloud networking. We’ll explore why teams stray from this classic model and look at practical alternatives like direct peering and multi-hub designs for when things get complicated.

So, You’re Thinking of Ditching Hub-and-Spoke? A Reality Check.

I still remember the 2 AM page. It was one of those cryptic “database unreachable” alerts from the prod-analytics cluster. The on-call engineer, a sharp but still junior guy, was completely stumped. The app servers were up, the database prod-db-01 was healthy, but nothing could connect. After 45 minutes of frantic digging, we found it: someone on the central platform team had pushed a “minor” firewall rule update to the hub VPC. It was supposed to lock down a test environment, but a typo in the CIDR range blackholed all traffic from the analytics spoke to the shared services spoke where the DB lived. An entire production system was down because of a single, seemingly unrelated change. That night, I understood why every engineer eventually asks, “Do we *really* need this hub-and-spoke thing?”

The “Why”: What’s Wrong With The Textbook Answer?

Let’s be clear: hub-and-spoke is the default for a reason. It centralizes security, simplifies DNS, and gives you a single place to manage egress and shared services. When you’re small, it’s perfect. But as you grow, this beautiful, clean diagram starts to develop some nasty habits:

The Central Bottleneck: Every single packet that needs to go from one spoke to another has to travel into the hub and back out. This can introduce latency and, more importantly, create a massive traffic chokepoint.
The Blast Radius of Doom: Like my war story, a single misconfiguration in the hub VPC (a bad route, a fat-fingered firewall rule) can take down every single application connected to it.
The Bureaucracy Chokehold: The “hub” is often owned by a central “Platform” or “Network” team. Need to open a port between your app and a new data service? Get in line and fill out a ticket. Innovation slows to a crawl.

So when a developer comes to you saying “I just need my two VPCs to talk,” they’re not being difficult. They’re trying to escape the chokehold. The good news is, you have options.

Solution 1: The Quick Fix – Direct VPC Peering

This is the “I need it working yesterday” solution. A team has prod-webapp-vpc and prod-ml-training-vpc, and they need to share a high-bandwidth connection without going through the central hub. Instead of waiting for the network team, you just connect them directly.

It’s simple, fast, and effective for a point-to-point problem. You create a peering connection, and both sides accept. Then you just update the route tables in each VPC to point to the other’s CIDR block via the peering connection.

Warning: This is the path to “spaghetti networking.” If you do this for two or three VPCs, it’s manageable. If you start peering every VPC with every other VPC, you create an unmanageable mesh of connections with no central control. Use this surgically, not as a default strategy.

When to use it:

For a small number of VPCs (think 2-4) that have a very high-traffic, low-latency requirement between them and don’t need access to many other shared services.

Solution 2: The Permanent Fix – The Modern Hub & Spoke with Transit Gateway

Most of the problems people have with “hub-and-spoke” are actually problems with the *old* way of doing it—using a regular VPC with a software firewall/router as the hub. The modern solution is a dedicated managed service like AWS Transit Gateway (TGW).

A TGW isn’t a VPC; it’s a regional network router. You attach all your VPCs (spokes) to it. It solves the classic problems:

It’s not a bottleneck: It scales to handle terabits per second of traffic.
Granular Routing: You can create separate TGW route tables. This means the dev spokes can all talk to each other in their own little sandbox, but they can’t touch the prod spokes. This contains the blast radius.
East-West Traffic: It handles spoke-to-spoke communication natively without traffic having to “hairpin” through a single EC2 instance in the hub.

Here’s a taste of how simple a Terraform config for this can be:

# main.tf - Assuming TGW is already created

resource "aws_ec2_transit_gateway_vpc_attachment" "webapp_attachment" {
  provider = aws.us-east-1

  transit_gateway_id = var.transit_gateway_id
  vpc_id             = var.prod_webapp_vpc_id
  subnet_ids         = var.prod_webapp_private_subnets

  tags = {
    Name = "tgw-attach-prod-webapp"
  }
}

# You also need to add a route in your VPC's route table
resource "aws_route" "to_shared_services" {
  route_table_id         = var.prod_webapp_private_route_table_id
  destination_cidr_block = "10.10.0.0/16" # CIDR of the shared services VPC
  transit_gateway_id     = var.transit_gateway_id
}

Solution 3: The ‘Nuclear’ Option – A Multi-Hub or Full Mesh Reality

Sometimes, even a single TGW isn’t enough. If you’re a global company, you don’t want your traffic from Sydney to hairpin through a TGW in Virginia just to talk to a VPC in the same Sydney region. This is where you graduate to more advanced topologies.

Multi-Hub (Region-to-Region): You deploy a Transit Gateway in each major region (e.g., us-east-1, eu-west-1, ap-southeast-2). Then, you peer these TGWs together. This creates a global backbone for your company. Traffic stays within a region for local communication but can traverse the peering connection for cross-region needs efficiently.

Cloud WAN / Full Mesh: For the top 1% of complexity, services like AWS Cloud WAN build a core network for you and allow you to define segments. You can create a “prod” segment and a “dev” segment globally. All VPCs attached to the “prod” segment can talk to each other (a logical mesh), but they are isolated from “dev”. This is powerful but requires a dedicated networking team to manage policy, routing, and costs.

Which path is right? A quick comparison:

Approach	Complexity	Cost	Best For
Direct Peering	Low	Low (Data transfer costs only)	Quickly linking 2-3 specific VPCs.
Modern Hub & Spoke (TGW)	Medium	Medium (TGW attachment + processing fees)	Most startups and mid-size companies. The 90% solution.
Multi-Hub / Cloud WAN	High	High	Large, multi-region, global enterprises.

At the end of the day, there’s no “one true way.” The textbook hub-and-spoke model is a starting point, not a destination. Don’t be afraid to deviate when reality calls for it. Just make sure you’re choosing a path to solve a real problem, not just creating a more complicated one for your future self to untangle at 2 AM.