Most hybrid-cloud networking discussions eventually converge toward the same set of enterprise-heavy solutions:
- Managed VPN gateways
- MPLS connectivity
- Transit Gateway
- BGP-based active-active architectures
In practice, many engineering teams are solving a much narrower — but operationally critical — problem:
Public cloud applications need reliable, secure access to private on-premises systems without introducing unnecessary networking complexity.
In our case, the requirement initially looked deceptively simple.
We had:
- Public-facing APIs running on AWS
- Internal APIs and databases hosted on-premises
- Strict requirements around resilience and observability
- A need to avoid expensive enterprise networking stacks
As traffic increased and failover testing became more realistic, several problems started emerging:
- Tunnel health appeared healthy while application traffic silently failed
- Intermittent packet forwarding issues became difficult to diagnose
- Failover behavior became inconsistent
- Route ambiguity started appearing during dual-path testing
- Monitoring produced false-positive tunnel health
The problem slowly stopped being: "How do we create secure connectivity?" and became: "How do we build predictable, observable, and operationally recoverable hybrid-cloud networking?"
That distinction ended up shaping the entire architecture.
Instead of optimizing for networking sophistication, the final system intentionally optimized for:
- Deterministic failover
- Operational simplicity
- Infrastructure-driven recovery
- Deep packet-forwarding visibility
- Cloud-agnostic portability
The result was a lightweight hybrid-cloud architecture built using WireGuard, Linux-native routing, AWS Route Table failover, Prometheus observability, and active-standby tunnel paths.
The Initial Assumption: One Tunnel Is Enough
The earliest implementation used a single WireGuard tunnel between AWS and the on-premises environment.
At small scale, this worked well — latency stayed low, operational overhead was minimal, and the architecture remained easy to reason about.
However, failover testing immediately exposed an operational weakness: the tunnel itself became a single recovery dependency.
Any degradation introduced:
- Backend API timeouts
- Intermittent request failures
- Delayed recovery
- Manual operational intervention
At this stage, the tunnel was technically functional, but operationally fragile.
Why Traditional Enterprise Networking Was Avoided
The obvious next step was evaluating managed VPN services, Transit Gateway, BGP-based routing, and active-active tunnel fabrics. On paper, these architectures looked attractive. Operationally, they introduced several concerns.
| Concern | Operational Impact |
|---|---|
| Managed VPN pricing | Difficult to justify at mid-scale |
| Dynamic route convergence | Harder to predict during failures |
| Vendor abstractions | Reduced debugging visibility |
| Active-active routing | Increased troubleshooting complexity |
| BGP failover behavior | Operationally non-deterministic |
One operational realization became increasingly clear: faster troubleshooting was consistently more valuable than maximizing tunnel utilization efficiency. At mid-scale, operational determinism mattered more than networking elegance.
Early Active-Active Experiments Created More Problems Than They Solved
The next iteration experimented with dual active tunnels. The idea initially seemed straightforward: two active paths, traffic balancing, higher availability, improved utilization.
Operationally, this introduced several difficult behaviors. During partial degradation testing, we observed:
- Asymmetric return traffic
- Intermittent packet forwarding failures
- Route ambiguity inside Linux routing tables
- Inconsistent failover timing
- Hard-to-debug intermittent API failures
One particularly difficult issue appeared when outbound and return traffic took different paths during transient failures. From the application perspective, some requests succeeded while others silently timed out — even though tunnel interfaces still appeared healthy.
The architecture technically achieved redundancy. Operationally, it reduced predictability.
Eventually, the team realized: deterministic failover behavior was more operationally valuable than active-active utilization efficiency. That became the turning point in the architecture.
Moving Toward Deterministic Routing
Instead of active-active routing, the architecture evolved toward isolated tunnel paths, active-standby failover, infrastructure-driven recovery, and strict subnet separation.
Traffic would always prefer a primary path and shift only during validated degradation. This dramatically simplified troubleshooting, observability, routing behavior, and operational recovery.
The system intentionally sacrificed maximum tunnel utilization and dynamic traffic balancing in exchange for deterministic failover, operational transparency, and predictable recovery.
Final Hybrid-Cloud Architecture
The final design used two completely isolated WireGuard paths, dedicated subnet ownership, AWS Route Table failover, and infrastructure-level recovery automation.
Why Two Completely Separate Tunnel Paths Worked Better
Operationally, shared tunnel systems introduced several recurring problems: overlapping route ownership, shared interface state, route recursion issues, asymmetric forwarding, and failover race conditions.
Instead, each tunnel path operated independently with isolated interfaces, isolated route ownership, isolated health validation, and isolated failover behavior. Failures became easier to isolate, observe, and recover from.
The secondary path remained continuously available but did not actively serve traffic until route failover occurred — a lightweight HA model without requiring BGP convergence, mesh synchronization, or overlay routing complexity.
Why Subnet Isolation Became Critical
Each tunnel path owned an isolated subnet slice.
| Environment | Interface | Tunnel IP | Role |
|---|---|---|---|
| On-Premises | wg0 / wg1 | 10.100.0.2 / 10.200.0.2 | Core Plane |
| AWS Primary | wg0 | 10.100.0.1 | ACTIVE |
| AWS Secondary | wg1 | 10.200.0.1 | STANDBY |
One operational lesson became very clear: simpler routing topologies fail in more understandable ways.
Separating VPN Edges from Backend Systems
Separating VPN edge gateways from backend APIs, databases, and public ingress layers reduced routing recursion, asymmetric local forwarding, interface overlap, and expanded blast radius — while improving maintenance flexibility, backend isolation, and operational scaling independence.
The Biggest Monitoring Mistake We Initially Made
Early monitoring focused primarily on interface UP state and WireGuard handshake timestamps.
During testing, we observed tunnels remaining "healthy" from WireGuard's perspective while real application traffic silently failed. In several cases, handshake timestamps continued updating and interfaces remained UP — but packet forwarding had already degraded.
One particularly difficult debugging session involved intermittent API failures where ICMP occasionally worked, application traffic intermittently timed out, and tunnel interfaces remained healthy. The root issue turned out to be partial packet-forwarding degradation rather than tunnel failure itself.
That incident fundamentally changed the monitoring strategy.
Evolving Toward Data-Plane Validation
The monitoring model evolved into three distinct validation layers.
| Validation Layer | Purpose |
|---|---|
| Interface Validation | Verify interface availability |
| Handshake Validation | Verify cryptographic synchronization |
| Data Plane Validation | Verify actual packet forwarding |
Production Health Validation Script
Tunnel health metrics were exported using a lightweight Linux validation script integrated into the Node Exporter textfile collector.
#!/bin/bash
METRIC_FILE="/var/lib/node_exporter/textfile_collector/wg.prom"
WG_INTERFACE="wg0"
WG_PEER_IP="10.100.0.1"
THRESHOLD_SECONDS=180
if ! wg show $WG_INTERFACE >/dev/null 2>&1; then
echo "wireguard_wg0_healthy 0" > "$METRIC_FILE"
exit 0
fi
NOW=$(date +%s)
LAST=$(wg show $WG_INTERFACE latest-handshakes | awk '{print $2}')
if [ -z "$LAST" ] || [ "$LAST" = "0" ]; then
echo "wireguard_wg0_healthy 0" > "$METRIC_FILE"
exit 0
fi
DIFF=$((NOW - LAST))
if [ $DIFF -gt $THRESHOLD_SECONDS ]; then
echo "wireguard_wg0_healthy 0" > "$METRIC_FILE"
exit 0
fi
if ! ping -c 2 -W 2 $WG_PEER_IP >/dev/null 2>&1; then
echo "wireguard_wg0_healthy 0" > "$METRIC_FILE"
exit 0
fi
echo "wireguard_wg0_healthy 1" > "$METRIC_FILE"
The monitoring flow validated interface state, handshake freshness, and actual packet forwarding.
That final packet-forwarding validation ended up becoming the most operationally valuable signal in the system.
Prometheus Observability Pipeline
graph LR
Script[Health Validation Script]
-->|Writes Metrics| File["wg.prom"]
NodeExporter[Node Exporter]
-->|Expose Metrics| Prometheus[Prometheus]
Prometheus
--> Alertmanager[Alertmanager]
Alertmanager
--> Automation[Failover Automation]
Automation
--> AWSRT[AWS Route Tables]
Infrastructure-Driven Failover
Instead of rebuilding tunnels or restarting applications during failures, failover occurred entirely through AWS Route Table updates.
When the primary tunnel degraded:
- Prometheus detected failure
- Alertmanager triggered automation
- Route targets were updated
- Traffic shifted to the standby WireGuard node
During testing, primary degradation was typically detected within seconds, route updates remained operationally predictable, and recovery became significantly more deterministic compared to earlier active-active experiments.
One operational pattern became consistently clear: the less failover logic applications contained, the more predictable recovery became.
Why This Architecture Worked Operationally
Operational Simplicity
The final design intentionally avoided BGP convergence, overlay routing complexity, active-active synchronization, and enterprise VPN orchestration. Everything relied on Linux routing, lightweight EC2 instances, WireGuard, and route-table automation — which significantly reduced operational overhead.
Better Failure Isolation
Separating tunnel paths, subnet ownership, route ownership, and backend systems made failures significantly easier to isolate, debug, and recover from.
Observability Became More Valuable Than Redundancy Alone
Redundant tunnels without packet-forwarding visibility still create operational risk. The architecture became reliable not simply because of redundancy, but because degradation became observable.
Cloud-Agnostic Portability
Because the architecture depended primarily on WireGuard, Linux-native routing, Prometheus, and infrastructure automation, the same design could be replicated across AWS, Azure, Google Cloud, private cloud environments, and bare-metal infrastructure with minimal architectural changes.
Final Thoughts
This architecture reinforced several operational principles repeatedly throughout testing.
First: Simpler networking topologies are easier to recover from during real incidents.
Second: Deterministic failover is often operationally more valuable than maximizing tunnel utilization.
And finally: Observability matters more than redundancy alone.
Using WireGuard, Linux-native routing, Prometheus observability, and AWS Route Table automation, it was possible to build resilient hybrid-cloud connectivity, predictable failover behavior, observable tunnel infrastructure, and cloud-agnostic deployment patterns — without introducing unnecessary enterprise networking complexity.
Originally published on the GeekyAnts Blog.
Top comments (0)