DEV Community

Seven Labs for Seven Labs

Posted on • Originally published at Medium on

How I Built Apex VPN: Infrastructure & Architecture Breakdown

A technical deep-dive into building a cross-platform VPN with 500+ nodes, AES-256 encryption, and sub-20ms latency across 20+ countries.

A technical deep-dive into building a cross-platform VPN

When the client came to us with the Apex VPN brief, the requirements were deceptively simple: build a fast, private, and scalable VPN optimised for gamers and streamers. What followed was one of the more technically demanding infrastructure projects I’ve shipped — and one of the most instructive.

This post breaks down how I designed and built it, the decisions that shaped the architecture, and what I’d do differently.

The Requirements That Shaped Everything

Before writing a single line of code, the client’s priorities were clear:

  • Latency above all  — gamers tolerate a lot, but not lag. Sub-20ms in key regions was a hard requirement.
  • Cross-platform  — iOS, Android, Web, and Chrome Extension. One backend, four clients.
  • Privacy-first  — AES-256 encryption, zero-logs policy, RAM-only servers. No exceptions.
  • Scale  — the architecture had to support hundreds of nodes without becoming a maintenance nightmare.

These four constraints defined every infrastructure decision that followed.

The Stack

Here’s what the final system runs on:

Infrastructure: DigitalOcean + Vultr (multi-cloud for redundancy and regional coverage) Automation: Ansible (server provisioning and configuration management) Containerisation: Docker Reverse Proxy: Nginx CI/CD: GitHub Actions Frontend: React.js + Next.js Backend: Node.js DNS & DDoS Protection: Cloudflare OS: Linux (Ubuntu 22.04 LTS on all nodes)

Architecture Overview

The system is built around three layers:

1. The Node Layer

500+ VPN servers deployed across 20+ countries. Each node is provisioned identically using Ansible playbooks — no manual SSH, no configuration drift. A new node goes from blank VPS to production-ready in under 8 minutes.

Each server runs:

  • A hardened VPN daemon (WireGuard-based for performance, with OpenVPN fallback)
  • Nginx as a reverse proxy handling TLS termination
  • Docker containers for the management agent
  • Automated health reporting to the central control plane

RAM-only configuration means no data is written to disk. On reboot, the server is clean.

2. The Control Plane

A centralised backend that handles:

  • Node registration and health monitoring
  • User authentication and session management
  • Server selection logic (latency-based routing)
  • Key exchange and certificate rotation
  • Usage metrics (aggregated only — no per-user logs)

The control plane runs on a hardened AWS instance with private VPC networking, IAM-restricted access, and automated certificate rotation every 30 days.

3. The Client Layer

Four clients share one backend API. The web app and Chrome extension are Next.js-based. The mobile apps (iOS and Android) connect to the same REST API with platform-native VPN profile management.

The biggest engineering challenge here was handling VPN profile installation across platforms — each OS has its own way of managing VPN configurations, and abstracting this cleanly required careful API design.

The Latency Problem

Early testing showed average latency of 40–60ms in key gaming regions (Southeast Asia, Western Europe, East Coast US). The target was sub-20ms.

Three changes got us there:

1. Protocol selection Switching the primary protocol from OpenVPN (TCP) to WireGuard reduced handshake overhead significantly. WireGuard’s smaller codebase and modern cryptography (ChaCha20, Poly1305) is purpose-built for performance.

2. Node placement We audited latency data from 10,000 real user sessions and repositioned 40% of nodes to better match actual traffic patterns. Singapore, Frankfurt, and Dallas ended up needing more capacity than the original plan assumed.

3. Cloudflare routing Routing all client-to-node traffic through Cloudflare Anycast dramatically reduced hop count for users far from a node. This alone shaved 8–12ms off average latency in South Asia and Africa.

Automation with Ansible

With 500+ nodes, manual management is off the table. Every server operation — provisioning, patching, config updates, certificate rotation — runs through Ansible playbooks.

The playbook structure:

playbooks/
  provision.yml # Fresh node setup
  harden.yml # Security baseline
  deploy.yml # VPN daemon + management agent
  rotate-certs.yml # Certificate rotation
  health-check.yml # Node validation
Enter fullscreen mode Exit fullscreen mode

Any engineer on the team can run ansible-playbook provision.yml -e "host=new-node-ip" and have a production node live in minutes. This was critical for scaling and for disaster recovery — if a node goes down, replacement is near-instant.

Security Hardening

Every node goes through the harden.yml playbook before going live. Key measures:

  • SSH key-only authentication (password auth disabled)
  • Fail2ban for brute force protection
  • UFW firewall with a default-deny policy
  • Unattended security upgrades enabled
  • Root login disabled
  • Non-standard SSH port
  • Automatic certificate rotation via the control plane

The zero-logs policy is enforced architecturally, not just by policy. The VPN daemon is configured to write no connection logs. The RAM-only server design means even if a node is physically seized, there’s nothing to recover.

CI/CD Pipeline

Deployments across 500+ nodes could be catastrophic if something breaks. The pipeline is built around staged rollouts:

  1. Build  — Docker image built and pushed to private registry
  2. Test  — Automated smoke tests against a staging node cluster
  3. Canary  — Deploy to 5% of nodes, monitor error rates for 15 minutes
  4. Progressive rollout  — 25% → 50% → 100% with automated health checks at each stage
  5. Rollback trigger  — if error rate exceeds 2% at any stage, automatic rollback

This meant we could push updates to the entire fleet with confidence — and we never had a failed deployment reach more than 5% of users.

What I’d Do Differently

Multi-region control plane from day one. The single control plane became a bottleneck during a DDoS event in month two. A geographically distributed control plane with active-active failover would have handled it cleanly. It’s on the roadmap now.

Observability earlier. We added Grafana dashboards mid-project. Next time, monitoring comes before the first node goes live — not after you’re wondering why latency spiked in Tokyo at 3am.

Mobile app architecture. The iOS and Android clients started as close ports of each other and gradually diverged. A shared React Native core would have saved significant time.

The Result

Apex VPN launched with:

  • 500+ nodes across 20+ countries
  • Average latency under 20ms in target regions
  • Zero production incidents in the first 90 days
  • Cross-platform clients on iOS, Android, Web, and Chrome

The client now runs a live subscription product serving users globally. The infrastructure handles traffic spikes without manual intervention, and new nodes can be provisioned in under 10 minutes.

If you’re building something similar — or if you have an infrastructure problem that needs solving — I’m available for new engagements.

📅 Book a call: calendly.com/sevenlabsolutions/30min

🌐 Website: sevenlabs.site

💻 GitHub: github.com/SevenLabSolutions

🔗 LinkedIn: linkedin.com/company/115781914

Seven Labs — AI Systems Engineer · Full Stack Developer · Infrastructure Specialist Founder, Seven Labs

Top comments (0)