DEV Community

Cover image for Onboarding Playbook: Connect New VPCs and Data Centers Fast
beefed.ai
beefed.ai

Posted on • Originally published at beefed.ai

Onboarding Playbook: Connect New VPCs and Data Centers Fast

  • [Lock the Identity and Dependencies First — Preflight Checklist]
  • [Ship Network-as-Code: Templates, Modules, and CI/CD for Safe Provisioning]
  • [Prove Connectivity: Validation Tests and Security Gates That Prevent Surprises]
  • [Operational Handoff, SLAs, and Fast Rollbacks for Risk-Free Onboarding]
  • [A 30-Minute Execution Runbook: Step-by-Step Onboarding Protocol]

Most onboarding failures trace back to three avoidable sins: missing identity bindings, manual route/ACL edits, and no automated validation. Treat connectivity as a deployable product with code, tests, and documented escape hatches and you turn one-off frictions into repeatable workflows.

You have tight calendars, multiple accounts, and different toolchains for each cloud. Symptoms you already know: last-minute firewall openings, DNS that only resolves for one team, CIDR overlaps found after peering, or a half-day waiting period for a Direct Connect ticket. The result is blocked application releases, insecure temporary rules, and an exhausted on-call rotation that reverts changes in the wrong order.

Lock the Identity and Dependencies First — Preflight Checklist

Every successful connection starts with identity and a deterministic inventory.

  • Identity integrations first: Ensure the consuming account has a role/trust path to the platform account and confirm SSO/OIDC or SAML federation is in place for the team and for automation service principals. Follow your IdP trust model and map assume-role or service-principal flows to artifacts in the NaC templates. No identity, no automated attachment.
  • IPAM and CIDR gating: Verify the target VPC/VNet CIDR against your central IPAM record to prevent overlaps and to assign a clear route tag and owner. Include cidr_block and owner as mandatory inputs in the module interface.
  • DNS readiness: Confirm that the zone delegation exists and that resolvers (e.g., central forwarders, Route 53 private hosted zones) have the required conditional forwarders or private zones configured so cross-VPC name resolution works immediately after routes are present. Cross-cloud DNS patterns are part of the onboarding contract.
  • Transport decision & capacity: Choose one of site-to-site VPN, Direct Connect/ExpressRoute/Partner Interconnect, or partner SD‑WAN based on throughput and SLA targets; record the required ASN, BGP prefixes, and VLAN/port requirements before provisioning. Use the following short comparison table.
Connection type Best for Latency / Throughput Typical provisioning time
Site-to-site VPN Short-term, backups, smaller bandwidth Higher latency, up to few Gbps with accelerated options Minutes–hours. Software config quick; external IP changes may be required.
Direct Connect / ExpressRoute / Interconnect Predictable high-throughput, low-latency production Lowest latency, large throughput (10–100Gbps options) Days–weeks (circuit provisioning & colo)
Partner SD‑WAN / Carrier Branch or multi-cloud integration managed by partner Depends on partner; often high reliability Hours–days (partner onboarding)
  • Quota and limits check: Ensure the target account/region has available VPC/VNet, TGW/Virtual WAN, and route quota. Validate service limits via the provider API before applying.
  • Audit & logging targets: Confirm that flow logs, VPC/NSG logs, and network monitoring (NetFlow/CloudWatch/Log Analytics) are pre-authorized and have a destination. The onboarding ticket must include the logging bucket / workspace and retention policy.

Important: Never open broad ingress/egress rules as a shortcut. Define minimal allowed ports and source CIDRs in the onboarding module, and use temporary ephemeral rules only when guarded by a short TTL and automated cleanup.

Ship Network-as-Code: Templates, Modules, and CI/CD for Safe Provisioning

Make the connection repeatable by turning it into code and packaging it as a composable module.

  • Module design patterns
    • Keep a single-purpose vpc_onboarding module that expects vpc_id, owner_tag, desired_prefixes, and transit_hub_id. The module performs attachment, route association, route propagation configuration, and optional DNS registration.
    • Use small, versioned modules (semantic versioning) stored in a central registry so application teams pull tested artifacts, not ad-hoc snippets.
  • State and locking
    • Use a remote state backend with locking and versioning (Terraform Cloud, S3 with native S3 locking or a remote backend) to avoid concurrent edits and to retain history for rollbacks.
  • Policy as code
    • Gate terraform apply with policy checks (tflint, tfsec, terrascan, or OPA/Sentinel) to enforce CIDR non-overlap, required tags, and allowed ports. Integrate policy checks into the PR pipeline.
  • CI/CD workflow
    • Enforce pull-request driven changes: plan runs on PR, apply is only allowed on main with an approved PR and a documented reviewer list. Use GitHub Actions, Atlantis, Spacelift, or Terraform Cloud for orchestrated runs. The CI job should:
    • terraform fmt and validate
    • Static checks (tflint, tfsec)
    • terraform plan with plan output stored and attached to PR
    • Automated pre-merge tests (see next section)
    • Human approval for apply on main branch
    • Example minimal GitHub Actions plan job:
name: Terraform Plan
on: [pull_request]
jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.6.0
      - run: terraform init -input=false
      - run: terraform fmt -check
      - run: terraform validate
      - run: terraform plan -out=tfplan
      - run: terraform show -json tfplan > tfplan.json
Enter fullscreen mode Exit fullscreen mode
  • Example vpc_onboarding module (Terraform HCL)
variable "vpc_id" { type = string }
variable "transit_gateway_id" { type = string }
variable "owner" { type = string }

resource "aws_ec2_transit_gateway_vpc_attachment" "attach" {
  vpc_id              = var.vpc_id
  transit_gateway_id  = var.transit_gateway_id
  subnet_ids          = var.subnet_ids
  tags = { Owner = var.owner }
}

resource "aws_ec2_transit_gateway_route_table_association" "assoc" {
  transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.attach.id
  transit_gateway_route_table_id = var.route_table_id
}

output "attachment_id" { value = aws_ec2_transit_gateway_vpc_attachment.attach.id }
Enter fullscreen mode Exit fullscreen mode
  • Module consumers: Keep application-level config thin — pass only vpc_id, owner, and intent variables; the module enforces naming, security rules, and telemetry.

Adopt automated testing of the IaC itself: unit-style linters and integration tests. Use Terratest for real-world integration tests that create temporary resources, run connectivity checks, and tear down.

Prove Connectivity: Validation Tests and Security Gates That Prevent Surprises

Testing must be part of the pipeline and the runtime checks must be automated.

  • Test categories
    • Static IaC checks: terraform validate, tflint, tfsec — fail PRs with policy violations.
    • Pre-apply simulation: Use static analyzers and vendor docs to verify BGP and route intents where possible.
    • Integration tests: Deploy ephemeral test artifacts (a small VM or container in each side and a health endpoint) and run network tests across the newly created attachment.
    • Behavioral tests: BGP adjacency, route propagation, and path validation using a control-plane-aware tool (for complex routing, use Batfish for configuration analysis).
  • Connectivity tests to automate
    • BGP adjacency check: confirm the expected neighbor in the Established state and the expected prefixes are present.
    • Route table checks: ensure that the transit route table has propagated prefixes and that no overlapping or blackholed routes exist.
    • Application-level health: curl -sSf http://10.0.0.10/healthz or a simple TCP connect to the required port.
    • Throughput and path MTU: iperf3 for throughput and tracepath/mturoute for MTU checks.
  • Sample Terratest pattern (Go)
package test
import (
  "testing"
  "github.com/gruntwork-io/terratest/modules/terraform"
  "github.com/stretchr/testify/assert"
  "net/http"
)

func TestOnboarding(t *testing.T) {
  t.Parallel()
  opts := &terraform.Options{TerraformDir: "../examples/vpc-onboarding"}
  defer terraform.Destroy(t, opts)
  terraform.InitAndApply(t, opts)
  resp, err := http.Get("http://10.0.0.10/healthz")
  assert.NoError(t, err)
  assert.Equal(t, 200, resp.StatusCode)
}
Enter fullscreen mode Exit fullscreen mode
  • Automated security validation
    • Verify security groups/network security rules are minimal and that no 0.0.0.0/0 write access exists to sensitive ports.
    • Run policy-as-code checks in CI and, after apply, run a runtime check that inspects cloud-native firewall state via API to confirm policy matches expected outputs.
  • Contrarian insight: Tests that run only after humans declare “ready” find problems too late. Shift left: run lightweight network verifications in PRs (simulated where possible) and full integration tests on a merge to a staging branch.

Operational Handoff, SLAs, and Fast Rollbacks for Risk-Free Onboarding

Onboarding ends when operations can support, measure, and recover the new connection.

  • Handoff artifacts
    • Documented runbook with owner, contact list, and a sequence diagram showing traffic flows and fallback paths.
    • Dashboard widgets: BGP status, transit hub throughput, per-attachment packets dropped, and DNS resolution success rate.
    • A single-page runbook that contains the terraform commit SHA and the exact module version used.
  • SLA & SLO mapping
    • Define SLOs for connectivity availability (e.g., 99.9% for production transit), and error budget for onboarding-related incidents, and publish the on-call responsibilities and escalation steps. Use SLO design techniques from SRE practice to convert operational targets into measurable SLIs and SLOs.
  • Owner / Responsibility / SLA table
Owner Responsibility SLA / Target
Network Platform Transit fabric, module maintenance, global routes 99.95% backbone availability
App Team VPC readiness, test artifacts Connection-ready within target window
SRE/On-call Monitoring, escalation MTTR for connectivity incidents ≤ 60 minutes
  • Rollback playbook (fast, deterministic)
    1. Identify the failing artifact (attachment ID, route table change, or security rule commit).
    2. Isolate traffic: disable propagation or disassociate the offending route table to stop further impact.
    3. Revert IaC to the last known good commit and apply in platform orchestration (this reverts route/attachment state). Example: merge the previous tag/commit and trigger terraform apply from CI.
    4. If immediate detachment is required, use the cloud API to detach the attachment (example AWS CLI):
      • aws ec2 describe-transit-gateway-attachments --filters Name=resource-type,Values=vpc
      • aws ec2 detach-transit-gateway-vpc-attachment --transit-gateway-attachment-id tgw-attach-xxxx
    5. Validate traffic did not leak and then return to a controlled re-apply once corrections exist.
  • Role of incident post-mortems
    • Run a blameless post-incident review that includes the IaC diff, the test failures (if any), and the time-to-restore with concrete actions: tighten tests, adjust policies, or harden rollbacks.

A 30-Minute Execution Runbook: Step-by-Step Onboarding Protocol

This protocol is the distilled, executable sequence you run when a team requests VPC/VNet onboarding. Times are realistic estimates once your modules and pipelines exist.

  1. Preflight (0–10 minutes)
    • Confirm vpc_id, owner, and CIDR in IPAM.
    • Confirm identity: role/account trust exists and a platform service principal is available.
    • Verify quotas and logging destinations exist.
  2. Provision via NaC (5–12 minutes)
    • Create a PR referencing the vpc_onboarding module with required variables.
    • CI runs terraform plan, tflint, tfsec. Wait for green.
    • Merge PR to main after required approvals.
  3. Apply and immediate smoke tests (5–8 minutes)
    • CI apply creates the TGW/VWAN attachment and updates route tables.
    • Run automated integration checks:
      • aws ec2 describe-transit-gateway-attachments --filters Name=resource-id,Values=<vpc-id>
      • Run curl to internal health endpoint and iperf3 throughput test to a staged tester host.
  4. Final validation and handoff (2–5 minutes)
    • Confirm logs appear in central analytics and DNS resolution passes.
    • Update the runbook with the final attachment ID, commit SHA, and timestamps.
  5. Post-onboard monitoring window (15–60 minutes)
    • Keep an elevated watch for 30–60 minutes for packet loss, BGP flaps, or rejected flows.
    • If an unrecoverable issue occurs, follow the Fast Rollback playbook above.

Sample quick iperf3 client run (from a test container in VPC A to server in VPC B):

# server (VPC B)
iperf3 -s -D

# client (VPC A)
iperf3 -c 10.10.0.5 -t 30 -J > iperf-result.json
Enter fullscreen mode Exit fullscreen mode

Operational tip: Version your onboarding modules and lock the exact module SHA in the onboarding PR so the handoff includes the exact code that was applied.

Sources:
What is AWS Transit Gateway for Amazon VPC? - Official AWS documentation describing Transit Gateway concepts, attachments, routing, and encryption control used to justify a hub-and-spoke transit model.

Azure Virtual WAN Overview - Microsoft Learn overview of Virtual WAN hub-and-spoke architecture, site-to-site VPN, and ExpressRoute integration relevant to global transit fabrics.

Cloud Interconnect overview — Google Cloud - Google Cloud documentation explaining Dedicated/Partner/Interconnect options and when to use direct interconnects for predictable bandwidth.

Terraform | HashiCorp Developer - Official Terraform documentation and best practices for module design, backends, and workflows referenced for the NaC and state management guidance.

Terratest documentation - Terratest docs showing patterns for integration tests of infrastructure code and examples for Terraform test harnesses.

SP 800-207, Zero Trust Architecture — NIST - NIST guidance on zero trust principles and identity-first security used to support identity integration and zero-trust posture recommendations.

Batfish — An open source network configuration analysis tool - Batfish project site and documentation describing configuration analysis and pre-deployment validation workflows for routing/ACL correctness.

iPerf3 — network bandwidth measurement tool - iPerf3 project and user docs referenced for active throughput and MTU testing examples.

Google SRE — Service Level Objectives - SRE guidance on SLIs/SLOs used to design operational SLAs and error-budget thinking for connectivity services.

setup-terraform GitHub Action / Terraform CI patterns - Examples and marketplace actions for running Terraform in GitHub Actions used in the CI/CD pipeline examples.

Top comments (0)