Onboarding Playbook: Connect New VPCs and Data Centers Fast

#cloud

[Lock the Identity and Dependencies First — Preflight Checklist]
[Ship Network-as-Code: Templates, Modules, and CI/CD for Safe Provisioning]
[Prove Connectivity: Validation Tests and Security Gates That Prevent Surprises]
[Operational Handoff, SLAs, and Fast Rollbacks for Risk-Free Onboarding]
[A 30-Minute Execution Runbook: Step-by-Step Onboarding Protocol]

Most onboarding failures trace back to three avoidable sins: missing identity bindings, manual route/ACL edits, and no automated validation. Treat connectivity as a deployable product with code, tests, and documented escape hatches and you turn one-off frictions into repeatable workflows.

You have tight calendars, multiple accounts, and different toolchains for each cloud. Symptoms you already know: last-minute firewall openings, DNS that only resolves for one team, CIDR overlaps found after peering, or a half-day waiting period for a Direct Connect ticket. The result is blocked application releases, insecure temporary rules, and an exhausted on-call rotation that reverts changes in the wrong order.

Lock the Identity and Dependencies First — Preflight Checklist

Every successful connection starts with identity and a deterministic inventory.

Identity integrations first: Ensure the consuming account has a role/trust path to the platform account and confirm SSO/OIDC or SAML federation is in place for the team and for automation service principals. Follow your IdP trust model and map assume-role or service-principal flows to artifacts in the NaC templates. No identity, no automated attachment.
IPAM and CIDR gating: Verify the target VPC/VNet CIDR against your central IPAM record to prevent overlaps and to assign a clear route tag and owner. Include cidr_block and owner as mandatory inputs in the module interface.
DNS readiness: Confirm that the zone delegation exists and that resolvers (e.g., central forwarders, Route 53 private hosted zones) have the required conditional forwarders or private zones configured so cross-VPC name resolution works immediately after routes are present. Cross-cloud DNS patterns are part of the onboarding contract.
Transport decision & capacity: Choose one of site-to-site VPN, Direct Connect/ExpressRoute/Partner Interconnect, or partner SD‑WAN based on throughput and SLA targets; record the required ASN, BGP prefixes, and VLAN/port requirements before provisioning. Use the following short comparison table.

Connection type	Best for	Latency / Throughput	Typical provisioning time
Site-to-site VPN	Short-term, backups, smaller bandwidth	Higher latency, up to few Gbps with accelerated options	Minutes–hours. Software config quick; external IP changes may be required.
Direct Connect / ExpressRoute / Interconnect	Predictable high-throughput, low-latency production	Lowest latency, large throughput (10–100Gbps options)	Days–weeks (circuit provisioning & colo)
Partner SD‑WAN / Carrier	Branch or multi-cloud integration managed by partner	Depends on partner; often high reliability	Hours–days (partner onboarding)

Quota and limits check: Ensure the target account/region has available VPC/VNet, TGW/Virtual WAN, and route quota. Validate service limits via the provider API before applying.
Audit & logging targets: Confirm that flow logs, VPC/NSG logs, and network monitoring (NetFlow/CloudWatch/Log Analytics) are pre-authorized and have a destination. The onboarding ticket must include the logging bucket / workspace and retention policy.

Important: Never open broad ingress/egress rules as a shortcut. Define minimal allowed ports and source CIDRs in the onboarding module, and use temporary ephemeral rules only when guarded by a short TTL and automated cleanup.

Ship Network-as-Code: Templates, Modules, and CI/CD for Safe Provisioning

Make the connection repeatable by turning it into code and packaging it as a composable module.

Module design patterns
- Keep a single-purpose vpc_onboarding module that expects vpc_id, owner_tag, desired_prefixes, and transit_hub_id. The module performs attachment, route association, route propagation configuration, and optional DNS registration.
- Use small, versioned modules (semantic versioning) stored in a central registry so application teams pull tested artifacts, not ad-hoc snippets.
State and locking
- Use a remote state backend with locking and versioning (Terraform Cloud, S3 with native S3 locking or a remote backend) to avoid concurrent edits and to retain history for rollbacks.
Policy as code
- Gate terraform apply with policy checks (tflint, tfsec, terrascan, or OPA/Sentinel) to enforce CIDR non-overlap, required tags, and allowed ports. Integrate policy checks into the PR pipeline.
CI/CD workflow
- Enforce pull-request driven changes: plan runs on PR, apply is only allowed on main with an approved PR and a documented reviewer list. Use GitHub Actions, Atlantis, Spacelift, or Terraform Cloud for orchestrated runs. The CI job should:
- terraform fmt and validate
- Static checks (tflint, tfsec)
- terraform plan with plan output stored and attached to PR
- Automated pre-merge tests (see next section)
- Human approval for apply on main branch
- Example minimal GitHub Actions plan job:

name: Terraform Plan
on: [pull_request]
jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.6.0
      - run: terraform init -input=false
      - run: terraform fmt -check
      - run: terraform validate
      - run: terraform plan -out=tfplan
      - run: terraform show -json tfplan > tfplan.json

Example vpc_onboarding module (Terraform HCL)

variable "vpc_id" { type = string }
variable "transit_gateway_id" { type = string }
variable "owner" { type = string }

resource "aws_ec2_transit_gateway_vpc_attachment" "attach" {
  vpc_id              = var.vpc_id
  transit_gateway_id  = var.transit_gateway_id
  subnet_ids          = var.subnet_ids
  tags = { Owner = var.owner }
}

resource "aws_ec2_transit_gateway_route_table_association" "assoc" {
  transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.attach.id
  transit_gateway_route_table_id = var.route_table_id
}

output "attachment_id" { value = aws_ec2_transit_gateway_vpc_attachment.attach.id }

Module consumers: Keep application-level config thin — pass only vpc_id, owner, and intent variables; the module enforces naming, security rules, and telemetry.

Adopt automated testing of the IaC itself: unit-style linters and integration tests. Use Terratest for real-world integration tests that create temporary resources, run connectivity checks, and tear down.

Prove Connectivity: Validation Tests and Security Gates That Prevent Surprises

Testing must be part of the pipeline and the runtime checks must be automated.

Test categories
- Static IaC checks: terraform validate, tflint, tfsec — fail PRs with policy violations.
- Pre-apply simulation: Use static analyzers and vendor docs to verify BGP and route intents where possible.
- Integration tests: Deploy ephemeral test artifacts (a small VM or container in each side and a health endpoint) and run network tests across the newly created attachment.
- Behavioral tests: BGP adjacency, route propagation, and path validation using a control-plane-aware tool (for complex routing, use Batfish for configuration analysis).
Connectivity tests to automate
- BGP adjacency check: confirm the expected neighbor in the Established state and the expected prefixes are present.
- Route table checks: ensure that the transit route table has propagated prefixes and that no overlapping or blackholed routes exist.
- Application-level health: curl -sSf http://10.0.0.10/healthz or a simple TCP connect to the required port.
- Throughput and path MTU: iperf3 for throughput and tracepath/mturoute for MTU checks.
Sample Terratest pattern (Go)

package test
import (
  "testing"
  "github.com/gruntwork-io/terratest/modules/terraform"
  "github.com/stretchr/testify/assert"
  "net/http"
)

func TestOnboarding(t *testing.T) {
  t.Parallel()
  opts := &terraform.Options{TerraformDir: "../examples/vpc-onboarding"}
  defer terraform.Destroy(t, opts)
  terraform.InitAndApply(t, opts)
  resp, err := http.Get("http://10.0.0.10/healthz")
  assert.NoError(t, err)
  assert.Equal(t, 200, resp.StatusCode)
}

Automated security validation
- Verify security groups/network security rules are minimal and that no 0.0.0.0/0 write access exists to sensitive ports.
- Run policy-as-code checks in CI and, after apply, run a runtime check that inspects cloud-native firewall state via API to confirm policy matches expected outputs.
Contrarian insight: Tests that run only after humans declare “ready” find problems too late. Shift left: run lightweight network verifications in PRs (simulated where possible) and full integration tests on a merge to a staging branch.

Operational Handoff, SLAs, and Fast Rollbacks for Risk-Free Onboarding

Onboarding ends when operations can support, measure, and recover the new connection.

Handoff artifacts
- Documented runbook with owner, contact list, and a sequence diagram showing traffic flows and fallback paths.
- Dashboard widgets: BGP status, transit hub throughput, per-attachment packets dropped, and DNS resolution success rate.
- A single-page runbook that contains the terraform commit SHA and the exact module version used.
SLA & SLO mapping
- Define SLOs for connectivity availability (e.g., 99.9% for production transit), and error budget for onboarding-related incidents, and publish the on-call responsibilities and escalation steps. Use SLO design techniques from SRE practice to convert operational targets into measurable SLIs and SLOs.
Owner / Responsibility / SLA table

Owner	Responsibility	SLA / Target
Network Platform	Transit fabric, module maintenance, global routes	99.95% backbone availability
App Team	VPC readiness, test artifacts	Connection-ready within target window
SRE/On-call	Monitoring, escalation	MTTR for connectivity incidents ≤ 60 minutes

Rollback playbook (fast, deterministic)
1. Identify the failing artifact (attachment ID, route table change, or security rule commit).
2. Isolate traffic: disable propagation or disassociate the offending route table to stop further impact.
3. Revert IaC to the last known good commit and apply in platform orchestration (this reverts route/attachment state). Example: merge the previous tag/commit and trigger terraform apply from CI.
4. If immediate detachment is required, use the cloud API to detach the attachment (example AWS CLI):
  - aws ec2 describe-transit-gateway-attachments --filters Name=resource-type,Values=vpc
  - aws ec2 detach-transit-gateway-vpc-attachment --transit-gateway-attachment-id tgw-attach-xxxx
5. Validate traffic did not leak and then return to a controlled re-apply once corrections exist.
Role of incident post-mortems
- Run a blameless post-incident review that includes the IaC diff, the test failures (if any), and the time-to-restore with concrete actions: tighten tests, adjust policies, or harden rollbacks.

A 30-Minute Execution Runbook: Step-by-Step Onboarding Protocol

This protocol is the distilled, executable sequence you run when a team requests VPC/VNet onboarding. Times are realistic estimates once your modules and pipelines exist.

Preflight (0–10 minutes)
- Confirm vpc_id, owner, and CIDR in IPAM.
- Confirm identity: role/account trust exists and a platform service principal is available.
- Verify quotas and logging destinations exist.
Provision via NaC (5–12 minutes)
- Create a PR referencing the vpc_onboarding module with required variables.
- CI runs terraform plan, tflint, tfsec. Wait for green.
- Merge PR to main after required approvals.
Apply and immediate smoke tests (5–8 minutes)
- CI apply creates the TGW/VWAN attachment and updates route tables.
- Run automated integration checks:
  - aws ec2 describe-transit-gateway-attachments --filters Name=resource-id,Values=<vpc-id>
  - Run curl to internal health endpoint and iperf3 throughput test to a staged tester host.
Final validation and handoff (2–5 minutes)
- Confirm logs appear in central analytics and DNS resolution passes.
- Update the runbook with the final attachment ID, commit SHA, and timestamps.
Post-onboard monitoring window (15–60 minutes)
- Keep an elevated watch for 30–60 minutes for packet loss, BGP flaps, or rejected flows.
- If an unrecoverable issue occurs, follow the Fast Rollback playbook above.

Sample quick iperf3 client run (from a test container in VPC A to server in VPC B):

# server (VPC B)
iperf3 -s -D

# client (VPC A)
iperf3 -c 10.10.0.5 -t 30 -J > iperf-result.json

Operational tip: Version your onboarding modules and lock the exact module SHA in the onboarding PR so the handoff includes the exact code that was applied.

Sources:
What is AWS Transit Gateway for Amazon VPC? - Official AWS documentation describing Transit Gateway concepts, attachments, routing, and encryption control used to justify a hub-and-spoke transit model.

Azure Virtual WAN Overview - Microsoft Learn overview of Virtual WAN hub-and-spoke architecture, site-to-site VPN, and ExpressRoute integration relevant to global transit fabrics.

Cloud Interconnect overview — Google Cloud - Google Cloud documentation explaining Dedicated/Partner/Interconnect options and when to use direct interconnects for predictable bandwidth.

Terraform | HashiCorp Developer - Official Terraform documentation and best practices for module design, backends, and workflows referenced for the NaC and state management guidance.

Terratest documentation - Terratest docs showing patterns for integration tests of infrastructure code and examples for Terraform test harnesses.

SP 800-207, Zero Trust Architecture — NIST - NIST guidance on zero trust principles and identity-first security used to support identity integration and zero-trust posture recommendations.

Batfish — An open source network configuration analysis tool - Batfish project site and documentation describing configuration analysis and pre-deployment validation workflows for routing/ACL correctness.

iPerf3 — network bandwidth measurement tool - iPerf3 project and user docs referenced for active throughput and MTU testing examples.

Google SRE — Service Level Objectives - SRE guidance on SLIs/SLOs used to design operational SLAs and error-budget thinking for connectivity services.

setup-terraform GitHub Action / Terraform CI patterns - Examples and marketplace actions for running Terraform in GitHub Actions used in the CI/CD pipeline examples.