- [Lock the Identity and Dependencies First — Preflight Checklist]
- [Ship Network-as-Code: Templates, Modules, and CI/CD for Safe Provisioning]
- [Prove Connectivity: Validation Tests and Security Gates That Prevent Surprises]
- [Operational Handoff, SLAs, and Fast Rollbacks for Risk-Free Onboarding]
- [A 30-Minute Execution Runbook: Step-by-Step Onboarding Protocol]
Most onboarding failures trace back to three avoidable sins: missing identity bindings, manual route/ACL edits, and no automated validation. Treat connectivity as a deployable product with code, tests, and documented escape hatches and you turn one-off frictions into repeatable workflows.
You have tight calendars, multiple accounts, and different toolchains for each cloud. Symptoms you already know: last-minute firewall openings, DNS that only resolves for one team, CIDR overlaps found after peering, or a half-day waiting period for a Direct Connect ticket. The result is blocked application releases, insecure temporary rules, and an exhausted on-call rotation that reverts changes in the wrong order.
Lock the Identity and Dependencies First — Preflight Checklist
Every successful connection starts with identity and a deterministic inventory.
-
Identity integrations first: Ensure the consuming account has a role/trust path to the platform account and confirm SSO/OIDC or SAML federation is in place for the team and for automation service principals. Follow your IdP trust model and map
assume-roleorservice-principalflows to artifacts in the NaC templates. No identity, no automated attachment. -
IPAM and CIDR gating: Verify the target VPC/VNet CIDR against your central IPAM record to prevent overlaps and to assign a clear route tag and owner. Include
cidr_blockandowneras mandatory inputs in the module interface. - DNS readiness: Confirm that the zone delegation exists and that resolvers (e.g., central forwarders, Route 53 private hosted zones) have the required conditional forwarders or private zones configured so cross-VPC name resolution works immediately after routes are present. Cross-cloud DNS patterns are part of the onboarding contract.
-
Transport decision & capacity: Choose one of
site-to-site VPN,Direct Connect/ExpressRoute/Partner Interconnect, or partner SD‑WAN based on throughput and SLA targets; record the required ASN, BGP prefixes, and VLAN/port requirements before provisioning. Use the following short comparison table.
| Connection type | Best for | Latency / Throughput | Typical provisioning time |
|---|---|---|---|
| Site-to-site VPN | Short-term, backups, smaller bandwidth | Higher latency, up to few Gbps with accelerated options | Minutes–hours. Software config quick; external IP changes may be required. |
| Direct Connect / ExpressRoute / Interconnect | Predictable high-throughput, low-latency production | Lowest latency, large throughput (10–100Gbps options) | Days–weeks (circuit provisioning & colo) |
| Partner SD‑WAN / Carrier | Branch or multi-cloud integration managed by partner | Depends on partner; often high reliability | Hours–days (partner onboarding) |
- Quota and limits check: Ensure the target account/region has available VPC/VNet, TGW/Virtual WAN, and route quota. Validate service limits via the provider API before applying.
- Audit & logging targets: Confirm that flow logs, VPC/NSG logs, and network monitoring (NetFlow/CloudWatch/Log Analytics) are pre-authorized and have a destination. The onboarding ticket must include the logging bucket / workspace and retention policy.
Important: Never open broad ingress/egress rules as a shortcut. Define minimal allowed ports and source CIDRs in the onboarding module, and use temporary ephemeral rules only when guarded by a short TTL and automated cleanup.
Ship Network-as-Code: Templates, Modules, and CI/CD for Safe Provisioning
Make the connection repeatable by turning it into code and packaging it as a composable module.
-
Module design patterns
- Keep a single-purpose
vpc_onboardingmodule that expectsvpc_id,owner_tag,desired_prefixes, andtransit_hub_id. The module performs attachment, route association, route propagation configuration, and optional DNS registration. - Use small, versioned modules (semantic versioning) stored in a central registry so application teams pull tested artifacts, not ad-hoc snippets.
- Keep a single-purpose
-
State and locking
- Use a remote state backend with locking and versioning (Terraform Cloud, S3 with native S3 locking or a remote backend) to avoid concurrent edits and to retain history for rollbacks.
-
Policy as code
- Gate
terraform applywith policy checks (tflint,tfsec,terrascan, or OPA/Sentinel) to enforce CIDR non-overlap, required tags, and allowed ports. Integrate policy checks into the PR pipeline.
- Gate
-
CI/CD workflow
- Enforce pull-request driven changes:
planruns on PR,applyis only allowed onmainwith an approved PR and a documented reviewer list. Use GitHub Actions, Atlantis, Spacelift, or Terraform Cloud for orchestrated runs. The CI job should: -
terraform fmtandvalidate - Static checks (
tflint,tfsec) -
terraform planwith plan output stored and attached to PR - Automated pre-merge tests (see next section)
- Human approval for
applyon main branch - Example minimal GitHub Actions plan job:
- Enforce pull-request driven changes:
name: Terraform Plan
on: [pull_request]
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.0
- run: terraform init -input=false
- run: terraform fmt -check
- run: terraform validate
- run: terraform plan -out=tfplan
- run: terraform show -json tfplan > tfplan.json
-
Example
vpc_onboardingmodule (Terraform HCL)
variable "vpc_id" { type = string }
variable "transit_gateway_id" { type = string }
variable "owner" { type = string }
resource "aws_ec2_transit_gateway_vpc_attachment" "attach" {
vpc_id = var.vpc_id
transit_gateway_id = var.transit_gateway_id
subnet_ids = var.subnet_ids
tags = { Owner = var.owner }
}
resource "aws_ec2_transit_gateway_route_table_association" "assoc" {
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.attach.id
transit_gateway_route_table_id = var.route_table_id
}
output "attachment_id" { value = aws_ec2_transit_gateway_vpc_attachment.attach.id }
-
Module consumers: Keep application-level config thin — pass only
vpc_id,owner, andintentvariables; the module enforces naming, security rules, and telemetry.
Adopt automated testing of the IaC itself: unit-style linters and integration tests. Use Terratest for real-world integration tests that create temporary resources, run connectivity checks, and tear down.
Prove Connectivity: Validation Tests and Security Gates That Prevent Surprises
Testing must be part of the pipeline and the runtime checks must be automated.
-
Test categories
-
Static IaC checks:
terraform validate,tflint,tfsec— fail PRs with policy violations. - Pre-apply simulation: Use static analyzers and vendor docs to verify BGP and route intents where possible.
- Integration tests: Deploy ephemeral test artifacts (a small VM or container in each side and a health endpoint) and run network tests across the newly created attachment.
- Behavioral tests: BGP adjacency, route propagation, and path validation using a control-plane-aware tool (for complex routing, use Batfish for configuration analysis).
-
Static IaC checks:
-
Connectivity tests to automate
- BGP adjacency check: confirm the expected neighbor in the
Establishedstate and the expected prefixes are present. - Route table checks: ensure that the transit route table has propagated prefixes and that no overlapping or blackholed routes exist.
- Application-level health:
curl -sSf http://10.0.0.10/healthzor a simple TCP connect to the required port. - Throughput and path MTU:
iperf3for throughput andtracepath/mturoutefor MTU checks.
- BGP adjacency check: confirm the expected neighbor in the
- Sample Terratest pattern (Go)
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
"net/http"
)
func TestOnboarding(t *testing.T) {
t.Parallel()
opts := &terraform.Options{TerraformDir: "../examples/vpc-onboarding"}
defer terraform.Destroy(t, opts)
terraform.InitAndApply(t, opts)
resp, err := http.Get("http://10.0.0.10/healthz")
assert.NoError(t, err)
assert.Equal(t, 200, resp.StatusCode)
}
-
Automated security validation
- Verify security groups/network security rules are minimal and that no 0.0.0.0/0 write access exists to sensitive ports.
- Run policy-as-code checks in CI and, after apply, run a runtime check that inspects cloud-native firewall state via API to confirm policy matches expected outputs.
- Contrarian insight: Tests that run only after humans declare “ready” find problems too late. Shift left: run lightweight network verifications in PRs (simulated where possible) and full integration tests on a merge to a staging branch.
Operational Handoff, SLAs, and Fast Rollbacks for Risk-Free Onboarding
Onboarding ends when operations can support, measure, and recover the new connection.
-
Handoff artifacts
- Documented runbook with owner, contact list, and a sequence diagram showing traffic flows and fallback paths.
- Dashboard widgets: BGP status, transit hub throughput, per-attachment packets dropped, and DNS resolution success rate.
- A single-page runbook that contains the
terraformcommit SHA and the exact module version used.
-
SLA & SLO mapping
- Define SLOs for connectivity availability (e.g., 99.9% for production transit), and error budget for onboarding-related incidents, and publish the on-call responsibilities and escalation steps. Use SLO design techniques from SRE practice to convert operational targets into measurable SLIs and SLOs.
- Owner / Responsibility / SLA table
| Owner | Responsibility | SLA / Target |
|---|---|---|
| Network Platform | Transit fabric, module maintenance, global routes | 99.95% backbone availability |
| App Team | VPC readiness, test artifacts | Connection-ready within target window |
| SRE/On-call | Monitoring, escalation | MTTR for connectivity incidents ≤ 60 minutes |
-
Rollback playbook (fast, deterministic)
- Identify the failing artifact (attachment ID, route table change, or security rule commit).
- Isolate traffic: disable propagation or disassociate the offending route table to stop further impact.
- Revert IaC to the last known good commit and apply in platform orchestration (this reverts route/attachment state). Example: merge the previous tag/commit and trigger
terraform applyfrom CI. - If immediate detachment is required, use the cloud API to detach the attachment (example AWS CLI):
aws ec2 describe-transit-gateway-attachments --filters Name=resource-type,Values=vpcaws ec2 detach-transit-gateway-vpc-attachment --transit-gateway-attachment-id tgw-attach-xxxx
- Validate traffic did not leak and then return to a controlled re-apply once corrections exist.
-
Role of incident post-mortems
- Run a blameless post-incident review that includes the IaC diff, the test failures (if any), and the time-to-restore with concrete actions: tighten tests, adjust policies, or harden rollbacks.
A 30-Minute Execution Runbook: Step-by-Step Onboarding Protocol
This protocol is the distilled, executable sequence you run when a team requests VPC/VNet onboarding. Times are realistic estimates once your modules and pipelines exist.
- Preflight (0–10 minutes)
- Confirm
vpc_id, owner, andCIDRin IPAM. - Confirm identity: role/account trust exists and a platform service principal is available.
- Verify quotas and logging destinations exist.
- Confirm
- Provision via NaC (5–12 minutes)
- Create a PR referencing the
vpc_onboardingmodule with required variables. - CI runs
terraform plan,tflint,tfsec. Wait for green. - Merge PR to
mainafter required approvals.
- Create a PR referencing the
- Apply and immediate smoke tests (5–8 minutes)
- CI
applycreates the TGW/VWAN attachment and updates route tables. - Run automated integration checks:
aws ec2 describe-transit-gateway-attachments --filters Name=resource-id,Values=<vpc-id>- Run
curlto internal health endpoint andiperf3throughput test to a staged tester host.
- CI
- Final validation and handoff (2–5 minutes)
- Confirm logs appear in central analytics and DNS resolution passes.
- Update the runbook with the final attachment ID, commit SHA, and timestamps.
- Post-onboard monitoring window (15–60 minutes)
- Keep an elevated watch for 30–60 minutes for packet loss, BGP flaps, or rejected flows.
- If an unrecoverable issue occurs, follow the Fast Rollback playbook above.
Sample quick iperf3 client run (from a test container in VPC A to server in VPC B):
# server (VPC B)
iperf3 -s -D
# client (VPC A)
iperf3 -c 10.10.0.5 -t 30 -J > iperf-result.json
Operational tip: Version your onboarding modules and lock the exact module SHA in the onboarding PR so the handoff includes the exact code that was applied.
Sources:
What is AWS Transit Gateway for Amazon VPC? - Official AWS documentation describing Transit Gateway concepts, attachments, routing, and encryption control used to justify a hub-and-spoke transit model.
Azure Virtual WAN Overview - Microsoft Learn overview of Virtual WAN hub-and-spoke architecture, site-to-site VPN, and ExpressRoute integration relevant to global transit fabrics.
Cloud Interconnect overview — Google Cloud - Google Cloud documentation explaining Dedicated/Partner/Interconnect options and when to use direct interconnects for predictable bandwidth.
Terraform | HashiCorp Developer - Official Terraform documentation and best practices for module design, backends, and workflows referenced for the NaC and state management guidance.
Terratest documentation - Terratest docs showing patterns for integration tests of infrastructure code and examples for Terraform test harnesses.
SP 800-207, Zero Trust Architecture — NIST - NIST guidance on zero trust principles and identity-first security used to support identity integration and zero-trust posture recommendations.
Batfish — An open source network configuration analysis tool - Batfish project site and documentation describing configuration analysis and pre-deployment validation workflows for routing/ACL correctness.
iPerf3 — network bandwidth measurement tool - iPerf3 project and user docs referenced for active throughput and MTU testing examples.
Google SRE — Service Level Objectives - SRE guidance on SLIs/SLOs used to design operational SLAs and error-budget thinking for connectivity services.
setup-terraform GitHub Action / Terraform CI patterns - Examples and marketplace actions for running Terraform in GitHub Actions used in the CI/CD pipeline examples.
Top comments (0)