Abraham Naiborhu

Posted on May 20

Terraforming a Production-Lite GCP Web Platform: MIG, Cloud NAT, Load Balancer, and Private Backends

#terraform #googlecloud #devops #infrastructure

Hi! After building my first Terraform artifact, The GCP Foundation Lite, I wanted to move one layer higher.

I wanted to answer a more complex question:

Can I provision infrastructure for an actual web platform?

Thus, I created this project titled "Production-Lite GCP Web Platform" using Terraform.

This project provisions:

custom VPC network
app subnet
reserved database subnet
map-based firewall rules
Cloud NAT
custom service account
regional Managed Instance Group
instance template
startup script
HTTP health check
backend service
external HTTP load balancer
remote Terraform state
reusable Terraform modules

The goal was not to create a full enterprise platform, but to create a small but production-shaped infra pattern that can run an application.

But, before i go even further, do check my github repository at terraform-gcp-production-lite-web-platform

Why I Built This

In my first artifact, I focused on the foundation layer.

That project included:

remote Terraform state
versioned GCS state bucket
custom VPC
role-based subnets
firewall rules
service accounts
IAM bindings
reusable modules

That was useful because it helped me understand how to create the base layer of a Google Cloud environment.

But a foundation alone does not run an application.

So for this second artifact, I wanted to build something closer to a real web platform.

The objective was to move from this:

Terraform creates the foundation.

To this:

Terraform provisions infrastructure that can serve application traffic.

What This Project Builds

This project creates:

custom VPC network
application subnet
reserved database subnet
firewall rules
Cloud Router
Cloud NAT
custom application service account
instance template
regional Managed Instance Group
HTTP health check
backend service
external HTTP load balancer
global forwarding rule
startup script
simple web application endpoint
remote state in GCS

The application VMs are private.

They do not have external IP addresses.

Users access the application through the external HTTP load balancer.

Outbound internet access from the private VMs is handled through Cloud NAT.

High-Level Architecture

The high-level architecture is:

User
  ↓
External HTTP Load Balancer
  ↓
Backend Service
  ↓
Regional Managed Instance Group
  ↓
Private Application VM
  ↓
Application Endpoint

For outbound access:

Private Application VM
  ↓
Cloud NAT
  ↓
Internet

The important point is this:

Inbound traffic enters through the load balancer.
Outbound traffic leaves through Cloud NAT.
The backend VM does not need a public IP address.

Architecture Diagram

User / Browser
      |
      v
External HTTP Load Balancer
      |
      v
Target HTTP Proxy
      |
      v
URL Map
      |
      v
Backend Service
      |
      v
Regional Managed Instance Group
      |
      v
Private App VM(s)
      |
      v
Application running from startup script

Private App VM(s)
      |
      v
Cloud NAT
      |
      v
Outbound Internet

What Production-Lite Means

For this project, production-lite means the infrastructure follows production-style patterns without trying to become a full enterprise platform.

This project includes:

private backend instances
load balancer entry point
health checks
Managed Instance Group
Cloud NAT
service account separation
firewall rules
remote state
Terraform modules

But this version does not include:

HTTPS
custom domain
Cloud Armor
Cloud SQL
Secret Manager
CI/CD
multi-region deployment
blue-green deployment
Kubernetes

Those features are important, but I intentionally deferred them.

The goal of v1.0 is to keep the scope focused:

Can I provision a production-shaped HTTP web platform with private backends?

Why Private Backend Instances Matter

One of the most important design choices in this project is that the backend VMs do not have external IP addresses.

In the instance template, the network interface does not include an access_config block.

Conceptually, the pattern is:

network_interface {
  subnetwork = var.subnetwork_self_link

  # No access_config block.
  # This intentionally creates VMs without external IP addresses.
}

This means the VM is not directly exposed to the internet.

Instead, application traffic must go through the load balancer.

That is a better pattern than exposing a VM directly with a public IP address.

A direct public VM might be fine for a quick test, but for an application platform, I want a cleaner entry point:

Internet
  -> Load Balancer
  -> Backend Service
  -> Private VM

Why Cloud NAT Is Needed

Because the backend VMs are private, they do not have direct outbound internet access through an external IP.

However, the VM still needs outbound access for operational tasks such as:

running apt-get update
installing packages
downloading dependencies
calling external services
bootstrapping the application during startup

This is where Cloud NAT is useful.

Cloud NAT allows private resources to initiate outbound internet connections without assigning public IP addresses to those resources.

In this project, Cloud NAT is created using:

Cloud Router
Cloud NAT gateway
selected subnet configuration

The important improvement from my earlier lab is that I do not need to NAT every subnet.

For this artifact, the app subnet receives NAT.

The reserved database subnet does not receive NAT by default.

This is the intended posture:

app subnet -> Cloud NAT enabled
db subnet  -> Cloud NAT not enabled by default

That separation is small, but architecturally meaningful.

Why I Used a Managed Instance Group

A standalone VM is simpler.

But a standalone VM does not really show platform thinking.

For this artifact, I used a regional Managed Instance Group.

The MIG uses:

instance template
target size
named port
autohealing policy
health check

The difference is important.

A standalone VM says:

I created a server.

A Managed Instance Group says:

I defined how application instances should be created, managed, replaced, and checked.

That is a stronger infrastructure pattern.

Why I Used an External HTTP Load Balancer

The external HTTP load balancer acts as the public entry point.

The load balancer connects to the backend service, and the backend service connects to the Managed Instance Group.

The request flow is:

User
  -> Global forwarding rule
  -> Target HTTP proxy
  -> URL map
  -> Backend service
  -> Managed Instance Group
  -> Application VM

This is more realistic than opening port 80 directly on a public VM.

It also allows me to test important infrastructure concepts:

backend services
health checks
named ports
firewall rules for health check probes
load-balanced application access

Why Health Checks Matter

The load balancer needs to know whether the backend instances are healthy.

For that, I use an HTTP health check.

The application exposes:

GET /
GET /healthz
GET /metadata

The /healthz endpoint is used by the health check.

This is better than using / as the health check path because / is usually a user-facing route, while /healthz is explicitly meant for machine health checking.

The expected response is simple:

ok

A simple health endpoint is enough for this version.

The goal is not to build a complex application.

The goal is to prove that the infrastructure can route traffic to a healthy backend.

Application Endpoints

The sample application exposes three endpoints.

Endpoint	Purpose
`/`	Root endpoint
`/healthz`	Health check endpoint
`/metadata`	Instance information endpoint

The root endpoint returns:

Hi from Terraform GCP Production-Lite Platform

The health endpoint returns:

ok

The metadata endpoint returns information such as:

{
  "service": "terraform-gcp-production-lite-web-platform",
  "environment": "dev",
  "version": "1.0.0",
  "hostname": "dev-web-mig-xxxx"
}

The application is intentionally small.

Terraform and infrastructure design are the focus.

Repository Structure

The repository structure is:

terraform-gcp-production-lite-web-platform/
├── README.md
├── .gitignore
├── versions.tf
├── providers.tf
├── backend.tf.example
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.tfvars.example
├── locals.tf
├── docs/
│   ├── architecture.md
│   ├── deployment-runbook.md
│   ├── operations-runbook.md
│   ├── verification.md
│   ├── design-decisions.md
│   └── version-roadmap.md
├── scripts/
│   └── startup.sh
└── modules/
    ├── network/
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    ├── iam/
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    ├── nat/
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    ├── compute/
    │   ├── main.tf
    │   ├── variables.tf
    │   └── outputs.tf
    └── load-balancer/
        ├── main.tf
        ├── variables.tf
        └── outputs.tf

There are five main modules:

network
iam
nat
compute
load-balancer

Each module has a specific responsibility.

Module Responsibilities

Module	Responsibility
`network`	VPC, subnets, firewall rules
`iam`	Service accounts and IAM role bindings
`nat`	Cloud Router and Cloud NAT
`compute`	Instance template and Managed Instance Group
`load-balancer`	HTTP load balancer resources

The root module connects them together.

The root module should answer:

What components make up this platform?
How are those components connected?

The child modules should answer:

How is each infrastructure area implemented?

This separation makes the repository easier to read.

Root Module Composition

The root main.tf composes the platform.

At a high level, the order is:

network
iam
nat
health check
compute
load balancer
IAP IAM bindings

This order makes sense because:

Compute needs network and IAM.
NAT needs network.
The load balancer needs the instance group.
IAP access needs IAM and firewall rules.

The root module should not contain all low-level resources.

It should orchestrate modules.

Network Module

The network module creates:

VPC
subnets
firewall rules

The VPC is created as a custom mode VPC:

resource "google_compute_network" "this" {
  name                    = local.final_network_name
  auto_create_subnetworks = false
  routing_mode            = "REGIONAL"
}

I use:

auto_create_subnetworks = false

because I want explicit control over subnet ranges.

This is cleaner than relying on automatically created subnets.

Map-Based Subnets

Instead of hardcoding each subnet as a separate resource, I define subnets as a map.

Example:

subnets = {
  app = {
    cidr_range            = "10.80.1.0/24"
    private_google_access = true
    role                  = "application"
  }

  db = {
    cidr_range            = "10.80.2.0/24"
    private_google_access = true
    role                  = "database-reserved"
  }
}

The network module then creates subnets using for_each.

Conceptually:

resource "google_compute_subnetwork" "subnets" {
  for_each = var.subnets

  name                     = "${var.environment}-${each.key}-subnet"
  region                   = coalesce(each.value.region, var.region)
  network                  = google_compute_network.this.id
  ip_cidr_range            = each.value.cidr_range
  private_ip_google_access = each.value.private_google_access
}

This is more flexible than writing:

resource "google_compute_subnetwork" "app" {
  ...
}

resource "google_compute_subnetwork" "db" {
  ...
}

If I want to add another subnet later, I can add another map entry.

For example:

cache = {
  cidr_range = "10.80.3.0/24"
  role       = "cache"
}

The module does not need to change.

Map-Based Firewall Rules

The network module also creates firewall rules from a map.

Example:

firewall_rules = {
  allow-lb-health-check = {
    description   = "Allow Google Cloud load balancer health checks and proxy traffic."
    source_ranges = ["35.191.0.0/16", "130.211.0.0/22"]
    target_tags   = ["web-backend"]

    allow = [
      {
        protocol = "tcp"
        ports    = ["80"]
      }
    ]
  }

  allow-iap-ssh = {
    description   = "Allow SSH to private backend instances through IAP."
    source_ranges = ["35.235.240.0/20"]
    target_tags   = ["web-backend"]

    allow = [
      {
        protocol = "tcp"
        ports    = ["22"]
      }
    ]
  }
}

The module creates firewall rules using for_each and dynamic allow blocks.

Conceptually:

resource "google_compute_firewall" "ingress_rules" {
  for_each = var.firewall_rules

  name          = "${var.environment}-${each.key}"
  network       = google_compute_network.this.name
  description   = each.value.description
  direction     = "INGRESS"
  source_ranges = each.value.source_ranges
  target_tags   = each.value.target_tags

  dynamic "allow" {
    for_each = each.value.allow

    content {
      protocol = allow.value.protocol
      ports    = allow.value.ports
    }
  }
}

I prefer this pattern because traffic policy becomes data-driven.

The module does not need to know every possible firewall rule.

It only needs to know how to create firewall rules from structured input.

Firewall Rules Used

For this version, I use three main firewall rules:

Rule	Purpose
`allow-lb-health-check`	Allows Google load balancer and health check traffic to backend VMs
`allow-iap-ssh`	Allows SSH through IAP TCP forwarding
`allow-internal`	Allows internal traffic inside the platform CIDR

The important security decision is that I do not open SSH to:

0.0.0.0/0

Instead, SSH access is designed around IAP.

IAM Module

The IAM module creates service accounts.

For this artifact, the main service account is the application VM service account.

Example input:

service_accounts = {
  app = {
    account_id    = "dev-prod-lite-app-sa"
    display_name  = "Production Lite App Service Account"
    description   = "Service account used by private application VM instances."
    project_roles = [
      "roles/logging.logWriter",
      "roles/monitoring.metricWriter"
    ]
  }
}

The service account is attached to the application instances.

This is cleaner than using the default Compute Engine service account.

It also makes the identity of the workload explicit.

NAT Module

The NAT module creates:

Cloud Router
Cloud NAT

In my earlier lab, Cloud NAT was applied to all subnets.

For this artifact, I wanted a more intentional design.

So NAT is applied only to selected subnet keys.

Example:

nat_subnet_keys = ["app"]

That means:

app subnet gets outbound internet through Cloud NAT
db subnet does not get NAT by default

This is a small design improvement, but it shows better network intent.

The app tier needs outbound access for package installation and application operations.

The reserved database tier should be more restricted.

Compute Module

The compute module creates:

instance template
regional Managed Instance Group
named port
autohealing policy

The instance template defines:

machine type
boot disk
network interface
startup script
service account
network tags

The important part is the network interface.

network_interface {
  subnetwork = var.subnetwork_self_link

  # No access_config block.
}

This keeps the VM private.

The MIG then uses the instance template:

resource "google_compute_region_instance_group_manager" "this" {
  name               = "${var.environment}-${var.mig_name}"
  region             = var.region
  base_instance_name = "${var.environment}-${var.mig_name}"
  target_size        = var.target_size

  version {
    instance_template = google_compute_instance_template.this.self_link
  }

  named_port {
    name = "http"
    port = var.app_port
  }

  auto_healing_policies {
    health_check      = var.health_check_self_link
    initial_delay_sec = 120
  }
}

The named port is important because the backend service uses it to send traffic to the correct backend port.

Load Balancer Module

The load balancer module creates:

global IP address
backend service
URL map
target HTTP proxy
global forwarding rule

The backend service connects the load balancer to the MIG.

Conceptually:

resource "google_compute_backend_service" "this" {
  name                  = "${var.environment}-${var.lb_name}-backend"
  protocol              = "HTTP"
  port_name             = "http"
  load_balancing_scheme = "EXTERNAL_MANAGED"
  timeout_sec           = 30

  health_checks = [
    var.health_check_self_link
  ]

  backend {
    group           = var.backend_instance_group
    balancing_mode  = "UTILIZATION"
    capacity_scaler = 1.0
  }
}

The load balancer then exposes the service through a global forwarding rule.

For v1.0, I use HTTP on port 80.

HTTPS will come later in v1.1.

Startup Script

The startup script bootstraps the application when the VM starts.

The script installs dependencies, creates the application files, and starts the service.

A simplified version of the application behavior is:

GET /         -> returns a simple message
GET /healthz -> returns ok
GET /metadata -> returns hostname and version information

The important operational improvement is that the application should run as a systemd service.

That is better than running a background process with &.

With systemd, I can check:

sudo systemctl status prod-lite-app

And view logs:

sudo journalctl -u prod-lite-app --no-pager -n 50

Remote State

Like the first artifact, this project uses a GCS backend for Terraform state.

Example backend:

terraform {
  backend "gcs" {
    bucket = "YOUR_TERRAFORM_STATE_BUCKET"
    prefix = "terraform-gcp-production-lite-web-platform/v1"
  }
}

The state path becomes something like:

gs://YOUR_TERRAFORM_STATE_BUCKET/terraform-gcp-production-lite-web-platform/v1/default.tfstate

Using remote state is important because this project is no longer a tiny one-file local experiment.

It has multiple modules and multiple cloud resources.

Remote state gives the project a more realistic workflow.

Git Safety

I do not commit real .tfvars files.

The repository includes:

terraform.tfvars.example

But ignores:

terraform.tfvars

The .gitignore includes:

.terraform/
*.tfstate
*.tfstate.*
*.tfvars
*.tfplan
crash.log
.DS_Store

This avoids committing local values, real project IDs, or state files.

Running the Project

The execution flow is:

1. Configure backend
2. Configure terraform.tfvars
3. Run terraform fmt
4. Run terraform init
5. Run terraform validate
6. Run terraform plan
7. Run terraform apply
8. Verify the load balancer and backend health

Configure Backend

cp backend.tf.example backend.tf

Then edit the bucket name.

Configure Variables

cp terraform.tfvars.example terraform.tfvars

Then edit:

project_id      = "your-gcp-project-id"
admin_principal = "user:your-email@example.com"

Run Terraform

terraform fmt -recursive
terraform init
terraform validate
terraform plan
terraform apply

Expected Output

After deployment, Terraform should output values such as:

network_name
subnets
firewall_rules
cloud_nat_name
cloud_router_name
health_check_name
mig_name
mig_instance_group
load_balancer_ip
load_balancer_url
curl_test_command
curl_health_check_command
platform_summary

The most important output is:

load_balancer_url

That URL is used to test the application.

Verification

After deployment, I can test the root endpoint:

curl -i http://LOAD_BALANCER_IP

Expected response:

Hi from Terraform GCP Production-Lite Platform

Test the health endpoint:

curl -i http://LOAD_BALANCER_IP/healthz

Expected response:

ok

Test the metadata endpoint:

curl -i http://LOAD_BALANCER_IP/metadata

Expected response:

{
  "service": "terraform-gcp-production-lite-web-platform",
  "environment": "dev",
  "version": "1.0.0",
  "hostname": "dev-web-mig-xxxx"
}

Verifying the Infrastructure

Verify the VPC:

gcloud compute networks list

Verify the subnets:

gcloud compute networks subnets list

Verify firewall rules:

gcloud compute firewall-rules list

Verify Cloud NAT:

gcloud compute routers nats list \
  --router=dev-nat-router \
  --region=asia-southeast2

Verify the Managed Instance Group:

gcloud compute instance-groups managed list

Verify backend health:

gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global

Verify the application VM does not have an external IP:

gcloud compute instances list

This is one of the most important checks.

The application should be reachable through the load balancer, not directly through a public VM IP.

Accessing the Private VM Through IAP

Since the VM has no external IP, direct SSH is not available.

Instead, I use IAP TCP forwarding.

Example:

gcloud compute ssh INSTANCE_NAME \
  --zone=asia-southeast2-a \
  --tunnel-through-iap

This requires:

IAM permission for IAP tunnel access
OS Login role
firewall rule allowing IAP source range
correct target tag on the VM

The firewall source range for IAP TCP forwarding is:

35.235.240.0/20

This is better than opening SSH to the public internet.

Troubleshooting Notes

Load Balancer Returns 502

Possible causes:

application failed to start
health check path is wrong
firewall rule does not allow health check probes
target tag mismatch
named port mismatch
backend service points to the wrong instance group

Useful commands:

gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global

sudo systemctl status prod-lite-app

sudo journalctl -u prod-lite-app --no-pager -n 100

VM Cannot Install Packages

Possible cause:

Cloud NAT is not configured correctly.

Useful check:

curl https://example.com

Run that from inside the VM.

IAP SSH Fails

Possible causes:

missing IAP role
missing OS Login role
missing service account user role
missing IAP firewall rule
wrong VM network tag

Check firewall rules:

gcloud compute firewall-rules list --filter="name~iap"

Important Design Decisions

1. Use private backend VMs

The backend application instances do not have external IP addresses.

This reduces direct exposure and forces traffic through the load balancer.

2. Use Cloud NAT

The private VMs still need outbound access.

Cloud NAT allows that without assigning public IPs to the VMs.

3. Use map-based subnet definitions

Subnets are defined through a map so the network module stays reusable.

Adding another subnet should not require another resource block.

4. Use map-based firewall rules

Firewall rules are also defined through a map.

This keeps ingress policy data-driven and easier to extend.

5. Use a Managed Instance Group

The application tier is managed as a group based on an instance template.

This is more production-shaped than a standalone VM.

6. Use HTTP only in v1.0

HTTPS is intentionally deferred to v1.1.

The v1.0 goal is to prove:

private backends
Cloud NAT
MIG
health checks
HTTP load balancing
firewall rules
remote state

7. Reserve a DB subnet without provisioning a database

The DB subnet demonstrates tiered network design.

A database is not provisioned in v1.0 because this artifact focuses on the web platform infrastructure.

What I Learned

This artifact helped me understand that application infrastructure is not only about creating a VM.

A proper platform needs several parts to work together:

networking
identity
firewall rules
NAT
compute lifecycle
health checks
load balancing
state management
documentation

The most interesting part for me was seeing how small configuration details affect the whole platform.

For example:

if the firewall source range is wrong, the load balancer cannot reach the backend
if the health check path is wrong, the backend becomes unhealthy
if Cloud NAT is missing, private VMs may fail during startup
if the MIG named port does not match the backend service, traffic may not route correctly
if SSH is opened to 0.0.0.0/0, the design becomes weaker

This project made me appreciate that infrastructure is a system.

Each component has to be designed with the others in mind.

What I Intentionally Did Not Add

I intentionally did not add HTTPS in this version.

I also did not add:

Cloud Armor
Cloud SQL
Secret Manager
CI/CD
custom domain
blue-green deployment
autoscaling policy
multi-region deployment

Those are useful, but I want this artifact to stay focused.

The purpose of v1.0 is to build the core web platform first.

Next Step

The next version will be:

v1.1 — HTTPS and Custom Domain

That version should add:

Google-managed SSL certificate
custom domain
HTTPS target proxy
global forwarding rule on port 443
optional HTTP-to-HTTPS redirect

After that, I want to continue with:

v1.2 — Security Hardening
v2.0 — Terraform CI/CD with GitHub Actions and Workload Identity Federation
v2.1 — Drift, Import, and State Recovery
v3.0 — Database and Secrets

References

Terraform GCS backend:
https://developer.hashicorp.com/terraform/language/backend/gcs

Google Cloud — Store Terraform state in Cloud Storage:
https://cloud.google.com/docs/terraform/resource-management/store-state

Google Cloud — External Application Load Balancer overview:
https://cloud.google.com/load-balancing/docs/https

Google Cloud — Terraform examples for external Application Load Balancers:
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-tf-module-examples

Google Cloud — Cloud NAT overview:
https://cloud.google.com/nat/docs/overview

Google Cloud — IAP TCP forwarding:
https://cloud.google.com/iap/docs/using-tcp-forwarding