Hi! After building my first Terraform artifact, The GCP Foundation Lite, I wanted to move one layer higher.
I wanted to answer a more complex question:
Can I provision infrastructure for an actual web platform?
Thus, I created this project titled "Production-Lite GCP Web Platform" using Terraform.
This project provisions:
- custom VPC network
- app subnet
- reserved database subnet
- map-based firewall rules
- Cloud NAT
- custom service account
- regional Managed Instance Group
- instance template
- startup script
- HTTP health check
- backend service
- external HTTP load balancer
- remote Terraform state
- reusable Terraform modules
The goal was not to create a full enterprise platform, but to create a small but production-shaped infra pattern that can run an application.
But, before i go even further, do check my github repository at terraform-gcp-production-lite-web-platform
Why I Built This
In my first artifact, I focused on the foundation layer.
That project included:
- remote Terraform state
- versioned GCS state bucket
- custom VPC
- role-based subnets
- firewall rules
- service accounts
- IAM bindings
- reusable modules
That was useful because it helped me understand how to create the base layer of a Google Cloud environment.
But a foundation alone does not run an application.
So for this second artifact, I wanted to build something closer to a real web platform.
The objective was to move from this:
Terraform creates the foundation.
To this:
Terraform provisions infrastructure that can serve application traffic.
What This Project Builds
This project creates:
- custom VPC network
- application subnet
- reserved database subnet
- firewall rules
- Cloud Router
- Cloud NAT
- custom application service account
- instance template
- regional Managed Instance Group
- HTTP health check
- backend service
- external HTTP load balancer
- global forwarding rule
- startup script
- simple web application endpoint
- remote state in GCS
The application VMs are private.
They do not have external IP addresses.
Users access the application through the external HTTP load balancer.
Outbound internet access from the private VMs is handled through Cloud NAT.
High-Level Architecture
The high-level architecture is:
User
↓
External HTTP Load Balancer
↓
Backend Service
↓
Regional Managed Instance Group
↓
Private Application VM
↓
Application Endpoint
For outbound access:
Private Application VM
↓
Cloud NAT
↓
Internet
The important point is this:
Inbound traffic enters through the load balancer.
Outbound traffic leaves through Cloud NAT.
The backend VM does not need a public IP address.
Architecture Diagram
User / Browser
|
v
External HTTP Load Balancer
|
v
Target HTTP Proxy
|
v
URL Map
|
v
Backend Service
|
v
Regional Managed Instance Group
|
v
Private App VM(s)
|
v
Application running from startup script
Private App VM(s)
|
v
Cloud NAT
|
v
Outbound Internet
What Production-Lite Means
For this project, production-lite means the infrastructure follows production-style patterns without trying to become a full enterprise platform.
This project includes:
private backend instances
load balancer entry point
health checks
Managed Instance Group
Cloud NAT
service account separation
firewall rules
remote state
Terraform modules
But this version does not include:
HTTPS
custom domain
Cloud Armor
Cloud SQL
Secret Manager
CI/CD
multi-region deployment
blue-green deployment
Kubernetes
Those features are important, but I intentionally deferred them.
The goal of v1.0 is to keep the scope focused:
Can I provision a production-shaped HTTP web platform with private backends?
Why Private Backend Instances Matter
One of the most important design choices in this project is that the backend VMs do not have external IP addresses.
In the instance template, the network interface does not include an access_config block.
Conceptually, the pattern is:
network_interface {
subnetwork = var.subnetwork_self_link
# No access_config block.
# This intentionally creates VMs without external IP addresses.
}
This means the VM is not directly exposed to the internet.
Instead, application traffic must go through the load balancer.
That is a better pattern than exposing a VM directly with a public IP address.
A direct public VM might be fine for a quick test, but for an application platform, I want a cleaner entry point:
Internet
-> Load Balancer
-> Backend Service
-> Private VM
Why Cloud NAT Is Needed
Because the backend VMs are private, they do not have direct outbound internet access through an external IP.
However, the VM still needs outbound access for operational tasks such as:
- running
apt-get update - installing packages
- downloading dependencies
- calling external services
- bootstrapping the application during startup
This is where Cloud NAT is useful.
Cloud NAT allows private resources to initiate outbound internet connections without assigning public IP addresses to those resources.
In this project, Cloud NAT is created using:
- Cloud Router
- Cloud NAT gateway
- selected subnet configuration
The important improvement from my earlier lab is that I do not need to NAT every subnet.
For this artifact, the app subnet receives NAT.
The reserved database subnet does not receive NAT by default.
This is the intended posture:
app subnet -> Cloud NAT enabled
db subnet -> Cloud NAT not enabled by default
That separation is small, but architecturally meaningful.
Why I Used a Managed Instance Group
A standalone VM is simpler.
But a standalone VM does not really show platform thinking.
For this artifact, I used a regional Managed Instance Group.
The MIG uses:
- instance template
- target size
- named port
- autohealing policy
- health check
The difference is important.
A standalone VM says:
I created a server.
A Managed Instance Group says:
I defined how application instances should be created, managed, replaced, and checked.
That is a stronger infrastructure pattern.
Why I Used an External HTTP Load Balancer
The external HTTP load balancer acts as the public entry point.
The load balancer connects to the backend service, and the backend service connects to the Managed Instance Group.
The request flow is:
User
-> Global forwarding rule
-> Target HTTP proxy
-> URL map
-> Backend service
-> Managed Instance Group
-> Application VM
This is more realistic than opening port 80 directly on a public VM.
It also allows me to test important infrastructure concepts:
- backend services
- health checks
- named ports
- firewall rules for health check probes
- load-balanced application access
Why Health Checks Matter
The load balancer needs to know whether the backend instances are healthy.
For that, I use an HTTP health check.
The application exposes:
GET /
GET /healthz
GET /metadata
The /healthz endpoint is used by the health check.
This is better than using / as the health check path because / is usually a user-facing route, while /healthz is explicitly meant for machine health checking.
The expected response is simple:
ok
A simple health endpoint is enough for this version.
The goal is not to build a complex application.
The goal is to prove that the infrastructure can route traffic to a healthy backend.
Application Endpoints
The sample application exposes three endpoints.
| Endpoint | Purpose |
|---|---|
/ |
Root endpoint |
/healthz |
Health check endpoint |
/metadata |
Instance information endpoint |
The root endpoint returns:
Hi from Terraform GCP Production-Lite Platform
The health endpoint returns:
ok
The metadata endpoint returns information such as:
{
"service": "terraform-gcp-production-lite-web-platform",
"environment": "dev",
"version": "1.0.0",
"hostname": "dev-web-mig-xxxx"
}
The application is intentionally small.
Terraform and infrastructure design are the focus.
Repository Structure
The repository structure is:
terraform-gcp-production-lite-web-platform/
├── README.md
├── .gitignore
├── versions.tf
├── providers.tf
├── backend.tf.example
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.tfvars.example
├── locals.tf
├── docs/
│ ├── architecture.md
│ ├── deployment-runbook.md
│ ├── operations-runbook.md
│ ├── verification.md
│ ├── design-decisions.md
│ └── version-roadmap.md
├── scripts/
│ └── startup.sh
└── modules/
├── network/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── iam/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── nat/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── compute/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── load-balancer/
├── main.tf
├── variables.tf
└── outputs.tf
There are five main modules:
network
iam
nat
compute
load-balancer
Each module has a specific responsibility.
Module Responsibilities
| Module | Responsibility |
|---|---|
network |
VPC, subnets, firewall rules |
iam |
Service accounts and IAM role bindings |
nat |
Cloud Router and Cloud NAT |
compute |
Instance template and Managed Instance Group |
load-balancer |
HTTP load balancer resources |
The root module connects them together.
The root module should answer:
What components make up this platform?
How are those components connected?
The child modules should answer:
How is each infrastructure area implemented?
This separation makes the repository easier to read.
Root Module Composition
The root main.tf composes the platform.
At a high level, the order is:
network
iam
nat
health check
compute
load balancer
IAP IAM bindings
This order makes sense because:
Compute needs network and IAM.
NAT needs network.
The load balancer needs the instance group.
IAP access needs IAM and firewall rules.
The root module should not contain all low-level resources.
It should orchestrate modules.
Network Module
The network module creates:
- VPC
- subnets
- firewall rules
The VPC is created as a custom mode VPC:
resource "google_compute_network" "this" {
name = local.final_network_name
auto_create_subnetworks = false
routing_mode = "REGIONAL"
}
I use:
auto_create_subnetworks = false
because I want explicit control over subnet ranges.
This is cleaner than relying on automatically created subnets.
Map-Based Subnets
Instead of hardcoding each subnet as a separate resource, I define subnets as a map.
Example:
subnets = {
app = {
cidr_range = "10.80.1.0/24"
private_google_access = true
role = "application"
}
db = {
cidr_range = "10.80.2.0/24"
private_google_access = true
role = "database-reserved"
}
}
The network module then creates subnets using for_each.
Conceptually:
resource "google_compute_subnetwork" "subnets" {
for_each = var.subnets
name = "${var.environment}-${each.key}-subnet"
region = coalesce(each.value.region, var.region)
network = google_compute_network.this.id
ip_cidr_range = each.value.cidr_range
private_ip_google_access = each.value.private_google_access
}
This is more flexible than writing:
resource "google_compute_subnetwork" "app" {
...
}
resource "google_compute_subnetwork" "db" {
...
}
If I want to add another subnet later, I can add another map entry.
For example:
cache = {
cidr_range = "10.80.3.0/24"
role = "cache"
}
The module does not need to change.
Map-Based Firewall Rules
The network module also creates firewall rules from a map.
Example:
firewall_rules = {
allow-lb-health-check = {
description = "Allow Google Cloud load balancer health checks and proxy traffic."
source_ranges = ["35.191.0.0/16", "130.211.0.0/22"]
target_tags = ["web-backend"]
allow = [
{
protocol = "tcp"
ports = ["80"]
}
]
}
allow-iap-ssh = {
description = "Allow SSH to private backend instances through IAP."
source_ranges = ["35.235.240.0/20"]
target_tags = ["web-backend"]
allow = [
{
protocol = "tcp"
ports = ["22"]
}
]
}
}
The module creates firewall rules using for_each and dynamic allow blocks.
Conceptually:
resource "google_compute_firewall" "ingress_rules" {
for_each = var.firewall_rules
name = "${var.environment}-${each.key}"
network = google_compute_network.this.name
description = each.value.description
direction = "INGRESS"
source_ranges = each.value.source_ranges
target_tags = each.value.target_tags
dynamic "allow" {
for_each = each.value.allow
content {
protocol = allow.value.protocol
ports = allow.value.ports
}
}
}
I prefer this pattern because traffic policy becomes data-driven.
The module does not need to know every possible firewall rule.
It only needs to know how to create firewall rules from structured input.
Firewall Rules Used
For this version, I use three main firewall rules:
| Rule | Purpose |
|---|---|
allow-lb-health-check |
Allows Google load balancer and health check traffic to backend VMs |
allow-iap-ssh |
Allows SSH through IAP TCP forwarding |
allow-internal |
Allows internal traffic inside the platform CIDR |
The important security decision is that I do not open SSH to:
0.0.0.0/0
Instead, SSH access is designed around IAP.
IAM Module
The IAM module creates service accounts.
For this artifact, the main service account is the application VM service account.
Example input:
service_accounts = {
app = {
account_id = "dev-prod-lite-app-sa"
display_name = "Production Lite App Service Account"
description = "Service account used by private application VM instances."
project_roles = [
"roles/logging.logWriter",
"roles/monitoring.metricWriter"
]
}
}
The service account is attached to the application instances.
This is cleaner than using the default Compute Engine service account.
It also makes the identity of the workload explicit.
NAT Module
The NAT module creates:
- Cloud Router
- Cloud NAT
In my earlier lab, Cloud NAT was applied to all subnets.
For this artifact, I wanted a more intentional design.
So NAT is applied only to selected subnet keys.
Example:
nat_subnet_keys = ["app"]
That means:
app subnet gets outbound internet through Cloud NAT
db subnet does not get NAT by default
This is a small design improvement, but it shows better network intent.
The app tier needs outbound access for package installation and application operations.
The reserved database tier should be more restricted.
Compute Module
The compute module creates:
- instance template
- regional Managed Instance Group
- named port
- autohealing policy
The instance template defines:
- machine type
- boot disk
- network interface
- startup script
- service account
- network tags
The important part is the network interface.
network_interface {
subnetwork = var.subnetwork_self_link
# No access_config block.
}
This keeps the VM private.
The MIG then uses the instance template:
resource "google_compute_region_instance_group_manager" "this" {
name = "${var.environment}-${var.mig_name}"
region = var.region
base_instance_name = "${var.environment}-${var.mig_name}"
target_size = var.target_size
version {
instance_template = google_compute_instance_template.this.self_link
}
named_port {
name = "http"
port = var.app_port
}
auto_healing_policies {
health_check = var.health_check_self_link
initial_delay_sec = 120
}
}
The named port is important because the backend service uses it to send traffic to the correct backend port.
Load Balancer Module
The load balancer module creates:
- global IP address
- backend service
- URL map
- target HTTP proxy
- global forwarding rule
The backend service connects the load balancer to the MIG.
Conceptually:
resource "google_compute_backend_service" "this" {
name = "${var.environment}-${var.lb_name}-backend"
protocol = "HTTP"
port_name = "http"
load_balancing_scheme = "EXTERNAL_MANAGED"
timeout_sec = 30
health_checks = [
var.health_check_self_link
]
backend {
group = var.backend_instance_group
balancing_mode = "UTILIZATION"
capacity_scaler = 1.0
}
}
The load balancer then exposes the service through a global forwarding rule.
For v1.0, I use HTTP on port 80.
HTTPS will come later in v1.1.
Startup Script
The startup script bootstraps the application when the VM starts.
The script installs dependencies, creates the application files, and starts the service.
A simplified version of the application behavior is:
GET / -> returns a simple message
GET /healthz -> returns ok
GET /metadata -> returns hostname and version information
The important operational improvement is that the application should run as a systemd service.
That is better than running a background process with &.
With systemd, I can check:
sudo systemctl status prod-lite-app
And view logs:
sudo journalctl -u prod-lite-app --no-pager -n 50
Remote State
Like the first artifact, this project uses a GCS backend for Terraform state.
Example backend:
terraform {
backend "gcs" {
bucket = "YOUR_TERRAFORM_STATE_BUCKET"
prefix = "terraform-gcp-production-lite-web-platform/v1"
}
}
The state path becomes something like:
gs://YOUR_TERRAFORM_STATE_BUCKET/terraform-gcp-production-lite-web-platform/v1/default.tfstate
Using remote state is important because this project is no longer a tiny one-file local experiment.
It has multiple modules and multiple cloud resources.
Remote state gives the project a more realistic workflow.
Git Safety
I do not commit real .tfvars files.
The repository includes:
terraform.tfvars.example
But ignores:
terraform.tfvars
The .gitignore includes:
.terraform/
*.tfstate
*.tfstate.*
*.tfvars
*.tfplan
crash.log
.DS_Store
This avoids committing local values, real project IDs, or state files.
Running the Project
The execution flow is:
1. Configure backend
2. Configure terraform.tfvars
3. Run terraform fmt
4. Run terraform init
5. Run terraform validate
6. Run terraform plan
7. Run terraform apply
8. Verify the load balancer and backend health
Configure Backend
cp backend.tf.example backend.tf
Then edit the bucket name.
Configure Variables
cp terraform.tfvars.example terraform.tfvars
Then edit:
project_id = "your-gcp-project-id"
admin_principal = "user:your-email@example.com"
Run Terraform
terraform fmt -recursive
terraform init
terraform validate
terraform plan
terraform apply
Expected Output
After deployment, Terraform should output values such as:
network_name
subnets
firewall_rules
cloud_nat_name
cloud_router_name
health_check_name
mig_name
mig_instance_group
load_balancer_ip
load_balancer_url
curl_test_command
curl_health_check_command
platform_summary
The most important output is:
load_balancer_url
That URL is used to test the application.
Verification
After deployment, I can test the root endpoint:
curl -i http://LOAD_BALANCER_IP
Expected response:
Hi from Terraform GCP Production-Lite Platform
Test the health endpoint:
curl -i http://LOAD_BALANCER_IP/healthz
Expected response:
ok
Test the metadata endpoint:
curl -i http://LOAD_BALANCER_IP/metadata
Expected response:
{
"service": "terraform-gcp-production-lite-web-platform",
"environment": "dev",
"version": "1.0.0",
"hostname": "dev-web-mig-xxxx"
}
Verifying the Infrastructure
Verify the VPC:
gcloud compute networks list
Verify the subnets:
gcloud compute networks subnets list
Verify firewall rules:
gcloud compute firewall-rules list
Verify Cloud NAT:
gcloud compute routers nats list \
--router=dev-nat-router \
--region=asia-southeast2
Verify the Managed Instance Group:
gcloud compute instance-groups managed list
Verify backend health:
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global
Verify the application VM does not have an external IP:
gcloud compute instances list
This is one of the most important checks.
The application should be reachable through the load balancer, not directly through a public VM IP.
Accessing the Private VM Through IAP
Since the VM has no external IP, direct SSH is not available.
Instead, I use IAP TCP forwarding.
Example:
gcloud compute ssh INSTANCE_NAME \
--zone=asia-southeast2-a \
--tunnel-through-iap
This requires:
- IAM permission for IAP tunnel access
- OS Login role
- firewall rule allowing IAP source range
- correct target tag on the VM
The firewall source range for IAP TCP forwarding is:
35.235.240.0/20
This is better than opening SSH to the public internet.
Troubleshooting Notes
Load Balancer Returns 502
Possible causes:
- application failed to start
- health check path is wrong
- firewall rule does not allow health check probes
- target tag mismatch
- named port mismatch
- backend service points to the wrong instance group
Useful commands:
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global
sudo systemctl status prod-lite-app
sudo journalctl -u prod-lite-app --no-pager -n 100
VM Cannot Install Packages
Possible cause:
Cloud NAT is not configured correctly.
Useful check:
curl https://example.com
Run that from inside the VM.
IAP SSH Fails
Possible causes:
- missing IAP role
- missing OS Login role
- missing service account user role
- missing IAP firewall rule
- wrong VM network tag
Check firewall rules:
gcloud compute firewall-rules list --filter="name~iap"
Important Design Decisions
1. Use private backend VMs
The backend application instances do not have external IP addresses.
This reduces direct exposure and forces traffic through the load balancer.
2. Use Cloud NAT
The private VMs still need outbound access.
Cloud NAT allows that without assigning public IPs to the VMs.
3. Use map-based subnet definitions
Subnets are defined through a map so the network module stays reusable.
Adding another subnet should not require another resource block.
4. Use map-based firewall rules
Firewall rules are also defined through a map.
This keeps ingress policy data-driven and easier to extend.
5. Use a Managed Instance Group
The application tier is managed as a group based on an instance template.
This is more production-shaped than a standalone VM.
6. Use HTTP only in v1.0
HTTPS is intentionally deferred to v1.1.
The v1.0 goal is to prove:
private backends
Cloud NAT
MIG
health checks
HTTP load balancing
firewall rules
remote state
7. Reserve a DB subnet without provisioning a database
The DB subnet demonstrates tiered network design.
A database is not provisioned in v1.0 because this artifact focuses on the web platform infrastructure.
What I Learned
This artifact helped me understand that application infrastructure is not only about creating a VM.
A proper platform needs several parts to work together:
networking
identity
firewall rules
NAT
compute lifecycle
health checks
load balancing
state management
documentation
The most interesting part for me was seeing how small configuration details affect the whole platform.
For example:
- if the firewall source range is wrong, the load balancer cannot reach the backend
- if the health check path is wrong, the backend becomes unhealthy
- if Cloud NAT is missing, private VMs may fail during startup
- if the MIG named port does not match the backend service, traffic may not route correctly
- if SSH is opened to
0.0.0.0/0, the design becomes weaker
This project made me appreciate that infrastructure is a system.
Each component has to be designed with the others in mind.
What I Intentionally Did Not Add
I intentionally did not add HTTPS in this version.
I also did not add:
- Cloud Armor
- Cloud SQL
- Secret Manager
- CI/CD
- custom domain
- blue-green deployment
- autoscaling policy
- multi-region deployment
Those are useful, but I want this artifact to stay focused.
The purpose of v1.0 is to build the core web platform first.
Next Step
The next version will be:
v1.1 — HTTPS and Custom Domain
That version should add:
- Google-managed SSL certificate
- custom domain
- HTTPS target proxy
- global forwarding rule on port 443
- optional HTTP-to-HTTPS redirect
After that, I want to continue with:
v1.2 — Security Hardening
v2.0 — Terraform CI/CD with GitHub Actions and Workload Identity Federation
v2.1 — Drift, Import, and State Recovery
v3.0 — Database and Secrets
References
Terraform GCS backend:
https://developer.hashicorp.com/terraform/language/backend/gcs
Google Cloud — Store Terraform state in Cloud Storage:
https://cloud.google.com/docs/terraform/resource-management/store-state
Google Cloud — External Application Load Balancer overview:
https://cloud.google.com/load-balancing/docs/https
Google Cloud — Terraform examples for external Application Load Balancers:
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-tf-module-examples
Google Cloud — Cloud NAT overview:
https://cloud.google.com/nat/docs/overview
Google Cloud — IAP TCP forwarding:
https://cloud.google.com/iap/docs/using-tcp-forwarding
Top comments (0)