Build Production-Ready GCP Infrastructure from Scratch: A Complete Console Guide
A 4-Part Series for Complete Beginners
Table of Contents
- Part 1: Foundation - Project Setup, VPC & Networking
- Part 2: Security Services - Secrets, Bastion & IAM
- Part 3: Database & Compute Resources
- Part 4: Observability & Load Balancer ← You are here
Part 4: Observability & Load Balancer
Overview
In this final part, you'll complete your infrastructure with observability and external access. We'll create Prometheus for metrics, Loki for logs, Grafana for dashboards, and an Application Load Balancer for external traffic.
What you'll build:
- Prometheus VM for metrics collection (7-day retention)
- Loki VM for log aggregation with Grafana
- External Application Load Balancer with SSL
- End-to-end health checks and monitoring
Estimated time: 45-60 minutes
Estimated cost: ~$64/month
Final cumulative cost: ~$301/month
Prerequisites
Before continuing, ensure you've completed Parts 1-3:
- [ ] VPC and 5 subnets exist (including
private-obssubnet) - [ ] Cloud SQL with private IP is running
- [ ] Backend MIG has 2+ healthy VMs
- [ ] Cache VM with Redis/PgBouncer is running
- [ ] Firewall rules allow health check IPs (35.191.0.0/16, 130.211.0.0/22)
If you missed Parts 1-3: Start with Part 1: Foundation →
Step 1: Create Static Internal IPs for Observability
What are Static Internal IPs?
Static IPs ensure observability VMs have predictable IP addresses. This makes configuration easier (no need to update configs if VMs are recreated).
Prometheus Static IP
Navigation Path
- Navigate to VPC networks → Internal IP addresses
- Click "Reserve static internal IP address"
IP Configuration
| Field | Value | Notes |
|---|---|---|
| Name | dev-prometheus-ip |
Descriptive |
| Network | dev-network |
Our VPC |
| Subnetwork | private-obs |
Observability subnet |
| IP address | 10.0.5.10 |
Manual assignment |
Click "Reserve".
Loki Static IP
Repeat the process:
| Field | Value |
|---|---|
| Name | dev-loki-ip |
| Network | dev-network |
| Subnetwork | private-obs |
| IP address | 10.0.5.11 |
Verify IPs
You should see 2 reserved IPs:
| Name | IP Address | Subnetwork |
|---|---|---|
| dev-prometheus-ip | 10.0.5.10 | private-obs |
| dev-loki-ip | 10.0.5.11 | private-obs |
Step 2: Create Prometheus VM
What is Prometheus?
Prometheus is a metrics collection and storage system:
- Scrapes metrics from Node Exporter (every VM)
- Stores 7 days of metrics data
- Provides query API for Grafana
Why self-hosted: Full control, no vendor lock-in, predictable costs.
Navigation Path
- Navigate to Compute Engine → VM instances
- Click "Create instance"
VM Configuration
Basic Settings
| Field | Value | Notes |
|---|---|---|
| Name | dev-prometheus |
Descriptive |
| Region | europe-west1 |
Same as VPC |
| Zone | europe-west1-b |
Zone b |
Machine Type
| Field | Value | Notes |
|---|---|---|
| Machine type | e2-medium |
2 vCPU, 4GB RAM |
Boot Disk
| Field | Value |
|---|---|
| OS | Ubuntu 22.04 LTS Minimal |
| Disk type | pd-balanced |
| Size | 50 GB |
Why 50GB: 7 days of metrics at 15s interval requires ~30-40GB. 50GB provides headroom.
Network Interface
| Field | Value | Notes |
|---|---|---|
| Network | dev-network |
Our VPC |
| Subnetwork | private-obs |
Observability subnet |
| Primary internal IP |
Static → dev-prometheus-ip
|
Use static IP |
| External IPv4 address | None | No public IP needed |
Service Account
| Field | Value |
|---|---|
| Service account | observability-dev-sa |
Metadata - Startup Script
Key: startup-script
Value:
#!/bin/bash
# Prometheus Startup Script
set -e
echo "=== Prometheus Startup Script Begin $(date) ==="
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
# Install Node Exporter for self-monitoring
NODE_EXPORTER_VERSION="1.6.1"
echo "Installing Node Exporter ${NODE_EXPORTER_VERSION}..."
useradd --no-create-home --shell /bin/false node_exporter || true
wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" -O /tmp/node_exporter.tar.gz
tar xzf /tmp/node_exporter.tar.gz -C /tmp
cp /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter
chmod +x /usr/local/bin/node_exporter
cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Type=simple
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
# Create Prometheus configuration
cat > /opt/prometheus.yml <<'EOF'
global:
scrape_interval: 15s
retention: 7d
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
# Backend VMs (update IPs after MIG creation)
- job_name: 'backend'
static_configs:
- targets: ['10.0.2.2:9100', '10.0.2.3:9100']
labels:
tier: backend
# Cache VM
- job_name: 'cache'
static_configs:
- targets: ['10.0.4.2:9100']
labels:
tier: cache
EOF
# Run Prometheus
docker run -d \
--name prometheus \
--restart unless-stopped \
-p 9090:9090 \
-v /opt/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest
echo "=== Prometheus Startup Script Complete $(date) ==="
echo "Prometheus running on port 9090"
echo "Node Exporter running on port 9100"
Create the VM
Click "Create".
VM Creation Time: 2-3 minutes
Verify Prometheus
You should see:
- Name: dev-prometheus
- Status: Running
- Internal IP: 10.0.5.10
- External IP: None
Step 3: Create Loki VM (with Grafana)
What is Loki and Grafana?
- Loki: Log aggregation system (like Prometheus, but for logs)
- Grafana: Visualization dashboard for metrics and logs
Why combined: Cost optimization. Single VM runs both services (~$23/month).
Navigation Path
- Navigate to Compute Engine → VM instances
- Click "Create instance"
VM Configuration
Basic Settings
| Field | Value | Notes |
|---|---|---|
| Name | dev-loki |
Descriptive |
| Region | europe-west1 |
Same as VPC |
| Zone | europe-west1-b |
Zone b |
Machine Type
| Field | Value | Notes |
|---|---|---|
| Machine type | e2-medium |
2 vCPU, 4GB RAM |
Boot Disk
| Field | Value |
|---|---|
| OS | Ubuntu 22.04 LTS Minimal |
| Disk type | pd-balanced |
| Size | 50 GB |
Network Interface
| Field | Value | Notes |
|---|---|---|
| Network | dev-network |
Our VPC |
| Subnetwork | private-obs |
Observability subnet |
| Primary internal IP |
Static → dev-loki-ip
|
Use static IP |
| External IPv4 address | Ephemeral |
Enable for Grafana access |
Why public IP: Grafana needs to be accessible from your browser. In production, use IAP instead of public IP.
Service Account
| Field | Value |
|---|---|
| Service account | observability-dev-sa |
Metadata - Startup Script
Key: startup-script
Value:
#!/bin/bash
# Loki and Grafana Startup Script
set -e
echo "=== Loki/Grafana Startup Script Begin $(date) ==="
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
# Install Docker Compose
apt-get update
apt-get install -y docker-compose
# Create docker-compose.yml
cat > /opt/docker-compose.yml <<'EOF'
version: '3.8'
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- /opt/loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://localhost:3000
volumes:
- grafana-storage:/var/lib/grafana
- /opt/grafana-provisioning:/etc/grafana/provisioning
restart: unless-stopped
volumes:
loki-data:
grafana-storage:
EOF
# Create Loki config
cat > /opt/loki-config.yml <<'EOF'
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 1h
max_chunk_age: 1h
chunk_target_size: 1048576
chunk_retain_period: 30s
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: filesystem
filesystem:
directory: /loki/chunks
chunk_store_config:
max_look_back_period: 168h
table_manager:
retention_deletes_enabled: false
retention_period: 0s
EOF
# Create Grafana provisioning
mkdir -p /opt/grafana-provisioning/datasources
cat > /opt/grafana-provisioning/datasources/loki.yml <<'EOF'
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
isDefault: false
editable: false
EOF
cat > /opt/grafana-provisioning/datasources/prometheus.yml <<'EOF'
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://10.0.5.10:9090
isDefault: true
editable: false
EOF
# Start services
cd /opt
docker-compose up -d
echo "=== Loki/Grafana Startup Script Complete $(date) ==="
echo "Loki running on port 3100"
echo "Grafana running on port 3000 (http://EXTERNAL_IP:3000)"
echo "Login: admin / admin123"
Create the VM
Click "Create".
VM Creation Time: 2-3 minutes
Verify Loki VM
You should see:
- Name: dev-loki
- Status: Running
- Internal IP: 10.0.5.11
- External IP: (Assigned IP)
Copy the external IP - we'll need it for Grafana access.
Step 4: Access Grafana Dashboard
Get Loki VM External IP
- Navigate to Compute Engine → VM instances
- Find
dev-lokiVM - Copy the External IP address
Access Grafana
Open your browser and navigate to:
http://[LOKI_EXTERNAL_IP]:3000
Grafana Login
| Field | Value |
|---|---|
| Email or username | admin |
| Password | admin123 |
Click "Log in".
Verify Datasources
- Click "Configuration" (gear icon) → Data sources
- Verify Prometheus shows "Healthy" (green checkmark)
- Verify Loki shows "Healthy"
Troubleshooting: If datasources are unhealthy:
- Check Prometheus is running:
docker pson Loki VM- Verify network connectivity:
curl http://10.0.5.10:9090/-/healthy- Check Docker logs:
docker logs lokiordocker logs grafana
Step 5: Update Prometheus Targets
What are Prometheus Targets?
Prometheus "scrapes" metrics from targets. We need to add our backend and cache VMs.
Update Prometheus Configuration
- SSH to Prometheus VM via bastion:
# From your local machine
gcloud compute ssh dev-bastion --tunnel-through-iap
# From bastion, SSH to Prometheus
ssh 10.0.5.10
# Edit Prometheus config
sudo nano /opt/prometheus.yml
- Update the targets with actual backend VM IPs:
scrape_configs:
- job_name: 'backend'
static_configs:
- targets: ['10.0.2.2:9100', '10.0.2.3:9100'] # Update with actual IPs
labels:
tier: backend
- job_name: 'cache'
static_configs:
- targets: ['10.0.4.2:9100'] # Update with actual IP
labels:
tier: cache
- Restart Prometheus:
docker restart prometheus
Verify Targets
- Navigate to
http://[LOKI_EXTERNAL_IP]:3000 - Go to Explore → Select Prometheus datasource
- Query:
up{job="backend"} - You should see metrics from backend VMs
Step 6: Create External Application Load Balancer
What is an ALB?
Application Load Balancer (ALB):
- Distributes traffic across backend VMs
- Provides single public IP for external access
- Handles SSL termination
- Health checks for backend availability
Step 6a: Create Health Check
Navigation Path
- Navigate to Compute Engine → Health checks
- Click "Create health check"
Health Check Configuration
| Field | Value | Notes |
|---|---|---|
| Name | dev-lb-health-check |
Descriptive |
| Protocol | HTTP |
HTTP health check |
| Port | 3000 |
NestJS app port |
| Request path | /api/health |
App must implement this |
| Check interval |
20 seconds |
How often to check |
| Timeout |
5 seconds |
Response timeout |
| Healthy threshold | 2 |
2 successes = healthy |
| Unhealthy threshold | 5 |
5 failures = unhealthy |
Click "Create".
Step 6b: Add Named Port to MIG
What is a Named Port?
Named ports link a name (like "http") to a port number (3000). The load balancer uses named ports for routing.
Navigation Path
- Navigate to Compute Engine → Instance groups
- Click on
dev-backend-mig - Click "Edit group"
Edit MIG
Scroll to "Named ports" section:
Click "Add item":
| Field | Value |
|---|---|
| Name | http |
| Port | 3000 |
Click "Save".
Step 6c: Create Load Balancer
Navigation Path
- Navigate to Network Services → Load balancing
- Click "Create load balancer"
Select LB Type
Click "Application Load Balancer (HTTP/S)" → Click "Configure".
Basic Configuration
| Field | Value | Notes |
|---|---|---|
| Name | dev-lb |
Environment-prefixed |
| Region | europe-west1 |
Same as VPC |
| Network | dev-network |
Our VPC |
Backend Configuration
Backend type: Instance group
Backend configuration:
| Field | Value | Notes |
|---|---|---|
| Region | europe-west1 |
Same region |
| Backend | dev-backend-mig |
Our MIG |
| Balancing mode | Rate |
Rate-based balancing |
| Maximum RPS | 100 |
Per instance |
Health check:
| Field | Value |
|---|---|
| Health check | dev-lb-health-check |
Session affinity:
| Field | Value | Notes |
|---|---|---|
| Session affinity | None |
Stateless app |
Advanced settings:
| Field | Value | Notes |
|---|---|---|
| Timeout |
30 seconds |
Connection timeout |
| Enable Cloud CDN | Off | Not needed for now |
Click "Done".
Frontend Configuration
Protocol: HTTPS (we'll add HTTP redirect)
HTTPS Frontend:
Click "Add frontend IP and port":
| Field | Value | Notes |
|---|---|---|
| Protocol | HTTPS |
Secure traffic |
| IP address | Reserve new static IP |
Create new IP |
| IP address name | dev-lb-ip |
Descriptive |
| Port | 443 |
HTTPS port |
Certificate:
| Field | Value | Notes |
|---|---|---|
| Certificate | Create a new certificate |
For SSL |
Create Certificate:
| Field | Value | Notes |
|---|---|---|
| Name | dev-lb-cert |
Descriptive |
| Type | Google-managed certificate |
Auto-renew |
Domains:
| Field | Value |
|---|---|
| Domains | (Skip for now) |
Note: For testing, you can use HTTP only (skip certificate). For production, add your domain and create Google-managed certificate.
Click "Create".
HTTP Frontend (Redirect):
Click "Add frontend IP and port":
| Field | Value |
|---|---|
| Protocol | HTTP |
| IP address | dev-lb-ip |
| Port | 80 |
| Enable redirect to HTTPS | ✓ Enable |
Routing Rules
Host rules: Leave default (all hosts)
Path matcher: Leave default (all paths)
Backend service: Select the backend created above
Click "Done".
Create the Load Balancer
Review configuration and click "Create load balancer".
Creation Time: 5-10 minutes
Verify Load Balancer
You should see:
- Name: dev-lb
- Status: (Checkmark) - Active
- IP address: (Reserved IP)
- Backends: dev-backend-mig (healthy)
Step 7: Test End-to-End Connectivity
Test 1: Load Balancer Health Check
# Get LB IP
gcloud compute forwarding-rules list --filter="name=dev-lb*"
# Test health endpoint
curl http://[LB_IP]/api/health
Expected: HTTP 200 with response like {"status":"ok"}
Test 2: Full Request Flow
# Test through load balancer
curl http://[LB_IP]/api/test
Expected: Response from backend application
Test 3: Prometheus Metrics
- Open Grafana:
http://[LOKI_EXTERNAL_IP]:3000 - Go to Explore → Prometheus
- Query:
rate(http_requests_total[5m]) - You should see request metrics
Test 4: Loki Logs
- Open Grafana:
http://[LOKI_EXTERNAL_IP]:3000 - Go to Explore → Loki
- Query:
{job="nestjs"} - You should see application logs
Part 4 Verification Checklist
Final Verification
- [ ] Prometheus VM running at 10.0.5.10:9090
- [ ] Loki VM running with external IP assigned
- [ ] Grafana accessible at http://[LOKI_EXTERNAL_IP]:3000
- [ ] Prometheus datasource shows "Healthy"
- [ ] Loki datasource shows "Healthy"
- [ ] Prometheus scraping backend and cache VMs
- [ ] Load balancer has reserved static IP
- [ ] Backend service shows healthy instances
- [ ] curl http://[LB_IP]/api/health returns 200
- [ ] Full request flow works (LB → MIG → App)
Cost Summary - Part 4
| Component | Monthly Cost | Notes |
|---|---|---|
| Prometheus VM | ~$23 | e2-medium, 50GB disk |
| Loki VM | ~$23 | e2-medium, 50GB disk |
| Load Balancer | ~$18 | ALB forwarding rules |
| Total Part 4 | ~$64 | ~$64/month |
Final Infrastructure Cost
| Component | Monthly Cost |
|---|---|
| Part 1 (VPC, NAT, Firewall) | ~$42 |
| Part 2 (Bastion) | ~$11 |
| Part 3 (Cloud SQL, MIG, Cache) | ~$184 |
| Part 4 (Observability, LB) | ~$64 |
| Total | ~$301/month |
Cost Optimization Tips:
- Use preemptible VMs for non-critical workloads (save 80%)
- Reduce Cloud SQL tier to db-g1-small for dev (save ~$70/month)
- Scale MIG to 0 during off-hours (save ~$46/month)
- Disable Flow Logs on non-critical subnets (save ~$5-10/month)
Comprehensive Troubleshooting
Issue: Grafana Cannot Connect to Datasources
Symptom: Datasource shows "Could not connect"
Diagnosis:
- Check Prometheus is running
- Verify network connectivity
- Check firewall rules
Solution:
# From Loki VM
docker ps # Check containers running
# Test Prometheus
curl http://10.0.5.10:9090/-/healthy
# Test Loki
curl http://localhost:3100/ready
Issue: Load Balancer Shows 0/0 Healthy
Symptom: No healthy backend instances
Diagnosis:
- Check health check configuration
- Verify app is running on port 3000
- Check firewall for health check IPs
Solution:
# From backend VM
curl localhost:3000/health
netstat -tlnp | grep 3000
Common fixes:
- Increase health check unhealthy threshold to 5
- Increase initial delay for MIG autohealing to 300
- Verify firewall rule allows 35.191.0.0/16 and 130.211.0.0/22
Issue: High CPU on Backend VMs
Symptom: VMs constantly at 90%+ CPU
Diagnosis:
- Check application logs
- Verify autoscaling thresholds
- Profile application performance
Solution:
# Check CPU usage
top -bn1 | head -20
# Check Node.js process
pm2 monit
# Check autoscaler status
gcloud compute instance-groups managed describe dev-backend-mig \
--region europe-west1
Fixes:
- Increase machine type (e2-medium → e2-highcpu-4)
- Optimize application code
- Increase max replicas to 6 or 8
Issue: Secrets Not Accessible from VMs
Symptom: "Permission denied" accessing secrets
Diagnosis:
- Verify service account has Secret Accessor role
- Check secret exists and has versions
Solution:
# From VM with correct SA
gcloud secrets versions list db-credentials-dev
# Add IAM role if missing
gcloud secrets add-iam-policy-binding db-credentials-dev \
--member='serviceAccount:backend-dev-sa@PROJECT_ID.iam.gserviceaccount.com' \
--role='roles/secretmanager.secretAccessor'
Next Steps Beyond This Guide
1. Domain & SSL
- Purchase domain from registrar
- Configure DNS to point to LB IP
- Update Load Balancer certificate with your domain
- Enable Google-managed certificate
2. CI/CD Pipeline
- Setup GitHub Actions
- Auto-deploy on push
- Run tests in container
- Automated rollback on failure
3. Monitoring Enhancements
- Add alert rules in Prometheus
- Configure PagerDuty/Slack integration
- Create custom Grafana dashboards
- Set up uptime monitoring
4. Security Hardening
- Enable VPC Service Controls
- Configure Organization Policies
- Setup Security Command Center
- Implement workload identity
5. Multi-Environment
- Create staging environment
- Use Shared VPC
- Implement environment isolation
- Setup service directory
End-to-End Verification Test
Test 1: Full Request Flow
# 1. Get Load Balancer IP
gcloud compute forwarding-rules list --filter="name=dev-lb*"
# 2. Send test request
curl http://[LB_IP]/api/health
# Expected: {"status":"ok","timestamp":"..."}
Test 2: Database Connectivity
# From backend VM (via bastion)
gcloud compute ssh dev-bastion --tunnel-through-iap
ssh 10.0.2.2 # Backend VM IP
# Test Cloud SQL connection
psql -h 10.100.0.2 -U backend-dev-sa -d appdb
Test 3: Cache Connectivity
# From backend VM
redis-cli -h 10.0.4.2 PING
# Expected: PONG
# Test PgBouncer
psql -h 10.0.4.2 -p 6432 -U app_admin -d appdb
Test 4: Monitoring Pipeline
- Generate traffic:
ab -n 1000 http://[LB_IP]/api/health - Open Grafana
- Check dashboards for metrics
- Verify Loki has logs
Test 5: Disaster Recovery
# Simulate instance failure
gcloud compute instances delete [ONE_BACKEND_INSTANCE] --quiet
# Verify:
# - MIG auto-heals (new instance appears)
# - Load balancer continues serving
# - No data loss
Congratulations!
You've built a complete, production-ready GCP infrastructure:
✅ Network: VPC with 5 subnets, Cloud NAT, firewall rules
✅ Security: Secret Manager, bastion with IAP, OS Login
✅ Data: Cloud SQL PostgreSQL with regional HA
✅ Compute: Managed Instance Group with autoscaling
✅ Cache: Redis + PgBouncer
✅ Observability: Prometheus, Loki, Grafana
✅ Access: External Application Load Balancer
Your infrastructure is ready for production deployment!
References
- GCP Documentation
- Prometheus Documentation
- Grafana Documentation
- Loki Documentation
- Load Balancer Documentation
Series Complete! You now have a fully functional, production-ready GCP infrastructure. From here, you can deploy your NestJS application and scale as needed.











Top comments (0)