Shaikh Al Amin

Posted on Feb 4

Build Production-Ready GCP Infrastructure from Scratch Part 04

#beginners #googlecloud #monitoring #tutorial

Build Production-Ready GCP Infrastructure from Scratch: A Complete Console Guide

A 4-Part Series for Complete Beginners

Part 4: Observability & Load Balancer

Overview

In this final part, you'll complete your infrastructure with observability and external access. We'll create Prometheus for metrics, Loki for logs, Grafana for dashboards, and an Application Load Balancer for external traffic.

What you'll build:

Prometheus VM for metrics collection (7-day retention)
Loki VM for log aggregation with Grafana
External Application Load Balancer with SSL
End-to-end health checks and monitoring

Estimated time: 45-60 minutes

Estimated cost: ~$64/month

Final cumulative cost: ~$301/month

Prerequisites

Before continuing, ensure you've completed Parts 1-3:

[ ] VPC and 5 subnets exist (including private-obs subnet)
[ ] Cloud SQL with private IP is running
[ ] Backend MIG has 2+ healthy VMs
[ ] Cache VM with Redis/PgBouncer is running
[ ] Firewall rules allow health check IPs (35.191.0.0/16, 130.211.0.0/22)

If you missed Parts 1-3: Start with Part 1: Foundation →

Step 1: Create Static Internal IPs for Observability

What are Static Internal IPs?

Static IPs ensure observability VMs have predictable IP addresses. This makes configuration easier (no need to update configs if VMs are recreated).

Prometheus Static IP

Navigation Path

Navigate to VPC networks → Internal IP addresses
Click "Reserve static internal IP address"

IP Configuration

Field	Value	Notes
Name	`dev-prometheus-ip`	Descriptive
Network	`dev-network`	Our VPC
Subnetwork	`private-obs`	Observability subnet
IP address	`10.0.5.10`	Manual assignment

Click "Reserve".

Loki Static IP

Repeat the process:

Field	Value
Name	`dev-loki-ip`
Network	`dev-network`
Subnetwork	`private-obs`
IP address	`10.0.5.11`

Verify IPs

You should see 2 reserved IPs:

Name	IP Address	Subnetwork
dev-prometheus-ip	10.0.5.10	private-obs
dev-loki-ip	10.0.5.11	private-obs

Step 2: Create Prometheus VM

What is Prometheus?

Prometheus is a metrics collection and storage system:

Scrapes metrics from Node Exporter (every VM)
Stores 7 days of metrics data
Provides query API for Grafana

Why self-hosted: Full control, no vendor lock-in, predictable costs.

Navigation Path

Navigate to Compute Engine → VM instances
Click "Create instance"

VM Configuration

Basic Settings

Field	Value	Notes
Name	`dev-prometheus`	Descriptive
Region	`europe-west1`	Same as VPC
Zone	`europe-west1-b`	Zone b

Machine Type

Field	Value	Notes
Machine type	`e2-medium`	2 vCPU, 4GB RAM

Boot Disk

Field	Value
OS	`Ubuntu 22.04 LTS Minimal`
Disk type	`pd-balanced`
Size	`50 GB`

Why 50GB: 7 days of metrics at 15s interval requires ~30-40GB. 50GB provides headroom.

Network Interface

Field	Value	Notes
Network	`dev-network`	Our VPC
Subnetwork	`private-obs`	Observability subnet
Primary internal IP	`Static` → `dev-prometheus-ip`	Use static IP
External IPv4 address	None	No public IP needed

Service Account

Field	Value
Service account	`observability-dev-sa`

Metadata - Startup Script

Key: startup-script

Value:

#!/bin/bash
# Prometheus Startup Script

set -e

echo "=== Prometheus Startup Script Begin $(date) ==="

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install Node Exporter for self-monitoring
NODE_EXPORTER_VERSION="1.6.1"

echo "Installing Node Exporter ${NODE_EXPORTER_VERSION}..."

useradd --no-create-home --shell /bin/false node_exporter || true

wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" -O /tmp/node_exporter.tar.gz
tar xzf /tmp/node_exporter.tar.gz -C /tmp
cp /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter
chmod +x /usr/local/bin/node_exporter

cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

# Create Prometheus configuration
cat > /opt/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  retention: 7d

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  # Backend VMs (update IPs after MIG creation)
  - job_name: 'backend'
    static_configs:
      - targets: ['10.0.2.2:9100', '10.0.2.3:9100']
        labels:
          tier: backend

  # Cache VM
  - job_name: 'cache'
    static_configs:
      - targets: ['10.0.4.2:9100']
        labels:
          tier: cache
EOF

# Run Prometheus
docker run -d \
  --name prometheus \
  --restart unless-stopped \
  -p 9090:9090 \
  -v /opt/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest

echo "=== Prometheus Startup Script Complete $(date) ==="
echo "Prometheus running on port 9090"
echo "Node Exporter running on port 9100"

Create the VM

Click "Create".

VM Creation Time: 2-3 minutes

Verify Prometheus

You should see:

Name: dev-prometheus
Status: Running
Internal IP: 10.0.5.10
External IP: None

Step 3: Create Loki VM (with Grafana)

What is Loki and Grafana?

Loki: Log aggregation system (like Prometheus, but for logs)
Grafana: Visualization dashboard for metrics and logs

Why combined: Cost optimization. Single VM runs both services (~$23/month).

Navigation Path

Navigate to Compute Engine → VM instances
Click "Create instance"

VM Configuration

Basic Settings

Field	Value	Notes
Name	`dev-loki`	Descriptive
Region	`europe-west1`	Same as VPC
Zone	`europe-west1-b`	Zone b

Machine Type

Field	Value	Notes
Machine type	`e2-medium`	2 vCPU, 4GB RAM

Boot Disk

Field	Value
OS	`Ubuntu 22.04 LTS Minimal`
Disk type	`pd-balanced`
Size	`50 GB`

Network Interface

Field	Value	Notes
Network	`dev-network`	Our VPC
Subnetwork	`private-obs`	Observability subnet
Primary internal IP	`Static` → `dev-loki-ip`	Use static IP
External IPv4 address	`Ephemeral`	Enable for Grafana access

Why public IP: Grafana needs to be accessible from your browser. In production, use IAP instead of public IP.

Service Account

Field	Value
Service account	`observability-dev-sa`

Metadata - Startup Script

Key: startup-script

Value:

#!/bin/bash
# Loki and Grafana Startup Script

set -e

echo "=== Loki/Grafana Startup Script Begin $(date) ==="

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install Docker Compose
apt-get update
apt-get install -y docker-compose

# Create docker-compose.yml
cat > /opt/docker-compose.yml <<'EOF'
version: '3.8'

services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - /opt/loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=http://localhost:3000
    volumes:
      - grafana-storage:/var/lib/grafana
      - /opt/grafana-provisioning:/etc/grafana/provisioning
    restart: unless-stopped

volumes:
  loki-data:
  grafana-storage:
EOF

# Create Loki config
cat > /opt/loki-config.yml <<'EOF'
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1h
  max_chunk_age: 1h
  chunk_target_size: 1048576
  chunk_retain_period: 30s

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
EOF

# Create Grafana provisioning
mkdir -p /opt/grafana-provisioning/datasources

cat > /opt/grafana-provisioning/datasources/loki.yml <<'EOF'
apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    isDefault: false
    editable: false
EOF

cat > /opt/grafana-provisioning/datasources/prometheus.yml <<'EOF'
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://10.0.5.10:9090
    isDefault: true
    editable: false
EOF

# Start services
cd /opt
docker-compose up -d

echo "=== Loki/Grafana Startup Script Complete $(date) ==="
echo "Loki running on port 3100"
echo "Grafana running on port 3000 (http://EXTERNAL_IP:3000)"
echo "Login: admin / admin123"

Create the VM

Click "Create".

VM Creation Time: 2-3 minutes

Verify Loki VM

You should see:

Name: dev-loki
Status: Running
Internal IP: 10.0.5.11
External IP: (Assigned IP)

Copy the external IP - we'll need it for Grafana access.

Step 4: Access Grafana Dashboard

Get Loki VM External IP

Navigate to Compute Engine → VM instances
Find dev-loki VM
Copy the External IP address

Access Grafana

Open your browser and navigate to:

http://[LOKI_EXTERNAL_IP]:3000

Grafana Login

Field	Value
Email or username	`admin`
Password	`admin123`

Click "Log in".

Verify Datasources

Click "Configuration" (gear icon) → Data sources
Verify Prometheus shows "Healthy" (green checkmark)
Verify Loki shows "Healthy"

Troubleshooting: If datasources are unhealthy:

Check Prometheus is running: docker ps on Loki VM

Verify network connectivity: curl http://10.0.5.10:9090/-/healthy

Check Docker logs: docker logs loki or docker logs grafana

Step 5: Update Prometheus Targets

What are Prometheus Targets?

Prometheus "scrapes" metrics from targets. We need to add our backend and cache VMs.

Update Prometheus Configuration

SSH to Prometheus VM via bastion:

# From your local machine
gcloud compute ssh dev-bastion --tunnel-through-iap

# From bastion, SSH to Prometheus
ssh 10.0.5.10

# Edit Prometheus config
sudo nano /opt/prometheus.yml

Update the targets with actual backend VM IPs:

scrape_configs:
  - job_name: 'backend'
    static_configs:
      - targets: ['10.0.2.2:9100', '10.0.2.3:9100']  # Update with actual IPs
        labels:
          tier: backend

  - job_name: 'cache'
    static_configs:
      - targets: ['10.0.4.2:9100']  # Update with actual IP
        labels:
          tier: cache

Restart Prometheus:

docker restart prometheus

Verify Targets

Navigate to http://[LOKI_EXTERNAL_IP]:3000
Go to Explore → Select Prometheus datasource
Query: up{job="backend"}
You should see metrics from backend VMs

Step 6: Create External Application Load Balancer

What is an ALB?

Application Load Balancer (ALB):

Distributes traffic across backend VMs
Provides single public IP for external access
Handles SSL termination
Health checks for backend availability

Step 6a: Create Health Check

Navigation Path

Navigate to Compute Engine → Health checks
Click "Create health check"

Health Check Configuration

Field	Value	Notes
Name	`dev-lb-health-check`	Descriptive
Protocol	`HTTP`	HTTP health check
Port	`3000`	NestJS app port
Request path	`/api/health`	App must implement this
Check interval	`20` seconds	How often to check
Timeout	`5` seconds	Response timeout
Healthy threshold	`2`	2 successes = healthy
Unhealthy threshold	`5`	5 failures = unhealthy

Click "Create".

Step 6b: Add Named Port to MIG

What is a Named Port?

Named ports link a name (like "http") to a port number (3000). The load balancer uses named ports for routing.

Navigation Path

Navigate to Compute Engine → Instance groups
Click on dev-backend-mig
Click "Edit group"

Edit MIG

Scroll to "Named ports" section:

Click "Add item":

Field	Value
Name	`http`
Port	`3000`

Click "Save".

Step 6c: Create Load Balancer

Navigation Path

Navigate to Network Services → Load balancing
Click "Create load balancer"

Select LB Type

Click "Application Load Balancer (HTTP/S)" → Click "Configure".

Basic Configuration

Field	Value	Notes
Name	`dev-lb`	Environment-prefixed
Region	`europe-west1`	Same as VPC
Network	`dev-network`	Our VPC

Backend Configuration

Backend type: Instance group

Backend configuration:

Field	Value	Notes
Region	`europe-west1`	Same region
Backend	`dev-backend-mig`	Our MIG
Balancing mode	`Rate`	Rate-based balancing
Maximum RPS	`100`	Per instance

Health check:

Field	Value
Health check	`dev-lb-health-check`

Session affinity:

Field	Value	Notes
Session affinity	`None`	Stateless app

Advanced settings:

Field	Value	Notes
Timeout	`30` seconds	Connection timeout
Enable Cloud CDN	Off	Not needed for now

Click "Done".

Frontend Configuration

Protocol: HTTPS (we'll add HTTP redirect)

HTTPS Frontend:

Click "Add frontend IP and port":

Field	Value	Notes
Protocol	`HTTPS`	Secure traffic
IP address	`Reserve new static IP`	Create new IP
IP address name	`dev-lb-ip`	Descriptive
Port	`443`	HTTPS port

Certificate:

Field	Value	Notes
Certificate	`Create a new certificate`	For SSL

Create Certificate:

Field	Value	Notes
Name	`dev-lb-cert`	Descriptive
Type	`Google-managed certificate`	Auto-renew

Domains:

Field	Value
Domains	(Skip for now)

Note: For testing, you can use HTTP only (skip certificate). For production, add your domain and create Google-managed certificate.

Click "Create".

HTTP Frontend (Redirect):

Click "Add frontend IP and port":

Field	Value
Protocol	`HTTP`
IP address	`dev-lb-ip`
Port	`80`
Enable redirect to HTTPS	✓ Enable

Routing Rules

Host rules: Leave default (all hosts)

Path matcher: Leave default (all paths)

Backend service: Select the backend created above

Click "Done".

Create the Load Balancer

Review configuration and click "Create load balancer".

Creation Time: 5-10 minutes

Verify Load Balancer

You should see:

Name: dev-lb
Status: (Checkmark) - Active
IP address: (Reserved IP)
Backends: dev-backend-mig (healthy)

Step 7: Test End-to-End Connectivity

Test 1: Load Balancer Health Check

# Get LB IP
gcloud compute forwarding-rules list --filter="name=dev-lb*"

# Test health endpoint
curl http://[LB_IP]/api/health

Expected: HTTP 200 with response like {"status":"ok"}

Test 2: Full Request Flow

# Test through load balancer
curl http://[LB_IP]/api/test

Expected: Response from backend application

Test 3: Prometheus Metrics

Open Grafana: http://[LOKI_EXTERNAL_IP]:3000
Go to Explore → Prometheus
Query: rate(http_requests_total[5m])
You should see request metrics

Test 4: Loki Logs

Open Grafana: http://[LOKI_EXTERNAL_IP]:3000
Go to Explore → Loki
Query: {job="nestjs"}
You should see application logs

Part 4 Verification Checklist

Final Verification

[ ] Prometheus VM running at 10.0.5.10:9090
[ ] Loki VM running with external IP assigned
[ ] Grafana accessible at http://[LOKI_EXTERNAL_IP]:3000
[ ] Prometheus datasource shows "Healthy"
[ ] Loki datasource shows "Healthy"
[ ] Prometheus scraping backend and cache VMs
[ ] Load balancer has reserved static IP
[ ] Backend service shows healthy instances
[ ] curl http://[LB_IP]/api/health returns 200
[ ] Full request flow works (LB → MIG → App)

Cost Summary - Part 4

Component	Monthly Cost	Notes
Prometheus VM	~$23	e2-medium, 50GB disk
Loki VM	~$23	e2-medium, 50GB disk
Load Balancer	~$18	ALB forwarding rules
Total Part 4	~$64	~$64/month

Final Infrastructure Cost

Component	Monthly Cost
Part 1 (VPC, NAT, Firewall)	~$42
Part 2 (Bastion)	~$11
Part 3 (Cloud SQL, MIG, Cache)	~$184
Part 4 (Observability, LB)	~$64
Total	~$301/month

Cost Optimization Tips:

Use preemptible VMs for non-critical workloads (save 80%)

Reduce Cloud SQL tier to db-g1-small for dev (save ~$70/month)

Scale MIG to 0 during off-hours (save ~$46/month)

Disable Flow Logs on non-critical subnets (save ~$5-10/month)

Comprehensive Troubleshooting

Issue: Grafana Cannot Connect to Datasources

Symptom: Datasource shows "Could not connect"

Diagnosis:

Check Prometheus is running
Verify network connectivity
Check firewall rules

Solution:

# From Loki VM
docker ps  # Check containers running

# Test Prometheus
curl http://10.0.5.10:9090/-/healthy

# Test Loki
curl http://localhost:3100/ready

Issue: Load Balancer Shows 0/0 Healthy

Symptom: No healthy backend instances

Diagnosis:

Check health check configuration
Verify app is running on port 3000
Check firewall for health check IPs

Solution:

# From backend VM
curl localhost:3000/health
netstat -tlnp | grep 3000

Common fixes:

Increase health check unhealthy threshold to 5
Increase initial delay for MIG autohealing to 300
Verify firewall rule allows 35.191.0.0/16 and 130.211.0.0/22

Issue: High CPU on Backend VMs

Symptom: VMs constantly at 90%+ CPU

Diagnosis:

Check application logs
Verify autoscaling thresholds
Profile application performance

Solution:

# Check CPU usage
top -bn1 | head -20

# Check Node.js process
pm2 monit

# Check autoscaler status
gcloud compute instance-groups managed describe dev-backend-mig \
  --region europe-west1

Fixes:

Increase machine type (e2-medium → e2-highcpu-4)
Optimize application code
Increase max replicas to 6 or 8

Issue: Secrets Not Accessible from VMs

Symptom: "Permission denied" accessing secrets

Diagnosis:

Verify service account has Secret Accessor role
Check secret exists and has versions

Solution:

# From VM with correct SA
gcloud secrets versions list db-credentials-dev

# Add IAM role if missing
gcloud secrets add-iam-policy-binding db-credentials-dev \
  --member='serviceAccount:backend-dev-sa@PROJECT_ID.iam.gserviceaccount.com' \
  --role='roles/secretmanager.secretAccessor'

Next Steps Beyond This Guide

1. Domain & SSL

Purchase domain from registrar
Configure DNS to point to LB IP
Update Load Balancer certificate with your domain
Enable Google-managed certificate

2. CI/CD Pipeline

Setup GitHub Actions
Auto-deploy on push
Run tests in container
Automated rollback on failure

3. Monitoring Enhancements

Add alert rules in Prometheus
Configure PagerDuty/Slack integration
Create custom Grafana dashboards
Set up uptime monitoring

4. Security Hardening

Enable VPC Service Controls
Configure Organization Policies
Setup Security Command Center
Implement workload identity

5. Multi-Environment

Create staging environment
Use Shared VPC
Implement environment isolation
Setup service directory

End-to-End Verification Test

Test 1: Full Request Flow

# 1. Get Load Balancer IP
gcloud compute forwarding-rules list --filter="name=dev-lb*"

# 2. Send test request
curl http://[LB_IP]/api/health

# Expected: {"status":"ok","timestamp":"..."}

Test 2: Database Connectivity

# From backend VM (via bastion)
gcloud compute ssh dev-bastion --tunnel-through-iap
ssh 10.0.2.2  # Backend VM IP

# Test Cloud SQL connection
psql -h 10.100.0.2 -U backend-dev-sa -d appdb

Test 3: Cache Connectivity

# From backend VM
redis-cli -h 10.0.4.2 PING
# Expected: PONG

# Test PgBouncer
psql -h 10.0.4.2 -p 6432 -U app_admin -d appdb

Test 4: Monitoring Pipeline

Generate traffic: ab -n 1000 http://[LB_IP]/api/health
Open Grafana
Check dashboards for metrics
Verify Loki has logs

Test 5: Disaster Recovery

# Simulate instance failure
gcloud compute instances delete [ONE_BACKEND_INSTANCE] --quiet

# Verify:
# - MIG auto-heals (new instance appears)
# - Load balancer continues serving
# - No data loss

Congratulations!

You've built a complete, production-ready GCP infrastructure:

✅ Network: VPC with 5 subnets, Cloud NAT, firewall rules
✅ Security: Secret Manager, bastion with IAP, OS Login
✅ Data: Cloud SQL PostgreSQL with regional HA
✅ Compute: Managed Instance Group with autoscaling
✅ Cache: Redis + PgBouncer
✅ Observability: Prometheus, Loki, Grafana
✅ Access: External Application Load Balancer

Your infrastructure is ready for production deployment!

References

Series Complete! You now have a fully functional, production-ready GCP infrastructure. From here, you can deploy your NestJS application and scale as needed.

Build Production-Ready GCP Infrastructure from Scratch: A Complete Console Guide

Table of Contents

Part 4: Observability & Load Balancer

Overview

Prerequisites

Step 1: Create Static Internal IPs for Observability

What are Static Internal IPs?

Prometheus Static IP

Navigation Path

IP Configuration

Loki Static IP

Verify IPs

Step 2: Create Prometheus VM

What is Prometheus?

Navigation Path

VM Configuration

Basic Settings

Machine Type

Boot Disk

Network Interface

Service Account

Metadata - Startup Script

Create the VM

Verify Prometheus

Step 3: Create Loki VM (with Grafana)

What is Loki and Grafana?

Navigation Path

VM Configuration

Basic Settings

Machine Type

Boot Disk

Network Interface

Service Account

Metadata - Startup Script

Create the VM

Verify Loki VM

Step 4: Access Grafana Dashboard

Get Loki VM External IP

Access Grafana

Grafana Login

Verify Datasources

Step 5: Update Prometheus Targets

What are Prometheus Targets?

Update Prometheus Configuration

Verify Targets

Step 6: Create External Application Load Balancer

What is an ALB?

Step 6a: Create Health Check

Navigation Path

Health Check Configuration

Step 6b: Add Named Port to MIG

What is a Named Port?

Navigation Path

Edit MIG

Step 6c: Create Load Balancer

Navigation Path

Select LB Type

Basic Configuration

Backend Configuration

Frontend Configuration

Routing Rules

Create the Load Balancer

Verify Load Balancer

Step 7: Test End-to-End Connectivity

Test 1: Load Balancer Health Check

Test 2: Full Request Flow

Test 3: Prometheus Metrics

Test 4: Loki Logs

Part 4 Verification Checklist

Final Verification

Cost Summary - Part 4

Final Infrastructure Cost

Comprehensive Troubleshooting

Issue: Grafana Cannot Connect to Datasources

Issue: Load Balancer Shows 0/0 Healthy

Issue: High CPU on Backend VMs

Issue: Secrets Not Accessible from VMs

Next Steps Beyond This Guide

1. Domain & SSL

2. CI/CD Pipeline