DEV Community

Shaikh Al Amin
Shaikh Al Amin

Posted on

Build Production-Ready GCP Infrastructure from Scratch Part 04

Build Production-Ready GCP Infrastructure from Scratch: A Complete Console Guide

A 4-Part Series for Complete Beginners


Table of Contents


Part 4: Observability & Load Balancer

Overview

In this final part, you'll complete your infrastructure with observability and external access. We'll create Prometheus for metrics, Loki for logs, Grafana for dashboards, and an Application Load Balancer for external traffic.

What you'll build:

  • Prometheus VM for metrics collection (7-day retention)
  • Loki VM for log aggregation with Grafana
  • External Application Load Balancer with SSL
  • End-to-end health checks and monitoring

Estimated time: 45-60 minutes

Estimated cost: ~$64/month

Final cumulative cost: ~$301/month


Prerequisites

Before continuing, ensure you've completed Parts 1-3:

  • [ ] VPC and 5 subnets exist (including private-obs subnet)
  • [ ] Cloud SQL with private IP is running
  • [ ] Backend MIG has 2+ healthy VMs
  • [ ] Cache VM with Redis/PgBouncer is running
  • [ ] Firewall rules allow health check IPs (35.191.0.0/16, 130.211.0.0/22)

If you missed Parts 1-3: Start with Part 1: Foundation →


Step 1: Create Static Internal IPs for Observability

What are Static Internal IPs?

Static IPs ensure observability VMs have predictable IP addresses. This makes configuration easier (no need to update configs if VMs are recreated).

Prometheus Static IP

Navigation Path

  1. Navigate to VPC networksInternal IP addresses
  2. Click "Reserve static internal IP address"

IP Configuration

Field Value Notes
Name dev-prometheus-ip Descriptive
Network dev-network Our VPC
Subnetwork private-obs Observability subnet
IP address 10.0.5.10 Manual assignment

Screenshot: Reserve Internal IP

Click "Reserve".

Loki Static IP

Repeat the process:

Field Value
Name dev-loki-ip
Network dev-network
Subnetwork private-obs
IP address 10.0.5.11

Verify IPs

You should see 2 reserved IPs:

Name IP Address Subnetwork
dev-prometheus-ip 10.0.5.10 private-obs
dev-loki-ip 10.0.5.11 private-obs

Step 2: Create Prometheus VM

What is Prometheus?

Prometheus is a metrics collection and storage system:

  • Scrapes metrics from Node Exporter (every VM)
  • Stores 7 days of metrics data
  • Provides query API for Grafana

Why self-hosted: Full control, no vendor lock-in, predictable costs.

Navigation Path

  1. Navigate to Compute EngineVM instances
  2. Click "Create instance"

VM Configuration

Basic Settings

Field Value Notes
Name dev-prometheus Descriptive
Region europe-west1 Same as VPC
Zone europe-west1-b Zone b

Machine Type

Field Value Notes
Machine type e2-medium 2 vCPU, 4GB RAM

Boot Disk

Field Value
OS Ubuntu 22.04 LTS Minimal
Disk type pd-balanced
Size 50 GB

Why 50GB: 7 days of metrics at 15s interval requires ~30-40GB. 50GB provides headroom.

Network Interface

Field Value Notes
Network dev-network Our VPC
Subnetwork private-obs Observability subnet
Primary internal IP Staticdev-prometheus-ip Use static IP
External IPv4 address None No public IP needed

Screenshot: Prometheus Network

Service Account

Field Value
Service account observability-dev-sa

Metadata - Startup Script

Key: startup-script

Value:

#!/bin/bash
# Prometheus Startup Script

set -e

echo "=== Prometheus Startup Script Begin $(date) ==="

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install Node Exporter for self-monitoring
NODE_EXPORTER_VERSION="1.6.1"

echo "Installing Node Exporter ${NODE_EXPORTER_VERSION}..."

useradd --no-create-home --shell /bin/false node_exporter || true

wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" -O /tmp/node_exporter.tar.gz
tar xzf /tmp/node_exporter.tar.gz -C /tmp
cp /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter
chmod +x /usr/local/bin/node_exporter

cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
User=node_exporter
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

# Create Prometheus configuration
cat > /opt/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  retention: 7d

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  # Backend VMs (update IPs after MIG creation)
  - job_name: 'backend'
    static_configs:
      - targets: ['10.0.2.2:9100', '10.0.2.3:9100']
        labels:
          tier: backend

  # Cache VM
  - job_name: 'cache'
    static_configs:
      - targets: ['10.0.4.2:9100']
        labels:
          tier: cache
EOF

# Run Prometheus
docker run -d \
  --name prometheus \
  --restart unless-stopped \
  -p 9090:9090 \
  -v /opt/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest

echo "=== Prometheus Startup Script Complete $(date) ==="
echo "Prometheus running on port 9090"
echo "Node Exporter running on port 9100"
Enter fullscreen mode Exit fullscreen mode

Create the VM

Click "Create".

VM Creation Time: 2-3 minutes

Verify Prometheus

You should see:

  • Name: dev-prometheus
  • Status: Running
  • Internal IP: 10.0.5.10
  • External IP: None

Step 3: Create Loki VM (with Grafana)

What is Loki and Grafana?

  • Loki: Log aggregation system (like Prometheus, but for logs)
  • Grafana: Visualization dashboard for metrics and logs

Why combined: Cost optimization. Single VM runs both services (~$23/month).

Navigation Path

  1. Navigate to Compute EngineVM instances
  2. Click "Create instance"

VM Configuration

Basic Settings

Field Value Notes
Name dev-loki Descriptive
Region europe-west1 Same as VPC
Zone europe-west1-b Zone b

Machine Type

Field Value Notes
Machine type e2-medium 2 vCPU, 4GB RAM

Boot Disk

Field Value
OS Ubuntu 22.04 LTS Minimal
Disk type pd-balanced
Size 50 GB

Network Interface

Field Value Notes
Network dev-network Our VPC
Subnetwork private-obs Observability subnet
Primary internal IP Staticdev-loki-ip Use static IP
External IPv4 address Ephemeral Enable for Grafana access

Why public IP: Grafana needs to be accessible from your browser. In production, use IAP instead of public IP.

Screenshot: Loki Network

Service Account

Field Value
Service account observability-dev-sa

Metadata - Startup Script

Key: startup-script

Value:

#!/bin/bash
# Loki and Grafana Startup Script

set -e

echo "=== Loki/Grafana Startup Script Begin $(date) ==="

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install Docker Compose
apt-get update
apt-get install -y docker-compose

# Create docker-compose.yml
cat > /opt/docker-compose.yml <<'EOF'
version: '3.8'

services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - /opt/loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=http://localhost:3000
    volumes:
      - grafana-storage:/var/lib/grafana
      - /opt/grafana-provisioning:/etc/grafana/provisioning
    restart: unless-stopped

volumes:
  loki-data:
  grafana-storage:
EOF

# Create Loki config
cat > /opt/loki-config.yml <<'EOF'
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1h
  max_chunk_age: 1h
  chunk_target_size: 1048576
  chunk_retain_period: 30s

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
EOF

# Create Grafana provisioning
mkdir -p /opt/grafana-provisioning/datasources

cat > /opt/grafana-provisioning/datasources/loki.yml <<'EOF'
apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    isDefault: false
    editable: false
EOF

cat > /opt/grafana-provisioning/datasources/prometheus.yml <<'EOF'
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://10.0.5.10:9090
    isDefault: true
    editable: false
EOF

# Start services
cd /opt
docker-compose up -d

echo "=== Loki/Grafana Startup Script Complete $(date) ==="
echo "Loki running on port 3100"
echo "Grafana running on port 3000 (http://EXTERNAL_IP:3000)"
echo "Login: admin / admin123"
Enter fullscreen mode Exit fullscreen mode

Create the VM

Click "Create".

VM Creation Time: 2-3 minutes

Verify Loki VM

You should see:

  • Name: dev-loki
  • Status: Running
  • Internal IP: 10.0.5.11
  • External IP: (Assigned IP)

Copy the external IP - we'll need it for Grafana access.


Step 4: Access Grafana Dashboard

Get Loki VM External IP

  1. Navigate to Compute EngineVM instances
  2. Find dev-loki VM
  3. Copy the External IP address

Access Grafana

Open your browser and navigate to:

http://[LOKI_EXTERNAL_IP]:3000
Enter fullscreen mode Exit fullscreen mode

Screenshot: Grafana Login

Grafana Login

Field Value
Email or username admin
Password admin123

Click "Log in".

Verify Datasources

  1. Click "Configuration" (gear icon) → Data sources
  2. Verify Prometheus shows "Healthy" (green checkmark)
  3. Verify Loki shows "Healthy"

Screenshot: Grafana Datasources

Troubleshooting: If datasources are unhealthy:

  1. Check Prometheus is running: docker ps on Loki VM
  2. Verify network connectivity: curl http://10.0.5.10:9090/-/healthy
  3. Check Docker logs: docker logs loki or docker logs grafana

Step 5: Update Prometheus Targets

What are Prometheus Targets?

Prometheus "scrapes" metrics from targets. We need to add our backend and cache VMs.

Update Prometheus Configuration

  1. SSH to Prometheus VM via bastion:
# From your local machine
gcloud compute ssh dev-bastion --tunnel-through-iap

# From bastion, SSH to Prometheus
ssh 10.0.5.10

# Edit Prometheus config
sudo nano /opt/prometheus.yml
Enter fullscreen mode Exit fullscreen mode
  1. Update the targets with actual backend VM IPs:
scrape_configs:
  - job_name: 'backend'
    static_configs:
      - targets: ['10.0.2.2:9100', '10.0.2.3:9100']  # Update with actual IPs
        labels:
          tier: backend

  - job_name: 'cache'
    static_configs:
      - targets: ['10.0.4.2:9100']  # Update with actual IP
        labels:
          tier: cache
Enter fullscreen mode Exit fullscreen mode
  1. Restart Prometheus:
docker restart prometheus
Enter fullscreen mode Exit fullscreen mode

Verify Targets

  1. Navigate to http://[LOKI_EXTERNAL_IP]:3000
  2. Go to Explore → Select Prometheus datasource
  3. Query: up{job="backend"}
  4. You should see metrics from backend VMs

Step 6: Create External Application Load Balancer

What is an ALB?

Application Load Balancer (ALB):

  • Distributes traffic across backend VMs
  • Provides single public IP for external access
  • Handles SSL termination
  • Health checks for backend availability

Step 6a: Create Health Check

Navigation Path

  1. Navigate to Compute EngineHealth checks
  2. Click "Create health check"

Health Check Configuration

Field Value Notes
Name dev-lb-health-check Descriptive
Protocol HTTP HTTP health check
Port 3000 NestJS app port
Request path /api/health App must implement this
Check interval 20 seconds How often to check
Timeout 5 seconds Response timeout
Healthy threshold 2 2 successes = healthy
Unhealthy threshold 5 5 failures = unhealthy

Screenshot: LB Health Check

Click "Create".

Step 6b: Add Named Port to MIG

What is a Named Port?

Named ports link a name (like "http") to a port number (3000). The load balancer uses named ports for routing.

Navigation Path

  1. Navigate to Compute EngineInstance groups
  2. Click on dev-backend-mig
  3. Click "Edit group"

Edit MIG

Scroll to "Named ports" section:

Click "Add item":

Field Value
Name http
Port 3000

Screenshot: MIG Named Port

Click "Save".

Step 6c: Create Load Balancer

Navigation Path

  1. Navigate to Network ServicesLoad balancing
  2. Click "Create load balancer"

Select LB Type

Click "Application Load Balancer (HTTP/S)" → Click "Configure".

Screenshot: LB Type Selection

Basic Configuration

Field Value Notes
Name dev-lb Environment-prefixed
Region europe-west1 Same as VPC
Network dev-network Our VPC

Backend Configuration

Backend type: Instance group

Backend configuration:

Field Value Notes
Region europe-west1 Same region
Backend dev-backend-mig Our MIG
Balancing mode Rate Rate-based balancing
Maximum RPS 100 Per instance

Health check:

Field Value
Health check dev-lb-health-check

Session affinity:

Field Value Notes
Session affinity None Stateless app

Screenshot: LB Backend

Advanced settings:

Field Value Notes
Timeout 30 seconds Connection timeout
Enable Cloud CDN Off Not needed for now

Click "Done".

Frontend Configuration

Protocol: HTTPS (we'll add HTTP redirect)

HTTPS Frontend:

Click "Add frontend IP and port":

Field Value Notes
Protocol HTTPS Secure traffic
IP address Reserve new static IP Create new IP
IP address name dev-lb-ip Descriptive
Port 443 HTTPS port

Certificate:

Field Value Notes
Certificate Create a new certificate For SSL

Create Certificate:

Field Value Notes
Name dev-lb-cert Descriptive
Type Google-managed certificate Auto-renew

Domains:

Field Value
Domains (Skip for now)

Note: For testing, you can use HTTP only (skip certificate). For production, add your domain and create Google-managed certificate.

Click "Create".

HTTP Frontend (Redirect):

Click "Add frontend IP and port":

Field Value
Protocol HTTP
IP address dev-lb-ip
Port 80
Enable redirect to HTTPS Enable

Screenshot: LB Frontend

Routing Rules

Host rules: Leave default (all hosts)

Path matcher: Leave default (all paths)

Backend service: Select the backend created above

Click "Done".

Create the Load Balancer

Review configuration and click "Create load balancer".

Creation Time: 5-10 minutes

Verify Load Balancer

You should see:

  • Name: dev-lb
  • Status: (Checkmark) - Active
  • IP address: (Reserved IP)
  • Backends: dev-backend-mig (healthy)

Step 7: Test End-to-End Connectivity

Test 1: Load Balancer Health Check

# Get LB IP
gcloud compute forwarding-rules list --filter="name=dev-lb*"

# Test health endpoint
curl http://[LB_IP]/api/health
Enter fullscreen mode Exit fullscreen mode

Expected: HTTP 200 with response like {"status":"ok"}

Test 2: Full Request Flow

# Test through load balancer
curl http://[LB_IP]/api/test
Enter fullscreen mode Exit fullscreen mode

Expected: Response from backend application

Test 3: Prometheus Metrics

  1. Open Grafana: http://[LOKI_EXTERNAL_IP]:3000
  2. Go to ExplorePrometheus
  3. Query: rate(http_requests_total[5m])
  4. You should see request metrics

Test 4: Loki Logs

  1. Open Grafana: http://[LOKI_EXTERNAL_IP]:3000
  2. Go to ExploreLoki
  3. Query: {job="nestjs"}
  4. You should see application logs

Part 4 Verification Checklist

Final Verification

  • [ ] Prometheus VM running at 10.0.5.10:9090
  • [ ] Loki VM running with external IP assigned
  • [ ] Grafana accessible at http://[LOKI_EXTERNAL_IP]:3000
  • [ ] Prometheus datasource shows "Healthy"
  • [ ] Loki datasource shows "Healthy"
  • [ ] Prometheus scraping backend and cache VMs
  • [ ] Load balancer has reserved static IP
  • [ ] Backend service shows healthy instances
  • [ ] curl http://[LB_IP]/api/health returns 200
  • [ ] Full request flow works (LB → MIG → App)

Screenshot: Completed Part 4


Cost Summary - Part 4

Component Monthly Cost Notes
Prometheus VM ~$23 e2-medium, 50GB disk
Loki VM ~$23 e2-medium, 50GB disk
Load Balancer ~$18 ALB forwarding rules
Total Part 4 ~$64 ~$64/month

Final Infrastructure Cost

Component Monthly Cost
Part 1 (VPC, NAT, Firewall) ~$42
Part 2 (Bastion) ~$11
Part 3 (Cloud SQL, MIG, Cache) ~$184
Part 4 (Observability, LB) ~$64
Total ~$301/month

Cost Optimization Tips:

  1. Use preemptible VMs for non-critical workloads (save 80%)
  2. Reduce Cloud SQL tier to db-g1-small for dev (save ~$70/month)
  3. Scale MIG to 0 during off-hours (save ~$46/month)
  4. Disable Flow Logs on non-critical subnets (save ~$5-10/month)

Comprehensive Troubleshooting

Issue: Grafana Cannot Connect to Datasources

Symptom: Datasource shows "Could not connect"

Diagnosis:

  1. Check Prometheus is running
  2. Verify network connectivity
  3. Check firewall rules

Solution:

# From Loki VM
docker ps  # Check containers running

# Test Prometheus
curl http://10.0.5.10:9090/-/healthy

# Test Loki
curl http://localhost:3100/ready
Enter fullscreen mode Exit fullscreen mode

Issue: Load Balancer Shows 0/0 Healthy

Symptom: No healthy backend instances

Diagnosis:

  1. Check health check configuration
  2. Verify app is running on port 3000
  3. Check firewall for health check IPs

Solution:

# From backend VM
curl localhost:3000/health
netstat -tlnp | grep 3000
Enter fullscreen mode Exit fullscreen mode

Common fixes:

  • Increase health check unhealthy threshold to 5
  • Increase initial delay for MIG autohealing to 300
  • Verify firewall rule allows 35.191.0.0/16 and 130.211.0.0/22

Issue: High CPU on Backend VMs

Symptom: VMs constantly at 90%+ CPU

Diagnosis:

  1. Check application logs
  2. Verify autoscaling thresholds
  3. Profile application performance

Solution:

# Check CPU usage
top -bn1 | head -20

# Check Node.js process
pm2 monit

# Check autoscaler status
gcloud compute instance-groups managed describe dev-backend-mig \
  --region europe-west1
Enter fullscreen mode Exit fullscreen mode

Fixes:

  • Increase machine type (e2-medium → e2-highcpu-4)
  • Optimize application code
  • Increase max replicas to 6 or 8

Issue: Secrets Not Accessible from VMs

Symptom: "Permission denied" accessing secrets

Diagnosis:

  1. Verify service account has Secret Accessor role
  2. Check secret exists and has versions

Solution:

# From VM with correct SA
gcloud secrets versions list db-credentials-dev

# Add IAM role if missing
gcloud secrets add-iam-policy-binding db-credentials-dev \
  --member='serviceAccount:backend-dev-sa@PROJECT_ID.iam.gserviceaccount.com' \
  --role='roles/secretmanager.secretAccessor'
Enter fullscreen mode Exit fullscreen mode

Next Steps Beyond This Guide

1. Domain & SSL

  1. Purchase domain from registrar
  2. Configure DNS to point to LB IP
  3. Update Load Balancer certificate with your domain
  4. Enable Google-managed certificate

2. CI/CD Pipeline

  • Setup GitHub Actions
  • Auto-deploy on push
  • Run tests in container
  • Automated rollback on failure

3. Monitoring Enhancements

  • Add alert rules in Prometheus
  • Configure PagerDuty/Slack integration
  • Create custom Grafana dashboards
  • Set up uptime monitoring

4. Security Hardening

  • Enable VPC Service Controls
  • Configure Organization Policies
  • Setup Security Command Center
  • Implement workload identity

5. Multi-Environment

  • Create staging environment
  • Use Shared VPC
  • Implement environment isolation
  • Setup service directory

End-to-End Verification Test

Test 1: Full Request Flow

# 1. Get Load Balancer IP
gcloud compute forwarding-rules list --filter="name=dev-lb*"

# 2. Send test request
curl http://[LB_IP]/api/health

# Expected: {"status":"ok","timestamp":"..."}
Enter fullscreen mode Exit fullscreen mode

Test 2: Database Connectivity

# From backend VM (via bastion)
gcloud compute ssh dev-bastion --tunnel-through-iap
ssh 10.0.2.2  # Backend VM IP

# Test Cloud SQL connection
psql -h 10.100.0.2 -U backend-dev-sa -d appdb
Enter fullscreen mode Exit fullscreen mode

Test 3: Cache Connectivity

# From backend VM
redis-cli -h 10.0.4.2 PING
# Expected: PONG

# Test PgBouncer
psql -h 10.0.4.2 -p 6432 -U app_admin -d appdb
Enter fullscreen mode Exit fullscreen mode

Test 4: Monitoring Pipeline

  1. Generate traffic: ab -n 1000 http://[LB_IP]/api/health
  2. Open Grafana
  3. Check dashboards for metrics
  4. Verify Loki has logs

Test 5: Disaster Recovery

# Simulate instance failure
gcloud compute instances delete [ONE_BACKEND_INSTANCE] --quiet

# Verify:
# - MIG auto-heals (new instance appears)
# - Load balancer continues serving
# - No data loss
Enter fullscreen mode Exit fullscreen mode

Congratulations!

You've built a complete, production-ready GCP infrastructure:

Network: VPC with 5 subnets, Cloud NAT, firewall rules
Security: Secret Manager, bastion with IAP, OS Login
Data: Cloud SQL PostgreSQL with regional HA
Compute: Managed Instance Group with autoscaling
Cache: Redis + PgBouncer
Observability: Prometheus, Loki, Grafana
Access: External Application Load Balancer

Your infrastructure is ready for production deployment!


References


Series Complete! You now have a fully functional, production-ready GCP infrastructure. From here, you can deploy your NestJS application and scale as needed.

Top comments (0)