Rue Matchaba

Posted on Apr 30

How I Cut My Cloud SQL Bill by 70% After Discovering 0.8% CPU Utilization

#ai #devops #softwareengineering #webdev

The Cloud SQL Bill That Taught Me Everything About Over-Provisioning

My database was running at 0.8% CPU utilisation.

I discovered this three months after going live, while investigating why our GCP bill seemed higher than expected for our traffic volume. The number was so low I thought there was an error in Cloud Monitoring. There wasn't.

I'd been paying for a machine that could handle roughly 100x more load than we were actually putting on it. Classic over-provisioning, but seeing it in real numbers was genuinely embarrassing.

Here's everything I learned about right-sizing Cloud SQL instances, with the specific metrics and commands that will save you from making the same mistakes.

The Real Cost of "Playing It Safe"

When you're spinning up your first production Cloud SQL instance, the console gives you a dropdown of machine types: db-f1-micro, db-n1-standard-1, db-n1-standard-2, and so on. The descriptions are helpful but vague: "1 vCPU, 3.75GB memory" tells you the specs, not whether you need them.

I picked db-n1-standard-2 because it seemed reasonable for a production database. Not too small, not excessive. The middle option. That decision was based on absolutely no data.

The problem with "reasonable" is that it's usually wrong. Either you're under-provisioned and your app breaks, or you're over-provisioned and you're burning money. In my case, it was the latter.

What the Metrics Actually Tell You

The key insight is that Cloud Monitoring shows you exactly what your database is doing. You just have to know where to look.

CPU Utilisation

This is the most important metric for right-sizing your instance.

Where to find it: Cloud Console → SQL → your instance → Monitoring tab → CPU utilization

What to look for:

Average utilisation over the past 30 days
P95 and P99 peaks (the highest 5% and 1% of usage)
Time of day patterns

How to interpret it:

Under 20% average: you can probably downgrade
20-50%: you're sized appropriately
50-80%: keep an eye on growth trends
Over 80% sustained: consider upgrading

My average was 0.8%. My P99 was around 3%. I could have run the same workload on a db-f1-micro instance and saved roughly 70% on compute costs.

Memory Utilisation

Where to find it: Same monitoring tab → Memory utilization

What matters: You want to see consistent memory usage without swap. If memory utilisation is consistently above 90% or you're seeing any swap usage, that's a performance problem waiting to happen.

What I found: Memory usage was sitting around 15% with zero swap. Another sign I was massively over-provisioned.

Connection Count

Where to find it: Monitoring tab → Database connections

What to look for: Peak active connections compared to your instance's connection limit.

Connection limits by instance:

db-f1-micro: 25 connections
db-n1-standard-1: 100 connections
db-n1-standard-2: 200 connections

My peak connections were hitting around 11. Even a db-f1-micro would have been comfortable.

The Commands That Actually Matter

Once you know your utilisation is low, here are the specific commands to check what you're currently running and how to change it.

Check Your Current Instance Configuration

gcloud sql instances describe YOUR_INSTANCE_NAME --format="table(
    name,
    settings.tier,
    settings.dataDiskSizeGb,
    settings.availabilityType,
    settings.backupConfiguration.enabled
)"

This gives you a clean summary of what you're paying for:

settings.tier: your machine type (the expensive part)
settings.dataDiskSizeGb: disk size
settings.availabilityType: whether HA is enabled
settings.backupConfiguration.enabled: backup settings

Downgrade Your Instance Tier

If your CPU utilisation is consistently low, this is the biggest cost saving:

gcloud sql instances patch YOUR_INSTANCE_NAME --tier=db-f1-micro

Important: This will restart your instance. Plan for a few minutes of downtime.

Machine type costs (rough monthly estimates for PostgreSQL in us-central1):

db-f1-micro: ~$7/month
db-n1-standard-1: ~$25/month
db-n1-standard-2: ~$50/month
db-n1-standard-4: ~$100/month

Moving from standard-2 to f1-micro saves around $43/month per instance. That adds up fast if you're running multiple environments.

Turn Off High Availability (Where Appropriate)

High Availability runs a standby replica in a different zone, roughly doubling your instance cost. You want this in production. You probably don't need it in staging or development.

Check if HA is enabled:

gcloud sql instances describe YOUR_INSTANCE_NAME --format="value(settings.availabilityType)"

Turn it off:

gcloud sql instances patch YOUR_INSTANCE_NAME --availability-type=ZONAL

Turn it back on:

gcloud sql instances patch YOUR_INSTANCE_NAME --availability-type=REGIONAL

This change also requires a restart, so plan accordingly.

The Storage Problem You Can't Fix Easily

Here's the frustrating part: Cloud SQL storage auto-increases but never auto-decreases. If your data grows to 50GB and then you delete 40GB, you're still paying for 50GB forever.

I had 100GB provisioned and was using 240MB. That's 0.24% utilisation. Storage isn't the most expensive part of Cloud SQL, but it's still $10/month I didn't need to spend.

Check your actual storage usage:

gcloud sql instances describe YOUR_INSTANCE_NAME --format="value(settings.dataDiskSizeGb)"

Then connect to your database and check actual usage:

-- For PostgreSQL
SELECT pg_size_pretty(pg_database_size('your_database_name'));

-- For MySQL
SELECT 
    table_schema AS "Database",
    ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS "Size (MB)"
FROM information_schema.tables
GROUP BY table_schema;

The only fix for oversized storage: export your data, delete the instance, and recreate it with a smaller disk. This is disruptive enough that you probably won't do it unless the over-provisioning is severe.

The lesson: Size your initial disk conservatively. 10GB is the minimum and sufficient for most applications starting out. You can always increase it later without downtime.

Backup Configuration That Actually Makes Sense

Cloud SQL defaults to 7 days of automated backup retention. For production, that makes sense. For staging environments that get refreshed weekly, you're paying to store backups of data you'd never restore.

Check your backup settings:

gcloud sql instances describe YOUR_INSTANCE_NAME --format="table(
    settings.backupConfiguration.enabled,
    settings.backupConfiguration.retainedBackups,
    settings.backupConfiguration.pointInTimeRecoveryEnabled
)"

Reduce backup retention for non-critical instances:

gcloud sql instances patch YOUR_INSTANCE_NAME --backup-retain-count=3

Turn off point-in-time recovery (PITR) for non-critical instances:

gcloud sql instances patch YOUR_INSTANCE_NAME --no-backup-point-in-time-recovery

PITR keeps transaction logs to allow recovery to any specific timestamp. It's useful for production but adds storage costs and complexity for environments where you'd just restore from the most recent daily backup anyway.

The Monitoring Dashboard You Should Actually Use

Instead of checking individual metrics manually, set up a custom dashboard that shows everything relevant at once.

Create a monitoring workspace (if you don't have one):

gcloud alpha monitoring dashboards create --config-from-file=dashboard-config.yaml

Dashboard configuration (dashboard-config.yaml):

displayName: "Cloud SQL Cost Optimization"
mosaicLayout:
  tiles:
  - width: 6
    height: 4
    widget:
      title: "CPU Utilization"
      xyChart:
        dataSets:
        - timeSeriesQuery:
            timeSeriesFilter:
              filter: 'resource.type="cloudsql_database"'
              metricFilter:
                filter: 'metric.type="cloudsql.googleapis.com/database/cpu/utilization"'
  - width: 6
    height: 4
    widget:
      title: "Memory Utilization"
      xyChart:
        dataSets:
        - timeSeriesQuery:
            timeSeriesFilter:
              filter: 'resource.type="cloudsql_database"'
              metricFilter:
                filter: 'metric.type="cloudsql.googleapis.com/database/memory/utilization"'
  - width: 6
    height: 4
    widget:
      title: "Active Connections"
      xyChart:
        dataSets:
        - timeSeriesQuery:
            timeSeriesFilter:
              filter: 'resource.type="cloudsql_database"'
              metricFilter:
                filter: 'metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'

This gives you a single view of the three metrics that matter most for cost optimization.

What I Wish I'd Known Before Clicking "Create"

The real lesson here isn't about any specific setting. It's about the mindset.

Start smaller than you think you need. Scaling up is a one-line command and a few minutes of downtime. Scaling down requires migration and planning.

Use actual data, not gut feel. Cloud Monitoring exists for a reason. If you don't have usage patterns yet, start with the smallest instance that can handle your expected load and scale up based on real metrics.

Environment-specific configuration matters. Production and staging have different availability requirements, different backup needs, and different cost tolerances. Configure them differently.

GCP defaults optimize for reliability, not cost. That's the right choice for a platform, but it means you need to actively optimize for your actual usage patterns.

The Bottom Line

My 0.8% CPU utilisation was embarrassing, but it taught me more about cloud cost optimization than months of reading best practices guides. The specific numbers forced me to understand what each metric actually means and how it translates to real money.

If you're setting up Cloud SQL for the first time, open the monitoring dashboard before you pick your instance tier. The metrics will tell you what you actually need, not what feels reasonable.

And if you're already running Cloud SQL instances, spend ten minutes checking your utilisation numbers. You might be surprised at what you find.

DEV Community