varun varde

Posted on May 8

FinOps for DevOps Engineers: The Complete Cloud Cost Optimisation Playbook

#devops #cloud #playbook #aws

Cloud bills rarely explode because of one catastrophic decision. They grow incrementally. Quietly. A forgotten load balancer here. Overprovisioned Kubernetes nodes there. NAT Gateway traffic multiplying invisibly in the background like fiscal mold behind drywall.

Most organisations approach FinOps as a finance exercise. That is a strategic mistake.

The engineers provisioning infrastructure are the same engineers best positioned to optimise it. DevOps teams control autoscaling, storage policies, networking topology, observability retention, and workload scheduling. They are not adjacent to cloud cost optimisation. They are the operational epicentre of it.

This playbook focuses on practical FinOps implementation for DevOps and platform engineers. Not abstract governance theory. Actual engineering patterns that reduce spend without degrading reliability.

The optimisation path is organised by return on investment. Start with visibility. Then tackle compute, storage, networking, Kubernetes, and finally governance automation.

Part 1: Visibility First — Tagging Standards and Cost Attribution

You cannot optimise what you cannot attribute.

Most cloud environments fail at cost management because nobody knows which team owns what.

The Minimum Viable Tagging Standard

Every resource should contain

tags = {
  team         = "platform-engineering"
  environment  = "production"
  application  = "checkout-api"
  cost-centre  = "ENG-042"
  owner        = "payments-team"
}

Why Tags Matter

Without tags

Cloud bill = giant undifferentiated blob

With tags

Cloud bill = attributable operational data

This changes engineering behaviour immediately.

AWS Cost Allocation Tags

Enable them explicitly

aws ce list-cost-allocation-tags

Then activate

Billing Console → Cost Allocation Tags → Activate

Cost Dashboard Strategy

Build dashboards around:

Cost by team
Cost by environment
Cost by service
Week-over-week growth
Top anomalous resources

Part 2: Compute Optimisation — Rightsizing, Spot, Graviton

Compute is usually the largest controllable expense category.

And most environments are dramatically oversized.

Rightsizing EC2 Instances

Example

m5.4xlarge
Average CPU: 9%

This is not infrastructure. It is financial leakage.

Identify Idle Instances

Using CloudWatch

Spot Instances

Spot pricing can reduce costs by 70–90%.

Perfect for:

CI runners
Batch jobs
Non-critical workloads
Kubernetes worker nodes

Terraform Spot Example

resource "aws_instance" "spot_worker" {
  instance_type = "m7g.large"

  instance_market_options {
    market_type = "spot"
  }
}

AWS Graviton Migration

Graviton instances routinely reduce compute costs by 20–40%.

Migration Candidate Checklist

Best workloads:

Stateless APIs
Containers
Node.js
Go
Java 17+

Kubernetes Node Group Example

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

managedNodeGroups:
  - name: graviton-workers
    instanceType: m7g.large

Part 3: Storage Optimisation — S3 Tiers, EBS, Lifecycle Policies

Storage inefficiency compounds silently over years.

S3 Lifecycle Policies

The fastest storage win in AWS.

Terraform Lifecycle Policy

resource "aws_s3_bucket_lifecycle_configuration" "cost_optimised" {
  bucket = aws_s3_bucket.data.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "INTELLIGENT_TIERING"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
  }
}

EBS Optimisation

Common waste patterns:

Detached volumes
Oversized gp3 disks
Unused snapshots

Find Unattached Volumes

aws ec2 describe-volumes \
  --filters Name=status,Values=available

Unless compliance requires otherwise.

Part 4: Networking Cost Reduction — NAT Gateway, VPC Endpoints, Data Transfer

Networking costs surprise almost everyone.

Especially NAT Gateways.

NAT Gateway Optimisation

NAT Gateway charges include:

Hourly fee
Per-GB transfer fee

Large clusters can spend thousands monthly on NAT traffic alone.

Replace NAT Traffic with VPC Endpoints

Example

resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"
}

This eliminates NAT transfer charges for S3 traffic.

Reduce Cross-AZ Traffic

Hidden cost source:

Service A → AZ-1
Service B → AZ-2

Every request incurs transfer cost.

Kubernetes Affinity Rules

topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone

Keep chatty services co-located.

Part 5: Database Cost Optimisation — RDS Rightsizing, Aurora Serverless, Read Replica Pruning

Databases are expensive because teams fear touching them.

Reasonably so.

RDS Rightsizing

Monitor:

CPU
Connections
IOPS
Memory pressure

Example Downsize

db.r6g.4xlarge → db.r6g.xlarge

Often invisible to applications.

Massively visible to finance.

Aurora Serverless v2

Ideal for:

Variable workloads
Internal APIs
Intermittent services

Terraform Example

serverlessv2_scaling_configuration {
  min_capacity = 0.5
  max_capacity = 8
}

Read Replica Cleanup

Common anti-pattern

Temporary read replica
→ never removed
→ costs persist forever

Audit quarterly.

Part 6: Reserved Instances & Savings Plans — When to Buy and How Much

Savings Plans are powerful when used correctly.

Dangerous when guessed incorrectly.

Recommended Strategy

Start conservative.

Target:

60–70% baseline utilisation coverage

Never 100%.

Compute Savings Plans

Best default option.

Flexible across:

Instance families
Regions
Compute types

AWS Recommendation API

aws ce get-savings-plans-purchase-recommendation

Use actual usage history.

Not optimism.

Part 7: Kubernetes Cost Optimisation — Bin Packing, Cluster Autoscaler, Spot Node Groups

Kubernetes amplifies both efficiency and waste.

Bin Packing

Underutilised nodes are financial dead weight.

Resource Requests Matter

Bad:

requests:
  cpu: "4"

Actual usage:

300m

Goldilocks Recommendation Tool

kubectl install goldilocks

Automatically suggests request sizing.

Cluster Autoscaler

--balance-similar-node-groups=true

Removes idle nodes dynamically.

Spot Node Groups

Example

capacityType: SPOT

Excellent for:

Stateless apps
Batch workers
CI runners

Part 8: Monitoring Cost Creep — Alerting on Unexpected Spend Increases

Cost optimisation without monitoring regresses rapidly.

Budget Alerts

AWS Example

aws budgets create-budget

Prometheus Cost Alert

groups:
- name: cloud_cost_alerts
  rules:
  - alert: MonthlySpendSpike
    expr: increase(cloud_cost_total[24h]) > 1000

Slack Notification Example

import requests

requests.post(
  webhook_url,
  json={"text": "Cloud spend increased unexpectedly"}
)

Immediate visibility changes behaviour.

Part 9: The Monthly Cost Review Checklist

The best FinOps teams operationalise review cadence.

Monthly Checklist
Compute

Idle instances removed
Rightsizing opportunities reviewed
Spot coverage audited

Storage

Snapshot retention reviewed
Glacier transitions verified
Orphaned volumes deleted

Kubernetes

Node utilisation checked
Resource requests audited
Cluster Autoscaler effectiveness reviewed

Networking

NAT Gateway spend analysed
Cross-region traffic reviewed

Databases

Read replicas validated
Aurora scaling reviewed

Governance

Untagged resources identified
Budget alerts tested

Appendix: Azure and GCP Equivalents

Compute

FinOps is not about making infrastructure cheap.

It is about making infrastructure intentional.

The most effective DevOps teams treat cloud cost as an engineering metric alongside latency, reliability, and deployment frequency.

Because every oversized node, forgotten snapshot, or unnecessary NAT transfer represents engineering inefficiency expressed financially.

The progression usually looks like this:

Visibility → Attribution → Accountability → Optimisation

Without visibility, optimisation is guesswork.

Without attribution, accountability disappears.

Without accountability, cloud spend becomes entropy.

But when engineers own both infrastructure reliability and infrastructure economics, something powerful happens:

Systems become leaner.
Architectures become cleaner.
And cloud bills stop being monthly surprises.

Top comments (1)

Rahul Joshi • May 13

This is a critical playbook for modern infrastructure, highlighting how cloud cost management must be treated as a first-class engineering metric rather than just a finance concern. I really appreciate the focus on actionable optimization strategies that allow DevOps teams to maintain high velocity while keeping cloud sprawl under control.