DEV Community

Cover image for FinOps for DevOps Engineers: The Complete Cloud Cost Optimisation Playbook
varun varde
varun varde

Posted on

FinOps for DevOps Engineers: The Complete Cloud Cost Optimisation Playbook

Cloud bills rarely explode because of one catastrophic decision. They grow incrementally. Quietly. A forgotten load balancer here. Overprovisioned Kubernetes nodes there. NAT Gateway traffic multiplying invisibly in the background like fiscal mold behind drywall.

Most organisations approach FinOps as a finance exercise. That is a strategic mistake.

The engineers provisioning infrastructure are the same engineers best positioned to optimise it. DevOps teams control autoscaling, storage policies, networking topology, observability retention, and workload scheduling. They are not adjacent to cloud cost optimisation. They are the operational epicentre of it.

This playbook focuses on practical FinOps implementation for DevOps and platform engineers. Not abstract governance theory. Actual engineering patterns that reduce spend without degrading reliability.

The optimisation path is organised by return on investment. Start with visibility. Then tackle compute, storage, networking, Kubernetes, and finally governance automation.

Part 1: Visibility First — Tagging Standards and Cost Attribution

You cannot optimise what you cannot attribute.

Most cloud environments fail at cost management because nobody knows which team owns what.

The Minimum Viable Tagging Standard

Every resource should contain

tags = {
  team         = "platform-engineering"
  environment  = "production"
  application  = "checkout-api"
  cost-centre  = "ENG-042"
  owner        = "payments-team"
}
Enter fullscreen mode Exit fullscreen mode

Why Tags Matter

Without tags

Cloud bill = giant undifferentiated blob
Enter fullscreen mode Exit fullscreen mode

With tags

Cloud bill = attributable operational data
Enter fullscreen mode Exit fullscreen mode

This changes engineering behaviour immediately.

AWS Cost Allocation Tags

Enable them explicitly

aws ce list-cost-allocation-tags
Enter fullscreen mode Exit fullscreen mode

Then activate

Billing Console → Cost Allocation Tags → Activate
Enter fullscreen mode Exit fullscreen mode

Cost Dashboard Strategy

Build dashboards around:

  • Cost by team

  • Cost by environment

  • Cost by service

  • Week-over-week growth

  • Top anomalous resources

Part 2: Compute Optimisation — Rightsizing, Spot, Graviton

Compute is usually the largest controllable expense category.

And most environments are dramatically oversized.

Rightsizing EC2 Instances

Example

m5.4xlarge
Average CPU: 9%
Enter fullscreen mode Exit fullscreen mode

This is not infrastructure. It is financial leakage.

Identify Idle Instances

Using CloudWatch

Spot Instances

Spot pricing can reduce costs by 70–90%.

Perfect for:

  • CI runners

  • Batch jobs

  • Non-critical workloads

  • Kubernetes worker nodes

Terraform Spot Example

resource "aws_instance" "spot_worker" {
  instance_type = "m7g.large"

  instance_market_options {
    market_type = "spot"
  }
}
Enter fullscreen mode Exit fullscreen mode

AWS Graviton Migration

Graviton instances routinely reduce compute costs by 20–40%.

Migration Candidate Checklist

Best workloads:

  • Stateless APIs
  • Containers
  • Node.js
  • Go
  • Java 17+

Kubernetes Node Group Example

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

managedNodeGroups:
  - name: graviton-workers
    instanceType: m7g.large
Enter fullscreen mode Exit fullscreen mode

Part 3: Storage Optimisation — S3 Tiers, EBS, Lifecycle Policies

Storage inefficiency compounds silently over years.

S3 Lifecycle Policies

The fastest storage win in AWS.

Terraform Lifecycle Policy

resource "aws_s3_bucket_lifecycle_configuration" "cost_optimised" {
  bucket = aws_s3_bucket.data.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "INTELLIGENT_TIERING"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

EBS Optimisation

Common waste patterns:

  • Detached volumes
  • Oversized gp3 disks
  • Unused snapshots

Find Unattached Volumes

aws ec2 describe-volumes \
  --filters Name=status,Values=available
Enter fullscreen mode Exit fullscreen mode

Unless compliance requires otherwise.

Part 4: Networking Cost Reduction — NAT Gateway, VPC Endpoints, Data Transfer

Networking costs surprise almost everyone.

Especially NAT Gateways.

NAT Gateway Optimisation

NAT Gateway charges include:

  • Hourly fee
  • Per-GB transfer fee

Large clusters can spend thousands monthly on NAT traffic alone.

Replace NAT Traffic with VPC Endpoints

Example

resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"
}
Enter fullscreen mode Exit fullscreen mode

This eliminates NAT transfer charges for S3 traffic.

Reduce Cross-AZ Traffic

Hidden cost source:

Service A → AZ-1
Service B → AZ-2
Enter fullscreen mode Exit fullscreen mode

Every request incurs transfer cost.

Kubernetes Affinity Rules

topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone
Enter fullscreen mode Exit fullscreen mode

Keep chatty services co-located.

Part 5: Database Cost Optimisation — RDS Rightsizing, Aurora Serverless, Read Replica Pruning

Databases are expensive because teams fear touching them.

Reasonably so.

RDS Rightsizing

Monitor:

  • CPU
  • Connections
  • IOPS
  • Memory pressure

Example Downsize

db.r6g.4xlarge → db.r6g.xlarge
Enter fullscreen mode Exit fullscreen mode

Often invisible to applications.

Massively visible to finance.

Aurora Serverless v2

Ideal for:

  • Variable workloads
  • Internal APIs
  • Intermittent services

Terraform Example

serverlessv2_scaling_configuration {
  min_capacity = 0.5
  max_capacity = 8
}
Enter fullscreen mode Exit fullscreen mode

Read Replica Cleanup

Common anti-pattern

Temporary read replica
→ never removed
→ costs persist forever
Enter fullscreen mode Exit fullscreen mode

Audit quarterly.

Part 6: Reserved Instances & Savings Plans — When to Buy and How Much

Savings Plans are powerful when used correctly.

Dangerous when guessed incorrectly.

Recommended Strategy

Start conservative.

Target:

60–70% baseline utilisation coverage
Enter fullscreen mode Exit fullscreen mode

Never 100%.

Compute Savings Plans

Best default option.

Flexible across:

  • Instance families
  • Regions
  • Compute types

AWS Recommendation API

aws ce get-savings-plans-purchase-recommendation
Enter fullscreen mode Exit fullscreen mode

Use actual usage history.

Not optimism.

Part 7: Kubernetes Cost Optimisation — Bin Packing, Cluster Autoscaler, Spot Node Groups

Kubernetes amplifies both efficiency and waste.

Bin Packing

Underutilised nodes are financial dead weight.

Resource Requests Matter

Bad:

requests:
  cpu: "4"
Enter fullscreen mode Exit fullscreen mode

Actual usage:

300m
Enter fullscreen mode Exit fullscreen mode

Goldilocks Recommendation Tool

kubectl install goldilocks
Enter fullscreen mode Exit fullscreen mode

Automatically suggests request sizing.

Cluster Autoscaler

--balance-similar-node-groups=true
Enter fullscreen mode Exit fullscreen mode

Removes idle nodes dynamically.

Spot Node Groups

Example

capacityType: SPOT
Enter fullscreen mode Exit fullscreen mode

Excellent for:

  • Stateless apps
  • Batch workers
  • CI runners

Part 8: Monitoring Cost Creep — Alerting on Unexpected Spend Increases

Cost optimisation without monitoring regresses rapidly.

Budget Alerts

AWS Example

aws budgets create-budget
Enter fullscreen mode Exit fullscreen mode

Prometheus Cost Alert

groups:
- name: cloud_cost_alerts
  rules:
  - alert: MonthlySpendSpike
    expr: increase(cloud_cost_total[24h]) > 1000
Enter fullscreen mode Exit fullscreen mode

Slack Notification Example

import requests

requests.post(
  webhook_url,
  json={"text": "Cloud spend increased unexpectedly"}
)
Enter fullscreen mode Exit fullscreen mode

Immediate visibility changes behaviour.

Part 9: The Monthly Cost Review Checklist

The best FinOps teams operationalise review cadence.

Monthly Checklist
Compute

  • Idle instances removed
  • Rightsizing opportunities reviewed
  • Spot coverage audited

Storage

  • Snapshot retention reviewed
  • Glacier transitions verified
  • Orphaned volumes deleted

Kubernetes

  • Node utilisation checked
  • Resource requests audited
  • Cluster Autoscaler effectiveness reviewed

Networking

  • NAT Gateway spend analysed
  • Cross-region traffic reviewed

Databases

  • Read replicas validated
  • Aurora scaling reviewed

Governance

  • Untagged resources identified
  • Budget alerts tested

Appendix: Azure and GCP Equivalents

Compute

FinOps is not about making infrastructure cheap.

It is about making infrastructure intentional.

The most effective DevOps teams treat cloud cost as an engineering metric alongside latency, reliability, and deployment frequency.

Because every oversized node, forgotten snapshot, or unnecessary NAT transfer represents engineering inefficiency expressed financially.

The progression usually looks like this:

Visibility → Attribution → Accountability → Optimisation
Enter fullscreen mode Exit fullscreen mode

Without visibility, optimisation is guesswork.

Without attribution, accountability disappears.

Without accountability, cloud spend becomes entropy.

But when engineers own both infrastructure reliability and infrastructure economics, something powerful happens:

Systems become leaner.
Architectures become cleaner.
And cloud bills stop being monthly surprises.

Top comments (0)