Cloud bills rarely explode because of one catastrophic decision. They grow incrementally. Quietly. A forgotten load balancer here. Overprovisioned Kubernetes nodes there. NAT Gateway traffic multiplying invisibly in the background like fiscal mold behind drywall.
Most organisations approach FinOps as a finance exercise. That is a strategic mistake.
The engineers provisioning infrastructure are the same engineers best positioned to optimise it. DevOps teams control autoscaling, storage policies, networking topology, observability retention, and workload scheduling. They are not adjacent to cloud cost optimisation. They are the operational epicentre of it.
This playbook focuses on practical FinOps implementation for DevOps and platform engineers. Not abstract governance theory. Actual engineering patterns that reduce spend without degrading reliability.
The optimisation path is organised by return on investment. Start with visibility. Then tackle compute, storage, networking, Kubernetes, and finally governance automation.
Part 1: Visibility First — Tagging Standards and Cost Attribution
You cannot optimise what you cannot attribute.
Most cloud environments fail at cost management because nobody knows which team owns what.
The Minimum Viable Tagging Standard
Every resource should contain
tags = {
team = "platform-engineering"
environment = "production"
application = "checkout-api"
cost-centre = "ENG-042"
owner = "payments-team"
}
Why Tags Matter
Without tags
Cloud bill = giant undifferentiated blob
With tags
Cloud bill = attributable operational data
This changes engineering behaviour immediately.
AWS Cost Allocation Tags
Enable them explicitly
aws ce list-cost-allocation-tags
Then activate
Billing Console → Cost Allocation Tags → Activate
Cost Dashboard Strategy
Build dashboards around:
Cost by team
Cost by environment
Cost by service
Week-over-week growth
Top anomalous resources
Part 2: Compute Optimisation — Rightsizing, Spot, Graviton
Compute is usually the largest controllable expense category.
And most environments are dramatically oversized.
Rightsizing EC2 Instances
Example
m5.4xlarge
Average CPU: 9%
This is not infrastructure. It is financial leakage.
Identify Idle Instances
Using CloudWatch
Spot Instances
Spot pricing can reduce costs by 70–90%.
Perfect for:
CI runners
Batch jobs
Non-critical workloads
Kubernetes worker nodes
Terraform Spot Example
resource "aws_instance" "spot_worker" {
instance_type = "m7g.large"
instance_market_options {
market_type = "spot"
}
}
AWS Graviton Migration
Graviton instances routinely reduce compute costs by 20–40%.
Migration Candidate Checklist
Best workloads:
- Stateless APIs
- Containers
- Node.js
- Go
- Java 17+
Kubernetes Node Group Example
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
- name: graviton-workers
instanceType: m7g.large
Part 3: Storage Optimisation — S3 Tiers, EBS, Lifecycle Policies
Storage inefficiency compounds silently over years.
S3 Lifecycle Policies
The fastest storage win in AWS.
Terraform Lifecycle Policy
resource "aws_s3_bucket_lifecycle_configuration" "cost_optimised" {
bucket = aws_s3_bucket.data.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 30
storage_class = "INTELLIGENT_TIERING"
}
transition {
days = 90
storage_class = "GLACIER_IR"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
}
}
EBS Optimisation
Common waste patterns:
- Detached volumes
- Oversized gp3 disks
- Unused snapshots
Find Unattached Volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available
Unless compliance requires otherwise.
Part 4: Networking Cost Reduction — NAT Gateway, VPC Endpoints, Data Transfer
Networking costs surprise almost everyone.
Especially NAT Gateways.
NAT Gateway Optimisation
NAT Gateway charges include:
- Hourly fee
- Per-GB transfer fee
Large clusters can spend thousands monthly on NAT traffic alone.
Replace NAT Traffic with VPC Endpoints
Example
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
}
This eliminates NAT transfer charges for S3 traffic.
Reduce Cross-AZ Traffic
Hidden cost source:
Service A → AZ-1
Service B → AZ-2
Every request incurs transfer cost.
Kubernetes Affinity Rules
topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone
Keep chatty services co-located.
Part 5: Database Cost Optimisation — RDS Rightsizing, Aurora Serverless, Read Replica Pruning
Databases are expensive because teams fear touching them.
Reasonably so.
RDS Rightsizing
Monitor:
- CPU
- Connections
- IOPS
- Memory pressure
Example Downsize
db.r6g.4xlarge → db.r6g.xlarge
Often invisible to applications.
Massively visible to finance.
Aurora Serverless v2
Ideal for:
- Variable workloads
- Internal APIs
- Intermittent services
Terraform Example
serverlessv2_scaling_configuration {
min_capacity = 0.5
max_capacity = 8
}
Read Replica Cleanup
Common anti-pattern
Temporary read replica
→ never removed
→ costs persist forever
Audit quarterly.
Part 6: Reserved Instances & Savings Plans — When to Buy and How Much
Savings Plans are powerful when used correctly.
Dangerous when guessed incorrectly.
Recommended Strategy
Start conservative.
Target:
60–70% baseline utilisation coverage
Never 100%.
Compute Savings Plans
Best default option.
Flexible across:
- Instance families
- Regions
- Compute types
AWS Recommendation API
aws ce get-savings-plans-purchase-recommendation
Use actual usage history.
Not optimism.
Part 7: Kubernetes Cost Optimisation — Bin Packing, Cluster Autoscaler, Spot Node Groups
Kubernetes amplifies both efficiency and waste.
Bin Packing
Underutilised nodes are financial dead weight.
Resource Requests Matter
Bad:
requests:
cpu: "4"
Actual usage:
300m
Goldilocks Recommendation Tool
kubectl install goldilocks
Automatically suggests request sizing.
Cluster Autoscaler
--balance-similar-node-groups=true
Removes idle nodes dynamically.
Spot Node Groups
Example
capacityType: SPOT
Excellent for:
- Stateless apps
- Batch workers
- CI runners
Part 8: Monitoring Cost Creep — Alerting on Unexpected Spend Increases
Cost optimisation without monitoring regresses rapidly.
Budget Alerts
AWS Example
aws budgets create-budget
Prometheus Cost Alert
groups:
- name: cloud_cost_alerts
rules:
- alert: MonthlySpendSpike
expr: increase(cloud_cost_total[24h]) > 1000
Slack Notification Example
import requests
requests.post(
webhook_url,
json={"text": "Cloud spend increased unexpectedly"}
)
Immediate visibility changes behaviour.
Part 9: The Monthly Cost Review Checklist
The best FinOps teams operationalise review cadence.
Monthly Checklist
Compute
- Idle instances removed
- Rightsizing opportunities reviewed
- Spot coverage audited
Storage
- Snapshot retention reviewed
- Glacier transitions verified
- Orphaned volumes deleted
Kubernetes
- Node utilisation checked
- Resource requests audited
- Cluster Autoscaler effectiveness reviewed
Networking
- NAT Gateway spend analysed
- Cross-region traffic reviewed
Databases
- Read replicas validated
- Aurora scaling reviewed
Governance
- Untagged resources identified
- Budget alerts tested
Appendix: Azure and GCP Equivalents
Compute
FinOps is not about making infrastructure cheap.
It is about making infrastructure intentional.
The most effective DevOps teams treat cloud cost as an engineering metric alongside latency, reliability, and deployment frequency.
Because every oversized node, forgotten snapshot, or unnecessary NAT transfer represents engineering inefficiency expressed financially.
The progression usually looks like this:
Visibility → Attribution → Accountability → Optimisation
Without visibility, optimisation is guesswork.
Without attribution, accountability disappears.
Without accountability, cloud spend becomes entropy.
But when engineers own both infrastructure reliability and infrastructure economics, something powerful happens:
Systems become leaner.
Architectures become cleaner.
And cloud bills stop being monthly surprises.




Top comments (0)