DEV Community

Abhishek Vasisht
Abhishek Vasisht

Posted on

Cost-Aware Platform Engineering: Implementing FinOps in AWS

How we transformed our AWS spending from reactive firefighting to proactive cost optimization, reducing waste by 30% while scaling infrastructure 3x - A Platform Engineering Manager's Journey

AWS FinOps Platform Engineering Cost Optimization

The Wake-Up Call

It was a Monday morning when our VP of Engineering & IT walked into my office with a printout of our AWS bill. Three months of data told a concerning story:

  • Month 1: $70K
  • Month 2: $81K (+15.6%)
  • Month 3: $100K (+23.3%)

"At this rate," he said, "we're on track for significant annual growth. What's the plan?"

I didn't have a good answer. Like many platform teams, we had focused on velocity and reliability, treating cost as an afterthought. Our developers could spin up resources freely, our monitoring focused on uptime not spend, and our architecture decisions rarely considered the price tag.

That conversation sparked a transformation in how we approach platform engineering. This is the story of how we implemented FinOps practices into our AWS platform, making cost awareness a first-class concern without sacrificing developer velocity.

The Platform Engineering Challenge

As Platform Engineering Managers, we face a unique challenge: enabling developer productivity while maintaining operational excellence AND cost efficiency. Traditional approaches force us to choose:

Option A: Cost Control Through Restriction

  • Lock down permissions
  • Manual approval for resources
  • Developers wait days for infrastructure
  • Innovation slows to a crawl

Option B: Developer Freedom Without Guardrails

  • Self-service everything
  • No cost visibility
  • Surprise bills at month-end
  • VP of Engineering & IT unhappy, platform team blamed

Option C: Cost-Aware Platform Engineering

  • Self-service with intelligent guardrails
  • Real-time cost visibility
  • Automated optimization
  • Developers empowered, finance happy

This blog post shows you how to implement Option C.

Our FinOps Journey: The Three Phases

Phase 1: Visibility          Phase 2: Optimization       Phase 3: Culture
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│ • Cost tracking │         │ • Rightsizing   │         │ • Cost reviews  │
│ • Tagging       │    →    │ • Reserved Inst │    →    │ • Team KPIs     │
│ • Dashboards    │         │ • Automation    │         │ • Best practices│
│ • Alerts        │         │ • Governance    │         │ • Training      │
└─────────────────┘         └─────────────────┘         └─────────────────┘
   Weeks 1-4                   Weeks 5-12                  Ongoing
Enter fullscreen mode Exit fullscreen mode

Let's dive into each phase with real implementation insights.


Phase 1: Visibility - You Can't Optimize What You Can't See

The Problem We Discovered

When we analyzed our costs, we found:

  • 89.7% of cost increase came from a single production account
  • EC2 compute grew 62.3% in one month ($12,390 increase)
  • New infrastructure appeared without platform team awareness
  • No one knew which team or application was responsible

Solution 1: Automated Cost Reporting

We built a serverless billing bot that sends daily reports to Microsoft Teams:

Architecture:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   EventBridge   │───▶│  Lambda Function │───▶│ Microsoft Teams │
│   (Daily 9AM)   │    │  (Cost Analysis) │    │  (Adaptive Card)│
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌──────────────────┐
                       │ Cost Explorer API│
                       │ Organizations API│
                       └──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key Features:

  • Daily cost summaries with account names (not just IDs)
  • Week-over-week and month-over-month comparisons
  • Top 5 services and accounts by spend
  • Projected monthly costs based on current burn rate
  • Automatic alerts for >15% daily changes

Sample Report:

💰 AWS Daily Billing Report

💵 Yesterday's Total: $3,245.67
📈 Daily Change: 📉 -$234.12 (-6.7%)
📊 Weekly Change: 📈 +$456.23 (+16.4%)
📅 Month-to-Date: $84,387.45
🎯 Projected Monthly: $103,482.70

🏢 Top Contributing Accounts
Production Primary: $2,234.56 (68.8%)
Production Secondary: $623.45 (19.2%)
Development: $387.66 (12.0%)

🔧 Top Services by Cost
EC2 - Compute: $1,845.23
RDS: $567.89
S3: $234.56
Enter fullscreen mode Exit fullscreen mode

Implementation:

We used AWS SAM (Serverless Application Model) for deployment:

  • Lambda Function fetches cost data from Cost Explorer and Organizations APIs
  • EventBridge triggers daily at 9 AM
  • Microsoft Teams integration via Adaptive Cards

Benefits:

  • Serverless (no infrastructure to manage)
  • Cost-effective (~$0.03/month)
  • Reliable daily execution
  • Easy to maintain and update

📖 For complete implementation: Building an Automated AWS Billing Report System

Solution 2: Comprehensive Tagging Strategy

We implemented a mandatory tagging policy across all AWS resources:

Required Tags:

CostCenter: "engineering" | "product" | "data" | "infrastructure"
Environment: "prod" | "staging" | "dev" | "sandbox"
Team: "platform" | "backend" | "frontend" | "data-science"
Application: "api-gateway" | "user-service" | "analytics"
Owner: "email@company.com"
Project: "project-name"
Enter fullscreen mode Exit fullscreen mode

Enforcement with AWS Organizations:

Use Service Control Policies (SCPs) to deny resource creation without required tags:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyResourceCreationWithoutTags",
    "Effect": "Deny",
    "Action": ["ec2:RunInstances", "rds:CreateDBInstance", "s3:CreateBucket"],
    "Resource": "*",
    "Condition": {
      "StringNotLike": {
        "aws:RequestTag/CostCenter": "*",
        "aws:RequestTag/Environment": "*",
        "aws:RequestTag/Team": "*"
      }
    }
  }]
}
Enter fullscreen mode Exit fullscreen mode

Enforcement Strategy:

  • Use Infrastructure as Code (Terraform/CloudFormation) with required tag variables
  • Implement pre-commit hooks to validate tags
  • Create reusable modules with tags baked in
  • Set up automated tag compliance scanning

Solution 3: Real-Time Cost Dashboards

We created CloudWatch dashboards showing cost trends in real-time:

Dashboard Components:

  1. Daily Spend Trend (last 30 days)
  2. Cost by Account (pie chart)
  3. Cost by Service (bar chart)
  4. Top 10 Resources by cost
  5. Budget vs Actual (gauge)
  6. Anomaly Detection alerts

Implementation Options:

  • AWS CloudWatch Dashboards (native, free)
  • Grafana with CloudWatch data source (more customizable)
  • QuickSight for executive reporting (business intelligence)

Result: Platform team and developers can now see cost impact within hours, not weeks.

Solution 4: Intelligent Budget Alerts

We set up multi-level budget alerts with different thresholds:

Budget Structure:

Organization Budget: $X/month (total)
├── Production Account: $Y/month (largest allocation)
│   ├── Alert at 50% → Platform Team
│   ├── Alert at 75% → Engineering Manager
│   └── Alert at 90% → VP of Engineering & IT + Head of Software Development
├── Development Account: $Z/month (medium allocation)
│   ├── Alert at 80% → Platform Team
│   └── Alert at 100% → Engineering Manager
└── Sandbox Account: $W/month (smallest allocation)
    └── Alert at 100% → Platform Team
Enter fullscreen mode Exit fullscreen mode

Implementation Approach:

  • Use AWS Budgets for threshold-based alerts
  • Configure multiple notification levels (50%, 75%, 90%, 100%)
  • Set up both ACTUAL and FORECASTED alert types
  • Route notifications to appropriate teams via email or SNS

Alert Strategy:

  • 50% threshold: Early warning to platform team
  • 75% threshold: Escalate to engineering manager
  • 90% threshold: Executive notification (VP of Engineering & IT / Head of Software Development)
  • 100% threshold: Immediate action required

Phase 1 Results:

  • ✅ Daily cost visibility for all stakeholders
  • ✅ Account-level cost attribution
  • ✅ Service-level spend tracking
  • ✅ Proactive alerts before budget overruns
  • ✅ Historical trend analysis

Time to Implement: 2 weeks

Cost: ~$5/month (AWS Budgets + Lambda)


Phase 2: Optimization - From Visibility to Action

With visibility in place, we discovered several optimization opportunities:

Discovery 1: EC2 Instance Rightsizing

The Problem:

  • 40% of EC2 instances were oversized
  • Average CPU utilization: 15-25%
  • Average memory utilization: 30-40%
  • Estimated waste: $15,000/month

The Solution:

We used AWS Compute Optimizer to identify rightsizing opportunities:

Rightsizing Strategy:

  1. Immediate wins (>50% savings): Resize during next maintenance window
  2. Medium opportunities (20-50% savings): Schedule for quarterly optimization
  3. Small optimizations (<20% savings): Evaluate during annual review

Key Metrics to Monitor:

  • CPU utilization (target: 40-70%)
  • Memory utilization (target: 50-80%)
  • Network throughput
  • Disk I/O patterns

Results:

  • Resized 45 instances in production
  • Monthly savings: $12,400
  • Performance impact: None (monitored for 30 days)
  • ROI: Immediate

Discovery 2: Reserved Instances & Savings Plans

The Analysis:

After 3 months of data, we identified stable workloads:

  • Production databases: 24/7 uptime, predictable load
  • Core API services: Consistent baseline capacity
  • Monitoring infrastructure: Always-on requirements

The Strategy:

Workload Type          | Commitment Strategy        | Savings
-----------------------|----------------------------|----------
Production RDS         | 3-year Reserved Instance   | 63%
Core EC2 (baseline)    | 1-year Compute Savings Plan| 42%
Variable EC2 (burst)   | On-Demand                  | 0%
Development (9-5)      | Instance Scheduler         | 65%
Enter fullscreen mode Exit fullscreen mode

Purchase Decision Matrix:

Utilization Rate | Recommendation           | Commitment
-----------------|--------------------------|------------
> 90%            | 3-year RI (All Upfront)  | Maximum savings
75-90%           | 1-year RI (Partial)      | Balanced
50-75%           | Compute Savings Plan     | Flexible
< 50%            | On-Demand                | No commitment
Enter fullscreen mode Exit fullscreen mode

Analysis Approach:

  • Review 90 days of usage patterns
  • Identify stable workloads (>75% utilization)
  • Calculate ROI for different commitment levels
  • Start conservative with 1-year commitments

Results:

  • Purchased $45,000 in Reserved Instances
  • Annual savings: $18,900 (42% discount)
  • Payback period: 2.4 years
  • Risk mitigation: Started with 1-year commitments

Discovery 3: Automated Resource Cleanup

The Problem:

We found significant waste from forgotten resources:

  • 23 stopped EC2 instances (still paying for EBS volumes)
  • 15 unattached EBS volumes
  • 8 old snapshots (>180 days)
  • 12 unused Elastic IPs
  • Estimated waste: $3,200/month

The Solution:

Automated cleanup with AWS Lambda and EventBridge:

Cleanup Policy:

  1. Day 0: Resource becomes idle
  2. Day 7: First notification to owner
  3. Day 10: Second notification with deletion warning
  4. Day 14: Automatic deletion (unless KeepAlive tag present)

Implementation Approach:

  • Lambda function scans for idle resources daily
  • SNS notifications to resource owners
  • Grace period with KeepAlive tag option
  • CloudWatch logs for audit trail

Results:

  • Cleaned up 45 unused resources in first month
  • Monthly savings: $3,200
  • Zero complaints (14-day grace period worked well)
  • Developers became more conscious of resource lifecycle

Discovery 4: Development Environment Scheduling

The Problem:

Development and staging environments ran 24/7 but were only used 9 AM - 6 PM weekdays:

  • 168 hours/week available
  • 45 hours/week actually used (27% utilization)
  • Waste: $8,500/month

The Solution:

AWS Instance Scheduler with custom schedules:

Schedule Definitions:

dev-hours:        Mon-Fri 8 AM - 7 PM
staging-hours:    Mon-Fri 7 AM - 8 PM
always-on:        24x7 (production only)
Enter fullscreen mode Exit fullscreen mode

Implementation:

  • Tag-based scheduling (Schedule=dev-hours)
  • Automatic start before work hours
  • Automatic stop after hours
  • Override capability for special cases

Results:

  • Scheduled 85 development instances
  • Scheduled 32 staging instances
  • Monthly savings: $8,500
  • Developer feedback: Positive (instances auto-start before work hours)
  • Unexpected benefit: Forced developers to use IaC (instances recreated daily)

Discovery 5: S3 Storage Optimization

The Problem:

S3 costs grew 45% over 3 months with no clear ownership:

  • 2.3 TB in Standard storage
  • 890 GB of data >90 days old
  • 450 GB of incomplete multipart uploads
  • Monthly cost: $5,200

The Solution:

Intelligent tiering and lifecycle policies:

Storage Class Decision Tree:

Access Pattern                    | Storage Class        | Cost/GB/Month
----------------------------------|----------------------|---------------
Frequent access (>1/month)        | Standard             | $0.023
Infrequent access (>1/quarter)    | Standard-IA          | $0.0125
Rare access (>1/year)             | Glacier IR           | $0.004
Archive (rarely accessed)         | Glacier Deep Archive | $0.00099
Unknown pattern                   | Intelligent-Tiering  | $0.023-0.00099
Enter fullscreen mode Exit fullscreen mode

Lifecycle Policy Strategy:

  • Move to IA after 30 days
  • Move to Glacier IR after 90 days
  • Move to Deep Archive after 180 days
  • Delete incomplete multipart uploads after 7 days
  • Delete old versions after 90 days

Results:

  • Moved 890 GB to Glacier Instant Retrieval
  • Cleaned up 450 GB of incomplete uploads
  • Enabled Intelligent-Tiering on 15 buckets
  • Monthly savings: $1,850
  • Storage costs reduced by 35%

Phase 2 Summary:

Optimization Monthly Savings Implementation Time
EC2 Rightsizing $12,400 2 weeks
Reserved Instances $18,900 (annual) 1 week
Resource Cleanup $3,200 1 week
Dev Scheduling $8,500 1 week
S3 Optimization $1,850 1 week
Total $26,950/month 6 weeks

Annual Impact: $323,400 in savings


Phase 3: Culture - Making FinOps Everyone's Responsibility

Technology alone doesn't create lasting change. We needed to shift the culture.

Initiative 1: Cost-Aware Development Guidelines

We created platform engineering standards that developers follow:

The FinOps Developer Checklist:

Before Deploying to Production:

Resource Sizing

  • Right-sized instances based on actual load testing
  • Configured auto-scaling with appropriate min/max
  • Reviewed CloudWatch metrics for 2+ weeks in staging

Cost Optimization

  • Enabled S3 lifecycle policies for data storage
  • Configured RDS automated backups with retention limits
  • Used appropriate storage classes (GP3 vs GP2 vs IO1)
  • Implemented caching where applicable (ElastiCache, CloudFront)

Tagging & Governance

  • All resources tagged with: CostCenter, Team, Application, Environment
  • Budget alerts configured for the application
  • Cost dashboard created in CloudWatch

Monitoring

  • Cost anomaly detection enabled
  • Utilization metrics tracked
  • Cleanup automation configured for temporary resources

Infrastructure as Code Guardrails:

We built cost-awareness into our IaC modules:

  • Instance type validation (prevent oversized instances in dev)
  • Automatic scheduling tags for non-prod environments
  • Cost estimation outputs in Terraform plans
  • Budget threshold checks before deployment

Initiative 2: Monthly Cost Review Meetings

We established a monthly FinOps review with all engineering teams:

Meeting Structure (60 minutes):

  1. Cost Overview (10 min)

    • Total spend vs budget
    • Month-over-month comparison
    • Top 5 cost drivers
  2. Team Deep Dives (30 min)

    • Each team presents their top 3 services
    • Explains any significant changes
    • Shares optimization wins
  3. Optimization Opportunities (15 min)

    • Platform team presents recommendations
    • Discussion of implementation plans
    • Assignment of action items
  4. Best Practices Sharing (5 min)

    • Highlight cost-saving innovations
    • Recognize teams with best improvements

Sample Report Output:

📊 FinOps Monthly Review

💰 Cost Summary
Total Spend: $87K (Budget: $95K)
vs Last Month: -$13K (-13%) ✅
vs Last Year: +$23K (+36%)

🏆 Team Performance
┌──────────────┬──────────┬──────────┬─────────┐
│ Team         │ Spend    │ Change   │ Status  │
├──────────────┼──────────┼──────────┼─────────┤
│ Platform     │ $32K     │ -15% ✅   │ On Track│
│ Backend      │ $28K     │ -8% ✅    │ On Track│
│ Data Science │ $19K     │ +5%      │ Watch   │
│ Frontend     │ $8K      │ -2% ✅    │ On Track│
└──────────────┴──────────┴──────────┴─────────┘

🎯 Optimization Wins This Month
1. Platform Team: Rightsized 12 RDS instances → $2.4K/mo savings
2. Backend Team: Implemented ElastiCache → $1.8K/mo savings
3. Data Science: Moved to Spot instances → $3.2K/mo savings

📈 Recommendations
1. Backend Team: 8 EC2 instances eligible for Reserved Instances
   Potential savings: $4.2K/month
2. Data Science: S3 buckets with old data (>180 days)
   Potential savings: $890/month
Enter fullscreen mode Exit fullscreen mode

Initiative 3: Cost Attribution & Team Accountability

We made cost visibility transparent at the team level:

Team Cost Dashboard Features:

  • Real-time spend by team (using CostCenter tag)
  • Budget vs actual with visual indicators
  • Top services by cost for each team
  • Trend analysis (daily, weekly, monthly)
  • Comparison with other teams (anonymized)

Slack Integration for Real-Time Alerts:

Teams receive daily cost summaries in their Slack channels:

  • Yesterday's spend with daily change
  • Week-over-week comparison
  • Month-to-date vs budget
  • Top 3 services by cost
  • Automatic alerts for >20% daily increases

Benefits:

  • Teams own their costs
  • Real-time feedback loop
  • Friendly competition between teams
  • Early detection of cost spikes

Initiative 4: FinOps Training & Enablement

We created a comprehensive training program for all engineers:

FinOps Training Curriculum:

Week 1: Fundamentals

  • Understanding AWS pricing models
  • Reading and interpreting AWS bills
  • Cost allocation tags and their importance
  • Introduction to Cost Explorer

Week 2: Optimization Techniques

  • EC2 instance selection and rightsizing
  • Reserved Instances vs Savings Plans
  • S3 storage classes and lifecycle policies
  • Database optimization (RDS, DynamoDB)

Week 3: Platform Tools

  • Using our cost dashboards
  • Setting up budget alerts
  • Automated cleanup tools
  • Cost estimation in Terraform

Week 4: Best Practices

  • Architecture for cost efficiency
  • Serverless vs containers vs VMs
  • Monitoring and alerting
  • Case studies from our teams

FinOps Champion Certification:

We created an internal certification program:

Requirements:

  1. Complete 4-week training program
  2. Achieve 30% cost reduction in your team's AWS spend
  3. Present optimization case study to engineering team
  4. Mentor 2 other engineers on FinOps practices

Benefits:

  • Recognition in company all-hands
  • Professional development budget
  • Priority for AWS certification training
  • FinOps Champion badge

Initiative 5: Cost-Aware Architecture Reviews

We integrated cost considerations into our architecture review process:

Architecture Review Checklist (Cost Section):

Estimated Costs

  • Monthly cost estimate provided (with calculations)
  • Cost comparison with alternative approaches
  • Breakdown by service (compute, storage, data transfer)

Scalability & Cost

  • Cost scaling analyzed (linear, exponential, logarithmic)
  • Auto-scaling configured with cost limits
  • Peak load costs estimated and budgeted

Optimization Strategy

  • Reserved capacity opportunities identified
  • Spot instances considered for appropriate workloads
  • Caching strategy to reduce compute/database costs
  • Data transfer costs minimized (same region, VPC endpoints)

Monitoring & Alerts

  • Cost anomaly detection configured
  • Budget alerts set at 50%, 75%, 90%
  • Cost dashboard created for the service
  • Runbook for cost spike investigation

Alternatives Considered

  • Serverless vs container vs VM comparison
  • Managed service vs self-hosted cost analysis
  • Multi-region vs single-region cost implications

Real Example from Our Reviews:

Architecture Review: New Analytics Pipeline
==========================================

Proposed Architecture:
- 5x m5.2xlarge EC2 instances (24/7)
- 2TB S3 Standard storage
- RDS PostgreSQL db.r5.2xlarge
Estimated Monthly Cost: $4,850

Alternative Architecture (Platform Team Recommendation):
- AWS Glue for ETL (serverless)
- S3 Intelligent-Tiering (2TB)
- Aurora Serverless v2 (auto-scaling)
Estimated Monthly Cost: $1,240

Decision: Approved alternative architecture
Savings: $3,610/month ($43,320/year)
Enter fullscreen mode Exit fullscreen mode

The Results: 6 Months Later

After implementing our FinOps platform engineering practices, here's what we achieved:

Financial Impact

Month     | Actual Spend | Without FinOps | Savings  | Cumulative
----------|--------------|----------------|----------|------------
Month 1   | $70K         | $70K           | $0       | $0
Month 2   | $81K         | $81K           | $0       | $0
Month 3   | $100K        | $100K          | $0       | $0
Month 4   | $87K         | $120K          | $33K     | $33K
Month 5   | $85K         | $144K          | $59K     | $92K
Month 6   | $82K         | $173K          | $91K     | $183K

Total Savings (3 months post-implementation): $183K
Projected Annual Savings: ~$730K
Enter fullscreen mode Exit fullscreen mode

Key Metrics

Metric Before FinOps After FinOps Improvement
Monthly AWS Spend $100K $82K -18%
Cost per Developer Higher Lower -18%
Wasted Resources ~35% ~8% -77%
Budget Overruns 3/month 0/month -100%
Cost Visibility Leadership only All teams +100%
Time to Detect Issues 30 days <24 hours -97%
Developer Satisfaction 7.2/10 8.9/10 +24%

Operational Improvements

Before FinOps:

  • ❌ Monthly surprise bills
  • ❌ No cost attribution
  • ❌ Reactive firefighting
  • ❌ Developers unaware of costs
  • ❌ Manual cost analysis (8 hours/month)
  • ❌ No optimization process

After FinOps:

  • ✅ Predictable spending
  • ✅ Team-level cost visibility
  • ✅ Proactive optimization
  • ✅ Cost-aware development culture
  • ✅ Automated reporting (<1 hour/month)
  • ✅ Continuous optimization

Cultural Transformation

Developer Feedback:

"I used to just pick the biggest instance type to be safe. Now I actually think about what I need and use the cost estimator. Turns out t3.medium works fine for most of our services."

— Backend Developer

"The daily Slack updates make me aware of our team's spending. When I see a spike, I investigate immediately instead of waiting for the monthly bill."

— Team Lead

"The FinOps training changed how I design systems. I now consider cost as a first-class requirement, not an afterthought."

— Senior Engineer

Leadership Feedback:

"We went from reactive cost management to proactive optimization. The platform team's FinOps implementation has been transformational."

— Head of Software Development

"The visibility and predictability we now have makes financial planning so much easier. And the savings speak for themselves."

— VP of Engineering & IT


Lessons Learned: What Worked and What Didn't

What Worked Well ✅

1. Start with Visibility, Not Restrictions

We didn't lock down permissions or block developers. We gave them visibility first, and behavior changed naturally.

2. Automate Everything

Manual cost analysis is unsustainable. Our automated daily reports, cleanup scripts, and scheduling saved hundreds of hours.

3. Make It Easy to Do the Right Thing

Our IaC modules with cost guardrails made it easier to be cost-efficient than wasteful.

4. Celebrate Wins Publicly

Recognizing teams that achieved cost savings created positive peer pressure and friendly competition.

5. Integrate with Existing Workflows

Slack notifications, Grafana dashboards, and architecture reviews fit into existing processes rather than creating new ones.

What Didn't Work ❌

1. Initial Tagging Enforcement Was Too Strict

Our first attempt blocked all resource creation without perfect tags. This frustrated developers and slowed velocity. We relaxed to required tags only.

2. Cost Alerts Were Too Noisy

Early alerts fired for every 5% change. Teams ignored them. We adjusted to 15% for daily, 25% for weekly.

3. One-Size-Fits-All Policies

Applying the same lifecycle policies to all S3 buckets caused issues. We learned to categorize by data type first.

4. Assuming Everyone Understands AWS Pricing

Many developers didn't know the difference between Reserved Instances and Savings Plans. Training was essential.

5. Focusing Only on Big Wins

We initially ignored small optimizations (<$100/month). But 50 small wins = $5,000/month savings.

Unexpected Benefits 🎁

1. Better Architecture Decisions

Cost awareness led to better designs: more caching, better auto-scaling, appropriate service selection.

2. Improved Resource Hygiene

Automated cleanup forced teams to use Infrastructure as Code and properly manage resource lifecycles.

3. Faster Incident Response

Cost anomaly detection caught several production issues before they became major incidents.

4. Stronger Team Collaboration

Monthly cost reviews brought teams together to share learnings and best practices.

5. Career Development

Engineers who became FinOps champions gained valuable skills and visibility in the organization.


Getting Started: Your FinOps Implementation Roadmap

Based on our experience, here's a practical 90-day plan to implement FinOps in your organization:

Days 1-30: Foundation & Visibility

Week 1: Assessment

  • [ ] Analyze last 3 months of AWS bills
  • [ ] Identify top 10 cost drivers
  • [ ] Map costs to teams/applications (best effort)
  • [ ] Document current state and pain points

Week 2: Quick Wins

  • [ ] Set up AWS Budgets with alerts
  • [ ] Deploy automated billing report
  • [ ] Create basic CloudWatch cost dashboard
  • [ ] Identify and clean up obvious waste

Week 3: Tagging Strategy

  • [ ] Define required tags
  • [ ] Create tagging policy document
  • [ ] Tag existing critical resources
  • [ ] Implement tag enforcement for new resources

Week 4: Team Enablement

  • [ ] Present findings to engineering teams
  • [ ] Share cost dashboards and reports
  • [ ] Conduct initial FinOps training session
  • [ ] Establish monthly cost review meeting

Expected Results: 5-10% cost reduction, full visibility into spending


Days 31-60: Optimization & Automation

Week 5: EC2 Optimization

  • [ ] Enable AWS Compute Optimizer
  • [ ] Analyze rightsizing recommendations
  • [ ] Implement instance scheduler for dev/staging
  • [ ] Rightsize top 10 oversized instances

Week 6: Storage Optimization

  • [ ] Audit S3 buckets and implement lifecycle policies
  • [ ] Review EBS volumes and snapshots
  • [ ] Enable S3 Intelligent-Tiering where appropriate
  • [ ] Clean up old snapshots and AMIs

Week 7: Reserved Capacity

  • [ ] Analyze usage patterns for stable workloads
  • [ ] Calculate ROI for Reserved Instances
  • [ ] Purchase initial RIs (start conservative)
  • [ ] Document RI management process

Week 8: Automation

  • [ ] Deploy automated resource cleanup Lambda
  • [ ] Set up cost anomaly detection
  • [ ] Create IaC modules with cost guardrails
  • [ ] Implement automated cost reporting

Expected Results: 15-25% cost reduction, automated optimization processes


Days 61-90: Culture & Governance

Week 9: Architecture Integration

  • [ ] Add cost section to architecture review template
  • [ ] Create cost estimation tools
  • [ ] Document cost-aware design patterns
  • [ ] Review upcoming projects for cost optimization

Week 10: Team Accountability

  • [ ] Implement team-level cost dashboards
  • [ ] Set up Slack/Teams cost notifications
  • [ ] Create team cost budgets
  • [ ] Establish cost KPIs for teams

Week 11: Training & Certification

  • [ ] Develop comprehensive FinOps training program
  • [ ] Train team leads and senior engineers
  • [ ] Create internal FinOps champion program
  • [ ] Document best practices and runbooks

Week 12: Continuous Improvement

  • [ ] Conduct first monthly FinOps review
  • [ ] Gather feedback and iterate
  • [ ] Plan next quarter's optimization initiatives
  • [ ] Celebrate and communicate wins

Expected Results: 25-35% cost reduction, sustainable FinOps culture


Essential Tools & Resources

AWS Native Tools (Free):

  • AWS Cost Explorer
  • AWS Budgets
  • AWS Compute Optimizer
  • AWS Cost Anomaly Detection
  • AWS Trusted Advisor

Open Source Tools:

  • Cloud Custodian (policy as code)
  • Komiser (cloud asset dashboard)
  • Infracost (Terraform cost estimation)
  • CloudQuery (cloud asset inventory)

Recommended Reading:

  • "Cloud FinOps" by J.R. Storment and Mike Fuller
  • AWS Well-Architected Framework - Cost Optimization Pillar
  • FinOps Foundation resources (finops.org)

Common Pitfalls and How to Avoid Them

Pitfall 1: Analysis Paralysis

Problem: Spending months analyzing costs without taking action.

Solution: Start with quick wins in week 1:

  • Clean up stopped instances
  • Delete unattached volumes
  • Set up basic budgets
  • Deploy automated reporting

Impact: 5-10% savings in first week builds momentum.


Pitfall 2: Over-Optimization

Problem: Spending $1000 in engineering time to save $50/month.

Solution: Use the 10x rule:

  • Only optimize if annual savings > 10x implementation cost
  • Example: If optimization takes 8 hours ($800), annual savings should be >$8,000

Pitfall 3: Ignoring Developer Experience

Problem: Cost controls that slow down development velocity.

Solution: Make cost-efficient choices the easy choice:

  • Provide IaC modules with sensible defaults
  • Automate optimization (don't require manual work)
  • Give visibility, not restrictions

Pitfall 4: Lack of Executive Support

Problem: FinOps treated as "IT's problem" without leadership buy-in.

Solution: Speak the language of business:

  • Show ROI in dollars, not percentages
  • Connect cost savings to business outcomes
  • Present at executive meetings with clear metrics

Example Pitch:

"Our FinOps initiative delivered significant cost savings in the first quarter. That's equivalent to hiring additional engineers or funding new product features. With continued optimization, we project substantial annual savings that directly impact our bottom line."


Pitfall 5: Set-and-Forget Mentality

Problem: Implementing FinOps once and assuming it's done.

Solution: FinOps is continuous:

  • Monthly cost reviews
  • Quarterly optimization sprints
  • Annual strategy refresh
  • Ongoing training and enablement

Measuring Success: Key FinOps Metrics

Track these metrics to measure your FinOps maturity:

Financial Metrics

  • Total Cloud Spend (monthly trend, YoY growth)
  • Cost per customer/transaction
  • Cost per developer
  • Infrastructure cost as % of revenue
  • Waste metrics (unused, idle, oversized resources)
  • Monthly savings from optimizations
  • ROI of FinOps program

Operational Metrics

  • % of resources with complete tags
  • % of costs attributed to teams
  • Time to detect cost anomalies
  • Budget compliance rate
  • Policy violation rate
  • % of resources managed by IaC
  • Automated optimization actions/month

Cultural Metrics

  • % of engineers trained in FinOps
  • Cost dashboard active users
  • Cost review meeting attendance
  • Cost optimization ideas submitted
  • FinOps champions certified
  • Developer satisfaction score

FinOps Maturity Model

Level 1: Reactive (Crawl)
├─ Manual cost analysis
├─ No tagging strategy
├─ Surprise bills common
└─ No optimization process

Level 2: Proactive (Walk)
├─ Automated reporting
├─ Basic tagging in place
├─ Budget alerts configured
└─ Ad-hoc optimizations

Level 3: Optimized (Run)
├─ Real-time visibility
├─ Comprehensive tagging
├─ Predictable spending
├─ Continuous optimization
└─ Cost-aware culture

Level 4: Advanced (Fly)
├─ Predictive analytics
├─ Multi-cloud optimization
├─ FinOps as code
├─ Cost innovation
└─ Industry-leading efficiency
Enter fullscreen mode Exit fullscreen mode

Our Journey:

  • Month 1: Level 1 (Reactive)
  • Month 3: Level 2 (Proactive)
  • Month 6: Level 3 (Optimized)
  • Target: Level 4 (Advanced)

Real-World Case Studies from Our Teams

Case Study 1: Backend Team - API Service Optimization

Challenge:

Backend team's API service costs grew 85% in 2 months with no corresponding traffic increase.

Investigation:

  • 12 m5.xlarge instances running 24/7
  • Average CPU utilization: 18%
  • Peak CPU utilization: 45%
  • Traffic pattern: 9 AM - 6 PM weekdays

Solution:

  1. Rightsized to m5.large (50% cost reduction)
  2. Implemented auto-scaling (3-8 instances based on load)
  3. Configured instance scheduler for non-peak hours
  4. Added ElastiCache to reduce database load

Results:

  • Monthly cost: $15,170 → $5,240 (65% reduction)
  • Annual savings: $119,160
  • Performance: Improved (better caching)
  • Implementation time: 1 week

Key Lesson: "We were over-provisioning for peak load that rarely happened. Auto-scaling gave us better performance at 1/3 the cost."


Case Study 2: Data Science Team - ML Training Optimization

Challenge:

Data Science team spending $12,000/month on GPU instances for model training, with instances idle 60% of the time.

Investigation:

  • 4x p3.2xlarge instances (24/7)
  • Training jobs: 2-4 hours each
  • Jobs run: 3-4 times per day
  • Idle time: 14-16 hours/day

Solution:

  1. Migrated to SageMaker Training Jobs (pay per use)
  2. Used Spot instances for training (70% discount)
  3. Implemented training job scheduler
  4. Optimized model code (reduced training time 30%)

Results:

  • Monthly cost: $12,000 → $3,200 (73% reduction)
  • Annual savings: $105,600
  • Training time: Reduced by 30%
  • Implementation time: 2 weeks

Key Lesson: "Serverless ML training with spot instances was a game-changer. We only pay when we're actually training."


Case Study 3: Platform Team - Monitoring Infrastructure

Challenge:

Monitoring infrastructure (Prometheus, Grafana, ELK) costing $6,500/month, growing 15% monthly.

Investigation:

  • EC2 instances: $3,200/month
  • EBS volumes: $1,800/month
  • Data transfer: $1,500/month

Solution:

  1. Migrated to AWS Managed Services:
    • Amazon Managed Prometheus
    • Amazon Managed Grafana
    • Amazon OpenSearch Service
  2. Implemented log filtering (reduced volume 60%)
  3. Configured log retention policies (30 days hot, 90 days cold)

Results:

  • Monthly cost: $6,500 → $2,800 (57% reduction)
  • Annual savings: $44,400
  • Operational overhead: Reduced 80%
  • Reliability: Improved (managed services)
  • Implementation time: 3 weeks

Key Lesson: "Managed services cost more per unit but eliminated operational overhead and actually saved money overall."


Case Study 4: Frontend Team - CDN and Storage

Challenge:

Frontend team's S3 and CloudFront costs growing 40% monthly due to increased traffic.

Investigation:

  • S3 storage: 800 GB (all Standard class)
  • CloudFront data transfer: 15 TB/month
  • Cache hit ratio: 45% (should be >80%)
  • Image optimization: None

Solution:

  1. Implemented image optimization (WebP format, compression)
  2. Improved CloudFront caching (increased TTL)
  3. Moved old assets to S3 Intelligent-Tiering
  4. Enabled CloudFront compression

Results:

  • S3 costs: $184 → $98 (47% reduction)
  • CloudFront costs: $1,275 → $510 (60% reduction)
  • Cache hit ratio: 45% → 87%
  • Page load time: Improved 35%
  • Implementation time: 1 week

Key Lesson: "Optimizing for performance also optimized for cost. Better caching reduced both latency and data transfer costs."


The Platform Engineering Perspective

As a Platform Engineering Manager, implementing FinOps taught me several crucial lessons:

1. Platform Teams Are Cost Enablers, Not Cost Police

Our role isn't to restrict developers—it's to enable them to make cost-effective choices easily:

  • Provide tools: Cost dashboards, estimation tools, IaC modules
  • Create guardrails: Sensible defaults, automated cleanup, budget alerts
  • Enable self-service: Developers can provision resources without approval
  • Offer guidance: Training, documentation, architecture reviews

2. Cost Optimization Is a Product Feature

Treat FinOps like any other platform capability:

  • User research: Understand developer pain points
  • Iterative development: Start small, gather feedback, improve
  • Measure success: Track adoption, savings, satisfaction
  • Continuous improvement: Regular updates and enhancements

3. Culture Change Takes Time

Technical implementation is fast (weeks). Cultural transformation is slow (months):

  • Month 1: Resistance ("This is finance's job, not mine")
  • Month 2: Curiosity ("Interesting, but not a priority")
  • Month 3: Engagement ("Let me try this optimization")
  • Month 6: Ownership ("We saved $5K this month!")

Be patient and celebrate small wins.

4. Executive Support Is Critical

FinOps succeeds when leadership:

  • Allocates time: Engineers need time for optimization work
  • Recognizes efforts: Public acknowledgment of cost savings
  • Provides resources: Budget for tools and training
  • Sets expectations: Cost efficiency as a performance metric

5. Start Small, Think Big

Our implementation roadmap:

Week 1: Quick wins (cleanup, budgets)
  ↓
Month 1: Visibility (dashboards, reporting)
  ↓
Month 3: Optimization (rightsizing, scheduling)
  ↓
Month 6: Culture (training, accountability)
  ↓
Year 1: Maturity (predictive analytics, innovation)
Enter fullscreen mode Exit fullscreen mode

Don't try to do everything at once. Build momentum with early wins.


Conclusion: The Future of Cost-Aware Platform Engineering

Six months ago, we faced a crisis: unsustainable AWS cost growth threatening our business. Today, we have a mature FinOps practice that has:

  • Achieved significant cost savings through systematic optimization
  • Dramatically reduced waste across our infrastructure
  • Improved visibility from leadership-only to all teams
  • Transformed culture from cost-ignorant to cost-aware
  • Enhanced developer experience with better tools and processes

But more importantly, we've fundamentally changed how we think about platform engineering.

The New Platform Engineering Paradigm

Traditional Platform Engineering:

Focus: Reliability + Velocity
Metrics: Uptime, deployment frequency, MTTR
Cost: Afterthought, handled by finance
Enter fullscreen mode Exit fullscreen mode

Cost-Aware Platform Engineering:

Focus: Reliability + Velocity + Efficiency
Metrics: Uptime, deployment frequency, MTTR, cost per transaction
Cost: First-class requirement, owned by engineering
Enter fullscreen mode Exit fullscreen mode

Key Principles We Live By

  1. Cost is a feature, not a constraint

    • Efficient systems are better systems
    • Cost optimization drives architectural improvements
    • Savings fund innovation
  2. Visibility drives behavior

    • Developers can't optimize what they can't see
    • Real-time feedback creates accountability
    • Transparency builds trust
  3. Automation scales culture

    • Manual processes don't scale
    • Automated optimization is sustainable
    • Tools enable best practices
  4. Continuous improvement is the goal

    • FinOps is never "done"
    • Always room for optimization
    • Learn, measure, improve, repeat

What's Next for Us

Our FinOps roadmap for the coming months:

Short-term (Next Quarter):

  • Implement predictive cost modeling with ML
  • Expand to multi-cloud cost optimization
  • Launch advanced FinOps certification program
  • Build cost optimization into CI/CD pipelines

Medium-term (6-12 Months):

  • Achieve advanced FinOps maturity
  • Further reduce waste through automation
  • Implement carbon-aware computing
  • Share learnings at industry conferences

Call to Action

If you're a Platform Engineering Manager facing similar challenges:

Start today:

  1. Analyze your last 3 months of AWS bills
  2. Identify your top 5 cost drivers
  3. Set up basic budget alerts
  4. Deploy automated cost reporting

This week:

  1. Clean up obvious waste (stopped instances, unattached volumes)
  2. Implement a tagging strategy
  3. Create a cost dashboard
  4. Schedule your first FinOps team meeting

This month:

  1. Rightsize oversized instances
  2. Implement instance scheduling for dev/staging
  3. Set up automated cleanup
  4. Conduct FinOps training for your team

This quarter:

  1. Evaluate Reserved Instances and Savings Plans
  2. Optimize storage with lifecycle policies
  3. Integrate cost into architecture reviews
  4. Establish monthly cost review meetings

Final Thoughts

FinOps isn't about spending less—it's about spending smart. It's about building a culture where every engineer understands the cost impact of their decisions and has the tools to make efficient choices.

The journey from reactive cost management to proactive optimization wasn't just about saving money. It was about building a better platform, creating better systems, and empowering better engineers.

The best time to start your FinOps journey was yesterday. The second best time is today.


Resources and Next Steps

Recommended Resources

Books:

  • "Cloud FinOps" by J.R. Storment and Mike Fuller
  • "The DevOps Handbook" by Gene Kim et al.
  • "AWS Well-Architected Framework" - Cost Optimization Pillar

Websites:

Certifications:

  • FinOps Certified Practitioner
  • AWS Certified Cloud Practitioner
  • AWS Certified Solutions Architect

Related Blog Posts

📖 Building an Automated AWS Billing Report System with SAM and Microsoft Teams


Have you implemented FinOps in your organization? What challenges did you face? What worked well? Share your experiences in the comments below!

Tags: #AWS #FinOps #PlatformEngineering #CostOptimization #CloudCosts #DevOps #SRE #CloudFinance #InfrastructureAsCode #Terraform #CloudArchitecture


About the Author

As a Platform Engineering Manager, I lead a team responsible for building and maintaining cloud infrastructure that powers our products. Our mission is to enable developers with reliable, scalable, and cost-efficient platforms. This blog shares our real-world journey implementing FinOps practices in AWS.

Published: January 2026

Reading Time: 30 minutes

Difficulty: Intermediate to Advanced

Top comments (0)