Companies move to AWS expecting three things: scalability, flexibility, and lower costs. The first two usually hold up. The third? That's where I've seen things go sideways — repeatedly.
I've spent a good chunk of my career working across cloud environments, helping teams untangle billing nightmares and architect systems that don't hemorrhage money. And the uncomfortable truth I keep running into is simple:
AWS isn't expensive. Running AWS without paying attention is.
Let me walk through what I actually see happening out there, and what I think more teams should be doing differently.
The promises vs. the reality
The pitch is always the same. Pay only for what you use. No upfront hardware investment. Scaling that just works.
All technically true. But there's a massive asterisk nobody talks about during the sales cycle — the cloud rewards discipline and absolutely punishes negligence. On-prem, your spending was capped by physical hardware. You bought a rack, you used a rack. With AWS, there's no ceiling, and that elasticity can work against you if nobody's watching.
I had a client a couple of years ago who migrated about 200 workloads into EC2 over six months. Their on-prem costs were roughly $85k/month. Nine months into AWS, they were at $140k and climbing. Not because AWS was more expensive — because they'd replicated every bad habit from their data center and added new ones on top.
Where the money actually goes
The lift-and-shift trap
This is the most common one. A team takes their on-prem VMs — say, a 16-vCPU server running an internal app — and stands up an identical EC2 instance. No benchmarking, no utilization checks, nothing. Just "make it the same size so nothing breaks."
The problem is that most of those servers were oversized to begin with. I've pulled CloudWatch metrics on freshly migrated instances and found utilization hovering around 12–18%. That's a machine doing almost nothing, billed around the clock. Multiply that across a fleet and you've got a serious problem.
Dev and staging environments that never sleep
This one drives me a little crazy because it's so fixable.
Production needs to run 24/7 — obviously. But I routinely find dev, staging, QA, and sandbox environments humming along at 2 AM on a Sunday. Nobody's using them. Nobody even remembers they're on. But they're racking up charges because no single person is responsible for shutting them down.
I worked with one mid-sized engineering org — maybe 8 teams — where non-production environments accounted for nearly 40% of their monthly compute bill. On infrastructure that was idle roughly 65% of the time.
The on-demand default
AWS gives you several ways to pay less: Savings Plans, Reserved Instances, Spot. But a surprising number of companies run everything on-demand because it's the path of least resistance. Nobody has to forecast. Nobody has to commit. Engineering doesn't have to talk to finance.
The tradeoff is you're paying full sticker price for workloads that have been running predictably for months or even years. I've seen teams save 30–40% just by committing to a one-year Savings Plan on compute they were clearly going to keep running anyway. It's not a risky bet — it's common sense that nobody gets around to acting on.
Storage: the quiet budget killer
Storage costs don't spike. They creep. That's what makes them dangerous.
S3 buckets with no lifecycle policies, so data just accumulates forever. EBS snapshots from instances that were terminated a year ago. Detached volumes sitting there, still on the meter. Log files nobody's looked at since 2022, stored in standard-tier S3 instead of Glacier or just deleted outright.
Individually, none of these line items look alarming. But I've audited accounts where storage waste was quietly eating $8–15k a month. It adds up faster than people expect.
Kubernetes cost sprawl
EKS is a whole separate conversation, honestly.
The pattern I see most often: developers set CPU and memory requests conservatively high because they'd rather have headroom than get paged at 3 AM. Fair enough — I get it. But the cluster autoscaler provisions nodes based on those requests, not actual usage. So you end up with a bunch of nodes running at 25% utilization, fully billed, because the resource requests told Kubernetes it needed all that capacity.
Tuning pod requests and limits, setting up HPA properly, right-sizing node groups — this stuff isn't glamorous, but it can cut EKS costs dramatically. I've seen 30–50% reductions from a focused two-week tuning effort.
Why this keeps happening
Here's the thing — it's almost never a skills problem. It's an incentives problem.
Engineers are measured on uptime and reliability, not cost. If my bonus depended on keeping an SLA, I'd overprovision too. That's rational behavior given the incentives. The issue is that nobody's creating a counterbalancing reason to optimize spend.
Finance teams, meanwhile, are used to predictable costs — hardware depreciation, annual license renewals. AWS billing is dynamic, granular, and frankly kind of confusing if you're not technical. So finance sees the bill, maybe flags it when it spikes, but doesn't have the context to ask the right questions.
What bridges that gap is a FinOps practice — finance, engineering, and leadership sharing ownership of cloud spend. But most companies I've worked with either don't have one or have a version that's really just one person generating Cost Explorer reports that nobody reads.
What actually works
No magic bullets here, but these consistently move the needle.
Bake cost into architecture decisions
Not as an afterthought. Before you deploy, estimate your utilization, pick the right instance family, define your scaling thresholds, and figure out your pricing model. If your architecture review only covers security and availability, you're leaving money on the table.
Enforce tagging — I mean it
If you can't attribute a cost to a team, a project, and an environment, you can't control it. I'm a big fan of making deployment pipelines reject resources that don't carry mandatory tags. Sounds heavy-handed, but it works. Once teams see their name next to a dollar amount, behavior changes fast.
A basic setup might look like:
# Example: enforcing tags in a CloudFormation resource
Resources:
MyInstance:
Type: AWS::EC2::Instance
Properties:
InstanceType: t3.medium
Tags:
- Key: Project
Value: !Ref ProjectName
- Key: Environment
Value: !Ref EnvType
- Key: Owner
Value: !Ref TeamOwner
- Key: CostCenter
Value: !Ref CostCenterCode
Automate environment scheduling
Probably the single highest-ROI change most companies can make. A Lambda function that shuts down dev and staging at 7 PM and spins them back up at 7 AM — that's a 50% compute reduction on those environments right there.
# Simplified Lambda to stop instances by tag
import boto3
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
response = ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['dev', 'staging']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
instance_ids = []
for reservation in response['Reservations']:
for instance in reservation['Instances']:
instance_ids.append(instance['InstanceId'])
if instance_ids:
ec2.stop_instances(InstanceIds=instance_ids)
print(f"Stopped {len(instance_ids)} instances")
You can get fancier with CI/CD-triggered provisioning, but even a simple scheduler pays for itself immediately.
Stop defaulting to on-demand
Review your usage quarterly. Anything that's been running steadily for 3+ months is a candidate for Savings Plans or RIs. Spot is great for fault-tolerant batch jobs and test workloads. This isn't complicated — it just requires someone to actually own the decision.
Rightsize continuously
Not once. Not annually. Set a recurring calendar reminder — monthly or quarterly — to review EC2, RDS, and EKS utilization. Workloads change. Traffic patterns shift. An instance that was right-sized six months ago might be oversized now.
AWS Compute Optimizer can point you in the right direction, but someone still has to act on the recommendations. Tooling without follow-through is just dashboard decoration.
Tie cost to business outcomes
"Our AWS bill is $200k/month" is not a useful data point on its own.
"We spend $0.12 per active user" or "$0.003 per transaction" — that's something you can reason about, benchmark, and improve. When cost is framed as a business metric rather than an infrastructure line item, the conversations get a lot more productive.
Where does your team fall?
I think about cloud cost management as a rough progression:
Cloud chaos — No tagging, no budgets, no ownership. The bill is a surprise every month.
Reactive control — Someone set up budget alerts. There's a cleanup sprint every quarter when the CFO starts asking questions.
Proactive optimization — Automated scheduling, a real Savings Plans strategy, regular utilization reviews built into team rhythms.
Cost-aware culture — Teams discuss spend in sprint reviews. Architects weigh pricing alongside performance. Engineering and finance share a vocabulary.
Most companies I encounter are somewhere between the first two. And honestly, getting from chaos to proactive is where the biggest savings happen — usually without a massive organizational overhaul.
TL;DR
AWS is a phenomenal platform. But the flexibility that makes it powerful is the same flexibility that makes it expensive if you're not intentional.
The companies I've seen manage this well all share a few traits: they treat cost as a design constraint rather than an afterthought, they create accountability through tagging and visibility, and they build lightweight processes that keep optimization from being a heroic one-off effort.
None of this is rocket science. Most of it isn't even hard. It just requires someone to care enough to make it a priority — and to keep caring about it month after month.
What's your experience? Have you found specific tools or practices that helped rein in cloud spend? I'd love to hear what's worked (or spectacularly hasn't) for your team.
Top comments (0)