Trigops for Trigops

Posted on May 28

The ECS Task That Billed You All Night (And How to Think About Fixing It)

#aws #finops #cloudcost #devops

Your ECS service spun up four tasks at 9 AM for a load test. By 11 AM the test was done, the results were in Slack, and everybody moved on. It's now 2 AM. Those four tasks are still running.

Nothing is serving traffic. Nobody is waiting on them. The service health check is green because the tasks haven't crashed — they're just sitting there, healthy and idle and billing you $0.04048 per vCPU-hour and $0.004445 per GB-hour.

This is not a rare edge case. It's the default behavior of most dev and staging environments, because the failure mode is invisible. The instance isn't broken. The service isn't alerting. The cost just... accumulates, quietly, until someone pulls the AWS bill at the end of the month and wonders where the extra $400 went.

This article is about why that happens structurally — what ECS, EC2, and ASG are actually doing while your engineers sleep — and what a better model looks like.

Why the problem is harder than it looks

The naive fix is a schedule: shut things down at 7 PM, bring them back at 8 AM. This is what most teams reach for first, and it solves about 60% of the problem while introducing a different one.

Schedules are an approximation of human behavior, not a model of it. Your engineers are not fungi; they don't follow a predictable diurnal rhythm. Someone is debugging a prod issue at 10 PM. Someone else is finishing a migration at 6 AM before a release. A team in a different timezone starts at noon your time.

The moment you codify a schedule, you start getting exceptions. The 7 PM shutdown becomes 8 PM because someone complained. Then 9 PM. Then you add a "keep-alive" flag for certain instances. Then the schedule starts drifting per-team, per-environment, per-resource, and you've replaced a billing problem with a configuration management problem.

The other failure mode is subtler: the schedule doesn't know what the resource is doing at shutdown time. Your EC2 dev box gets terminated mid-compile. Your staging RDS gets paused while a migration is running. Your ECS task for a long-running job gets killed at T-2 minutes. These aren't theoretical — they happen, and they erode trust in the automation fast enough that teams start disabling it.

What actually drives idle time

To fix the problem properly, you have to model it correctly. Idle time in dev/staging environments comes from a few distinct sources:

Work-session gaps. Engineers don't work in continuous eight-hour blocks. There are lunch breaks, meetings, context switches, and the natural rhythm of focused work followed by review and coordination. A dev EC2 box might be actively used for three or four focused hours in a given workday, with the rest of the time sitting warm but unused.

End-of-day orphans. The most common case: work ends, the engineer closes their laptop, and whatever AWS resources they were using keep running. No shutdown signal was sent. Nothing crashed. The resources just... persist.

Test and load artifacts. ECS tasks, RDS read replicas, and EC2 instances provisioned for a specific test run and not cleaned up afterward. The test finished, the PR merged, and nobody thought to tear down the environment.

Over-provisioned ASGs. Scale-out events during business hours don't automatically reverse when load drops. If your desired capacity isn't decremented or your scale-in policy is conservative (common, to avoid flapping), the extra instances sit at minimum capacity billing you for capacity you're not using.

None of these are pathological. They're all rational behavior by engineers doing their jobs — and they add up fast at AWS prices.

The ASG case is worth examining in detail

Auto Scaling Groups get special attention because they're where the invisible-cost problem is most structurally embedded.

Here's a typical sequence:

09:00  ASG desired capacity: 2 (baseline)
13:00  Load test begins, scale-out triggered
13:15  ASG desired capacity: 6 (scale-out complete)
14:47  Load test ends
15:30  p99 latency back to baseline
17:00  Engineers sign off
23:00  ASG desired capacity: 6 (still)

Why didn't it scale back in? Several reasons:

Scale-in policies typically have a cooldown (300 seconds by default) and require sustained low-CPU, not just low-CPU right now. A brief spike at 5:50 PM reset the cooldown.
The target tracking policy is based on average CPU across the group. Four idle instances average out against two slightly-busy ones and the metric never triggers.
Nobody set a scheduled scale-in for end-of-business because "we'll handle that in the next sprint."

The result is that you're paying for four extra instances from 2:47 PM until someone manually adjusts desired capacity the next morning — roughly 16 hours. At m5.large on-demand pricing in us-east-1, that's about $0.096/hour per instance, so 4 instances × 16 hours × $0.096 = roughly $6.14 for a single day. Multiply by the number of load-test days per month, and by the number of ASGs in your non-production account, and you're talking meaningful money.

The fix that almost works: set a scheduled action to reset desired capacity to 2 at 8 PM. This is better than nothing. It still misses the load test that finishes at 7:30 PM (instances run until 8), and it breaks the engineer who is legitimately using those instances at 8:05 PM.

What "activity-driven" actually means

The phrase "activity-driven" is worth defining precisely, because it gets used loosely.

Activity-driven resource management means the pause/resume decision is based on detected real user presence and work-tool focus — what the engineer's machine is actively doing — rather than a clock or a server-side metric.

This is different from:

Schedule-based: "Pause at 7 PM, resume at 8 AM" — no signal from the human.
Metric-based: "Pause when CPU < 5% for 30 minutes" — signal from the server, not the user. An idle EC2 box running a cron job looks identical to one nobody has touched.
Manual: "Click the stop button when you're done" — relies on engineers remembering, which they don't.

An activity-driven system detects that the engineer's machine has been idle or context-switched away from work tools for a meaningful interval, and uses that as the signal to pause the associated cloud resources. When they return — open their IDE, reconnect a terminal, push a commit — the resources resume automatically before they need them.

The key property is that the signal comes from the human side of the equation, not the infrastructure side. This matters because:

An RDS instance can be completely idle (zero connections, zero queries) while the engineer is mid-thought and about to run a migration. Metric-based systems see idle and pause; the engineer gets a connection timeout.
An EC2 box can show 90% CPU because it's running a background compile while the engineer is in a three-hour meeting. Schedule-based systems keep it running; activity-driven systems can pause it when it's clear the engineer won't need output from that compile for hours.

The RDS wrinkle

RDS makes this harder because the resource has no meaningful idle signal at the instance level. An EC2 instance at 0% CPU is almost certainly idle. An RDS instance at 0% CPU might have an engineer staring at a half-written query in their SQL client, about to hit execute.

The standard AWS tooling here is Instance Scheduler, which lets you define stop/start windows per tag. It's schedule-based, and it has the same failure mode described above — except with RDS the stakes are higher, because RDS startup time is non-trivial. A Multi-AZ RDS instance can take several minutes to become available after a stop. If someone's stop window is wrong and the database restarts mid-migration, you're looking at a real incident, not just a mild inconvenience.

The billing structure also differs from EC2. When an RDS instance is stopped, you stop paying for instance hours — but you keep paying for storage (gp2/gp3 volume), backup storage, and any snapshot overhead. The Multi-AZ standby instance you provisioned for a staging environment that gets promoted to prod-like parity? That's billed at the same rate as the primary. Stopping the primary stops both, but storage keeps running.

For most non-production RDS use cases, the right architecture is a stopped instance that restarts on demand, not a running instance that someone remembers to stop. The challenge is the "on demand" part — the startup latency makes manual workflows painful, which is why most teams don't do it.

What good looks like in practice

A useful mental model: treat non-production cloud resources the way you treat your local dev environment. Your laptop doesn't run your IDE at full capacity all night. It sleeps. It wakes up fast when you need it. The resource is available when you're working and not burning compute when you're not.

Applied to AWS:

An EC2 dev box stops within a few minutes of the engineer closing their last work session, and starts automatically when they open their IDE or terminal the next morning.
An ECS staging service scales to zero when no engineers are actively working, and scales up before they connect — not after they notice a 502.
An RDS instance used only for migration testing stops at end-of-day and restarts when someone connects.
An ASG resets to baseline capacity when work-session signals across the team have been idle for a threshold interval, not on a fixed schedule.

None of this requires perfect engineering. It requires that the pause/resume signal be grounded in actual human activity rather than a wall-clock approximation of it.

The multi-account visibility problem

One thing that makes this worse at scale: non-production AWS resources often live in separate accounts from production, and visibility into those accounts is fragmented.

Your staging account, dev sandbox, and QA environment might all be separate AWS accounts — the right security posture — but the cost data from all three flows through AWS Cost Explorer at the payer account level, tagged inconsistently, and nobody has a clear picture of what's actually running and idle at any given moment.

Teams that are serious about this problem centralize the visibility first. Not just cost reports, but live resource state: what's running, who provisioned it, when it was last used, and what it's costing per hour. Without that view, you can't make good decisions about what to pause.

The tooling for this has improved significantly, but the gap between "I can see the cost" and "I can act on it automatically" is still wide for most teams. That gap is where a large portion of the waste lives.

A note on what not to do

Two approaches that seem reasonable and usually backfire:

Aggressive auto-termination. Setting TTLs on dev resources and terminating them after X hours of inactivity is blunt and creates anxiety. Engineers start hoarding long-lived resources and tagging things to prevent deletion, which is the opposite of what you want. Pause, not terminate, is the right default for dev resources; the state is preserved and startup is fast.

Tagging-and-ask workflows. Sending Slack messages like "your EC2 instance has been idle for 2 hours, should we stop it?" adds friction back to the engineer at exactly the moment they've context-switched away. Response rates on these are low, and the instances that matter most (the ones where someone is mid-task) are the ones most likely to get stopped by an inattentive yes-click.

The goal is zero-friction automation that the engineer can trust, not a different version of manual.

If this matches a problem you're dealing with in your non-production environments, Trigops is built around exactly this model — activity-driven pause and resume for EC2, RDS, ECS, and ASG, across multiple accounts. You can connect a cloud account and see what's actually running at trigops.com.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.