Originally published on graycloudarch.com.
The morning after go-live, the first thing I looked at was CPU. One of the two delivery services was sitting at 99.8% average utilization across 9 tasks. P50 latency: 1,010ms.
We'd launched deliberately without autoscaling. The plan was to observe real traffic patterns before configuring a scaling policy — you can't tune a policy you haven't seen the workload demand yet. What we didn't know was that the workload would reveal something about the task itself before we'd had a chance to watch it for a week.
Thirty-six hours after go-live, we'd shipped right-sizing changes, a working autoscaling configuration, and a new observability source for ALB-layer signals. All of it came directly from what the first day of production data said. Here's how we read it.
What 99.8% CPU means at 0.5 vCPU
The service was allocated 512 ECS CPU units per task — half a vCPU. CloudWatch was telling us the tasks were spending essentially all of their scheduled CPU time working.
The first instinct in this situation is to add tasks. Scale out horizontally. But adding more 0.5 vCPU containers when each one is already saturated doesn't change the constraint. In ECS, the scheduler distributes tasks across hosts, but the per-task CPU ceiling is set in the task definition. More tasks at ceiling is not materially different from fewer tasks at ceiling — you're distributing the same undersized unit more widely.
The signal wasn't about count. It was about the unit itself.
At 99.8% utilization, any burst in per-request processing time — a downstream API call that's slow, a cache miss, a spike in concurrent requests — queues. The task has no headroom to absorb it. That's where the 1,010ms p50 comes from: not that individual requests are slow, but that tasks are scheduled tightly enough that requests wait before they even start processing.
Right-sizing the task before configuring the autoscaler
We doubled the CPU allocation: 512 → 1,024 units. The rationale is mechanical once you see it: you can't configure a useful CPU-based autoscaling policy on a task that's already running at ceiling. If 100% CPU is the baseline, the autoscaler has nothing to respond to — it would scale out immediately on creation and never scale in.
Target tracking at 70% CPU requires headroom. A 1 vCPU task running the same workload that previously pinned a 0.5 vCPU task will land around 50% utilization — below the target, room to absorb variance before triggering a scale-out, and enough signal for scale-in to be meaningful rather than noise.
The second service had a different profile: 12 tasks, 1 vCPU each, hitting 92% at peak. Not saturated the same way, but thin on headroom. We went to 2 vCPU there.
Two other services in the platform were running the opposite problem — allocated more memory than they'd ever used. Those went the other direction: overprovisioned memory cut back based on observed peaks. The same 24-hour data window showed both problems at once.
Sequencing matters: right-size the task before you configure the autoscaler. Otherwise you're teaching a scaling policy to respond to a signal that's already maxed out, and the first thing it does is scale out to a floor that's still running on undersized tasks.
Why we chose CPU tracking instead of request count
The obvious autoscaling metric for an HTTP service is ALBRequestCountPerTarget. The ALB knows the request rate per target group; scaling on that metric tracks load linearly and is highly predictable.
We couldn't use it.
The platform uses a cross-account Lambda to register ECS tasks with ALB target groups at boot. Because of how the registration bridge works, the ECS service resource is provisioned with target_group_arn = null — the target group lives in a different account, and the service module doesn't know its ARN. ALBRequestCountPerTarget requires the target group ARN to be known to the Application Auto Scaling policy. Without it, there's no way to wire the metric across accounts without building additional dependency plumbing.
CPU target tracking at 70% was the correct second choice. For a CPU-bound workload — which 99.8% utilization confirms this is — CPU is a meaningful proxy for load. The metric was there, it was clean, and the task was now sized to make it useful.
One thing worth noting: the cross-account registration bridge was the right architectural decision for the problem it solved. But it created a constraint three layers away in a scaling configuration we hadn't designed yet. Architecture decisions compound downstream. The fix here was straightforward; I've seen the same pattern take longer to untangle when the constraint wasn't recognized.
The observability gap app logs can't fill
Application logs were already flowing to BetterStack from both services. We had route-level latency, HTTP status codes, request counts, error breakdowns — everything that happens inside a container.
What the logs couldn't tell us was what happens above them. The ALB generates its own error signals: HTTPCode_ELB_5XX_Count for errors the load balancer generates before a request reaches a container, RejectedConnectionCount for connections refused at the ALB layer when backend capacity is exhausted, ActiveConnectionCount as a proxy for in-flight load per target group. None of this appears in application logs. If the ALB had been dropping connections during the 99.8% CPU period, we would have had no signal in our observability platform.
CloudWatch had the data. The gap was getting it into the same place as everything else.
A 60-second Lambda in the infrastructure account — where the ALB lives — calls GetMetricData and ships structured JSON to BetterStack. One EventBridge rule, no ECS changes, effectively zero cost (one CloudWatch API call per minute against Lambda's free tier). The metrics land alongside the application data and show the ALB layer that the app logs are blind to.
The design decision here was Lambda over an ECS sidecar. A sidecar would have run per-service, per-task, 24 hours a day, and required task definition changes across the platform. A single Lambda running once per minute in the account that owns the ALB costs nothing and touches no ECS configuration.
Autoscaling parameters worth explaining
For the higher-load service: min=9, max=20, CPU target=70%, scale-out cooldown=60s, scale-in cooldown=300s.
Setting min_capacity to 9 — the current running task count — was deliberate. We'd just established that 9 tasks was a functional floor for this workload at current traffic levels. An autoscaler configured with min=2 or min=4 would have attempted to scale in on the first quiet period, bringing the service back to a state we knew was already under-provisioned. Anchoring the floor to the observed stable-state count prevents that while we accumulate enough autoscaling history to set a meaningful long-term floor.
The asymmetric cooldowns — 60 seconds for scale-out, 5 minutes for scale-in — reflect the cost asymmetry of being wrong in each direction. Scaling out too slowly during a load spike means requests queue. Scaling in too aggressively during a brief quiet period means tasks are killed and restarted unnecessarily. The 5-minute scale-in cooldown is conservative; we'll revisit it once we have a week of data showing where the service naturally stabilizes.
What 24 hours of data drove
We launched expecting to spend the first week observing. What the data delivered instead was a complete picture of three distinct problems: a task sizing issue that was causing queuing, a scaling policy that needed the right foundation before it could be configured, and an observability gap for a class of signals that app logs fundamentally can't surface.
All three were solved from the same 24-hour data window. The pre-launch load testing hadn't revealed any of them — synthetic traffic and production ad-bidding traffic have different CPU profiles, and you don't know which until the real thing runs.
The thing I'd change if running this again: put a structured post-launch data review into the go-live plan, not the next morning's to-do list. Not a formal incident review — a deliberate hour with CloudWatch after the first day's traffic has run through. The data is there. The question is whether you've planned to look at it.
If you're planning a production go-live and want a structured approach to post-launch data review and stabilization — or you're staring at a service running at ceiling with no autoscaling — get in touch. This is the kind of platform work I do regularly, and the pattern here applies well beyond ad delivery.
Top comments (0)