Most Traffic Spikes Are Predictable. So Why Are We Still Panic-Scaling?

#automation #aws #devops #infrastructure

The usual playbook when a big event is coming: someone sends a Slack message three hours before launch asking, "Did we scale up?" A senior engineer logs into the AWS console, eyeballs the current desired count, multiplies by something, and manually bumps the number. Then forgets to roll it back.

That's the part nobody talks about. The spike passes, the instance count stays at 3x, and you're burning money for two days because everyone assumed someone else would fix it.
We ran into this enough times that we built a proper way to handle it.

What it does

Event Readiness turns a planned traffic spike into a structured scaling plan. You define the event window, set an expected load multiplier, and attach the autoscaling groups you want to scale.

ZopNight handles the rest pre-scales before the event starts, holds capacity during it, and rolls everything back when it ends. No manual intervention. No forgetting to scale down.

How it works

You create a plan against your existing autoscaler policies:

Pick the event window (start, end, timezone)
Set a load multiplier — e.g., 3x for a campaign expecting 3x traffic
Attach target policies: AWS ASG, GCP MIG, or Azure VMSS
Set a pre-scale buffer (we default to 30 minutes before event start)

ZopNight calculates the scaled min/max/desired for each target before
you commit. If CPU metrics aren't available, it tells you it's estimating and why.

Before saving, a preview step shows you exactly what will happen to every target's current size, the scaled size, and the timestamps the executor will fire. No surprises.

Once scheduled, the plan moves through a clear state machine:

draft → scheduled → scaling_up → active → scaling_down → completed

If something fails, it lands in failed, and you can retry from draft.
Cancelling from any active state rolls back whatever was already scaled.

The cost estimate

One thing we added that most teams don't have: upfront cost visibility. Before you schedule the plan, you can see the estimated extra cost per target, per hour, in dollars. Not after the event.

Before it, for a plan running 8 hours at 3x capacity across two ASGs, that number is usually a lot smaller than the cost of the event going down.

How does your team handle planned traffic spikes right now? Manual scaling, scripts, or something else?