From auto-recommendation to one-click cloud remediation, the workflow most tools skip

#devops #aws #cloud #kubernetes

Every cloud cost tool I have ever opened shows the same big number near the top of the dashboard. You could save 487,000 dollars a year. Sometimes it is bigger. The number is real in the sense that the math checks out. The number is also a lie in the sense that almost none of it ever happens.

The recommended savings number on your dashboard is not the realised savings number on your bill. The gap between those two is where most FinOps tools quietly fail their users, and it is not a small gap. At most teams I have talked to, it sits somewhere between 80 and 95 per cent.

The ticket nobody picks up

Walk through what the dashboard asks you to do when it says to stop this idle EC2 instance and save 312 dollars a month. The path looks like this.

An engineer reads the recommendation.
Files a ticket because they do not own the resource.
Waits for the team that does.
That team schedules the work into a sprint.
Someone eventually logs into the AWS console.
Finds the resource. Confirms it is actually the right one.
Runs the stop action.
Verifies nothing downstream broke.
Updates the ticket.

Nine steps, three humans, two weeks of calendar time for a single 312 dollar recommendation. Multiply that by a few hundred recommendations a month, and the math becomes obvious. Nobody works through that queue. The recommendations pile up. The dashboard keeps showing the same big number. The bill keeps not going down.

The recommendations were never the problem. The execution layer was.

Recommendation is not remediation

Every cloud cost tool on the market does auto-recommendation well. It scans your account, finds the idle instances, the over-provisioned databases, and the orphaned storage, and surfaces them on a dashboard. Some tools are very good at this part. The recommendations are usually right.

What almost no tool does well is auto-remediation. Recommendation tells you what to do. Remediation actually does it. The first is a report. The second is a button that, when clicked, performs the cloud action, verifies it landed, and writes an audit log.

Most teams have spent the last five years drowning in recommendations they never executed on. The dashboards got more sophisticated. The list of suggested actions got longer. The realised savings number on the bill barely moved.

The reason every cost tool stops at recommendation and not remediation is that recommendation is safe to ship, and remediation is not. A number on a dashboard cannot break production. A cloud API call that stops a resource absolutely can. So the industry settled on a comfortable middle ground. Tell the user what to do. Let them deal with the consequences.

The obvious fix, and why it is harder than it looks

The obvious fix is a button on the recommendation that just does the thing. Click, instance stopped, ticket avoided. People have tried this. The reason it has not been a default feature in cost tools is that cloud actions are not safe to fire blindly, and the failure modes are bad enough that one wrong action poisons trust in the whole tool for a year.

The interesting engineering problem is not calling the stop API. That part takes an afternoon. The interesting part is everything around it.

I have been watching a team build this, and the workflow they ended up with is a useful artefact even if you never use their tool. The shape of it is what matters.

What sits behind one-click remediation

Five steps, and skipping any of them is how you get a 3 am incident.

1. Precondition check. Before doing anything, ask the cloud what state the resource is actually in right now. If somebody on the team manually stopped it an hour ago, the workflow stops here and reports already done. This single check is the difference between automation people trust and automation people turn off.

2. Optional approval. Some actions need a human gate. Production-tier rules, destructive operations, anything where the cost of a wrong call is worse than the cost of a slow call. The approval queues with full context: resource, savings, who initiated it, and what rule fired. An admin clicks approve or reject. Cheap actions skip this entirely.

3. Execute. Call the cloud API. Stop the EC2, pause the Synapse pool, scale the Lambda to zero. This is the boring part.

4. Validate. This is the part most tools get wrong. A 200 response from the cloud API does not mean the resource is actually stopped. The validate step polls the cloud state until it confirms the action genuinely landed. If the API said yes but the resource is still running, the workflow flags it as a system error instead of silently lying.

5. Audit. Every step, input, and result is written to a dedicated audit table. Six months from now, when someone asks who stopped the prod-adjacent Synapse pool on March 12, the answer is one query away.

The other thing worth stealing from this design is how errors get categorised. Three buckets.

User action. Permission denied, quota hit. Shows the fix with a console link.
Transient. 429s, 5xx. Gets a retry button.
System. The cloud actually broke, or the API is unsupported. Gets a diagnostic and a support contact.

The category drives the UI. Retry only shows up where retry actually makes sense. This sounds small. It is the difference between an automation surface that engineers learn to trust and one they learn to ignore.

The certification gate

One choice in this design that surprised me. Not every recommendation rule gets a Remediate button. The team ships the button only on rules that have been certified end-to-end on real cloud accounts. The certified set started at 20 rules covering stop, scale-to-zero, and pause actions across AWS, GCP, and Azure. The other rules render the recommendation card without the button.

The temptation when shipping automation is to ship it everywhere on day one. The discipline is to ship it only where you have proven the workflow handles the edge cases. A fake remediation that returns success but did not actually do anything is worse than no remediation at all, because it convinces the team that the savings are realised when they are not.

The Databases Rule

One more thing worth calling out. Customer data resources (RDS, Aurora, Cloud SQL, Elasticache, the entire Postgres and MySQL family on every cloud) are excluded from any automated action. Not as a toggle. As an allowlist that excludes them at the code level, so they cannot be passed to the executor, regardless of what the rule says. The kind of safety rail you only see in tools built by people who have personally been responsible for a database outage and refuse to be again.

What this changes

The recommended savings number on your dashboard becomes a number you can actually realise. The 20-minute ticket-and-console-hop becomes one click. The audit log behind every action means you can show the CFO not just what you saved, but who saved it, when, and how.

The interesting thing is that none of this is technically novel. Precondition, approval, execute, validate, audit. Five steps you would design on a whiteboard in an hour. The reason it matters is that almost no cost tool actually does it, and the ones that try usually skip the validation step and pretend the API response is the truth.

If you are evaluating cloud cost tools, the question to ask is not what the recommended savings number is. The question is, what happens after I click the button, and how do you know the resource is actually stopped?

That is the only number that ends up on your bill.