Recently, I opened our AWS bill and found that thousands of dollars had been charged toward AWS Lambda. I was sure we did not have that much traffic. So I started digging through the bill, and that is when I found the real reason. It was Lambda SnapStart.
Lambda SnapStart is a great feature. For latency-sensitive applications, especially APIs, it can reduce cold start duration from seconds to sub-seconds by creating a snapshot of the initialized function and caching it. When invoked, Lambda restores the execution environment from that snapshot and resumes from that point to serve traffic. But there was one key detail we had missed. Every time we publish a Lambda version with SnapStart enabled, a new snapshot is created and cached. We pay for both cache and restore. AWS documents that caching is billed for as long as the version remains active, with a minimum of 3 hours, and the pricing page example for US East, N. Virginia shows cache at $0.0000015046 per GB-second and restore at $0.0001397998 per GB restored.
In our case, SnapStart itself was not the problem. Unused versions were.
We had enabled SnapStart for many Python APIs. Over time, repeated deployments created more and more published versions with SnapStart enabled. The older versions were not referenced anywhere and were not receiving traffic, but they were still sitting there and continuing to accumulate snapshot cache charges. AWS documents a 14-day inactive cleanup behavior for Java SnapStart versions, but that is not something we could rely on for our Python functions. That was the real bill shock.
So I built a cleanup workflow for it. Not a blind cleanup script. A controlled workflow with human approval. And this is where AWS Lambda Durable Functions turned out to be a perfect fit.
Why I used a Durable Function
At first, this looked like a simple cleanup script. List the old versions, check if they are unused, and delete them.
But I did not want a script that blindly deletes Lambda versions. Even if a version looks old, it could still be referenced by an alias or still be important for rollback. So I wanted this to be a workflow, not just a script.
That is why I used AWS Lambda Durable Functions.
With a durable orchestrator, I could break the process into clear steps: scan the SnapStart-enabled versions, filter the old and unused ones, generate a report, wait for approval, and only then delete them. The best part is that the approval step is part of the workflow itself. The function can pause, wait for a human decision, and then continue from the same point.
At a high level, the workflow looks like this:
How the workflow works
The workflow starts by listing all published Lambda versions with SnapStart enabled. Then it filters out versions that are newer than 14 days and also keeps the latest few versions as a safety measure. After that, it checks whether those versions are still being used by looking at invocation metrics and alias references. A version is considered a candidate only if it is old enough, not among the latest kept versions, not protected by an alias, and not showing recent usage.
If a version is old, not protected, and not used, it is added to the candidate list.
Once the list is ready, the workflow builds a report and sends an approval email. At this point, the durable function pauses and waits. If approved, it resumes and deletes the versions. If rejected, it simply exits without deleting anything. That is what made Durable Functions such a good fit for this use case.
It gave me a clean human-in-the-loop workflow instead of a risky cleanup script.
Approval flow
The approval flow is simple.
Once the candidate list is ready, the workflow uploads the report and sends an email with Approve and Reject links. Then the durable function pauses and waits for a callback.
One detail I liked here is that the link itself does not directly take action.
When the approver opens the link, it first shows a confirmation page. The actual approve or reject action happens only after clicking confirm. This is useful because email scanners and safe-link checkers often open links automatically, and I did not want them to accidentally approve a cleanup request.
Once the approver confirms, the callback is sent back to the durable workflow. If approved, the orchestrator resumes and deletes the selected versions. If rejected, it exits safely without deleting anything.
That made the whole process much safer and easier to trust.
Final thoughts
SnapStart is a really useful feature, especially for Python APIs where cold starts matter.
But like many good AWS features, it needs the right cleanup around it. In our case, the problem was not SnapStart itself. It was the old published versions that kept piling up and quietly adding to the bill.
That is why I built this workflow.
Not just to delete unused versions, but to do it in a safe and controlled way. The Durable Function gave me exactly that. It let me build a proper workflow with scanning, reporting, human approval, and deletion only after confirmation.
For me, this was the real value of the solution.
It solved the cost problem, and at the same time showed how useful Durable Functions can be when we need automation with a human in the loop.
If you are using SnapStart for Python, it is worth checking how many old versions are still sitting there. You may find that the real cost is not in your traffic, but in the versions you forgot to clean up.
If you are interested in trying this approach or exploring the implementation in more detail, you can find the complete source code in my GitHub repository.
GitHub Repo: https://github.com/jayaganeshk/aws-lambda-snapstart-version-cleaner



Top comments (0)