I used to think “done” meant: PR merged, CI green, deployed to staging. That’s it. Ship the feature, move on, next ticket.
Then I shipped a feature that was done… and immediately turned into a slow-burn production mess.
Not because the code was “bad” (okay, a bit maybe), but because the definition was bad.
No dashboard. No kill switch. No clear rollback. No runbook. No clue what was happening when users reported “it’s broken” except a vague spike in errors and a Slack thread that started with:
“Anyone knows what changed?”
That’s when I realized something brutal:
Most Definitions of Done are just “we stopped working on it.”
Not “it’s safe.” Not “it’s operable.” Not “it’s maintainable.”
So here’s what I changed. These are the 5 things that finally made “done” mean what it sounds like:
shipped, safe, and survivable.
The problem was never speed
It was fragility disguised as velocity
A weak DoD feels productive because it optimizes for one visible milestone: merge.
But the cost doesn’t disappear. It’s just deferred into:
- on-call chaos
- hotfix releases
- “why is this slow?” investigations
- support tickets you can’t reproduce
- teammates afraid to touch the code
It’s like building a bridge and calling it “done” because the concrete dried—without checking if trucks can cross it.
So I reframed DoD into one sentence:
A feature is done when it can survive production without its author babysitting it.
1) Done means Observable
If it breaks, can I know what broke in under 5 minutes?
When a user says “it’s not working,” the worst response is “I can’t see anything.”
That’s not a monitoring issue. That’s a definition issue.
So now, “done” includes:
- structured logs (not vibes)
- request IDs / correlation IDs
- key metrics: success rate, error rate, latency (p50/p95)
- a simple dashboard that answers: is it healthy?
- alerts only when the feature is truly critical
The shift:
I stopped treating observability as “nice to have later.”
I treat it as part of the feature—like the UI or database schema.
Because shipping without telemetry is like deploying without SSH access.
2) Done means Reversible
Can I disable it instantly without redeploying and praying?
This is the “I learned it the hard way” one.
At some point, every feature will misbehave:
- edge case you didn’t predict
- downstream service failing
- traffic spike
- data inconsistency
- unexpected user behavior
3) Done means Safe by Default
If someone tries to misuse it, does it fail safely?
Many incidents aren’t “bugs.” They’re missing guardrails.
So “done” includes:
- proper authorization (not just authentication)
- validation for user inputs (especially file uploads)
- rate limiting / abuse controls if the endpoint is “fun”
- secure defaults (deny > allow)
- secrets not logged, not leaked, not shipped to clients
This one matters because production isn’t just for users.
Production is also:
- bots (AI crawls, those shit)
- weird clients
- accidental spam
- malicious traffic
Done means it survives real users and real attackers.
4) Done means Correct Under Real Conditions
Not “works on my machine.” Works in the world.
Staging is polite. Production is chaos.
Production has:
- concurrency
- retries
- timeouts
- partial failures
- unexpected payload sizes
- real traffic patterns
- real data inconsistencies
So “done” includes:
- negative tests + edge case tests
- idempotency where duplicates can happen (payments, messaging, webhook handlers)
- bounded retries + timeouts (no infinite retry loops)
- defined behavior when dependencies fail (fallbacks, graceful degradation)
- load assumptions written somewhere (“expected max payload, expected latency”)
The shift:
I stopped thinking testing is about correctness.
Testing is about survivability under pressure.
5) Done means Explainable to the Next Person
If I disappear for a week, can someone operate it without summoning me?
This is the part devs underestimate because it doesn’t feel technical.
But it’s the difference between a team that ships and a team that waits.
So “done” includes:
- a short runbook: how to debug, where to look, common failures
- updated API docs/contracts
- “how to test locally” steps (even minimal)
- clear ownership: who maintains this, which service, which repo
- release notes (short) if user behavior changes
What a “real” Definition of Done looks like
Here’s the compact version I wish I had earlier:
A feature is “done” when it is:
- Observable (logs/metrics/traces + dashboard)
- Reversible (feature flag + rollback plan)
- Safe (authZ + validation + rate limiting if needed)
- Correct in production reality (idempotency + timeouts + failure handling)
- Explainable (runbook + docs + ownership)
The merged code is not done.
It’s just… merged.
The pushback: “This slows us down”
Yeah. It slows down this PR.
But it speeds up the next 20 releases.
Because the alternative is paying the same cost later, but with interest:
- production incidents
- emergency fixes
- broken trust
- morale drain
- tech debt that turns into fear
DoD is not a bureaucracy.
It’s how you buy back your future time.
Closing: the best Definition of Done is boring
And the best part? I could finally ship with confidence—not because I became smarter, but because I stopped calling things done before they were operable.
The goal isn’t perfection.
The goal is to be able to sleep.
Thank you for reading this article, hope it’s helpful 📖. See you in the next article 🙌



Top comments (0)