Prologue: A Seemingly Normal Afternoon
It was a Friday, 4:30 PM. My team was about to deploy an update for the order-service
– one of the most critical microservices in our order processing pipeline.
Everything looked smooth. Tests passed. CI/CD was all green. I confidently hit the Deploy button to production.
“Just a small rollout… what could go wrong?”
Five minutes later, Slack lit up. Channels like #alert
, #ops
, and #order-system
turned red with pings.
Grafana showed a strange spike: the failure rate of orders shot up.
Log entries appeared, and they weren’t friendly:
java.net.SocketException: Connection reset
org.apache.kafka.common.errors.TimeoutException
Connection refused: no further information
I froze. Within minutes, nearly 500 orders vanished without a trace. Each one was abruptly halted—as if someone pressed “pause” then hit “delete.”
Investigation: Something Wasn't Right
We jumped into a quick incident meeting.
No bugs in the code.
No Kafka issues.
No database outages.
But one thing was consistent: all failed orders happened during the new deployment.
Then someone from the team asked:
“Did anyone set up graceful shutdown for this service?”
I went silent. It all started to make sense.
The old pod had just received requests when Kubernetes sent it a SIGTERM
.
But we hadn’t configured Spring Boot for graceful shutdown.
So the pod was killed—instantly and brutally. Kafka didn’t get a chance to send messages. Database transactions were left hanging. Half-processed data disappeared.
Aftermath: Production Fell Apart Because of One Missing Config
Who would’ve thought a single missing line could cause so much damage?
500 lost orders, all had to be manually recovered one by one.
We did 4 hours of overtime, tracing logs from Kafka to reconstruct requests.
An apology email went out to customers—along with compensation vouchers.
At that point, all I could think was: “I wish I’d known this earlier.”
The Realization: How a Service Dies Is Just as Important as How It Starts
That incident pushed me to dig into graceful shutdown—a concept I had only glanced over before.
Lesson #1: Enable shutdown with empathy
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s
This makes Spring wait for in-flight requests to finish before shutting down.
Lesson #2: Say goodbye to Kafka properly
@PreDestroy
public void cleanUp() {
kafkaProducer.flush();
kafkaProducer.close(Duration.ofSeconds(10));
log.info("Kafka producer closed.");
}
If you don’t close your producer correctly, you’re basically throwing messages into the void.
Lesson #3: Don’t forget your thread pools
@Bean
public Executor taskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setWaitForTasksToCompleteOnShutdown(true);
executor.setAwaitTerminationSeconds(30);
return executor;
}
Lesson #4: Readiness probes are your safety net
@EventListener
public void onAppShutdown(ContextClosedEvent event) {
isReady.set(false); // readiness = false => K8s stops sending new traffic
}
If a pod is in the middle of dying and still receiving traffic, it’s like asking a patient on life support to keep working.
Conclusion
That incident was a painful but valuable lesson. It taught me that a system shouldn’t just be designed to run well—it must also be designed to shut down safely.
In a microservices world, where everything is interconnected in real-time, a single service dying unexpectedly can cause a domino effect—disrupting data, user experience, and system reputation.
Key Takeaways:
Graceful shutdown is not optional – it's essential.
Especially for services dealing with requests, Kafka, RabbitMQ, databases, or external APIs.Always configure
server.shutdown: graceful
and set an appropriatetimeout-per-shutdown-phase
.Ensure all critical resources are properly released:
- Kafka producers
- Thread pools
- DB connections
- External clients
Use readiness probes to signal Kubernetes to stop sending new traffic during shutdown.
Test shutdown scenarios in staging – not just startup ones.
And finally: avoid Friday deployments if you can.
Systems may fail—but people deserve their weekends.
Writing clean code is one thing.
Running a system responsibly and safely is another—and it’s often the part that’s overlooked.
I hope this story saves you from facing a black Friday like I did.
Top comments (0)