Hundreds of orders vanished in just 3 minutes – all because of one forgotten config line

Prologue: A Seemingly Normal Afternoon

It was a Friday, 4:30 PM. My team was about to deploy an update for the order-service – one of the most critical microservices in our order processing pipeline.

Everything looked smooth. Tests passed. CI/CD was all green. I confidently hit the Deploy button to production.

“Just a small rollout… what could go wrong?”

Five minutes later, Slack lit up. Channels like #alert, #ops, and #order-system turned red with pings.
Grafana showed a strange spike: the failure rate of orders shot up.
Log entries appeared, and they weren’t friendly:

java.net.SocketException: Connection reset
org.apache.kafka.common.errors.TimeoutException
Connection refused: no further information

I froze. Within minutes, nearly 500 orders vanished without a trace. Each one was abruptly halted—as if someone pressed “pause” then hit “delete.”

Investigation: Something Wasn't Right

We jumped into a quick incident meeting.
No bugs in the code.
No Kafka issues.
No database outages.
But one thing was consistent: all failed orders happened during the new deployment.

Then someone from the team asked:

“Did anyone set up graceful shutdown for this service?”

I went silent. It all started to make sense.

The old pod had just received requests when Kubernetes sent it a SIGTERM.
But we hadn’t configured Spring Boot for graceful shutdown.
So the pod was killed—instantly and brutally. Kafka didn’t get a chance to send messages. Database transactions were left hanging. Half-processed data disappeared.

Aftermath: Production Fell Apart Because of One Missing Config

Who would’ve thought a single missing line could cause so much damage?

500 lost orders, all had to be manually recovered one by one.
We did 4 hours of overtime, tracing logs from Kafka to reconstruct requests.
An apology email went out to customers—along with compensation vouchers.

At that point, all I could think was: “I wish I’d known this earlier.”

The Realization: How a Service Dies Is Just as Important as How It Starts

That incident pushed me to dig into graceful shutdown—a concept I had only glanced over before.

Lesson #1: Enable shutdown with empathy

server:
  shutdown: graceful
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

This makes Spring wait for in-flight requests to finish before shutting down.

Lesson #2: Say goodbye to Kafka properly

@PreDestroy
public void cleanUp() {
    kafkaProducer.flush();
    kafkaProducer.close(Duration.ofSeconds(10));
    log.info("Kafka producer closed.");
}

If you don’t close your producer correctly, you’re basically throwing messages into the void.

Lesson #3: Don’t forget your thread pools

@Bean
public Executor taskExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setWaitForTasksToCompleteOnShutdown(true);
    executor.setAwaitTerminationSeconds(30);
    return executor;
}

Lesson #4: Readiness probes are your safety net

@EventListener
public void onAppShutdown(ContextClosedEvent event) {
    isReady.set(false); // readiness = false => K8s stops sending new traffic
}

If a pod is in the middle of dying and still receiving traffic, it’s like asking a patient on life support to keep working.

Conclusion

That incident was a painful but valuable lesson. It taught me that a system shouldn’t just be designed to run well—it must also be designed to shut down safely.

In a microservices world, where everything is interconnected in real-time, a single service dying unexpectedly can cause a domino effect—disrupting data, user experience, and system reputation.

Key Takeaways:

Graceful shutdown is not optional – it's essential.
Especially for services dealing with requests, Kafka, RabbitMQ, databases, or external APIs.
Always configure server.shutdown: graceful and set an appropriate timeout-per-shutdown-phase.
Ensure all critical resources are properly released:

Kafka producers
Thread pools
DB connections
External clients

Use readiness probes to signal Kubernetes to stop sending new traffic during shutdown.
Test shutdown scenarios in staging – not just startup ones.
And finally: avoid Friday deployments if you can.
Systems may fail—but people deserve their weekends.

Writing clean code is one thing.
Running a system responsibly and safely is another—and it’s often the part that’s overlooked.

I hope this story saves you from facing a black Friday like I did.

DEV Community

Hundreds of orders vanished in just 3 minutes – all because of one forgotten config line

Top comments (0)