DEV Community: lowkey dev

How I Saved My System Through Peak Season

lowkey dev — Sun, 21 Sep 2025 04:02:46 +0000

Introduction: Peak Season and the Challenge Ahead

The travel season was here, and the atmosphere at our company was hotter than the sun outside. Our system—the heartbeat of all operations—was about to face peak traffic 8–10 times higher than usual. I opened my laptop and accessed the dashboard like a normal user, but immediately felt the pressure: everything was slow and laggy, each click sent a flurry of requests that were hard to control.

Every analytics table, every chart was a potential “CPU and memory bomb.” The server under stress, and OOM (Out of Memory) was almost guaranteed if traffic kept spiking. This marked the start of my journey to save the system, where every decision would directly affect the user experience.

Investigating the Frontend: The Tip of the Iceberg

Opening F12, I saw hundreds of requests continuously hitting endpoints, many fetching entire customer, transaction, and payment tables. The dashboard tried to compute everything in real-time, but CPU and memory jumped with every click.

I applied lazy loading for non-critical data, cached some tables temporarily in localStorage, and sacrificed a little smoothness in UX. Instantly, the dashboard became more responsive, the backend felt lighter. But I knew this was just the tip of the iceberg—the real danger was lurking deeper.

Investigating the Backend: Where the Pressure Truly Lies

Frontend only shows the tip of the iceberg. I opened server logs, enabled APM, and tracked slow queries and profiling metrics. Many endpoints computed analytics in real-time on massive tables. Read-heavy queries were unoptimized, fetching all data on each dashboard load, sending CPU and memory into overdrive.

I tried precomputing heavy metrics and storing them in Redis. Initially, data was a few minutes behind real-time, making me anxious, but the dashboard ran smoothly and the backend stabilized. A clear trade-off: sacrificing some accuracy to save the system. Redis hit rates increased, and I felt both relief and tension.

CQRS and Read-Heavy Queries: A Long-Term Solution

The read-heavy queries continued to stress the server. I tried scaling MySQL, adding replicas, increasing RAM—but memory spikes still occurred. I decided to implement CQRS, separating write and read operations, using OpenSearch to serve read-heavy queries.

Data synchronization was complex, logic was intricate, but the dashboard finally responded fast and reliably. Complexity increased—more services in the codebase, listeners syncing data, added monitoring for OpenSearch, Redis, and MySQL. Yet, heavy analytics tables now ran smoothly, CPU and memory no longer jumped wildly.

Precomputing the Dashboard: Sacrificing Realtime

The most critical analytics tables, if computed in real-time, would make the server under stress prone to crashes. I precomputed results and stored them in Redis. When peak traffic hit, the dashboard ran smoothly, though data was no longer fully real-time. I remembered the moment of clicking through the dashboard and seeing charts lag by a few minutes—a trade-off worth accepting to keep the system alive.

Exports and dashboard queries now returned lightning-fast data from Redis, CPU dropped from 95% to 60%, and memory stabilized.

Cache Promise, Request Coalescing, and Pre-Warming

Before peak traffic, many concurrent requests hitting the same data made Redis and the database shaky. I implemented Cache Promise and request coalescing, merging multiple requests so that only one query actually hit the database. The code became more complex, but the backend stood firm—I felt like we had weathered a storm.

I also scheduled pre-warming cache jobs. The server absorbed a light load during off-peak hours, but when traffic peaked, data was ready. The dashboard stayed smooth, and the backend calmly handled 8–10x traffic without faltering.

Request Prioritization and Selective Querying

Some Excel exports or analytics requests used to slow down critical operations. I implemented bulkhead and request prioritization, ensuring critical requests were processed first. Some analytics exports were slower, but the system remained responsive.

To avoid OOM, I queried only necessary fields and processed large exports in batches. Real-time data integrity was partially sacrificed, but the server survived, the dashboard remained smooth, and the feeling of victory ran through the system.

Monitoring and Alerting: Better Safe Than Sorry

During preparation, I set up continuous monitoring: CPU, memory, Redis hits, OpenSearch query latency, successful and failed request counts. I configured alerts for threshold breaches, so we received warnings before the system truly failed.

This way, I didn’t wait for the server to crash to know something was wrong—memory spikes or slow queries were reported immediately, allowing timely intervention.

Chaos Testing and Load Testing

Before the peak season, my team ran load tests simulating peak traffic and performed chaos testing, intentionally breaking some services. Through these tests, we learned a lot: redundant cache, stacked request queues, potential deadlocks in OpenSearch sync listeners. These exercises helped us prepare rollback plans, increase replicas, and adjust batch sizes.

Rollout & Hotfix During Peak Hours

One night, during the traffic peak, a minor bug in the precomputed dashboard caused data to lag more than usual. I had to apply a hotfix directly in production, deploying carefully step by step while monitoring Redis and OpenSearch. It was tense and stressful, but once everything stabilized, it felt like we had truly survived a data storm.

Conclusion: Lessons Learned

After surviving the peak traffic, the dashboard ran smoothly, the backend was stable, and users were unaffected. Reflecting on the experience, I realized that preparation is everything: setting up monitoring, alerts, load testing, chaos testing, and pre-warming caches beforehand can make the difference between success and disaster.

Equally important is finding the root cause of issues. It’s easy to patch symptoms, but unless you understand the underlying problems—whether it’s read-heavy queries, unoptimized endpoints, or poorly synchronized data—the system will eventually break under stress.

Finally, there’s no perfect solution. Every choice involves trade-offs: sacrificing some UX smoothness, accepting minor delays in real-time data, increasing system complexity. Recognizing these trade-offs and planning for them ahead of time is the key to keeping a system alive during high-pressure peak traffic seasons.

Understanding the Saga Pattern in 5 Minutes

lowkey dev — Tue, 02 Sep 2025 12:04:03 +0000

If you are new to microservices, you’ve probably heard of the Saga Pattern – a design pattern for managing distributed transactions in microservices. It helps services coordinate smoothly, maintain data consistency, and achieve eventual consistency even when a service fails. This article will help you quickly understand the Saga Pattern, with clear examples and fundamental technical concepts.

1. Context and Problem

In traditional (monolithic) systems, you can use a transaction to ensure data consistency:

If all steps succeed → commit
If any step fails → rollback

Example of order processing in a monolithic system:

Deduct customer payment
Deduct product inventory
Send confirmation email

All steps are in one transaction, so if any step fails → rollback everything, keeping the data consistent.

However, in microservices, each step is usually managed by a separate service with its own database:

Payment Service: deduct money
Inventory Service: deduct stock
Notification Service: send email

If a step fails, previous steps may have already committed, leading to data inconsistency.

Example: the customer is charged, but the product is out of stock, or the confirmation email was not sent.

This is the problem the Saga Pattern solves: helping services in microservices coordinate smoothly and keep data consistent even in case of failures.

2. What is the Saga Pattern?

The Saga Pattern is a design pattern for managing distributed transactions in microservices.

Instead of using a traditional transaction (rollback everything if one step fails), each service manages its own transaction, and if a subsequent step fails, the system performs compensation for previous steps.

Example of an online order:

Payment Service: deducts money → success
Inventory Service: deducts stock → fails (out of stock)
Notification Service: sends email → not executed

Without Saga Pattern: Payment Service already charged the customer → the customer loses money but gets no product
With Saga Pattern: Inventory Service fails → Payment Service refunds, email not sent → avoids confusion

Core idea: Each step takes responsibility and has a compensation mechanism, allowing steps in a distributed transaction to coordinate without breaking the system.

3. Two Approaches to Implement Saga Pattern

3.1 Event-Driven Saga

Each step emits an event on success or failure
The next step listens to events to decide whether to execute or compensate

Example:

Payment Service deducts money → emits event "PaymentSuccess"
Inventory Service listens → deducts stock
If Inventory Service fails → emits event "InventoryFailed"
Payment Service listens → performs refund

Advantages:

No central orchestrator needed; services coordinate flexibly on their own
Easy to scale when adding new services

Disadvantages:

Hard to track the overall transaction state
Susceptible to duplicate or delayed events, requiring idempotency

3.2 Orchestration Saga

A central orchestrator coordinates all steps
If a step fails, the orchestrator commands rollback of previous steps

Example:

Orchestrator commands Payment Service → success
Orchestrator commands Inventory Service → fails
Orchestrator commands Payment Service refund
Notification Service does not send email

Advantages:

Easy to manage complex processes, centralized state control
Easier to track and reduce risk of duplicate or missing events

Disadvantages:

Orchestrator becomes a single point of failure; if it fails or lags → affects the entire transaction
Adds a central component → increases deployment complexity

4. Illustrative Example: Online Order

Assume a 3-step order process:

Payment Service: deduct customer money → success
Inventory Service: deduct stock → fails (out of stock)
Notification Service: send email → not executed

Without Saga Pattern:

Payment Service already charged → customer loses money but gets no product
Inventory Service fails → data inconsistency

With Saga Pattern (Event-Driven or Orchestration):

Inventory Service fails → Payment Service refunds
Email not sent → avoids confusion
Process remains consistent, ensuring good customer experience

Saga Pattern allows each step in a distributed transaction to be independent while still coordinating effectively, ensuring data consistency and good user experience.

5. Key Technical Terms

Transaction: A sequence of operations on data that ensures ACID (Atomicity, Consistency, Isolation, Durability).
Example: Transferring money from account A to B; if deducting A succeeds but adding B fails → rollback.
Distributed Transaction: A transaction spanning multiple services or separate databases, requiring compensation or eventual consistency.
Example: Online order: Payment Service deducts money, Inventory Service deducts stock, Notification Service sends email.
Saga Pattern: Design pattern managing distributed transactions by performing compensation if a subsequent step fails.
Example: Inventory Service reports out of stock → Payment Service refunds.
Compensation: Undo a committed step if another step fails.
Example: Payment Service deducted money but Inventory Service fails → Payment Service refunds.
Event: Asynchronous message between services indicating transaction status.
Example: Payment Service sends "PaymentSuccess", Inventory Service listens and deducts stock.
Orchestrator: Central component in Orchestration Saga coordinating steps and rollbacks.
Example: Orchestrator commands Payment → Inventory → rollback if necessary.
Partial Failure: One step in a distributed transaction fails while others have committed.
Example: Payment Service succeeds, but Inventory Service reports out of stock.
Consistency: Data always satisfies business rules after a transaction.
Example: After ordering, total money deducted = total order price, stock decreases correctly.
Eventual Consistency: The system will become consistent over time, not immediately.
Example: Payment Service commits first, Inventory Service commits later, overall state eventually correct.
Idempotency: Performing an operation multiple times does not corrupt data, preventing duplicate events.
Example: "PaymentSuccess" event sent twice → Payment Service only deducts once.
Orchestration Saga: Saga Pattern implemented with a central orchestrator coordinating steps.
Example: Orchestrator commands Payment → Inventory → Notification; rollback if Inventory fails.
Event-Driven Saga: Saga Pattern implemented with each service managing its own transaction, emitting/listening to events without a central coordinator.
Example: Payment sends "PaymentSuccess" → Inventory deducts stock → Inventory sends "InventoryFailed" → Payment refunds.

6. Conclusion

The Saga Pattern is a design pattern for managing distributed transactions in microservices, helping:

Each service manages its own transaction and can perform compensation in case of failure
Services remain independent but coordinated, ensuring overall process stability
Reduces risks: data consistency, good user experience, continuous operation

Saga Pattern is an essential design pattern that makes complex systems efficient, reliable, and easier to manage.

Saga Pattern: When Theory Collides with Reality

lowkey dev — Tue, 02 Sep 2025 09:27:41 +0000

You start your computer, open your IDE, ready to implement the order flow in your microservices. In your mind, you still have a clear picture of what you read about the Saga Pattern:

“Oh, easy. Each service handles its own transaction, if it fails, just rollback using a compensate. Eventual consistency? No problem, Saga’s got it covered.”

Sounds neat, sounds simple… but when you actually start coding, you realize nothing is that smooth.

You imagine the ideal flow in your head:

Order Service creates an order.
Payment Service deducts money.
Inventory Service reduces stock.
Shipping Service creates a shipment.

In the books, if any step fails → compensate → everything returns to the original state, the system is perfect. In your mind, it’s a smooth dance.

But in reality… it’s a completely different dance. A network timeout, a duplicate event, or an imperfect compensate, and the dance quickly becomes… an operational nightmare.

1. Partial Failure – The First Shock

You imagine: Payment Service successfully deducts money, but Order Service hasn’t received the event due to a network timeout.

Result? The customer lost money, but the order hasn’t been created. You try retrying, but it gets worse: duplicate events → money deducted twice, wrong stock reduction, double shipment.

Partial failure and duplicate events are not exceptions, they are the reality in microservices.

You realize: if partial failures are already complex, can rollback and compensate really save the day?

2. Compensate – When Rollback Is Never Perfect

Books teach: rollback is just calling a compensate function → everything returns to the original state.

Reality:

Email already sent → can’t undo.
Shipment label created → can’t reverse.
Third-party booking → rollback almost impossible.

Example: a service sends a payment confirmation SMS. If the transaction fails, you can’t “take back” the SMS. Compensate only makes up with another action, like sending a cancellation notice or issuing a credit.

Saga is not magic. Compensate is only approximate, sometimes requiring manual intervention.

But the story doesn’t stop there. If states aren’t synchronized, what does the customer see? This is when Eventual Consistency comes into play.

3. Eventual Consistency – The Inevitable Trade-Off

Data will eventually be consistent, but customers might see: “Processing…” while money is deducted, but order isn’t created.

You realize:

UX must hide temporary states.
The system needs monitoring, retries, reconciliation.
Alerts must be clear.

Eventual consistency isn’t free. It requires accepting temporary risk. Otherwise, you’ll face a flood of support tickets from customers.

While calculating UX, a question arises: should the flow be managed by a central “director” or let services handle themselves? This is when Orchestration vs. Choreography appears.

4. Orchestration or Choreography – A Painful Choice

You must choose:

Criteria	Orchestration	Choreography
Debug & Monitoring	Easy to track Saga states	Hard to debug, needs detailed logging
Single Point of Failure	Has orchestrator	No SPoF, distributed
Duplicate Event	Easy to control	Likely, requires idempotency & retry queue
Flexibility	Fixed flow, less flexible	Flexible when adding/removing services
Deployment & Scaling	Orchestrator requires special scaling	Each service can scale independently

Example: you want to add a service to send promotional vouchers after order completion.

Orchestration: update orchestrator flow, easy to control.
Choreography: add a listener for the event, but must ensure idempotency and retry queue; errors arise if events are delayed or duplicated.

You realize: there is no perfect choice. Easier debug or avoid SPoF? Accept temporary inconsistency or strict consistency? Saga isn’t just a technique – it’s a constant trade-off.

And when you consider it, a red warning flashes: Saga won’t always save the day, especially in systems requiring strong consistency.

5. Saga Is Not a Solution for Every Case

Imagine: a bank, transferring money between two accounts. You decide to use Saga: deduct money from A, add to B, log the transaction.

At first, you are confident: any step fails → compensate → all good.

Then disaster strikes. Payment Service deducted the money, but Ledger Service hasn’t received the event. Customers panic, support is busy. Compensate? Doesn’t help. Only manual intervention can save it.

Now you understand: Saga is not suitable for banking transactions. A safer solution: 2-Phase Commit (2PC).

2PC ensures strong consistency: commit synchronously, fail → rollback immediately.
Avoids dangerous partial failures: customers don’t see temporary wrong balances.
Absolute integrity: critical transactions are always correct.

Lesson: choose the wrong tool, and microservices can turn into an operational nightmare, even if you just wanted “to apply a cool technique.”

6. Real Lessons from Applying Saga

After all the shocks from partial failure, approximate compensate, duplicate events, and choosing a model, you begin to draw “painful” lessons.

You recall the first time you deployed Saga: events delayed, compensate called in wrong order, customers constantly calling support. Only then you understood:

Uncontrolled retries = disaster. Idempotency is mandatory.
Compensate can’t save everything. It only reduces risk; sometimes manual intervention is still needed.
Customers see temporary inconsistent states? UX must be clever, alerts clear, reconciliation always ready.
Deployment model has no perfect choice. Orchestration is easier to debug but SPoF; Choreography is distributed but hard to trace. Choose a flow wisely, not on a whim.
Saga is not for every system. If business requires strong consistency – e.g., banking – 2PC or other synchronous transactions are safer.

Looking back, you realize: Saga isn’t magic, it’s a sophisticated tool. Applied correctly → reduces risk, increases flexibility. Applied wrongly → operational nightmare.

Most importantly: don’t use it because it’s “cool,” use it because it truly fits your business needs.

Conclusion

Saga Pattern is a powerful tool for complex distributed transactions, but not a solution for every problem.

Key takeaways:

Understand trade-offs and edge cases.
Prepare monitoring, alerting, retry, reconciliation, and even manual intervention.
Choose between Orchestration and Choreography based on flow, debugging, SPoF.
Evaluate system specifics before deploying Saga, avoiding environments needing strong consistency, where 2PC or synchronous transactions are safer.

After reading this, you’ll ask yourself:

“Does this business really need Saga, or am I just adding complexity for myself?”

Understanding this, you can implement Saga safely, flexibly, effectively, instead of getting caught in an entirely avoidable operational nightmare.

Be careful with retries — don't DDoS your own system

lowkey dev — Sun, 22 Jun 2025 16:13:46 +0000

Retry isn't bad. But used incorrectly, you could unknowingly become a "DDoS hacker"... of your own system.

Retry — the mechanism of repeating a request upon failure — is a crucial part of distributed system design. When one API call to another service fails due to network errors, timeouts, or temporary issues, retries are often configured to increase the chance of success.

From a supporting mechanism, retry can easily turn into the culprit of a domino failure effect if left uncontrolled.

1. When Retry Is a Double-Edged Sword

Imagine a simple scenario:

Service A calls Service B.
Service B is under heavy load and returns a 503 (Service Unavailable).
Service A retries 3 times, with a 100ms delay between each attempt.

Now suppose 1000 requests hit Service A at the same time:

Each request makes 4 calls to Service B (1 original + 3 retries).
Total: 1000 × 4 = 4000 requests to Service B.
While Service B is already overloaded, these retries choke it completely, leading to cascading failure.

Uncontrolled retries = shooting yourself in the foot.

2. Dangerous Retry Patterns

Retry without delay
→ Causes request storms when errors occur.

Simultaneous retries from multiple instances
→ Multiple services retrying at once → sudden traffic spikes → downstream crashes.

Infinite retries
→ Can cause memory leaks, jammed queues, and unstoppable request storms.

3.5 When to Retry and When Not To

Not every error should be retried.

Retry if:

Temporary issues: timeouts, connection resets
System errors: HTTP 5xx like 500, 502, 503, 504
Downstream service is restarting

Do NOT retry if:

Client errors: 400, 401, 403, 404
Business logic errors: user not found, insufficient funds, validation failed
422 – Unprocessable Entity

✅ Only retry if the error is recoverable.

3.6 How to Retry the Right Way

Limit retry attempts
Never retry infinitely. Use a max of 2–3 tries depending on the context.
Use delay and jitter
Add delays between retries (exponential or linear), with jitter to avoid synchronized spikes.
Only retry idempotent actions
E.g., GET and PUT are safer than POST — avoid duplicate orders or repeated payments.
Use a circuit breaker
Temporarily cut off retries when the downstream service keeps failing.
Deferred Retry – Smart retries using jobs
Instead of retrying immediately, queue the task or store it in a DB, and process later via background jobs. Helps avoid additional load during a system failure.
Log everything
Record the error reason, retry count, and retry time for easier debugging and alerting.

3.7 How Do You Know When It's Safe to Retry?

Use circuit breakers
Stop retrying temporarily when services fail repeatedly. Switch back to half-open state gradually.
Monitor health checks and metrics
Check /health endpoints or tools like Prometheus and Grafana to see if services have recovered.
Respect the Retry-After header
Some APIs return this to indicate the recommended wait time before retrying.
Rate-limit retries
Avoid flooding the service again after it starts recovering.

4. Tools for Effective Retry Implementation

Java / Spring Ecosystem:

Spring Retry
Supports @Retryable, configurable delays, backoff, and fallback with @Recover.
Resilience4j
Combines retry, circuit breaker, rate limiter, and bulkhead into one library. Works well with Spring Boot and Micrometer.
Kafka Retry Topic
Separate retry topics with delay, avoids blocking the main consumer. Combine with dead-letter topics for reliability.
Quartz / Spring Task
Schedule deferred retries using background jobs.

Other Languages / Platforms:

Python:
- tenacity: powerful retry decorator
- celery: built-in retry policy for async tasks
Node.js:
- retry, bull, agenda: retry support with timing and retry limits
Go:
- go-retryablehttp, backoff: lightweight and effective

Cloud-native:

AWS:
- SQS + Lambda + DLQ
- Step Functions with retry/catch blocks
GCP:
- Cloud Tasks, Pub/Sub retry + DLQ
- Workflows with built-in retry logic
Azure:
- Service Bus with configurable retry policy
- Azure Durable Functions with built-in retry

5. Real Case: Saving the System During Peak Load with Strategic Retry

Context:
At year-end, the system was under heavy traffic due to a promotional campaign. A payment processing service got overloaded, frequently timing out. Meanwhile, a batch job was firing thousands of requests per minute, with 5 retries per request, no delay, no jitter.

Result:
Massive retry storm completely choked the payment service → triggered cascading failures in related systems → 15 minutes of downtime during peak hours.

Solution:

Reduced retries to 2
Added exponential backoff and jitter
Applied circuit breaker on the job
Moved retries to a queue and processed via background jobs

Outcome:
System stabilized in under 10 minutes. Retries no longer overwhelmed the backend.

Lesson:

Retry isn’t about “hammering through” — it’s about helping the system recover gracefully.

6. Conclusion

Retry is a powerful tool when used correctly. But if applied without control, it can bring down your system faster than the original error.

Keep in mind:

Retry only for temporary, recoverable errors
Always limit retries, add delay + jitter, and use circuit breakers
Effective retry isn’t about "how many times you call back", but "knowing when to stop and wait"

Retry is medicine — used wisely, it heals. Used wrong, it poisons your system.

Hundreds of orders vanished in just 3 minutes – all because of one forgotten config line

lowkey dev — Wed, 18 Jun 2025 15:18:11 +0000

Prologue: A Seemingly Normal Afternoon

It was a Friday, 4:30 PM. My team was about to deploy an update for the order-service – one of the most critical microservices in our order processing pipeline.

Everything looked smooth. Tests passed. CI/CD was all green. I confidently hit the Deploy button to production.

“Just a small rollout… what could go wrong?”

Five minutes later, Slack lit up. Channels like #alert, #ops, and #order-system turned red with pings.
Grafana showed a strange spike: the failure rate of orders shot up.
Log entries appeared, and they weren’t friendly:

java.net.SocketException: Connection reset
org.apache.kafka.common.errors.TimeoutException
Connection refused: no further information

I froze. Within minutes, nearly 500 orders vanished without a trace. Each one was abruptly halted—as if someone pressed “pause” then hit “delete.”

Investigation: Something Wasn't Right

We jumped into a quick incident meeting.
No bugs in the code.
No Kafka issues.
No database outages.
But one thing was consistent: all failed orders happened during the new deployment.

Then someone from the team asked:

“Did anyone set up graceful shutdown for this service?”

I went silent. It all started to make sense.

The old pod had just received requests when Kubernetes sent it a SIGTERM.
But we hadn’t configured Spring Boot for graceful shutdown.
So the pod was killed—instantly and brutally. Kafka didn’t get a chance to send messages. Database transactions were left hanging. Half-processed data disappeared.

Aftermath: Production Fell Apart Because of One Missing Config

Who would’ve thought a single missing line could cause so much damage?

500 lost orders, all had to be manually recovered one by one.
We did 4 hours of overtime, tracing logs from Kafka to reconstruct requests.
An apology email went out to customers—along with compensation vouchers.

At that point, all I could think was: “I wish I’d known this earlier.”

The Realization: How a Service Dies Is Just as Important as How It Starts

That incident pushed me to dig into graceful shutdown—a concept I had only glanced over before.

Lesson #1: Enable shutdown with empathy

server:
  shutdown: graceful
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

This makes Spring wait for in-flight requests to finish before shutting down.

Lesson #2: Say goodbye to Kafka properly

@PreDestroy
public void cleanUp() {
    kafkaProducer.flush();
    kafkaProducer.close(Duration.ofSeconds(10));
    log.info("Kafka producer closed.");
}

If you don’t close your producer correctly, you’re basically throwing messages into the void.

Lesson #3: Don’t forget your thread pools

@Bean
public Executor taskExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setWaitForTasksToCompleteOnShutdown(true);
    executor.setAwaitTerminationSeconds(30);
    return executor;
}

Lesson #4: Readiness probes are your safety net

@EventListener
public void onAppShutdown(ContextClosedEvent event) {
    isReady.set(false); // readiness = false => K8s stops sending new traffic
}

If a pod is in the middle of dying and still receiving traffic, it’s like asking a patient on life support to keep working.

Conclusion

That incident was a painful but valuable lesson. It taught me that a system shouldn’t just be designed to run well—it must also be designed to shut down safely.

In a microservices world, where everything is interconnected in real-time, a single service dying unexpectedly can cause a domino effect—disrupting data, user experience, and system reputation.

Key Takeaways:

Graceful shutdown is not optional – it's essential.
Especially for services dealing with requests, Kafka, RabbitMQ, databases, or external APIs.
Always configure server.shutdown: graceful and set an appropriate timeout-per-shutdown-phase.
Ensure all critical resources are properly released:

Kafka producers
Thread pools
DB connections
External clients

Use readiness probes to signal Kubernetes to stop sending new traffic during shutdown.
Test shutdown scenarios in staging – not just startup ones.
And finally: avoid Friday deployments if you can.
Systems may fail—but people deserve their weekends.

Writing clean code is one thing.
Running a system responsibly and safely is another—and it’s often the part that’s overlooked.

I hope this story saves you from facing a black Friday like I did.

AI won’t take your job – but your outdated thinking might!

lowkey dev — Sun, 08 Jun 2025 17:11:02 +0000

These days, coders are busy debating whether AI is a “work savior” or a “nightmare that causes unemployment.”

“AI writes code so fast, I’m about to lose my job!”
“Developers nowadays have all become vibe coders.”

Calm down, skilled coders!

AI does not take your job away — it only does well at repetitive, mechanical tasks. Honestly, AI won’t make you unemployed; rather, not knowing how to leverage AI to innovate and grow your work is what will leave you behind.

1️⃣ Software engineers are not typists, but software developers

Many still think: “I’m a dev = I know how to code = I’m safe.”

Then AI shows up and hits hard with the harsh truth: you’re just a typist, while AI types faster, with fewer bugs, and never takes lunch breaks.

Software engineers are not typists — they are software developers, meaning they design and build solutions, not copy-paste on command.

A coder asks: “Give me the specs, I’ll just type.”
A software engineer asks: “Who is this feature for? Should it be prioritized over others? Is the data flowing correctly? Can the API scale?”

If you only know how to type code according to specs without understanding why you’re typing it, AI will replace that “typing” part for you.

2️⃣ AI generates code fast, but can’t replace you if you… know what you’re doing

I’ve tried many AI tools: ChatGPT, Copilot, Claude, Gemini… and here’s the truth I found:

AI works like a skilled chef, but it doesn’t know what the customer really wants. If you don’t understand the menu, no matter how fast the kitchen is, the dish will be a messy, tasteless plate.

It helps me write API controllers in 30 seconds but can’t help me decide:

How to expose APIs in a proper RESTful way?
Should authentication be bypassed?
What data does the mobile app actually need?

AI is a tool, but you still have to lead and decide. Otherwise, no matter how fast it is, the result will just be “a messy table.”

3️⃣ Learn deeply so you don’t get “trapped” by AI, learn broadly so life doesn’t “hit” you

Common mindset:

“I know enough Java, the rest can be handled by AI.”

Then when CI/CD breaks, JSON is malformed, or the app lags, you only know how to shout:

“AI, save me!” 😭

Don’t think AI means you can stop learning — if you do, you’re pushing yourself down the path of unemployment.

Learn broadly so you’re not “technologically blind” when collaborating with teammates

With Product Owners — to know what they really want (not 47 unnecessary APIs).
With UX — so you don’t build an app that looks good on desktop but breaks on phones.
With DevOps — so you don’t panic when deploying.
With Data teams — so you understand data pipelines aren’t a joke.

Learn deeply so you don’t get fooled by AI “nonsense”

AI generates great code but bugs still appear like Swiss watches.
Without deep understanding, you’ll be misled by bots that never deployed real products.

4️⃣ AI is a learning weapon, not a reason to stop learning

I use AI every day to learn faster:

YAML is hard to remember? AI suggests.
Dockerfile is confusing? AI fixes it.
Bash script sounds like Klingon? AI translates it to human.

I don’t learn less, I learn faster. Then I use AI as a weapon.

From a backend dev who used to spam System.out.println, now I:

Understand how CI/CD works.
Know what ETL means in data pipelines.
Read and understand UX/UI to avoid making apps “as clunky as a punch.”

Not because I’m smarter, but because I learned how to leverage AI.

5️⃣ Losing your job is not because of AI — it’s because of stubbornness

Frankly:

AI is not the enemy, but stubbornness and refusal to change make you miss opportunities.
Those who refuse to learn how to use AI will be rated low by their bosses because they no longer meet job requirements in this new era.

On the other hand:

People who know how to use AI as a powerful tool work more efficiently, faster, and extend their influence.
Those who cling to old mindsets and refuse to learn AI will quickly be left behind and easily lose their jobs.

6️⃣ The AI era is the era of “adaptable devs,” not “complaining devs”

Ask yourself:

Am I using AI to speed up or just sit and fear being replaced?
Am I learning something new beyond my old, rusty stack?
Do I understand who the product I build serves?

If yes — AI is your ally.

If no — AI is a mirror showing you’re... outdated.

✅ Conclusion: AI doesn’t take your job — but those who know how to use AI will take your job

We are not just typists — we are solution designers, system integrators, user-understanders.

In the AI era:

Know deeply to tell if AI code is good or nonsense.
Know broadly to connect teams, understand products, and support teammates.
Know how to use AI as an assistant — not an “online teacher” you rely on every second.

AI isn’t scary — stubbornness is what makes you lag behind.

So:

👉 Don’t fear AI — turn AI into a powerful ally for creativity and growth.

I don’t like microservices, and here’s why

lowkey dev — Sun, 01 Jun 2025 14:16:20 +0000

Hello everyone! I’m Hung Pham, a backend developer who used to think that microservices were the standard for every system—until I actually deployed and maintained one myself. After many nights struggling with dozens of logs from 4–5 different services, I realized one thing: microservices aren’t the right solution for every system. Why? Let me share my story in detail, hoping it will give you a more realistic perspective on microservices.

1. In the beginning—microservices seemed like the “holy grail”

When I first discovered microservices, it felt incredibly exciting. The hype, the case studies from big players like Netflix, Amazon, Uber, and Google made me almost obsessed:

“Break your app into small parts that can be developed and deployed independently, as you wish!”
“Scale each service separately—no need to scale the whole bloated app!”
“If you’re not doing microservices, you’re falling behind—it’s the future of software development!”

I dove straight into Docker, Kubernetes, service mesh, CI/CD automation, API Gateway… The list of things to learn was endless—longer than the actual project deadlines! I thought I was opening a brand-new chapter for my backend career.

2. But reality was not as dreamy

When I finally “microfied” some of my team’s projects—our small team had only 4–5 devs—I realized that microservices were not just about splitting up code. It’s a complex web that nearly drove me crazy:

Network latency and timeouts: If a single service is slow, the whole system can crash like a line of dominos. One seemingly simple request might pass through a dozen services—timeouts and partial failures became routine.
Complicated deployment management: Each service had its own CI/CD pipeline, its own configs, its own versioning. Deployments weren’t just a click anymore—they turned into a campaign.
Data consistency headaches: No more simple transactions. Now we had to think about eventual consistency, complex patterns like Saga, Orchestrator (Camunda, Temporal…)—just hearing those words made me want to give up.
Debug logs were a nightmare: When production issues hit, I had to dig through logs from multiple services, tracing requests across systems. I felt like Sherlock Holmes stumbling in the dark!

3. The things microservices “stole” from our small team

For our small team, I realized microservices were stealing many valuable things from us:

Focus:

Monolith: One repo, one codebase, everyone working together—easy to communicate, easy to grasp the big picture.
Microservices: Everyone camping in their own service, talking only via APIs, creating silos in the team.

Initial development speed:

Monolith: Deploy once, rollback once, small changes went live quickly.
Microservices: Deployments scattered across services, config tweaking everywhere, rollbacks were trickier and took much longer.

The joy of releasing features:

Monolith: Release a feature immediately, get instant feedback from users.
Microservices: Release in pieces, carefully coordinate to avoid breaking APIs—stressful and slow.

4. But microservices aren’t the villain

I’m not denying that microservices have some real strengths:

Independent scaling: Hot services can be scaled separately, saving resources.
Empowered teams: Teams can work independently on their services, reducing dependencies and speeding up long-term development.
Easy to evolve and replace: Updating a single part doesn’t require messing with a huge monolith.

5. So when should you actually use microservices?

I think microservices only truly shine when:

Your project has a large backend team (10+ devs), so you can split by domain.
Your infrastructure is mature enough (CI/CD automation, great observability—logging, tracing, metrics…), so deploying doesn’t feel like rocket science.
Your application has clear, separate domains—like payments, user management, logistics, each operating almost independently.
Your traffic is huge, and you really need to scale specific components to save costs and boost performance.

6. And when you probably shouldn’t “play fancy”

If you’re in one of these situations, think twice before jumping into microservices:

Small team (3–5 devs), drowning in backlog with tons of features to build.
Simple application with just a few main modules, no real need for complex scaling.
No experience with CI/CD or DevOps—microservices will force you to learn DevOps first.
Tight deadlines (like 1 month to launch) rather than a year for sustainable growth.

7. Conclusion: I don’t hate microservices—I just don’t like meaningless “hype-following”

Microservices aren’t evil. They’re not “automatically great,” either. I just don’t like when small teams chase trends blindly and burden themselves with unnecessary complexity.

For me, the most important thing is:

Understand your actual problem.
Choose the architecture that fits your team size, your app’s nature, and the real complexity you need.

In short:

Small team, few features, tight deadlines: Monolith is king.
Big team, complex domains, heavy traffic: Microservices is the savior.

8 And what about you?

Have you had a wildly successful microservices experience? Or a complete disaster? I’d love to hear your stories—so we can learn from each other and avoid repeating the same mistakes I did.