By: João Vitor Nascimento De Mendonça Originally published in Engineering Weekly / Tech Blog
- The Scenario: The Chaos of Unmanaged Scale In modern architectures, using Apache Kafka or RabbitMQ solves decoupling issues but creates a new challenge: throughput disparity.
I recently observed a scenario where a producer was injecting 50k msgs/s, while the consumer—limited by a third-party API—could only process 10k msgs/s. The result? Metric omission, heap memory exhaustion, and cascading latency across the entire system.
- Backpressure and Concurrency Control To solve this, simply "scaling the pod" isn't enough. I implemented Semaphore-based Concurrency Control. In Go, for instance, we use buffered channels as semaphores to limit active workers:
Go
// Example of a concurrency limiter for DB protection
var semaphore = make(chan struct{}, 50) // Limit to 50 active workers
func processEvent(event Event) {
semaphore <- struct{}{} // Acquire slot
defer func() { <-semaphore }() // Release slot
// Processing logic and DB persistence
db.Save(event)
}
Additionally, we integrated a Circuit Breaker (using Resilience4j/Hystrix). If the database begins responding above a 500ms threshold, the circuit opens, immediately halting queue consumption. This prevents the application from crashing while attempting to process requests it cannot currently deliver.
- Infrastructure Tuning: Optimizing the Garbage Collector (GC) Latency wasn't solely caused by I/O; millisecond pauses from the Garbage Collector were locking up processing via "Stop-the-World" events.
We migrated from traditional x86 instances to AWS Graviton (ARM64) and fine-tuned the ZGC (on Java 21+). Our goal was to maintain pauses below 1ms, even with large heaps.
The Result: An 85% reduction in GC pauses, stabilizing throughput during high-traffic peaks.
- Resilience with Dead Letter Queues (DLQ) Errors are inevitable. Our strategy involved implementing Exponential Backoff. If a message fails, it doesn't block the main queue; instead, it is routed to a Retry Topic with increasing delays (1s, 10s, 1min). Once retries are exhausted, the message lands in a DLQ (Dead Letter Queue) for manual inspection.
Field Note: Never allow infinite retries without backoff. Doing so is essentially a self-inflicted Denial of Service (DoS) attack against your own database.
Top comments (0)