Pavel Sanikovich

Posted on Oct 22

Redpanda in Production: 3 Traps I Fell Into (and How to Avoid Them)

#go #redpanda #kafka #performance

"Redpanda looked like the holy grail — Kafka-compatible, lightning fast, no ZooKeeper, no JVM.
Then I ran it in production."

⚙️ The Setup

Our team runs a high-load B2B marketplace built around event-driven Go microservices.
For years, we lived with Kafka’s quirks — and its ZooKeeper demons.
Redpanda promised a clean escape: a drop-in replacement, blazing fast, ops-friendly, and written in C++.

So, we made the jump.

And while Redpanda delivered on performance, it also delivered a few… surprises.
Here are three traps I fell into when running Redpanda in production — and how you can avoid them.

🧩 Trap #1: Default Memory Settings Are Not Your Friend

"Redpanda tries to be smart about memory. Until it isn’t."

Out of the box, Redpanda auto-tunes memory usage based on your system.
That sounds nice — until your production box starts swapping and your broker gets killed by the OOM reaper.

In our early tests, Redpanda consumed up to 80% of available RAM, pushing other processes (like monitoring agents and log collectors) to starvation.

Fix: Pin your memory limits explicitly.

rpk cluster config set redpanda.memory.enable_memory_locking true
rpk cluster config set redpanda.memory.available_memory 4G

Lesson learned:
Auto-tuning is fine for laptops. Not for clusters.
Always define available memory explicitly and lock it — otherwise Redpanda will take everything it can.

🧬 Trap #2: Schema Evolution Is Still DIY

"Kafka-compatible doesn’t mean Kafka-ecosystem-compatible."

We used to rely on Confluent Schema Registry for Avro schemas and smooth evolution of message formats.
Turns out, Redpanda doesn’t ship with a built-in registry — you need to bring your own.

You have two options:

Self-host Karapace, the open-source Schema Registry alternative.
Redpanda Console, which is great for UI inspection but limited for schema enforcement.

We ended up hosting Karapace next to Redpanda and managing schema versions manually.

Here’s an example Go event we had to version ourselves:

type OrderEvent struct {
  ID      string `json:"id"`
  Status  string `json:"status"`
  Version int    `json:"version"`
}

No magic migrations. No hidden helpers.
Just version your payloads and keep producers + consumers in sync.

Lesson learned:
Redpanda nails the broker layer, but schema management remains your job.

🪞 Trap #3: Monitoring ≠ Kafka Monitoring

"Our dashboards were green. Our cluster wasn’t."

We reused our existing Prometheus + Grafana dashboards from Kafka, expecting everything to “just work.”
Spoiler: metric names differ.
And some metrics — like under_replicated_partitions — don’t even exist the same way.

The result? We thought our system was fine until one broker hit 95% disk usage and silently throttled producers.

Fix:

Use rpk cluster metrics to explore available metrics.
Import the official Redpanda dashboards.
Set alerts on disk usage, controller health, and latency spikes.

Redpanda has excellent observability — if you wire it up correctly.

⚖️ Bonus Trap: Compatibility ≠ Behavior Parity

Kafka clients work, yes — but subtle differences appear under load.

For example, acks=all in Kafka ensures durability across replicas.
In Redpanda, you might hit timeouts under bursty load unless you adjust raft_heartbeat_interval_ms.

We also noticed consumer group rebalancing behaving slightly differently.
Our fix was simple but crucial — explicitly set the balancer in our Go client:

writer := &kafka.Writer{
    Addr:     kafka.TCP("redpanda:9092"),
    Topic:    "orders",
    Balancer: &kafka.Hash{},
}

Lesson learned:
Compatibility ≠ identical semantics.
Test consumer lag and rebalance timing under realistic traffic before trusting production.

🏁 Conclusion

Redpanda remains my top choice for Kafka-compatible workloads.
It’s insanely fast, ops-friendly, and spares you from JVM nightmares.
But like any sharp tool — it can cut you if you treat it like Kafka 1-to-1.

Remember:

Tune memory.
Bring your own schema registry.
Redo your monitoring.
Validate client behavior.

Do that — and you’ll sleep well, even under a million messages per second.

Have you deployed Redpanda in production?
What traps did you fall into?
Let’s compare scars in the comments 👇

DEV Community