Dmitriy Shmatov

Posted on Jun 1 • Originally published at blog.shmatov.dev

Kafka without ZooKeeper: My Strimzi HA Playbook on K8s

#kafka #strimzi #kubernetes #kraft

I've been running Strimzi Kafka in production at scale for the past few years - multi-cloud, multi-zone, mixed broker sizes, the usual. The first year I spent more time firefighting Kafka than building anything on top of it. The next two I spent slowly stripping operational pain out of the setup until it stopped paging me.

This is the configuration I landed on, why each part is shaped the way it is, and the production failure modes that drove each decision. No theory, no marketing, no Hello-World defaults - only the cluster I actually run.

The problem

A production Kafka cluster on Kubernetes has to survive at least four things at once:

A single broker dying mid-write.
An entire AZ disappearing for an hour.
A node-pool resize that doubles the broker count without manual rebalancing.
A "we need 200 more topics by Friday" request from the data team.

The default Strimzi Kafka resource doesn't handle any of these gracefully. The defaults assume a single zone, one node pool, no rack awareness, no rebalance automation, and a Kubernetes-native topic count in the tens. The moment you cross into real production, all four assumptions break.

What didn't work

Things I tried first and ripped out:

One big broker node pool spanning all zones. Strimzi rolls pods within a pool in sequence; with one pool you can land two pods of the same partition in the same zone after a roll, and you only notice when a zone fails.
Soft anti-affinity (preferredDuringScheduling). It works until the cluster is busy, then the scheduler decides "good enough" means "two brokers on one node". Under heavy load both brokers OOM together. Hard affinity or nothing.
Manual rebalancing after every scale-out. Drafting a KafkaRebalance, approving it, watching the dashboards - fine for one cluster, untenable across a fleet.
Reattaching old PVCs to recover a dead broker. Looks clever, fails reliably. The recovery path that tests cleanly is "spin up a fresh broker, let Cruise Control replicate".
Default JMX exporter probes. kubelet kills the exporter before it finishes warming up on a busy cluster, dashboards go dark, on-call panics.

Every section below exists because of one of those failures.

The shape of the cluster

A single Strimzi Kafka CR, KRaft enabled, backed by multiple KafkaNodePool resources - one pool per AZ, split by role:

Controllers (KRaft quorum): small, an odd number, one pool per zone.
Brokers: workhorses, one pool per zone, SSD-backed PVCs, larger CPU and memory budgets.

# charts/kafka/templates/kafkaNodepool.yaml (excerpt)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: ab12c-us-east-2a-b   # <short-hash>-<zone>-b (broker)
  labels:
    strimzi.io/cluster: my-cluster
spec:
  roles: [broker]
  replicas: 6
  storage:
    type: persistent-claim
    class: gp3-kafka
    size: 600Gi
    deleteClaim: true

With replicas: 6 per zone across 3 AZs, that's 18 brokers total - each backed by a 600Gi gp3 PVC with 3000 IOPS and 125 MB/s throughput provisioned. Brokers get 8–12Gi memory requests and 3–6 CPU depending on the workload profile.

The short hash prefixing every pool name is there for a single reason: pool names hit the 63-char DNS label limit faster than you'd think once you concatenate cluster name + zone + role. A 5-char SHA prefix keeps every name stable, deterministic, and DNS-safe.

KRaft, finally

metadata:
  annotations:
    strimzi.io/node-pools: "enabled"
    strimzi.io/kraft: "enabled"
spec:
  kafka:
    version: "4.1.0"
    metadataVersion: "4.1-IV1"

No ZooKeeper, no separate Helm release, no fourth stateful system in the namespace. The controllers form the metadata quorum directly. Kafka 4.x dropped ZooKeeper from the codebase entirely - it's not just deprecated, it's gone. Fewer pods, fewer failure modes, a noticeably faster control plane on partition creation. If you're starting fresh, there's nothing to migrate.

Rack awareness in two places that must agree

Multi-zone Kafka is two settings that have to match. Either both are right or none of them are.

On the Kafka spec - what Kafka itself sees:

spec:
  kafka:
    rack:
      topologyKey: "topology.kubernetes.io/zone"

Each broker now advertises its zone as its rack ID. Kafka's replica placement starts considering zones when picking followers for a partition. Pair this with replica.selector.class: org.apache.kafka.common.replica.RackAwareReplicaSelector in the broker config - it makes consumers prefer reading from followers in the same zone, cutting cross-AZ data transfer costs significantly.

On the node pool - where the pod actually lands:

nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: "topology.kubernetes.io/zone"
            operator: In
            values: ["us-east-2a"]
          - key: "purpose"
            operator: In
            values: ["kafka"]
podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
          - key: "strimzi.io/broker-role"
            operator: In
            values: ["true"]
      topologyKey: "kubernetes.io/hostname"

The purpose=kafka label on the node, plus a role=kafka:NoSchedule taint, gives me dedicated nodes for Kafka. No noisy neighbour is going to evict a broker because a CronJob spiked a sidecar's CPU.

Tip - dev vs prod from the same chart. Production runs hard affinity; dev needs to fit on a shared 3-node pool. I gate scheduling strictness behind an enforceSchedulingConstraints flag that flips every required… into preferred…. Same chart, different cluster shape, no copy-paste drift across environments.

JVM tuning that's worth typing out

Most "deploy Kafka on K8s" tutorials stop at replicas: 3 and call it production. The JVM flags are where p99 latency actually lives.

jvmOptions:
  -XX:
    UnlockExperimentalVMOptions: "true"
    UseG1GC: "true"
    MaxGCPauseMillis: "20"
    G1HeapRegionSize: "16M"
    G1NewSizePercent: "25"
    G1MaxNewSizePercent: "30"
    InitiatingHeapOccupancyPercent: "35"
    ParallelGCThreads: "10"
    ConcGCThreads: "3"
    AlwaysPreTouch: "true"
    UseStringDeduplication: "true"
    ExitOnOutOfMemoryError: "true"
    PerfDisableSharedMem: "true"
    ExplicitGCInvokesConcurrent: "true"
    MetaspaceSize: "96m"
    MinMetaspaceFreeRatio: "50"
    MaxMetaspaceFreeRatio: "80"

Two flags that are routinely underestimated:

AlwaysPreTouch: true - the JVM touches every heap page on startup, so the broker takes longer to come up but never pays the first-touch page-fault cost under load. On producers with strict latency SLOs, this is the difference between a clean p99 and an ugly one.
ExitOnOutOfMemoryError: true - if a broker OOMs, kill it loudly. Don't sit in a half-alive state confusing Strimzi about whether the pod is healthy. Let the operator restart it cleanly.

Trade-off. MaxGCPauseMillis: 20 is aggressive. With 8–12Gi container memory, Strimzi gives brokers roughly 5–8Gi effective heap - at that range, 20ms works well. On smaller heaps with bursty traffic, the G1 collector starts thrashing - frequent young collections, rising CPU, no real pause-time win. If you see that in gc.log, raise the target to 50–100ms before you reach for more heap. Tuning the target is cheaper than tuning the workload.

The metaspace flags (MetaspaceSize: 96m, MinMetaspaceFreeRatio: 50, MaxMetaspaceFreeRatio: 80) prevent the JVM from constantly resizing the metaspace on startup - relevant when you have hundreds of topics and Kafka loads many classes at boot.

Cruise Control: hand it the keys

The single largest quality-of-life jump in this whole setup is letting Cruise Control drive rebalances.

Strimzi already wires Cruise Control into the Kafka CR. The step beyond that is auto-rebalance templates:

cruiseControl:
  autoRebalance:
    - mode: add-brokers
      template:
        name: my-cluster-add-brokers-rbt
    - mode: remove-brokers
      template:
        name: my-cluster-remove-brokers-rbt
  brokerCapacity:
    cpu: "3"
    inboundNetwork: "125MiB/s"
    outboundNetwork: "125MiB/s"

The brokerCapacity.cpu should match your broker's CPU request - Cruise Control uses it to model whether a broker is overloaded. If you set it to "1" but your brokers have 3–6 cores, the optimiser thinks every broker is perpetually over-capacity and thrashes.

A KafkaRebalance template is a KafkaRebalance resource annotated with strimzi.io/rebalance-template: "true". When a broker pool grows, Strimzi clones the template into a real rebalance, runs it to completion, and there are no under-replicated alerts to chase. Scale-down works the same way in reverse - brokers don't leave until their partitions have drained somewhere else.

Three rebalance flavours

I keep three KafkaRebalance resources in the chart, each with a clear purpose:

*-deep-rebalance-config - mode: full, concurrentPartitionMovementsPerBroker: 10, concurrentLeaderMovements: 10. Sits there as a pre-approved plan. When the cluster drifts (uneven disks, hot brokers after a partition surge), kubectl annotate kafkarebalance ... strimzi.io/rebalance=approve and Cruise Control fixes the world.
*-add-brokers-rbt - mode: add-brokers, skipHardGoalCheck: true, same concurrency settings. The template Strimzi clones when a node pool grows.
*-remove-brokers-rbt - mode: remove-brokers, same skipHardGoalCheck: true.

All three carry the full 14-goal optimisation list (from RackAwareGoal through CpuUsageDistributionGoal). The concurrency knobs (concurrentPartitionMovementsPerBroker: 10) control how many partitions move simultaneously per broker - 10 is aggressive enough to finish a rebalance in reasonable time but not so aggressive that broker I/O flatlines during the move.

skipHardGoalCheck on the templates is intentional. When a brand-new broker joins, capacity goals are mathematically unsatisfiable for a moment - the new broker has zero replicas and looks "under-utilised". You don't want the rebalance to refuse to start. The optimisation goals still apply; you're telling Cruise Control to do its best, not to insist on perfection.

Goals in the right order

The order of goals, default.goals and hard.goals is the optimiser's priority list, not decoration. Mine:

hard.goals:
  - RackAwareGoal               # never put two replicas in the same zone
  - ReplicaCapacityGoal
  - DiskCapacityGoal
  - NetworkInboundCapacityGoal
  - NetworkOutboundCapacityGoal
  - CpuCapacityGoal
default.goals:
  - RackAwareGoal
  - MinTopicLeadersPerBrokerGoal   # leaders not bunched on one broker
  - ReplicaCapacityGoal
  - DiskCapacityGoal
  - NetworkInboundCapacityGoal
  - NetworkOutboundCapacityGoal
  - CpuCapacityGoal
  - LeaderReplicaDistributionGoal

The full optimisation goals list in production is much longer (20 goals including RackAwareDistributionGoal, PotentialNwOutGoal, and the KafkaAssigner goals), but what matters day-to-day is hard.goals (never violate) and default.goals (what Cruise Control actually optimises towards). Keep those two tight and let the full goals list be a superset for manual rebalances.

RackAwareGoal is always a hard goal. If Cruise Control is ever forced to choose between balance and zone safety, zone safety wins. I also enable the concurrency adjusters so Cruise Control throttles itself when broker load climbs:

config:
  max.active.user.tasks: "30"
  anomaly.detection.interval.ms: "900000"             # 15m
  disk.failure.detection.interval.ms: "300000"        # 5m
  metric.anomaly.detection.interval.ms: "1800000"     # 30m
  topic.anomaly.detection.interval.ms: "3600000"      # 60m
  concurrency.adjuster.leadership.enabled: "true"
  concurrency.adjuster.inter.broker.replica.enabled: "true"
  concurrency.adjuster.min.isr.check.enabled: "true"

max.active.user.tasks: 30 looks high (default is 5), but in a fleet with frequent scaling events across multiple clusters, you need headroom for concurrent proposals. The concurrency adjusters are the real safety net - they throttle replica and leadership movements dynamically based on broker load, so you don't need to manually lower parallelism during peak hours.

Net effect: scaling brokers from 18 → 21 is a replicas: 6 → 7 change in the per-AZ config (the module multiplies by zone count - 6 per AZ × 3 AZs = 18 total, 7 per AZ × 3 AZs = 21 total). Apply, wait for the auto-rebalance to finish, verify no under-replicated partitions. No on-call activity.

HA across datacenters: what multi-zone actually buys you

A topic where teams overestimate what they have. Let me be specific about each failure mode:

One broker dies. N-1 in-sync replicas pick up. With RF=3 and min.insync.replicas=2, no producer notices.

One zone disappears for an hour. Three zones × six brokers per zone × RF=3 × min.insync.replicas=2 leaves two in-sync replicas in the surviving zones. Producers continue at full throughput. This is the failure mode I design for - by far the most valuable property the cluster has.

Two zones disappear at once. Below min.insync.replicas. Writes start failing for any topic with acks=all. Reads of already-committed data still work. This is correct behaviour - durability over availability - but the producers need to handle it (retries with backoff, DLQs, idempotency keys, the usual hygiene). Test this path on staging; don't discover it in prod.

The whole region dies. A single multi-zone cluster does not save you. You need a second cluster in a second region, MirrorMaker 2 replicating the topics that matter, and a runbook for failover where consumer offset translation is the part that bites everyone. In practice I run MM2 with ~20 replicas, 40+ connector tasks, and IdentityReplicationPolicy so topic names stay the same on both sides. Treat MM2 as a second source of truth, not as backup. Make consumers idempotent.

One cluster per pipeline stage

A pattern I've come to rely on the more event-driven the platform gets: stop trying to fit every service onto one giant Kafka cluster. Split the pipeline into stages, and give each stage its own cluster.

Three concrete reasons this pays off:

Smaller blast radius. A misbehaving producer that hammers kafka-ingest with 10× traffic can't starve the brokers serving the analytics stream. The failure stays inside one stage; the rest of the pipeline keeps moving on whatever it had buffered.
Per-stage tuning. Ingest clusters are write-heavy, short retention, many partitions, lots of network bandwidth. Analytics clusters are read-heavy, long retention, fewer but larger partitions, optimised for sequential scans. Trying to find one set of broker sizes, JVM flags, and topic defaults that serves both ends with a cluster that's mediocre at everything. With separate clusters I tune each one for the workload it actually runs.
Independent scaling and upgrades. Cruise Control rebalances on kafka-ingest don't move analytics partitions around. Kafka version upgrades roll one cluster at a time. Maintenance windows shrink to the services that talk to that stage instead of every consumer in the company.

The boundary between stages is also a natural place to enforce contracts. Each kafka-* cluster gets its own Schema Registry, its own ACLs, its own retention policy. Teams downstream consume only the topics on their input cluster - you stop seeing accidental coupling where a service quietly subscribes to a raw upstream topic it had no business touching.

Trade-offs, honestly:

More clusters to operate. The Strimzi setup here is exactly designed to make that painless - a Kafka CR, a few KafkaNodePool resources, Cruise Control automates the rest - but it's still N control planes instead of one.
Cross-stage observability needs work. End-to-end tracing across multiple clusters means correlation IDs in headers from day one, and a Grafana board that joins lag metrics across producers and consumers in different namespaces.
This is not the same as MirrorMaker 2. MM2 copies the same logical stream into another cluster for DR or geo-locality. Pipeline-stage splitting is a different stream per cluster, each transformed by the service in front of it.

My rule of thumb: the moment a topic has more than three meaningfully different consumers, or sits between two services that scale at very different rates, that's the seam where the next cluster wants to live. Cheap to set up with this chart, expensive to retrofit once you have downtime budgets.

Quiet contributors to HA

A few details that don't look like HA but absolutely are:

deleteClaim: true on broker storage. Counter-intuitive: yes, removing a node pool drops the PVC. But the recovery path I trust is "fresh broker, Cruise Control re-replicates". Reattaching old disks looks faster, fails in subtle ways (stale meta.properties, mismatched broker IDs, half-written segments), and is never the path that's been rehearsed.
Internal load balancer per broker. Each broker pod gets its own perPodService with an internal NLB annotation, plus a bootstrap LB annotated for external-dns. Clients outside the cluster but inside the VPC get stable DNS without going through ingress. No L7 between producer and broker, ever - Kafka is a TCP protocol and pretending otherwise breaks it.
A topic-operator queue sized for your actual topic count. With a few thousand KafkaTopic resources, the default Strimzi reconciliation queue silently chokes - topics take 10+ minutes to apply, and you spend an afternoon wondering why. One environment variable fixes it:

entityOperator:
  topicOperator: {}
  template:
    topicOperatorContainer:
      env:
        - name: STRIMZI_MAX_QUEUE_SIZE
          value: "16384"

That one line resolved a real "topics take forever to reconcile" incident for me. Default is 1024 - fine for a demo, wrong for production.

Topics as code

Every topic is a KafkaTopic CR reconciled by Strimzi's Topic Operator. A small Helm chart loops over a list and renders one per topic:

# charts/kafka-topics/templates/kafkaTopic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: my-events           # DNS-safe, normalised
  labels:
    strimzi.io/cluster: my-cluster
spec:
  topicName: my.events.v1   # real Kafka topic name - dots and underscores allowed
  partitions: 12
  replicas: 3
  config:
    retention.ms: "604800000"   # 7 days (cluster default is 1 day - override per topic for streams that need history)
    cleanup.policy: "delete"
    min.insync.replicas: "2"

Production lessons that cost real time:

metadata.name ≠ Kafka topic name. Kubernetes names don't allow dots or underscores; Kafka topics do. Always set spec.topicName to the real name and normalise metadata.name.
Partitions are a one-way street. You can grow them, never shrink. Be generous, not silly - every partition has a cost in controller metadata, replica churn, and consumer fan-out.
replicas per topic should match your zone count for anything that matters. RF=3 across 3 zones for durable streams. RF=2 only for ephemeral data you can drop on the floor.
min.insync.replicas: 2 at the topic level for durable streams. Don't rely on cluster-wide defaults - they drift.

Exporter probes you'll wish you had

The Kafka JMX exporter is slow to start. On a busy cluster the first scrape can take a minute. The default liveness probe will kill it mid-warmup, kubelet restarts the pod, the dashboards stay blank, and you spend an hour blaming Prometheus.

kafkaExporter:
  livenessProbe:
    initialDelaySeconds: 60
    timeoutSeconds: 15
    periodSeconds: 30
  readinessProbe:
    initialDelaySeconds: 60
    timeoutSeconds: 15
    periodSeconds: 30

Not glamorous. Saves a lot of pages.

What I'd do differently next time

An honest look at things I'd change if I were starting over today:

Move to KRaft on day one. I migrated from ZK-backed Strimzi and the cutover was non-trivial (Kafka 3.7 → 4.1 with KRaft migration in between). Starting on KRaft would have saved a month.
Pin metadataVersion separately from version. Quick to forget that the metadata version follows its own upgrade dance and lags the Kafka version. Today I keep both explicit in values and bump them in separate PRs.
Use Cruise Control's autoRebalance from the beginning. I shipped the templates as a follow-up after a painful manual scale-out. Should have been in the first iteration.
Bigger SSDs from day zero. Resizing PVCs through StatefulSets is possible but rarely pleasant. Overprovision broker storage by 50%; you'll thank yourself.
A dashboard that shows the rebalance state, not only broker load. Knowing whether Cruise Control is currently moving 4,000 partitions matters more than knowing one broker is at 85% CPU.

The compressed checklist

If you skim everything above and remember only this:

One node pool per zone, per role. Hard nodeAffinity, hard podAntiAffinity on hostname.
KRaft over ZooKeeper. Kafka 4.x dropped ZK entirely - no new ZK clusters, period.
Rack awareness in two places - Kafka rack.topologyKey + nodepool nodeAffinity. Both or neither.
Cruise Control drives rebalances. Templates for add/remove, plus a mode: full plan you approve manually for drift.
RackAwareGoal is a hard goal. Always.
JVM: G1GC, MaxGCPauseMillis: 20, AlwaysPreTouch, ExitOnOutOfMemoryError.
Per-broker internal LB. No L7 in front of Kafka.
STRIMZI_MAX_QUEUE_SIZE: 16384 the moment topic count crosses a thousand.
Slow exporter probes. 60s initial delay minimum.
Multi-region = MirrorMaker 2. Multi-zone is HA, not DR.
One Kafka cluster per pipeline stage. Smaller blast radius, per-stage tuning, independent upgrades.

The point of Strimzi is to make Kafka feel like another workload on the cluster. Get the scheduling right, hand the rebalances to Cruise Control, and most weeks you'll forget Kafka is in the namespace at all. Which, for a streaming platform at this scale, is the highest praise I can give.

DEV Community