giveitatry

Posted on Mar 22

Deploying Apache Kafka 4.2.0 on Kubernetes with KRaft, SASL, and High Availability

#kafka #kubernetes

Overview

This guide walks through deploying a production-grade Apache Kafka cluster on Kubernetes using KRaft mode (no ZooKeeper), SASL/SCRAM-SHA-512 authentication, and a 3-node StatefulSet for high availability. It covers every manifest file required, the reasoning behind each configuration decision, the SCRAM credential bootstrap process, common pitfalls encountered in practice, and the steps needed to take the cluster from running to production-ready.

Prerequisites

Tools

kubectl configured against your target cluster
A Kubernetes cluster with at least 3 nodes (one per Kafka pod) with sufficient resources
Persistent volume provisioner available (e.g. local-path, Longhorn, Ceph, AWS EBS)
keytool (part of the JDK) if you plan to add TLS later

Cluster Resources

Each Kafka pod in this guide requests 500m CPU and 1Gi RAM, with limits of 2 CPU and 4Gi RAM. For production you should size these based on your throughput requirements. A minimum of 10Gi persistent storage per broker is configured.

Kubernetes Version

Kubernetes 1.21 or later is recommended. StatefulSets, headless Services, and ConfigMaps used in this guide are stable APIs available in all modern versions.

Architecture

The cluster consists of three pods (kafka-0, kafka-1, kafka-2) each running in combined broker+controller mode using KRaft. This means each pod participates in the Raft quorum for metadata management as well as serving producer and consumer traffic.

┌─────────────────────────────────────────┐
│             Kafka Namespace             │
│                                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐ │
│  │ kafka-0 │  │ kafka-1 │  │ kafka-2 │ │
│  │ broker  │  │ broker  │  │ broker  │ │
│  │ + ctrl  │  │ + ctrl  │  │ + ctrl  │ │
│  └────┬────┘  └────┬────┘  └────┬────┘ │
│       │             │             │     │
│  port 9092 (SASL_PLAINTEXT — broker)    │
│  port 9093 (PLAINTEXT — controller)     │
│                                         │
│  ┌─────────────────────────────────┐    │
│  │   kafka-headless (ClusterIP:    │    │
│  │   None) — DNS per pod           │    │
│  └─────────────────────────────────┘    │
└─────────────────────────────────────────┘

KRaft requires a majority quorum (2 out of 3) to elect a leader and commit metadata. The cluster can tolerate the loss of one node.

File Structure

You need five manifest files plus one temporary modification during the SCRAM bootstrap process:

namespace.yaml           — Kafka namespace
kafka-svc.yaml           — Headless service for pod DNS
kafka-jaas.yaml          — JAAS config for inter-broker SCRAM auth
kafka-sasl.yaml          — Secret holding admin credentials for client apps
kafka-stateful-set.yaml  — Brokers, init containers, volumes, config

Step 1 — Namespace

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: kafka

Apply first so all subsequent resources land in the correct namespace:

kubectl apply -f namespace.yaml

Step 2 — Headless Service

# kafka-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: kafka-headless
  namespace: kafka
spec:
  clusterIP: None
  selector:
    app: kafka
  ports:
    - name: internal
      port: 9092
    - name: controller
      port: 9093

A headless service (clusterIP: None) causes Kubernetes DNS to create per-pod A records:

kafka-0.kafka-headless.kafka.svc.cluster.local
kafka-1.kafka-headless.kafka.svc.cluster.local
kafka-2.kafka-headless.kafka.svc.cluster.local

These stable DNS names are what the KRaft quorum voters list and the advertised listeners are built from. They survive pod restarts and rescheduling because they are tied to the StatefulSet ordinal, not the pod IP.

Step 3 — JAAS Configuration

# kafka-jaas.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-jaas
  namespace: kafka
data:
  jaas.conf: |
    KafkaServer {
      org.apache.kafka.common.security.scram.ScramLoginModule required
      username="admin"
      password="supersecret";
    };

This file configures the Java Authentication and Authorization Service (JAAS) for the Kafka broker process itself. The username and password here are the inter-broker credentials — what each broker uses when authenticating to other brokers on the INTERNAL listener.

Important: JAAS alone does not create the SCRAM user in Kafka's metadata store. That is a separate bootstrap step performed after the cluster is running (see Step 7). The JAAS file tells the broker process what credentials to use, but those credentials must also exist in Kafka's internal metadata before inter-broker authentication will succeed.

Step 4 — SASL Secret

# kafka-sasl.yaml
apiVersion: v1
kind: Secret
metadata:
  name: kafka-sasl
  namespace: kafka
type: Opaque
stringData:
  username: admin
  password: supersecret

This secret can be mounted into client pods or referenced by applications that need to connect to Kafka. It is separate from the JAAS ConfigMap so that client credentials can be managed independently of the broker configuration.

Step 5 — StatefulSet

This is the core manifest. It contains two init containers and one main broker container.

Why Two Init Containers?

init-node-id runs a tiny busybox container to extract the pod ordinal (0, 1, or 2) from the hostname and writes it to a shared emptyDir volume. This is necessary because Kubernetes environment variable substitution does not support string manipulation, so you cannot derive the integer 0 from the pod name kafka-0 using env vars alone.

format-storage runs the actual Kafka image to format the KRaft log directory using kafka-storage.sh. It only runs if /data/meta.properties does not already exist, making it safe on pod restarts. The temporary kraft.properties file used during formatting includes a dummy plaintext broker listener — this is required because Kafka's config validator refuses to proceed if the only listener defined is a controller listener. These values are only used during formatting and are not used by the running broker.

Why `/etc/kafka/docker/run` Instead of `kafka-server-start.sh`?

The official apache/kafka Docker image ships an entrypoint at /etc/kafka/docker/run that translates KAFKA_* environment variables into server.properties entries before starting the broker. Calling kafka-server-start.sh directly bypasses this translation entirely — env vars like KAFKA_LOG_DIRS are ignored and the broker falls back to compiled-in defaults, causing it to look for data in the wrong directory.

Why `CLUSTER_ID` as an Env Var?

If CLUSTER_ID is not set, the official entrypoint generates a new random cluster ID on every pod start and attempts to re-format storage. This fails because the existing meta.properties already has a different ID written by the init container. Always set CLUSTER_ID to match the value used during the format step.

Generating Your Own Cluster ID

The cluster ID q1Sh-9_ISia_zwGINzRvyQ used in this guide was generated with:

/opt/kafka/bin/kafka-storage.sh random-uuid

Generate your own unique ID for each cluster you deploy. Do not reuse IDs across clusters.

# kafka-stateful-set.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: kafka
spec:
  serviceName: kafka-headless
  replicas: 3

  selector:
    matchLabels:
      app: kafka

  template:
    metadata:
      labels:
        app: kafka

    spec:
      securityContext:
        fsGroup: 1001

      volumes:
        - name: kafka-config
          emptyDir: {}

        - name: kafka-jaas
          configMap:
            name: kafka-jaas

      initContainers:

        - name: init-node-id
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              ORDINAL=$(hostname | awk -F'-' '{print $NF}')
              echo "$ORDINAL" > /config/node-id
          volumeMounts:
            - name: kafka-config
              mountPath: /config

        - name: format-storage
          image: apache/kafka:4.2.0
          command:
            - sh
            - -c
            - |
              NODE_ID=$(cat /config/node-id)

              if [ ! -f "/data/meta.properties" ]; then
                echo "Formatting storage for node $NODE_ID..."

                echo "node.id=$NODE_ID" > /tmp/kraft.properties
                echo "process.roles=broker,controller" >> /tmp/kraft.properties
                echo "controller.quorum.voters=0@kafka-0.kafka-headless.kafka.svc.cluster.local:9093,1@kafka-1.kafka-headless.kafka.svc.cluster.local:9093,2@kafka-2.kafka-headless.kafka.svc.cluster.local:9093" >> /tmp/kraft.properties
                echo "listeners=PLAINTEXT://:9092,CONTROLLER://:9093" >> /tmp/kraft.properties
                echo "advertised.listeners=PLAINTEXT://localhost:9092" >> /tmp/kraft.properties
                echo "listener.security.protocol.map=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT" >> /tmp/kraft.properties
                echo "inter.broker.listener.name=PLAINTEXT" >> /tmp/kraft.properties
                echo "controller.listener.names=CONTROLLER" >> /tmp/kraft.properties
                echo "log.dirs=/data" >> /tmp/kraft.properties

                /opt/kafka/bin/kafka-storage.sh format \
                  --ignore-formatted \
                  --cluster-id q1Sh-9_ISia_zwGINzRvyQ \
                  --config /tmp/kraft.properties
              else
                echo "Already formatted, skipping."
              fi
          volumeMounts:
            - name: kafka-data
              mountPath: /data
            - name: kafka-config
              mountPath: /config

      containers:
        - name: kafka
          image: apache/kafka:4.2.0

          command:
            - sh
            - -c
            - |
              export KAFKA_NODE_ID=$(cat /config/node-id)
              exec /etc/kafka/docker/run

          ports:
            - containerPort: 9092
            - containerPort: 9093

          env:
            - name: CLUSTER_ID
              value: "q1Sh-9_ISia_zwGINzRvyQ"

            - name: KAFKA_PROCESS_ROLES
              value: "broker,controller"

            - name: KAFKA_CONTROLLER_LISTENER_NAMES
              value: "CONTROLLER"

            - name: KAFKA_CONTROLLER_QUORUM_VOTERS
              value: "0@kafka-0.kafka-headless.kafka.svc.cluster.local:9093,1@kafka-1.kafka-headless.kafka.svc.cluster.local:9093,2@kafka-2.kafka-headless.kafka.svc.cluster.local:9093"

            - name: KAFKA_LISTENERS
              value: "INTERNAL://:9092,CONTROLLER://:9093"

            - name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
              value: "INTERNAL:SASL_PLAINTEXT,CONTROLLER:PLAINTEXT"

            - name: KAFKA_INTER_BROKER_LISTENER_NAME
              value: "INTERNAL"

            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

            - name: KAFKA_ADVERTISED_LISTENERS
              value: "INTERNAL://$(POD_NAME).kafka-headless.kafka.svc.cluster.local:9092"

            - name: KAFKA_SASL_ENABLED_MECHANISMS
              value: SCRAM-SHA-512

            - name: KAFKA_SASL_MECHANISM_INTER_BROKER_PROTOCOL
              value: SCRAM-SHA-512

            - name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR
              value: "3"

            - name: KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR
              value: "3"

            - name: KAFKA_TRANSACTION_STATE_LOG_MIN_ISR
              value: "2"

            - name: KAFKA_MIN_INSYNC_REPLICAS
              value: "2"

            - name: KAFKA_LOG_DIRS
              value: /data

            - name: KAFKA_OPTS
              value: "-Djava.security.auth.login.config=/opt/kafka/config/jaas/jaas.conf"

          volumeMounts:
            - name: kafka-data
              mountPath: /data
            - name: kafka-config
              mountPath: /config
            - name: kafka-jaas
              mountPath: /opt/kafka/config/jaas

          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: "2"
              memory: 4Gi

  volumeClaimTemplates:
    - metadata:
        name: kafka-data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

Step 6 — Initial Deployment

Apply all manifests:

kubectl apply -f namespace.yaml
kubectl apply -f kafka-svc.yaml
kubectl apply -f kafka-jaas.yaml
kubectl apply -f kafka-sasl.yaml
kubectl apply -f kafka-stateful-set.yaml

Watch the pods come up:

kubectl get pods -n kafka -w

All three pods should reach Running status within a minute or two. Check kafka-0 logs to confirm a clean startup:

kubectl logs -n kafka kafka-0

A healthy startup ends with:

[KafkaRaftServer nodeId=0] Kafka Server started

Step 7 — Bootstrap SCRAM Credentials

This is the most involved post-deployment step, and also the most commonly misunderstood. Here is why it requires a special process.

The Chicken-and-Egg Problem

The broker's INTERNAL listener (port 9092) requires SASL/SCRAM-SHA-512 authentication. The kafka-configs.sh admin tool needs to connect to a broker to write SCRAM credentials into Kafka's metadata. But to connect to the broker, you need valid SCRAM credentials — which don't exist yet because you haven't written them.

There are two apparent escape hatches that do not actually work:

Using the JAAS file credentials directly against port 9092 — the JAAS file tells the broker process what credentials to use for its own inter-broker connections, but those credentials must also exist in Kafka's metadata store before the SCRAM handshake will succeed. The JAAS file does not pre-populate metadata.
Using --bootstrap-controller against port 9093 — Kafka 4.x does not support the alterUserScramCredentials admin API via the controller quorum endpoint. It must go through a broker.

The solution is to temporarily open an unauthenticated PLAINTEXT listener on a separate port (9094), use it to write the credentials, then close it again.

7a — Add the Temporary Listener to the Service

Edit kafka-svc.yaml to add port 9094:

# kafka-svc.yaml (temporary — port 9094 will be removed after bootstrap)
apiVersion: v1
kind: Service
metadata:
  name: kafka-headless
  namespace: kafka
spec:
  clusterIP: None
  selector:
    app: kafka
  ports:
    - name: internal
      port: 9092
    - name: controller
      port: 9093
    - name: plaintext-bootstrap
      port: 9094

7b — Add the Temporary Listener to the StatefulSet

Edit kafka-stateful-set.yaml and make these changes to the broker container:

Add containerPort: 9094 to the ports list.

Change KAFKA_LISTENERS to:

- name: KAFKA_LISTENERS
  value: "INTERNAL://:9092,CONTROLLER://:9093,PLAINTEXT://:9094"

Change KAFKA_LISTENER_SECURITY_PROTOCOL_MAP to:

- name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
  value: "INTERNAL:SASL_PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"

Change KAFKA_ADVERTISED_LISTENERS to:

- name: KAFKA_ADVERTISED_LISTENERS
  value: "INTERNAL://$(POD_NAME).kafka-headless.kafka.svc.cluster.local:9092,PLAINTEXT://$(POD_NAME).kafka-headless.kafka.svc.cluster.local:9094"

7c — Apply and Wait for Rollout

kubectl apply -f kafka-svc.yaml
kubectl apply -f kafka-stateful-set.yaml
kubectl rollout status statefulset/kafka -n kafka

7d — Register the SCRAM Credentials

kubectl exec -n kafka kafka-0 -- \
  /opt/kafka/bin/kafka-configs.sh \
  --bootstrap-server kafka-0.kafka-headless.kafka.svc.cluster.local:9094 \
  --alter \
  --add-config 'SCRAM-SHA-512=[password=supersecret]' \
  --entity-type users \
  --entity-name admin

Expected output:

Completed updating config for user admin.

If you need additional users (application service accounts), register them now while port 9094 is still open:

kubectl exec -n kafka kafka-0 -- \
  /opt/kafka/bin/kafka-configs.sh \
  --bootstrap-server kafka-0.kafka-headless.kafka.svc.cluster.local:9094 \
  --alter \
  --add-config 'SCRAM-SHA-512=[password=apppassword]' \
  --entity-type users \
  --entity-name myapp

7e — Remove the Temporary Listener

Revert kafka-svc.yaml to remove port 9094:

# kafka-svc.yaml (final)
apiVersion: v1
kind: Service
metadata:
  name: kafka-headless
  namespace: kafka
spec:
  clusterIP: None
  selector:
    app: kafka
  ports:
    - name: internal
      port: 9092
    - name: controller
      port: 9093

Revert the StatefulSet changes — remove the PLAINTEXT entries from KAFKA_LISTENERS, KAFKA_LISTENER_SECURITY_PROTOCOL_MAP, and KAFKA_ADVERTISED_LISTENERS, and remove the containerPort: 9094 line.

Apply and wait:

kubectl apply -f kafka-svc.yaml
kubectl apply -f kafka-stateful-set.yaml
kubectl rollout status statefulset/kafka -n kafka

Step 8 — Verify the Cluster

Check the KRaft quorum status:

kubectl exec -n kafka kafka-0 -- \
  /opt/kafka/bin/kafka-metadata-quorum.sh \
  --bootstrap-controller kafka-0.kafka-headless.kafka.svc.cluster.local:9093 \
  describe --status

A healthy cluster shows:

LeaderId:             0
LeaderEpoch:          1
HighWatermark:        303
MaxFollowerLag:       0
MaxFollowerLagTimeMs: 0
CurrentVoters:        [{"id": 0, ...}, {"id": 1, ...}, {"id": 2, ...}]
CurrentObservers:     []

MaxFollowerLag: 0 confirms all three nodes are fully in sync. Three voters and no observers confirms all nodes are full participants in the quorum.

Verify SASL authentication works on port 9092:

kubectl exec -n kafka kafka-0 -- bash -c '
cat > /tmp/client.properties << EOF
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="admin" password="supersecret";
EOF
/opt/kafka/bin/kafka-topics.sh \
  --bootstrap-server kafka-0.kafka-headless.kafka.svc.cluster.local:9092 \
  --command-config /tmp/client.properties \
  --list
'

Create a test topic and produce/consume a message:

kubectl exec -n kafka kafka-0 -- bash -c '
cat > /tmp/client.properties << EOF
security.protocol=SASL_PLAINTEXT
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="admin" password="supersecret";
EOF

/opt/kafka/bin/kafka-topics.sh \
  --bootstrap-server kafka-0.kafka-headless.kafka.svc.cluster.local:9092 \
  --command-config /tmp/client.properties \
  --create --topic test --partitions 3 --replication-factor 3

echo "hello kafka" | /opt/kafka/bin/kafka-console-producer.sh \
  --bootstrap-server kafka-0.kafka-headless.kafka.svc.cluster.local:9092 \
  --producer.config /tmp/client.properties \
  --topic test

/opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server kafka-0.kafka-headless.kafka.svc.cluster.local:9092 \
  --consumer.config /tmp/client.properties \
  --topic test --from-beginning --max-messages 1
'

Production Hardening Checklist

The cluster at this point is functional and secured with SASL. The following additional steps are needed before handling real production traffic.

Add TLS Encryption

The current setup uses SASL_PLAINTEXT, meaning credentials and data are transmitted unencrypted within the cluster. For any environment where network traffic could be intercepted, upgrade to SASL_SSL.

Generate a keystore and truststore using keytool. The Subject Alternative Names (SANs) must cover all broker hostnames — this is the most common mistake when setting up TLS for Kafka on Kubernetes.

PASS=yourpassword
SAN="dns:kafka-0.kafka-headless.kafka.svc.cluster.local,\
dns:kafka-1.kafka-headless.kafka.svc.cluster.local,\
dns:kafka-2.kafka-headless.kafka.svc.cluster.local,\
dns:kafka-headless.kafka.svc.cluster.local"

# Generate CA
keytool -genkeypair -alias ca -keyalg RSA -keysize 2048 \
  -dname "CN=kafka-ca" -validity 3650 \
  -keystore ca.jks -storepass $PASS -storetype JKS

keytool -exportcert -alias ca -keystore ca.jks \
  -storepass $PASS -file ca.crt

# Generate broker keypair and CSR
keytool -genkeypair -alias kafka -keyalg RSA -keysize 2048 \
  -dname "CN=kafka" -validity 3650 \
  -keystore keystore.jks -storepass $PASS -storetype JKS

keytool -certreq -alias kafka -keystore keystore.jks \
  -storepass $PASS -file kafka.csr

# Sign the CSR with the CA, including all broker SANs
keytool -gencert -alias ca -keystore ca.jks -storepass $PASS \
  -infile kafka.csr -outfile kafka.crt -validity 3650 \
  -ext "SAN=$SAN"

# Import CA + signed cert into the keystore
keytool -importcert -alias ca -file ca.crt \
  -keystore keystore.jks -storepass $PASS -noprompt
keytool -importcert -alias kafka -file kafka.crt \
  -keystore keystore.jks -storepass $PASS -noprompt

# Build truststore containing only the CA
keytool -importcert -alias ca -file ca.crt \
  -keystore truststore.jks -storepass $PASS -noprompt

# Store in Kubernetes
kubectl create secret generic kafka-tls -n kafka \
  --from-file=keystore.jks=keystore.jks \
  --from-file=truststore.jks=truststore.jks \
  --from-literal=password=$PASS \
  --dry-run=client -o yaml | kubectl apply -f -

Then update the StatefulSet to mount the secret and change the listener protocols:

INTERNAL:SASL_PLAINTEXT → INTERNAL:SASL_SSL
CONTROLLER:PLAINTEXT → CONTROLLER:SSL
Add KAFKA_SSL_KEYSTORE_LOCATION, KAFKA_SSL_TRUSTSTORE_LOCATION, KAFKA_SSL_KEYSTORE_PASSWORD, KAFKA_SSL_TRUSTSTORE_PASSWORD env vars
Mount the kafka-tls secret as a volume at /tls

Change Default Credentials

The admin/supersecret credentials used in this guide are for demonstration only. Before going to production:

Update kafka-jaas.yaml with a strong randomly generated password
Update kafka-sasl.yaml with the same password
Re-register the SCRAM credentials using the bootstrap process in Step 7
Rotate any client configurations that reference the old credentials

Network Policies

Add a NetworkPolicy to restrict which pods can reach Kafka:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: kafka-network-policy
  namespace: kafka
spec:
  podSelector:
    matchLabels:
      app: kafka
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: kafka
      ports:
        - port: 9092
        - port: 9093
    - from:
        - namespaceSelector: {}
      ports:
        - port: 9092
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: kafka

Pod Anti-Affinity

Prevent Kubernetes from scheduling multiple Kafka pods on the same node:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app
              operator: In
              values:
                - kafka
        topologyKey: kubernetes.io/hostname

Add this under spec.template.spec in the StatefulSet.

Pod Disruption Budget

Ensure Kubernetes never takes down more than one Kafka pod at a time during voluntary disruptions such as node drains or rolling updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: kafka
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: kafka

Liveness and Readiness Probes

Add probes so Kubernetes can detect and restart unhealthy pods:

livenessProbe:
  tcpSocket:
    port: 9092
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 6

readinessProbe:
  tcpSocket:
    port: 9092
  initialDelaySeconds: 20
  periodSeconds: 5
  failureThreshold: 3

Resource Tuning

Workload	CPU Request	Memory Request	Storage
Development	250m	512Mi	5Gi
Low traffic	500m	1Gi	20Gi
Medium traffic	1	2Gi	50Gi
High traffic	2–4	4–8Gi	100Gi+

JVM Heap Tuning

Set the JVM heap via the KAFKA_HEAP_OPTS env var. A common rule of thumb is 50% of the container memory limit, not exceeding 6Gi:

- name: KAFKA_HEAP_OPTS
  value: "-Xms2g -Xmx2g"

Monitoring

The official apache/kafka image exposes JMX metrics. Enable them with:

- name: KAFKA_JMX_PORT
  value: "9999"
- name: KAFKA_JMX_HOSTNAME
  value: "0.0.0.0"

Key metrics to monitor:

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions — should always be 0
kafka.controller:type=KafkaController,name=ActiveControllerCount — should be 1 across the cluster
kafka.network:type=RequestMetrics,name=TotalTimeMs — producer/consumer latency
JVM GC pause times — long pauses indicate heap needs tuning

Common Troubleshooting

Pod stuck in Init state

Check the init container logs:

kubectl logs -n kafka kafka-0 -c format-storage
kubectl logs -n kafka kafka-0 -c init-node-id

No readable meta.properties error

The broker cannot find formatted storage. Ensure you are using /etc/kafka/docker/run as the entrypoint, not kafka-server-start.sh. The latter ignores KAFKA_* env vars and the broker looks in the wrong directory.

Invalid cluster.id error

The CLUSTER_ID env var does not match what was written to meta.properties during the format step. Delete the PVCs to force a reformat:

kubectl delete pvc -n kafka -l app=kafka

There must be at least one broker advertised listener during format

The temp kraft.properties used in the format-storage init container must include a non-controller listener. Ensure listeners, advertised.listeners, listener.security.protocol.map, and inter.broker.listener.name are all present in the temp properties file.

SASL authentication timeout on port 9092

The SCRAM credentials have not been registered. The JAAS file alone is not enough — run the bootstrap process in Step 7 to write the credentials into Kafka's metadata store.

UnsupportedEndpointTypeException when using --bootstrap-controller

The alterUserScramCredentials admin API is not supported via the controller quorum endpoint in Kafka 4.x. You must connect through a broker (port 9092). Use the temporary PLAINTEXT listener on port 9094 as described in Step 7.

Port 9094 connection refused

The headless service does not expose port 9094 by default. You must explicitly add it to kafka-svc.yaml during the bootstrap process (Step 7a) and remove it afterwards.

Brokers cannot reach each other

Verify the headless service exists and pod DNS resolves from within a pod:

kubectl exec -n kafka kafka-0 -- \
  nslookup kafka-1.kafka-headless.kafka.svc.cluster.local

MaxFollowerLag is non-zero

One broker is behind. Check its logs for errors. Common causes are disk I/O pressure, a recent pod restart, or network instability.

Upgrading Kafka

To upgrade to a new Kafka version, update the image field in the StatefulSet. Kubernetes performs a rolling restart one pod at a time. Since KRaft requires a majority, the cluster stays available as long as at least 2 of 3 pods are running.

kubectl set image statefulset/kafka \
  kafka=apache/kafka:4.3.0 \
  -n kafka

kubectl rollout status statefulset/kafka -n kafka

Summary

Step	What it does
namespace.yaml	Isolates all Kafka resources
kafka-svc.yaml	Headless service for stable pod DNS
kafka-jaas.yaml	JAAS config for inter-broker SCRAM auth
kafka-sasl.yaml	Secret for client applications
kafka-stateful-set.yaml	Brokers, init containers, storage, config
Step 7 — SCRAM bootstrap	Temporarily open port 9094, write credentials to metadata, close port 9094

The cluster is production-ready when TLS is enabled, credentials are rotated to strong values, pod anti-affinity and a PodDisruptionBudget are in place, resource limits are tuned for your workload, and monitoring is active.