DEV Community

Cover image for Why 80% of Kafka Clusters Would Fail a SOC 2 Audit Tomorrow
Jayprakash
Jayprakash

Posted on • Originally published at kafkaguard.com

Why 80% of Kafka Clusters Would Fail a SOC 2 Audit Tomorrow

The Uncomfortable Number

We aggregated findings from 50 production Kafka cluster scans. 80% of them had at least one finding that would fail a SOC 2 Type II audit on the spot. Not "needs improvement." Not "compensating control accepted." Fail.

The findings are not exotic. They're not edge cases. They're the same handful of mistakes, repeated across teams, frameworks, and managed-Kafka providers. This post breaks down the most common ones, what SOC 2 control they map to, and what to change.

If you're preparing for a SOC 2 audit with Kafka in scope — or you suspect an upcoming auditor question — read on. If you'd rather just scan your cluster, grab the binary and run it. Either path works.


What "in scope" actually means

Before we get to findings, the question every team gets wrong: is Kafka in your SOC 2 scope?

If your Kafka clusters touch any of the following, the answer is yes:

  • Customer data or PII
  • Payment information
  • Healthcare data
  • Authentication tokens or session data
  • Financial transactions
  • Audit logs from other in-scope systems

Most teams realise this in the auditor's second meeting, not the planning phase. By then, the surface area is larger than they remembered.


The 7 Findings That Keep Showing Up

We found 80% of clusters failing on at least one of these seven controls. Frequencies are aggregated across all 50 scans.

1. Inter-broker traffic in plaintext (73%)

SOC 2 control: CC6.7 — Restrict transmission of information.

If security.inter.broker.protocol is PLAINTEXT, replication, controller messages, and ISR state move between brokers unencrypted. Anyone with network access between brokers can read every message replicating across the cluster.

Fix:

security.inter.broker.protocol=SSL
ssl.keystore.location=/etc/kafka/ssl/server.keystore.jks
ssl.truststore.location=/etc/kafka/ssl/server.truststore.jks
Enter fullscreen mode Exit fullscreen mode

This is a rolling-restart change. Plan it, but don't ship without it.

2. Wildcard ACLs on production topics (73%)

SOC 2 control: CC6.1 — Logical access controls.

ACLs granted to User:* or Group:*. Every authenticated principal can read or write the topic. The reason it persists: someone needed quick access during an incident, granted wildcard, and never came back. The auditor will read the ACL list line-by-line; they will find this.

Fix: Replace wildcards with explicit principals or a dedicated group. If your ACL surface is complex, declare ACLs in IaC and audit drift on every deploy.

3. No client authentication (52%)

SOC 2 control: CC6.1.

A PLAINTEXT:// listener bound to anything other than localhost. Even when "internal-only," internal increasingly means every workload in the VPC, every compromised pod, every leaked service-account token. Network position is not an authentication mechanism in 2026.

Fix: SASL/SCRAM-SHA-512 at minimum, mTLS for stronger identity guarantees:

listeners=SASL_SSL://0.0.0.0:9093
sasl.enabled.mechanisms=SCRAM-SHA-512
Enter fullscreen mode Exit fullscreen mode

4. auto.create.topics.enable=true (44%)

SOC 2 control: CC8.1 — Authorized configuration changes.

Topics created on demand have no naming convention, no retention policy, no ACL, no documentation, no change-management trail. That's an unauthorized configuration change in CC8.1 terms.

Fix: Set it to false. Once. Topic creation routes through your provisioning pipeline like every other infrastructure change.

5. Outdated Kafka with unpatched CVEs (32%)

SOC 2 control: CC7.1 — Vulnerability detection.

We routinely find clusters on 2.6, 2.8, 3.1. CVE-2023-25194 (Confluent JMX RCE), CVE-2024-27309 (Kafka principal-impersonation in MirrorMaker), and others have public exploits and were disclosed long enough ago that they should have been patched.

Fix: Run kafka-broker-api-versions.sh and compare. Plan the rolling upgrade. If you're stuck on an old version because of a downstream client that won't upgrade, document the compensating control — and make sure it's actually compensating.

6. JMX exposed without authentication (28%)

SOC 2 control: CC6.6 — Logical access from outside boundaries.

Default port 9999. No auth. No TLS. Means anyone on the broker network can read JMX MBeans (broker internals, topic metadata, consumer group state). Some jmxremote configurations also allow code execution via deserialization.

Fix:

com.sun.management.jmxremote.authenticate=true
com.sun.management.jmxremote.ssl=true
Enter fullscreen mode Exit fullscreen mode

Or, if you don't need JMX externally, drop it entirely.

7. No audit log retention (24%)

SOC 2 control: CC4.1 + CC7.2 — Ongoing monitoring.

Audit logs enabled but with no retention policy, or no audit logs at all. The auditor will ask: "Who consumed from this topic in Q1?" If the answer is "we don't know," it's a finding. If the answer is "we know, but the logs got rotated three weeks ago," it's also a finding.

Fix:

authorizer.class.name=kafka.security.authorizer.AclAuthorizer
log4j.logger.kafka.authorizer.logger=INFO, authorizerAppender
Enter fullscreen mode Exit fullscreen mode

Plus an explicit retention policy on the audit log appender. 90 days minimum for SOC 2 Type II; 365 days if you want headroom.


What to do this week

If three of these felt familiar, three options in order of effort:

  1. Run a scan — KafkaGuard is open-source, single binary. Download from github.com/KafkaGuard/kafkaguard-releases (Linux, macOS, Docker), point it at a broker, findings list in 90 seconds.

  2. Pick the highest-blast-radius finding above and fix it this week. Inter-broker TLS and wildcard ACLs are usually the worst.

  3. Schedule the audit prep meeting with your security team using this list as the agenda.

You don't need to fix all seven before the auditor arrives. You need to be able to answer "yes, we know about it, here's the remediation timeline, here's the compensating control." The cluster doesn't have to be perfect. The story does.


Methodology

The 50 scans referenced are a mix of self-hosted Apache Kafka, Confluent Platform, Amazon MSK, Aiven, and Redpanda — anonymous data from KafkaGuard runs across our user base over Q1–Q2 2026. We did not include scans against test clusters or scans run with --policy baseline-dev. Findings are deduplicated per cluster.


Originally published at kafkaguard.com. KafkaGuard is an open-source Kafka security and compliance scanner — 55 controls across PCI-DSS, SOC 2, ISO 27001. Single binary. Runs in 90 seconds.``

Top comments (0)