DEV Community: Conduktor

Every Application Could Delete Our Schemas.

Stéphane Derosiaux — Mon, 13 Jul 2026 20:49:55 +0000

Quick question about your Kafka setup: right now, which applications are allowed to change or delete which schemas?

Sometimes, it's everyone (because the Schema Registry (SR) is not protected). Sometimes, it's an nginx/traefik in front of their SR to blacklist some routes (like global compatibility) and call it done. Sometimes, it's just to add basic authentication at least. Rarely, it's about authorization.

Authorization is about "what are you allowed to change/break?"

Let me show you where the gap actually is, because "just put a proxy in front of it" is not enough.

Authentication isn't authorization

A schema registry exists to solve a coordination problem. Producers and consumers ship on their own schedules, so they need a shared, versioned contract for the shape of each message. Producers register a schema, consumers resolve it by ID, and the registry enforces compatibility as things evolve so one side's change doesn't blow up the other.

That makes it a very high-value target. Everything producing or consuming structured data depends on it. So "who can change it" should be a first-class question and in the free Confluent Schema Registry, the answer is "anyone."

The default listener is plain HTTP:

# Schema Registry default — plaintext, every interface
listeners=http://0.0.0.0:8081

Per-subject access control isn't in the free version. It's a commercial add-on:

"Until either ACLs or Role-Based Access Control is also enabled for Schema Registry, any user can create, alter, and delete Schema Registry subjects."

Create/alter/delete arê the controls you actually want is per-subject: the payments team can write `payments-` schemas and nothing else.*

It's not just Confluent

The free registry is the most obvious offender, but it's not alone. Every option handles per-subject authorization differently, and most of it is off by default:

Approach	Per-subject authz	Default	Uses your Kafka identity
Confluent SR (free)	none	open, plain HTTP	no
Apicurio	registry-wide roles only	off	no
Karapace	yes, regex ACL per subject	off	no
Confluent RBAC / plugin	yes	off, commercial	Confluent's plane
Generic HTTP proxy	coarse, path-prefix only	you build it	no — matches the URL

Per-subject control exists in a couple of places, but it's either a hand-edited auth file (Karapace) or a paid platform tier (Confluent RBAC). None of the free or open-source options tie the rule to the Kafka identities you already manage, or give you an audit trail that lines up with Kafka.

Why a generic HTTP proxy is a bad solution

Since the registry speaks HTTP, the natural thinking is to set up an nginx or an API gateway and only allow the routes you want. For basic guardrails on a single registry, that's cheap and reasonable. Subject names even show up in the path, so you can gate writes by prefix:

# Allow writes only to payments-* subjects
location ~ ^/subjects/payments-.*/versions {
    proxy_pass http://schema-registry:8081;
}

Looks fine. Then you need to maintain the nginx file, meet the edge cases, and you can only reason about URLs: not schemas, not identities. Here's the one that always gets people:

GET /schemas/ids/42

That fetches a schema by numeric ID. There's no subject in the path to match on, so your careful payments-* rule doesn't apply — and that call can return any schema, including another team's.

It authorizes the network path, not your Kafka principal: so it's only as strong as the firewall around port 8081.
It assumes subjects are named after topics. Switch to RecordNameStrategy and the subject is a record name, so your prefix rules stop lining up.
It can't tell a deliberate registration from a client silently auto-registering on first produce:

# One flag away from clients registering schemas you never reviewed
auto.register.schemas=true

Try to go down this path will create an abomination you'll need to maintain forever with your own model of subjects and owners inside your proxy. That's a lot of infrastructure to own just to answer "who can touch payments schemas."

What is "closed by default"?

Per-subject authorization should be:

By subject, prefix, and wildcard, not by URL path that happens to contain a subject sometimes.
Tied to the Kafka identities you already use, the same principals you manage for topic ACLs, not a second, parallel access system.
Logged end to end, every operation and every denial, in a trail that lines up with the rest of your Kafka audit.

The way to get all three without a platform license or bespoke plumbing is a schema-aware proxy, something that understands subjects and requests rather than URLs. That's the category Conduktor's Schema Registry Proxy sits in: it fronts the registry you already run, checks read and write permissions per subject, logs every call, and doesn't need any producer or consumer changes. You just point schema.registry.url at the proxy and keep your current registry, Confluent or open source:

# Clients don't change — they just talk to the proxy
schema.registry.url=http://schema-registry-proxy:8081

Now you have a protected schema registry, reads and writes, that can even be linked to a self-service framework where ownership is a first-class citizen.

Every Kafka Cluster Eventually Hits This Networking Problem

Stéphane Derosiaux — Mon, 06 Jul 2026 17:42:50 +0000

If you've ever tried to connect a Kafka client that lives in a different VPC than the cluster, you've probably hit this issue where there's a route between the two networks, telnet works, and yet the client still can't consume a single record. It feels like a networking bug. It isn't.

The issue is that reaching Kafka across VPCs is an addressing problem, not a networking one. Once that clicks, why peering, transit gateways, and per-broker load balancers don't work will make sense.

What your client actually does when it connects

Most services are easy to reach across a networking boundary. Put a load balancer in front, give it one address, point clients at it. Done.

Kafka doesn't work like that. A client connects in two stages:

Bootstrap. The client talks to any broker and asks one question: who's in this cluster?
Direct connection. The broker answers with metadata — a list of every broker, each named by its own advertised.listeners address. The client then opens direct connections to specific brokers: the leader for each partition it reads or writes (or the nearest replica, if you're using follower fetching, KIP-392, Kafka 2.4+).

That second stage is the most important to remember. One shared address in front of the cluster is not working, because the client has to resolve and route to each broker's advertised address, exactly as the cluster hands it back.

So if a broker advertises broker-1.cluster.internal:9092, a private name that only resolves inside the cluster's VPC, a remote client connects to the bootstrap fine, then fail when it tries to connect directly to one of the brokers:

docker exec kafka-consumer-a kcat -b kafka:9092 -L -m 5
# -> Failed to resolve 'kafka:9092'

A network path between two VPCs is not the same as Kafka reachability. The client still has to resolve and route to every broker's advertised address from where it sits.

"Then just make the broker advertise something reachable". Sure, brokers support multiple listeners, one internal and one external. But every listener is another port to open on every broker, and managed Kafka (MSK, Confluent Cloud, Aiven) won't let you touch them anyway. Listeners don't scale to a dozen independent networks across accounts and clouds. Which is exactly the situation you're in.

Why peering collapses the moment you have more than two networks

VPC peering is genuinely great for connecting two VPCs. Past two, here be dragons:

It's point-to-point and non-transitive. Peering A and B each to the cluster does not let A talk to B. Every new client network needs its own peering straight into the cluster's VPC.
Your cluster's VPC becomes an accidental hub. Ten client networks means ten peerings, ten route-table entries, and ten security-group conversations on a VPC that was never designed to be a hub.
CIDRs have to be coordinated. Peered VPCs can't overlap ranges, so every team negotiates address space.
Managed Kafka caps it. Providers limit how many peerings or PrivateLink attachments you get, and none of them let you rewrite broker listeners.

The recurring pain is one line: every new client forces another change onto the cluster's VPC, the one piece of infra you least want to keep editing.

The fix: let a Kafka-aware proxy rewrite the addresses

The solution: put something in the connection path that actually understands the Kafka protocol instead of just shuffling packets, and have it rewrite the broker addresses in the metadata response before the client ever sees them.

That's what Conduktor Gateway does. A broker advertises kafka-internal:9092; the Gateway rewrites that to an address the client can reach. Because clients connect to specific brokers (the partition leaders), the Gateway maps each broker to its own port so traffic still routes deterministically to the right one behind it.

From the client's side, the only change is a single line:

# before: pointing straight at a broker it can't actually reach
bootstrap.servers=broker-1.cluster.internal:9092

# after: pointing at the proxy, which hands back addresses it can
bootstrap.servers=conduktor-gateway.hub:9092

Credentials don't change. The client presents the same SASL credentials it already has, the Gateway forwards them to the broker, and Kafka ACLs (or Confluent RBAC) still decide what it's allowed to do. The proxy is stateless, it stores no credentials and adds no new auth model.

The nice architectural payoff: the Gateway lives in its own hub VPC. Peerings or PrivateLink attachments land on the hub, never directly between a client and the cluster. New networks attach to the hub; the cluster keeps a single attachment no matter how many clients reach in, and its listeners are never touched.

The Gateway isn't a substitute for a route as you still need connectivity between the hub and each VPC. What it removes is the per-client peering into the cluster, the broker reconfiguration, and the requirement that every advertised address be reachable from every client.

"But what about a transit gateway? Or an LB per broker?"

Won't work.

A cloud transit gateway is one hub many VPCs attach to for any-to-any IP connectivity. But it operates at L3 so it moves IP packets with zero awareness of Kafka. It doesn't change what brokers advertise, so you're back to sharing private hosted zones across accounts. Overlapping CIDRs still break it without NAT, it's confined to one cloud, and on managed Kafka you still can't touch the listeners.

A load balancer per broker can't be round-robined, because clients connect to specific leaders. So it's one NLB (or target group, or PrivateLink endpoint) per broker, plus advertised.listeners reconfigured: N pieces of plumbing to keep in sync, each with its own port, DNS record, and TLS SAN. Scale or replace a broker and the mapping churns, forcing a rolling restart. And on managed Kafka you usually can't set advertised listeners at all — so it's off the table before you begin.

Approach	Works at	Fixes the advertised-address problem?	Managed Kafka?	Cross-cloud?
VPC peering	L3	No	Capped by limits	No
Transit gateway	L3	No	No	On-prem yes, other cloud no
NLB per broker	L4	Only if you reconfigure listeners	Usually not	Partial
Kafka-aware proxy	Kafka L7	Yes, automatically	Yes	Yes

Reproduce the whole thing on one machine

The free Gateway Community quickstart stands up the exact scenario: a Kafka cluster in a private Docker network the clients can't reach, two consumers in two separate "VPC" networks, and the Gateway as the only container joined to all three.

bash <(curl -fsSL https://releases.conduktor.io/gateway-community-quickstart)

Consumer A can't even resolve the broker directly — different network:

docker exec kafka-consumer-a kcat -b kafka:9092 -L -m 5
# -> Failed to resolve 'kafka:9092'

Same consumer, same SASL credentials, one different bootstrap address — now it reads:

docker exec kafka-consumer-a kcat -b conduktor-gateway:9092 -t customers -C -e -c 3 \
  -s value=avro -r http://karapace:8081 \
  -X security.protocol=SASL_PLAINTEXT -X sasl.mechanism=PLAIN \
  -X sasl.username=consumer-a -X sasl.password=consumer-a-secret
# -> readable records

And the log line that proves why:

docker logs conduktor-gateway 2>&1 | grep "Rewriting METADATA"
# kafka:9092 -> conduktor-gateway:9092

That's the whole trick. The broker advertises an address the client can't reach; the proxy rewrites it to one it can.

Wrapping up

If cross-VPC Kafka has been a recurring issue on your platform team, try rethinking it: you don't have a networking problem, you have an addressing problem, and a Kafka-aware proxy is the one thing that fixes the address without you touching the cluster.

If you later want to encrypt fields, enforce schemas, or isolate tenants on that same path, this same proxy will do it too.

If you want to poke at it, the Gateway Community quickstart is free and runs on one machine. Curious how you're solving this today, peering everything everywhere, transit gateways, something else? Let me know.

We Measured Kafka Usage. The Results Surprised Us.

Stéphane Derosiaux — Mon, 29 Jun 2026 14:32:43 +0000

I sit in a lot of Kafka reviews. Vendors, instances, replication, tiered storage, advanced stuff like fetch-from-follower, networking, partitions, best practices etc. Most discussions are driven by tech only, instead of looking at the big picture: how this beautiful infra is being used.

Unpopular opinion: most of your Kafka cost is not due to infrastructure, it's due to a usage problem.

Where the cost actually comes from

Vendor calculators are hard to compare because of so many assumptions. Replication multipliers, disk class, compression ratio, tiered storage (billed at the replicated rate or the actual S3 rate). The price you see is almost never what you pay.

RF=3 multiplies the per-GB price by 3 everywhere. And tiered storage is often still billed at the replicated rate even though only one copy lives in S3. You're paying the RF=3 rate for data Kafka no longer replicates.
Cross-VPC, in-region traffic between your account and the vendor's lands on your cloud bill, roughly 1c/GB each way depending on the path.
Without fetch-from-follower, most consumer fetches cross AZ boundaries. With three balanced AZs, ~2/3 of consumer reads go cross-AZ, because the leader lives in one AZ and the other two reads come from elsewhere.
Compression is often just... off.

With zstd at sane batch sizes, JSON-ish logs and metrics commonly compress 8–10x:

compression.type=zstd
batch.size=65536          # 64KB
linger.ms=20

Going from 5x to 10x halves your stored bytes and halves the replication bytes flowing inside the cluster. You pay for that traffic three times over at RF=3, so the ratio matters.

And fetch-from-follower, available since Kafka 2.4, is a broker + consumer config away. Same-AZ traffic inside your VPC is free on AWS, so no cross-AZ tax:

# broker
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
broker.rack=us-east-1a

# consumer — must match the broker's rack value
client.rack=us-east-1a

Do all of it: fetch-from-follower, tiered storage, compression enforcement, partition right-sizing, BYOC to apply your existing cloud discount, single-AZ topics where you can tolerate it. But notice that it's still infrastructure tuning. Let's go up.

Cost is a stack, not a line item

When you tune anything in Kafka, you think in layers, bottom-up: hardware, JVM, broker config, producer/consumer tuning, topic design, application code. Same for cost:

Cloud infrastructure: instance types, AZ placement, networking, BYOC negotiation. At big contract sizes, negotiated networking discounts can hit 90%, but only if traffic flows through your account.
Broker & protocol tuning: compression, retention, RF, fetch-from-follower, tiered storage, partition count. Easy, they're config changes.
Architecture: diskless topics, Iceberg topics, single-AZ topics, proxies between clients and brokers, virtual clusters for multi-tenancy and non-prod consolidation.
Usage: fan-out, governance, discovery, self-service.

Often payoff goes up as you go higher (more system thinking) Everyone's comfortable arguing about GP2 vs GP3 (volume types on AWS. Almost nobody thinks "why are 40% of these partitions doing nothing?"

Speaking of which: most clusters carry 40–70% partition waste, did you know that? On managed Kafka that's per-partition-hour billing. On self-managed, you hit the ~4,000–6,000 partition-replicas-per-broker ceiling (RF=3 turns 100k partitions into 300k replicas to host and track). KRaft raises the ceiling but it doesn't make the waste free.

Fan-out is the whole point of Kafka

Kafka exists so that one byte written can be read by N independent consumers, decoupled in time, with zero coordination back to the producer. That's the log abstraction's reason to live.

Do you measure your average fan-out? If it's 1, you probably shouldn't be running Kafka at all, you're paying for a distributed log to do a point-to-point queue's job. LinkedIn famously ran at ~5.4: the same bytes, written once, read by 5.4 independent teams.

Cluster cost stays flat while consumers grow, so cost-per-business-outcome is decreasing the more we consume existing topics:

cost_per_use_case = cluster_cost / fan_out

fan-out 1  ->  $X      (one team carries the whole bill)
fan-out 3  ->  $X / 3
fan-out 5  ->  $X / 5  (same hardware, five outcomes)

"Is our Kafka usage growing?" is the wrong question. More business use-cases reading existing data is the best money you'll ever spend. Duplicated topics because nobody could find the existing one is pure waste, more storage, more replication, more pipelines, all because discovery and ownership are missing.

The same goes for partitions: people over-provision because nobody knows how to size them, and you can't reduce partition count after the fact (breaks key ordering). The only way to surface that waste is chargeback at the team-and-topic level. You can't optimize what you can't attribute.

"A third of our traffic, we know what it has to do with, but we don't know exactly what they're doing."

That's the usage layer leaking. It costs money, and nobody can fix it because nobody knows how to, where to look, or just own it. It's not an infra problem, it's governance, discovery, and self-service.

Cost optimization is everybody's concern and nobody's objective. Teams over-provision because what if we need it later and what if it breaks when we touch it are rational fears. "It's expensive" is not a business case. What works is showing the waste, the annual dollar number, and the effort to reclaim it, with a name next to it.

2026: Where to spend your effort

Most deployments I see have way more headroom in the usage layer than the infra layer: topics nobody reads, partitions nobody needs, teams who'd benefit from streaming but find it too painful to onboard.

There's a funny industry reflex here too. We chase the next architectural shiny thing, diskless, Iceberg topics, single-AZ, before we've answered the boring questions: who's using this, for what, and why aren't more teams using it?

My actual recommendation:

Do the infrastructure pass once. Instance types, AZ placement, BYOC.
Do the config pass once. Compression, retention, partition right-sizing, fetch-from-follower, tiered storage.
Spend the rest of the year on the usage layer. Fan-out, ownership, discovery, chargeback, self-service.

Steps 1 and 2 are a sprint. Step 3 is the marathon.

If you want to see where your usage layer is leaking, Conduktor's field engineering team does a free Kafka cost analysis: they'll map cost back to teams and topics and show you where the payoff sits. And if you just want to keep reading, Why Kafka Costs Keep Rising and the partition waste deep-dive are good next stops.

What's your average fan-out? If you don't know it off the top of your head, that's probably where I'd start.

We Built a Kafka Proxy. Here's Everything It Ended Up Doing.

Stéphane Derosiaux — Mon, 15 Jun 2026 13:43:15 +0000

At Current 2026, I realized that nobody knows exactly what a Kafka proxy can do.

Most engineers and architects think it's just some kind of reverse-proxy for Kafka (think nginx) to do routing and used to bridge a legacy or non-native client to the cluster.

That's not it. It's barely the start of it.

Encryption

For instance, an engineer at a UK building society had a hard requirement: encrypt personally identifiable fields before they ever hit Kafka: emails, national insurance numbers, that kind of data.

His team built encryption into the application layer. Every producer that touched PII got encryption code. Every consumer got decryption code. Key handling, rotation, etc. to manage across services.

Something like this:

// In every producer that touches PII...
public ProducerRecord<String, Customer> encrypt(Customer c) {
    c.setEmail(crypto.encrypt(c.getEmail(), keyRef("pii-key")));
    c.setSsn(crypto.encrypt(c.getSsn(), keyRef("pii-key")));
    return new ProducerRecord<>("customers", c.getId(), c);
}

// And the mirror image in every consumer...
public Customer decrypt(Customer c) {
    c.setEmail(crypto.decrypt(c.getEmail()));
    c.setSsn(crypto.decrypt(c.getSsn()));
    return c;
}

Multiply that by all the micro-services to update and maintain now (cross languages, versioning, access to KMS etc.). That's quite expensive, at implementation time and to maintain.

He didn't know a Kafka proxy could have done the whole thing at the record level, outside the apps. When we chat about it, he just realized he might have done a mistake.

What a Kafka proxy does

A Kafka proxy sits between your clients and your brokers and speaks the Kafka protocol. Clients connect to it exactly like they'd connect to a broker. No SDK, no app changes. It works for Kafka clients, Kafka Connect, Kafka Streams, Flink, Spark, etc. It's fully transparent to them.

It makes it a natural place to put policy that doesn't belong inside your application and doesn't belong inside the cluster either.

Encryption is the obvious one. Instead of touching dozens of applications, you declare the rule once. With Conduktor Gateway it's what we call an interceptor: a small piece of config applied to traffic matching a topic pattern. Roughly:

{
  "kind": "Interceptor",
  "apiVersion": "gateway/v2",
  "metadata": { "name": "encrypt-customer-pii" },
  "spec": {
    "pluginClass": "io.conduktor.gateway.interceptor.EncryptionPlugin",
    "priority": 100,
    "config": {
      "topic": "customers.*",
      "kmsConfig": { "kms": "VAULT", "vault": { "uri": "https://vault:8200" } },
      "recordValue": {
        "fields": [
          { "fieldName": "email", "algorithm": "AES256_GCM", "keySecretId": "pii-key" },
          { "fieldName": "ssn",   "algorithm": "AES256_GCM", "keySecretId": "pii-key" }
        ]
      }
    }
  }
}

The proxy encrypts the fields on the way in and authorized consumers get them decrypted on the way out, everyone else gets ciphertext. The application code shrinks back to just... sending a record, not dealing with KMS and secrets:

// Same producer, after.
producer.send(new ProducerRecord<>("customers", c.getId(), c));

Because the rule lives in one declarative place (the proxy), you can do things that are painful at the app layer.

For instance, crypto-shredding for GDPR. Delete the key, and every message encrypted with it becomes unreadable, instantly, across all your retention. You don't go hunting through topics for one person's data. You revoke a key. Done.

Masking, validation, isolation: same one place

Once the proxy is in the Kafka path, the same pattern opens a lot of doors:

Field-level masking: show j***@example.com to one team, the real value to another, same topic.
Schema and payload validation: reject malformed records at the edge instead of poisoning a downstream consumer.
Topic aliasing for migration: point clients at a stable name while you move the real topic between clusters.
Virtual clusters: carve one physical cluster into isolated tenants without standing up new infrastructure.
Audit and policy enforcement: log and gate access without patching the broker or the client.

None of that touches application code or broker config. It's policy, declared once, enforced in the path.

Why this connects to cost and self-service

The other big topic at the conference was cost. Conduktor published a field guide on where Kafka costs hide in April.

Then the self-service conversation: teams want developers to create topics and request access in autonomy, but simple Topic on GitOps solution is just not enough, because self-service without guardrails easily turns Kafka into a mess:

"If somebody goes onto the tool and adds in something ridiculous, like a thousand partitions, we need someone to have eyes on that. That's something we've learned we can't let go of."

Look at what encryption-at-the-proxy, cost guardrails, and self-service approval gates have in common. They're all policy that belongs between your developers and your brokers, not baked into either one. Push it into the app and you copy-paste it dozens of time. Push it into the cluster and you can't change it without a migration. Put it in the layer in between, declare it once, and you can actually govern it.

To build AI agents on top of streaming, the plumbing must come first: ownership, schema discipline, key custody, data quality before the data even enters Kafka, etc. A proxy that enforces structure and policy is a big chunk of that plumbing.

So where does your policy live?

What people think a proxy does is pass packets. What it really does is more of a safekeeper that holds the policy.

If you've got encryption, masking, validation, or multi-tenant isolation scattered across your services right now, it's worth asking whether any of it should be living one layer down instead.

Want to go deeper? The Gateway overview walks through the interceptor model, and the original Current 2026 write-up has the rest of what we heard on the floor.

I Gave Claude Admin Access to My Kafka Cluster

Stéphane Derosiaux — Mon, 08 Jun 2026 13:34:35 +0000

Your AI coding assistant can explain consumer groups, rebalancing, and exactly-once semantics. Ask it to actually set up a Kafka platform with governance, though, and it won't be able to do that on its own.

Between hallucinations, misunderstanding, production impact (I really saw Claude messing up a rolling upgrade of Kafka brokers), and the lack of knowledge of the products your Kafka infra is relying on, there's a lot working against it

The models, besides their training, have zero context about your infra. They've never seen your cluster, don't know your policies (technical, governance), and often have no way to check anything against your actual environment.

You can give it the missing context using Conduktor.

The thing that was missing

There is an open-source Conduktor skill you install into your AI assistant. It works with Claude Code, Cursor, VS Code Copilot, Gemini CLI:

npx skills add conduktor/skills

This is teaching the agent the whole platform and how to run process against it: Console, Gateway, and the CLI, so it can be efficient and not hallucinate.

After the install, the agent discovers your environment (Kafka clusters, Schema Registry, policies, etc.), asks questions based on what it finds, generates configs with real values and best practices, and runs everything with dry-run validation before it touches anything.

The CLI are really its "hands" as more deep than just MCP. The skill is the playbook where all the experience and practices from years of usage are written. This does a big difference VS "generate some YAML and cross fingers"

Starting from absolutely nothing

You can start from scratch with just Docker running and nothing else. No Kafka, no Conduktor, no config. When I just ask this (with the Conduktor skill setup):

install Conduktor and set it up so I can login

It checked my environment, asked what I was trying to do, wrote a docker-compose.yml, spun up the containers, hit one error along the way, self-corrected, and handed me a working platform, Kafka & Console perfectly configured.

I could ask the same but on my production Kubernetes. It would follow best practices too, use Helm, discover my environment, etc., and in minutes everything would be wired perfectly, with policies already in place.

This is much more powerful than a "human" quickstart, as the range of applications it covers is just wider and more production-ready already. The agent knows the Kafka domain, and with the skill it knows Conduktor, so the combination of both makes it ask me the right questions.

Governance, without becoming a Kafka lawyer

Running Kafka isn't the hard part anymore. Making it safe for a team to share is the hard part: naming conventions, ownership boundaries, policies. This is what prevent a Kafka cluster from turning into a wasteland of test-topic-final-v2.

The beautiful thing is to be able to ask large prompts like this now:

set up governance for two teams, Payments and Analytics, with topic policies and cross-team permissions

It worked in stages and figured out the dependency ordering itself. When the API rejected something, it read the rejection, restructured the YAML, and retried, with minimal hand-holding from me (just asking what policies I want based on what's possible). It ended up creating the following:

TopicPolicy objects: locking down naming per team, enforcing safe defaults (retention, replication, required labels) across every topic.
Application objects with non-overlapping resource boundaries to define ownership of resources and teams.
Topics with descriptions and labels in the catalog.
Cross-team permission giving Analytics read access to payments.orders.*.

This is federated ownership in practice: the platform team sets the boundaries, developers move freely inside them. Normally that knowledge takes months to accumulate and lives spreadsheet or Jira tickets. Here it lives in a skill file that every agent on the team can read.

Now flip to the developer side

Once those guardrails exist, a developer on the Payments team installs the same skill and never has to know any of it happened. No ApplicationInstance, no TopicPolicy, no YAML. They just talk.

"What topics do we have?"

The agent runs conduktor get Topic and shows the catalog — descriptions, owners, labels, visibility.

"I need a topic for my service."

The agent checks their ApplicationInstance, reads the policy constraints (naming prefix payments.*, retention one-to-seven days, a required data-criticality label), asks what the topic is for, generates compliant YAML, dry-runs it, and applies:

Topic/payments.fulfillment.shipped: Created

The developer just got a topic that's compliant by default. Without the skill, that's a JIRA ticket most likely, and asking platform team what's the right shape and what to put.

"How do I produce to my topic?"

It reads the cluster config, grabs the real bootstrap server, and hands back working code:

from confluent_kafka import Producer

producer = Producer({
    "bootstrap.servers": "localhost:19092",
})

producer.produce(
    "payments.fulfillment.shipped",
    key="ord-123",
    value='{"orderId": "ord-123", "status": "shipped"}',
)
producer.flush()

Copy, paste, run.

"I need to read the Analytics team's clickstream."

The agent finds that analytics.clickstream.pageviews belongs to the Analytics team, then writes a read-only permission scoped to exactly that topic, at both the Kafka and Console layers. The developer doesn't know what an ACL is or what patternType: LITERAL means. They asked in English and got access.

What I actually take away from this

This walkthrough only touched governance and onboarding. The skill also covers Gateway (Kafka proxy) encryption, data quality rules, Terraform export, and CI/CD scaffolding.

Try it, it's one command:

npx skills add conduktor/skills

It's open source, so if you hit a workflow it handles badly, open a PR. And if you're new to Conduktor, the Community Edition is free and self-hosted, the skill will do the install for you.

This post was adapted from the original on the Conduktor blog.

Most Kafka Cost Calculators Miss This

Stéphane Derosiaux — Mon, 25 May 2026 15:19:36 +0000

Which side are you on: "This is just what Kafka costs at scale" or "We should switch to a cheaper Kafka provider"?

At Conduktor, our field team works inside Kafka environments that have been running for a long time. We see this: most Kafka teams are overpaying by 25 to 40 percent. Not because anyone did anything wrong, but because of how Kafka got built up over time.

The cost drivers of Kafka are weirdly context-dependent: the infrastructure and the provider are a tiny part of the full picture.

The "how" it's being used is the real question.

Five bad patterns eating budget

Below is what see, the same patterns show up everywhere, and are the first things we work with our customers.

1. Partition overprovisioning

"How many partitions?" is the most common question with Kafka. I heard last week someone telling me an org just defaults to "64". I was shocked. Not only providers may price per partitions, but from a Kafka point of view: this takes metadata and open files etc.

Partitions depend on throughput and concurrency expected (consumer parallelism). If a 64-partitions topic is sitting in a cluster with barely no traffic, you're just losing money on all sides. Multiply by dozens or hundreds of topics at scale.

2. Retention that makes no sense

Long retention on topics that nobody reads past the last few hours. Do you need replay? Default is 7-day retention, but it's often applied uniformly, when some topics only need a couple of hours and others genuinely need weeks.

Tips: when using compacted topics and/or Kafka streams (changelog etc.), data is being stored indefinitely, that can cause some security/regulations issues.

3. Let's spin up another cluster

One-cluster-per-team was a reasonable isolation strategy a long time ago. We saw this multiple times, more than 500 clusters, with tons of mirroring to share data. Throwing money down the drain.

You're paying for underutilized clusters instead of consolidating onto fewer well-managed ones.

4. Zombie topics

Topics created for experiments, migrations, or one-off tests that were never cleaned up. It's a simple thing but cost so much money as no one is looking. Every one of them is replicated and has retention costs. We've seen enterprises with hundreds of zombie topics, who were so surprised when we showed them.

5. Runaway egress

We had a customer where egress was running 30x higher than ingress on a single topic because of a misconfigured consumer. Buggy consumers, unnecessary fan-out, and chatty clients create traffic patterns that are invisible without dedicated infra monitoring. Egress is rarely free.

How to deal with it

Pick your starting point based on where the waste is concentrated.

Stop the bleeding: better defaults

Low-coordination work that pays off over time. It's better to have exceptions rather than wrong defaults you can't rollback.

Set sensible low partition defaults (3) and short retention (1 day). Increase if necessary only.
Enforce client-side compression. (Conduktor Gateway)
Require ownership metadata at topic creation. (Conduktor)

This won't reduce your bill right away, but it will prevent it from getting worse.

Trim the fat: optimize what's running

Tune retention where it's drifted, analyze consumer patterns.
Retire topics with no active producers or consumers.
Right-size partition counts (this is the hard one, since it means recreating topics and coordinating with every producer and consumer). - Consolidate Kafka clusters, introduce multi-tenancy (Conduktor)

This work easily moves the infrastructure bill, we saw reductions of $500k just doing this.

Now, keep it clean, be disciplined

After a cleanup, the same "drift" will start operating again.

To help you keeping the direction, have absolute visibility into what you Kafka ecosystems contains and what it costs (chargeback is powerful for this), clear ownership so every topic and cluster has a team accountable for it, and a regular review cadence to catch drift before it becomes permanent. Not heavyweight governance. Just enough discipline that the cleanup doesn't have to be repeated every year.

Where to start

The diagnostic question is simple: which of these patterns are present in your environment, and what are they costing you?

The original deep-dive goes further into the four layers of Kafka cost (infrastructure, ecosystem tooling, vendor/licensing, and operational) and includes a framework for sequencing the work.

If you want to look at your own estate, Conduktor's field team does a free cost analysis where they walk through your environment with you and give you concrete numbers.