<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stéphane Derosiaux</title>
    <description>The latest articles on DEV Community by Stéphane Derosiaux (@sderosiaux).</description>
    <link>https://dev.to/sderosiaux</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1137771%2F8ec57b19-4d29-4f61-9e8c-e5d27f821c04.jpg</url>
      <title>DEV Community: Stéphane Derosiaux</title>
      <link>https://dev.to/sderosiaux</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sderosiaux"/>
    <language>en</language>
    <item>
      <title>I let an AI agent set up my entire Kafka platform. Here's what actually happened.</title>
      <dc:creator>Stéphane Derosiaux</dc:creator>
      <pubDate>Mon, 08 Jun 2026 13:34:35 +0000</pubDate>
      <link>https://dev.to/conduktor/i-let-an-ai-agent-set-up-my-entire-kafka-platform-heres-what-actually-happened-220m</link>
      <guid>https://dev.to/conduktor/i-let-an-ai-agent-set-up-my-entire-kafka-platform-heres-what-actually-happened-220m</guid>
      <description>&lt;p&gt;Your AI coding assistant can explain consumer groups, rebalancing, and exactly-once semantics. Ask it to actually &lt;em&gt;set up&lt;/em&gt; a Kafka platform with governance, though, and it won't be able to do that on its own.&lt;/p&gt;

&lt;p&gt;Between hallucinations, misunderstanding, production impact (I really saw Claude messing up a rolling upgrade of Kafka brokers), and the lack of knowledge of the products your Kafka infra is relying on, there's a lot working against it&lt;/p&gt;

&lt;p&gt;The models, besides their training, have zero context about your infra. They've never seen your cluster, don't know your policies (technical, governance), and often have no way to check anything against your actual environment.&lt;/p&gt;

&lt;p&gt;You can give it the missing context using Conduktor.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing that was missing
&lt;/h2&gt;

&lt;p&gt;There is an open-source &lt;a href="https://github.com/conduktor/skills" rel="noopener noreferrer"&gt;Conduktor skill&lt;/a&gt; you install into your AI assistant. It works with Claude Code, Cursor, VS Code Copilot, Gemini CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add conduktor/skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is teaching the agent the whole platform and how to run process against it: Console, Gateway, and the CLI, so it can be efficient and not hallucinate.&lt;/p&gt;

&lt;p&gt;After the install, the agent discovers your environment (Kafka clusters, Schema Registry, policies, etc.), asks questions based on what it finds, generates configs with &lt;em&gt;real&lt;/em&gt; values and best practices, and runs everything with dry-run validation before it touches anything.&lt;/p&gt;

&lt;p&gt;The CLI are really its "hands" as more deep than just MCP. The skill is the playbook where all the experience and practices from years of usage are written. This does a big difference VS "generate some YAML and cross fingers"&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting from absolutely nothing
&lt;/h2&gt;

&lt;p&gt;You can start from scratch with just Docker running and nothing else. No Kafka, no Conduktor, no config. When I just ask this (with the Conduktor skill setup): &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;install Conduktor and set it up so I can login&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It checked my environment, asked what I was trying to do, wrote a &lt;code&gt;docker-compose.yml&lt;/code&gt;, spun up the containers, hit one error along the way, self-corrected, and handed me a working platform, Kafka &amp;amp; Console perfectly configured.&lt;/p&gt;

&lt;p&gt;I could ask the same but on my production Kubernetes. It would follow best practices too, use Helm, discover my environment, etc., and in minutes everything would be wired perfectly, with policies already in place.&lt;/p&gt;

&lt;p&gt;This is much more powerful than a "human" quickstart, as the range of applications it covers is just wider and more production-ready already. The agent knows the Kafka domain, and with the skill it knows Conduktor, so the combination of both makes it ask me the right questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance, without becoming a Kafka lawyer
&lt;/h2&gt;

&lt;p&gt;Running Kafka isn't the hard part anymore. Making it &lt;em&gt;safe for a team to share&lt;/em&gt; is the hard part: naming conventions, ownership boundaries, policies. This is what prevent a Kafka cluster from turning into a wasteland of &lt;code&gt;test-topic-final-v2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The beautiful thing is to be able to ask large prompts like this now:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;set up governance for two teams, Payments and Analytics, with topic policies and cross-team permissions&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It worked in stages and figured out the dependency ordering itself. When the API rejected something, it read the rejection, restructured the YAML, and retried, with minimal hand-holding from me (just asking what policies I want based on what's possible). It ended up creating the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;TopicPolicy&lt;/code&gt; objects: locking down naming per team, enforcing safe defaults (retention, replication, required labels) across every topic. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Application&lt;/code&gt; objects with non-overlapping resource boundaries to define ownership of resources and teams.&lt;/li&gt;
&lt;li&gt;Topics with descriptions and labels in the catalog.&lt;/li&gt;
&lt;li&gt;Cross-team permission giving Analytics read access to &lt;code&gt;payments.orders.*&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is federated ownership in practice: the platform team sets the boundaries, developers move freely inside them. Normally that knowledge takes months to accumulate and lives spreadsheet or Jira tickets. Here it lives in a skill file that every agent on the team can read.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now flip to the developer side
&lt;/h2&gt;

&lt;p&gt;Once those guardrails exist, a developer on the Payments team installs the &lt;em&gt;same skill&lt;/em&gt; and never has to know any of it happened. No &lt;code&gt;ApplicationInstance&lt;/code&gt;, no &lt;code&gt;TopicPolicy&lt;/code&gt;, no YAML. They just talk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What topics do we have?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent runs &lt;code&gt;conduktor get Topic&lt;/code&gt; and shows the catalog — descriptions, owners, labels, visibility. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"I need a topic for my service."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent checks their &lt;code&gt;ApplicationInstance&lt;/code&gt;, reads the policy constraints (naming prefix &lt;code&gt;payments.*&lt;/code&gt;, retention one-to-seven days, a required &lt;code&gt;data-criticality&lt;/code&gt; label), asks what the topic is for, generates compliant YAML, dry-runs it, and applies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Topic/payments.fulfillment.shipped: Created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The developer just got a topic that's compliant by default. Without the skill, that's a JIRA ticket most likely, and asking platform team what's the right shape and what to put.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"How do I produce to my topic?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It reads the cluster config, grabs the real bootstrap server, and hands back working code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;confluent_kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Producer&lt;/span&gt;

&lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Producer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:19092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;produce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments.fulfillment.shipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ord-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orderId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ord-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy, paste, run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"I need to read the Analytics team's clickstream."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent finds that &lt;code&gt;analytics.clickstream.pageviews&lt;/code&gt; belongs to the Analytics team, then writes a read-only permission scoped to exactly that topic, at both the Kafka and Console layers. The developer doesn't know what an ACL is or what &lt;code&gt;patternType: LITERAL&lt;/code&gt; means. They asked in English and got access. &lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually take away from this
&lt;/h2&gt;

&lt;p&gt;This walkthrough only touched governance and onboarding. The skill also covers Gateway (Kafka proxy) encryption, data quality rules, Terraform export, and CI/CD scaffolding.&lt;/p&gt;

&lt;p&gt;Try it, it's one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add conduktor/skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's &lt;a href="https://github.com/conduktor/skills" rel="noopener noreferrer"&gt;open source&lt;/a&gt;, so if you hit a workflow it handles badly, open a PR. And if you're new to Conduktor, the &lt;a href="https://www.conduktor.io/community" rel="noopener noreferrer"&gt;Community Edition&lt;/a&gt; is free and self-hosted, the skill will do the install for you.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was adapted from the &lt;a href="https://www.conduktor.io/blog/set-up-a-kafka-platform-with-an-ai-agent" rel="noopener noreferrer"&gt;original on the Conduktor blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>dataengineering</category>
      <category>devops</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>5 Things Kafka Practitioners Actually Said at Current 2026</title>
      <dc:creator>Stéphane Derosiaux</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:23:50 +0000</pubDate>
      <link>https://dev.to/sderosiaux/5-things-kafka-practitioners-actually-said-at-current-2026-2l01</link>
      <guid>https://dev.to/sderosiaux/5-things-kafka-practitioners-actually-said-at-current-2026-2l01</guid>
      <description>&lt;p&gt;We were at Current 2026 in London two weeks ago. Keynotes about agentic AI and streaming agents but conferences conversations were about something else entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The IBM-Confluent acquisition?
&lt;/h2&gt;

&lt;p&gt;Six months after IBM acquired Confluent for $11B, I expected hot takes. What I got was mostly 'indifference'?&lt;/p&gt;

&lt;p&gt;A data engineering lead at a major US bank called himself a "doomer", Kafka and Cassandra are his two favorite open-source projects, and both now sit under IBM. A staff engineer at a streaming vendor was more optimistic, speculating about Confluent's AI team merging with Watson.&lt;/p&gt;

&lt;p&gt;The majority were more like "I try to concentrate on the tech side." "I don't know enough to comment." Moving on.&lt;/p&gt;

&lt;p&gt;The acquisition was in the air, but practitioners had more pressing problems to talk about.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Kafka costs are the third certainty (after death and taxes)
&lt;/h2&gt;

&lt;p&gt;Every cost conversation we had mapped to the same things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partition overprovisioning&lt;/strong&gt;: inherited from not-knowing or templates tuned for a different workload&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retention mismatches&lt;/strong&gt;: defaults that outlive the use case they were set for&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster proliferation&lt;/strong&gt;: one-off clusters that never got consolidated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orphan and duplicate topics&lt;/strong&gt;: experiments or dead projectsthat never got cleaned up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inefficient client patterns&lt;/strong&gt;: batching, compression, and serialization left on defaults + &lt;a href="https://www.conduktor.io/blog/librdkafka-vs-java-client" rel="noopener noreferrer"&gt;JVM vs librdkafka differences&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static capacity&lt;/strong&gt;: paying for peak when the load is 10% of that most of the time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some examples of scale where they had these problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;six years of unmanaged topics inherited during a Confluent Cloud migration&lt;/li&gt;
&lt;li&gt;one team had twenty MSK clusters running multi-regions&lt;/li&gt;
&lt;li&gt;MSK, Confluent, and Redpanda all running in parallel, looking for a single control plane as totally separated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A data engineer at a German energy consultancy put it well:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Kafka resource costs should not be taken for granted. Maybe I need sub-second latency, or maybe daily is enough. These are considerations you must make up front, before it's too late."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The solution is to change how to &lt;em&gt;use&lt;/em&gt; Kafka: guardrails at topic creation, retention defaults that reflect reality, and ownership at the source. We typically see 25-40% of infrastructure costs come back this way.&lt;/p&gt;

&lt;p&gt;We wrote up the full analysis in &lt;a href="https://www.conduktor.io/blog/your-platform-team-cant-fix-kafka-costs-alone" rel="noopener noreferrer"&gt;Your Platform Team Can't Fix Kafka Costs Alone&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Everyone wants self-service. Nobody has the operating model for it.
&lt;/h2&gt;

&lt;p&gt;Self-service provisioning is a common topic now. Developers want to create topics and request permissions without filing a Jira ticket (many still do). Platform teams want to provide it to not be a bottleneck. The &lt;em&gt;how&lt;/em&gt; is the question. Home-made is time consuming considering all variations, and using a vendor, it has to be flexible to adopt.&lt;/p&gt;

&lt;p&gt;A senior engineer at a Danish financial firm told us he was too busy managing tickets to build the system that would have made the tickets unnecessary. An architect at a UK telecom had no self-service at all, just Jira. A senior developer at a UK bank said her biggest frustration was waiting on the ticketing system.&lt;/p&gt;

&lt;p&gt;This is no joke. Self-service &lt;em&gt;resource provisioning&lt;/em&gt; is a capability and &lt;strong&gt;Federated ownership&lt;/strong&gt; is what produces it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The platform team sets standards and guardrails.&lt;/li&gt;
&lt;li&gt;Domain teams operate within them autonomously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that have this working got the operating model right before building the tooling. You need to slow down to accelerate better. Everyone else stopped at "we can create topics, we have gitops", they think this is enough and are missing the whole point. This is like 5% of a real solution.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The key thing for us is approval gates. Guardrails. If somebody adds a thousand partitions, we need someone to have eyes on that." — Lead streaming data engineer at a major African bank&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Self-service without federated ownership becomes either a free-for-all nobody can govern, or a tightly controlled environment nobody uses.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Nobody knows what a Kafka proxy actually does
&lt;/h2&gt;

&lt;p&gt;This was the surprise. Kafka proxies came up in more conversations than any other architectural topic, but most people still think a proxy does one thing: route traffic for legacy clients.&lt;/p&gt;

&lt;p&gt;An engineer at a European ISP runs one in production. It handles message routing. We asked about encryption, masking, or transformation. "We see a proxy as more of a routing tool."&lt;/p&gt;

&lt;p&gt;Here's the range of what serious Kafka proxies like Conduktor Gateway can do:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Routing&lt;/strong&gt;: bridging legacy or non-native clients (most common)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DR and migration&lt;/strong&gt;: failover and provider switching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: encryption and masking at the message level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full policy enforcement&lt;/strong&gt;: aliasing, schema validation, virtual clusters, field-level encryption, audit, and access control&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A VP at a global bank walked us through his checklist for a Kafka gateway: topic aliasing for migration, payload validation, virtual clusters for multi-tenant isolation, field-level encryption, data masking, and audit. It's a bank requirements doc.&lt;/p&gt;

&lt;p&gt;Another engineer at a UK building society had built application-layer encryption for PII. He didn't know a proxy could do it at the message level, saving hours of per-application work. Plenty of teams are paying the same tax right now, doing it client-side and it's a pain to manage at scale.&lt;/p&gt;

&lt;p&gt;A proxy is where policy and control live when they don't belong in the cluster or the application. &lt;a href="https://www.conduktor.io/gateway/" rel="noopener noreferrer"&gt;Here's what a Kafka proxy actually does&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. AI on Kafka is an ownership problem, not a streaming problem
&lt;/h2&gt;

&lt;p&gt;The keynotes kept talking about agentic AI as the bottleneck. Kafka developers disagree:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Spinning up infrastructure and writing code is rarely the bottleneck in a full end-to-end business solution." — Staff engineer at a UK retailer&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A senior engineer at a major social platform building on-call AI agents put it directly: "Building the AI agent isn't the difficult thing. The data needs to be in the right structure."&lt;/p&gt;

&lt;p&gt;A software engineer at a UK challenger bank has been feeding an AI agent context from ~40 microservices. "You have to give it the entire Java classes of all those microservices. That's usually when it starts hallucinating."&lt;/p&gt;

&lt;p&gt;The hallucination is a legibility problem. The bottleneck is data quality, which sits downstream of governance, which sits downstream of your operating model. Every prerequisite practitioners named like ownership, schema discipline, topic visibility, federated governance, is foundational plumbing, not AI tooling.&lt;/p&gt;

&lt;p&gt;The teams that will run AI on Kafka in 2027 are the teams getting their ownership and governance right in 2026.&lt;/p&gt;




&lt;p&gt;All these discussions are coming from people running Kafka in production at banks, telcos, retailers, and energy companies. Costs are too high and too complex to undersand, self-service needs an operating model, proxies are underused, and AI isn't a shortcut past the governance work you haven't done yet.&lt;/p&gt;

&lt;p&gt;If any of these hit home, the &lt;a href="https://www.conduktor.io/blog/" rel="noopener noreferrer"&gt;Conduktor blog&lt;/a&gt; goes deeper on each one.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>datastreaming</category>
      <category>architecture</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to analyze the cost of Kafka?</title>
      <dc:creator>Stéphane Derosiaux</dc:creator>
      <pubDate>Mon, 25 May 2026 15:19:36 +0000</pubDate>
      <link>https://dev.to/conduktor/how-to-analyze-the-cost-of-kafka-2a4b</link>
      <guid>https://dev.to/conduktor/how-to-analyze-the-cost-of-kafka-2a4b</guid>
      <description>&lt;p&gt;Which side are you on: "This is just what Kafka costs at scale" or "We should switch to a cheaper Kafka provider"?&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://conduktor.io" rel="noopener noreferrer"&gt;Conduktor&lt;/a&gt;, our field team works inside Kafka environments that have been running for a long time. We see this: most Kafka teams are overpaying by 25 to 40 percent. Not because anyone did anything wrong, but because of how Kafka got built up over time.&lt;/p&gt;

&lt;p&gt;The cost drivers of Kafka are weirdly context-dependent: the infrastructure and the provider are a tiny part of the full picture. &lt;/p&gt;

&lt;p&gt;The "how" it's being used is the real question.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five bad patterns eating budget
&lt;/h2&gt;

&lt;p&gt;Below is what see, the same patterns show up everywhere, and are the first things we work with our customers.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Partition overprovisioning
&lt;/h3&gt;

&lt;p&gt;"How many partitions?" is the most common question with Kafka. I heard last week someone telling me an org just defaults to "64". I was shocked. Not only providers may price per partitions, but from a Kafka point of view: this takes metadata and open files etc.&lt;/p&gt;

&lt;p&gt;Partitions depend on throughput and concurrency expected (consumer parallelism). If a 64-partitions topic is sitting in a cluster with barely no traffic, you're just losing money on all sides. Multiply by dozens or hundreds of topics at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Retention that makes no sense
&lt;/h3&gt;

&lt;p&gt;Long retention on topics that nobody reads past the last few hours. Do you need replay? Default is 7-day retention, but it's often applied uniformly, when some topics only need a couple of hours and others genuinely need weeks.&lt;/p&gt;

&lt;p&gt;Tips: when using compacted topics and/or Kafka streams (changelog etc.), data is being stored indefinitely, that can cause some security/regulations issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Let's spin up another cluster
&lt;/h3&gt;

&lt;p&gt;One-cluster-per-team was a reasonable isolation strategy a long time ago. We saw this multiple times, more than 500 clusters, with tons of mirroring to share data. Throwing money down the drain.&lt;/p&gt;

&lt;p&gt;You're paying for underutilized clusters instead of consolidating onto fewer well-managed ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Zombie topics
&lt;/h3&gt;

&lt;p&gt;Topics created for experiments, migrations, or one-off tests that were never cleaned up. It's a simple thing but cost so much money as no one is looking. Every one of them is replicated and has retention costs. We've seen enterprises with hundreds of zombie topics, who were so surprised when we showed them.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Runaway egress
&lt;/h3&gt;

&lt;p&gt;We had a customer where egress was running 30x higher than ingress on a single topic because of a misconfigured consumer. Buggy consumers, unnecessary fan-out, and chatty clients create traffic patterns that are invisible without dedicated infra monitoring. Egress is rarely free.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to deal with it
&lt;/h2&gt;

&lt;p&gt;Pick your starting point based on where the waste is concentrated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stop the bleeding: better defaults
&lt;/h3&gt;

&lt;p&gt;Low-coordination work that pays off over time. It's better to have exceptions rather than wrong defaults you can't rollback.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set sensible low partition defaults (3) and short retention (1 day). Increase if necessary only. &lt;/li&gt;
&lt;li&gt;Enforce client-side compression. (Conduktor Gateway)&lt;/li&gt;
&lt;li&gt;Require ownership metadata at topic creation. (Conduktor)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This won't reduce your bill right away, but it will prevent it from getting worse.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trim the fat: optimize what's running
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tune retention where it's drifted, analyze consumer patterns.&lt;/li&gt;
&lt;li&gt;Retire topics with no active producers or consumers.&lt;/li&gt;
&lt;li&gt;Right-size partition counts (this is the hard one, since it means recreating topics and coordinating with every producer and consumer). - Consolidate Kafka clusters, introduce multi-tenancy (Conduktor)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This work easily moves the infrastructure bill, we saw reductions of $500k just doing this.&lt;/p&gt;




&lt;h2&gt;
  
  
  Now, keep it clean, be disciplined
&lt;/h2&gt;

&lt;p&gt;After a cleanup, the same "drift" will start operating again.&lt;/p&gt;

&lt;p&gt;To help you keeping the direction, have absolute visibility into what you Kafka ecosystems contains and what it costs (&lt;a href="https://conduktor.io/blog/chargeback-attribute-map-kafka-costs-to-your-business" rel="noopener noreferrer"&gt;chargeback&lt;/a&gt; is powerful for this), clear ownership so every topic and cluster has a team accountable for it, and a regular review cadence to catch drift before it becomes permanent. Not heavyweight governance. Just enough discipline that the cleanup doesn't have to be repeated every year.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;The diagnostic question is simple: which of these patterns are present in your environment, and what are they costing you?&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://conduktor.io/blog/a-better-conversation-about-kafka-costs" rel="noopener noreferrer"&gt;original deep-dive&lt;/a&gt; goes further into the four layers of Kafka cost (infrastructure, ecosystem tooling, vendor/licensing, and operational) and includes a framework for sequencing the work.&lt;/p&gt;

&lt;p&gt;If you want to look at your own estate, Conduktor's field team does a &lt;a href="https://conduktor.io/contact/demo" rel="noopener noreferrer"&gt;free cost analysis&lt;/a&gt; where they walk through your environment with you and give you concrete numbers.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>datastreaming</category>
      <category>devops</category>
      <category>architecture</category>
    </item>
    <item>
      <title>1500 Lines of Markdown vs 15000 Lines of Python.</title>
      <dc:creator>Stéphane Derosiaux</dc:creator>
      <pubDate>Wed, 31 Dec 2025 13:25:39 +0000</pubDate>
      <link>https://dev.to/sderosiaux/1500-lines-of-markdown-vs-15000-lines-of-python-5bac</link>
      <guid>https://dev.to/sderosiaux/1500-lines-of-markdown-vs-15000-lines-of-python-5bac</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;The best orchestrator might be the one you don't have to build. Claude Code is already an orchestrator. Stop building infrastructure around it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw9zuwaf1mbyfhu1hiy2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw9zuwaf1mbyfhu1hiy2t.png" alt="orchestration overhead" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I spent a weekend studying &lt;a href="https://github.com/assafelovic/gpt-researcher" rel="noopener noreferrer"&gt;GPT-Researcher&lt;/a&gt;, an open-source project with 24,000+ GitHub stars. It builds an autonomous research agent that generates comprehensive reports with citations. The architecture is elegant: multiple specialized agents coordinate through LangGraph, parallel execution speeds up research, and quality gates ensure reliable output.&lt;/p&gt;

&lt;p&gt;It uses LLM calls to decide which agent to run. It uses LLM calls to generate sub-queries. It uses LLM calls to select tools. It uses LLM calls to coordinate parallel work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We're wrapping LLMs in infrastructure to teach them orchestration... when they can already orchestrate.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The orchestration overhead
&lt;/h2&gt;

&lt;p&gt;Consider a typical agent workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM API call to analyze the task&lt;/li&gt;
&lt;li&gt;LLM API call to plan the approach&lt;/li&gt;
&lt;li&gt;LLM API call to select tools&lt;/li&gt;
&lt;li&gt;LLM API call to execute (finally, the actual work)&lt;/li&gt;
&lt;li&gt;LLM API call to verify results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Four out of five calls are orchestration overhead. The LLM produces structured outputs (JSON) that code interprets to decide... what to ask the LLM next and passing it the right context.&lt;/p&gt;

&lt;p&gt;This pattern is everywhere. LangChain and LangGraph provide graph-based workflows. AutoGen from Microsoft enables multi-agent conversations. CrewAI offers role-based agent coordination, used by Oracle, PwC, and NVIDIA. Each framework solves real problems: managing state, coordinating agents, handling failures.&lt;/p&gt;

&lt;p&gt;But they all share the same assumption: the LLM needs infrastructure to orchestrate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GPT-Researcher does well
&lt;/h2&gt;

&lt;p&gt;Credit where it's due: GPT-Researcher is excellent. According to its maintainers, it outperforms Perplexity, OpenAI's research tools, and other systems in benchmarks on citation quality, report quality, and information coverage.&lt;/p&gt;

&lt;p&gt;The architecture is sophisticated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent roles&lt;/strong&gt;: Chief Editor orchestrates the process. Researchers investigate subtopics. Editors plan structure. Reviewers validate quality. Revisers incorporate feedback. Writers compile reports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel execution&lt;/strong&gt;: Research happens concurrently across subtopics. Multiple retrievers (Tavily, Google, Bing) run in parallel. Web scraping is asynchronous.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality gates&lt;/strong&gt;: Review cycles catch errors. Revision loops improve output.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The architecture tax
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cjj81ph62n01di3jllw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cjj81ph62n01di3jllw.png" alt="all the steps from query to report" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's what the orchestration layer requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# agent_creator.py - LLM decides which agent to use
response = await llm.call(
    "Analyze this query and return JSON with agent type..."
)
agent_type = parse_json(response)  # error handling, retries

# query_processing.py - LLM generates sub-queries
response = await llm.call(
    "Generate search queries for this task..."
)
queries = parse_list(response)  # more parsing, more error handling

# tool_selector.py - LLM selects MCP tools
response = await llm.call(
    "Select relevant tools from this list..."
)
tools = parse_tool_selection(response)  # yet more parsing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step requires prompt engineering, output parsing, error handling, and retry logic. The orchestration layer is substantial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code's native capabilities
&lt;/h2&gt;

&lt;p&gt;Here's what Claude Code provides out of the box:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0w6lqo727qrqtmpclc6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0w6lqo727qrqtmpclc6m.png" alt="claude code capabilities" width="649" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude Code doesn't need an LLM call to decide what tools to use. It IS the LLM. It reasons about the task and uses tools directly, in the same context, without round-trips.&lt;/p&gt;

&lt;p&gt;When you ask Claude Code to research a topic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It analyzes the query (no separate LLM call)&lt;/li&gt;
&lt;li&gt;It generates sub-queries (no separate LLM call)&lt;/li&gt;
&lt;li&gt;It executes parallel searches (native Task agents)&lt;/li&gt;
&lt;li&gt;It synthesizes results (no separate LLM call)&lt;/li&gt;
&lt;li&gt;It writes the report (native Write tool)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What required infrastructure now requires prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rewrite
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt21recd4tgksz1xae2r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt21recd4tgksz1xae2r.png" alt="how orchestrate with Claude Code" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I created Claude Researcher to test this. It's not a Python package. It's four commands and one skill file.&lt;/p&gt;

&lt;p&gt;Commands:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9ghovcytzh34g4ejgun.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9ghovcytzh34g4ejgun.png" alt="the 4 claude code commands" width="400" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/research-team&lt;/code&gt; command implements the full multi-agent pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code (Chief Editor)
      │
      ├── [PARALLEL] Research Agent 1 → findings
      ├── [PARALLEL] Research Agent 2 → findings
      └── [PARALLEL] Research Agent 3 → findings
              ↓
         Draft Report
              ↓
         Reviewer Agent → feedback
              ↓
         Reviser Agent → improved draft
              ↓ (repeat until quality gate passes)
         Final Report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quality gates are defined in the command file. Review cycles repeat until scores meet thresholds. The multi-agent patterns from GPT-Researcher, expressed as instructions rather than code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The difference in sub-query generation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-Researcher:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt = f"""Write {max_iterations} google search queries...
You must respond with a list of strings in the following format: [{example}].
The response should contain ONLY the list."""

response = await llm.call(prompt)
queries = json.loads(response)  # parsing, error handling, retries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Claude Researcher:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generate 5 search queries to research: "{query}"
- Each query should explore a different angle
- Include queries for recent information when relevant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code executes the queries directly. No parsing layer. No error handling for malformed output. The LLM produces the queries and uses them in the same context.&lt;/p&gt;

&lt;h2&gt;
  
  
  When orchestration frameworks make sense
&lt;/h2&gt;

&lt;p&gt;This isn't a claim that frameworks are useless. They solve real problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production systems with strict SLAs. When you need guaranteed response formats, retry logic, circuit breakers, and observability, frameworks provide battle-tested infrastructure. Claude Code is conversational, not transactional.&lt;/li&gt;
&lt;li&gt;Non-Claude environments. If you're building on GPT-4, Gemini, or open-source models, Claude Code isn't available. Frameworks provide the coordination layer those environments lack.&lt;/li&gt;
&lt;li&gt;Complex state machines. Research is relatively linear: gather, synthesize, write. Workflows with branching logic, human-in-the-loop steps, or long-running state benefit from explicit orchestration.&lt;/li&gt;
&lt;li&gt;Team standardization. Frameworks enforce patterns. When multiple developers build agents, shared infrastructure ensures consistency. Markdown commands are flexible but less structured.&lt;/li&gt;
&lt;li&gt;Audit requirements. Enterprise deployments often need detailed logs of every decision. Frameworks with explicit orchestration make this easier than conversational interfaces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question isn't "frameworks vs no frameworks". It's "do you need the framework for THIS task?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Clone the repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/sderosiaux/claude-researcher
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy to your Claude Code config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cp commands/_*.md ~/.claude/commands/
cp skills/researcher.md ~/.claude/skills/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run a research task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/research "Impact of AI agents on software development" --depth=deep

/research-team "Comparison of vector databases" --quality=high

/lookup "What is Claude Opus 4.5 context window?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No pip install. No API keys beyond what Claude Code already uses. No configuration files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code IS the Orchestrator, a really good one
&lt;/h2&gt;

&lt;p&gt;There's a pattern in software engineering: we build abstractions to solve problems, then build abstractions to manage our abstractions. Each layer adds capability but also complexity, configuration, and cognitive load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the base layer already does what I need?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude Code is an LLM with native tool access, parallel execution, and context management. GPT-Researcher is infrastructure that makes LLMs do those things. For research tasks, the native capabilities are sufficient.&lt;/p&gt;

&lt;p&gt;The best orchestrator might be the one you don't have to build.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
