DEV Community: Stéphane Derosiaux

Chrome Shrunk a 30,000 Token Web Page Into 300 Tokens.

Stéphane Derosiaux — Tue, 21 Jul 2026 18:52:07 +0000

Ever handed an LLM a full web page and watched the amount of tokens being used?

A single product listing is 20-30K tokens of <div> soup before the model finds what it needs: wrapper divs, css class, SVG, JSON blobs etc. The agent needs maybe 300 tokens of that (the items and their prices). When you feed an LLM with the raw DOM, you pay for markup cleanup.

This is why chrome-agent exists. It's one 3 MB binary that drives Chrome over the DevTools Protocol (CDP) and hands your agent the page in a shape a language model can cheaply read and act on. No MCP server, no Node runtime.

For instance, ask an agent to: find standing desks under $400 on a store, return the top five with prices. A boring job. Use a default agentic workflow with basic "WebSearch" and you're cooked.

Let me show you why chrome-agent is the best to do the job here.

Land on the page and look

chrome-agent goto "https://store.example.com/search?q=standing+desk" --inspect

goto navigates. --inspect grabs the useful content (no div soup), so the agent gets useful minimum content in one round trip. What comes back is not HTML. It's the accessibility tree (the same roles-and-labels view a screen reader gets) flattened to compact text:

uid=n1   heading "Standing Desks (48 results)"
uid=n14  searchbox "Search products"
uid=n22  button "Apply filters"
uid=n41  link "Uplift V2 Standing Desk"
uid=n47  link "Flexispot E7"
uid=n52  link "Vari Electric"
...

Notice what's gone: no divs, no class names, no base64 <img>. useless HTML nodes are stripped, roughly a 70% cut against the raw accessibility tree (which is a 99% over raw HTML).

chrome-agent has also magic command such as inspect to extract deterministically what makes on the page, based on Reader modes from browser. It can returns like 50 tokens, against the 20-30K of raw HTML the same page would otherwise cost the model. That's insane!

A page becomes something the model can read in tokens it can afford, and act on by pointing at a uid.

Why UID and not CSS selector? Because CSS selectors suck and cost way more tokens.

Why uids instead of CSS selectors?

Because the agent needs a stable handle it can point at without inventing CSS. Those n41, n47 values are Chrome's own internal IDs. They're stable across repeated inspects on the same page, so the agent can inspect, reason, then act on n47 a few commands later and still hit the same element.

chrome-agent has multiple targeting modes that the agent picks per situation:

uid (from inspect) => the default, cheap, specific
--selector "css" => when you already know the DOM shape
--xy 100,200 => canvas, maps, anything with no real DOM node

In our example, the agent doesn't need every link, it needs the price control. So it narrows:

chrome-agent inspect --filter "textbox,button" --urls

--filter keeps only the roles you name (with aliases, so textbox also pulls in searchbox and combobox). Now the tree is a handful of lines:

uid=n14  searchbox "Search products"
uid=n62  textbox "Max price"
uid=n22  button "Apply filters"

There's n62. The agent saw it, so it can point at it.

Act, then wait for the grid to settle

chrome-agent fill "400" --uid n62
chrome-agent click n22

fill types into the max-price field, click applies the filter. The full action set is there when the task needs it: click, fill, dblclick, select (matches by option value then visible text), check/uncheck (idempotent, so re-running a step never toggles you into the wrong state), upload, drag, and press for raw keys.

You have a SPA so there's no real navigation and the DOM keeps being replaced? That's also fine! Clicking the filter fires an XHR and re-renders the grid client-side. chrome-agent waits on the network, no need of pseudo-sleep:

chrome-agent wait network-idle --idle-ms 500

This tracks in-flight requests and returns once the network has been quiet for 500ms (plenty enough to ensure all the Javascript calls are done).

Getting the data out is a decision, not a command

The filtered grid is on screen and the agent needs the top five as structured data. There's a hierarchy, and the skill is matching the rung to the shape of the data, not reaching for the same command every time:

read => one article or long body of prose. Injects Mozilla's Readability, turns ~15K of HTML into ~500 clean tokens. Wrong tool here (this is a grid, not an essay).
extract => repeating records: product grids, search results, feeds, tables. MDR/DEPTA-style heuristics (sibling structural similarity, text-to-link ratio, hidden-element exclusion) find the repeating region on their own. This is our rung.
text --selector "main" => scoped visible copy when you just want words from one region.
eval "..." => a single computed value the heuristics would miss.
network => when the grid was painted from a JSON API, skip the DOM and read the response that fed it.

For this task, extract:

chrome-agent extract --limit 5 --json

{
  "ok": true,
  "records": [
    {"title": "Branch Duo",   "price": "$349", "url": "/p/branch-duo"},
    {"title": "Flexispot E7", "price": "$389", "url": "/p/flexispot-e7"},
    {"title": "Vari Electric", "price": "$395", "url": "/p/vari-electric"}
  ]
}

If the store lazy-loads rows on scroll, extract --scroll drives the page down and watches a MutationObserver until new content lands.

Sometimes the cheapest rung isn't the page at all. That grid was populated by a JSON call. The agent can grab it directly:

chrome-agent network --filter "/api/search" --body --limit 1

Same prices, no scraping.

When it breaks, the error tells the agent what to try

Agents fail constantly. What decides whether they recover is what a failure hands back. In --json mode an error exits 1 but still prints a parseable object on stdout, and when there's an obvious next step it comes back with a hint:

{"ok": false, "error": "No element with uid=n47", "hint": "Page may have changed. Re-run inspect to get current uids."}

The agent reads ok:false, reads the hint, re-inspects, retries. Self-healing without you hand-writing recovery logic for every command.

One connection instead of a hundred spawns

Running the binary once per command pays process startup and a fresh Chrome connection every time. For a multi-step task that adds up. So there's pipe mode: one persistent connection, JSON commands in on stdin, JSON results out on stdout.

chrome-agent pipe <<'EOF'
{"cmd":"goto","url":"https://store.example.com/search?q=standing+desk","inspect":true}
{"cmd":"fill","selector":"input[name=maxPrice]","value":"400"}
{"cmd":"click","selector":"button[type=submit]"}
{"cmd":"wait","what":"network-idle","idle_ms":500}
{"cmd":"extract","limit":5}
EOF

Playwright already exists. When do you reach for this?

Different use-cases:

Playwright / Puppeteer => deterministic end-to-end tests: assertions, fixtures, waiting on specific selectors, visual diffing. A full, mature automation API for humans writing test suites.
chrome-agent => an LLM agent that has to read and act on pages it's never seen, in tokens it can afford, and recover from failure on its own.

Different jobs. If you're writing an assertion-heavy CI suite, use Playwright. If you do web researches and your agents need to browser the web, they are burning their context window (parsing HTML soup) and you are paying for it. Don't.

Even more magic!

chrome-agent supports even more things an LLM agent needs:

it can drag elements
it can download elements
it can handle iframe
it can --stealth to apply patches to bypass Cloudflare and Turnstile. It is not guaranteed against DataDome or Kasada though; they fingerprint the bundled Chromium and ship detection updates constantly.
it can access your logged-in site by using your real user cookies using --copy-cookies.

Try it

Go with the skill to install and forget:

npx skills add sderosiaux/chrome-agent
# or
cargo install chrome-agent

Kafka Doesn't Know Who Owns Your Topics.

Stéphane Derosiaux — Mon, 20 Jul 2026 18:23:21 +0000

Someone on your team asked "which Terraform provider do we use for Kafka?". Your answer is probably between Mongey/kafka and your vendor's provider like Aiven, MSK or Confluent.

Good, but the question can actually be broader. There are three separate layers hiding in there:

Creating the cluster (brokers, VPC, MSK Serverless)
Creating the topics (partitions, retention, ACLs)
Governing the platform (who owns what, which rules apply, what a team can ship without asking)

Both 1 and 2 have had providers for years. Layer 3 is just getting started with solutions like Conduktor. Let me show you why you should care and why it exists now.

The Kafka protocol has no idea who owns what

No provider that speaks the Kafka protocol can express ownership, because the protocol has no concept of it.

A topic in the Kafka protocol is a name, a partition count, a replication factor, and a config map. That's it. There is nowhere to put "the payments team owns this", "this contains PII," or "C1 criticality". Not because providers are lazy but because there's no field.

So what happens instead? You build a central Terraform repo. The platform team owns it. Every topic request becomes a PR into a codebase the requester has never seen. And then:

"They'll go in and have to make a PR to that repo. It's not super intuitive on how to do that. So they'll end up pinging our team anyway." — platform engineer at a consumer-lending company

Self-service that routes through a human is just a ticket queue with extra YAML.

Two other things that bite

Your CI becomes the most privileged Kafka client in the company. A provider that dials brokers needs admin credentials and a network route to every broker of every cluster, with TLS/SASL/IAM wired up in the runner. Every new cluster is a new networking problem and a new credentials problem in your pipeline.

Terraform will happily delete your data. AWS's own MSK Terraform tutorial walks you through changing a topic from 50 partitions to 10. Kafka can't shrink partitions. So Terraform destroys and recreates the topic, data included. The plan says:

Plan: 1 to add, 1 to change, 1 to destroy.

Which is technically accurate and completely fails to communicate "you are about to lose weeks of events." and make your consumers very unhappy. (without mentioning business impact)

Notice the trend AWS itself is following: the newer aws_msk_topic resource takes a cluster_arn, not bootstrap servers. Don't dial brokers from CI. Talk to a control plane over HTTPS.

What a control-plane provider looks like

I've been using the Conduktor provider. It doesn't create clusters, it registers ones that already exist and governs what runs on them. MSK, Confluent, Aiven, Redpanda, self-managed, Gateway virtual clusters: same model regardless.

Registering an existing MSK cluster:

resource "conduktor_console_kafka_cluster_v2" "msk" {
  name = "payments-msk"
  spec = {
    display_name      = "Payments (MSK, eu-west-1)"
    bootstrap_servers = "b-1.xxxxx.kafka.eu-west-1.amazonaws.com:9198"
    properties = {
      "sasl.jaas.config"  = "software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn='arn:aws:iam::123456789123:role/MSK-role';"
      "security.protocol" = "SASL_SSL"
      "sasl.mechanism"    = "AWS_MSK_IAM"
    }
  }
}

A Gateway virtual cluster registers exactly the same way, so your non-prod isolation is governed by the same code path as prod:

resource "conduktor_console_kafka_cluster_v2" "payments_vc" {
  name = "payments-vc"
  spec = {
    display_name      = "Payments (virtual cluster)"
    bootstrap_servers = "gateway.internal:6969"
    properties = {
      "security.protocol" = "SASL_SSL"
      "sasl.mechanism"    = "PLAIN"
    }
    kafka_flavor = {
      gateway = {
        url             = "https://gateway.internal:8888"
        user            = "admin"
        password        = var.gateway_admin_password
        virtual_cluster = "payments"
      }
    }
  }
}

The topic carries its own metadata

Same topic you'd declare anywhere, except now the fields that matter to humans are first-class:

resource "conduktor_console_topic_v2" "orders" {
  name    = "click.orders.events"
  cluster = conduktor_console_kafka_cluster_v2.msk.name

  labels = {
    domain      = "orders"
    criticality = "C1"
  }
  description = "Order lifecycle events, one message per state change."

  spec = {
    partitions         = 6
    replication_factor = 3
    configs = {
      "cleanup.policy" = "delete"
      "retention.ms"   = "604800000"
    }
  }

  lifecycle {
    prevent_destroy = true
  }
}

That prevent_destroy is not optional, see the partition-shrink story above.

Policies are Terraform resources

The rules a topic must satisfy are themselves Terraform resources:

resource "conduktor_console_topic_policy_v1" "orders_rules" {
  name = "orders-topic-rules"
  spec = {
    policies = {
      "metadata.name" = {
        match = { pattern = "^click\\.(?<event>[a-z0-9-]+)\\.(avro|json)$" }
      }
      "metadata.labels.criticality" = {
        one_of = { values = ["C0", "C1", "C2"] }
      }
      "spec.configs.retention.ms" = {
        range = { min = 3600000, max = 604800000 }
      }
    }
  }
}

Now compare this to the usual approach: an OPA or Sentinel check on the Terraform plan. That check only fires for topics created through that repo. Anyone who opens the UI, runs a CLI, or curls the API will just ignore it.

A control-plane policy object is evaluated on every creation path, because it lives where the creation is admitted, not where one particular pipeline runs.

Attach the policy to an application, hand the team an API key scoped to their own application instance instead of an admin key, and the loop closes: the dev applies Terraform in their own repo, with their own credentials, and gets

Error: topic name "orders-events" does not match ^click\.(?<event>[a-z0-9-]+)\.(avro|json)$

...which they can fix in thirty seconds without pinging anyone. That's the difference between self-service and a ticket with YAML.

Where this stops

To clarify, it's not one provider to rule them all:

It does not create clusters. No brokers, no VPC, no MSK Serverless. Keep hashicorp/aws or confluentinc/confluent for the infra layer. The pairing is: cloud provider creates the cluster, control-plane provider governs what runs on it.
It requires Conduktor Console. It's the IaC surface of a platform, not a standalone Kafka client.

The takeaway

Ask yourself:

Who owns click.orders.events?
What rules must it respect?
What can the orders team create tomorrow without opening a ticket?

If the answer is "our wiki", that's the layer worth moving into code.

Browse the provider docs on the Terraform Registry if you want to poke at it.

Every Application Could Delete Our Schemas.

Stéphane Derosiaux — Mon, 13 Jul 2026 20:49:55 +0000

Quick question about your Kafka setup: right now, which applications are allowed to change or delete which schemas?

Sometimes, it's everyone (because the Schema Registry (SR) is not protected). Sometimes, it's an nginx/traefik in front of their SR to blacklist some routes (like global compatibility) and call it done. Sometimes, it's just to add basic authentication at least. Rarely, it's about authorization.

Authorization is about "what are you allowed to change/break?"

Let me show you where the gap actually is, because "just put a proxy in front of it" is not enough.

Authentication isn't authorization

A schema registry exists to solve a coordination problem. Producers and consumers ship on their own schedules, so they need a shared, versioned contract for the shape of each message. Producers register a schema, consumers resolve it by ID, and the registry enforces compatibility as things evolve so one side's change doesn't blow up the other.

That makes it a very high-value target. Everything producing or consuming structured data depends on it. So "who can change it" should be a first-class question and in the free Confluent Schema Registry, the answer is "anyone."

The default listener is plain HTTP:

# Schema Registry default — plaintext, every interface
listeners=http://0.0.0.0:8081

Per-subject access control isn't in the free version. It's a commercial add-on:

"Until either ACLs or Role-Based Access Control is also enabled for Schema Registry, any user can create, alter, and delete Schema Registry subjects."

Create/alter/delete arê the controls you actually want is per-subject: the payments team can write `payments-` schemas and nothing else.*

It's not just Confluent

The free registry is the most obvious offender, but it's not alone. Every option handles per-subject authorization differently, and most of it is off by default:

Approach	Per-subject authz	Default	Uses your Kafka identity
Confluent SR (free)	none	open, plain HTTP	no
Apicurio	registry-wide roles only	off	no
Karapace	yes, regex ACL per subject	off	no
Confluent RBAC / plugin	yes	off, commercial	Confluent's plane
Generic HTTP proxy	coarse, path-prefix only	you build it	no — matches the URL

Per-subject control exists in a couple of places, but it's either a hand-edited auth file (Karapace) or a paid platform tier (Confluent RBAC). None of the free or open-source options tie the rule to the Kafka identities you already manage, or give you an audit trail that lines up with Kafka.

Why a generic HTTP proxy is a bad solution

Since the registry speaks HTTP, the natural thinking is to set up an nginx or an API gateway and only allow the routes you want. For basic guardrails on a single registry, that's cheap and reasonable. Subject names even show up in the path, so you can gate writes by prefix:

# Allow writes only to payments-* subjects
location ~ ^/subjects/payments-.*/versions {
    proxy_pass http://schema-registry:8081;
}

Looks fine. Then you need to maintain the nginx file, meet the edge cases, and you can only reason about URLs: not schemas, not identities. Here's the one that always gets people:

GET /schemas/ids/42

That fetches a schema by numeric ID. There's no subject in the path to match on, so your careful payments-* rule doesn't apply — and that call can return any schema, including another team's.

It authorizes the network path, not your Kafka principal: so it's only as strong as the firewall around port 8081.
It assumes subjects are named after topics. Switch to RecordNameStrategy and the subject is a record name, so your prefix rules stop lining up.
It can't tell a deliberate registration from a client silently auto-registering on first produce:

# One flag away from clients registering schemas you never reviewed
auto.register.schemas=true

Try to go down this path will create an abomination you'll need to maintain forever with your own model of subjects and owners inside your proxy. That's a lot of infrastructure to own just to answer "who can touch payments schemas."

What is "closed by default"?

Per-subject authorization should be:

By subject, prefix, and wildcard, not by URL path that happens to contain a subject sometimes.
Tied to the Kafka identities you already use, the same principals you manage for topic ACLs, not a second, parallel access system.
Logged end to end, every operation and every denial, in a trail that lines up with the rest of your Kafka audit.

The way to get all three without a platform license or bespoke plumbing is a schema-aware proxy, something that understands subjects and requests rather than URLs. That's the category Conduktor's Schema Registry Proxy sits in: it fronts the registry you already run, checks read and write permissions per subject, logs every call, and doesn't need any producer or consumer changes. You just point schema.registry.url at the proxy and keep your current registry, Confluent or open source:

# Clients don't change — they just talk to the proxy
schema.registry.url=http://schema-registry-proxy:8081

Now you have a protected schema registry, reads and writes, that can even be linked to a self-service framework where ownership is a first-class citizen.

Every Kafka Cluster Eventually Hits This Networking Problem

Stéphane Derosiaux — Mon, 06 Jul 2026 17:42:50 +0000

If you've ever tried to connect a Kafka client that lives in a different VPC than the cluster, you've probably hit this issue where there's a route between the two networks, telnet works, and yet the client still can't consume a single record. It feels like a networking bug. It isn't.

The issue is that reaching Kafka across VPCs is an addressing problem, not a networking one. Once that clicks, why peering, transit gateways, and per-broker load balancers don't work will make sense.

What your client actually does when it connects

Most services are easy to reach across a networking boundary. Put a load balancer in front, give it one address, point clients at it. Done.

Kafka doesn't work like that. A client connects in two stages:

Bootstrap. The client talks to any broker and asks one question: who's in this cluster?
Direct connection. The broker answers with metadata — a list of every broker, each named by its own advertised.listeners address. The client then opens direct connections to specific brokers: the leader for each partition it reads or writes (or the nearest replica, if you're using follower fetching, KIP-392, Kafka 2.4+).

That second stage is the most important to remember. One shared address in front of the cluster is not working, because the client has to resolve and route to each broker's advertised address, exactly as the cluster hands it back.

So if a broker advertises broker-1.cluster.internal:9092, a private name that only resolves inside the cluster's VPC, a remote client connects to the bootstrap fine, then fail when it tries to connect directly to one of the brokers:

docker exec kafka-consumer-a kcat -b kafka:9092 -L -m 5
# -> Failed to resolve 'kafka:9092'

A network path between two VPCs is not the same as Kafka reachability. The client still has to resolve and route to every broker's advertised address from where it sits.

"Then just make the broker advertise something reachable". Sure, brokers support multiple listeners, one internal and one external. But every listener is another port to open on every broker, and managed Kafka (MSK, Confluent Cloud, Aiven) won't let you touch them anyway. Listeners don't scale to a dozen independent networks across accounts and clouds. Which is exactly the situation you're in.

Why peering collapses the moment you have more than two networks

VPC peering is genuinely great for connecting two VPCs. Past two, here be dragons:

It's point-to-point and non-transitive. Peering A and B each to the cluster does not let A talk to B. Every new client network needs its own peering straight into the cluster's VPC.
Your cluster's VPC becomes an accidental hub. Ten client networks means ten peerings, ten route-table entries, and ten security-group conversations on a VPC that was never designed to be a hub.
CIDRs have to be coordinated. Peered VPCs can't overlap ranges, so every team negotiates address space.
Managed Kafka caps it. Providers limit how many peerings or PrivateLink attachments you get, and none of them let you rewrite broker listeners.

The recurring pain is one line: every new client forces another change onto the cluster's VPC, the one piece of infra you least want to keep editing.

The fix: let a Kafka-aware proxy rewrite the addresses

The solution: put something in the connection path that actually understands the Kafka protocol instead of just shuffling packets, and have it rewrite the broker addresses in the metadata response before the client ever sees them.

That's what Conduktor Gateway does. A broker advertises kafka-internal:9092; the Gateway rewrites that to an address the client can reach. Because clients connect to specific brokers (the partition leaders), the Gateway maps each broker to its own port so traffic still routes deterministically to the right one behind it.

From the client's side, the only change is a single line:

# before: pointing straight at a broker it can't actually reach
bootstrap.servers=broker-1.cluster.internal:9092

# after: pointing at the proxy, which hands back addresses it can
bootstrap.servers=conduktor-gateway.hub:9092

Credentials don't change. The client presents the same SASL credentials it already has, the Gateway forwards them to the broker, and Kafka ACLs (or Confluent RBAC) still decide what it's allowed to do. The proxy is stateless, it stores no credentials and adds no new auth model.

The nice architectural payoff: the Gateway lives in its own hub VPC. Peerings or PrivateLink attachments land on the hub, never directly between a client and the cluster. New networks attach to the hub; the cluster keeps a single attachment no matter how many clients reach in, and its listeners are never touched.

The Gateway isn't a substitute for a route as you still need connectivity between the hub and each VPC. What it removes is the per-client peering into the cluster, the broker reconfiguration, and the requirement that every advertised address be reachable from every client.

"But what about a transit gateway? Or an LB per broker?"

Won't work.

A cloud transit gateway is one hub many VPCs attach to for any-to-any IP connectivity. But it operates at L3 so it moves IP packets with zero awareness of Kafka. It doesn't change what brokers advertise, so you're back to sharing private hosted zones across accounts. Overlapping CIDRs still break it without NAT, it's confined to one cloud, and on managed Kafka you still can't touch the listeners.

A load balancer per broker can't be round-robined, because clients connect to specific leaders. So it's one NLB (or target group, or PrivateLink endpoint) per broker, plus advertised.listeners reconfigured: N pieces of plumbing to keep in sync, each with its own port, DNS record, and TLS SAN. Scale or replace a broker and the mapping churns, forcing a rolling restart. And on managed Kafka you usually can't set advertised listeners at all — so it's off the table before you begin.

Approach	Works at	Fixes the advertised-address problem?	Managed Kafka?	Cross-cloud?
VPC peering	L3	No	Capped by limits	No
Transit gateway	L3	No	No	On-prem yes, other cloud no
NLB per broker	L4	Only if you reconfigure listeners	Usually not	Partial
Kafka-aware proxy	Kafka L7	Yes, automatically	Yes	Yes

Reproduce the whole thing on one machine

The free Gateway Community quickstart stands up the exact scenario: a Kafka cluster in a private Docker network the clients can't reach, two consumers in two separate "VPC" networks, and the Gateway as the only container joined to all three.

bash <(curl -fsSL https://releases.conduktor.io/gateway-community-quickstart)

Consumer A can't even resolve the broker directly — different network:

docker exec kafka-consumer-a kcat -b kafka:9092 -L -m 5
# -> Failed to resolve 'kafka:9092'

Same consumer, same SASL credentials, one different bootstrap address — now it reads:

docker exec kafka-consumer-a kcat -b conduktor-gateway:9092 -t customers -C -e -c 3 \
  -s value=avro -r http://karapace:8081 \
  -X security.protocol=SASL_PLAINTEXT -X sasl.mechanism=PLAIN \
  -X sasl.username=consumer-a -X sasl.password=consumer-a-secret
# -> readable records

And the log line that proves why:

docker logs conduktor-gateway 2>&1 | grep "Rewriting METADATA"
# kafka:9092 -> conduktor-gateway:9092

That's the whole trick. The broker advertises an address the client can't reach; the proxy rewrites it to one it can.

Wrapping up

If cross-VPC Kafka has been a recurring issue on your platform team, try rethinking it: you don't have a networking problem, you have an addressing problem, and a Kafka-aware proxy is the one thing that fixes the address without you touching the cluster.

If you later want to encrypt fields, enforce schemas, or isolate tenants on that same path, this same proxy will do it too.

If you want to poke at it, the Gateway Community quickstart is free and runs on one machine. Curious how you're solving this today, peering everything everywhere, transit gateways, something else? Let me know.

We Measured Kafka Usage. The Results Surprised Us.

Stéphane Derosiaux — Mon, 29 Jun 2026 14:32:43 +0000

I sit in a lot of Kafka reviews. Vendors, instances, replication, tiered storage, advanced stuff like fetch-from-follower, networking, partitions, best practices etc. Most discussions are driven by tech only, instead of looking at the big picture: how this beautiful infra is being used.

Unpopular opinion: most of your Kafka cost is not due to infrastructure, it's due to a usage problem.

Where the cost actually comes from

Vendor calculators are hard to compare because of so many assumptions. Replication multipliers, disk class, compression ratio, tiered storage (billed at the replicated rate or the actual S3 rate). The price you see is almost never what you pay.

RF=3 multiplies the per-GB price by 3 everywhere. And tiered storage is often still billed at the replicated rate even though only one copy lives in S3. You're paying the RF=3 rate for data Kafka no longer replicates.
Cross-VPC, in-region traffic between your account and the vendor's lands on your cloud bill, roughly 1c/GB each way depending on the path.
Without fetch-from-follower, most consumer fetches cross AZ boundaries. With three balanced AZs, ~2/3 of consumer reads go cross-AZ, because the leader lives in one AZ and the other two reads come from elsewhere.
Compression is often just... off.

With zstd at sane batch sizes, JSON-ish logs and metrics commonly compress 8–10x:

compression.type=zstd
batch.size=65536          # 64KB
linger.ms=20

Going from 5x to 10x halves your stored bytes and halves the replication bytes flowing inside the cluster. You pay for that traffic three times over at RF=3, so the ratio matters.

And fetch-from-follower, available since Kafka 2.4, is a broker + consumer config away. Same-AZ traffic inside your VPC is free on AWS, so no cross-AZ tax:

# broker
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
broker.rack=us-east-1a

# consumer — must match the broker's rack value
client.rack=us-east-1a

Do all of it: fetch-from-follower, tiered storage, compression enforcement, partition right-sizing, BYOC to apply your existing cloud discount, single-AZ topics where you can tolerate it. But notice that it's still infrastructure tuning. Let's go up.

Cost is a stack, not a line item

When you tune anything in Kafka, you think in layers, bottom-up: hardware, JVM, broker config, producer/consumer tuning, topic design, application code. Same for cost:

Cloud infrastructure: instance types, AZ placement, networking, BYOC negotiation. At big contract sizes, negotiated networking discounts can hit 90%, but only if traffic flows through your account.
Broker & protocol tuning: compression, retention, RF, fetch-from-follower, tiered storage, partition count. Easy, they're config changes.
Architecture: diskless topics, Iceberg topics, single-AZ topics, proxies between clients and brokers, virtual clusters for multi-tenancy and non-prod consolidation.
Usage: fan-out, governance, discovery, self-service.

Often payoff goes up as you go higher (more system thinking) Everyone's comfortable arguing about GP2 vs GP3 (volume types on AWS. Almost nobody thinks "why are 40% of these partitions doing nothing?"

Speaking of which: most clusters carry 40–70% partition waste, did you know that? On managed Kafka that's per-partition-hour billing. On self-managed, you hit the ~4,000–6,000 partition-replicas-per-broker ceiling (RF=3 turns 100k partitions into 300k replicas to host and track). KRaft raises the ceiling but it doesn't make the waste free.

Fan-out is the whole point of Kafka

Kafka exists so that one byte written can be read by N independent consumers, decoupled in time, with zero coordination back to the producer. That's the log abstraction's reason to live.

Do you measure your average fan-out? If it's 1, you probably shouldn't be running Kafka at all, you're paying for a distributed log to do a point-to-point queue's job. LinkedIn famously ran at ~5.4: the same bytes, written once, read by 5.4 independent teams.

Cluster cost stays flat while consumers grow, so cost-per-business-outcome is decreasing the more we consume existing topics:

cost_per_use_case = cluster_cost / fan_out

fan-out 1  ->  $X      (one team carries the whole bill)
fan-out 3  ->  $X / 3
fan-out 5  ->  $X / 5  (same hardware, five outcomes)

"Is our Kafka usage growing?" is the wrong question. More business use-cases reading existing data is the best money you'll ever spend. Duplicated topics because nobody could find the existing one is pure waste, more storage, more replication, more pipelines, all because discovery and ownership are missing.

The same goes for partitions: people over-provision because nobody knows how to size them, and you can't reduce partition count after the fact (breaks key ordering). The only way to surface that waste is chargeback at the team-and-topic level. You can't optimize what you can't attribute.

"A third of our traffic, we know what it has to do with, but we don't know exactly what they're doing."

That's the usage layer leaking. It costs money, and nobody can fix it because nobody knows how to, where to look, or just own it. It's not an infra problem, it's governance, discovery, and self-service.

Cost optimization is everybody's concern and nobody's objective. Teams over-provision because what if we need it later and what if it breaks when we touch it are rational fears. "It's expensive" is not a business case. What works is showing the waste, the annual dollar number, and the effort to reclaim it, with a name next to it.

2026: Where to spend your effort

Most deployments I see have way more headroom in the usage layer than the infra layer: topics nobody reads, partitions nobody needs, teams who'd benefit from streaming but find it too painful to onboard.

There's a funny industry reflex here too. We chase the next architectural shiny thing, diskless, Iceberg topics, single-AZ, before we've answered the boring questions: who's using this, for what, and why aren't more teams using it?

My actual recommendation:

Do the infrastructure pass once. Instance types, AZ placement, BYOC.
Do the config pass once. Compression, retention, partition right-sizing, fetch-from-follower, tiered storage.
Spend the rest of the year on the usage layer. Fan-out, ownership, discovery, chargeback, self-service.

Steps 1 and 2 are a sprint. Step 3 is the marathon.

If you want to see where your usage layer is leaking, Conduktor's field engineering team does a free Kafka cost analysis: they'll map cost back to teams and topics and show you where the payoff sits. And if you just want to keep reading, Why Kafka Costs Keep Rising and the partition waste deep-dive are good next stops.

What's your average fan-out? If you don't know it off the top of your head, that's probably where I'd start.

We Built a Kafka Proxy. Here's Everything It Ended Up Doing.

Stéphane Derosiaux — Mon, 15 Jun 2026 13:43:15 +0000

At Current 2026, I realized that nobody knows exactly what a Kafka proxy can do.

Most engineers and architects think it's just some kind of reverse-proxy for Kafka (think nginx) to do routing and used to bridge a legacy or non-native client to the cluster.

That's not it. It's barely the start of it.

Encryption

For instance, an engineer at a UK building society had a hard requirement: encrypt personally identifiable fields before they ever hit Kafka: emails, national insurance numbers, that kind of data.

His team built encryption into the application layer. Every producer that touched PII got encryption code. Every consumer got decryption code. Key handling, rotation, etc. to manage across services.

Something like this:

// In every producer that touches PII...
public ProducerRecord<String, Customer> encrypt(Customer c) {
    c.setEmail(crypto.encrypt(c.getEmail(), keyRef("pii-key")));
    c.setSsn(crypto.encrypt(c.getSsn(), keyRef("pii-key")));
    return new ProducerRecord<>("customers", c.getId(), c);
}

// And the mirror image in every consumer...
public Customer decrypt(Customer c) {
    c.setEmail(crypto.decrypt(c.getEmail()));
    c.setSsn(crypto.decrypt(c.getSsn()));
    return c;
}

Multiply that by all the micro-services to update and maintain now (cross languages, versioning, access to KMS etc.). That's quite expensive, at implementation time and to maintain.

He didn't know a Kafka proxy could have done the whole thing at the record level, outside the apps. When we chat about it, he just realized he might have done a mistake.

What a Kafka proxy does

A Kafka proxy sits between your clients and your brokers and speaks the Kafka protocol. Clients connect to it exactly like they'd connect to a broker. No SDK, no app changes. It works for Kafka clients, Kafka Connect, Kafka Streams, Flink, Spark, etc. It's fully transparent to them.

It makes it a natural place to put policy that doesn't belong inside your application and doesn't belong inside the cluster either.

Encryption is the obvious one. Instead of touching dozens of applications, you declare the rule once. With Conduktor Gateway it's what we call an interceptor: a small piece of config applied to traffic matching a topic pattern. Roughly:

{
  "kind": "Interceptor",
  "apiVersion": "gateway/v2",
  "metadata": { "name": "encrypt-customer-pii" },
  "spec": {
    "pluginClass": "io.conduktor.gateway.interceptor.EncryptionPlugin",
    "priority": 100,
    "config": {
      "topic": "customers.*",
      "kmsConfig": { "kms": "VAULT", "vault": { "uri": "https://vault:8200" } },
      "recordValue": {
        "fields": [
          { "fieldName": "email", "algorithm": "AES256_GCM", "keySecretId": "pii-key" },
          { "fieldName": "ssn",   "algorithm": "AES256_GCM", "keySecretId": "pii-key" }
        ]
      }
    }
  }
}

The proxy encrypts the fields on the way in and authorized consumers get them decrypted on the way out, everyone else gets ciphertext. The application code shrinks back to just... sending a record, not dealing with KMS and secrets:

// Same producer, after.
producer.send(new ProducerRecord<>("customers", c.getId(), c));

Because the rule lives in one declarative place (the proxy), you can do things that are painful at the app layer.

For instance, crypto-shredding for GDPR. Delete the key, and every message encrypted with it becomes unreadable, instantly, across all your retention. You don't go hunting through topics for one person's data. You revoke a key. Done.

Masking, validation, isolation: same one place

Once the proxy is in the Kafka path, the same pattern opens a lot of doors:

Field-level masking: show j***@example.com to one team, the real value to another, same topic.
Schema and payload validation: reject malformed records at the edge instead of poisoning a downstream consumer.
Topic aliasing for migration: point clients at a stable name while you move the real topic between clusters.
Virtual clusters: carve one physical cluster into isolated tenants without standing up new infrastructure.
Audit and policy enforcement: log and gate access without patching the broker or the client.

None of that touches application code or broker config. It's policy, declared once, enforced in the path.

Why this connects to cost and self-service

The other big topic at the conference was cost. Conduktor published a field guide on where Kafka costs hide in April.

Then the self-service conversation: teams want developers to create topics and request access in autonomy, but simple Topic on GitOps solution is just not enough, because self-service without guardrails easily turns Kafka into a mess:

"If somebody goes onto the tool and adds in something ridiculous, like a thousand partitions, we need someone to have eyes on that. That's something we've learned we can't let go of."

Look at what encryption-at-the-proxy, cost guardrails, and self-service approval gates have in common. They're all policy that belongs between your developers and your brokers, not baked into either one. Push it into the app and you copy-paste it dozens of time. Push it into the cluster and you can't change it without a migration. Put it in the layer in between, declare it once, and you can actually govern it.

To build AI agents on top of streaming, the plumbing must come first: ownership, schema discipline, key custody, data quality before the data even enters Kafka, etc. A proxy that enforces structure and policy is a big chunk of that plumbing.

So where does your policy live?

What people think a proxy does is pass packets. What it really does is more of a safekeeper that holds the policy.

If you've got encryption, masking, validation, or multi-tenant isolation scattered across your services right now, it's worth asking whether any of it should be living one layer down instead.

Want to go deeper? The Gateway overview walks through the interceptor model, and the original Current 2026 write-up has the rest of what we heard on the floor.

I Gave Claude Admin Access to My Kafka Cluster

Stéphane Derosiaux — Mon, 08 Jun 2026 13:34:35 +0000

Your AI coding assistant can explain consumer groups, rebalancing, and exactly-once semantics. Ask it to actually set up a Kafka platform with governance, though, and it won't be able to do that on its own.

Between hallucinations, misunderstanding, production impact (I really saw Claude messing up a rolling upgrade of Kafka brokers), and the lack of knowledge of the products your Kafka infra is relying on, there's a lot working against it

The models, besides their training, have zero context about your infra. They've never seen your cluster, don't know your policies (technical, governance), and often have no way to check anything against your actual environment.

You can give it the missing context using Conduktor.

The thing that was missing

There is an open-source Conduktor skill you install into your AI assistant. It works with Claude Code, Cursor, VS Code Copilot, Gemini CLI:

npx skills add conduktor/skills

This is teaching the agent the whole platform and how to run process against it: Console, Gateway, and the CLI, so it can be efficient and not hallucinate.

After the install, the agent discovers your environment (Kafka clusters, Schema Registry, policies, etc.), asks questions based on what it finds, generates configs with real values and best practices, and runs everything with dry-run validation before it touches anything.

The CLI are really its "hands" as more deep than just MCP. The skill is the playbook where all the experience and practices from years of usage are written. This does a big difference VS "generate some YAML and cross fingers"

Starting from absolutely nothing

You can start from scratch with just Docker running and nothing else. No Kafka, no Conduktor, no config. When I just ask this (with the Conduktor skill setup):

install Conduktor and set it up so I can login

It checked my environment, asked what I was trying to do, wrote a docker-compose.yml, spun up the containers, hit one error along the way, self-corrected, and handed me a working platform, Kafka & Console perfectly configured.

I could ask the same but on my production Kubernetes. It would follow best practices too, use Helm, discover my environment, etc., and in minutes everything would be wired perfectly, with policies already in place.

This is much more powerful than a "human" quickstart, as the range of applications it covers is just wider and more production-ready already. The agent knows the Kafka domain, and with the skill it knows Conduktor, so the combination of both makes it ask me the right questions.

Governance, without becoming a Kafka lawyer

Running Kafka isn't the hard part anymore. Making it safe for a team to share is the hard part: naming conventions, ownership boundaries, policies. This is what prevent a Kafka cluster from turning into a wasteland of test-topic-final-v2.

The beautiful thing is to be able to ask large prompts like this now:

set up governance for two teams, Payments and Analytics, with topic policies and cross-team permissions

It worked in stages and figured out the dependency ordering itself. When the API rejected something, it read the rejection, restructured the YAML, and retried, with minimal hand-holding from me (just asking what policies I want based on what's possible). It ended up creating the following:

TopicPolicy objects: locking down naming per team, enforcing safe defaults (retention, replication, required labels) across every topic.
Application objects with non-overlapping resource boundaries to define ownership of resources and teams.
Topics with descriptions and labels in the catalog.
Cross-team permission giving Analytics read access to payments.orders.*.

This is federated ownership in practice: the platform team sets the boundaries, developers move freely inside them. Normally that knowledge takes months to accumulate and lives spreadsheet or Jira tickets. Here it lives in a skill file that every agent on the team can read.

Now flip to the developer side

Once those guardrails exist, a developer on the Payments team installs the same skill and never has to know any of it happened. No ApplicationInstance, no TopicPolicy, no YAML. They just talk.

"What topics do we have?"

The agent runs conduktor get Topic and shows the catalog — descriptions, owners, labels, visibility.

"I need a topic for my service."

The agent checks their ApplicationInstance, reads the policy constraints (naming prefix payments.*, retention one-to-seven days, a required data-criticality label), asks what the topic is for, generates compliant YAML, dry-runs it, and applies:

Topic/payments.fulfillment.shipped: Created

The developer just got a topic that's compliant by default. Without the skill, that's a JIRA ticket most likely, and asking platform team what's the right shape and what to put.

"How do I produce to my topic?"

It reads the cluster config, grabs the real bootstrap server, and hands back working code:

from confluent_kafka import Producer

producer = Producer({
    "bootstrap.servers": "localhost:19092",
})

producer.produce(
    "payments.fulfillment.shipped",
    key="ord-123",
    value='{"orderId": "ord-123", "status": "shipped"}',
)
producer.flush()

Copy, paste, run.

"I need to read the Analytics team's clickstream."

The agent finds that analytics.clickstream.pageviews belongs to the Analytics team, then writes a read-only permission scoped to exactly that topic, at both the Kafka and Console layers. The developer doesn't know what an ACL is or what patternType: LITERAL means. They asked in English and got access.

What I actually take away from this

This walkthrough only touched governance and onboarding. The skill also covers Gateway (Kafka proxy) encryption, data quality rules, Terraform export, and CI/CD scaffolding.

Try it, it's one command:

npx skills add conduktor/skills

It's open source, so if you hit a workflow it handles badly, open a PR. And if you're new to Conduktor, the Community Edition is free and self-hosted, the skill will do the install for you.

This post was adapted from the original on the Conduktor blog.

What Kafka Engineers Actually Care About in 2026

Stéphane Derosiaux — Mon, 01 Jun 2026 13:23:50 +0000

We were at Current 2026 in London two weeks ago. Keynotes about agentic AI and streaming agents but conferences conversations were about something else entirely.

1. The IBM-Confluent acquisition?

Six months after IBM acquired Confluent for $11B, I expected hot takes. What I got was mostly 'indifference'?

A data engineering lead at a major US bank called himself a "doomer", Kafka and Cassandra are his two favorite open-source projects, and both now sit under IBM. A staff engineer at a streaming vendor was more optimistic, speculating about Confluent's AI team merging with Watson.

The majority were more like "I try to concentrate on the tech side." "I don't know enough to comment." Moving on.

The acquisition was in the air, but practitioners had more pressing problems to talk about.

2. Kafka costs are the third certainty (after death and taxes)

Every cost conversation we had mapped to the same things:

Partition overprovisioning: inherited from not-knowing or templates tuned for a different workload
Retention mismatches: defaults that outlive the use case they were set for
Cluster proliferation: one-off clusters that never got consolidated
Orphan and duplicate topics: experiments or dead projectsthat never got cleaned up
Inefficient client patterns: batching, compression, and serialization left on defaults + JVM vs librdkafka differences
Static capacity: paying for peak when the load is 10% of that most of the time

Some examples of scale where they had these problems:

six years of unmanaged topics inherited during a Confluent Cloud migration
one team had twenty MSK clusters running multi-regions
MSK, Confluent, and Redpanda all running in parallel, looking for a single control plane as totally separated.

A data engineer at a German energy consultancy put it well:

"Kafka resource costs should not be taken for granted. Maybe I need sub-second latency, or maybe daily is enough. These are considerations you must make up front, before it's too late."

The solution is to change how to use Kafka: guardrails at topic creation, retention defaults that reflect reality, and ownership at the source. We typically see 25-40% of infrastructure costs come back this way.

We wrote up the full analysis in Your Platform Team Can't Fix Kafka Costs Alone.

3. Everyone wants self-service. Nobody has the operating model for it.

Self-service provisioning is a common topic now. Developers want to create topics and request permissions without filing a Jira ticket (many still do). Platform teams want to provide it to not be a bottleneck. The how is the question. Home-made is time consuming considering all variations, and using a vendor, it has to be flexible to adopt.

A senior engineer at a Danish financial firm told us he was too busy managing tickets to build the system that would have made the tickets unnecessary. An architect at a UK telecom had no self-service at all, just Jira. A senior developer at a UK bank said her biggest frustration was waiting on the ticketing system.

This is no joke. Self-service resource provisioning is a capability and Federated ownership is what produces it.

The platform team sets standards and guardrails.
Domain teams operate within them autonomously.

The teams that have this working got the operating model right before building the tooling. You need to slow down to accelerate better. Everyone else stopped at "we can create topics, we have gitops", they think this is enough and are missing the whole point. This is like 5% of a real solution.

"The key thing for us is approval gates. Guardrails. If somebody adds a thousand partitions, we need someone to have eyes on that." — Lead streaming data engineer at a major African bank

Self-service without federated ownership becomes either a free-for-all nobody can govern, or a tightly controlled environment nobody uses.

4. Nobody knows what a Kafka proxy actually does

This was the surprise. Kafka proxies came up in more conversations than any other architectural topic, but most people still think a proxy does one thing: route traffic for legacy clients.

An engineer at a European ISP runs one in production. It handles message routing. We asked about encryption, masking, or transformation. "We see a proxy as more of a routing tool."

Here's the range of what serious Kafka proxies like Conduktor Gateway can do:

Routing: bridging legacy or non-native clients (most common)
DR and migration: failover and provider switching
Security: encryption and masking at the message level
Full policy enforcement: aliasing, schema validation, virtual clusters, field-level encryption, audit, and access control

A VP at a global bank walked us through his checklist for a Kafka gateway: topic aliasing for migration, payload validation, virtual clusters for multi-tenant isolation, field-level encryption, data masking, and audit. It's a bank requirements doc.

Another engineer at a UK building society had built application-layer encryption for PII. He didn't know a proxy could do it at the message level, saving hours of per-application work. Plenty of teams are paying the same tax right now, doing it client-side and it's a pain to manage at scale.

A proxy is where policy and control live when they don't belong in the cluster or the application. Here's what a Kafka proxy actually does.

5. AI on Kafka is an ownership problem, not a streaming problem

The keynotes kept talking about agentic AI as the bottleneck. Kafka developers disagree:

"Spinning up infrastructure and writing code is rarely the bottleneck in a full end-to-end business solution." — Staff engineer at a UK retailer

A senior engineer at a major social platform building on-call AI agents put it directly: "Building the AI agent isn't the difficult thing. The data needs to be in the right structure."

A software engineer at a UK challenger bank has been feeding an AI agent context from ~40 microservices. "You have to give it the entire Java classes of all those microservices. That's usually when it starts hallucinating."

The hallucination is a legibility problem. The bottleneck is data quality, which sits downstream of governance, which sits downstream of your operating model. Every prerequisite practitioners named like ownership, schema discipline, topic visibility, federated governance, is foundational plumbing, not AI tooling.

The teams that will run AI on Kafka in 2027 are the teams getting their ownership and governance right in 2026.

All these discussions are coming from people running Kafka in production at banks, telcos, retailers, and energy companies. Costs are too high and too complex to undersand, self-service needs an operating model, proxies are underused, and AI isn't a shortcut past the governance work you haven't done yet.

If any of these hit home, the Conduktor blog goes deeper on each one.

Most Kafka Cost Calculators Miss This

Stéphane Derosiaux — Mon, 25 May 2026 15:19:36 +0000

Which side are you on: "This is just what Kafka costs at scale" or "We should switch to a cheaper Kafka provider"?

At Conduktor, our field team works inside Kafka environments that have been running for a long time. We see this: most Kafka teams are overpaying by 25 to 40 percent. Not because anyone did anything wrong, but because of how Kafka got built up over time.

The cost drivers of Kafka are weirdly context-dependent: the infrastructure and the provider are a tiny part of the full picture.

The "how" it's being used is the real question.

Five bad patterns eating budget

Below is what see, the same patterns show up everywhere, and are the first things we work with our customers.

1. Partition overprovisioning

"How many partitions?" is the most common question with Kafka. I heard last week someone telling me an org just defaults to "64". I was shocked. Not only providers may price per partitions, but from a Kafka point of view: this takes metadata and open files etc.

Partitions depend on throughput and concurrency expected (consumer parallelism). If a 64-partitions topic is sitting in a cluster with barely no traffic, you're just losing money on all sides. Multiply by dozens or hundreds of topics at scale.

2. Retention that makes no sense

Long retention on topics that nobody reads past the last few hours. Do you need replay? Default is 7-day retention, but it's often applied uniformly, when some topics only need a couple of hours and others genuinely need weeks.

Tips: when using compacted topics and/or Kafka streams (changelog etc.), data is being stored indefinitely, that can cause some security/regulations issues.

3. Let's spin up another cluster

One-cluster-per-team was a reasonable isolation strategy a long time ago. We saw this multiple times, more than 500 clusters, with tons of mirroring to share data. Throwing money down the drain.

You're paying for underutilized clusters instead of consolidating onto fewer well-managed ones.

4. Zombie topics

Topics created for experiments, migrations, or one-off tests that were never cleaned up. It's a simple thing but cost so much money as no one is looking. Every one of them is replicated and has retention costs. We've seen enterprises with hundreds of zombie topics, who were so surprised when we showed them.

5. Runaway egress

We had a customer where egress was running 30x higher than ingress on a single topic because of a misconfigured consumer. Buggy consumers, unnecessary fan-out, and chatty clients create traffic patterns that are invisible without dedicated infra monitoring. Egress is rarely free.

How to deal with it

Pick your starting point based on where the waste is concentrated.

Stop the bleeding: better defaults

Low-coordination work that pays off over time. It's better to have exceptions rather than wrong defaults you can't rollback.

Set sensible low partition defaults (3) and short retention (1 day). Increase if necessary only.
Enforce client-side compression. (Conduktor Gateway)
Require ownership metadata at topic creation. (Conduktor)

This won't reduce your bill right away, but it will prevent it from getting worse.

Trim the fat: optimize what's running

Tune retention where it's drifted, analyze consumer patterns.
Retire topics with no active producers or consumers.
Right-size partition counts (this is the hard one, since it means recreating topics and coordinating with every producer and consumer). - Consolidate Kafka clusters, introduce multi-tenancy (Conduktor)

This work easily moves the infrastructure bill, we saw reductions of $500k just doing this.

Now, keep it clean, be disciplined

After a cleanup, the same "drift" will start operating again.

To help you keeping the direction, have absolute visibility into what you Kafka ecosystems contains and what it costs (chargeback is powerful for this), clear ownership so every topic and cluster has a team accountable for it, and a regular review cadence to catch drift before it becomes permanent. Not heavyweight governance. Just enough discipline that the cleanup doesn't have to be repeated every year.

Where to start

The diagnostic question is simple: which of these patterns are present in your environment, and what are they costing you?

The original deep-dive goes further into the four layers of Kafka cost (infrastructure, ecosystem tooling, vendor/licensing, and operational) and includes a framework for sequencing the work.

If you want to look at your own estate, Conduktor's field team does a free cost analysis where they walk through your environment with you and give you concrete numbers.

I Rebuilt GPT-Researcher in Claude Code. Here's What I Learned.

Stéphane Derosiaux — Wed, 31 Dec 2025 13:25:39 +0000

The best orchestrator might be the one you don't have to build. Claude Code is already an orchestrator. Stop building infrastructure around it.

I spent a weekend studying GPT-Researcher, an open-source project with 24,000+ GitHub stars. It builds an autonomous research agent that generates comprehensive reports with citations. The architecture is elegant: multiple specialized agents coordinate through LangGraph, parallel execution speeds up research, and quality gates ensure reliable output.

It uses LLM calls to decide which agent to run. It uses LLM calls to generate sub-queries. It uses LLM calls to select tools. It uses LLM calls to coordinate parallel work.

We're wrapping LLMs in infrastructure to teach them orchestration... when they can already orchestrate.

The orchestration overhead

Consider a typical agent workflow:

LLM API call to analyze the task
LLM API call to plan the approach
LLM API call to select tools
LLM API call to execute (finally, the actual work)
LLM API call to verify results

Four out of five calls are orchestration overhead. The LLM produces structured outputs (JSON) that code interprets to decide... what to ask the LLM next and passing it the right context.

This pattern is everywhere. LangChain and LangGraph provide graph-based workflows. AutoGen from Microsoft enables multi-agent conversations. CrewAI offers role-based agent coordination, used by Oracle, PwC, and NVIDIA. Each framework solves real problems: managing state, coordinating agents, handling failures.

But they all share the same assumption: the LLM needs infrastructure to orchestrate.

What GPT-Researcher does well

Credit where it's due: GPT-Researcher is excellent. According to its maintainers, it outperforms Perplexity, OpenAI's research tools, and other systems in benchmarks on citation quality, report quality, and information coverage.

The architecture is sophisticated:

Multi-agent roles: Chief Editor orchestrates the process. Researchers investigate subtopics. Editors plan structure. Reviewers validate quality. Revisers incorporate feedback. Writers compile reports.
Parallel execution: Research happens concurrently across subtopics. Multiple retrievers (Tavily, Google, Bing) run in parallel. Web scraping is asynchronous.
Quality gates: Review cycles catch errors. Revision loops improve output.

The architecture tax

Here's what the orchestration layer requires:

# agent_creator.py - LLM decides which agent to use
response = await llm.call(
    "Analyze this query and return JSON with agent type..."
)
agent_type = parse_json(response)  # error handling, retries

# query_processing.py - LLM generates sub-queries
response = await llm.call(
    "Generate search queries for this task..."
)
queries = parse_list(response)  # more parsing, more error handling

# tool_selector.py - LLM selects MCP tools
response = await llm.call(
    "Select relevant tools from this list..."
)
tools = parse_tool_selection(response)  # yet more parsing

Each step requires prompt engineering, output parsing, error handling, and retry logic. The orchestration layer is substantial.

Claude Code's native capabilities

Here's what Claude Code provides out of the box:

Claude Code doesn't need an LLM call to decide what tools to use. It IS the LLM. It reasons about the task and uses tools directly, in the same context, without round-trips.

When you ask Claude Code to research a topic:

It analyzes the query (no separate LLM call)
It generates sub-queries (no separate LLM call)
It executes parallel searches (native Task agents)
It synthesizes results (no separate LLM call)
It writes the report (native Write tool)

What required infrastructure now requires prompts.

The rewrite

I created Claude Researcher to test this. It's not a Python package. It's four commands and one skill file.

Commands:

The /research-team command implements the full multi-agent pattern:

Claude Code (Chief Editor)
      │
      ├── [PARALLEL] Research Agent 1 → findings
      ├── [PARALLEL] Research Agent 2 → findings
      └── [PARALLEL] Research Agent 3 → findings
              ↓
         Draft Report
              ↓
         Reviewer Agent → feedback
              ↓
         Reviser Agent → improved draft
              ↓ (repeat until quality gate passes)
         Final Report

Quality gates are defined in the command file. Review cycles repeat until scores meet thresholds. The multi-agent patterns from GPT-Researcher, expressed as instructions rather than code.

The difference in sub-query generation:

GPT-Researcher:

prompt = f"""Write {max_iterations} google search queries...
You must respond with a list of strings in the following format: [{example}].
The response should contain ONLY the list."""

response = await llm.call(prompt)
queries = json.loads(response)  # parsing, error handling, retries

Claude Researcher:

Generate 5 search queries to research: "{query}"
- Each query should explore a different angle
- Include queries for recent information when relevant

Claude Code executes the queries directly. No parsing layer. No error handling for malformed output. The LLM produces the queries and uses them in the same context.

When orchestration frameworks make sense

This isn't a claim that frameworks are useless. They solve real problems:

Production systems with strict SLAs. When you need guaranteed response formats, retry logic, circuit breakers, and observability, frameworks provide battle-tested infrastructure. Claude Code is conversational, not transactional.
Non-Claude environments. If you're building on GPT-4, Gemini, or open-source models, Claude Code isn't available. Frameworks provide the coordination layer those environments lack.
Complex state machines. Research is relatively linear: gather, synthesize, write. Workflows with branching logic, human-in-the-loop steps, or long-running state benefit from explicit orchestration.
Team standardization. Frameworks enforce patterns. When multiple developers build agents, shared infrastructure ensures consistency. Markdown commands are flexible but less structured.
Audit requirements. Enterprise deployments often need detailed logs of every decision. Frameworks with explicit orchestration make this easier than conversational interfaces.

The question isn't "frameworks vs no frameworks". It's "do you need the framework for THIS task?"

Try it yourself

Clone the repo:

git clone https://github.com/sderosiaux/claude-researcher

Copy to your Claude Code config:

cp commands/_*.md ~/.claude/commands/
cp skills/researcher.md ~/.claude/skills/

Run a research task:

/research "Impact of AI agents on software development" --depth=deep

/research-team "Comparison of vector databases" --quality=high

/lookup "What is Claude Opus 4.5 context window?"

No pip install. No API keys beyond what Claude Code already uses. No configuration files.

Claude Code IS the Orchestrator, a really good one

There's a pattern in software engineering: we build abstractions to solve problems, then build abstractions to manage our abstractions. Each layer adds capability but also complexity, configuration, and cognitive load.

What if the base layer already does what I need?

Claude Code is an LLM with native tool access, parallel execution, and context management. GPT-Researcher is infrastructure that makes LLMs do those things. For research tasks, the native capabilities are sufficient.

The best orchestrator might be the one you don't have to build.