Anil Kurmi

Posted on Apr 19

Event-Driven Agents: Why Direct CDC Just Killed the Kafka-Debezium-Kafka Stack

#programming #architecture #kafka #ai

It's 2:47 AM. A fraud detection agent wakes up, polls the transactions REST endpoint, sees nothing unusual, and goes back to sleep for 5 seconds. At 2:47:01, a card is swiped in Berlin. At 2:47:03, a contactless tap lands in London. At 2:47:05, a high-value online purchase clears from a residential proxy in Singapore. The agent's next poll fires at 2:47:06. By then the pattern is already three transactions deep, the money is gone, and the agent sees only the final state: "account balance lower than expected." The fraud chain happened in the gaps between polls.

This is the failure mode that made me stop defending request/response as the default integration style for AI agents this week. The same week Kai Waehner published three back-to-back pieces on agentic AI integration, Apache Flink CDC 3.6.0 shipped with sub-second binlog capture, and DBConvert Streams 2.0 removed Kafka from the CDC path entirely. The 2015-2025 assumption — that change data capture requires a broker — is quietly dying. And when it dies, the architecture under AI agents inverts.

The 5-Minute Skim

What changed this week: Direct CDC shipped in multiple products. Flink CDC 3.6.0 reads MySQL binlog and PostgreSQL WAL directly with sub-second latency and YAML-declarative pipelines. DBConvert Streams 2.0 ships PostgreSQL WAL CDC with zero Kafka in the path. Kai Waehner's trinity piece frames event-driven integration as the connective tissue between process intelligence and agentic AI.

Default recommendation: If you're building an agent that makes more than 5 decisions per second against mutable data, default to a streaming substrate (materialized views + CDC), not REST polling. Use REST for drill-down enrichment, not for primary state.

Where it breaks: Multi-consumer federations with 10+ downstream systems, long-retention event archives, cross-org event sharing — Kafka still wins. Direct CDC is a single-pipeline optimization.

Key trade-off: You're trading Kafka's pluggability and retention for one less hop and one less operational surface. For agent-centric, latency-critical, budget-constrained systems, that's the right trade. For enterprise event backbones, it isn't.

Why this week?

Three signals collided. First, Kai Waehner's "Trinity of Modern Data Architecture" (April 1) argues that agentic AI without event-driven integration is just a chatbot with API access — it can't perceive the world continuously. Second, his "MCP vs REST vs Kafka" piece (April 10) reframes the integration debate: these aren't alternatives, they're layers. Third, his CEP piece (April 14) draws the line between pattern matching (Flink) and inference (agents) — and it turns out most people are using the wrong tool on both sides of that line.

Underneath all three, the plumbing got better. Flink CDC 3.6.0 landed March 30. DBConvert 2.0 landed in April. The "Streaming SQL in 2026" Medium piece declared RisingWave and Materialize production-ready for the materialized-view-as-agent-context pattern. The week you could defend "Kafka in the middle of every pipeline" as the default architecture ended somewhere between these releases.

Why does request/response fail for agents?

Three reasons, each with specifics.

Staleness between polls. A REST endpoint returns a snapshot. If your agent polls every 5 seconds, every decision is made against state that is, on average, 2.5 seconds old. For a chatbot recommending a restaurant, that's fine. For a fraud agent watching a card-present sequence, it's the difference between blocking a transaction and refunding one. The fraud chain above happens entirely inside a single poll interval.

Poll load scales with agents, not with events. If 100 agents each poll every 5 seconds, you generate 20 requests per second against your transactions service — whether or not anything is happening. Most of those requests return "nothing new." This is the worst of both worlds: load when idle, and still latency when busy. Event-driven flips it: zero load when idle, immediate wake-up when an event arrives.

No event history, no pattern detection. A poll gives you the current state. It does not give you the sequence that led to the state. Agents that reason about behavior — fraud chains, user intent, supply chain disruption — need the ordered event stream, not the final snapshot. Request/response discards the sequence by construction.

Kai Waehner's argument in the MCP piece is that these aren't opinions; they're structural properties of the integration style. You can work around them (longer-lived websockets, SSE, webhooks), but at that point you've built a worse Kafka.

Visual architecture: what does the new stack look like?

The pre-2026 stack had three hops between the database and the agent. The 2026 stack has two.

The database is the source of truth. A direct CDC reader tails the write-ahead log. The streaming layer either maintains a materialized view (for query-style access) or runs a CEP pattern (for sequence detection). The agent subscribes to view updates or pattern hits, then uses MCP-exposed tools for drill-down. Kafka is optional, not required.

Kafka vs REST vs MCP: what's the hierarchy?

Here's the frame that clicked for me this week. These three are not competitors. They're layers in a stack, each solving a different problem.

MCP is the tool discovery layer. It tells an agent what it can do — what APIs exist, what schemas they take, what side effects they cause. MCP is static metadata plus an invocation protocol. It does not solve "when should I act."

Kafka (or any event log) is the event sourcing layer. It tells an agent what happened, in order, with replay. This is where continuous perception lives. Without an event log — or a direct-CDC equivalent — an agent is blind between invocations.

CEP / Flink is the pattern match layer. It tells an agent when something interesting just happened — a known sequence, a windowed aggregation, a join across streams. CEP is declarative, deterministic, and fast. It's the scalpel between the firehose and the LLM.

REST is the drill-down layer. It answers agent questions like "what are the last 30 days of charges for this specific account?" once the agent has decided it needs to look. REST is pull-based and stateless, which is exactly what drill-down needs.

The mistake is treating them as alternatives. REST-only agents are blind. Kafka-only agents have no pattern detection. CEP-only pipelines can't reason about ambiguous cases. MCP-only stacks have no perception loop. The production pattern is all four, layered: MCP exposes tools, Kafka (or direct CDC) delivers events, CEP filters for known patterns, the agent handles the ambiguous cases, and REST handles drill-down.

What is the CDC simplification revolution?

Here are the numbers that moved this week.

Traditional Debezium path: database → Debezium connector → Kafka topic → Kafka Connect → downstream processor. Three network hops, three operational surfaces, typical end-to-end latency 100-500ms under load, with tail latencies into seconds during rebalances.

Direct CDC path: database → WAL/binlog reader → processor. One network hop, one operational surface, sub-second end-to-end (often under 200ms), no rebalance tail.

The vendors shipping this pattern now:

RisingWave — PostgreSQL-wire-compatible streaming database. Connects directly to Postgres logical replication or MySQL binlog, maintains materialized views, serves SQL queries. No Kafka required for single-pipeline workloads.
DBConvert Streams 2.0 (April 2026) — PostgreSQL WAL CDC with direct sinks. Explicit positioning as "Kafka-free CDC."
Flink CDC 3.6.0 (March 30, 2026) — sub-second binlog capture, YAML pipeline definitions, direct sinks to Paimon, Iceberg, Doris, StarRocks.
Materialize — incremental view maintenance over Postgres CDC.

The architecture changed from three-hop (DB → Debezium → Kafka → processor) to two-hop (DB → CDC reader → processor). You lose Kafka's multi-consumer fan-out. You gain a simpler operational story and a latency budget that fits agent decision loops.

When does this matter? When the agent's decision latency is dominated by the integration path, not the inference. If your LLM call takes 800ms, shaving 300ms off CDC doesn't help. If your agent uses a small local model and the bottleneck is "how fresh is the state," cutting 300ms of broker hop is a 50% latency reduction.

When does CEP win and when does it fail?

Complex Event Processing is the layer most teams skip and then regret. Kai's CEP piece this week draws clean lines.

CEP wins for known sequences. Fraud chains like the Berlin-London-Singapore one above are textbook CEP: three events, temporal ordering, geographic constraint, cardinality threshold. Flink's MATCH_RECOGNIZE clause expresses this in ten lines of SQL and executes in milliseconds. Asking an LLM to watch a stream for this pattern is a waste of tokens and a latency disaster.

CEP wins for predictive maintenance. "Temperature over 80°C for 3 consecutive readings, followed by vibration spike within 60 seconds" — a Flink pattern, not a prompt. Deterministic, auditable, and cheap.

CEP wins for supply chain and e-commerce behavior. "Cart abandonment after coupon view without checkout within 10 minutes" — pattern match territory.

CEP fails for undefined patterns. If you can't write the pattern in SQL, CEP can't match it. Novel fraud modes, emergent user behaviors, anything that requires "this feels off" judgment — that's agent territory.

CEP fails for simple windowed aggregations. If all you need is "count per minute per user," use a streaming SQL TUMBLE window. CEP is overkill.

CEP fails for multi-day, high-cardinality lookback. CEP holds state per pattern match attempt. Trying to match "any anomaly across 100M users over 30 days" blows up memory. Use a feature store and batch scoring instead.

The pattern that works in production: CEP for known patterns at millisecond latency, agent inference for the ambiguous residual. The CEP layer handles 95% of cases cheaply; the agent handles the 5% that needs reasoning.

Trade-offs: Kafka vs direct CDC, streaming vs polling, CEP vs agent

This is the debate, not a table.

Kafka still wins when you have multi-consumer federations. If ten downstream systems each need the order events — analytics, fraud, CRM, warehouse sync, audit, search indexing, ML features, notifications, billing, reporting — Kafka's fan-out is the right answer. Direct CDC means each consumer opens its own replication slot against the database, which Postgres will not love. Kafka also wins when you need long retention (weeks or months of replayable history), when you need cross-system event archives for compliance, and when your ops team already runs it well. Do not rip out Kafka to save one hop if Kafka is doing five other jobs.

Direct CDC wins when you have a single-pipeline agent-centric architecture. Greenfield project, one primary database, one or two consumers, sub-second latency critical, budget-constrained. The operational surface drops from "Kafka cluster + Connect workers + schema registry + Debezium" to "a reader process." The latency drops by 100-300ms. The monthly bill drops by a meaningful chunk.

Request/response wins for low-frequency, drill-down access. An agent that needs "give me the full profile for user 12345" uses REST via MCP. That's the right tool. Streaming is overkill when the access pattern is ad-hoc and infrequent.

Streaming wins above the 5-decisions-per-second threshold. This is the rough break-even I've seen in practice. Below that, REST polling's overhead is tolerable. Above it, the poll load and staleness start dominating the architecture. At 50 decisions per second, streaming is not optional.

CEP wins when the pattern is known, the latency budget is tight, and the cardinality is high. Fraud rules, SLA breaches, threshold-and-sequence alerts. Declarative, auditable, fast.

Agent inference wins when the pattern is undefined, the reasoning is multi-step, or the flexibility matters more than latency. Novel fraud, customer intent, incident triage. Slower (hundreds of ms to seconds), more expensive per decision, but handles cases CEP can't express.

The production architecture layers both: CEP filters the stream for known patterns, the agent handles the residual.

What are the implementation patterns and anti-patterns?

Pattern: materialized view as agent context. The agent doesn't query the operational database directly. It queries a materialized view in a Postgres-wire-compatible streaming database (RisingWave, Materialize). The view is kept fresh by direct CDC. The agent gets point-in-time consistency and sub-second freshness without loading the primary.

Pattern: CEP filter, agent decider. The Flink job runs the known patterns and emits "suspicious event" signals. The agent subscribes to the suspicious-event topic (or materialized view of suspicious events) and does the deeper reasoning. Cheap filtering, expensive reasoning only where needed.

Pattern: agent feedback loop. The agent's decisions (blocked, approved, escalated) become events themselves, fed back into the stream. Over time, the streaming layer can learn which patterns the agent blocks versus approves, and promote high-confidence patterns back into CEP rules. This is how you migrate decisions from "expensive LLM call" to "cheap pattern match" as you learn.

Anti-pattern: polling for agent context. If you find yourself tuning poll intervals to balance staleness against load, you're solving the wrong problem. Switch substrates.

Anti-pattern: LLM as pattern matcher. Asking GPT-class models to watch a Kafka topic for "sequences of three transactions in different cities" is burning tokens to do what MATCH_RECOGNIZE does in microseconds. Save the LLM for ambiguity.

Anti-pattern: Kafka because Kafka. If you have one producer and one consumer and sub-second requirements, a direct CDC pipeline is simpler and faster. Don't add a broker out of habit.

Anti-pattern: direct CDC at enterprise scale without planning replication slots. Postgres has a hard limit on concurrent replication slots. If twelve teams each want their own slot, you need a fan-out layer — which is exactly what Kafka is for. Know your scale before you rip out the broker.

Actionable takeaways

Audit your agents' integration style this week. Count how many poll REST on a timer. For each, ask: would this agent detect a multi-step sequence that spans the poll interval? If no, flag it for streaming migration.
Pilot direct CDC on one greenfield pipeline. Pick the lowest-risk new agent workload, put RisingWave or Flink CDC 3.6 in the path, skip Kafka. Measure end-to-end latency and compare to your Debezium baseline.
Map your integration stack to the MCP/Kafka/CEP/REST layering. If any layer is missing or doubled-up, that's technical debt. Most teams are missing the CEP layer and double-using REST.
Write three CEP patterns before your next agent project. Fraud sequence, SLA breach, user behavior funnel. If you can express them in Flink SQL, CEP handles them. Everything that doesn't fit becomes agent scope.
Build the feedback loop. Every agent decision should be an event on the stream. Without this, you can't migrate decisions from LLM to CEP as confidence grows, and your agent costs don't come down.

Top comments (2)

Andrew Tan • Apr 23

The polling-gap scenario is a great hook, but I'd push back slightly on 'killed' — Kafka still wins when you need fan-out, replay, or multiple consumers reading the same CDC stream. Direct CDC shines for simple point-to-point replication, but the moment you need an audit log or schema evolution layer, you're back to a broker. Have you benchmarked backpressure behavior when the sink can't keep up?

Anil Kurmi • Apr 24

Totally Agree, I have tried to cover this in Tradeoff section.