Pravin Khandke

Posted on May 25 • Edited on Jun 2

Messaging in the Age of AI

#ai #eventdriven #kafka #agents

Messaging infrastructure has been boring for a decade. Queues, topics, exchanges the primitives settled. Then AI agents arrived, and suddenly the assumptions that made messaging boring stopped holding. Messages are no longer just data. They are context. An agent will read your message, reason over it, call tools because of it, and generate responses whose token count you cannot predict at enqueue time. The transport layer that worked fine for deterministic services needs to be rethought — not replaced, but adapted.

This article is not about which message broker to pick. It is about what changes when the producer and consumer are both potentially non-deterministic reasoning systems, and what patterns actually hold up in production. The examples use Spring Boot and Apache Kafka because that is a stack I have seen work at scale, but the patterns apply across stacks.

1. Why AI Changes Messaging

Traditional messaging carries structured, bounded payloads. An order-placed event has a known shape: order ID, customer ID, line items, total. A payment-confirmed event carries a transaction reference. These messages are small (hundreds of bytes), predictable in volume, and idempotent by design reprocess the same order event, get the same result.

AI-originated messages break all three assumptions. A single agent-to-agent message can carry a 100K-token context window effectively a small novel's worth of reasoning state. Volume is bursty in ways that do not correlate with user activity: a multi-agent consensus round can generate 50 internal messages for a single user request. And idempotency is no longer free, because the same logical input can produce different reasoning paths on each retry.

The key consideration here is that messaging for AI systems shifts from "deliver this payload reliably" to "manage reasoning context at scale." Reliability still matters, it matters more but it is joined by concerns that traditional messaging never had to address: token budgets, model latency variance, and reasoning trace integrity.

In the traditional model, each arrow is a bounded, schema-validated message. In the AI model, the arrow from Planner to Executor carries an entire reasoning state and that arrow has a dollar cost measured in tokens. The messaging layer needs to know that.

2. New Workloads Created by Agents

Agents generate traffic patterns that look nothing like what your messaging infrastructure was designed for. It is worth cataloguing the new workloads explicitly, because each one stresses a different part of the system.

Planning outputs. Before an agent acts, it thinks and the thinking produces structured output. A planner agent emits a plan object (goal, sub-goals, constraints, assigned agents) that downstream agents consume. These messages are medium-sized (2-8K tokens) and are the highest-leverage messages in the system get the plan wrong, and everything downstream wastes tokens.

Tool-call results. When an agent invokes a tool a database query, an API call, a code execution, the result enters the messaging fabric as a first-class message. These are unpredictable in size (a SQL query can return one row or a million) and must be chunked, summarized, or rejected before they blow out a context window.

Chain-of-thought traces. Some architectures persist the agent's reasoning trace as it streams, not just for debugging, but as context shared with other agents. A reasoning trace is verbose by design. Storing and forwarding it as a message requires treating it as a structured artifact, not a log line.

Multi-agent broadcast and consensus. Agents often need to reach agreement — which plan to execute, whether a tool call result is valid, whether a response meets policy. These consensus rounds generate fan-out message bursts: one agent publishes a proposal, N agents respond with votes or critiques. The messaging layer sees N+1 messages where a traditional system would see one.

In practice, this means your messaging system needs to handle message sizes spanning five orders of magnitude (bytes to megabytes), traffic bursts that do not follow any daily or weekly pattern, and consumers that may take seconds or minutes to process a single message — and retry it aggressively if they are unsure of the result.

3. Messaging Architecture Patterns That Actually Work

After observing agent systems in production across several teams, a set of patterns has crystallized. These are not speculative. They are what teams end up building after the first production incident.

Pattern 1: The Message Envelope

Every message in an AI system must carry metadata beyond a correlation ID. The envelope should include the token count of the payload, the model that generated it, the trace ID, the sender type (human, agent, tool), and an idempotency key if the sender is an agent. The consumer uses this metadata to make routing, quota, and deduplication decisions without parsing the payload body.

The companion project implements this as a Java record — see code/src/main/java/com/messaging/relay/model/MessageEnvelope.java:

public record MessageEnvelope<T>(
    String messageId,
    String traceId,
    String parentMessageId,
    SenderType senderType,
    T payload,
    int tokenCount,          // pre-enqueue estimate
    String modelId,
    Instant timestamp,
    String idempotencyKey,   // required for agent traffic
    Map<String, String> metadata
) { }

Pattern 2: Separate Traffic Lanes

Human-to-agent, agent-to-agent, and agent-to-tool traffic have different latency tolerances, token profiles, and failure modes. Placing them on separate Kafka topics lets you apply different retention policies, compaction strategies, and consumer group scaling independently. An observability agent can consume from all three topics without competing with operational consumers.

Pattern 3: Idempotency Keys for Agent Traffic

Agents retry. It is inherent to their design — when a reasoning step produces low confidence, the agent re-executes. Without idempotency keys at the messaging layer, every retry becomes a new transaction, duplicating work and inflating costs. The pattern is straightforward: the producer sets a key derived from the logical operation (e.g., plan-{conversationId}-{stepNumber}), and the consumer deduplicates within a configurable window. Kafka's log compaction can assist here, but application-layer dedup is more reliable for agent workloads because the retry semantics are not strictly exactly-once in the Kafka sense.

Pattern 4: Chunked Context Delivery

Do not send a 100K-token context window as a single Kafka message. Break it into chunks — summary, relevant history, tool outputs, reasoning state — each with its own envelope metadata. The consumer can then decide which chunks to load into the model's context window based on relevance, recency, and token budget. This turns context assembly from a producer-side guess into a consumer-side decision.

The companion project's ContextChunker (see code/src/main/java/com/messaging/relay/chunking/ContextChunker.java) splits content by a configurable maxChunkTokens threshold. The KafkaConfig (code/src/main/java/com/messaging/relay/config/KafkaConfig.java) defines the four-topic topology with per-lane retention policies — 7 days for human traffic, 30 days for agent traffic (audit trail), 3 days with compaction for tool calls, and 90 days for the dead letter topic.

4. Token Limits, Rate Limits, and Quota Management

Rate limiting by request count made sense when every request cost roughly the same. An AI system can receive two messages that are both "one request" — one costs $0.002 and the other costs $0.30. The remedy is token-aware rate limiting.

The mechanism is simple: before enqueuing a message to Kafka, count its tokens using the same tokenizer the model will use. Apply rate limits in tokens-per-minute, not requests-per-minute. Partition the quota: 70% reserved for human-originated traffic (which must be responsive), 30% for agent-to-agent traffic (which can be delayed or degraded). When the quota for a partition is exhausted, apply backpressure, signal to the producer that it should slow down, batch, or degrade to a cheaper model.

The companion project implements this in TokenAwareRateLimiter (see code/src/main/java/com/messaging/relay/ratelimit/TokenAwareRateLimiter.java):

public RateLimitDecision check(String serializedPayload, MessageEnvelope<?> envelope) {
    int tokenCount = countTokens(serializedPayload);
    SenderType senderType = envelope.senderType();
    boolean allowed = quotaManager.tryConsume(senderType, tokenCount);
    if (allowed) return RateLimitDecision.allowed(tokenCount);
    return RateLimitDecision.denied(
        senderType.name() + " quota exhausted. " + backpressureHint(senderType), tokenCount);
}

The QuotaManager maintains per-lane sliding windows resetting each minute, with configurable limits — defaulting to 600K tokens/min for human traffic, 200K for agents, and 100K for tool calls.

The key consideration here is that rate limiting in AI systems is not just about protecting infrastructure. It is about cost control. A runaway agent loop that retries 50 times before converging should not generate a surprise $15 charge. The messaging layer is the correct place to enforce this, because it sits between the agent's impulse to retry and the model provider's metering endpoint.

5. Observability, Auditing, and Operational Safety

Observability for AI messaging is not an extension of APM. APM tells you whether a topic is backed up. AI messaging observability tells you whether the messages flowing through it are producing correct, safe, and cost-effective outcomes. Those are different questions that require different instrumentation.

What to Log per Message

Every message passing through the system should carry a structured log entry — not as an afterthought, but as a first-class part of the messaging pipeline. The minimum fields: traceId, senderType, tokenCount, modelId, latencyMs, retryCount, idempotencyKey, and blockedCheck (whether a safety guardrail intercepted the message). These fields let you reconstruct any interaction from raw logs — what was sent, by whom, at what cost, with what result.

The companion project's ObservabilityFilter (see code/src/main/java/com/messaging/relay/observability/ObservabilityFilter.java) logs a structured JSON event per consumed message:

public void logConsumption(MessageEnvelope<?> envelope, String topic, long offset) {
    Map<String, Object> event = new LinkedHashMap<>();
    event.put("trace_id", envelope.traceId());
    event.put("sender_type", envelope.senderType().name());
    event.put("token_count", envelope.tokenCount());
    event.put("model_id", envelope.modelId());
    event.put("idempotency_key", envelope.idempotencyKey());
    event.put("topic", topic);
    obsLog.info(objectMapper.writeValueAsString(event));
}

A separate passesSafetyCheck method runs before consumer processing, blocking messages flagged in metadata. In production, extend this with PII detection and content policy evaluation.

Message Lineage

A single user request can spawn a tree of agent messages: planner to executor, executor to tool, tool result back to executor, executor to critic, critic back to planner. If you cannot trace that tree, you cannot debug it. The trace ID is the spine of lineage — but it is not enough. Each agent should also record parentMessageId so you can reconstruct the tree topology. In practice, this means the message envelope (Pattern 1) carries a parentMessageId field, and the observability consumer builds the tree from the event stream.

Safety Guardrails at the Messaging Layer

Content policy enforcement, PII scrubbing, and tool-call authorization should not live solely in the agent logic. They should be applied at the messaging boundary — before a message reaches a consumer. A lightweight filter consuming from each topic can validate, block, or redact messages based on policy. The filter is not a model; it is a deterministic rules engine plus (optionally) a small classifier for ambiguous cases. When a message is blocked, the producer receives a structured rejection reason, not silence.

6. Real-World Use Cases and Anti-Patterns

Use Case: Customer Support Triage

A customer sends a message. A triage agent classifies it — billing, technical, account — and routes it to the correct specialist agent. The triage agent publishes to agent.messages with senderType=agent and a classification envelope. The specialist agent consumes, drafts a response, and routes it to a human for approval. The human sees the draft, the classification confidence, and the reasoning trace. The messaging layer carries all three.

Use Case: Code Review Pipeline

A PR is opened. A review agent comments on the diff. The comment is published to agent.messages. A human reviewer sees the agent's comment alongside the diff. The human can accept, reject, or modify the comment. The final review is a merge of agent suggestions and human judgment, with every message in the chain auditable. The messaging layer provides the timeline.

Anti-Pattern: The "Autonomous Everything" Trap

The most common failure mode I have seen is giving agents unbounded autonomy over messaging. The agent decides whom to message, what to say, and how often — with no human-in-the-loop validation. Inevitably, the agent finds an edge case, enters a reasoning loop, and floods the messaging layer with repetitive, costly messages. The fix is straightforward: cap agent-originated messages per conversation, require human approval above a cost or sensitivity threshold, and alert when an agent exceeds its lane quota.

Anti-Pattern: Prompt Chains as Messaging Protocol

The 2026 equivalent of connecting microservices with SSH tunnels. Teams string together LLM calls with raw prompt templates, passing unstructured text between agents. There is no schema, no versioning, no retry contract, no observability hook. When it breaks — and it always breaks, debugging means reading raw prompt logs and guessing which template produced which output. Use a proper message envelope and a proper transport. Kafka adds maybe 50ms of latency and saves hours of debugging.

Do: Structured Messaging	Don't: Prompt-Chain Spaghetti
Schema-validated envelopes	Raw prompt strings as message format
Versioned message types	No versioning — template changes break downstream silently
Idempotency keys on every agent message	No retry contract — agents retry, prompts drift
Trace context propagated end-to-end	No observability — debugging = grep + guesswork
Token count in every envelope	Token consumption unknown until the bill arrives

7. What to Avoid: Hype, Autonomy Theater, and Brittle Prompt Chains

The AI industry has a hype problem, and messaging architecture is not immune. Three flavors of nonsense are particularly common, and it is worth naming them so you can recognize them in a meeting.

Autonomy theater. Dashboards that show agents "autonomously" handling customer interactions while three human operators shadow-monitor every message. The messaging layer is configured to route everything to agents, but the agents' confidence is low on 80% of requests, so humans silently handle those via a side channel. The dashboard reports 95% autonomous resolution. The messaging logs tell a different story. Build the dashboard from the message logs, not from the demo script.

Prompt-chain spaghetti. Mentioned above, but worth calling out as its own category. The problem is not that prompt chains exist — they will always exist as a prototyping tool. The problem is promoting a prototype to production without replacing the prompt-chain transport with a proper messaging layer. It is the architectural equivalent of deploying a bash script as a production service and being surprised when it breaks at 3 AM.

The AGI bait-and-switch. "Our messaging architecture is designed for AGI-scale agent collaboration." No, it is not. AGI does not exist, and designing for it today means optimizing for constraints nobody has measured. Design for the workloads you actually have: LLMs with context windows, token budgets, and human-in-the-loop validation. When the technology changes, the messaging layer will adapt — because it is built on Kafka, not on a proprietary agent framework.

The key consideration here is that the best messaging architecture for AI systems today is boring. Kafka topics with clear schemas. Structured envelopes with metadata. Token-aware rate limiting. Trace-level observability. These are not exotic technologies. They are the same patterns that made microservices manageable, applied with slight adaptation to a new kind of producer and consumer. The teams that succeed will be the ones that resist the urge to build an "AI-native messaging platform" and instead build a solid messaging platform that happens to carry AI traffic.

Companion project: A runnable Spring Boot + Kafka messaging relay implementing the patterns described here — message envelopes, lane-separated topics, token-aware rate limiting, idempotency keys, and structured observability logging. Available in the code/ directory alongside this article.

Sources:

Confluent, "The Future of AI Agents Is Event-Driven"
Kai Waehner, "MCP vs. REST/HTTP API vs. Kafka"
Temporal.io, "What Agentic AI Borrowed from Microservices"
RisingWave, "Event-Driven Architecture in 2026"
Technode, "Beware the Distributed Monolith"
CNCF, "Cloud Native Agentic Standards" (2026)

Top comments (1)

Robinson • May 27

The idempotency and schema validation are good points. Enforcing strict message structure at the broker level is especially needed when considering that agents could be producing messages. The token-aware rate limiting argument is interesting, but I think it's more accurately traffic shaping than cost control because if every valid message eventually gets consumed and processed by an LLM, the total token cost is the same regardless of the rate they flow through the broker, if I understood correctly. The real cost levers are upstream in prompt and context window design, and downstream in how each consuming agent manages its own retry limits and token budget.
Good read overall though, it's a useful way to think about how agent behavior stresses assumptions that messaging infrastructure has taken for granted for years.