DEV Community: Pravin Khandke

Messaging in the Age of AI

Pravin Khandke — Mon, 25 May 2026 16:46:55 +0000

Messaging infrastructure has been boring for a decade. Queues, topics, exchanges the primitives settled. Then AI agents arrived, and suddenly the assumptions that made messaging boring stopped holding. Messages are no longer just data. They are context. An agent will read your message, reason over it, call tools because of it, and generate responses whose token count you cannot predict at enqueue time. The transport layer that worked fine for deterministic services needs to be rethought — not replaced, but adapted.

This article is not about which message broker to pick. It is about what changes when the producer and consumer are both potentially non-deterministic reasoning systems, and what patterns actually hold up in production. The examples use Spring Boot and Apache Kafka because that is a stack I have seen work at scale, but the patterns apply across stacks.

1. Why AI Changes Messaging

Traditional messaging carries structured, bounded payloads. An order-placed event has a known shape: order ID, customer ID, line items, total. A payment-confirmed event carries a transaction reference. These messages are small (hundreds of bytes), predictable in volume, and idempotent by design reprocess the same order event, get the same result.

AI-originated messages break all three assumptions. A single agent-to-agent message can carry a 100K-token context window effectively a small novel's worth of reasoning state. Volume is bursty in ways that do not correlate with user activity: a multi-agent consensus round can generate 50 internal messages for a single user request. And idempotency is no longer free, because the same logical input can produce different reasoning paths on each retry.

The key consideration here is that messaging for AI systems shifts from "deliver this payload reliably" to "manage reasoning context at scale." Reliability still matters, it matters more but it is joined by concerns that traditional messaging never had to address: token budgets, model latency variance, and reasoning trace integrity.

In the traditional model, each arrow is a bounded, schema-validated message. In the AI model, the arrow from Planner to Executor carries an entire reasoning state and that arrow has a dollar cost measured in tokens. The messaging layer needs to know that.

2. New Workloads Created by Agents

Agents generate traffic patterns that look nothing like what your messaging infrastructure was designed for. It is worth cataloguing the new workloads explicitly, because each one stresses a different part of the system.

Planning outputs. Before an agent acts, it thinks and the thinking produces structured output. A planner agent emits a plan object (goal, sub-goals, constraints, assigned agents) that downstream agents consume. These messages are medium-sized (2-8K tokens) and are the highest-leverage messages in the system get the plan wrong, and everything downstream wastes tokens.

Tool-call results. When an agent invokes a tool a database query, an API call, a code execution, the result enters the messaging fabric as a first-class message. These are unpredictable in size (a SQL query can return one row or a million) and must be chunked, summarized, or rejected before they blow out a context window.

Chain-of-thought traces. Some architectures persist the agent's reasoning trace as it streams, not just for debugging, but as context shared with other agents. A reasoning trace is verbose by design. Storing and forwarding it as a message requires treating it as a structured artifact, not a log line.

Multi-agent broadcast and consensus. Agents often need to reach agreement — which plan to execute, whether a tool call result is valid, whether a response meets policy. These consensus rounds generate fan-out message bursts: one agent publishes a proposal, N agents respond with votes or critiques. The messaging layer sees N+1 messages where a traditional system would see one.

In practice, this means your messaging system needs to handle message sizes spanning five orders of magnitude (bytes to megabytes), traffic bursts that do not follow any daily or weekly pattern, and consumers that may take seconds or minutes to process a single message — and retry it aggressively if they are unsure of the result.

3. Messaging Architecture Patterns That Actually Work

After observing agent systems in production across several teams, a set of patterns has crystallized. These are not speculative. They are what teams end up building after the first production incident.

Pattern 1: The Message Envelope

Every message in an AI system must carry metadata beyond a correlation ID. The envelope should include the token count of the payload, the model that generated it, the trace ID, the sender type (human, agent, tool), and an idempotency key if the sender is an agent. The consumer uses this metadata to make routing, quota, and deduplication decisions without parsing the payload body.

The companion project implements this as a Java record — see code/src/main/java/com/messaging/relay/model/MessageEnvelope.java:

public record MessageEnvelope<T>(
    String messageId,
    String traceId,
    String parentMessageId,
    SenderType senderType,
    T payload,
    int tokenCount,          // pre-enqueue estimate
    String modelId,
    Instant timestamp,
    String idempotencyKey,   // required for agent traffic
    Map<String, String> metadata
) { }

Pattern 2: Separate Traffic Lanes

Human-to-agent, agent-to-agent, and agent-to-tool traffic have different latency tolerances, token profiles, and failure modes. Placing them on separate Kafka topics lets you apply different retention policies, compaction strategies, and consumer group scaling independently. An observability agent can consume from all three topics without competing with operational consumers.

Pattern 3: Idempotency Keys for Agent Traffic

Agents retry. It is inherent to their design — when a reasoning step produces low confidence, the agent re-executes. Without idempotency keys at the messaging layer, every retry becomes a new transaction, duplicating work and inflating costs. The pattern is straightforward: the producer sets a key derived from the logical operation (e.g., plan-{conversationId}-{stepNumber}), and the consumer deduplicates within a configurable window. Kafka's log compaction can assist here, but application-layer dedup is more reliable for agent workloads because the retry semantics are not strictly exactly-once in the Kafka sense.

Pattern 4: Chunked Context Delivery

Do not send a 100K-token context window as a single Kafka message. Break it into chunks — summary, relevant history, tool outputs, reasoning state — each with its own envelope metadata. The consumer can then decide which chunks to load into the model's context window based on relevance, recency, and token budget. This turns context assembly from a producer-side guess into a consumer-side decision.

The companion project's ContextChunker (see code/src/main/java/com/messaging/relay/chunking/ContextChunker.java) splits content by a configurable maxChunkTokens threshold. The KafkaConfig (code/src/main/java/com/messaging/relay/config/KafkaConfig.java) defines the four-topic topology with per-lane retention policies — 7 days for human traffic, 30 days for agent traffic (audit trail), 3 days with compaction for tool calls, and 90 days for the dead letter topic.

4. Token Limits, Rate Limits, and Quota Management

Rate limiting by request count made sense when every request cost roughly the same. An AI system can receive two messages that are both "one request" — one costs $0.002 and the other costs $0.30. The remedy is token-aware rate limiting.

The mechanism is simple: before enqueuing a message to Kafka, count its tokens using the same tokenizer the model will use. Apply rate limits in tokens-per-minute, not requests-per-minute. Partition the quota: 70% reserved for human-originated traffic (which must be responsive), 30% for agent-to-agent traffic (which can be delayed or degraded). When the quota for a partition is exhausted, apply backpressure, signal to the producer that it should slow down, batch, or degrade to a cheaper model.

The companion project implements this in TokenAwareRateLimiter (see code/src/main/java/com/messaging/relay/ratelimit/TokenAwareRateLimiter.java):

public RateLimitDecision check(String serializedPayload, MessageEnvelope<?> envelope) {
    int tokenCount = countTokens(serializedPayload);
    SenderType senderType = envelope.senderType();
    boolean allowed = quotaManager.tryConsume(senderType, tokenCount);
    if (allowed) return RateLimitDecision.allowed(tokenCount);
    return RateLimitDecision.denied(
        senderType.name() + " quota exhausted. " + backpressureHint(senderType), tokenCount);
}

The QuotaManager maintains per-lane sliding windows resetting each minute, with configurable limits — defaulting to 600K tokens/min for human traffic, 200K for agents, and 100K for tool calls.

The key consideration here is that rate limiting in AI systems is not just about protecting infrastructure. It is about cost control. A runaway agent loop that retries 50 times before converging should not generate a surprise $15 charge. The messaging layer is the correct place to enforce this, because it sits between the agent's impulse to retry and the model provider's metering endpoint.

5. Observability, Auditing, and Operational Safety

Observability for AI messaging is not an extension of APM. APM tells you whether a topic is backed up. AI messaging observability tells you whether the messages flowing through it are producing correct, safe, and cost-effective outcomes. Those are different questions that require different instrumentation.

What to Log per Message

Every message passing through the system should carry a structured log entry — not as an afterthought, but as a first-class part of the messaging pipeline. The minimum fields: traceId, senderType, tokenCount, modelId, latencyMs, retryCount, idempotencyKey, and blockedCheck (whether a safety guardrail intercepted the message). These fields let you reconstruct any interaction from raw logs — what was sent, by whom, at what cost, with what result.

The companion project's ObservabilityFilter (see code/src/main/java/com/messaging/relay/observability/ObservabilityFilter.java) logs a structured JSON event per consumed message:

public void logConsumption(MessageEnvelope<?> envelope, String topic, long offset) {
    Map<String, Object> event = new LinkedHashMap<>();
    event.put("trace_id", envelope.traceId());
    event.put("sender_type", envelope.senderType().name());
    event.put("token_count", envelope.tokenCount());
    event.put("model_id", envelope.modelId());
    event.put("idempotency_key", envelope.idempotencyKey());
    event.put("topic", topic);
    obsLog.info(objectMapper.writeValueAsString(event));
}

A separate passesSafetyCheck method runs before consumer processing, blocking messages flagged in metadata. In production, extend this with PII detection and content policy evaluation.

Message Lineage

A single user request can spawn a tree of agent messages: planner to executor, executor to tool, tool result back to executor, executor to critic, critic back to planner. If you cannot trace that tree, you cannot debug it. The trace ID is the spine of lineage — but it is not enough. Each agent should also record parentMessageId so you can reconstruct the tree topology. In practice, this means the message envelope (Pattern 1) carries a parentMessageId field, and the observability consumer builds the tree from the event stream.

Safety Guardrails at the Messaging Layer

Content policy enforcement, PII scrubbing, and tool-call authorization should not live solely in the agent logic. They should be applied at the messaging boundary — before a message reaches a consumer. A lightweight filter consuming from each topic can validate, block, or redact messages based on policy. The filter is not a model; it is a deterministic rules engine plus (optionally) a small classifier for ambiguous cases. When a message is blocked, the producer receives a structured rejection reason, not silence.

6. Real-World Use Cases and Anti-Patterns

Use Case: Customer Support Triage

A customer sends a message. A triage agent classifies it — billing, technical, account — and routes it to the correct specialist agent. The triage agent publishes to agent.messages with senderType=agent and a classification envelope. The specialist agent consumes, drafts a response, and routes it to a human for approval. The human sees the draft, the classification confidence, and the reasoning trace. The messaging layer carries all three.

Use Case: Code Review Pipeline

A PR is opened. A review agent comments on the diff. The comment is published to agent.messages. A human reviewer sees the agent's comment alongside the diff. The human can accept, reject, or modify the comment. The final review is a merge of agent suggestions and human judgment, with every message in the chain auditable. The messaging layer provides the timeline.

Anti-Pattern: The "Autonomous Everything" Trap

The most common failure mode I have seen is giving agents unbounded autonomy over messaging. The agent decides whom to message, what to say, and how often — with no human-in-the-loop validation. Inevitably, the agent finds an edge case, enters a reasoning loop, and floods the messaging layer with repetitive, costly messages. The fix is straightforward: cap agent-originated messages per conversation, require human approval above a cost or sensitivity threshold, and alert when an agent exceeds its lane quota.

Anti-Pattern: Prompt Chains as Messaging Protocol

The 2026 equivalent of connecting microservices with SSH tunnels. Teams string together LLM calls with raw prompt templates, passing unstructured text between agents. There is no schema, no versioning, no retry contract, no observability hook. When it breaks — and it always breaks, debugging means reading raw prompt logs and guessing which template produced which output. Use a proper message envelope and a proper transport. Kafka adds maybe 50ms of latency and saves hours of debugging.

Do: Structured Messaging	Don't: Prompt-Chain Spaghetti
Schema-validated envelopes	Raw prompt strings as message format
Versioned message types	No versioning — template changes break downstream silently
Idempotency keys on every agent message	No retry contract — agents retry, prompts drift
Trace context propagated end-to-end	No observability — debugging = grep + guesswork
Token count in every envelope	Token consumption unknown until the bill arrives

7. What to Avoid: Hype, Autonomy Theater, and Brittle Prompt Chains

The AI industry has a hype problem, and messaging architecture is not immune. Three flavors of nonsense are particularly common, and it is worth naming them so you can recognize them in a meeting.

Autonomy theater. Dashboards that show agents "autonomously" handling customer interactions while three human operators shadow-monitor every message. The messaging layer is configured to route everything to agents, but the agents' confidence is low on 80% of requests, so humans silently handle those via a side channel. The dashboard reports 95% autonomous resolution. The messaging logs tell a different story. Build the dashboard from the message logs, not from the demo script.

Prompt-chain spaghetti. Mentioned above, but worth calling out as its own category. The problem is not that prompt chains exist — they will always exist as a prototyping tool. The problem is promoting a prototype to production without replacing the prompt-chain transport with a proper messaging layer. It is the architectural equivalent of deploying a bash script as a production service and being surprised when it breaks at 3 AM.

The AGI bait-and-switch. "Our messaging architecture is designed for AGI-scale agent collaboration." No, it is not. AGI does not exist, and designing for it today means optimizing for constraints nobody has measured. Design for the workloads you actually have: LLMs with context windows, token budgets, and human-in-the-loop validation. When the technology changes, the messaging layer will adapt — because it is built on Kafka, not on a proprietary agent framework.

The key consideration here is that the best messaging architecture for AI systems today is boring. Kafka topics with clear schemas. Structured envelopes with metadata. Token-aware rate limiting. Trace-level observability. These are not exotic technologies. They are the same patterns that made microservices manageable, applied with slight adaptation to a new kind of producer and consumer. The teams that succeed will be the ones that resist the urge to build an "AI-native messaging platform" and instead build a solid messaging platform that happens to carry AI traffic.

Companion project: A runnable Spring Boot + Kafka messaging relay implementing the patterns described here — message envelopes, lane-separated topics, token-aware rate limiting, idempotency keys, and structured observability logging. Available in the code/ directory alongside this article.

Sources:

Confluent, "The Future of AI Agents Is Event-Driven"
Kai Waehner, "MCP vs. REST/HTTP API vs. Kafka"
Temporal.io, "What Agentic AI Borrowed from Microservices"
RisingWave, "Event-Driven Architecture in 2026"
Technode, "Beware the Distributed Monolith"
CNCF, "Cloud Native Agentic Standards" (2026)

Feature Flags That Actually Ship: Lessons From the Trenches

Pravin Khandke — Sun, 03 May 2026 22:22:26 +0000

It was 2:47 AM when the alerts started. A seemingly straightforward database migration had triggered a cascading failure across three downstream services, and our payment processing pipeline was dropping roughly 12% of transactions. The on-call engineer didn't need to wake anyone, locate a rollback script, or wait for a CI pipeline to churn through another deploy. She opened the LaunchDarkly dashboard, toggled one kill switch, and the system reverted to the stable path within seconds. The migration was still there, still deployed — just no longer live.

That moment crystallized something I'd been learning across two and a half decades of building software: separating deployment from release isn't a nice-to-have. It's the difference between a system you trust and one you fear touching on a Friday afternoon.

This article captures what I've learned using feature flags in production — the patterns that held up under pressure, the mistakes I've watched teams repeat (and made myself), and the practical steps you can take whether you're evaluating LaunchDarkly or already deep into your feature flag journey. I'm publishing this here first because the developer community gives the most honest feedback, and I'd rather refine these ideas with you before they land on LeadDev and DZone.

The Patterns That Actually Matter

When you first start with feature flags, everything looks like a toggle. The key consideration here is understanding that not all flags serve the same purpose, and conflating them creates the very fragility you're trying to avoid.

Release Flags

These gate unfinished features. They're temporary by design the flag exists while the feature stabilizes, then gets removed. The mistake I see most often is teams treating release flags as permanent configuration knobs. When a flag has been at 100% for three months, nobody remembers which code path is the "real" one, and your test matrix silently doubles.

In practice, this means setting a removal date the moment you create the flag. Our team attaches an expiration tag to every release flag and runs a weekly script that surfaces anything past its removal window. We borrowed from the FlagShark playbook here: flags older than 90 days that aren't operational kill switches get an automatic ticket filed.

Centralize your flag keys in a single file, it gives you a one-glance inventory and prevents the typo-driven debugging sessions that scattered string literals create:

// code/src/flags.js — single source of truth for all flag keys
// See companion project: code/src/flags.js

const FLAGS = {
  // Kill switch: wraps the payment provider integration.
  // Defaults to FALSE (safe path) if SDK is unreachable.
  PAYMENT_PROVIDER_KILL_SWITCH: "ops_payments_new_provider",

  // Release flag: gates the new checkout UI.
  // Temporary — remove after 100% rollout + 14 days stable.
  NEW_CHECKOUT_UI: "release_checkout_redesigned_ui",

  // Experiment flag: percentage rollout of recommendation engine.
  RECOMMENDATION_ENGINE: "experiment_recommendations_v2",

  // Permission flag: enterprise-only feature.
  ENTERPRISE_ANALYTICS: "permission_enterprise_analytics",
};

The naming convention follows a pattern: {type}_{team/domain}_{feature}_{detail}. This tells you at a glance what a flag does, who owns it, and when it should be removed. Release flags should be short-lived. Ops flags (kill switches) should be reviewed annually. Experiment flags expire when the experiment ends.

Here's the LaunchDarkly client initialization a singleton that streams flag rules and caches them locally so evaluations work even during network interruptions:

// code/src/launchdarkly.js — LD client singleton
// See companion project: code/src/launchdarkly.js

const LaunchDarkly = require("@launchdarkly/node-server-sdk");

async function initLaunchDarkly(sdkKey) {
  const ldClient = LaunchDarkly.init(sdkKey);

  try {
    await ldClient.waitForInitialization({ timeout: 5 });
    console.log("[LaunchDarkly] Client initialized successfully");
  } catch (err) {
    console.warn(
      "[LaunchDarkly] Initialization timed out — operating from cache or defaults"
    );
  }

  return ldClient;
}

Kill Switches

A kill switch is a different animal entirely. It's not about shipping features, it's about operational safety. Every integration point with an external system, every experimental code path, every performance-sensitive refactor gets wrapped in one.

The pattern that saved us at 2:47 AM looked like this:

// code/src/server.js — Kill Switch pattern
// See companion project: code/src/server.js, GET /api/payment/status

app.get("/api/payment/status", async (req, res) => {
  const context = { kind: "user", key: req.query.user || req.ip };

  // Default: false = use safe fallback path.
  // If LaunchDarkly is unreachable, the SDK returns the default.
  const useNewProvider = await client.boolVariation(
    FLAGS.PAYMENT_PROVIDER_KILL_SWITCH,
    context,
    false   // <-- THE CRITICAL DEFAULT: safe path
  );

  if (useNewProvider) {
    return res.json({ provider: "new-payment-provider", status: "ok" });
  }

  // Safe fallback: the existing, battle-tested provider.
  res.json({ provider: "existing-payment-provider", status: "ok" });
});

The critical design requirement: the fallback path must be the one that works. If your kill switch guards a new payment provider integration, the fallback routes through the existing, battle-tested provider. If the flag evaluation itself fails due to a network issue, LaunchDarkly's SDK returns the default value you specify — which should always trigger the safe path.

Percentage Rollouts

Deterministic hashing based on a stable user attribute means the same user sees the same experience across sessions. This matters more than you'd think — users notice inconsistency, and your metrics become meaningless if a single user bounces between variants.

Our rollout cadence settled into a rhythm: internal team for one day, 1% of external users for a day, then 5%, 25%, and full release if all guardrails stay green. At each stage, we watch application error rates, API latency, and business metrics. LaunchDarkly's Guarded Releases can automate the pause-or-rollback decision if a threshold breaches, which removes the 3 AM judgment call from the equation.

// code/src/server.js — Percentage rollout with string variation
// See companion project: code/src/server.js, GET /api/recommendations

app.get("/api/recommendations", async (req, res) => {
  const context = { kind: "user", key: req.query.user || "anonymous" };

  // stringVariation for multi-variant experiments.
  // Deterministic hashing on user key ensures the same user
  // consistently sees the same variant.
  const variant = await client.stringVariation(
    FLAGS.RECOMMENDATION_ENGINE,
    context,
    "v1"   // default: existing recommendation engine
  );

  if (variant === "v2") {
    return res.json({
      engine: "collaborative-filtering-v2",
      recommendations: ["Item-A", "Item-B", "Item-C"],
    });
  }

  res.json({
    engine: "popularity-based-v1",
    recommendations: ["Item-X", "Item-Y", "Item-Z"],
  });
});

And here's user targeting in action — enterprise features gated by a custom attribute:

// code/src/server.js — Targeting with custom attributes
// See companion project: code/src/server.js, GET /api/analytics/dashboard

app.get("/api/analytics/dashboard", async (req, res) => {
  const context = {
    kind: "user",
    key: req.query.user || "anonymous",
    plan: req.query.plan || "free",  // custom attribute for targeting rules
  };

  const canAccess = await client.boolVariation(
    FLAGS.ENTERPRISE_ANALYTICS,
    context,
    false
  );

  if (!canAccess) {
    return res.status(403).json({
      error: "Enterprise analytics require the Enterprise plan.",
    });
  }

  res.json({
    dashboard: "advanced-analytics",
    metrics: ["revenue-per-user", "churn-prediction", "cohort-retention"],
  });
});

All the code above comes from the companion project — a fully runnable Express app in code/src/server.js. Clone it, set your SDK key, and you'll see every pattern respond to flag toggles in real time without a server restart.

The Questions Your Team Will Ask (And How to Answer Them)

When you introduce feature flags at scale, you'll hear the same objections. I've had these conversations enough times to recognize the patterns.

"Doesn't this just create more code to maintain?"

Yes, if you treat flags as permanent. The entire discipline of flag lifecycle management exists because flags without expiration dates become technical debt with a feature flag logo. The countermeasure is mechanical, not cultural: automation that flags stale toggles, creates cleanup tasks, and blocks new flags when the ratio of creation to removal tips past 2:1.

We enforce a simple rule: every flag has an owner, an expiration date, and a ticket filed at creation time for its eventual removal. When a release flag hits 100% rollout for two weeks, the cleanup PR gets auto-generated. This isn't optional, it's how you prevent the flag graveyard.

"What if the flag service goes down?"

LaunchDarkly SDKs maintain a streaming connection and cache flag rules locally. If the connection drops, evaluations continue against the cached ruleset. The boolVariation call includes a default value parameter precisely for this scenario — and every code path I write defaults to the safe, existing behavior.

In the 2:47 AM scenario, the kill switch worked because the SDK had already cached the flag state. Even if LaunchDarkly's service had been unavailable at that exact moment, the toggle would have still evaluated correctly against the local cache.

"Can't we just build this ourselves?"

Technically, yes. I've seen teams build internal feature flag systems. I've also seen those same teams spend sprint after sprint maintaining edge-case evaluation logic, building dashboards, and debugging deterministic hashing when they could have been building their actual product. The key consideration here isn't whether you can build it — it's whether maintaining a feature flag platform is where your team's time creates the most value.

Where We Go From Here

If you're starting with feature flags, begin with one operational kill switch on a high-risk integration. Get comfortable with the pattern, build the muscle memory for flag cleanup, then expand to release flags and progressive rollouts. The most successful adoptions I've seen started small and grew organically, rather than attempting a company-wide flag-everything initiative overnight.

For deeper dives, the LaunchDarkly documentation on guarded rollouts and kill switch flags is excellent. The FlagShark best practices guide informed much of our internal naming and lifecycle discipline. And if you want to understand why stale flags genuinely keep me up at night, read about the $460M Knight Capital incident — a stark reminder that unreachable code paths aren't harmless.

The original version of this article, along with a companion project demonstrating every pattern discussed here, lives on this blog. I'll be expanding it based on your questions and feedback before it goes to LeadDev and DZone — so if something here sparks a thought or a disagreement, I'd genuinely like to hear it in the comments.

Key Takeaways

Separate deployment from release. A deployed change that isn't live yet is a safety net. A deployed change that's fully live with no way to turn it off is a liability.

Treat flag cleanup as a first-class engineering practice. Naming conventions, expiration dates, and automated removal aren't overhead — they're what keep your codebase comprehensible six months from now.

Default to safety. Every flag evaluation should fall back to the known-good path. The time to verify your kill switch works isn't during an incident at 2:47 AM.

Start small, automate early, and build the habits before you build the flag count. The teams I've watched succeed with feature flags aren't the ones with the most sophisticated tooling — they're the ones with the most disciplined lifecycle management.