Ricardo Ferreira for Redis

Posted on Mar 27

Building Reliable Agents with the Transactional Outbox Pattern and Redis Streams

#agents #redis #ai #eventdriven

Bridging the gap between reasoning and action

AI agents are pretty good at deciding what should happen next, given a well-defined business workflow. In the case of a customer support agent, for example, they can read a conversation, apply a policy, and return a response like "approve the refund" or "escalate this case." That part is exciting, and it is usually what gets demoed first. But the hard part starts right after the decision is made.

In a real system, a decision only matters if the rest of the platform can trust it. If an agent decides a customer should get a refund, that decision still has to turn into real work across the rest of the application. The support case needs to be updated, and billing needs to issue the refund. The customer may need an email, and the CRM probably needs the updated status, too.

If the app updates the case and then crashes before billing gets the event, you now have a case that says "refund approved" and a customer who never actually got refunded. That is the kind of bug that makes a system feel flaky even when the model made the right call. But the worst part is the damage to customer experience. I would be really mad at the company if this happened to me.

For scenarios like this, the Transactional Outbox pattern exists. Instead of treating "update the case" and "tell the rest of the system" as two separate operations, we commit them together and let the rest of the platform react asynchronously afterward. This pattern became fairly famous in the context of microservices, as they often need a reliable way to hand off tasks. I think the pattern is also useful for agents, because the fundamental problem is the same.

In this post, I will discuss the Transactional Outbox pattern in the context of agents and provide an opinionated view of why I believe it is a best practice for agentic applications. I will discuss the pattern around the following question: once an agent makes a business decision, how can you ensure the rest of the system can rely on it?

The problem is the handoff

Developers often stress about designing systems that can survive unimaginable production incidents. But the reality is, you don't need a major outage or some exotic distributed systems incident to witness the worst. Sometimes it is as simple as the need for one service to try to do two related things in two separate steps. Like, first, it updates the business state, then it publishes an event for downstream systems.

That looks harmless until something fails in between. If the state update succeeds and the event publish does not, the source of truth has moved forward, but the rest of the workflow has not.

Here is that failure in one picture:

What makes this annoying is that nothing looks obviously broken at first. During an incident investigation, if someone checks the case record, it looks correct. The problem only shows up later when billing never acts, the customer complains, or support has to manually reconcile what happened. That is why I think this is not an AI problem. It is a handoff problem.

Motivation for the Transactional Outbox pattern

The Transactional Outbox pattern exists because "save state, then publish the event" is fragile by design. The pattern gives you a cleaner contract: when business state changes, the application also writes an outbox event in the same atomic operation.

That one change removes the worst failure mode. You no longer end up in the state where the case changed, but the event silently disappeared. It also keeps the request path honest. The service does not have to directly coordinate billing, notifications, CRM sync, and everything else just to be correct.

Instead, the request path only needs to guarantee one thing: the decision and the outbox event are committed together. Once that happens, everything else becomes recoverable instead of fragile.

That is why this pattern fits agentic systems so well. Agents make decisions that trigger follow-up work, but those decisions need a durable "this happened" moment before the rest of the system can safely react.

Why is "Just Retry the Publish" not enough?

Every time I promote a discussion with developers around the Transactional Outbox pattern, I often hear them saying, "Why not just retry if the publish fails?" I think that happens because implementing the pattern correctly requires certain design decisions and technology stacks. The instinct is usually to look for a simpler alternative.

For this reason, I like to stress the following: the use of retries is reasonable until you look closely at where the failure occurs. This means that retries only help if the application still knows it has something to retry. If the process crashes after the state update but before the event is durably recorded anywhere, there is nothing left to retry.

That is the key difference between retries and an outbox. Retries help you deliver an event that already exists, while the outbox ensures the event exists in the first place. Once you look at it that way, the pattern feels less like a ceremony and more like basic design principles. If the business state changes, the system needs a durable record of the event that describes that change.

Redis Streams is great for this pattern

Redis Streams are a good fit for this kind of outbox because they already behave like the commit log we want. You can append events to them, consume them in order, track what is pending, and let different consumer groups process the same stream independently. That matters because the outbox is not really a queue in the narrow sense. It is a commit log for business events.

Admittedly, the Transactional Outbox pattern is often implemented using Apache Kafka and technologies such as Debezium. That is where the pattern became most notorious. I helped many developers implement this pattern with Kafka, and it works great for getting things done. However, because I have tons of experience with Kafka, I can say that the implementation effort can sometimes exceed the main problem they were trying to solve due to Kafka's inherent complexity. You spend more time dealing with Kafka than the actual problem.

Redis Streams, on the other hand, makes that pretty natural. A single event can be appended once and then processed independently by several downstream concerns. The other reason Streams fit well is that they sit comfortably inside Redis. If your support case state also lives in Redis, the state change and the outbox append can share one commit boundary.

That part is important. The pattern is strongest when the business state and the outbox live in the same datastore, because that gives you a single atomic write instead of a dual-write problem wearing different clothes. With Kafka, you would need to handle two different distributed systems: the commit log itself and the data store where the update must occur.

Diving deep into the architecture

For this example, the support case state and the outbox both live in Redis. The current case state is stored in a hash, and the outbox is stored in a Redis Stream.

A case key might look like support:{tenant-acme}:case:case-123, while the outbox stream might be support:{tenant-acme}:outbox. The use of hash tags here is important because you must be intentional about where the data will be stored in Redis. During development, you may work with a single server which is the equivalent of a single shard. The data will naturally live in the same place. However, in production, you may have a clustered Redis environment with multiple shards.

The shared hash tag keeps both keys in the same slot in clustered Redis, which is what lets them participate in the same transaction. That gives us a clean split of responsibilities. The case record tells us what is true now, and the outbox stream tells us what happened and what the rest of the platform still needs to process. Yes, a relatively simple use of key prefixes could make this entire implementation useless if not carefully thought out.

From there, downstream concerns consume the stream through their own consumer groups. Billing can issue the refund, notifications can contact the customer, and CRM sync can update external systems, all without forcing the support service to orchestrate them directly in the request path.

That flow looks like this:

The thing I like most about the Transactional Outbox pattern is that it keeps responsibilities clear. The support service is responsible for making the decision durable, and the rest of the platform is responsible for responding to it.

Trade-offs that are interesting to consider

The basic implementation of the pattern is simple. The design choices around it are where things get interesting. One of the first questions you must ask is where your source of truth lives. If the support case is also in Redis, the case update and the outbox append can share one transaction. If the case lives somewhere else and Redis only holds the stream, you are back in dual-write territory.

Another big choice is partitioning. It is tempting to imagine a single global outbox stream for the whole application, but that often becomes awkward in a clustered Redis setup. A per-tenant stream is often a better balance. It keeps related events together, provides useful ordering, and avoids making every transactional write depend on a single global key. It also makes querying and data retrieval a bit easier during investigation scenarios.

Consumer isolation is another trade-off that is worth saying out loud. One consumer group per downstream concern is a very nice model operationally, because billing, notifications, and CRM sync can all move at their own pace. The flip side is that you now own several background workflows. Each one has lag, retries, health, and recovery behavior to think about. This is where the world of microservices cross paths again with agentic systems. Each agent is not only a set of code and resources. They also bring operational complexities that must be owned by someone.

Retention matters too. An outbox is a log, and logs grow. If you trim too aggressively, you lose the replay window and the investigation history. If you never trim at all, the stream just keeps growing and eventually becomes an operational problem in its own right. Deciding how large the stream is allowed to grow must be a discussion that takes place before the app even goes to production. Not an afterthought.

Durability is another place where the architecture gets real fast. If the outbox carries important business decisions like refunds, escalations, or account changes, Redis is no longer "just a cache" in this design. It is part of the system's correctness model. You must treat Redis as a single source of truth, and as such, think carefully about how to handle details like replication, failover, and geographic disasters.

Finally, there is idempotency. The outbox makes the handoff reliable, but it does not magically make downstream effects exactly-once behavior in the business sense. If a worker crashes after reading but before acknowledging, another worker may retry the same event later. That means the side effect needs to be safe to run more than once. The usual instinct for developers is to write the worker as a function that hooks into the stream, pulls the latest records, and processes them as if the data is simply mutable. Nope, you must treat them as immutable objects.

Okay, let's see some code

This post is not meant to be a complete implementation reference, but I know that, as a developer, looking at code helps make understanding concrete. I will try to provide the example with fewer details so you can understand the design principles. I'm sure your coding agent can help with your actual final code. Also, I will use Java because it comes naturally to me — but feel free to ask your coding agent to translate it into another language.

Let's start by looking for a runtime helper class that instantiates Jedis, a Redis client for Java:

import redis.clients.jedis.RedisClient;
import redis.clients.jedis.UnifiedJedis;

public final class RuntimeSupport {

    public UnifiedJedis createJedisFromEnv() {
        String redisHost = System.getenv().getOrDefault("REDIS_HOST", "localhost");
        int redisPort = Integer.parseInt(System.getenv().getOrDefault("REDIS_PORT", "6379"));

        return RedisClient.builder()
                .hostAndPort(redisHost, redisPort)
                .build();
    }
}

Next, let's take a look at how keys and group naming are handled in a small constants class instead of scattering strings through the code.

public final class SupportConstants {
    public static final String STREAM_GROUP_START_ID = "0-0";
    public static final String BILLING_GROUP_NAME = "billing-cg";
    public static final String NOTIFICATIONS_GROUP_NAME = "notifications-cg";
    public static final String CRM_SYNC_GROUP_NAME = "crm-sync-cg";

    private SupportConstants() {}
}

For the Redis keys themselves, a small helper record keeps the slotting decision obvious:

public record SupportKeys(String caseKey, String outboxKey) {

    public static SupportKeys forCase(String tenantId, String caseId) {
        String hashTag = "{" + tenantId + "}";
        return new SupportKeys(
                "support:" + hashTag + ":case:" + caseId,
                "support:" + hashTag + ":outbox"
        );
    }
}

The core write path is where the architectural idea becomes real. When the support service accepts the agent's decision, it updates the case state and appends a RefundApproved event in a single Redis transaction.

import redis.clients.jedis.AbstractTransaction;
import redis.clients.jedis.Response;
import redis.clients.jedis.StreamEntryID;
import redis.clients.jedis.UnifiedJedis;

import java.time.Instant;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.UUID;

public final class RefundApprovalService {
    private final UnifiedJedis jedis;

    public RefundApprovalService(UnifiedJedis jedis) {
        this.jedis = Objects.requireNonNull(jedis, "jedis must not be null");
    }

    public RefundCommitted approveRefund(RefundDecision decision) {
        SupportKeys keys = SupportKeys.forCase(decision.tenantId(), decision.caseId());

        Map<String, String> caseFields = new LinkedHashMap<>();
        caseFields.put("case_id", decision.caseId());
        caseFields.put("customer_id", decision.customerId());
        caseFields.put("refund_id", decision.refundId());
        caseFields.put("status", "refund_approved");
        caseFields.put("decision_source", "support-agent");
        caseFields.put("updated_at", decision.decidedAt().toString());

        Map<String, String> outboxFields = new LinkedHashMap<>();
        outboxFields.put("event_id", decision.eventId());
        outboxFields.put("event_type", "RefundApproved");
        outboxFields.put("case_id", decision.caseId());
        outboxFields.put("customer_id", decision.customerId());
        outboxFields.put("refund_id", decision.refundId());
        outboxFields.put("decision_source", "support-agent");
        outboxFields.put("occurred_at", decision.decidedAt().toString());

        try (AbstractTransaction redisTx = jedis.multi()) {
            redisTx.hset(keys.caseKey(), caseFields);
            Response<StreamEntryID> streamEntryId =
                    redisTx.xadd(keys.outboxKey(), StreamEntryID.NEW_ENTRY, outboxFields);

            List<Object> execResults = redisTx.exec();
            if (execResults == null) {
                throw new IllegalStateException("Refund approval transaction aborted");
            }

            return new RefundCommitted(
                    decision.caseId(),
                    decision.eventId(),
                    streamEntryId.get().toString()
            );
        }
    }

    public record RefundDecision(
            String tenantId,
            String caseId,
            String customerId,
            String refundId,
            String eventId,
            Instant decidedAt
    ) {
        public static RefundDecision create(
                String tenantId,
                String caseId,
                String customerId,
                String refundId
        ) {
            return new RefundDecision(
                    tenantId,
                    caseId,
                    customerId,
                    refundId,
                    UUID.randomUUID().toString(),
                    Instant.now()
            );
        }
    }

    public record RefundCommitted(
            String caseId,
            String eventId,
            String streamEntryId
    ) {}
}

This one method is the whole architectural point made concrete. If the transaction does not complete, neither the case update nor the outbox event exists. If it does complete, both exist. That is the durability boundary that the rest of the workflow can rely on.

Here is the same moment as a diagram:

On the consumer side, the worker will act on the message written to the stream.

import redis.clients.jedis.StreamEntryID;
import redis.clients.jedis.UnifiedJedis;
import redis.clients.jedis.exceptions.JedisDataException;
import redis.clients.jedis.params.XReadGroupParams;
import redis.clients.jedis.resps.StreamEntry;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.Objects;

import static redis.clients.jedis.StreamEntryID.XREADGROUP_UNDELIVERED_ENTRY;

public final class BillingConsumer {
    private static final StreamEntryID PENDING_ID =
            new StreamEntryID(SupportConstants.STREAM_GROUP_START_ID);
    private static final StreamEntryID NEW_ENTRY_ID = XREADGROUP_UNDELIVERED_ENTRY;

    private final UnifiedJedis jedis;
    private final BillingGateway billingGateway;
    private final String consumerName;

    public BillingConsumer(
            UnifiedJedis jedis,
            BillingGateway billingGateway,
            String consumerName
    ) {
        this.jedis = Objects.requireNonNull(jedis, "jedis must not be null");
        this.billingGateway = Objects.requireNonNull(billingGateway, "billingGateway must not be null");
        this.consumerName = Objects.requireNonNull(consumerName, "consumerName must not be null");
    }

    public void run(String tenantId) throws InterruptedException {
        String outboxKey = SupportKeys.forCase(tenantId, "unused").outboxKey();
        createConsumerGroup(outboxKey);

        while (!Thread.currentThread().isInterrupted()) {
            List<StreamMessage> pendingEntries = readGroup(outboxKey, PENDING_ID, 10);
            if (!pendingEntries.isEmpty()) {
                processEntries(outboxKey, pendingEntries);
                continue;
            }

            List<StreamMessage> newEntries = readGroup(outboxKey, NEW_ENTRY_ID, 10);
            if (!newEntries.isEmpty()) {
                processEntries(outboxKey, newEntries);
            } else {
                Thread.sleep(200L);
            }
        }
    }

    private void createConsumerGroup(String outboxKey) {
        try {
            jedis.xgroupCreate(
                    outboxKey,
                    SupportConstants.BILLING_GROUP_NAME,
                    new StreamEntryID(SupportConstants.STREAM_GROUP_START_ID),
                    true
            );
        } catch (JedisDataException e) {
            if (!e.getMessage().contains("BUSYGROUP")) {
                throw e;
            }
        }
    }

    private List<StreamMessage> readGroup(String outboxKey, StreamEntryID streamEntryID, int count) {
        XReadGroupParams params = XReadGroupParams.xReadGroupParams().count(count);

        List<Map.Entry<String, List<StreamEntry>>> rawEntries = jedis.xreadGroup(
                SupportConstants.BILLING_GROUP_NAME,
                consumerName,
                params,
                Map.of(outboxKey, streamEntryID)
        );

        return parseEntries(rawEntries);
    }

    private void processEntries(String outboxKey, List<StreamMessage> entries) {
        for (StreamMessage entry : entries) {
            if (!"RefundApproved".equals(entry.fields().get("event_type"))) {
                jedis.xack(
                        outboxKey,
                        SupportConstants.BILLING_GROUP_NAME,
                        new StreamEntryID(entry.id())
                );
                continue;
            }

            billingGateway.issueRefund(
                    entry.fields().get("refund_id"),
                    entry.fields().get("customer_id"),
                    entry.fields().get("event_id")
            );

            jedis.xack(
                    outboxKey,
                    SupportConstants.BILLING_GROUP_NAME,
                    new StreamEntryID(entry.id())
            );
        }
    }

    private static List<StreamMessage> parseEntries(List<Map.Entry<String, List<StreamEntry>>> rawEntries) {
        if (rawEntries == null || rawEntries.isEmpty()) {
            return Collections.emptyList();
        }

        List<StreamMessage> entries = new ArrayList<>();
        for (Map.Entry<String, List<StreamEntry>> streamData : rawEntries) {
            for (StreamEntry streamEntry : streamData.getValue()) {
                entries.add(new StreamMessage(
                        streamEntry.getID().toString(),
                        streamEntry.getFields()
                ));
            }
        }

        return entries;
    }

    private record StreamMessage(String id, Map<String, String> fields) {}

    public interface BillingGateway {
        void issueRefund(String refundId, String customerId, String idempotencyKey);
    }
}

Closing

What I like most about the Transactional Outbox pattern is that it respects the actual shape of agentic systems. Agents are good at deciding what should happen next given a flow, but the platform is still responsible for turning that decision into a durable state and letting the rest of the workflow react safely. The pattern gives you a clean handoff for that.

Redis Streams make it practical when your application state and the outbox both live in Redis. That doesn't make the design free of trade-offs. You still need to think about partitioning, retention, durability, lag, and idempotency. It just prevents you from thinking about dual-write problems. It gives you a system where an agent's decision becomes a durable fact before the rest of the platform starts depending on it.

By applying the Transactional Outbox pattern in your agents, you can be the difference between an agent that looks clever in a demo and a system you can actually trust.

Top comments (9)

Apex Stack • Mar 28

This really resonated — I run about a dozen scheduled agents on a content platform (daily ETL, site audits, cross-posting, analytics checks) and the handoff problem is exactly where things get fragile. An agent decides "deploy the new build" but if the build succeeds and the CDN cache purge fails silently, you end up serving stale pages while the database says everything is current.

The distinction between retries and the outbox is the key takeaway for me. We had a case where a news refresh agent would pull data from an API, write it to Supabase, then trigger a site rebuild. If the rebuild step failed, the data was updated but the static pages were stale. Adding an intermediate "pending_rebuild" state that gets committed atomically with the data write would have caught that immediately.

The per-tenant stream partitioning advice is solid too. We partition by job type rather than tenant, but the principle is the same — keeping related events together makes debugging so much easier when something stalls at 3 AM.

Botánica Andina • Mar 28

Spot on about agent decisions needing reliable follow-through! The 'refund approved but not processed' scenario perfectly shows how inconsistency impacts trust. This Transactional Outbox pattern, especially with Redis Streams, offers a solid, practical way to ensure critical agent actions are never lost. Super helpful insights here!

Ricardo Ferreira Redis • Mar 28

I'm glad you enjoyed!

PEACEBINFLOW • Mar 27

This post strikes at the very heart of the "last mile" problem for AI agents. We’ve spent so much time perfecting the reasoning (the "brain") that we’ve neglected the nervous system (the handoff). As you pointed out, an agent’s decision is just a hallucination until it is committed to a durable record.

The Agent as a Transactional Logic Engine
I love how you’ve reframed the agent's role here. In a complex system, the agent isn't just a chatbot; it is a state transition function. When it decides to approve a refund, it is essentially proposing a change to the "world state." By using the Transactional Outbox pattern with Redis Streams, you are ensuring that the Intent (the outbox event) and the Fact (the case update) are atomically fused.

Redis Streams as the "Temporal Cortex"
Using Redis Streams for this is a brilliant architectural choice. In most agentic designs, we struggle with "memory"—not just what the agent said, but what the system did as a result.

By treating the Stream as a commit log, you’re creating a permanent, replayable history of agent-driven causality.

The "Hash Tag" trick for slotting in clustered Redis is a vital detail. It’s the difference between a system that works on a laptop and one that survives a production-scale distributed environment. You’re essentially ensuring that the Causal Pair (State + Event) lives on the same physical shard.

Closing the "Dual-Write" Cognitive Gap
The most dangerous part of any agentic system is the Dual-Write Trap. If we let an agent call an API directly (like Billing) and then update the database, we are begging for inconsistency. Your approach enforces a "Write-Ahead" philosophy. The agent's decision becomes a durable fact first, and the external world (Billing, CRM, etc.) catches up asynchronously. This moves the system from a "Hope-based" architecture to a "Convergence-based" one.

A Quick Reflection
I wonder if we could take this even further by storing the Agent's Reasoning Trace (the "Why" behind the decision) in the same outbox event? That way, if a human has to audit a refund three months later, the "Signal" in the stream contains not just the outcome, but the context that led to it.

Truly great insight on moving from "clever demos" to "trusted systems." The code examples using Jedis and the atomic multi() block make the theory feel very actionable.

Ricardo Ferreira Redis • Mar 27

I appreciate your feedback and your kind words. I like your reflection, and I think it makes sense. Storing the "why" along with the outbox event can certainly make the audit easier. I'd store it as JSON using Redis's native JSON data type, since most AI frameworks like LangGraph will return this format anyway. Or even if you don't use a framework, you can set a prompt to generate JSON for the decision. Then the transaction includes [hash + stream + reasoning in JSON]

Chad Brunswick • Mar 27

Really solid walkthrough. The key insight that clicked for me: "retries help you deliver an event that already exists, the outbox ensures the event exists in the first place." That distinction alone is worth the read.

I've been running into the same handoff problem but in a different context — BYOK agent loops where the user's browser is the execution environment. The agent makes a decision (e.g. "this draft needs revision"), but if the browser tab closes mid-loop, the intermediate state is gone. No outbox, no recovery. Your Redis Streams approach makes me think about persisting the agent's step history to a durable log before executing each step, so the loop can resume.

The hash tag slotting detail for clustered Redis is the kind of thing that separates "works on my machine" from "works in production." Easy to miss, expensive to debug later.

Ricardo Ferreira Redis • Mar 27

I'm glad you liked it. Yeah, the pattern seems like a good option for your use case.

klement Gunndu • Mar 27

The dual-write problem is the silent killer in agent pipelines — curious whether you've hit Redis Streams retention issues when outbox consumers fall behind for hours.

Ricardo Ferreira Redis • Mar 27

Yes, I certainly hit these issues. In my case, it was what we call "a poison pill," which was a malformed event written into the stream that the worker was not able to process due to format issues. Then, after numerous attempts to process, the worker failed, and the stream was enormous by then. So I had to implement a DLQ to quarantine poison pills to avoid getting the workers stuck in something that could be skipped over.