Pedro Santos

Posted on May 27

RAG That Improves Over Time

#agents #ai #devops #rag

RAG That Improves Over Time: The Flywheel Effect

In the previous post, I showed how similarity search transforms failure diagnoses. The agent finds past incidents with similar patterns and uses them to make better recommendations. But that was a snapshot. The real power of this setup shows up over time.

Every new saga event makes the system smarter. This post covers the flywheel: how three agents share the same vector store, how diagnoses improve as data accumulates, and the practical limits I've hit.

The Flywheel

Three agents use the same pgvector embedding store, each in a different way.

The OperationsAgent is a writer and a reader. It vectorizes every saga event (writes). When a failure occurs, it searches for similar past incidents (reads). Each failure it diagnoses today becomes context for diagnosing future failures.

The SagaComposerAgent is a reader. It searches for historical failure patterns per customer profile. If new:high-value customers have a 30% fraud block rate, the composer adds a Fraud Validation step to their saga plan. This search uses the same embeddings the OperationsAgent wrote.

The DataAnalystAgent doesn't use RAG directly (it uses MCP tools instead). But it feeds the SagaComposerAgent with current metrics. The composer combines MCP metrics with RAG patterns to decide the optimal saga order.

Here's the cycle:

Saga event → OperationsAgent vectorizes it
    ↓
OperationsAgent searches similar past events → better diagnoses
    ↓
SagaComposerAgent searches failure patterns → smarter saga plans
    ↓
Orchestrator uses smarter plans → fewer failures
    ↓
Fewer failures → different patterns in the vector store
    ↓
(repeat)

As the vector store grows, the quality of both diagnoses and saga plans improves.

How Diagnoses Improve

I tracked the OperationsAgent's output over three phases.

Week 1 (< 50 events). Most failures get "No similar incidents found." The agent produces generic diagnoses based only on the current event. Still useful, but not much better than a well-structured log.

Week 2 (50-200 events). RAG starts finding matches. The agent identifies that payment failures for new customers cluster around evening hours. It recommends time-based credit limits. The SagaComposerAgent picks up on this pattern and moves Fraud Validation earlier in the plan for new:high-value profiles.

Month 2 (500+ events). The agent distinguishes between different failure subtypes within the same service. "Payment failed due to insufficient funds" gets matched with similar cases and the agent notes "73% of these orders retry successfully within 2 hours." Meanwhile, "payment blocked by fraud" gets matched with fraud cases and the agent notes "92% of blocks for this profile are false positives during daytime."

The improvements are not magic. They come from having more data points for vector search. With 10 similar incidents, the agent sees a pattern. With 100, it sees the exceptions.

How the SagaComposerAgent Uses RAG

The composer searches for failure patterns per profile:

private String findHistoricalPatterns(String profileKey) {
    var embedding = embeddingModel.embed(
        "saga failure patterns for profile: " + profileKey).content();

    var results = embeddingStore.search(
        EmbeddingSearchRequest.builder()
            .queryEmbedding(embedding)
            .maxResults(5)
            .minScore(0.70)
            .build());

    if (results.matches().isEmpty())
        return "No historical patterns found for this profile yet.";

    return results.matches().stream()
        .map(m -> m.embedded().text())
        .collect(Collectors.joining("\n---\n"));
}

Notice the lower minScore compared to the OperationsAgent (0.70 vs 0.75). The composer benefits from broader context. It doesn't need exact matches. It needs patterns: "new customers tend to fail at payment" is enough to justify reordering the saga.

The results get combined with real-time metrics from the DataAnalystAgent:

String metrics     = queryMetrics(dataAnalystAgent);
String stockAlerts = queryStockAlerts(dataAnalystAgent);
String ragContext  = findHistoricalPatterns(profile);

String prompt = """
    ORDER PROFILE: %s

    CURRENT SYSTEM METRICS (via MCP):
    %s

    CRITICAL STOCK ALERTS (via MCP):
    %s

    HISTORICAL FAILURE PATTERNS FOR THIS PROFILE (RAG):
    %s

    Compose the optimal saga plan for this profile.
    """.formatted(profile, metrics, stockAlerts, ragContext);

MCP provides the current state. RAG provides the historical patterns. The LLM weighs both to decide the saga order.

Practical Limits

After running this for a while, I hit some practical issues.

Vector store growth. Each event adds a row with a 768-dimensional vector. At 1000 events/day, that's about 10MB/day in pgvector. PostgreSQL handles this fine up to hundreds of thousands of rows. Beyond that, you'd want to add indexes (pgvector supports IVFFlat and HNSW) or archive old events.

Embedding drift. The nomic-embed-text model is static. It doesn't change. But the failure patterns change. A fix in the payment service might eliminate a whole category of failures. The old vectors for those failures are still in the store. I haven't implemented cleanup yet, but the plan is to add a TTL (delete embeddings older than 90 days) or re-embed periodically.

Prompt size. Each similar incident adds 200-500 tokens to the prompt. With 5 matches, that's up to 2500 tokens of RAG context. Add the current failure, the system prompt, and the formatting instructions, and you're at 4000+ tokens of input. This works fine with Gemini's 1M token context, but it eats into the budget. I cap at 3 matches for the OperationsAgent and 5 for the SagaComposerAgent.

False patterns. With small datasets (< 50 events), the vector search sometimes finds "matches" that aren't real patterns. A payment failure for a VIP customer matches a payment failure for a new customer just because both have similar error messages. The diagnosis suggests profile-specific fixes when the issue is actually global. Higher minScore thresholds mitigate this, but don't eliminate it.

The Additive Principle

Every AI component in this system has a fallback. If the vector store is empty, the OperationsAgent still produces a diagnosis based on the current event alone. If the SagaComposerAgent can't find patterns, it falls back to the default saga order. If Ollama is down, events still flow through the saga normally. They just don't get vectorized.

The AI layer is an improvement, not a dependency. The system worked before the agents existed. It works better with them.

if (results.matches().isEmpty())
    return "No similar incidents found in history.";

This one line is the entire fallback strategy for RAG. No similar incidents? Tell the LLM that. It still produces useful output. Just less targeted.

Measuring the Impact

How do you know if RAG is actually helping? I track three things.

RAG hit rate. Percentage of diagnoses that include similar incidents. Started at 0% (empty store), now consistently above 80%. If this drops, either the failure patterns changed or the minScore threshold is too high.

Diagnosis specificity. Are the recommendations generic ("review your payment limits") or specific ("adjust the R$500 threshold for new:high-value customers during evening hours")? This is subjective but noticeable. After 200+ events, the diagnoses are consistently specific.

Saga plan effectiveness. After the SagaComposerAgent started reordering steps, the overall failure rate per saga dropped. Fraud checks moved earlier for high-risk profiles. Inventory checks moved before payment when stock was low. Fewer wasted payment calls.

Wrapping Up

RAG in this project isn't about answering questions from documents. It's about building institutional memory into a distributed system. Every failure teaches the system something. Every diagnosis makes the next one better.

The stack is simple: Ollama for embeddings, pgvector for storage, LangChain4j for glue. No exotic infrastructure. No external services. Runs on your laptop.

The repo: github.com/pedrop3/saga-orchestration

Top comments (1)

RAGPrep • Jun 1

The "improves over time" framing flips the usual RAG conversation in a useful way. Most teams accept that RAG quality degrades over time as documents drift, chunks go stale, and the corpus accumulates noise. Building deliberately for improvement instead of just managing decay is the right inversion.The hard part with feedback loops is the same hard part with every RAG system: the signal you're learning from has to be cleaner than the noise it carries. A few practical considerations from running similar patterns:User feedback is biased by what was retrieved. If your retriever consistently pulls weaker chunks for a class of queries, users will rate those answers down — but you can't tell whether they're rating the chunk quality, the generation quality, or the lack of an answer that wasn't in the corpus at all. Cleanest fix: separate the feedback signal into "the answer was wrong" vs "the answer wasn't in the documents." Two different problems, two different fixes. Chunk-level feedback beats answer-level feedback. If you can trace which specific chunks contributed to a flagged answer, the feedback becomes actionable at the data layer. "This chunk contributed to 12 downvoted answers in the last week" is a chunk-quality signal you can act on. "This answer was bad" leaves you guessing whether to fix retrieval, generation, or the underlying data. Improvement compounds only if you prune. Adding new high-quality chunks helps. Removing the bad ones helps more. Most teams only do the first, which means the corpus grows but the signal-to-noise ratio stays flat or degrades. Periodic chunk audits — running quality scores over the existing vector DB and removing or re-embedding the worst performers — is the unglamorous step that makes the feedback loop actually move quality forward. Worth noting: I'm writing a piece this week on the inverse failure — when RAG quality silently degrades because no one's watching the data layer. Different angle on the same underlying truth: the corpus is alive whether you treat it that way or not.