Similarity Search for Failure Diagnosis
In the previous post, I showed how every saga event gets vectorized into pgvector. Now let's use that data. When a saga fails, the OperationsAgent searches for similar past incidents and uses them to diagnose the current failure.
The Search
The search takes the current failure's text representation, embeds it, and finds the closest matches in pgvector:
private String findSimilarIncidents(String historyText) {
var queryEmbedding = embeddingModel.embed(historyText).content();
var results = embeddingStore.search(
EmbeddingSearchRequest.builder()
.queryEmbedding(queryEmbedding)
.maxResults(3)
.minScore(0.75)
.build());
if (results.matches().isEmpty())
return "No similar incidents found in history.";
return results.matches().stream()
.map(m -> "--- Similar incident (score=" +
String.format("%.2f", m.score()) + ") ---\n"
+ m.embedded().text())
.collect(Collectors.joining("\n\n"));
}
Three parameters control the quality of results.
maxResults(3) limits the context size. I tested with 1, 3, 5, and 10. Three gives the best results for diagnosis. With 1, the LLM doesn't have enough context to spot patterns. With 5+, the prompt gets long and the LLM starts summarizing instead of analyzing.
minScore(0.75) filters out weak matches. Cosine similarity ranges from 0 to 1 in pgvector. Here's what I found through testing:
| Score Range | What It Means | Action |
|---|---|---|
| 0.90+ | Almost identical failure | Very strong match, same root cause |
| 0.80-0.89 | Similar failure pattern | Good match, likely related |
| 0.75-0.79 | Loosely related | Worth including as context |
| Below 0.75 | Different type of failure | Noise, exclude |
0.75 was the sweet spot for my data. Lower than that and the matches are unrelated failures that confuse the diagnosis.
Building the Diagnosis Prompt
The search results get injected into a prompt alongside the current failure:
private void diagnose(Event event, String historyText) {
String ragContext = findSimilarIncidents(historyText);
String prompt = """
SAGA FAILED, DIAGNOSE
OrderId: %s | TransactionId: %s
Final status: %s | Total amount: R$ %.2f
SAGA HISTORY:
%s
SIMILAR INCIDENTS (RAG):
%s
""".formatted(
event.getOrderId(),
event.getTransactionId(),
event.getStatus(),
totalAmount,
historyText,
ragContext);
String diagnosis = operationsAgent.analyze(prompt);
}
The prompt has two sections the LLM can work with. SAGA HISTORY is the current failure. SIMILAR INCIDENTS is the RAG context. The LLM compares them and identifies patterns.
The Agent That Uses RAG
The OperationsAgent doesn't use MCP tools. It doesn't need to query any service. All the data arrives pre-assembled in the prompt. Its only job is to reason:
public interface OperationsAgent {
@SystemMessage("""
You are a failure diagnosis specialist for distributed sagas.
You receive the full history of a FAIL saga and similar past incidents.
Required format:
ROOT CAUSE: <service and reason>
AFFECTED SERVICES: <list>
FINANCIAL IMPACT: <based on totalAmount>
HISTORICAL PATTERN: <if RAG found similar cases>
RECOMMENDATION: <corrective action>
Rules:
1. Only use the provided context, never invent data.
2. If no similar incidents found, say so.
3. Be concise, consumed by a monitoring system.
""")
String analyze(@UserMessage String context);
}
The structured output format matters. The diagnosis goes into a database table and gets displayed in an API endpoint. Consistent formatting makes it parseable.
Real Diagnosis Examples
Here's what the agent produces with and without RAG context.
Without RAG (First Failure of This Type)
ROOT CAUSE: PAYMENT_SERVICE: New customer limit exceeded (R$450.00 > R$500.00)
AFFECTED SERVICES: PAYMENT_SERVICE, PRODUCT_VALIDATION_SERVICE (rolled back)
FINANCIAL IMPACT: R$450.00: payment was never processed, no financial loss
HISTORICAL PATTERN: No similar incidents found in history.
RECOMMENDATION: Review credit limit policy for new customers or implement
progressive limits based on order history.
Useful but generic. The agent can only analyze what's in front of it.
With RAG (After 50+ Similar Failures)
ROOT CAUSE: PAYMENT_SERVICE: New customer limit exceeded (R$450.00 > R$500.00)
AFFECTED SERVICES: PAYMENT_SERVICE, PRODUCT_VALIDATION_SERVICE (rolled back)
FINANCIAL IMPACT: R$450.00: payment was never processed, no financial loss
HISTORICAL PATTERN: Found 3 similar incidents (scores: 0.92, 0.88, 0.84).
All involve new customers with orders between R$300-R$500. 78% occur during
evening hours (18h-22h). Profile "new:high-value" accounts for 90% of these
failures.
RECOMMENDATION: Adjust credit limit threshold for "new:high-value" profile
from R$500 to R$600, or implement a stepped verification instead of hard block.
Evening orders from new customers should trigger additional verification
rather than outright rejection.
The RAG context transforms the diagnosis. The agent identifies the pattern (evening hours, specific amount range, specific profile) and gives a targeted recommendation.
Storing the Diagnosis
Every diagnosis gets saved to a relational table:
diagnosticRepository.save(SagaDiagnostic.builder()
.orderId(event.getOrderId())
.transactionId(event.getTransactionId())
.diagnosis(diagnosis)
.createdAt(LocalDateTime.now())
.build());
The table is queryable via a REST endpoint:
@GetMapping("/diagnostics")
public List<SagaDiagnostic> getAllDiagnostics() {
return diagnosticRepository.findAllByOrderByCreatedAtDesc();
}
This gives the operations team a feed of auto-generated failure analyses. They don't need to look at logs or manually correlate events across services.
Tuning the Similarity Threshold
The minScore threshold is the most important tuning parameter. Too high and you miss relevant context. Too low and you pollute the prompt with noise.
I started at 0.80 and found that the agent was often getting "No similar incidents found" even when related failures existed. Dropping to 0.75 fixed this. The matches at 0.75-0.79 are loose but still relevant enough to improve the diagnosis.
If your data is very homogeneous (similar events with small variations), you might need a higher threshold like 0.85. If your data is diverse (different failure types, different services), 0.70 might work better.
There's no universal answer. Run 10-20 failures through the system, check the matches manually, and adjust until the RAG context is consistently helpful.
What's Next
The system works for point-in-time diagnosis. But the real value shows up over time. In the next post, I'll cover the flywheel effect: how each new saga event improves future diagnoses, how the SagaComposerAgent uses the same vector store to optimize saga execution order, and what happens when you have thousands of historical incidents.
The repo: github.com/pedrop3/saga-orchestration
Top comments (0)