Slava

Posted on Jun 30 • Originally published at Medium

Predicting agent failures from trajectory shape: trajectory-pattern retrieval outperforms basic RAG

#rag #llm #agents #llmops

A trajectory-pattern retrieval engine reaches AUC 0.71 (95% CI [0.61, 0.78]) for per-step failure prediction on held-out coding-agent trajectories - and, notably, eval > tune. A tuned text-embedding (cosine-KNN) baseline over the same data lands at chance: every configuration and aggregation we tried has a confidence interval that includes 0.5.

Growing databases of agent traces that companies collect (or should collect, if they haven't started yet) present an opportunity to analyse or monitor agents, since their efficiency, safety and security shape net economic impact. For practical purposes, different layers occupy different points on the cost (operational or inference), latency and accuracy spectrum. The LLM-as-a-judge approach clearly occupies the expensive, slow, high-accuracy slot. Tools that rely on statistical instruments or deterministic signals, on the other hand, are expected to provide a fast and cheap gateway to LLM-based evaluation.

This leads us to a question: what tools are currently available to extract signals about an agent's trajectory, given a database of related runs? We tested basic RAG and an alternative retrieval method on agent failure prediction: take a snapshot of a trajectory at some turn, find its N closest neighbouring snapshots (excluding the same trajectory), and use the neighbours' failure rate to make a prediction.

Failure prediction was chosen with a few points in mind:

it is an objective, not opinion-based, characteristic of an agent trajectory, and doesn't require an LLM judge
it is a global feature, which makes for a hard baseline (local properties are expected to be easier)
it has practical value for analytics, token efficiency or runtime monitoring

The benchmark

Dataset
275 trajectories from SWE-rebench-OpenHands-Trajectories, filtered to the tobymao/sqlglot repository (178 failures, 97 successes). At the preprocessing stage, trajectories were filtered so that each one comes from a unique pull request. Any same-PR pair carries near-identical semantic content and introduces data leakage; leaving same-PR samples in the dataset bumps eval AUC up by ~20–30 points.
The split is built by randomly ordering instance-id-sorted trajectories, then proportionally interleaving by status - a 100-trajectory tune slice and a 175-trajectory held-out eval slice, both landing at ~65% failure rate (the population mean). The split is used in hyperparameter selection, not in retrieval: eval queries see the full corpus. Reported confidence intervals are per-trajectory bootstrap, 2000 draws.

Metric: per-step stratified AUC from step 50
The primary metric is per-step stratified weighted ROC AUC from step 50 onwards - the trajectory-step analogue of the well-established time-stratified evaluation used in survival analysis and dynamic prediction (e.g. time-dependent ROC), where each subject is scored at matched times to remove confounding from horizon length.

The final score is a weighted mean of per-step AUCs. For every stepS ≥ 50:

Take the trajectories still alive at S (those whose last snapshot is at step ≥ S); call this count n_S.
Compute a score for each of them on its [0…S] snapshot slice.
Compute ROC AUC across that active set - call it AUC_S.

The per-step AUCs are then combined into a single weighted average.

Step 50 was chosen because in the studied dataset ~95% of trajectories are still alive at step 50. Weighting is done so that the early steps - where most runs are still alive - carry the most weight.

This was chosen over a single global AUC for a reason: failed trajectories are systematically longer than successful ones. This property seems highly regular in the agentic domain, rather than an anomaly of this particular dataset. A global AUC therefore leaks length into the score: any retriever that relies solely on final trajectory length gets around 0.68 AUC (global), while at step stratified setup should converge to 0.5. Global AUC also has little practical value: the usefulness of post-run classification is unclear, whereas the per-step stratified version emulates real-time monitoring.

Participants

Trajectory-pattern retrieval KNN
Basic RAG (cosine-KNN)

What trajectory-pattern retrieval is: use an embedder model to embed each message into 1024 dimensions, cluster all messages as actions and observations, build a compressed vocabulary of typed action-observation tokens (~50–60 tokens), then represent each run as a sequence of these tokens. More specifics are described below.

What basic RAG is precisely: each prefix is reduced to one vector (swept over 1024 or 2048 dimensions) by averaging its last N messages (N swept across the interval {1…20}, plus an all-messages special case); retrieval is top-k (k over {1…50}) nearest neighbours from distinct trajectories by cosine similarity; the score is the failure rate of the retrieved set. It scores exactly the way a RAG-style failure monitor would.
For both participants, the Qwen3 8B embedder mode is used. Trajectory-pattern retrieval skips the initial task message, while for basic RAG both options were tested: including or excluding the initial message.

Aggregations
At each step we summarise a trajectory's accumulated fail-similarity into a single score, and we report the cumulative mean (cummean) over the 175 eval trajectories, since cummean proved the most noise-robust and therefore the most useful given the dataset limitations.
A valid objection is that basic RAG might fare better under an aggregation other than our headline cummean, so we additionally tune and evaluate it under running-max (cummax) and peak-running-mean (cummeanmax). For each aggregation we run the full query/retrieval hyperparameter search independently and keep the configuration that maximises that aggregation's tune-set AUC, then score it on the held-out eval set - so each aggregation is reported under its own best-tuned configuration. Best aggregation for basic RAG turned out to be a cummax, while other two show inverted AUC, so we will report cummax for basic RAG further.

Results

Every basic RAG interval includes 0.5. On this corpus, text-embedding cosine retrieval has no usable signal for predicting the outcome - even when allowed to pick its best aggregation. The trajectory-pattern retrieval cummean interval has a lower bound of 0.621, clearly above chance, and the fact that eval and tune are close (with eval slightly higher) shows no sign of an overfit.

How trajectory-pattern retrieval works

The representation is built bottom-up:
Embed and cluster every observation and action message in the corpus. Each message ends up labelled by its cluster (observations get classes o1, o2, …; actions get a1, a2, …). The initial task prompt is dropped - it can't be cleanly typed, and keeping it degrades downstream metrics.
At each agent step, form the pair (action class, observation class).
Cluster those pairs - concatenating each pair's centroid embeddings - into a compact act_obs token vocabulary (~50–60 tokens in our setup).
A trajectory becomes the resulting sequence of act_obs tokens, one per step.

Retrieval
Three-stage retrieval is used:

anchors placed at evenly spaced centre steps along the path → a per-window MinHash LSH band prefilter (a cheap candidate shortlist)
a dense Jaccard rerank over each candidate's neighbourhood → a cross-anchor aggregate
an aggregative shift-aware Levenshtein rerank: it slides across a trajectory with a window W, finds the best-matching alignment position in the candidate trajectory within W/2, applies a shift penalty and produces a score for each element (point); these scores are then averaged.

Here we note that this rerank technique was not an obvious choice, and more basic sequence-similarity algorithms were tested, such as whole-sequence Levenshtein, whole-sequence Levenshtein with a substitution matrix, Smith-Waterman, Needleman-Wunsch, DTW (Dynamic Time Warping) and some others. All of them performed near the basic-RAG level, which suggests a valuable interpretation of agent-trace structure that we may cover later.

Related code and benchmark are open-sourced
The codebase to run this benchmark was open-sourced as a Python package named Episodiq. Numbers should reproduce; if they don't, that's a bug we want to know about.
The same representation supports several practical features, which were also implemented:

Human-readable logs at ~98% lower annotation cost, LLM-free at runtime. Annotate one representative per cluster and reuse it for every trajectory; new messages are tagged at write time by KNN against their slice - instant labels, zero LLM calls in the hot path. The gap widens with corpus size: at 1,000 trajectories it reaches 99% (~165×), because annotation cost scales with the vocabulary, not the corpus.
A per-snapshot fail-similarity score, as benchmarked above.
Loop detection - surfaces an agent stuck repeating a type of step, useful for token-waste monitoring even when the run isn't broken.
An action-predictability signal - flags steps where the next action is highly constrained versus steps where the agent could plausibly branch many ways.

Results are preliminary - a single repository, 175 held-out trajectories, wide bootstrap intervals. The gap should be confirmed on larger corpora, though assembling a large enough public trajectory dataset is itself an open problem.

Top comments (1)

Max Quimby • Jul 2

The "text-embedding KNN lands at chance while trajectory-pattern retrieval hits 0.71" result is the interesting part — it's evidence that failure is a structural property of the trajectory (retry loops, tool-call thrash, turn count blowing up) rather than anything semantic in the text. That matches what we see monitoring long-horizon agents: the shape of the run tells you it's going sideways well before the content does. Two things I'd love to see next. First, external validity — filtering to a single repo (sqlglot) controls leakage nicely, but does the pattern engine transfer to a different repo or a different agent scaffold, or is it partly learning sqlglot-specific failure signatures? Second, the practical framing: at AUC 0.71 this feels like a cheap first-stage gate that decides when to spend an LLM-judge call, not a standalone verdict. Have you looked at it as a router in front of the expensive judge rather than as a classifier on its own? Also curious what "eval > tune" is telling you — over-tuning collapsing a weak-but-real signal into noise?