One Hermes tool call took 4.8 seconds. The others averaged 900ms. trace-anomaly found it.

#devchallenge #hermesagentchallenge #agents #observability

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge.

I was profiling a Hermes research agent last week. The run had 47 steps. My aggregate stats from trace-stats showed a p95 latency of 1.68 seconds — not terrible, but higher than I expected. The mean was 900ms. Something was pulling the tail up.

I could have sorted the steps by duration_ms and looked at the top of the list. That works. But I wanted something that would tell me which events were statistically anomalous — not just "the slowest ones" but "the ones that are outliers relative to the distribution." That's different when most events cluster around 900ms and one event takes 4.8 seconds.

That's trace-anomaly.

One command

python3 -m trace_anomaly run.jsonl duration_ms

field:        duration_ms
events:       47
iqr:          250.0  (q1=650.0, q3=900.0)
fences:       [275.0, 1275.0]
anomalies:    2

   #  event index         value   dir     score  name/kind
   1           34        4800.0  high     14.10  tool_call
   2           21        1890.0  high      2.46  tool_call

Event #34 took 4.8 seconds. It's 14 IQR units above the upper fence. Event #21 took 1.89 seconds — less extreme but still an outlier. Both were tool_call steps. I looked at the event payloads and found that both were calling an external search API that has rate limiting. The 4.8s step was a retry after a 429.

How IQR detection works

No ML, no trained model. Just statistics:

Compute Q1 (25th percentile) and Q3 (75th percentile) of the distribution
IQR = Q3 - Q1
Lower fence = Q1 - 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR
Flag events outside [lower_fence, upper_fence]

The 1.5 multiplier is Tukey's inner fence — the standard choice for "mild outliers." Use k=3.0 for "extreme outliers only":

report = detect_anomalies(events, "duration_ms", k=3.0)

The score tells you how many IQR units the anomalous value is from the fence. Score of 14 is very extreme. Score of 2.46 is mild. You can sort or filter by score:

for a in report.anomalies:
    if a.score > 5.0:
        print(f"Severe anomaly at event #{a.index}: {a.value:.0f}ms")

Python API

from trace_anomaly import detect_anomalies, load_jsonl

events = load_jsonl("run.jsonl")
report = detect_anomalies(events, "duration_ms")

print(f"IQR: {report.iqr:.0f}ms")
print(f"Fences: [{report.lower_fence:.0f}, {report.upper_fence:.0f}]")
print(f"Anomalies: {len(report.anomalies)}")

for a in report.anomalies:
    # a.event is the full original event dict
    name = a.event.get("name") or a.event.get("kind", "")
    print(f"  {name}: {a.value:.0f}ms (score={a.score:.2f})")

Works with any numeric field

The algorithm is field-agnostic. Apply it to cost if you want to find the unexpectedly expensive steps:

python3 -m trace_anomaly run.jsonl cost_usd

Or tokens if you want to find the steps that consumed far more context than expected:

python3 -m trace_anomaly run.jsonl tokens_in

Combine with trace-filter to scope the analysis

Find anomalous tool calls only, not session lifecycle events:

from trace_filter import filter_trace, load_jsonl as filter_load, kind_is
from trace_anomaly import detect_anomalies

events = filter_load("merged.jsonl")
tool_calls = filter_trace(events, predicate=kind_is("tool_call"))
report = detect_anomalies(tool_calls, "duration_ms")

This gives you a tighter IQR because you're comparing tool calls against other tool calls, not against fast session_open events that would pull the distribution down.

What it does not do

It doesn't predict which events will be anomalous in the future. It doesn't train a model. It has no configuration beyond k. For a 47-step trace that runs in under a second, it's ready to use with zero setup.

If all events have the same value (IQR = 0), it reports no anomalies and returns a report with an empty anomaly list. You can check report.iqr == 0 if you want to handle that case explicitly.

Technical notes

19 tests. Zero runtime dependencies. Python 3.10+. The test suite covers the IQR computation on a known dataset, the k parameter effect on fences, identical-value non-detection, high and low outliers, anomaly score ordering, index correctness, and the non-numeric value handling (booleans, non-numeric strings, missing fields).

Repo: https://github.com/MukundaKatta/trace-anomaly