p95 latency was 1.68 seconds. I didn't know until I built trace-stats.

#devchallenge #hermesagentchallenge #agents #observability

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge.

After I got my Hermes agent running reliably, the next question was: how fast is it, really? Not "the last run took 3 seconds" fast — I mean the distribution. What's the median step latency? What's the p95? Is there a long tail?

Means lie. A step that takes 200ms most of the time but 4 seconds occasionally has a mean of maybe 400ms. That 400ms number is not actionable. The p95 is.

I had duration_ms on every step in my JSONL logs. I did not have a fast way to compute percentiles on it. I wrote a loop with statistics.median. Then I wanted p95. Then p99. Then I wanted the same breakdown for cost_usd and tokens_in. That's trace-stats.

One command

python3 -m trace_stats run.jsonl duration_ms cost_usd

duration_ms:
  n=47 total=42300.0
  mean=900.0 stddev=340.2
  min=120.0 p50=850.0 p90=1450.0 p95=1680.0 p99=1950.0 max=2100.0

cost_usd:
  n=35 total=0.034200
  mean=0.000977 stddev=0.000280
  min=0.000100 p50=0.000900 p90=0.001400 p95=0.001600 p99=0.001900 max=0.002100

The p95 latency was 1.68 seconds. The max was 2.1 seconds. That's the tail I needed to know about.

Python API

from trace_stats import field_stats, load_jsonl

events = load_jsonl("run.jsonl")
s = field_stats(events, "duration_ms")

print(f"p50: {s.p50:.0f}ms")
print(f"p95: {s.p95:.0f}ms")
print(f"p99: {s.p99:.0f}ms")
print(f"mean: {s.mean:.0f}ms  stddev: {s.stddev:.0f}ms")

Multiple fields, one pass

from trace_stats import multi_field_stats

stats = multi_field_stats(events, ["duration_ms", "cost_usd", "tokens_in"])
for field, s in stats.items():
    if s.count > 0:
        print(f"{field}: p95={s.p95:.3g} n={s.count}")

multi_field_stats collects all events into memory once and scans each field. It's not significantly slower than calling field_stats once for a small field list, but it avoids re-reading the file.

Works with filtered subsets

The real power is combining it with trace-filter to ask questions like "what's the latency distribution for just tool calls?":

from trace_filter import filter_trace, load_jsonl as filter_load, kind_is
from trace_stats import field_stats

events = filter_load("merged.jsonl")
tool_calls = filter_trace(events, predicate=kind_is("tool_call"))
s = field_stats(tool_calls, "duration_ms")
print(f"Tool call p95: {s.p95:.0f}ms")

Or comparing distributions across lanes:

from trace_filter import filter_trace, lane_is

for lane_name in ["supervisor", "worker1", "worker2"]:
    lane_events = filter_trace(events, lane=lane_name)
    s = field_stats(lane_events, "duration_ms")
    print(f"{lane_name}: p95={s.p95:.0f}ms n={s.count}")

This is how I found that worker2 had a p95 latency twice that of worker1. Both used the same model and the same tools. The difference turned out to be worker2's system prompt was much longer.

The percentile algorithm

I use the nearest-rank method: for a given percentile p, the rank is ceil(p/100 * n). This gives exact answers on small datasets without interpolation, which makes the test suite predictable.

For standard deviation, I use the sample standard deviation (ddof=1, dividing by n-1). This is the right choice for agent trace samples where you're estimating the population of all possible runs, not the population of the logged events specifically.

Technical notes

19 tests. Zero runtime dependencies. Python 3.10+. The test suite covers the percentile algorithm on specific datasets with known answers, the standard deviation formula, the bool-not-numeric guard, string numeric values, missing field handling, and the multi_field_stats multi-field path.

Repo: https://github.com/MukundaKatta/trace-stats