Akash Cp

Posted on May 19

A cost curve an SRE will actually read

#devops #agents #monitoring #python

I'm going to argue that the most important chart in an agent
cockpit isn't accuracy, latency, or token count. It's a layered
line chart with two series and a shaded band. Here's why I built
one, why I picked Altair over Streamlit's default, and the
invariants the chart is bound to under the hood.

The argument

Operations teams ship dashboards that look like Datadog overview
pages: forty tiny tiles, each with a sparkline. The information
density is impressive and the actionable signal is roughly zero.

When I started instrumenting an alert triage co-pilot, I noticed
the same drift in my own cockpit. I had counters for memory hits,
fingerprint match scores, route distributions across cheap vs
strong models, escalation reasons, and live-vs-deterministic
flags. Eight metrics. Useful for debugging. Useless for explaining
the system to anyone else.

The question I actually wanted the cockpit to answer was simpler:

If I run this batch of alerts through OpenRecall instead of
calling the strong model on every one, what does it cost me, and
how is that changing as memory accumulates?

Two numbers, one trend. That's a chart.

The shape

The chart is a single Altair render with three layers:

Actual cost per alert in blue — what cascadeflow's routing actually charged for each alert in submission order.
Strong-model-only baseline in red — what the same alert would have cost if every step had used the strong model with no routing.
Savings band in green — a shaded area between the two lines.

The band is the point of the chart. The wider it gets as the batch
progresses, the more memory is paying off.

Here's the actual render:

# app.py — queue tab cost curve
def render_cost_curve(tracker: CostCurveTracker) -> None:
    series = tracker.series()
    if len(series) < 2:
        return  # one point renders empty; bail

    df = pd.DataFrame(
        [
            {
                "alert_index": p.alert_index,
                "cost_usd": p.cost_usd,
                "baseline_cost_usd": p.baseline_cost_usd,
            }
            for p in series
        ]
    )

    lines = (
        alt.Chart(df)
        .transform_fold(
            ["cost_usd", "baseline_cost_usd"],
            as_=["series", "value"],
        )
        .mark_line()
        .encode(
            x=alt.X("alert_index:Q", title="Alert"),
            y=alt.Y("value:Q", title="USD per alert"),
            color=alt.Color(
                "series:N",
                scale=alt.Scale(
                    domain=["cost_usd", "baseline_cost_usd"],
                    range=["#0369a1", "#b91c1c"],
                ),
            ),
        )
    )

    band = (
        alt.Chart(df)
        .mark_area(opacity=0.18, color="#15803d")
        .encode(
            x="alert_index:Q",
            y=alt.Y("cost_usd:Q"),
            y2=alt.Y2("baseline_cost_usd:Q"),
        )
    )

    st.altair_chart(band + lines, use_container_width=True)

I tried st.line_chart first. It can't render the band cleanly,
the colors aren't controllable, and the legend doesn't say what the
green area means. Altair gives me one chart with three layers and
explicit color semantics. The investment in the extra fifteen lines
of code paid for itself the first time someone looked at the
cockpit and said "oh, the green is what we saved."

The invariants

A chart that lies is worse than no chart. I bound the cost curve
to two correctness properties verified by Hypothesis on every CI
run.

Cumulative cost is monotonic non-decreasing. Adding a point
never reduces the total. This catches sign errors and accidental
double-subtraction:

# tests/property/test_cost_curve.py
@given(points=cost_curve_point_strategy_list())
@settings(deadline=None, max_examples=100)
def test_cumulative_cost_monotonic(points: list[CostCurvePoint]) -> None:
    """Feature: openrecall, Property 8: cumulative cost is monotonic."""
    tracker = CostCurveTracker()
    for p in points:
        tracker.record(cost_usd=p.cost_usd, baseline_cost_usd=p.baseline_cost_usd)
    cumulative = [s.cumulative_cost_usd for s in tracker.series()]
    for prev, curr in zip(cumulative, cumulative[1:]):
        assert curr >= prev

The savings band is monotonic non-decreasing. The difference
between baseline cumulative and actual cumulative never narrows.
This is the property that gives the chart its meaning — if savings
shrink mid-batch, something is wrong with the routing logic, not
with the chart:

@given(points=cost_curve_point_strategy_list())
@settings(deadline=None, max_examples=100)
def test_savings_band_monotonic(points: list[CostCurvePoint]) -> None:
    """Feature: openrecall, Property 14: savings band is monotonic."""
    tracker = CostCurveTracker()
    for p in points:
        tracker.record(cost_usd=p.cost_usd, baseline_cost_usd=p.baseline_cost_usd)
    series = tracker.series()
    savings = [s.cumulative_baseline_cost_usd - s.cumulative_cost_usd for s in series]
    for prev, curr in zip(savings, savings[1:]):
        assert curr >= prev

These two properties are why I trust the chart. The blue line and
the red line are both growing. The green band between them is
always growing too. If a deploy ever changes that, the property
suite fails before the deploy lands.

The bypass case

The most surprising point on the chart is the cheapest one.

When the triage engine decides an alert can be auto-resolved from
memory — same fingerprint, consistent prior decisions, low-risk
class — I emit a synthetic routing step and record it like this:

# incident_agent/workflow.py — bypass branch
if triage_engine.is_bypass_eligible(triage, fingerprint):
    audit.record_step(
        RouteTrace(
            model="memory-bypass",
            route="MemoryBypass",
            cost_usd=0.0,
            baseline_cost_usd=baseline_for_strong_rca,
            live_call=False,
            confidence=triage.triage_confidence,
        ),
        decision_basis="memory consistent enough to bypass strong RCA",
    )
    cost_tracker.record(cost_usd=0.0, baseline_cost_usd=batch_baseline)
    return AnalysisResult.from_bypass(triage, fingerprint)

Two things happen here that matter for the chart.

The actual cost is exactly zero. Not "small." Not "the cheap
model price." Zero. That's the point of the bypass — no LLM call
fires, so no LLM cost is charged.

The baseline cost is preserved. I don't zero it out. The
baseline is what it would have cost to run the full strong-model
RCA path on this alert; that's still a meaningful number even
though we didn't pay it. Preserving it is what makes the savings
band meaningful — without the baseline, the green area would
collapse on every bypass and the chart would tell the wrong story.

This is encoded as another property:

# tests/property/test_workflow.py
@given(state=bypass_eligible_state_strategy())
@settings(deadline=None, max_examples=100)
def test_bypass_records_zero_cost_with_baseline_preserved(
    state: BypassEligibleState,
) -> None:
    """Feature: openrecall, Property 9: bypass cost=0, baseline preserved."""
    result = workflow.analyze(state.raw_alert)
    bypass_steps = [s for s in result.route_trace if s.model == "memory-bypass"]
    assert len(bypass_steps) == 1
    assert bypass_steps[0].cost_usd == 0.0
    assert bypass_steps[0].baseline_cost_usd > 0.0

The first time I ran the demo and saw the actual line dip to zero
on a bypassed alert while the baseline line stayed at the strong-
model price, I knew the chart was doing the right job. You can see
the savings event happen on a single point.

The numbers from a real batch

Concrete result from the packaged 100-alert demo:

100 alerts processed
Auto-decided by memory: depends on prior memory; on a sparse bank, 0; after 20 retains, climbs into the 30s
Escalated to strong model: 53 of 100 even on a sparse bank, because the cheap model handles normalization and fingerprint extraction for the other 47 without needing the strong model at all
Total cost: $0.0268
Baseline (strong-only): $0.0384
Savings: $0.0116 — 30.2% saved

The 30% comes from cascadeflow routing, not from memory bypass —
the bypass count is zero on first run because the bank is sparse.
That's also a result, and it's the result the chart shows: even
without memory, runtime intelligence alone shaved a third off the
cost. Add memory, and the slope changes.

The
cascadeflow docs cover the routing
and provider configuration directly; the
GitHub repo is where
the Groq adapter lives. I configured it once and the per-step
cost numbers showed up in the trace.

The lesson

A chart that an SRE will actually read has three properties:

One question, two numbers. The question is "is this thing paying for itself," the two numbers are actual and baseline.
The delta is the headline. The savings band is the only visual element on the chart that grows with the batch. Everything else is reference geometry.
The numbers are bound to invariants. Monotonic cumulative cost, monotonic savings, zero-cost-with-preserved-baseline on bypass. Three properties, all checked by Hypothesis, all blocking on CI.

Eight tiles with sparklines tell you the system is working. One
chart with a green band tells you the system is paying off. Pick
the second one.

The repo lives at
https://github.com/Dawn-Fighter/openrecall
If you're putting any kind of routing or
caching layer into your own agent, build the savings band first.
The rest of the cockpit gets easier when you can point at one
chart and say "this is what the layer is for."

Memory and runtime intelligence are the two halves; the
Hindsight docs and the agent-
memory overview from
Vectorize cover the
storage side cleanly if you want to wire your own.

DEV Community