Ravi Patel

Posted on Jun 4 • Originally published at ssimplifi.com

What was that request, exactly? Observability for the AI proxy layer

#ai #api #observability #developertools

When you run AI in production for more than a week, three questions start to dominate everything else, and none of them are about model quality.

"What did request X actually do?" Which model picked it up. Which mode. Did it hit the cache. How long did it take. How many tokens. How much it cost. The bill at the end of the month is an aggregate; the question is about a specific call from a specific user that did something unexpected.

"Which features in my app are eating cost?" Cost-per-month is one number. Cost-per-feature is a different conversation — the one where you decide whether to keep an experimental flow live or replace it with something cheaper. Without per-feature attribution, you're flying blind.

"Is the proxy actually fast enough?" The headline latency is the average request, which means nothing. What matters is p95 and p99 — the 1-in-20 and 1-in-100 calls that determine whether your users wait or move on.

The previous Prism dashboard answered the cost question well — how much did caching save me? It did not answer the three above. As of today, v1.3 does.

Request explorer

The first tab on the new /dashboard/usage page. Every call you've ever made through Prism, in a paginated table:

Filters: project, model, provider, mode, cache status, success/error, request tag.
Cursor pagination — no expensive COUNT(*), just "load more" forever.
Date range tier-clamped: Free sees 24 hours, Paid 7 days, Pro 30 days, Team 90 days.
Click any row to expand a side drawer: full request metadata, latency, cost, cache details, the request_tags you sent, and the feedback (if any) on that specific response.
CSV export of any filtered view (Paid and above). Streams chunked to the browser; capped at 10,000 rows per call.
A "Sorted slowest first" mode for outlier hunting.

The design choice worth noting: we didn't add a model selector that joins request metadata to model metadata at query time. Every column the table renders is on the usage_logs row already; the listing query is index-only at our scale. p95 stays under 100ms even at 30 days of history on the Pro tier.

By feature — cost attribution

If you've been tagging requests with the new X-Prism-Tags header, the second tab is for you:

curl https://api.ssimplifi.com/v1/chat/completions \
  -H "X-Prism-Tags: feature=onboarding,team=growth" \
  ...

The dashboard reads tag values out of the request_tags jsonb column and gives you a per-feature breakdown — cost, request count, savings, p50 latency. Cost-ordered bars next to the table so you can see at a glance which feature is the biggest line item.

Click any row in the breakdown and you jump to the Request tab pre-filtered to that tag value. Drill down to the individual requests that contributed to the cost; spot patterns; export to CSV.

This tab unlocks on Pro and Team. Capture is free across every tier — the Free and Paid tiers can keep sending tags and they accumulate; subscribe later to see the breakdown.

Latency — p50, p95, p99

The third tab is what your platform team is going to ask about during a SOC 2 conversation. Per-provider summary cards at the top: p50 / p95 / p99 for the current window, request count, error count, error rate. A red "slow" badge appears when p95 crosses 2 seconds — adjust the threshold mentally based on your use case; we expose it because outlier surfacing matters more than averages.

Below the cards: an SVG time-series chart. Three lines, one per percentile, over the window. Hour buckets for 24h windows, day buckets for longer. The chart is pure SVG, no chart library, so it stays under 5KB on the wire.

On Pro and Team, you can switch the grouping from provider to model or mode. On Free and Paid, provider only and p99 is hidden. The architectural reason p99 is gated: percentile_cont over a small sample size is noisy. Below a few hundred requests per provider per day, p99 is almost meaningless. Pro tier customers tend to have enough traffic to make p99 honest.

Feedback

The fourth tab is the one that turns Prism from a metering layer into a feedback loop.

Every response from /v1/chat/completions now carries an X-Prism-Feedback-Id header — a UUID. Your client code keeps that header, surfaces a thumbs-up/down to your user, and when they click, you POST it back:

curl https://api.ssimplifi.com/v1/feedback \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{
    "feedback_id": "<uuid from X-Prism-Feedback-Id>",
    "thumbs": -1,
    "comment": "Wrong product name in the recommendation",
    "tag": "factual-error"
  }'

The dashboard aggregates: total feedback count, thumbs split, rating histogram, recent comments with the originating request log. You can correlate quality with model, with mode, with cache hit. You can see which features get thumbs-down disproportionately and decide whether to switch their default model or add a prompt guardrail.

UPSERT semantics: send a thumb first, a comment seconds later when the user types one — both stick. Unknown feedback_id still returns 200 — we capture customer intent even when the request log got pruned.

This is the layer that powers everything in v1.4. Policy rules ("never use Opus for classification" — Block the request) and budget caps ("alert me when monthly cost exceeds $X") sit on top of an observability data plane. v1.3 ships the data plane.

What's underneath

The whole pillar is one new database migration (014) — a feedback table, a feedback_id column on usage_logs, and four indexes covering the common access patterns. Two more migrations (015 + 016) add SQL functions for the aggregations: usage_tags_summary, usage_tags_timeseries, usage_latency_summary, usage_latency_timeseries. Postgres does the percentile_cont so we don't ship raw rows to Python.

The frontend is a single page with four tabs and an SVG line chart that's 80 lines of code. No chart library, no virtualization library, no state machine library. The whole observability surface adds 12 KB to the dashboard bundle.

If you want to see it on your own data: head to /dashboard/usage. Start tagging your requests with X-Prism-Tags and posting feedback on responses you care about. The data will accumulate, and the next time you ask yourself "what did request X actually do?" — you'll have the answer.

DEV Community