tl;dr: pip install seismograph-probe โ a Python probe that detects silent LLM API drift using CUSUM change-point detection, with privacy-preserving signal aggregation. 103 tests passing. Dashboard live. Open source.
Three weeks ago I asked a question I couldn't answer:
"Did GPT-4 just change underneath me, or is it my prompt?"
No latency spike. No downtime. Just subtly different outputs from the same prompts, same parameters, same everything. I spent days debugging something that wasn't my fault.
So I built a detector.
Today I'm shipping it publicly.
What's actually working right now
This isn't a concept post. Here's what's live:
The probe SDK โ on PyPI today
pip install seismograph-probe
from probe.sdk import ProbeSDK
sdk = ProbeSDK(provider="openai", model="gpt-4-turbo")
result = sdk.run_canary_suite()
print(result.drift_score) # 0.0 stable โ 1.0 significant shift
The probe runs โค200 canary prompts at temperature=0 daily. These are semantically stable tasks โ deterministic questions, structured reasoning, format-adherence checks. The goal is a reliable behavioral baseline, not a capability benchmark.
Privacy boundary: raw prompts and model outputs never leave your machine. The probe extracts SHA-256 feature hashes, distributional stats, and DP-noised aggregates. That's all that transmits.
CUSUM change-point detection โ running
The correlation engine uses CUSUM (Cumulative Sum) โ a sequential statistical test that's sensitive to gradual drift, not just threshold crossings.
When I backtest against a known LLM behavioral shift event (AugโSep 2025):
Day 0: CUSUM statistic: 0.12 (stable baseline)
Day 11: First elevation detected
Day 19: Alert threshold crossed โ SEISMOGRAPH fires
Day 57: Public postmortem published
38-day lead time. That's the number I keep coming back to.
Ingestion gateway โ deployed
FastAPI gateway with:
Ed25519-signed batch verification (unsigned batches rejected atomically)
Pydantic v2 schema validation
SQLAlchemy ORM + SQLite (ClickHouse migration planned for Phase 2)
Bearer token auth on audit export endpoint
Public dashboard โ live at localhost, hosted version coming
Dark-mode model weather dashboard. Polls /v1/weather every 60 seconds. Shows per-model drift status across your fleet.
GET /v1/weather
โ [{ "model": "gpt-4-turbo", "status": "STABLE", ... },
{ "model": "claude-3-5-sonnet", "status": "STABLE", ... }]
Test suite โ 103/103 passing
Not "it works on my machine." 103 tests across probe SDK, storage layer, gateway, CUSUM detector, privacy boundary, and auth. Zero ruff violations across 22 Python files.
Provider ToS compliance โ checked
Before adding any provider to the canary suite, I verify it doesn't violate their Terms of Service. Done for: OpenAI โ
, Anthropic โ
, Google Gemini โ
, Mistral โ
, Cohere โ
. Documented in docs/PROVIDER_TOS_CHECKS.md.
What's NOT done yet (being honest)
No hosted gateway yet. The gateway runs locally. Public ingestion endpoint is Phase 1.
No Bayesian online detector yet. CUSUM is running. BayesianOnlineDetector.update() is deferred โ it's on the backlog.
No federation yet. Right now it's single-org. The cross-observer agreement scoring that makes it genuinely valuable is Phase 2.
No cloud dashboard. localhost:8000 only for now.
This is Phase 0: I'm proving the detection logic works before scaling it.
The architecture in one diagram
Your app
โ (gen_ai.* OTel spans)
โผ
ProbeSDK
โ SHA-256 hashes + DP-noised stats only
โ Ed25519-signed batch
โผ
Ingestion Gateway (FastAPI)
โ signature check โ schema validation โ store
โผ
SQLite / ClickHouse
โ
โผ
CUSUM Detector โโโบ DriftAlert
โ
โผ
/v1/weather dashboard
OTel-native throughout. If you're already emitting gen_ai.* spans, the adapter plugs straight in.
Why this matters (and why it has to be federated)
A single organization's drift signal is almost useless. Your outputs change because your users change. Your prompts change. Your context windows change.
But if 15 independent organizations running the same canary suite all see correlated semantic drift on the same day โ that's a model change. That's the signal you can act on.
Single-org signal = private fleet data (yours only).
Multi-org correlated signal = public drift alert.
That's the design. Federation is Phase 2. The local probe is shippable today.
Try it / follow along
GitHub: github.com/Tania-coder/SEISMOGRAPH
PyPI: pypi.org/project/seismograph-probe
If you've been burned by a silent model change โ I want to hear about it. Open an issue, or find me on Twitter @tatyanti.
The probe is Apache 2.0. The gateway will be too.
Tatiana Radchenko ยท AI Infrastructure ยท Aarhus, Denmark
Building in public. Phase 0 of 3.
Top comments (0)