Vignesh Reddy

Posted on Jun 13

Rate limiting, email alerts, health checks, and Grafana — what we shipped to make Ajah production-ready

#api #devops #llm #monitoring

When we launched Ajah two weeks ago,
261 developers cloned it in the first week.

The product worked. But it wasn't
production-ready for enterprise teams.

Today that changes.

Here's exactly what we shipped and why
each piece matters for teams running
LLMs in production.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RATE LIMITING PER FEATURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: a single misconfigured
agent or a traffic spike on one feature
can exhaust your entire API budget before
anyone notices.

The fix: per-feature rate limiting using
a Redis sliding window counter.

Configure requests per minute from the
Settings page — no code changes needed.
When a feature exceeds its limit, the
gateway returns 429 before the request
ever reaches your LLM provider:

{
"error": "rate limit exceeded",
"feature": "chat",
"limit": 60,
"reset_in_seconds": 34
}

Response headers include X-RateLimit-Limit
and X-RateLimit-Reset for client-side
handling. One Redis INCR call per request —
sub-millisecond overhead.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EMAIL ALERTS VIA SMTP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: Slack webhooks reach
developers. They don't reach compliance
teams, finance teams, or anyone who
needs an audit trail.

The fix: SMTP email alerts alongside
existing Slack webhooks.

Configure once via the Settings API:

POST /settings
{
"smtp_config": {
"host": "smtp.gmail.com",
"port": 587,
"username": "alerts@yourcompany.com",
"password": "your-app-password",
"from": "alerts@yourcompany.com"
}
}

Then set alert_email_to per feature.
Cost spikes and risk flags fire email
automatically — subject lines like:

[Ajah Alert] Cost spike — feature: chat
[Ajah Alert] Risk flag — feature: support-bot

Fire-and-forget goroutines. Zero latency
added to the hot path.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PER-DEPENDENCY HEALTH CHECKS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: {"status":"ok"} is useless
when your load balancer needs to know
which specific dependency is down at 2am.

The fix: /health now pings Redis,
PostgreSQL, and ClickHouse individually
with a 3-second timeout per dependency:

{
"status": "ok",
"version": "0.1.0",
"dependencies": {
"redis": {"status": "ok"},
"postgres": {"status": "ok"},
"clickhouse": {"status": "ok"}
}
}

If any dependency is down, the response
returns HTTP 503 with the specific error:

{
"status": "degraded",
"dependencies": {
"redis": {
"status": "down",
"error": "dial tcp: connection refused"
}
}
}

Your monitoring system, load balancer,
and on-call engineer know exactly what
to fix.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GRAFANA DASHBOARD
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: we shipped 10 Prometheus
metrics two weeks ago. Nobody wants
to build 18 Grafana panels from scratch.

The fix: docs/grafana-dashboard.json
— one import, production dashboard.

18 panels across 5 sections:

Traffic
→ Requests per second by feature
→ Requests per second by provider

Latency
→ LLM p50 and p95 by provider
→ Scorer p50 and p95

Cost
→ Cost per hour by feature (USD)
→ Cost per hour by model (USD)

Quality and Safety
→ Hallucination risk gauges by feature
→ Claim density risk by feature
→ Narrative drift risk by feature

Warnings and PII
→ Warning rate by risk level
→ PII detection rate by feature

Import the JSON, point at your Prometheus
datasource, and you have a complete
LLM observability dashboard in under
60 seconds.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Ajah is open source, self-hosted,
MIT licensed.

No data leaves your server.
No vendor lock-in.
No acquisition risk.

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

buildinpublic #llm #opensource #devtools

DEV Community

Rate limiting, email alerts, health checks, and Grafana — what we shipped to make Ajah production-ready

buildinpublic #llm #opensource #devtools

Top comments (0)