Mervin

Posted on May 21 • Originally published at alert.melyx.id

Gemma 4 as an On-Call SRE: Turning Alert Spam into One Reasoned Incident

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

description: "I rebuilt my alert pipeline around Gemma 4 26B MoE. It now groups cascading alerts into a single incident and writes the root cause for me. Architecture, demo, and why MoE — not the dense 31B — was the right tool."

tags: gemmachallenge, gemma, ai, devops

I had a working alerts service — Postgres, BullMQ, rules engine, Telegram bot. Classic stuff. It also produced the classic problem: 4 separate notifications for what was obviously one incident, no causal narrative, no fix suggestion. So I bolted a single component onto it: a Gemma 4 26B MoE "SRE brain" that reads correlated events and writes the postmortem before I finish my coffee.

Demo. https://github.com/melyx-id/alert-service
Repo (single self-contained NestJS service): /opt/alert-service on the host.

The intentional pick: google/gemma-4-26B-A4B-it (26B MoE, 4B active) — not the dense 31B. Reasoning below.

The problem I actually had

Last week our api-gateway hiccupped after a deploy. Telegram fired:

🚨 Deploy #441 promoted to production
🔴 Redis connection timeout spike (p99 4.2s)
🚨 5xx error rate surged 340% (12% of traffic)
🔴 Checkout latency p95 = 8.7s

Four pings. Four pages. I had to assemble the story myself: the deploy caused the redis pool to exhaust, which caused 5xx, which broke checkout. Obvious in retrospect. Cognitive load at 2am, not so obvious.

That's the gap I wanted Gemma to close.

Architecture

                    Webhooks / app events
                            │
                            ▼
                  ┌─────────────────────┐
                  │ /events/incident    │ (Fastify + NestJS)
                  └────────┬────────────┘
                           │
                  ┌────────▼────────┐
                  │  AlertsService  │  dedup → Postgres
                  └────────┬────────┘
                           │
                  ┌────────▼────────────────┐
                  │  IncidentsService       │
                  │  • signature(service)   │
                  │  • find OPEN incident   │
                  │    in last 10 min       │
                  │  • attach alert         │
                  └────────┬────────────────┘
                           │
                           ▼
                  ┌─────────────────────────┐
                  │  AnalysisService        │
                  │  HF Inference Router →  │
                  │  gemma-4-26B-A4B-it     │
                  │  (system prompt: SRE)   │
                  └────────┬────────────────┘
                           │
       ┌───────────────────┼────────────────────┐
       ▼                   ▼                    ▼
  Telegram (HTML +    Dashboard (Alpine,    Postgres
   inline buttons:    polls /incidents)     (timeline, aiFixes,
   ACK / RESOLVE /                          aiConfidence,
   Retry AI)                                aiRootCause)

Key idea: an Incident is the unit, not an Alert. An incident gathers all alerts from the same service within a 10-minute window. Every new alert in the window re-invokes Gemma with the full chronological stack — so the analysis improves with context instead of repeating with each ping.

Why Gemma 4 26B MoE (and not the dense 31B)

The challenge specifically asks why each model is the right tool. Here's my honest answer:

Property	What incident analysis needs	26B MoE	31B Dense
Workload shape	Bursty (idle → 3-6 events in 60s → idle)	✅ sparse activation = lower TTFT per call	dense is always-on cost
Reasoning depth	Multi-step causal chain (deploy → pool → 5xx → checkout)	✅ MoE benchmarks competitive with 31B on reasoning	slightly better, marginal
Long context	Up to 128K — we send growing event timelines	✅ both fine	✅ both fine
Cost per analysis	Want sub-cent	✅ 4B active params → cheaper inference	higher
Latency budget	<10s per call (on-call patience)	✅ ~4–7s observed	~6–9s observed

For an incident analyst workload — short, bursty, but reasoning-heavy — MoE was the right tool. I kept the 31B Dense wired in as automatic fallback for when the MoE provider 429's. Both go through the HuggingFace Inference Router using the same OpenAI-compatible interface (/v1/chat/completions) which made the fallback a one-line config swap.

// src/modules/analysis/analysis.service.ts
this.model = config.get('GEMMA_MODEL') || 'google/gemma-4-26B-A4B-it'
this.fallbackModel = config.get('GEMMA_FALLBACK_MODEL') || 'google/gemma-4-31B-it'

A subtle gotcha worth flagging: HF Router model IDs are case-sensitive. gemma-4-26b-a4b-it returns 400 model_not_found. gemma-4-26B-A4B-it works. Lost 30 minutes to that.

The system prompt that actually mattered

The interesting part of the build wasn't the plumbing — it was getting Gemma to reason rather than summarize. My first prompt produced confident-sounding restatements of the input ("Redis is timing out and 5xx errors are happening"). Useless.

What worked was framing it as a senior person with a strong opinion about what qualifies as a root cause:

You are a senior Site Reliability Engineer with 10+ years on-call experience...

Rules:
- Prefer concrete causal chains over vague language
  ("connection pool exhaustion after deploy #441" beats "service degradation")
- If a deploy event is present, evaluate whether it is the likely trigger
- Severity: CRITICAL = revenue path or full outage
- Confidence: be honest. 0.5 means "plausible but unverified".
  Above 0.85 only when the causal chain is clear.

Two design choices behind this:

Force a causal chain, not a summary. Without this, Gemma reflexively rewrites symptoms.
Confidence as a contract. When I tell it "0.5 = plausible but unverified", it actually self-rates lower on weak signal. With the redis cascade demo, the first alert (deploy event alone) returned confidence 50%. By the third alert it hit 95% — because the causal chain became visible. The model is policing its own certainty.

What the demo actually looks like

$ npm run demo:redis

=== Demo scenario: Redis timeout cascade after deploy #441 ===

  [1/4] (4528ms) LOW      Deploy #441 promoted to production
     ↳ grouped into INC-260520-001 (conf 50%, google/gemma-4-26B-A4B-it)
  [2/4] (7616ms) HIGH     Redis connection timeout spike (p99 4.2s)
     ↳ grouped into INC-260520-001 (conf 90%, google/gemma-4-26B-A4B-it)
  [3/4] (5634ms) CRITICAL 5xx error rate surged 340% (12% of traffic)
     ↳ grouped into INC-260520-001 (conf 95%, google/gemma-4-26B-A4B-it)
  [4/4] (4407ms) HIGH     Checkout latency p95 = 8.7s
     ↳ grouped into INC-260520-001 (conf 95%, google/gemma-4-26B-A4B-it)

Gemma 4 final analysis:
  root cause : Deploy #441 introduced a regression causing Redis connection
               pool exhaustion, leading to request queuing and 5xx errors
               on the checkout path.
  impact     : Users are experiencing high latency and a 12% failure rate
               during the checkout process, directly impacting revenue.
  severity   : CRITICAL  (auto-escalated from initial LOW)
  confidence : 95%
  fixes:
    - Roll back api-gateway to the previous stable version (v2.3.3)
    - Increase Redis connection pool size as a temporary mitigation
      if rollback is delayed
    - Investigate commit a1b2c3d for unclosed Redis connections
      or inefficient session lookups

Note: severity was auto-escalated. The first event was tagged LOW (a deploy isn't itself a problem). Gemma rewrote the incident's severity to CRITICAL after seeing the cascading impact — exactly what a human SRE would do.

Before/after view on the dashboard makes this concrete:

Before (raw alerts pane): 4 separate-looking entries, no narrative, on-call paged 4 times.
After (Gemma pane): 1 grouped incident, root cause + impact + fixes + 95% confidence, on-call paged once with all context inline.

Same data. Different outcome.

Things I deliberately did NOT do (yet)

Multi-agent reasoning (DB-specialist, network-specialist, summarizer). LangGraph would slot in cleanly, but for the use case — small bursts, single service per incident — one well-prompted call beats four coordinated ones in latency. Multi-agent is on the roadmap once I'm grouping across services.
Vector search for similar past incidents. pgvector is already running on the host; the hook is in IncidentsService.groupAndAnalyze. Will add when there are >50 historical incidents to retrieve from.
Local Ollama. Tempting for privacy, but my VPS is 4GB RAM and runs ~15 other services. The HF Router gives me the same Gemma 4 weights without evicting half my fleet. If you're on dedicated hardware, swap the endpoint — the prompt and grouping logic don't change.

Production-y bits that came along for the ride

Dedupe + retry. Cache key = sha1(title:source), 2-min TTL. Stops a runaway cron from re-analyzing the same payload 60x.
Telegram inline keyboard: ACK / RESOLVE / Retry AI / open dashboard. The Retry AI button is my favorite — it re-invokes Gemma with the current event stack. Cheap second opinion when the first reasoning felt off.
Severity escalation. The incident's stored severity is max(human-rule severity, AI severity). AI can upgrade LOW→CRITICAL but cannot downgrade a CRITICAL classification, by design.
Confidence as UI signal. The dashboard shows conf 95% next to every root cause. Below 70% the UI hints "consider re-analysis or wait for more events."

Stack summary

NestJS 11 (Fastify) — existing service, ~30 LOC of wiring to add the Gemma layer
Prisma + Postgres — 1 new model (Incident), 3 new columns on Alert
HuggingFace Inference Router — google/gemma-4-26B-A4B-it primary, gemma-4-31B-it fallback
Alpine.js + Tailwind CDN — single-file dashboard, polls /incidents every 5s
Telegram bot — HTML messages with inline keyboard, HMAC-signed callbacks

Single npm run demo:redis reproduces the entire flow from cold start.

What surprised me

I expected Gemma to be good at the language — paraphrasing logs, polishing summaries. What I didn't expect was how reliably it upgrades severity. The first event in my demo (a deploy) is mundane. The model only paints it as CRITICAL once it has the second and third alerts to connect the chain. That's not pattern matching, that's reasoning over a sequence. It's the behavior I'd want from a junior SRE on their third month.

The other surprise: confidence actually moves. Most LLM "confidence" outputs are 0.9 forever. Telling Gemma in the system prompt that 0.5 is honest got me back a useful spread of values that I can now drive UI on.

Try it

If it's empty when you look, the demo data may have expired — you're welcome to mentally substitute "redis cascade after deploy #441, severity CRITICAL, 95% confidence, with fixes." Or watch the next real incident roll through, which is the whole point.

DEV Community