This is a submission for the Gemma 4 Challenge: Write About Gemma 4
description: "I rebuilt my alert pipeline around Gemma 4 26B MoE. It now groups cascading alerts into a single incident and writes the root cause for me. Architecture, demo, and why MoE β not the dense 31B β was the right tool."
tags: gemmachallenge, gemma, ai, devops
I had a working alerts service β Postgres, BullMQ, rules engine, Telegram bot. Classic stuff. It also produced the classic problem: 4 separate notifications for what was obviously one incident, no causal narrative, no fix suggestion. So I bolted a single component onto it: a Gemma 4 26B MoE "SRE brain" that reads correlated events and writes the postmortem before I finish my coffee.
Demo. https://github.com/melyx-id/alert-service
Repo (single self-contained NestJS service): /opt/alert-service on the host.
The intentional pick: google/gemma-4-26B-A4B-it (26B MoE, 4B active) β not the dense 31B. Reasoning below.
The problem I actually had
Last week our api-gateway hiccupped after a deploy. Telegram fired:
π¨ Deploy #441 promoted to production
π΄ Redis connection timeout spike (p99 4.2s)
π¨ 5xx error rate surged 340% (12% of traffic)
π΄ Checkout latency p95 = 8.7s
Four pings. Four pages. I had to assemble the story myself: the deploy caused the redis pool to exhaust, which caused 5xx, which broke checkout. Obvious in retrospect. Cognitive load at 2am, not so obvious.
That's the gap I wanted Gemma to close.
Architecture
Webhooks / app events
β
βΌ
βββββββββββββββββββββββ
β /events/incident β (Fastify + NestJS)
ββββββββββ¬βββββββββββββ
β
ββββββββββΌβββββββββ
β AlertsService β dedup β Postgres
ββββββββββ¬βββββββββ
β
ββββββββββΌβββββββββββββββββ
β IncidentsService β
β β’ signature(service) β
β β’ find OPEN incident β
β in last 10 min β
β β’ attach alert β
ββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β AnalysisService β
β HF Inference Router β β
β gemma-4-26B-A4B-it β
β (system prompt: SRE) β
ββββββββββ¬βββββββββββββββββ
β
βββββββββββββββββββββΌβββββββββββββββββββββ
βΌ βΌ βΌ
Telegram (HTML + Dashboard (Alpine, Postgres
inline buttons: polls /incidents) (timeline, aiFixes,
ACK / RESOLVE / aiConfidence,
Retry AI) aiRootCause)
Key idea: an Incident is the unit, not an Alert. An incident gathers all alerts from the same service within a 10-minute window. Every new alert in the window re-invokes Gemma with the full chronological stack β so the analysis improves with context instead of repeating with each ping.
Why Gemma 4 26B MoE (and not the dense 31B)
The challenge specifically asks why each model is the right tool. Here's my honest answer:
| Property | What incident analysis needs | 26B MoE | 31B Dense |
|---|---|---|---|
| Workload shape | Bursty (idle β 3-6 events in 60s β idle) | β sparse activation = lower TTFT per call | dense is always-on cost |
| Reasoning depth | Multi-step causal chain (deploy β pool β 5xx β checkout) | β MoE benchmarks competitive with 31B on reasoning | slightly better, marginal |
| Long context | Up to 128K β we send growing event timelines | β both fine | β both fine |
| Cost per analysis | Want sub-cent | β 4B active params β cheaper inference | higher |
| Latency budget | <10s per call (on-call patience) | β ~4β7s observed | ~6β9s observed |
For an incident analyst workload β short, bursty, but reasoning-heavy β MoE was the right tool. I kept the 31B Dense wired in as automatic fallback for when the MoE provider 429's. Both go through the HuggingFace Inference Router using the same OpenAI-compatible interface (/v1/chat/completions) which made the fallback a one-line config swap.
// src/modules/analysis/analysis.service.ts
this.model = config.get('GEMMA_MODEL') || 'google/gemma-4-26B-A4B-it'
this.fallbackModel = config.get('GEMMA_FALLBACK_MODEL') || 'google/gemma-4-31B-it'
A subtle gotcha worth flagging: HF Router model IDs are case-sensitive. gemma-4-26b-a4b-it returns 400 model_not_found. gemma-4-26B-A4B-it works. Lost 30 minutes to that.
The system prompt that actually mattered
The interesting part of the build wasn't the plumbing β it was getting Gemma to reason rather than summarize. My first prompt produced confident-sounding restatements of the input ("Redis is timing out and 5xx errors are happening"). Useless.
What worked was framing it as a senior person with a strong opinion about what qualifies as a root cause:
You are a senior Site Reliability Engineer with 10+ years on-call experience...
Rules:
- Prefer concrete causal chains over vague language
("connection pool exhaustion after deploy #441" beats "service degradation")
- If a deploy event is present, evaluate whether it is the likely trigger
- Severity: CRITICAL = revenue path or full outage
- Confidence: be honest. 0.5 means "plausible but unverified".
Above 0.85 only when the causal chain is clear.
Two design choices behind this:
- Force a causal chain, not a summary. Without this, Gemma reflexively rewrites symptoms.
- Confidence as a contract. When I tell it "0.5 = plausible but unverified", it actually self-rates lower on weak signal. With the redis cascade demo, the first alert (deploy event alone) returned confidence 50%. By the third alert it hit 95% β because the causal chain became visible. The model is policing its own certainty.
What the demo actually looks like
$ npm run demo:redis
=== Demo scenario: Redis timeout cascade after deploy #441 ===
[1/4] (4528ms) LOW Deploy #441 promoted to production
β³ grouped into INC-260520-001 (conf 50%, google/gemma-4-26B-A4B-it)
[2/4] (7616ms) HIGH Redis connection timeout spike (p99 4.2s)
β³ grouped into INC-260520-001 (conf 90%, google/gemma-4-26B-A4B-it)
[3/4] (5634ms) CRITICAL 5xx error rate surged 340% (12% of traffic)
β³ grouped into INC-260520-001 (conf 95%, google/gemma-4-26B-A4B-it)
[4/4] (4407ms) HIGH Checkout latency p95 = 8.7s
β³ grouped into INC-260520-001 (conf 95%, google/gemma-4-26B-A4B-it)
Gemma 4 final analysis:
root cause : Deploy #441 introduced a regression causing Redis connection
pool exhaustion, leading to request queuing and 5xx errors
on the checkout path.
impact : Users are experiencing high latency and a 12% failure rate
during the checkout process, directly impacting revenue.
severity : CRITICAL (auto-escalated from initial LOW)
confidence : 95%
fixes:
- Roll back api-gateway to the previous stable version (v2.3.3)
- Increase Redis connection pool size as a temporary mitigation
if rollback is delayed
- Investigate commit a1b2c3d for unclosed Redis connections
or inefficient session lookups
Note: severity was auto-escalated. The first event was tagged LOW (a deploy isn't itself a problem). Gemma rewrote the incident's severity to CRITICAL after seeing the cascading impact β exactly what a human SRE would do.
Before/after view on the dashboard makes this concrete:
- Before (raw alerts pane): 4 separate-looking entries, no narrative, on-call paged 4 times.
- After (Gemma pane): 1 grouped incident, root cause + impact + fixes + 95% confidence, on-call paged once with all context inline.
Same data. Different outcome.
Things I deliberately did NOT do (yet)
- Multi-agent reasoning (DB-specialist, network-specialist, summarizer). LangGraph would slot in cleanly, but for the use case β small bursts, single service per incident β one well-prompted call beats four coordinated ones in latency. Multi-agent is on the roadmap once I'm grouping across services.
-
Vector search for similar past incidents. pgvector is already running on the host; the hook is in
IncidentsService.groupAndAnalyze. Will add when there are >50 historical incidents to retrieve from. - Local Ollama. Tempting for privacy, but my VPS is 4GB RAM and runs ~15 other services. The HF Router gives me the same Gemma 4 weights without evicting half my fleet. If you're on dedicated hardware, swap the endpoint β the prompt and grouping logic don't change.
Production-y bits that came along for the ride
-
Dedupe + retry. Cache key =
sha1(title:source), 2-min TTL. Stops a runaway cron from re-analyzing the same payload 60x. - Telegram inline keyboard: ACK / RESOLVE / Retry AI / open dashboard. The Retry AI button is my favorite β it re-invokes Gemma with the current event stack. Cheap second opinion when the first reasoning felt off.
-
Severity escalation. The incident's stored severity is
max(human-rule severity, AI severity). AI can upgrade LOWβCRITICAL but cannot downgrade a CRITICAL classification, by design. -
Confidence as UI signal. The dashboard shows
conf 95%next to every root cause. Below 70% the UI hints "consider re-analysis or wait for more events."
Stack summary
- NestJS 11 (Fastify) β existing service, ~30 LOC of wiring to add the Gemma layer
-
Prisma + Postgres β 1 new model (
Incident), 3 new columns onAlert -
HuggingFace Inference Router β
google/gemma-4-26B-A4B-itprimary,gemma-4-31B-itfallback - Alpine.js + Tailwind CDN β single-file dashboard, polls /incidents every 5s
- Telegram bot β HTML messages with inline keyboard, HMAC-signed callbacks
Single npm run demo:redis reproduces the entire flow from cold start.
What surprised me
I expected Gemma to be good at the language β paraphrasing logs, polishing summaries. What I didn't expect was how reliably it upgrades severity. The first event in my demo (a deploy) is mundane. The model only paints it as CRITICAL once it has the second and third alerts to connect the chain. That's not pattern matching, that's reasoning over a sequence. It's the behavior I'd want from a junior SRE on their third month.
The other surprise: confidence actually moves. Most LLM "confidence" outputs are 0.9 forever. Telling Gemma in the system prompt that 0.5 is honest got me back a useful spread of values that I can now drive UI on.
Try it
If it's empty when you look, the demo data may have expired β you're welcome to mentally substitute "redis cascade after deploy #441, severity CRITICAL, 95% confidence, with fixes." Or watch the next real incident roll through, which is the whole point.

Top comments (0)