LLMs amplify whatever architecture you bring them. Including none.

#zabbix #monitoring #observability #aiops

The ordinary failure mode I keep seeing in "LLM-assisted infrastructure" pet projects is the one a home-lab Zabbix operator sketched recently: the alert that arrives on the way home from work, on a phone, declaring that a port speed on a switch in the lab has changed and this is very-very important. Zabbix is doing exactly what it was configured to do. The configuration is the problem. Tuning the trigger thresholds by hand is the kind of work that never gets prioritised on a Saturday, and so the operator does what an increasing number of people in this position do: wonders whether to put an LLM in front of the alert pipeline and let it decide.

The naive version of that wondering — "I'll just hand the model my alerts and ask it to be smart about them" — produces predictable outcomes. The operator I'm reading walked through them up front, in a register I'd characterise as politely brutal about the limits of unstructured prompting. The takeaway, before they wrote a line of code, was that LLMs in this kind of pipeline don't replace engineers and don't hallucinate a coherent system into existence either. They amplify whatever architectural rigour you bring to the prompt — including the absence of any.

Let me unpack what that means in practice.

The two-camps fallacy

The first thing the operator's notes get out of the way is the framing-level error that swallows most of these conversations. There are two adjacent positions you've heard:

"Architects, programmers, and SREs aren't needed; the model will do all of it."
"LLMs hallucinate constantly; they can't be trusted to ship anything serious."

Both are true at exactly the boundary they describe and false everywhere else. The first one fails the moment you ask the model to make a judgement that depends on context the prompt didn't carry — which alerts are noise on this particular cluster, what does the on-call rotation look like, what's the company's tolerance for false positives at 3am. The second fails the moment you give the model a well-specified subroutine to expand into code. The interesting truth is in the middle: an LLM is approximately a deterministic expansion engine on top of a well-specified architecture, and approximately a fluent confabulation engine in the absence of one. The same model behaves like two different tools depending on what you put in front of it.

The home-lab operator's working framing is the one I'd keep: the model is not the design system. You are the design system, and the model is the implementation accelerator that runs on whatever quality of design you produce. If the design is mush, the implementation is mush. If the design is a cleanly bounded set of named components and contracts, the implementation tracks the design closely. The operator's first two weeks on this Zabbix project were spent writing the design, not writing code. The decision to spend those two weeks is the one that determines whether the rest of the project produces an "AIOps pipeline" or an architectural debt pile.

What perimeter means, specifically

The first concrete output of the operator's design work is a perimeter list — what the system does, and, equally important, what it explicitly does not do. The architectural discipline here is the one most "smart alerting" projects skip and then regret. Without a written perimeter the LLM will, helpfully, expand into anything adjacent the prompt suggests it might want.

For this project the perimeter looks roughly like:

Inside the perimeter: receive Zabbix webhooks; normalise events; store them durably enough to survive worker crashes; enrich via the Zabbix API; suppress low-importance noise based on policy and LLM triage; correlate events within a configurable window; deliver to Matrix and email; keep an audit trail.

Outside the perimeter: incident management (that's a separate system); auto-execution of recommended remediation commands (no model-issued kubectl, ever); training a custom model (out of scope for a pet project); a generalised root-cause-analysis engine (a hard problem; bounded RCA only).

That second list is where the work is. "We are explicitly not auto-executing remediation" is not a default — it's a position you have to take and then defend in the prompt, the prompt template, and the deployment. Without that statement the LLM will produce examples that include subprocess.run(...) in the recommendation pipeline because, of course it will, that's what every Stack Overflow answer in its training data looks like. With that statement, the model writes a system that produces advice and stops short of acting on it.

The same logic applies to "we are not building incident management." Zabbix-event-triage and ServiceNow-style incident-tracking look adjacent enough that an LLM, if invited, will conflate them. The perimeter is what keeps the project a project rather than a creeping reimagining of an entire ITSM stack.

Severity is a trust dial

Zabbix's documented trigger severity levels are Not classified, Information, Warning, Average, High, Disaster. They are also a perfectly serviceable trust dial for "how much LLM judgement is allowed in the loop on this event."

The operator's policy is the one I'd recommend reading as a baseline rather than a finished position:

Severity	Decision authority	LLM role	Required audit fields
Disaster, High	Human; LLM is not relied on for suppression	Optional context (recent flaps, related events, last successful change). LLM may enrich; LLM does not decide to suppress.	event_id, raw payload, enrichment results, who got paged, ack time
Average	Human, but operator pre-configures whether to use LLM triage at the policy level	If enabled, LLM may recommend "suppress as flap" or "deliver"; recommendation is logged regardless of decision	event_id, payload, LLM verdict, LLM confidence, policy version, final action
Warning, Information	LLM triage with policy override	LLM may suppress as flap, deduplicate against recent events, or escalate based on correlation	event_id, payload, LLM verdict, suppression reason, audit timestamps

The shape of this table is the load-bearing thing. A flat policy ("LLM triages everything" / "LLM triages nothing") is a flat policy because nobody designed for severity-stratified trust. The dial works because each rung gives the LLM exactly as much authority as the consequences of being wrong allow. Disaster events that get suppressed are the kind of mistake that ends a postmortem with a CTO present; Warning events that get incorrectly suppressed cost approximately nothing to recover from on the next event in the correlation window. The dial reflects that asymmetry directly.

A second observation that falls out of the table: the audit fields grow as authority shifts to the model. That's not paranoia. It's the prerequisite for ever debugging the system. When the model suppresses a Warning and the operator later discovers the event mattered, the only path to a fix is the logged LLM verdict + confidence + policy version, because that's what tells you whether the rule was wrong, the model was wrong, or the prompt was wrong.

Diagnose, don't act

The single design constraint I'd argue does the most work in this kind of project is also one of the simplest: the LLM's recommendations are bounded to diagnosis commands, not change commands. It can suggest zabbix_get, tail, journalctl, curl -I, iostat, top. It can't suggest systemctl restart, kill, anything with --force, anything that mutates state. The list of allowed verbs is short and explicit, and it lives in the prompt.

Why this matters: an LLM that tells the operator what to look at is leaning into the model's actual capability — pattern-matching from a large corpus of similar incidents to plausible diagnostic next-steps. An LLM that tells the operator what to fix is leaning into a capability the model doesn't reliably have, on infrastructure the model doesn't see. The first is leverage; the second is liability dressed as automation. Constraining the verb space is how you get the leverage without the liability.

This also pre-empts the most expensive failure mode of LLM-assisted ops: the recommendation-followed-without-verification. If the operator is staring at an alert at 2am and the model recommends journalctl -u nginx --since "10m ago", copy-paste is fine — running it is read-only. If the model recommends systemctl restart nginx, copy-paste means the operator just restarted production at 2am because a model in their lab said to. The verb-space constraint enforces the right ergonomics by construction.

State the system has to keep

What surprises operators new to this design is how much state a "lightweight LLM layer" actually needs in front of it. The model is mostly stateless per request; the layer around it is not. The audit-log motivation gets you most of the way there, but the deduplication, flap detection, and correlation requirements add the rest:

Recent-event memory for the configurable correlation window (typically minutes), so the layer can recognise a Warning on host-A as the symptom of a High on its upstream router from 90 seconds earlier.
Deduplication state: an alert that fires every 30 seconds for an hour should produce one notification with a "still firing" suffix, not 120.
Flap detection: an alert that goes ok→problem→ok→problem twelve times in five minutes is not the same alert pattern as one that fires once and stays asserted; the layer needs to suppress the noise and surface the flap-as-symptom.
Recovery state: an alert that fires and resolves itself before the LLM finishes thinking should produce a resolved notification with the diagnostic context, not a firing notification that gets contradicted ten seconds later.

All of this is unsexy infrastructure that the LLM does not solve. The LLM acts on top of a stateful pipeline that already deduplicates, correlates, and tracks recovery. Without that pipeline, the LLM is asked to do all of those tasks per request, and it does them inconsistently because it has no memory between requests. Most of the engineering in a project like this is in the layer the LLM sits on, not the LLM itself.

What the design produces, and what it produces well

The architecture sketch the operator settled on, before any model selection or implementation, looks roughly like this — in a Willison-favourite shape, the request envelope itself:

# Zabbix webhook → normalised internal envelope (what the layer expects)
{
    "event_id": "zbx-24917341",
    "received_at": "2026-05-05T14:33:08Z",
    "host": "router-edge-01",
    "trigger": "Interface ge-0/0/3: link speed changed",
    "severity": "Warning",                # Disaster|High|Average|Warning|Information
    "payload": {...},                     # raw zabbix output
    "enrichment": {                       # filled by the API-fetcher worker
        "host_tags": [...],
        "recent_events_60s": [...],
        "last_change_to_trigger_5d": "...",
    },
    "policy": {                           # from the operator's config
        "llm_triage_allowed": True,
        "auto_suppress_allowed": True,
        "delivery_channels": ["matrix:#noc"],
    },
    "audit": {
        "policy_version": "v0.4",
        "received_by_worker": "ingest-2",
    },
}

The point of the envelope shape is that everything the LLM needs in order to make a good decision is already present in the structured fields, and everything the operator needs in order to debug the LLM's decisions afterwards is in audit. The model sees a normalised view; the policy decides what authority the model has on this severity; the audit log keeps every decision. The LLM doesn't have to know about Zabbix-the-product — it sees only the envelope. That's how you keep the LLM swappable later.

What this is actually a guide for

The reason I find this kind of write-up instructive is not the Zabbix specifics — most ops teams don't run Zabbix in their day job. The transferable lesson is the evaluation pattern. If you're a team considering whether to introduce LLM-assisted alerting (or, by extension, LLM-assisted code review, LLM-assisted ticket triage, LLM-assisted anything in operations), the question is not "is the model good enough yet." The question is "are we good enough at writing the architecture the model needs, in advance, in order to be evaluable on it."

The operator's spec-first approach is the answer to that. Two weeks of perimeter, severity policy, audit-field design, and a precise envelope shape — before any model picks. With that work in hand, the model selection becomes a real comparison: how does Llama-3.1-8B-Instruct on Ollama do on this pipeline versus a managed API call to Claude or GPT, on the same envelope, with the same allowed-verb list, on the same audit constraints? Without that work, the comparison is "which model produces the most plausible-sounding free-text triage output," which is a question that has no operationally useful answer.

The framing the operator's piece converges on, and the one I'd take from it, is that "AIOps with LLMs" is not a category of system you build — it's a category of system you evaluate. The architectural discipline is what makes the evaluation meaningful. Without it, there's no system to evaluate; just a free-text generator with infrastructure access.

What I'd take from it

The cleanest framing I keep coming back to is one the operator's own piece supplies almost in passing: the LLM works in a defined problem space, not in your hopes about what it should do. The work of giving the model a defined problem space — perimeter, severity-stratified trust, allowed verbs, audit fields, envelope schema — looks like documentation overhead the first time you do it. The second time you do it, it looks like the only part of the project that didn't need to be redone.

The project the operator is building is small and the stakes are low; nothing in their home lab is going to pager-duty the company at 3am. The lesson, however, generalises in exactly the direction it always has: when the consequences scale, only the projects whose architectural discipline scaled with them stay legible. That's true of LLM-assisted infrastructure because it's true of infrastructure in general. The LLM doesn't change the rule. It just makes the absence of the rule cheaper to ignore in the prototype phase, and more expensive in production.