Kerry Kier

Posted on May 10

I Used Gemma 4 to Simulate an Entire Emergency Command Team -- One Model, Six Roles, Real Doctrine

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I Built an ICS Tabletop Exercise Simulator with Gemma 4 -- Here's What Actually Happened

Emergency managers face a frustrating reality: the exercises that build the sharpest incident response skills require the most coordination to pull off. A full Incident Command System tabletop exercise means getting an Incident Commander, a Safety Officer, a Public Information Officer, three Section Chiefs, and an Exercise Facilitator all in the same room at the same time. For agencies running lean, that kind of coordination is the bottleneck -- and exercises don't happen as often as they should.

I work in emergency management and I've felt that bottleneck firsthand. When the Gemma 4 challenge came along, I had a specific problem I wanted to solve: what if a single AI model could simulate an entire ICS organization, so an Emergency Operations Manager could run a realistic tabletop exercise alone, on demand, without coordinating a room full of people?

This is the story of building that system -- what worked, what didn't, and a few things I discovered about Gemma 4 that aren't in any documentation.

Why Gemma 4, and Why the 26B MoE Specifically

The model selection here was deliberate, not default.

The ICS Tabletop Exercise Simulator needs to simultaneously maintain six distinct personas -- each with different authorities, different information access, and different communication rules. The Incident Commander knows what's been reported up the chain. The Planning Section Chief knows resource status. The Safety Officer has unilateral stop-work authority that no other position has. These aren't just personality differences -- they're doctrinal constraints grounded in NIMS 2017 and NQS Position Task Books.

That kind of concurrent multi-role reasoning under constraint is exactly what the Gemma 4 26B MoE architecture is built for. The 26B MoE variant activates only 4B parameters per token while routing through 26B total. For a workload where the model needs to think across six simultaneous personas and enforce different rules for each, that routing efficiency matters more than raw parameter count. A 31B dense model would have higher per-token cost with no meaningful quality advantage for this specific task.

The Gemma 4 family gives you three realistic options depending on your hardware situation:

E2B / E4B -- Edge and mobile class. Runs on a Raspberry Pi or similar. Not enough capacity for six-position concurrent reasoning with hard doctrine constraints.
26B MoE -- This is the one. Efficient, high-throughput, designed for complex reasoning workloads. The right fit for this use case.
31B Dense -- Strongest local performance, but requires server-grade hardware and has higher per-token cost without a meaningful quality advantage for this task.

Access is through the Google AI Studio API (gemma-4-26b-a4b-it), routed through LiteLLM into OpenWebUI. This matches how emergency management agencies and vendors would realistically operate -- API deployment against an open model gives a path to future on-premises deployment without code changes. That was a deliberate architecture decision, not a convenience choice.

The Hardware -- Deliberately Modest

This matters for the emergency management context, so I want to be specific.

The system runs on a Dell Precision t3610 workstation -- not a modern AI server, not a cloud instance. This is the class of hardware that sits in the back of an emergency operations center that hasn't had a budget refresh in five years.

Server specs:

Dell Precision t3610
Ubuntu Server 24.04 LTS
128GB ECC System RAM
16-core Xeon CPU
NVIDIA RTX 3060 (12GB VRAM)
500GB SSD

Software stack:

OpenWebUI 0.9.2 (workspace interface and RAG engine)
Ollama 0.22.1 (local embedding model serving)
LiteLLM 1.83.10 (API routing to Google AI Studio)
mxbai-embed-large 335M (local embedding model via Ollama)
TEI Reranker (RAG reranking layer)

Gemma 4 26B inference runs via Google AI Studio API -- the RTX 3060 at 12GB VRAM can't run the 26B MoE locally at full precision, and that's fine. The embedding model and reranker run locally on the Xeon and GPU respectively. The architecture cleanly separates what needs to run locally from what benefits from cloud inference.

For an agency that already has a workstation in the EOC and an internet connection, the incremental cost to run this system is an API key.

The Setup: One Model, Six Positions, Hard Doctrine Rules

The system runs entirely through a structured system prompt in an OpenWebUI workspace. No custom code, no agent framework, no separate model instances. One prompt, one model, six simultaneous ICS positions.

The positions are:

IC -- Incident Commander: Overall authority. Single point of contact for exercise injects. Sets objectives and issues directives.
SO -- Safety Officer: The only position with unilateral stop-work authority. Communicates safety hazards directly to any position without IC routing.
PIO -- Public Information Officer: Manages media and public communications. Nothing goes out without IC approval.
OSC -- Operations Section Chief: Manages tactical operations. Routes all cross-section coordination through the IC.
PSC -- Planning Section Chief: Manages the planning process and IAP development. Pre-authorized to gather information directly from other sections -- but cannot issue directives.
LSC -- Logistics Section Chief: Provides resources and support. Fulfills IC-approved requests; does not task Operations directly.

Every behavior, every communication pathway, every authority is grounded in NIMS doctrine and NQS Position Task Books. Nothing is invented. If it's not in the PTBs or NIMS 2017, it doesn't go in the prompt.

The Emergency Operations Manager interacts through a simple command structure. All regular input is treated as an exercise inject. Out-of-band facilitator commands use a // prefix:

// POSITION QUERY: PSC -- What are the required elements of an Incident Action Plan?
// DECISION POINT
// STATUS REPORT
// RESET

What a Real Response Looks Like

Here's an actual response from the system to a warehouse fire inject:

Input:

A structure fire has been reported at a mid-size commercial warehouse. 
Initial reports indicate heavy smoke, unknown occupancy status, and one 
confirmed injury to a bystander. Mutual aid has not yet been requested.

Output (condensed):

INJECT RECEIVED
A structure fire is occurring at a mid-size commercial warehouse involving 
heavy smoke, unknown occupancy, and one confirmed bystander injury.

[ IC -- INCIDENT COMMANDER ]
Incident Objectives:
1. Suppress the fire and prevent spread to adjacent structures.
2. Conduct primary search and rescue to determine occupancy status.
3. Provide medical treatment to the injured bystander.
4. Establish an incident perimeter and secure the scene.

Directives:
- OSC: Initiate fire attack and establish a search pattern for occupants.
- SO: Monitor structural integrity and smoke conditions.
- PIO: Monitor for media arrival and prepare a holding statement.
- PSC: Begin tracking all arriving resources and developing an initial 
  incident map.
- LSC: Stand by for resource requests.

[ SO -- SAFETY OFFICER ]
Hazards identified: Reduced visibility due to heavy smoke; potential for 
structural collapse; unknown hazardous materials; potential for secondary 
explosions. All incoming personnel must utilize full PPE and SCBA.
Reports assessment to IC.

[ PIO -- PUBLIC INFORMATION OFFICER ]
Draft Messaging (HOLD FOR IC APPROVAL): "Emergency crews are currently on 
the scene of a structure fire at a commercial warehouse. We are working to 
confirm the status of the building. Please avoid the area."

[ OSC -- OPERATIONS SECTION CHIEF ]
Tactical Status: Initial engine companies establishing water supply and 
beginning exterior fire attack. Resource Needs: Additional engine companies 
for suppression and a search/rescue team for interior operations.
Escalating resource gap to IC.

[ PSC -- PLANNING SECTION CHIEF ]
Developing initial IAP framework. Information Gaps: Exact warehouse floor 
plan; presence of hazardous materials on-site.

[ LSC -- LOGISTICS SECTION CHIEF ]
Monitoring local resource availability. Standing by for formal resource 
requests routed through IC.

[ EXERCISE SUMMARY -- FACILITATOR VIEW ]
Command is established with initial objectives focused on life safety, 
fire suppression, and scene stabilization. Decision Point: Does the IC 
authorize a formal mutual aid request?

That's doctrine-compliant ICS behavior across six simultaneous positions in a single response. The chain of command is clean. The SO reports hazards. The PIO holds for IC approval. The LSC doesn't task Operations directly.

The Token Loop Problem -- and the Fix

Here's something that isn't in the documentation: Gemma 4 with extended reasoning enabled will loop on complex multi-constraint injects.

When I pushed the system with a scenario involving three simultaneous doctrine conflicts -- an OSC requesting interior fire attack, a pending SO structural integrity assessment, and resources at capacity -- the model entered a reasoning loop in the thinking panel. It repeatedly processed the same constraint verification blocks without ever exiting to generate a response. The loop ran past 15,000 tokens before I terminated it.

The root cause is the interaction between the MoE architecture and the extended reasoning mode. When you stack extended reasoning on top of a prompt with multiple simultaneous hard constraints, the model can get caught verifying and re-verifying those constraints without resolving to output. The more constraints you have in play simultaneously, the higher the loop risk.

The fix is a trigger token instruction at the top of the system prompt:

## INFERENCE CONTROL

Do not use the <|think|> token. Set thinking budget to 0. 
Provide responses immediately without internal reasoning tags or thought blocks.

This suppresses the extended reasoning token behavior. What it does not suppress is the MoE routing itself -- that's architectural and operates at a different layer entirely. The model still reasons through constraint conflicts; it just doesn't do it in a visible loop that consumes all available tokens.

After applying this fix, behavior splits cleanly by inject complexity:

Simple injects: No thinking panel at all. Fast, clean responses.
Complex multi-constraint injects: Some visible thinking (46 seconds in one test), but linear reasoning that completes and exits rather than looping indefinitely.

That's actually the right behavior for this use case. You want the model thinking carefully through doctrine conflicts on hard scenarios. You just don't want it looping forever. The trigger token instruction gives you that split without sacrificing response quality.

One important nuance: the MoE architecture is doing meaningful work here even without extended reasoning. The 26B parameter routing is what maintains six simultaneous constraint sets cleanly across positions. Suppressing the <|think|> token removes the reasoning loop risk without touching the capability that makes the model right for this task.

If you're running Gemma 4 with reasoning enabled and hitting loops on complex prompts, try this instruction before you blame the model.

The RAG Setup and an Honest Assessment of What Happened

The knowledge base powering this system contains 148 documents converted to clean Markdown: NIMS 2017, NRF 4th Edition, HSEEP 2020, NQS Position Task Books for all six ICS positions, ICS forms, course manuals, and HSEEP templates.

The conversion step mattered more than expected.

Original documents were PDF, DOCX, and PPTX. OpenWebUI's default extractors produced garbled table text from ICS forms, fragmented bullet content from training slides, and merged columns from multi-column doctrine PDFs. The chunks being indexed were nearly unusable -- the model was retrieving sources but had no signal to work with. Early testing produced one-sentence responses to substantive position queries despite correct source retrieval.

After converting everything to clean Markdown using pymupdf4llm for PDFs, python-pptx for slide decks, and python-docx for Word documents, the same queries returned structured multi-point responses with correct form numbers and doctrine citations. The document conversion fixed the core retrieval problem before any model tuning was needed.

The retrieval ranking problem that didn't fully resolve.

Even with a TEI reranker in the stack, IS-200 course manuals consistently ranked above the authoritative Position Task Books for position-specific queries. The responses were doctrinally correct -- the model knows the content -- but citations pointed to training course materials rather than the PTBs that should be primary sources.

The reason is semantic: PTBs are written in formal NIMS task language ("incumbent will demonstrate proficiency in establishing incident objectives per ICS 202"). Course manuals use plain instructional language that maps more naturally to how a question gets phrased. The embedding model scores semantic similarity and the course manuals win on that metric even when the PTBs carry higher authority. The TEI reranker improved relevance across the board but couldn't overcome a gap that large in the embedding space.

The partial mitigation was a source hierarchy instruction in the system prompt:

## KNOWLEDGE BASE SOURCE AUTHORITY

Tier 1 -- Authoritative (primary):
NIMS 2017, NQS PTBs, ICS Position Checklists, NRF, HSEEP 2020

Tier 2 -- Supplementary:
IS-100, IS-200, IS-700 course manuals and instructor guides

Tier 3 -- Reference only (not doctrine):
HSEEP Templates, Exercise Evaluation Guides, Course slides

This influenced the model's citation reasoning but couldn't override the retrieval ranking -- the embedding layer runs before the model sees anything. Full resolution would require either a domain-specific embedding model trained on government technical documentation, or a custom reranking approach that weights document metadata as a retrieval signal.

For a prototype and training use case this is acceptable. The answers are right. In a production deployment where citation accuracy is a compliance requirement, this is the thing to solve next.

What It's Actually Good For

After testing across a range of scenarios, here's where the system genuinely earns its keep:

Rapid scenario iteration. An EOM can run a full six-position inject response in seconds, adjust the scenario, and run it again. What used to require scheduling six people now happens alone at a desk.

Doctrinal friction. The most valuable learning outcome of a tabletop exercise is when positions conflict -- when the SO's stop-work authority collides with the OSC's tactical urgency. The system portrays that friction accurately rather than smoothing it over. In one test, the SO explicitly prevented an interior fire attack citing unverified structural integrity, the OSC escalated the resource gap to the IC, and the IC had to manage both simultaneously. That's the kind of decision-point pressure that makes exercises useful.

Escalating complexity. Stacking injects -- a second structure igniting, casualties increasing, media arriving on scene -- the system tracked the evolving incident picture across positions without losing doctrine compliance. The PSC correctly identified a transition toward Type 3 incident complexity unprompted. That's not a trivial output.

Position-specific queries. The // POSITION QUERY command lets an EOM ask any position a direct doctrine question mid-exercise. These are useful for both exercise facilitation and individual position training.

What Comes Next

Phase 1 covers the six core ICS positions. The architecture supports expansion to Finance/Administration Section Chief and subordinate positions without structural changes -- it's a system prompt update, not a rebuild.

The RAG citation ranking is the most meaningful technical debt. A domain-specific embedding model trained on FEMA and NIMS documentation would likely close the gap between PTB language and query phrasing. That's the next experiment worth running.

The trigger token discovery is worth tracking across other Gemma 4 deployments. The loop behavior correlates with inject complexity -- single-issue injects run clean, multi-constraint injects with three or more simultaneous doctrine conflicts are where the risk lives. The fix is simple but it's not obvious if you haven't hit the problem.

The Bigger Picture

Emergency management agencies are chronically under-resourced for training. The gap between how often exercises should happen and how often they do happen is a real preparedness problem. A tool that lets one person run a realistic ICS tabletop alone -- on demand, at no coordination cost, on hardware that's already sitting in the EOC -- has direct operational value.

Gemma 4's MoE architecture is genuinely well-suited to this kind of concurrent multi-role reasoning workload. The 26B parameter count with 4B active per token gives you the efficiency needed for a task that requires maintaining six distinct constraint sets simultaneously. It's not just a capable model -- it's the right shape of model for the problem.

That intentional fit between model architecture and task structure is what makes this more than a demo. It's a real use case for a real capability gap, built on hardware a department could actually afford to run.

Glossary

ICS -- Incident Command System. Standardized emergency response management structure. NIMS -- National Incident Management System. Federal framework ICS operates within.
NRF -- National Response Framework. Federal doctrine for disaster response roles.
HSEEP -- Homeland Security Exercise and Evaluation Program. Federal methodology for designing and running emergency exercises.
TTX -- Tabletop Exercise. Discussion-based scenario exercise without physical resource deployment.
IAP -- Incident Action Plan. Documents incident objectives and assignments per operational period.
PTB -- Position Task Book. FEMA's official competency standard for each ICS position. MSEL -- Master Scenario Events List. Pre-scripted sequence of exercise events.
Inject -- A scenario event introduced mid-exercise to drive participant decisions.
EOM -- Emergency Operations Manager. The person running the exercise.
IC -- Incident Commander.
SO -- Safety Officer.
PIO -- Public Information Officer.
OSC -- Operations Section Chief.
PSC -- Planning Section Chief.
LSC -- Logistics Section Chief.

Built with Gemma 4 26B MoE via Google AI Studio API. Stack: LiteLLM 1.83.10, OpenWebUI 0.9.2, Ollama 0.22.1, mxbai-embed-large 335M, TEI Reranker. Hardware: Dell Precision t3610, Ubuntu Server 24.04 LTS, 16-core Xeon, 128GB ECC RAM, RTX 3060. Knowledge base: 148 converted documents from NIMS, ICS, and HSEEP doctrine. All ICS/NIMS/HSEEP terminology used per official doctrine.

Top comments (2)

Hollow House Institute • May 11

What stands out here is that the system already mirrors something operationally real:

incident response depends on persistent role boundaries under changing conditions.

That’s why the multi-role structure matters more than the model itself.

The moment inject complexity increases, multiple positions start operating simultaneously, priorities begin conflicting, and escalation pressure rises, the real governance problem appears:

Can the system preserve doctrine alignment and accountability continuity while conditions evolve in real time?

A lot of agent discussions stop at orchestration.

But long-running systems eventually need:
Decision Boundaries,
Escalation persistence,
role-specific accountability,
and replayable Governance Telemetry.

Especially in emergency management environments where post-hoc reconstruction is not enough.

The interesting part is that you’re already touching execution-time governance problems without framing them that way directly.

Kerry Kier • May 11

To your direct question -- partially. Within a session the system holds doctrine alignment well under escalating inject pressure; the role boundaries don't drift even when conditions get complex. Accountability continuity is the honest gap. There's no persistent state between sessions and no structured decision record. The conversation thread is the only telemetry, which isn't sufficient for post-hoc reconstruction or formal exercise evaluation. That's the next real engineering problem.

Worth noting: this is a research prototype, not a deployable product. Real-world application would require significant agency-specific configuration -- local protocols, mutual aid structures, and jurisdictional policy all layer on top of NIMS doctrine differently for every agency. This isn't an off-the-shelf solution. It's a framework that would need to be built out for each operational context before it touches anything real.