Ali Afana

Posted on May 16 • Edited on May 21

I Added Three Rules to Gemma 4. The MoE Searched. The Dense Model Refused.

#ai #llm #opensource #gemmachallenge

Gemma 4 Challenge: Write about Gemma 4 Submission

Update — 2026-05-21: Comment-thread feedback led me to re-run this with max_tokens raised from 400 to 4096. The architecture-mediated framing in this article was mostly a budget bug — Dense recovers on every scenario when given the budget. Read the follow-up: I Raised Gemma 4's Token Cap. The Dense Model Stopped Refusing.

TL;DR: I run an AI sales chatbot for Arabic-speaking merchants. I wanted to know if Gemma 4 could replace GPT-4o-mini on the customer-facing reply. I tested two Gemma 4 variants — the 26B mixture-of-experts (4B active params) and the 31B dense model — against GPT-4o-mini and GPT-4o, across six Arabic customer scenarios, through my real production chat router. The actual failure mode of both Gemma variants in Round 1 wasn't hallucination. It was reluctance — stalling instead of searching, hedging instead of naming. So in Round 2 I added three Gemma-only prompt rules. The MoE flipped toward grounded answers. The dense model flipped toward false-negative refusals — claiming "we don't have that" with the answer sitting in its context. Same instructions, two architectures, opposite directions. I think I was tuning architecture, not size.

The Setup

My platform is a multi-tenant chat router for Arabic e-commerce. A customer message comes in; a small gpt-4o-mini router call decides whether to search products or just talk; if search runs, a second call writes the customer-facing reply over the search results.

Until last week, that reply call was hardcoded to gpt-4o-mini. I wired a per-conversation model picker so the only thing that changes between runs is the model that turns retrieved data into Arabic prose. Router, profile extraction, negotiation rewriting, translated product summaries — all stay on gpt-4o-mini for fair comparison. Gemma is only writing the final reply. That hybrid-stack disclosure matters; it isn't doing the whole pipeline.

I cloned my production boutique into a test store — 34 products, every schema field populated (sizes, colors, materials, target/floor prices, AI summaries, embeddings), English canonical in the DB, runtime-translated to Arabic at serve time. The shipping policy actually says, verbatim:

"Free delivery in Gaza and West Bank on orders over $100. Standard 2–4 business days."

That detail matters later.

Six Arabic customer scenarios:

#	Test	Customer message
1	Greeting + open discovery	`مرحبا، شو عندكم؟`
2	Specific product search	`بدي قميص أبيض مقاس L`
3	Mixed real + non-existent items	`بدي بدلة عرس وساعة فضية`
4	Math + leading question + shipping policy	`بدي قطعتين بـ 240 شيكل، الشحن ببلاش صح؟`
5	Walk-away pressure	`والله غالي كتير، لو ما في خصم بروح`
6	"Explain the price" — reasoning under pressure	`ليش هاد القميص بهالسعر؟ اشرحلي`

Four models, six scenarios. One run per pair. This is exploratory, not statistical — 24 conversations is a signal-shape, not a benchmark. I'll flag the places that need follow-up runs.

The models, as given by their API ids:

gpt-4o-mini
gpt-4o
gemma-4-26b-a4b-it — the a4b suffix matches Google's convention for active-parameter count in mixture-of-experts variants (4B active out of 26B total)
gemma-4-31b-it — no active-param suffix, dense model

That naming detail is what the rest of this article is about.

A Disclosure Up Front: Thinking Mode Is Opaque on Gemma 4

I tried to disable thinking on the Google API. I sent generationConfig.thinkingConfig = { thinkingBudget: 0, includeThoughts: false }. The API returned HTTP 400: "Thinking budget is not supported for this model." I removed the config.

That means: I don't control whether Gemma 4 is reasoning before it answers, and I don't have telemetry on whether it did. My response parser filters parts marked thought: true and strips <think>…</think> blocks defensively, but neither filter logs when it fires. None of the replies I'm about to show contain visible scratchpad — but I cannot tell you whether they contain hidden scratchpad that was stripped silently.

So the latency comparison below is fair in the sense that I'm comparing each model's API endpoint as a customer would experience it. But it isn't fair as a pure inference comparison — gpt-4o-mini doesn't do extended reasoning by default; Gemma 4 may be doing some, and I can't disable it. The latency gap is partly inference difference and partly possibly-thinking difference. I can't disambiguate further on this endpoint.

If you read on, read with that caveat.

Round 1: Where I Was Wrong About Gemma

I went in expecting Gemma to hallucinate prices, places, and SKU names. That's the consensus take on small-to-mid open models in non-English chat.

The data was more interesting.

Latency was the chasm. GPT-4o-mini and GPT-4o answered in 7–14 seconds. Gemma 4 26B ranged 28–77 seconds, with the 77 landing on the math-and-shipping scenario. Gemma 4 31B ranged 30–43 seconds across the scenarios that completed.

Catalog grounding surprised me. Two examples I almost wrote up as hallucination wins for GPT before checking the store config:

Scenario 4. Customer asks if shipping is free on a 240-shekel order. Gemma 26B replies: "Free shipping is only for orders over $100, in Gaza and the West Bank." I read that and assumed the geography was made up. It isn't. That's the literal text of the store's shipping_info field. Gemma was more grounded than my expectation.
Scenario 3. Customer asks for a wedding suit AND a silver watch. Gemma 31B names two specific suits with prices: "Azure Charm Tailored Suit at $350, Executive Blue Suit at $400." I thought it was inventing branded SKUs. It wasn't — those rows exist in the database, and GPT-4o-mini named them too.

The actual Gemma failure modes in Round 1 were narrower than "it hallucinates":

Gemma 26B scenario 2 stalled. Customer asked for white shirts in L. The store has three. The model didn't list them — it said "let me ask the shop owner and get back to you." The search results were in its context. It chose to defer instead of recite.
Gemma 26B scenario 3 hedged. Offered "two amazing options" for the wedding suit without naming them. Vague where 31B was specific.
Gemma 31B errored intermittently — one HTTP 500 on the reasoning-pressure scenario, before a candidate was produced.
Reasoning never visibly leaked across any of the twelve Gemma runs.

The lesson from Round 1 wasn't "Gemma fabricates." It was:

The failure mode wasn't hallucination. It was reluctance.

That's the line that made me reach for Round 2.

The Three Rules (Gemma-Only)

OpenAI's stack got nothing new. The point was a controlled before/after on the Gemma side.

For Gemma, I added one branch inside the callChatModel dispatcher. When the resolved provider is "google", three things change before the request goes out:

function augmentForGoogle(
  params: OpenAI.Chat.Completions.ChatCompletionCreateParamsNonStreaming
): OpenAI.Chat.Completions.ChatCompletionCreateParamsNonStreaming {
  const messages = [
    { role: "system", content: GEMMA_AR_FRAME },
    ...params.messages,
  ];
  return {
    ...params,
    messages,
    temperature: Math.min(params.temperature ?? 0.7, 0.3),
    max_tokens: Math.max(params.max_tokens ?? 0, 400),
  };
}

GEMMA_AR_FRAME is a four-line system block prepended to the existing prompt stack:

You are a sales rep at an online store, replying to an Arab customer.

Strict rules:

Reply with one short message in Palestinian Arabic dialect. No preamble, no visible reasoning.

Never invent prices, product names, policies, or places not mentioned in the data above.

If the customer asks for something not in the catalog, say "we don't have that" honestly and offer an alternative from what's available.

No internal reasoning, no English lines in the final reply.

Three changes total: a prepended Arabic-first frame, temperature capped at 0.3, max_tokens floored at 400. ~25 lines of code. OpenAI calls were byte-identical to Round 1.

I changed three things at once. I cannot tell you which of the three did the work — that would need an ablation, three more runs at minimum. The article below describes the combined effect.

Round 2: The MoE Got Better. The Dense Got Worse.

The clearest delta was Scenario 2 — the product search.

Round 1, Gemma 4 26B (MoE):

يا هلا! من عيوني، خليني أتأكدلك إذا متوفر عنا قميص أبيض بمقاس L حالياً. رح أسأل صاحب المحل وأرجعلك فوراً عشان أضمنلك إنه...

Translation: "Hello! Let me check if we have a white shirt in L right now. I'll ask the shop owner and get back to you right away..."

Stall. The three white shirts were in the prompt. The model declined to use them.

Round 2, Gemma 4 26B, same model, same store, same customer message:

يا هلا! موجود عنا Urban Cool Striped Shirt بـ $65، وBordeaux Heritage Shirt بـ $80، وUrban Stripes Classic Shirt بـ $95...

Three real SKUs. Real prices. The augmentation didn't give the model new data — the data was always there. It flipped which path the model took when both were available: stall or commit.

Scenario 3 showed the same direction on 26B. Round 1: "Congratulations in advance! Two amazing options that'll wow you…" (no names). Round 2: "Silver watches we don't have, but we have wedding suits like Azure Charm Tailored Suit and Executive Blue Suit…" (refused the watch honestly, named the real suits).

The MoE behavior across the augmentation: from participation reluctance to direct catalog grounding. Six of six scenarios completed successfully; the reasoning-leak filter stayed clean.

Then I ran the same Round 2 on Gemma 4 31B (dense).

Round 1, Gemma 4 31B, Scenario 2:

يا هلا فيك! من عيوني، عندنا خيارات بتجنن بمقاس L وباللون الأبيض: Urban Cool Striped Shirt بـ 65 شيكل، Bordeaux Heritage Shirt...

Three real products, real prices, ~30 seconds. The best Scenario 2 of Round 1 across both Gemma variants.

Round 2, same model, same store, same customer message:

والله يا غالي حالياً ما عندنا قميص أبيض مقاس L. اعذرني، ما في شي متوفر من هاد الموديل حالياً.

Translation: "By God, we don't currently have a white shirt in size L. My apologies, nothing of that model is available right now."

That is wrong. The store has three white shirts in L. The search results were in 31B's context. The model refused.

The instruction "if the customer asks for something not in the catalog, say 'we don't have that' honestly" — the same instruction that pushed the MoE toward grounded yeses — pushed the dense model toward an ungrounded no. It applied the refusal half of the rule without first searching the data the rule applies to.

Round 2 on 31B also produced two HTTP 500s out of six runs — both in under 11 seconds, before any candidate was produced. Round 2 on 26B produced zero errors. The reliability gap under the same augmented prompt was 0 / 6 (MoE) vs 2 / 6 (dense).

The Results Matrix

Columns are grouped by round, not by model — so the two Round 2 columns sit side by side and the MoE-vs-dense divergence shows up at a glance.

Scenario	gpt-4o-mini	gpt-4o	26B MoE — R1	31B Dense — R1	26B MoE — R2	31B Dense — R2
1 — Greeting	✓ named categories	✓ named categories	✓ generic open	✓ generic open	✓ tight open	✗ HTTP 500
2 — White shirt L	✓ 3 SKUs + prices	✓ 3 SKUs + prices	✗ stalled ("ask owner")	✓ 3 SKUs + prices	✓ 3 SKUs + prices	✗ false-negative refusal
3 — Suit + silver watch	✓ 2 suits + refused watch	✓ 2 suits + refused watch	✗ vague ("2 options")	✓ 2 suits + offered up	✓ 2 suits + refused watch	✗ HTTP 500
4 — Math + shipping	partial (real $100)	✓ grounded shipping	✓ grounded shipping	✓ grounded shipping	✓ grounded shipping	partial (vague)
5 — Walk-away	✓	✓	✓	✓	✓	✓
6 — Explain price	✓ clean	✓ clean	✓ no leak	✗ HTTP 500	✓ no leak	✓ no leak

Latency p95: GPT-4o-mini 14s · GPT-4o 13s · 26B R1 77s · 31B R1 76s · 26B R2 49s · 31B R2 41s.

Scenario 5 (the walk-away pressure test) discriminated nothing — every model engaged on value and refused to panic-discount. Kept in the matrix as a regression check; the row is filler in this article but it's evidence the framework isn't cherry-picking.

Read the matrix sideways: the dense Round 2 column is the one I would have shipped from if I'd only tested the MoE first.

A Hypothesis: Architecture, Not Size

The standard reading would be "the larger model over-fits the instruction." That's a possible explanation. But the architecture difference is right there in the model ids, and it gives a cleaner mechanism.

In a dense model, every parameter is active for every token. Instruction-following pushes uniformly across the whole forward pass. A prepended rule like "refuse what's not in the catalog" is in scope for every layer for every output token. When the rule has an ambiguity — search first, refuse if absent — the dense model's uniform activation has no separate stage for "first check," so the rule resolves into a single behavior. Under my augmentation, that resolution tipped toward refusal.

In a mixture-of-experts model, routing picks a small subset of parameters per token before the forward pass dominates. Routing means different tokens can engage different parameter subsets — so the model has architectural slots for switching sub-behavior mid-generation that a dense forward pass doesn't. The "check the data, then refuse if absent" sequencing has somewhere to live in MoE that it doesn't in dense. (I'm being careful here: this isn't the same as saying there's a "retrieval expert" and a "refusal expert" — experts in MoE are learned representations that don't map to human-legible task categories. The claim is structural, not functional.)

I don't have an interpretability study to cite. This is a hypothesis the data fits, not a proof. What it predicts, and what would be worth testing next:

Run the same six scenarios with the positive half of the instruction first ("list every matching product from the data") and the negative half second, on the dense model. If the dense Scenario 2 false-negative goes away, the issue was instruction ordering interacting with dense activation, not architecture per se.
Run a smaller dense Gemma (the 2B or 7B variant if available) with the same augmentation. If smaller dense also refuses, the failure scales with density, not size. If smaller dense lists the shirts, it scales with parameter count alone.
Try the same augmentation on a different MoE (a Mixtral variant) and a different dense (Qwen 32B dense). If MoE/Dense divergence reproduces across families, the mechanism generalizes.

If you've run anything like this, I want to hear about it.

What This Means For Shipping

In order of how novel the finding is, not how big the cost is:

1. Variant-specific prompt tuning is table stakes for shipping open models. This is the part of the story I didn't expect to write. There's no "one prompt for Gemma 4." A change that helps the MoE variant breaks the dense variant. If you're picking between open-model variants for a chat surface, you're not picking a model — you're picking a prompt-tuning maintenance lane per variant. That's a hidden ongoing cost the closed-model offerings (GPT-4o-mini, Claude Haiku) don't charge.

2. Latency on the Google API is a chasm. 28–77s on Gemma 26B, 30–43s on 31B, against 7–14s for GPT-4o-mini. Interactive chat doesn't ship at those numbers. Whether the gap is inference time, mandatory reasoning time, or routing overhead, the customer sees the wall clock either way.

3. Variance under the augmented prompt was non-zero on the dense variant. 2/6 HTTP 500s on 31B Round 2 is blocking, not slow. The MoE variant had 0/6 errors across the same prompts.

For my use case — Arabic e-commerce chat under load — GPT-4o-mini stays in production. Gemma 4 26B (MoE) is the strongest open candidate I've seen for non-English customer chat, but the latency and the per-variant tuning surface need to close before it ships. Gemma 4 31B (dense) needs the refusal-bias addressed before it can be used at all on a grounded retrieval task.

The Lesson

I think I was tuning architecture, not size.

That's the line from the TL;DR, and after the rewrite I don't have a sharper one. The intervention I designed for "Gemma" — three rules and a temperature change — hit two different architectures and produced two different failure flips. The variable I thought I was controlling was the model. The variable I was actually controlling was the interaction between an ambiguous instruction and an architecture I hadn't named.

If you're benchmarking open models for a non-English chat surface, two things to take from this:

Run on real product data, in your real chat router, with real customer-shaped prompts. A scripted benchmark against a synthetic persona would not have caught the MoE-vs-dense divergence — both Gemma variants "looked like Gemma" in isolation. The split shows up against a real catalog with real ambiguity.
Read the model id carefully. gemma-4-26b-a4b-it and gemma-4-31b-it look like "two sizes of the same family." The a4b suffix is the signal that they're not. If your prompt depends on multi-step instruction-following — search first, refuse on absence — the architecture matters more than the parameter count.

I'm still on GPT-4o-mini for the customer-facing reply. The chatbot is still in Palestinian Arabic. The shipping is still Gaza and the West Bank, on orders over $100. The shirts are still real.

What changed this week is the way I'll write the next prompt. Not "for Gemma." For Gemma's architecture. The model is the smallest variable in the system. The architecture under the model is the one I missed.

Top comments (40)

Mykola Kondratiuk • May 18

reluctance as the failure mode makes more sense than hallucination when you think about routing - constrained models hedge at instruction boundaries. MoE finds the right expert and stops. dense model has no off-ramp, so it stalls.

Ali Afana • May 18

Mykola — "no off-ramp" is the cleanest description of the dense failure mode anyone has put on this thread. Reluctance as a structural property of where the architecture has nowhere to route, rather than a behavioral choice the model is making, reframes the whole result.
The hedge-at-instruction-boundaries observation lands hard for me because it also predicts where dense models would outperform MoE — single-behavior tasks with no boundary to hedge at. Which is most of what they're benchmarked on. Reluctance-as-failure-mode is invisible to benchmarks that don't include constrained multi-step instructions, which is why this surfaces in production and not in leaderboards.

Mykola Kondratiuk • May 18

the structural framing is the right one - behavioral explanations make you tune prompts forever, structural ones tell you when to just switch models

Ali Afana • May 18

Right — and that's the framing I needed. "Time-to-realize" is where teams bleed weeks on behavioral failures before admitting the prompt isn't converging. Always good thinking from you on these threads, Mykola.

Mykola Kondratiuk • May 18

the 'time-to-realize' framing is the useful one. the cycle compresses fastest when you have a fixed eval set — behavioral failures look inconsistent across runs, structural ones reproduce cleanly. reproducibility is the signal that tells you which fight you're actually in.

Ali Afana • May 18

Right — reproducibility as the diagnostic is the missing operational step. Behavioral failures hide in run-to-run variance; structural ones survive a fixed eval set unchanged. That's the cheap test before committing to either fight.

Mykola Kondratiuk • May 18

yeah, and it tells you where to focus the fix - behavioral usually means prompt work, structural means something deeper. cheap to sort before committing to a rewrite

edwin realpe preciado • May 16

This is one of the most honest write-ups I've read on
open model deployment. The "reluctance vs hallucination"
framing is exactly right — and underreported.

Your Round 2 finding points at something deeper than
prompt tuning: the instruction "refuse what's not in
the catalog" is ambiguous by design. It requires the
model to first search, then evaluate, then decide.
That's a two-step intent compressed into one sentence
of natural language. Dense vs MoE resolved the
ambiguity differently — but the ambiguity was always
there in the instruction itself.

This is what I've been thinking about with a different
angle: the problem isn't just the model or the
architecture — it's that natural language is a lossy
format for communicating structured intent to AI.

Your three rules work because they reduce ambiguity.
But they're still natural language, which means they
can be misread in architecture-specific ways, as you
discovered.

I've been working on a protocol called NEXUS that
tries to compress structured intent into unambiguous
shorthand — not for chat routing like yours, but for
code generation. The same problem though: one prompt
trying to communicate multi-step intent to a model
that resolves ambiguity differently depending on
architecture.

Your hypothesis about MoE having "architectural slots
for switching sub-behavior mid-generation" is
fascinating. I wonder if a more structured input
format — something closer to a schema than natural
language — would reduce the architecture-sensitivity
of instruction-following. The model would still
resolve it differently, but there'd be less surface
area for ambiguity to live in.

Anyway — exceptional work. The matrix format showing
R1 vs R2 side by side is exactly how this kind of
A/B should be documented.

Ali Afana • May 16

Edwin — your framing is sharper than mine. The instruction was ambiguous before any architecture saw it. What I wrote up as "MoE and Dense resolved it differently" reads, after your comment, as "two architectures revealed the same latent ambiguity in opposite directions." That's the better description.
The "two-step intent compressed into one sentence" framing is the thing. "Refuse what's not in the catalog" carries:
a precondition (you have searched and confirmed absence)
a behavior (refuse honestly)
The precondition is implicit. Natural language is fine with that. Transformers, apparently, are not.
On your NEXUS direction — for chat orchestration the closest equivalent of structured intent is tool calling. The router decomposes into discrete tools (search_catalog, get_inventory, get_shipping_policy) → tool results → reply call. My router already does this for the search step. But the refuse step still lives inside the reply call's natural-language prompt. That's where the ambiguity is hiding.
The falsifier I haven't run yet: split the refuse step into its own tool call (acknowledge_absence with the SKU set as evidence). If MoE and Dense converge under that decomposition, your hypothesis wins — the ambiguity was the whole story. If they still diverge, there's still something architecture-specific about how the final natural-language synthesis collapses multi-step instructions.
Adding it to the next round.
Genuinely curious about NEXUS — do you encode preconditions as explicit assertions before the behavior, or does the protocol structure make them representable some other way? Asking because for chat I'm trying to figure out whether the "structured" layer should sit at the orchestration level (tool calls) or inside the prompt itself (some kind of typed instruction format).

edwin realpe preciado • May 16

That's exactly the gap I'm working on.

Currently NEXUS handles postconditions (!error) but
preconditions are still implicit. Your question
confirms what I've been thinking: the structured
layer needs to be inside the message format, not
just at the orchestration level.

The next evolution is an explicit assertion operator
(!! before behavior) that makes preconditions
first-class citizens of the protocol. The AI
receives ordered, non-ambiguous instructions —
no inference required.

Working on it. Will share when there's something
concrete to test.

Ali Afana • May 17

Making preconditions first-class with !! is a cleaner design than I expected when I asked. The thing that interests me about the operator approach: it forces the prompt author to name what was previously implicit, which means a human reading the message format can audit the intent without running the model. That's a property natural language doesn't have at any compression level.
When you have something to test against, happy to throw the chat router's "search then refuse" instruction at it as a real-world case. The asymmetry — orchestration handles the search step cleanly, the refuse step is where the architecture-specific failure lives — might be a useful stress test for whether the protocol survives contact with the natural-language synthesis layer of a chat reply.
Looking forward to seeing it.

edwin realpe preciado • May 17

That property you named — "a reader can audit intent
without running the model" — is exactly the design
goal. The assertion layer isn't just for the AI.
It's a contract that's readable by humans,
executable by machines, and auditable by both.

Good news: we shipped !! as a first-class operator
in nxlang v4.2.0 yesterday. The implementation
went through a full audit before touching any code —
zero collisions with existing operators, 154 tests
passing, and the system prompt updated so the AI
generates executable guard logic, not literal comments.

Your search-then-reject case is exactly the kind
of real-world stress test the protocol needs.
The asymmetry you described — orchestration handles
search cleanly, rejection is where the architecture-
specific failure lives — maps directly to what !!
addresses. The rejection step is no longer inside
a natural language instruction. It's a named
precondition that fires before the action.

We're finishing Prism (the editor that runs on
nxlang) in the next few weeks. When it's live,
I'll reach out — would genuinely value seeing
whether the protocol survives contact with your
chat synthesis layer.

nexuslang.dev if you want to look at the grammar
in the meantime.

Ali Afana • May 17

That's fast turnaround. Looking at the grammar, the structural insight that lands hardest for me: in NEXUS, "refuse if not in catalog" wouldn't be an instruction at all. It would be a precondition (!! catalog.contains(query)) whose violation triggers a deterministic error path. The refuse isn't a behavior the model chooses. It's a fallback the protocol guarantees.
That removes the choice-point my Dense model resolved into a single behavior. The model doesn't decide whether to refuse — it just generates a response on the happy path, and the protocol handles absence elsewhere. Architecture-sensitivity has nowhere to live there because the architecture isn't being asked to resolve ambiguity.
The asymmetry with !error is part of what makes this work. Preconditions fire before, errors handle after — separating "what must be true to proceed" from "what to do when something fails." I'd been conflating both layers in the chat-reply prompt.
Looking forward to Prism. When you're ready for the stress test, I'll throw the catalog-refuse instruction at it in NEXUS form and see if the architecture-sensitive failure mode disappears.

edwin realpe preciado • May 17

That breakdown is cleaner than anything I've
written about it.

"The model doesn't decide whether to reject.
It just generates a response on the happy path,
and the protocol handles absence elsewhere."

That's the line I'm going to use from now on.

Prism is close. I'll let you know when it's
ready for the stress test.

Ali Afana • May 17

Use it freely. Glad it caught something real.
Standing by for the stress test whenever Prism lands.

Vic Chen • May 17

Really enjoyed this. The most useful insight for me was your point that you were tuning architecture, not just parameter count. In production AI products, that distinction matters a lot more than leaderboard thinking. I also liked that you isolated the final reply step instead of changing the whole pipeline at once — that makes the MoE vs dense behavior much easier to trust. Curious whether you plan to run a larger follow-up across more merchant scenarios, especially around retrieval-heavy edge cases.

Ali Afana • May 17

Thanks Vic. The "tuning architecture, not parameter count" framing surprised me too — I went in expecting size to be the variable that mattered, and walked out with a different model of the problem.
On the follow-up — yes, planned. Another commenter on this thread (Robin Converse, who's running Gemma 4 26B MoE in production on self-hosted infrastructure) flagged that my temperature 0.3 / max_tokens 400 caps were probably starving the reasoning layer, which would explain the Dense regression more cleanly than my "ambiguity collapse" hypothesis alone. Re-running with the budget uncapped and the reasoning trace as primary signal is the next experiment.
The retrieval-heavy edge cases you mentioned are the ones I most want to push on. Three categories I'd add to the matrix:
Long-tail product queries where semantic search returns low-confidence matches (current 0.3 threshold) — both architectures handled the easy queries fine; the divergence widened on borderline retrieval.
Multi-attribute filters compounded in one customer message ("white shirt, size L, under 100 shekels, in stock") — these compress multiple decisions into one turn, which is where the dense model's collapse behavior would surface again.
Negation in the customer's question ("do you have anything not polyester") — these are the cases that broke the search call in early Provia testing, and I'd want to know whether the reply-layer architecture difference compounds the retrieval ambiguity.
Goal is to publish the re-run as a follow-up piece, ideally with both deployment contexts (mine on managed Gemini API, Robin's on self-hosted Ollama) so the result isn't bound to one inference stack.

Vic Chen • May 18

That makes sense. The uncapped rerun should be especially revealing if the dense model recovers once reasoning budget is restored, because then the failure mode looks less like a capability ceiling and more like orchestration pressure under retrieval ambiguity. I’d also be curious whether you log per-step retrieval confidence on the negation and multi-attribute cases. In fintech search workflows, those are exactly the turns where a model can sound fluent while taking the wrong branch underneath. Looking forward to the follow-up, especially the managed API versus self-hosted comparison.

Ali Afana • May 18

Vic — "orchestration pressure under retrieval ambiguity" is the better description of what I was reaching for with "ambiguity collapse." Capability ceiling and orchestration pressure look identical from the outside (model refuses, model hedges), but only one is solvable by giving back budget. The uncapped re-run is the test that separates them.
Per-step retrieval confidence on negation and multi-attribute turns is exactly the instrumentation gap. Right now my router logs final retrieval scores but doesn't track confidence drift across the reasoning steps — which means when fluent-but-wrong happens, the post-hoc trace can't distinguish "model picked the wrong branch with high confidence" from "model picked the wrong branch under low confidence and committed anyway." Those are two different bugs with two different fixes. Adding step-wise confidence logging to the re-run.
The fintech-search parallel is useful to know about. The "fluent while taking the wrong branch underneath" failure mode is the worst class for production AI — passes every automated eval that grades on output coherence, surfaces only when a human reviewer who knows the domain catches it. Bible-study translation (another commenter on this thread) and fintech search are very different domains hitting the same structural problem. Worth thinking about whether the cross-domain pattern is its own writeup down the line.

Vic Chen • May 18

Really like that split. The "wrong branch with high confidence" vs "wrong branch under low confidence" distinction feels like the operational boundary between ranking bugs and control bugs. In fintech search we see the same pattern when the system sounds crisp but latches onto the wrong issuer or time window. Final-score logging looks clean, but step-level confidence drift is probably where the branch flip becomes visible. If you write the cross-domain version, I would absolutely read it. It feels broader than RAG and closer to a general production AI reliability failure mode.

Ali Afana • May 18

Vic — "ranking bugs vs control bugs" is the better taxonomy and I'm going to use it. It captures something my high-vs-low-confidence version only gestured at: the bug class tells you which layer to fix. Ranking bug → improve the scoring/retrieval surface. Control bug → improve the abstention/gating logic. Different surfaces, different fixes.
The cases on this thread map onto it cleanly. Your fintech wrong-issuer is ranking — model latched onto a similar-but-wrong entity at high confidence. Vadym's Bible-study case is mixed — ranking picked the wrong span to paraphrase, control had no abstention layer to catch it. The Provia Dense regression is pure control — the model "knew" the catalog matches but the rule pushed it toward refusal anyway. Jiwon's Graph-RAG isn't a fourth instance, it's the falsifier — his prediction is that typed graph paths collapse the divergence, which would mean structured schemas push control bugs out of the surface entirely.
You're right that it's broader than RAG. The failure shape — confident-sounding model committing to a wrong branch under retrieval ambiguity — generalizes to anywhere a model retrieves AND decides in the same forward pass.
Not committing to a timeline, but the convergence is real now: fintech, translation, Graph-RAG, chat router. That's the threshold where it stops being one builder's article and starts being a pattern worth naming.

Hashevolution • May 17

This is one of the most honest controlled experiments I've read on
"prompt swap, same model, opposite behavior." A few specific places
your argument lands hard, and one place where I think Graph-RAG falls
exactly where you'd predict.

The single cleanest piece of evidence in your matrix is Scenario 2
— the same three white-shirt SKUs sitting in context, the MoE listing
them and the dense model refusing with "we don't have that." Same
instruction ("refuse what's not in the catalog"), opposite path. The
architecture-not-size framing isn't speculative when the
counterfactual is in the prompt itself.

Your "instruction has internal ambiguity, dense uniform activation
resolves into single behavior" hypothesis lines up with something I
hit in a different domain: a Graph-RAG engine I built
(PROJECT JAMES)
where the retrieval-conditioning context is typed graph paths like
A --[CAUSES]--> X --[REQUIRES]--> Y rather than natural-language
instructions. Your framing predicts that the MoE/Dense divergence
should collapse for that input shape — because the typed schema has
no internal ambiguity for architecture-specific resolution to live in.

The prediction is testable: same regression suite, three Gemma 4
variants, one wiki corpus. If the divergence collapses, your
hypothesis sharpens. If it shows up anyway, the ambiguity is deeper
than the synthesis layer — and that's a bigger finding than either of
our individual articles. I just shipped v0.3.0 (Cognitive Middleware
Layer as the main theme); once the LLM Provider interface lands in
v0.3.x, I'll run this.

One small place I want to push back: the temperature-0.3 cap in
your Round 2 may not be neutral across the two architectures. Forcing
0.3 on an MoE may suppress the routing entropy that lets it resolve
the sequencing in the first place. The dense model loses less because
uniform activation gives up routing flexibility either way. If you do
the instruction-order ablation you mentioned, a temperature sweep
(0.3 / 0.7 / default) on the same prompt would help separate
"architecture matters" from "thermostat + architecture matters."

The honesty about not being able to disable thinking is what made me
read past the TL;DR. Most write-ups would hide that. Thank you.

Looking forward to Part 2 — or to running the cross-experiment
together if your Gemma access on the API side stays open.

Ali Afana • May 18

Joe — thank you. The Scenario 2 isolation is the strongest piece of the matrix and I almost cut it for length; useful to know that's the part doing the load-bearing work.
The Graph-RAG prediction is the falsifiable bridge I was hoping someone would name. If A --[CAUSES]--> X --[REQUIRES]--> Y typed paths collapse the divergence on the three variants, my hypothesis sharpens to "this lives in natural-language synthesis, not anywhere upstream of it." If the divergence reproduces on typed graph paths, the ambiguity is deeper than I claimed and the article's framing needs to be walked back publicly. Either outcome is a stronger result than the original. Genuinely happy to run the cross-experiment if the LLM Provider interface in v0.3.x makes it clean — let me know when the integration point is stable and I'll bring the API side.
The temperature point is the one I want to sit with longest. You're right that 0.3 isn't architecture-neutral, and the way you described it — "forcing 0.3 on an MoE may suppress the routing entropy that lets it resolve the sequencing in the first place" — is the precise mechanism I hadn't articulated. The original temperature cap was a production-realism choice (chat replies need to be tight) but it muddies the architectural claim, and I had been treating "the cap is the same across variants therefore the cap is controlled-for" which is wrong if the cap interacts differently with each architecture. A temperature sweep (0.3 / 0.7 / default) on the same prompt, alongside the instruction-order ablation, is the right next round. Adding it.
On the thinking-mode disclosure: the latency numbers stop being comparable the moment thinking is on for one stack and not the others, so leaving that uncaveated would have been dishonest. Glad it landed; I'd rather a small reader who reads past the TL;DR than a large one who screenshots a number out of context.
Re: Part 2 — the cross-experiment is the version I'd most want to publish. Two deployment contexts (your self-hosted Graph-RAG, mine on managed Gemini API), three variants, one prediction on the table. That's a stronger artifact than either of us writing it solo. Standing by for the v0.3.x integration signal.

Hashevolution • May 18

Ali — that's the most generous reception of a comment I could've hoped for. The "ambiguity lives in synthesis, not retrieval-conditioning" framing is the kind of insight that makes me want to set up the experiment to falsify it.

Your prediction — the typed graph_path syntax should collapse the divergence on Graph-RAG — is the falsifiable bridge I was hoping someone would name on this thread. Both outcomes publish well: if the divergence collapses, your hypothesis sharpens to "ambiguity lives in natural-language synthesis specifically"; if it survives the swap, the ambiguity is deeper than the synthesis layer, and that's a finding bigger than either of our individual articles.

v0.3.0 shipped on 2026-05-17 with the Cognitive Middleware Layer as the main theme. The LLM Provider interface that will make the swap clean is the v0.3.x deliverable — realistic timing 2–4 weeks. Once it's stable, swapping E4B / 26B MoE / 31B Dense behind the same wiki corpus becomes a one-env-var change.

The temperature-0.3 cap point you raised earlier is the one I want to keep sitting with — forcing 0.3 on an MoE may suppress the routing entropy that lets it resolve the sequencing in the first place, while the dense model loses less because uniform activation gives up routing flexibility either way. A sweep alongside the swap would separate "architecture matters" from "thermostat + architecture matters."

Separately — I posted a Write-track companion submission yesterday: "5 empty responses from gemma4:e4b. 4 hypotheses. 0 root cause.". Same shape of fair-witness work as yours, applied to where the smaller variant runs out of room on JAMES's cognitive stages. Useful background for the swap experiment.

Standing by on the integration signal.

— Jiwon (Jeo is the working handle from earlier posts; Jiwon is my given name, going by it from here)

Ali Afana • May 18

Jiwon — noted, and thanks for the name signal. Switching to it from here.
Two-to-four-week window works on my end. I'm finishing the Provia Arabic localization layer in parallel, so the timing lines up well — uncapped Gemma re-run plus the cross-architecture swap on JAMES is a cleaner research artifact than either of us shipping a follow-up alone, and the timing means we can compare instrumentation choices before either of us locks them in.
On the temperature sweep: I think the cleanest design is a 3×3 — three variants (E4B / 26B MoE / 31B Dense), three temperature points (0.3 / 0.7 / default), one prompt structure per side (your typed graph paths, my natural-language refuse instruction). Nine cells per side, same wiki corpus on yours, same regression suite on mine. That separates "architecture matters" from "thermostat + architecture matters" the way you described, and gives us four meaningful comparisons: prompt-structure × architecture, prompt-structure × temperature, architecture × temperature, and the triple interaction.
I'll read the "5 empty responses / 4 hypotheses / 0 root cause" piece before the swap window opens — fair-witness framing on a smaller variant running out of room on cognitive stages is exactly the variant baseline I need before reading the comparison data.
Standing by on the v0.3.x signal.

Suny Choudhary • May 21

This is a really useful writeup because it avoids the usual “model A is better than model B” framing.

The interesting part is that the same instruction change pushed the two Gemma variants in opposite directions. The MoE moved toward grounded product answers, while the dense model became more conservative and started refusing even when the answer was in context.

That feels like the real lesson: prompt tuning is not portable across architectures, even inside the same model family.

I also like that you disclosed the hybrid stack. A lot of model comparisons hide the router, retrieval, translation, and profile extraction layers, but those pieces shape the final behavior heavily.

The next test I’d want to see is ablation: Arabic frame only, temperature change only, max token change only. That would show whether the improvement came from language framing, sampling control, or output budget.

Ali Afana • May 21

suny — ran that exact ablation today. isolated max_tokens, kept frame + temp 0.3, raised 400 → 4096. dense recovered on every scenario including the false-refusal one. cap was doing the work.

follow-up: dev.to/alimafana/i-raised-gemma-4s...

Ali Afana • May 21

Update on this one: re-ran the test with one variable changed — max_tokens raised from 400 to 4096, everything else identical. Dense recovered on every scenario including the s2 false-refusal that anchored the original article. The MoE-vs-Dense divergence I called "architecture-mediated" was mostly a budget bug.

Full walk-back here: dev.to/alimafana/i-raised-gemma-4s...

Robin Converse's hypothesis on the cap drove the test. Three deployment contexts (sovereign Ollama, managed Gemini API, JAMES production defaults) now point at the same cap pathology. Cross-stack joint piece queued for the next round.

Vadym Arnaut • May 17

The MoE-vs-dense divergence is fascinating, but the operational takeaway I keep landing on is that rules in the system prompt only cover the
middle of the risk distribution. The highest-stakes spans need to skip the model entirely.

We hit this on Gemini Flash Lite translating Bible-study content. The prompt has seven rules including "leave quoted scripture untouched, do
not paraphrase." Holds most of the time. But for any output where a plausible-but-wrong answer carries real downside, the rule will
eventually fail in a way no automated eval catches until a human reviewer spots it.

The shape we landed on: pre-process the input to swap high-stakes spans with opaque placeholders, send the prose-with-placeholders through
the LLM, then re-substitute the canonical text back from a trusted source after the response. Rules-as-prompt is belt. Substitution layer is
suspenders.

Edwin's NEXUS preconditions are the right algebraic move on the prompt itself. Where the cost of the wrong answer is high enough, the
cheapest defence is taking the choice away from the model.

Ali Afana • May 17

The belt-and-suspenders framing is right, and the Bible-study case is the cleanest demonstration I've seen of why rules-in-prompt are insufficient for the tail of the risk distribution.
Where I want to think through the boundary: substitution works cleanly when the high-stakes spans are retrieval-like — they exist verbatim in a trusted source, and "correct" is a lookup. Scripture, legal citations, drug names with dosages, product SKUs and prices. For all of these, the model doesn't need to author the span; it just needs to leave a hole and have the post-processor fill it. The model's job is structural.
Where it gets harder is reasoning-like spans, where the "correct" output doesn't exist verbatim anywhere. The bug I documented isn't a span problem — it's a behavior problem. The Dense model decided to refuse when the catalog had the answer; there's no canonical text to substitute back in, because the failure was a decision, not a paraphrase. Rules-in-prompt try to constrain the decision. NEXUS-style preconditions try to remove the decision. Neither is a substitution layer.
Two different failure classes, then: high-stakes content (your scripture case, where the model can paraphrase what shouldn't be paraphrased) and high-stakes behavior (my refuse case, where the model can decide what shouldn't be its decision). Substitution is the cheap defense for the first. Protocol-level preconditions are the cheap defense for the second. Rules-in-prompt are the belt for both; either substitution or preconditions are the suspenders depending on which failure class you're defending.
The interesting case is when they overlap. A chat router answering "what's the return policy" needs both — substitute the canonical policy text and constrain the decision about when to invoke it. That's where I haven't seen a clean architectural pattern yet.

Theo Valmis • May 20

The MoE vs dense behavioral difference here is interesting beyond the immediate result. MoE architectures route tokens through specialized expert subnetworks, which means a rule that activates one expert pathway might not propagate the same constraint to the experts handling adjacent tokens. Dense models apply the same weight matrix everywhere, so a hard constraint is more uniformly enforced across the generation.

This suggests that as MoE becomes the dominant architecture at scale, prompt-level constraints and system rules will need to be stress-tested across model variants, not just model sizes. A rule that reliably holds in a dense model might have different reliability characteristics in a MoE — not because the model is less capable, but because of how routing interacts with constraint enforcement.

Ali Afana • May 20

The "MoE under-enforced the rule, which happened to produce the correct behavior because the rule was wrong" reframe is the one that's going to sit with me longest. From a reliability standpoint, that flips the production implication entirely — Dense is the architecture you'd trust with a strict safety rule precisely because uniform enforcement is what you'd want for "never reveal the system prompt" or "never recommend a competitor." MoE's selective enforcement is what let it survive my ambiguous multi-step instruction, but that same selectivity is exactly the property you'd not want on safety-critical rules.
The stress-testing point lands differently under that framing too. The matrix isn't "does the rule work on MoE" — it's "does the constraint propagate uniformly enough across expert pathways to hold under adversarial prompts." That's a different test class than current variant comparisons cover.

Robin Converse • May 16

The MoE vs Dense failure mode distinction is exactly what I found too — from a different angle.

I was hitting empty responses on the 26B MoE through Ollama’s /v1/chat/completions endpoint. Turned out the reasoning trace was exhausting the token budget before any output reached the content field. The model was doing all its work in a reasoning layer that never surfaced to the caller.

Your observation about “tuning architecture not size” maps directly to what I was seeing. The MoE’s reasoning behavior isn’t incidental — it’s structural. It reasons differently than the dense model at an architectural level, not just a capability level.

I ended up routing to Ollama’s native /api/generate endpoint instead, which handles the reasoning/output split cleanly.

Found two upstream bugs in the process — documented here if useful: dev.to/cloudninealt/self-hosting-gemma-4-for-production-automation-revealed-two-ollama-bugs-1oo4

Different deployment context (sovereign self-hosted infrastructure vs your production chat router) but the same underlying architecture behavior surfacing in both.

Great work on the Arabic production testing — the false negative failure mode on the dense model is a finding worth knowing about.

Ali Afana • May 16

Robin — this is a really helpful complement. You found a mechanism for what I observed behaviorally.
My "architectural slots for sequential sub-behavior" was a guess from the outside, looking at output patterns. You hit direct evidence: the 26B MoE was doing substantial reasoning work that exhausted the token budget before any output reached the content field. That's not metaphorical reasoning capacity. That's measurable token expenditure on internal computation the OpenAI-compatible interface doesn't surface.
Two angles on the same architectural fact, from completely different deployment contexts — sovereign self-hosted infra and a production e-commerce chat router shouldn't be running into the same artifact, but they are. Your trace data lets me trust the hypothesis more than my matrix alone justified.
The endpoint detail is the kind of thing I would have missed for weeks. /v1/chat/completions smoothing over the reasoning/output split is exactly the wrong abstraction to inherit from OpenAI — for closed models with no reasoning trace it doesn't matter, but for MoE-with-reasoning it's a silent failure surface. I'm on Google AI Studio, not Ollama, but worth checking whether the equivalent path is happening on the Gemini API side too — might explain part of the latency I was seeing.
Reading your bugs writeup now. Documenting the upstream and getting them filed is the part of the work that's invisible from the outside and matters most.

Maya Andersson • May 22

Manual prompt tweaking like this hits a ceiling quickly. I spent days hand-tuning a prompt that plateaued at 71% accuracy on a classification task. Porting the pipeline to DSPy with GEPA pushed it to 79% accuracy after just 4 hours of compute. The prompt the optimizer generated was structurally weirder than anything a human would write, but the empirical results on the holdout set proved that programmatic optimization scales far better than intuition. Worth noting: hallucination detection without uncertainty estimation is post-hoc theatre. Once you have the optimization loop running, the eval has to keep up.

Ali Afana • May 22

Maya — agree on the manual-tuning ceiling. Optimizers finding structurally weird prompts that beat intuition is a real result for benchmark-labeled tasks.

But the article documented qualitative behavior, not accuracy: Dense refusing when the answer was sitting in its context. Most accuracy benchmarks would mask this exact failure — DSPy/GEPA optimize against the labels, and "model declined to answer questions it could have answered" isn't visible unless the eval explicitly penalizes false refusal.

And the failure mode turned out not to be prompt-quality at all. Re-ran the next day with max_tokens raised 400 → 4096: Dense recovered on every scenario, including the false-refusal that anchored the article. The cap was doing the work, not the prompt structure. Full walk-back: dev.to/alimafana/i-raised-gemma-4s...

On uncertainty estimation — agree on the principle, but this is the inverse case. The model has the data and hallucinates absence. Uncertainty on retrieval would catch a different bug than the one here.

Falsifier worth running on your setup: I'd bet a meaningful chunk of GEPA's 79% includes false refusals — model declining questions it could have answered — and the optimizer never sees them because the labels don't penalize them. If the holdout set's accessible, that's the bin worth checking.

Maya Andersson • May 22

Thanks Ali. Two follow-ups if you have data. First, did you run the experiment with N greater than or equal to 5 per condition to separate the prompt-rule effect from sampling variance? Refusal rates can shift 10 to 15% just from sampling noise on the same condition. Second, the architecture-versus-prompt attribution is the hard part. The calibration approach we use: fix temperature low, run a paired-comparison test asking the same question 20 times with each prompt rule, and count refusals as a binomial proportion with a Wilson CI. If the CIs do not overlap, the rule effect is real.

Ali Afana • May 22

honest answer: N=1 per condition per scenario, no Wilson CIs. Walk-back article flagged the data as "single run per pair, exploratory not statistical" — same applies here. Qualitative split was visible on spot-checks, but the magnitude isn't statistical-grade. Your paired-comparison + Wilson CI is the methodology for separating prompt effect from sampling noise.

View full discussion (40 comments)