Update — 2026-05-21: Comment-thread feedback led me to re-run this with max_tokens raised from 400 to 4096. The architecture-mediated framing in t...
For further actions, you may consider blocking this person and/or reporting abuse
reluctance as the failure mode makes more sense than hallucination when you think about routing - constrained models hedge at instruction boundaries. MoE finds the right expert and stops. dense model has no off-ramp, so it stalls.
Mykola — "no off-ramp" is the cleanest description of the dense failure mode anyone has put on this thread. Reluctance as a structural property of where the architecture has nowhere to route, rather than a behavioral choice the model is making, reframes the whole result.
The hedge-at-instruction-boundaries observation lands hard for me because it also predicts where dense models would outperform MoE — single-behavior tasks with no boundary to hedge at. Which is most of what they're benchmarked on. Reluctance-as-failure-mode is invisible to benchmarks that don't include constrained multi-step instructions, which is why this surfaces in production and not in leaderboards.
the structural framing is the right one - behavioral explanations make you tune prompts forever, structural ones tell you when to just switch models
Right — and that's the framing I needed. "Time-to-realize" is where teams bleed weeks on behavioral failures before admitting the prompt isn't converging. Always good thinking from you on these threads, Mykola.
the 'time-to-realize' framing is the useful one. the cycle compresses fastest when you have a fixed eval set — behavioral failures look inconsistent across runs, structural ones reproduce cleanly. reproducibility is the signal that tells you which fight you're actually in.
Right — reproducibility as the diagnostic is the missing operational step. Behavioral failures hide in run-to-run variance; structural ones survive a fixed eval set unchanged. That's the cheap test before committing to either fight.
This is one of the most honest write-ups I've read on
open model deployment. The "reluctance vs hallucination"
framing is exactly right — and underreported.
Your Round 2 finding points at something deeper than
prompt tuning: the instruction "refuse what's not in
the catalog" is ambiguous by design. It requires the
model to first search, then evaluate, then decide.
That's a two-step intent compressed into one sentence
of natural language. Dense vs MoE resolved the
ambiguity differently — but the ambiguity was always
there in the instruction itself.
This is what I've been thinking about with a different
angle: the problem isn't just the model or the
architecture — it's that natural language is a lossy
format for communicating structured intent to AI.
Your three rules work because they reduce ambiguity.
But they're still natural language, which means they
can be misread in architecture-specific ways, as you
discovered.
I've been working on a protocol called NEXUS that
tries to compress structured intent into unambiguous
shorthand — not for chat routing like yours, but for
code generation. The same problem though: one prompt
trying to communicate multi-step intent to a model
that resolves ambiguity differently depending on
architecture.
Your hypothesis about MoE having "architectural slots
for switching sub-behavior mid-generation" is
fascinating. I wonder if a more structured input
format — something closer to a schema than natural
language — would reduce the architecture-sensitivity
of instruction-following. The model would still
resolve it differently, but there'd be less surface
area for ambiguity to live in.
Anyway — exceptional work. The matrix format showing
R1 vs R2 side by side is exactly how this kind of
A/B should be documented.
Edwin — your framing is sharper than mine. The instruction was ambiguous before any architecture saw it. What I wrote up as "MoE and Dense resolved it differently" reads, after your comment, as "two architectures revealed the same latent ambiguity in opposite directions." That's the better description.
The "two-step intent compressed into one sentence" framing is the thing. "Refuse what's not in the catalog" carries:
a precondition (you have searched and confirmed absence)
a behavior (refuse honestly)
The precondition is implicit. Natural language is fine with that. Transformers, apparently, are not.
On your NEXUS direction — for chat orchestration the closest equivalent of structured intent is tool calling. The router decomposes into discrete tools (search_catalog, get_inventory, get_shipping_policy) → tool results → reply call. My router already does this for the search step. But the refuse step still lives inside the reply call's natural-language prompt. That's where the ambiguity is hiding.
The falsifier I haven't run yet: split the refuse step into its own tool call (acknowledge_absence with the SKU set as evidence). If MoE and Dense converge under that decomposition, your hypothesis wins — the ambiguity was the whole story. If they still diverge, there's still something architecture-specific about how the final natural-language synthesis collapses multi-step instructions.
Adding it to the next round.
Genuinely curious about NEXUS — do you encode preconditions as explicit assertions before the behavior, or does the protocol structure make them representable some other way? Asking because for chat I'm trying to figure out whether the "structured" layer should sit at the orchestration level (tool calls) or inside the prompt itself (some kind of typed instruction format).
That's exactly the gap I'm working on.
Currently NEXUS handles postconditions (!error) but
preconditions are still implicit. Your question
confirms what I've been thinking: the structured
layer needs to be inside the message format, not
just at the orchestration level.
The next evolution is an explicit assertion operator
(!! before behavior) that makes preconditions
first-class citizens of the protocol. The AI
receives ordered, non-ambiguous instructions —
no inference required.
Working on it. Will share when there's something
concrete to test.
Making preconditions first-class with !! is a cleaner design than I expected when I asked. The thing that interests me about the operator approach: it forces the prompt author to name what was previously implicit, which means a human reading the message format can audit the intent without running the model. That's a property natural language doesn't have at any compression level.
When you have something to test against, happy to throw the chat router's "search then refuse" instruction at it as a real-world case. The asymmetry — orchestration handles the search step cleanly, the refuse step is where the architecture-specific failure lives — might be a useful stress test for whether the protocol survives contact with the natural-language synthesis layer of a chat reply.
Looking forward to seeing it.
That property you named — "a reader can audit intent
without running the model" — is exactly the design
goal. The assertion layer isn't just for the AI.
It's a contract that's readable by humans,
executable by machines, and auditable by both.
Good news: we shipped !! as a first-class operator
in nxlang v4.2.0 yesterday. The implementation
went through a full audit before touching any code —
zero collisions with existing operators, 154 tests
passing, and the system prompt updated so the AI
generates executable guard logic, not literal comments.
Your search-then-reject case is exactly the kind
of real-world stress test the protocol needs.
The asymmetry you described — orchestration handles
search cleanly, rejection is where the architecture-
specific failure lives — maps directly to what !!
addresses. The rejection step is no longer inside
a natural language instruction. It's a named
precondition that fires before the action.
We're finishing Prism (the editor that runs on
nxlang) in the next few weeks. When it's live,
I'll reach out — would genuinely value seeing
whether the protocol survives contact with your
chat synthesis layer.
nexuslang.dev if you want to look at the grammar
in the meantime.
That's fast turnaround. Looking at the grammar, the structural insight that lands hardest for me: in NEXUS, "refuse if not in catalog" wouldn't be an instruction at all. It would be a precondition (!! catalog.contains(query)) whose violation triggers a deterministic error path. The refuse isn't a behavior the model chooses. It's a fallback the protocol guarantees.
That removes the choice-point my Dense model resolved into a single behavior. The model doesn't decide whether to refuse — it just generates a response on the happy path, and the protocol handles absence elsewhere. Architecture-sensitivity has nowhere to live there because the architecture isn't being asked to resolve ambiguity.
The asymmetry with !error is part of what makes this work. Preconditions fire before, errors handle after — separating "what must be true to proceed" from "what to do when something fails." I'd been conflating both layers in the chat-reply prompt.
Looking forward to Prism. When you're ready for the stress test, I'll throw the catalog-refuse instruction at it in NEXUS form and see if the architecture-sensitive failure mode disappears.
Really enjoyed this. The most useful insight for me was your point that you were tuning architecture, not just parameter count. In production AI products, that distinction matters a lot more than leaderboard thinking. I also liked that you isolated the final reply step instead of changing the whole pipeline at once — that makes the MoE vs dense behavior much easier to trust. Curious whether you plan to run a larger follow-up across more merchant scenarios, especially around retrieval-heavy edge cases.
Thanks Vic. The "tuning architecture, not parameter count" framing surprised me too — I went in expecting size to be the variable that mattered, and walked out with a different model of the problem.
On the follow-up — yes, planned. Another commenter on this thread (Robin Converse, who's running Gemma 4 26B MoE in production on self-hosted infrastructure) flagged that my temperature 0.3 / max_tokens 400 caps were probably starving the reasoning layer, which would explain the Dense regression more cleanly than my "ambiguity collapse" hypothesis alone. Re-running with the budget uncapped and the reasoning trace as primary signal is the next experiment.
The retrieval-heavy edge cases you mentioned are the ones I most want to push on. Three categories I'd add to the matrix:
Long-tail product queries where semantic search returns low-confidence matches (current 0.3 threshold) — both architectures handled the easy queries fine; the divergence widened on borderline retrieval.
Multi-attribute filters compounded in one customer message ("white shirt, size L, under 100 shekels, in stock") — these compress multiple decisions into one turn, which is where the dense model's collapse behavior would surface again.
Negation in the customer's question ("do you have anything not polyester") — these are the cases that broke the search call in early Provia testing, and I'd want to know whether the reply-layer architecture difference compounds the retrieval ambiguity.
Goal is to publish the re-run as a follow-up piece, ideally with both deployment contexts (mine on managed Gemini API, Robin's on self-hosted Ollama) so the result isn't bound to one inference stack.
That makes sense. The uncapped rerun should be especially revealing if the dense model recovers once reasoning budget is restored, because then the failure mode looks less like a capability ceiling and more like orchestration pressure under retrieval ambiguity. I’d also be curious whether you log per-step retrieval confidence on the negation and multi-attribute cases. In fintech search workflows, those are exactly the turns where a model can sound fluent while taking the wrong branch underneath. Looking forward to the follow-up, especially the managed API versus self-hosted comparison.
Vic — "orchestration pressure under retrieval ambiguity" is the better description of what I was reaching for with "ambiguity collapse." Capability ceiling and orchestration pressure look identical from the outside (model refuses, model hedges), but only one is solvable by giving back budget. The uncapped re-run is the test that separates them.
Per-step retrieval confidence on negation and multi-attribute turns is exactly the instrumentation gap. Right now my router logs final retrieval scores but doesn't track confidence drift across the reasoning steps — which means when fluent-but-wrong happens, the post-hoc trace can't distinguish "model picked the wrong branch with high confidence" from "model picked the wrong branch under low confidence and committed anyway." Those are two different bugs with two different fixes. Adding step-wise confidence logging to the re-run.
The fintech-search parallel is useful to know about. The "fluent while taking the wrong branch underneath" failure mode is the worst class for production AI — passes every automated eval that grades on output coherence, surfaces only when a human reviewer who knows the domain catches it. Bible-study translation (another commenter on this thread) and fintech search are very different domains hitting the same structural problem. Worth thinking about whether the cross-domain pattern is its own writeup down the line.
Really like that split. The "wrong branch with high confidence" vs "wrong branch under low confidence" distinction feels like the operational boundary between ranking bugs and control bugs. In fintech search we see the same pattern when the system sounds crisp but latches onto the wrong issuer or time window. Final-score logging looks clean, but step-level confidence drift is probably where the branch flip becomes visible. If you write the cross-domain version, I would absolutely read it. It feels broader than RAG and closer to a general production AI reliability failure mode.
Vic — "ranking bugs vs control bugs" is the better taxonomy and I'm going to use it. It captures something my high-vs-low-confidence version only gestured at: the bug class tells you which layer to fix. Ranking bug → improve the scoring/retrieval surface. Control bug → improve the abstention/gating logic. Different surfaces, different fixes.
The cases on this thread map onto it cleanly. Your fintech wrong-issuer is ranking — model latched onto a similar-but-wrong entity at high confidence. Vadym's Bible-study case is mixed — ranking picked the wrong span to paraphrase, control had no abstention layer to catch it. The Provia Dense regression is pure control — the model "knew" the catalog matches but the rule pushed it toward refusal anyway. Jiwon's Graph-RAG isn't a fourth instance, it's the falsifier — his prediction is that typed graph paths collapse the divergence, which would mean structured schemas push control bugs out of the surface entirely.
You're right that it's broader than RAG. The failure shape — confident-sounding model committing to a wrong branch under retrieval ambiguity — generalizes to anywhere a model retrieves AND decides in the same forward pass.
Not committing to a timeline, but the convergence is real now: fintech, translation, Graph-RAG, chat router. That's the threshold where it stops being one builder's article and starts being a pattern worth naming.
This is one of the most honest controlled experiments I've read on
"prompt swap, same model, opposite behavior." A few specific places
your argument lands hard, and one place where I think Graph-RAG falls
exactly where you'd predict.
The single cleanest piece of evidence in your matrix is Scenario 2
— the same three white-shirt SKUs sitting in context, the MoE listing
them and the dense model refusing with "we don't have that." Same
instruction ("refuse what's not in the catalog"), opposite path. The
architecture-not-size framing isn't speculative when the
counterfactual is in the prompt itself.
Your "instruction has internal ambiguity, dense uniform activation
resolves into single behavior" hypothesis lines up with something I
hit in a different domain: a Graph-RAG engine I built
(PROJECT JAMES)
where the retrieval-conditioning context is typed graph paths like
A --[CAUSES]--> X --[REQUIRES]--> Yrather than natural-languageinstructions. Your framing predicts that the MoE/Dense divergence
should collapse for that input shape — because the typed schema has
no internal ambiguity for architecture-specific resolution to live in.
The prediction is testable: same regression suite, three Gemma 4
variants, one wiki corpus. If the divergence collapses, your
hypothesis sharpens. If it shows up anyway, the ambiguity is deeper
than the synthesis layer — and that's a bigger finding than either of
our individual articles. I just shipped v0.3.0 (Cognitive Middleware
Layer as the main theme); once the LLM Provider interface lands in
v0.3.x, I'll run this.
One small place I want to push back: the temperature-0.3 cap in
your Round 2 may not be neutral across the two architectures. Forcing
0.3 on an MoE may suppress the routing entropy that lets it resolve
the sequencing in the first place. The dense model loses less because
uniform activation gives up routing flexibility either way. If you do
the instruction-order ablation you mentioned, a temperature sweep
(0.3 / 0.7 / default) on the same prompt would help separate
"architecture matters" from "thermostat + architecture matters."
The honesty about not being able to disable thinking is what made me
read past the TL;DR. Most write-ups would hide that. Thank you.
Looking forward to Part 2 — or to running the cross-experiment
together if your Gemma access on the API side stays open.
Joe — thank you. The Scenario 2 isolation is the strongest piece of the matrix and I almost cut it for length; useful to know that's the part doing the load-bearing work.
The Graph-RAG prediction is the falsifiable bridge I was hoping someone would name. If A --[CAUSES]--> X --[REQUIRES]--> Y typed paths collapse the divergence on the three variants, my hypothesis sharpens to "this lives in natural-language synthesis, not anywhere upstream of it." If the divergence reproduces on typed graph paths, the ambiguity is deeper than I claimed and the article's framing needs to be walked back publicly. Either outcome is a stronger result than the original. Genuinely happy to run the cross-experiment if the LLM Provider interface in v0.3.x makes it clean — let me know when the integration point is stable and I'll bring the API side.
The temperature point is the one I want to sit with longest. You're right that 0.3 isn't architecture-neutral, and the way you described it — "forcing 0.3 on an MoE may suppress the routing entropy that lets it resolve the sequencing in the first place" — is the precise mechanism I hadn't articulated. The original temperature cap was a production-realism choice (chat replies need to be tight) but it muddies the architectural claim, and I had been treating "the cap is the same across variants therefore the cap is controlled-for" which is wrong if the cap interacts differently with each architecture. A temperature sweep (0.3 / 0.7 / default) on the same prompt, alongside the instruction-order ablation, is the right next round. Adding it.
On the thinking-mode disclosure: the latency numbers stop being comparable the moment thinking is on for one stack and not the others, so leaving that uncaveated would have been dishonest. Glad it landed; I'd rather a small reader who reads past the TL;DR than a large one who screenshots a number out of context.
Re: Part 2 — the cross-experiment is the version I'd most want to publish. Two deployment contexts (your self-hosted Graph-RAG, mine on managed Gemini API), three variants, one prediction on the table. That's a stronger artifact than either of us writing it solo. Standing by for the v0.3.x integration signal.
Ali — that's the most generous reception of a comment I could've hoped for. The "ambiguity lives in synthesis, not retrieval-conditioning" framing is the kind of insight that makes me want to set up the experiment to falsify it.
Your prediction — the typed
graph_pathsyntax should collapse the divergence on Graph-RAG — is the falsifiable bridge I was hoping someone would name on this thread. Both outcomes publish well: if the divergence collapses, your hypothesis sharpens to "ambiguity lives in natural-language synthesis specifically"; if it survives the swap, the ambiguity is deeper than the synthesis layer, and that's a finding bigger than either of our individual articles.v0.3.0 shipped on 2026-05-17 with the Cognitive Middleware Layer as the main theme. The LLM Provider interface that will make the swap clean is the v0.3.x deliverable — realistic timing 2–4 weeks. Once it's stable, swapping E4B / 26B MoE / 31B Dense behind the same wiki corpus becomes a one-env-var change.
The temperature-0.3 cap point you raised earlier is the one I want to keep sitting with — forcing 0.3 on an MoE may suppress the routing entropy that lets it resolve the sequencing in the first place, while the dense model loses less because uniform activation gives up routing flexibility either way. A sweep alongside the swap would separate "architecture matters" from "thermostat + architecture matters."
Separately — I posted a Write-track companion submission yesterday: "5 empty responses from gemma4:e4b. 4 hypotheses. 0 root cause.". Same shape of fair-witness work as yours, applied to where the smaller variant runs out of room on JAMES's cognitive stages. Useful background for the swap experiment.
Standing by on the integration signal.
— Jiwon (Jeo is the working handle from earlier posts; Jiwon is my given name, going by it from here)
Jiwon — noted, and thanks for the name signal. Switching to it from here.
Two-to-four-week window works on my end. I'm finishing the Provia Arabic localization layer in parallel, so the timing lines up well — uncapped Gemma re-run plus the cross-architecture swap on JAMES is a cleaner research artifact than either of us shipping a follow-up alone, and the timing means we can compare instrumentation choices before either of us locks them in.
On the temperature sweep: I think the cleanest design is a 3×3 — three variants (E4B / 26B MoE / 31B Dense), three temperature points (0.3 / 0.7 / default), one prompt structure per side (your typed graph paths, my natural-language refuse instruction). Nine cells per side, same wiki corpus on yours, same regression suite on mine. That separates "architecture matters" from "thermostat + architecture matters" the way you described, and gives us four meaningful comparisons: prompt-structure × architecture, prompt-structure × temperature, architecture × temperature, and the triple interaction.
I'll read the "5 empty responses / 4 hypotheses / 0 root cause" piece before the swap window opens — fair-witness framing on a smaller variant running out of room on cognitive stages is exactly the variant baseline I need before reading the comparison data.
Standing by on the v0.3.x signal.
This is a really useful writeup because it avoids the usual “model A is better than model B” framing.
The interesting part is that the same instruction change pushed the two Gemma variants in opposite directions. The MoE moved toward grounded product answers, while the dense model became more conservative and started refusing even when the answer was in context.
That feels like the real lesson: prompt tuning is not portable across architectures, even inside the same model family.
I also like that you disclosed the hybrid stack. A lot of model comparisons hide the router, retrieval, translation, and profile extraction layers, but those pieces shape the final behavior heavily.
The next test I’d want to see is ablation: Arabic frame only, temperature change only, max token change only. That would show whether the improvement came from language framing, sampling control, or output budget.
suny — ran that exact ablation today. isolated max_tokens, kept frame + temp 0.3, raised 400 → 4096. dense recovered on every scenario including the false-refusal one. cap was doing the work.
follow-up: dev.to/alimafana/i-raised-gemma-4s...
Update on this one: re-ran the test with one variable changed —
max_tokensraised from 400 to 4096, everything else identical. Dense recovered on every scenario including the s2 false-refusal that anchored the original article. The MoE-vs-Dense divergence I called "architecture-mediated" was mostly a budget bug.Full walk-back here: dev.to/alimafana/i-raised-gemma-4s...
Robin Converse's hypothesis on the cap drove the test. Three deployment contexts (sovereign Ollama, managed Gemini API, JAMES production defaults) now point at the same cap pathology. Cross-stack joint piece queued for the next round.
The MoE-vs-dense divergence is fascinating, but the operational takeaway I keep landing on is that rules in the system prompt only cover the
middle of the risk distribution. The highest-stakes spans need to skip the model entirely.
We hit this on Gemini Flash Lite translating Bible-study content. The prompt has seven rules including "leave quoted scripture untouched, do
not paraphrase." Holds most of the time. But for any output where a plausible-but-wrong answer carries real downside, the rule will
eventually fail in a way no automated eval catches until a human reviewer spots it.
The shape we landed on: pre-process the input to swap high-stakes spans with opaque placeholders, send the prose-with-placeholders through
the LLM, then re-substitute the canonical text back from a trusted source after the response. Rules-as-prompt is belt. Substitution layer is
suspenders.
Edwin's NEXUS preconditions are the right algebraic move on the prompt itself. Where the cost of the wrong answer is high enough, the
cheapest defence is taking the choice away from the model.
The belt-and-suspenders framing is right, and the Bible-study case is the cleanest demonstration I've seen of why rules-in-prompt are insufficient for the tail of the risk distribution.
Where I want to think through the boundary: substitution works cleanly when the high-stakes spans are retrieval-like — they exist verbatim in a trusted source, and "correct" is a lookup. Scripture, legal citations, drug names with dosages, product SKUs and prices. For all of these, the model doesn't need to author the span; it just needs to leave a hole and have the post-processor fill it. The model's job is structural.
Where it gets harder is reasoning-like spans, where the "correct" output doesn't exist verbatim anywhere. The bug I documented isn't a span problem — it's a behavior problem. The Dense model decided to refuse when the catalog had the answer; there's no canonical text to substitute back in, because the failure was a decision, not a paraphrase. Rules-in-prompt try to constrain the decision. NEXUS-style preconditions try to remove the decision. Neither is a substitution layer.
Two different failure classes, then: high-stakes content (your scripture case, where the model can paraphrase what shouldn't be paraphrased) and high-stakes behavior (my refuse case, where the model can decide what shouldn't be its decision). Substitution is the cheap defense for the first. Protocol-level preconditions are the cheap defense for the second. Rules-in-prompt are the belt for both; either substitution or preconditions are the suspenders depending on which failure class you're defending.
The interesting case is when they overlap. A chat router answering "what's the return policy" needs both — substitute the canonical policy text and constrain the decision about when to invoke it. That's where I haven't seen a clean architectural pattern yet.
The MoE vs dense behavioral difference here is interesting beyond the immediate result. MoE architectures route tokens through specialized expert subnetworks, which means a rule that activates one expert pathway might not propagate the same constraint to the experts handling adjacent tokens. Dense models apply the same weight matrix everywhere, so a hard constraint is more uniformly enforced across the generation.
This suggests that as MoE becomes the dominant architecture at scale, prompt-level constraints and system rules will need to be stress-tested across model variants, not just model sizes. A rule that reliably holds in a dense model might have different reliability characteristics in a MoE — not because the model is less capable, but because of how routing interacts with constraint enforcement.
The "MoE under-enforced the rule, which happened to produce the correct behavior because the rule was wrong" reframe is the one that's going to sit with me longest. From a reliability standpoint, that flips the production implication entirely — Dense is the architecture you'd trust with a strict safety rule precisely because uniform enforcement is what you'd want for "never reveal the system prompt" or "never recommend a competitor." MoE's selective enforcement is what let it survive my ambiguous multi-step instruction, but that same selectivity is exactly the property you'd not want on safety-critical rules.
The stress-testing point lands differently under that framing too. The matrix isn't "does the rule work on MoE" — it's "does the constraint propagate uniformly enough across expert pathways to hold under adversarial prompts." That's a different test class than current variant comparisons cover.
The MoE vs Dense failure mode distinction is exactly what I found too — from a different angle.
I was hitting empty responses on the 26B MoE through Ollama’s /v1/chat/completions endpoint. Turned out the reasoning trace was exhausting the token budget before any output reached the content field. The model was doing all its work in a reasoning layer that never surfaced to the caller.
Your observation about “tuning architecture not size” maps directly to what I was seeing. The MoE’s reasoning behavior isn’t incidental — it’s structural. It reasons differently than the dense model at an architectural level, not just a capability level.
I ended up routing to Ollama’s native /api/generate endpoint instead, which handles the reasoning/output split cleanly.
Found two upstream bugs in the process — documented here if useful: dev.to/cloudninealt/self-hosting-gemma-4-for-production-automation-revealed-two-ollama-bugs-1oo4
Different deployment context (sovereign self-hosted infrastructure vs your production chat router) but the same underlying architecture behavior surfacing in both.
Great work on the Arabic production testing — the false negative failure mode on the dense model is a finding worth knowing about.
Robin — this is a really helpful complement. You found a mechanism for what I observed behaviorally.
My "architectural slots for sequential sub-behavior" was a guess from the outside, looking at output patterns. You hit direct evidence: the 26B MoE was doing substantial reasoning work that exhausted the token budget before any output reached the content field. That's not metaphorical reasoning capacity. That's measurable token expenditure on internal computation the OpenAI-compatible interface doesn't surface.
Two angles on the same architectural fact, from completely different deployment contexts — sovereign self-hosted infra and a production e-commerce chat router shouldn't be running into the same artifact, but they are. Your trace data lets me trust the hypothesis more than my matrix alone justified.
The endpoint detail is the kind of thing I would have missed for weeks. /v1/chat/completions smoothing over the reasoning/output split is exactly the wrong abstraction to inherit from OpenAI — for closed models with no reasoning trace it doesn't matter, but for MoE-with-reasoning it's a silent failure surface. I'm on Google AI Studio, not Ollama, but worth checking whether the equivalent path is happening on the Gemini API side too — might explain part of the latency I was seeing.
Reading your bugs writeup now. Documenting the upstream and getting them filed is the part of the work that's invisible from the outside and matters most.
Manual prompt tweaking like this hits a ceiling quickly. I spent days hand-tuning a prompt that plateaued at 71% accuracy on a classification task. Porting the pipeline to DSPy with GEPA pushed it to 79% accuracy after just 4 hours of compute. The prompt the optimizer generated was structurally weirder than anything a human would write, but the empirical results on the holdout set proved that programmatic optimization scales far better than intuition. Worth noting: hallucination detection without uncertainty estimation is post-hoc theatre. Once you have the optimization loop running, the eval has to keep up.
Maya — agree on the manual-tuning ceiling. Optimizers finding structurally weird prompts that beat intuition is a real result for benchmark-labeled tasks.
But the article documented qualitative behavior, not accuracy: Dense refusing when the answer was sitting in its context. Most accuracy benchmarks would mask this exact failure — DSPy/GEPA optimize against the labels, and "model declined to answer questions it could have answered" isn't visible unless the eval explicitly penalizes false refusal.
And the failure mode turned out not to be prompt-quality at all. Re-ran the next day with max_tokens raised 400 → 4096: Dense recovered on every scenario, including the false-refusal that anchored the article. The cap was doing the work, not the prompt structure. Full walk-back: dev.to/alimafana/i-raised-gemma-4s...
On uncertainty estimation — agree on the principle, but this is the inverse case. The model has the data and hallucinates absence. Uncertainty on retrieval would catch a different bug than the one here.
Falsifier worth running on your setup: I'd bet a meaningful chunk of GEPA's 79% includes false refusals — model declining questions it could have answered — and the optimizer never sees them because the labels don't penalize them. If the holdout set's accessible, that's the bin worth checking.
Thanks Ali. Two follow-ups if you have data. First, did you run the experiment with N greater than or equal to 5 per condition to separate the prompt-rule effect from sampling variance? Refusal rates can shift 10 to 15% just from sampling noise on the same condition. Second, the architecture-versus-prompt attribution is the hard part. The calibration approach we use: fix temperature low, run a paired-comparison test asking the same question 20 times with each prompt rule, and count refusals as a binomial proportion with a Wilson CI. If the CIs do not overlap, the rule effect is real.
honest answer: N=1 per condition per scenario, no Wilson CIs. Walk-back article flagged the data as "single run per pair, exploratory not statistical" — same applies here. Qualitative split was visible on spot-checks, but the magnitude isn't statistical-grade. Your paired-comparison + Wilson CI is the methodology for separating prompt effect from sampling noise.
Ali appreciate the honesty, that kind of qualitative-split write-up is genuinely useful even when not statistically powered. If you ever want to run a quick statistical pass, what has worked for us on prompt-rule experiments with low budget: 5 paired trials per condition, McNemar's test for refusal-versus-non-refusal flips, and a paired bootstrap on the continuous outcomes. Costs about $3 in API calls and gets you a defensible p-value. Clean implementation in scipy.stats.mcnemar. Happy to share the runner if useful.
Quick stress break 🎮 Play puzzles at [nexus-games.onrender.com](