This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
On May 8, Cloudflare posted a deprecation notice.
@cf/...
For further actions, you may consider blocking this person and/or reporting abuse
The max_tokens gotcha hits the same shape as a num_ctx default I tripped over in my own Ollama pipeline —
/api/generatesilently caps context at 2048 tokens unless you setnum_ctxexplicitly, regardless of the model's actual window. Different parameter, identical failure mode: the model "produces a worse output" when in reality it's been given a fraction of the input.Your consistency-over-speed observation also tracks at much smaller scale. I ran Gemma 4 e2b on-device against a truncated meeting transcript and it pushed back — flagged the input as incomplete instead of confidently summarizing the trailing fragment. Qwen 2.5 3B in the same condition just summarized the trailing Q&A and called it the meeting's headlines. Sounds like a Gemma 4 family trait, not just MoE.
Curious whether you ever saw the MoE flag a reflection query as malformed when retrieved chunks were too thin or contradictory — or was the consistency you measured purely latency/completion?
The num_ctx parallel is the right frame — different parameter, same failure signature: the model produces something coherent, the degradation stays invisible until you've seen full output. That's exactly why max_tokens: 512 was hard to catch. No error. Just a shorter string.
On the reflection query question: the MoE never refused or flagged anything explicitly. Thin chunks produced thin reflections — shorter, more generic, still grammatically valid. The guard is upstream: the engine drops chunks below 0.45 cosine similarity and bails early if too few qualify. Model behavior stays consistent; the filtering is retrieval-layer, not model-layer.
Your e2b observation is the interesting one. Flagging incomplete input rather than summarizing confidently suggests either a training difference or the on-device context limit surfacing as something that looks like refusal. Was it consistent across sessions, or did it vary?
Honest answer: 6-8 runs, all hedging some variant of "give me the relevant transcript" — which on rereading sounds like the LLM equivalent of "this isn't what you sent me to a meeting for, please try again." Same vibe, different wording every time. Never ran formal N-trial variance because I was building, not publishing a paper, so 1 in 20 might confidently summarize the fragment and I'd never know.
Your (a) and (b) — my data doesn't distinguish them. Clean ablation: feed e2b a 1500-token self-contained paragraph at num_ctx=2048. still hedges → refusing short inputs. Summarizes happily → recognizing incompleteness. ~20 minutes of work. I haven't done it. Recording the idea here so my future self bumps into it during a coffee break.
One weak nudge toward (a): the refusal language was specifically "a mix of unrelated topics" — a content critique, not "this is too short." A length heuristic wouldn't talk about topical coherence. But arguing from one output is exactly the variance question youjust asked, so I'm calibrated about my own uncalibrated claim here.
The "mix of unrelated topics" detail is the signal. A length heuristic would produce "please provide more context" — a process critique. "Mix of unrelated topics" is a content critique. The model evaluated what was there and described it. That's semantic evaluation, not a truncation fallback which pushes toward (a) more than you're giving yourself credit for.
One confound in the ablation: a self-contained paragraph strips out what makes a meeting transcript behave like a meeting transcript — speaker labels, topic jumps, mid-utterance cuts. If e2b hedges on clean prose too, you've learned about short-text behavior, not whether the refusal is tracking length or incoherence artifacts specifically. Truncated paragraph from the same session would be the cleaner control.
Fair point on the process-vs-content distinction — "please provide more context" would have been the length-heuristic tell, and "mix ofunrelated topics" is a content claim. You're right that I was giving the model less credit than the output earns.
Your truncated-paragraph-from-the-same-session control is also clearly the cleaner experiment. The clean-prose version was conflating length AND prosodic style; yours isolates length while holding transcript incoherence constant.
Refined matrix I'd actually run now:
full session (~5K tok, low cohesion) -> ground truth
paragraph from same session, untouched (~600 tok) -> length-only
paragraph from same session, cut mid-sentence -> length + truncation
unrelated clean prose paragraph (~600 tok) -> prose-style control
If e2b refuses on row 2 but accepts on row 4, the refusal is tracking something about the transcript distribution itself — discontinuity density, speaker-label noise — not length or training. More interesting than what I half-claimed in the article either way.
Will run this on the e2b box this week and post deltas back here.
Row 3 is the one I'd watch most closely. Rows 2 and 4 test transcript distribution versus prose style useful. But row 3 isolates syntactic incompleteness: cut mid-sentence is a different kind of broken than a mid-session paragraph, which is semantically incoherent but syntactically whole. If the model responds the same way to rows 2 and 3, the signal is probably "this input is damaged" as a class. If it responds differently, it's distinguishing between syntactic and semantic damage which would be a more specific learned behavior than the heuristic framing suggests.
"Discontinuity density" is the right term for what row 2 actually contains. Meetings have high discontinuity structurally — topic jumps, speaker switches, dangling references — so a mid-session extract feels incomplete even when every sentence is grammatically complete....
The math on this is brutal, and it perfectly highlights the hidden traps of relying on third-party managed AI primitives.
Your breakdown of why the recommended Kimi K2.6 upgrade completely blows up a low-cost, high-volume ingest architecture—forcing $4.00/M output tokens onto a reflection layer that processes 100k+ documents—is a massive reality check. Switching to Gemma 4 MoE (
@cf/google/gemma-4-26b-a4b-it) to keep the pipeline entirely edge-native and within the free tier is incredibly clever. The warning about how its constraint-analysis behavior literally regurgitates rules as bullet points if your system prompt is too verbose is an invaluable catch for anyone else facing this exact 22-day deprecation clock. 👍The system prompt behavior is the one most people will only hit after the fact - verbose prompts feel safe until the model starts treating them as rules to enumerate...
Keeping system prompts lean is quickly becoming a core senior dev skill. 👍
Daniel, this is an excellent breakdown of a major operational risk in cloud environments.
Here is why I find it so good:
The sudden deprecation of production models and the resulting cost spikes highlight exactly why vendor lock-in and cost predictability are critical architectural concerns. Your pivot to bypass the token cost trap is a pragmatic solution which I like to an issue many enterprise teams face when third-party platforms force their hand.
Strong engineering decision.
It is really well done! Made me think very hard about why I didn´t write this article and you did :)
The last line got me. write it. The Hermes challenge is still open, and so is the GitHub Finish-Up-A-Thon ($3k prize pool, closes June 7). That frustration is exactly the kind of thing that gets reads.