DEV Community

Cloudflare Deprecated My Production Model. The Recommended Upgrade Costs $4/M Tokens. Gemma 4 MoE Doesn't.

Daniel Nwaneri on May 19, 2026

This is a submission for the Gemma 4 Challenge: Build with Gemma 4 What I Built On May 8, Cloudflare posted a deprecation notice. @cf/...

Read full post

thehwang • May 19

The max_tokens gotcha hits the same shape as a num_ctx default I tripped over in my own Ollama pipeline — /api/generate silently caps context at 2048 tokens unless you set num_ctx explicitly, regardless of the model's actual window. Different parameter, identical failure mode: the model "produces a worse output" when in reality it's been given a fraction of the input.

Your consistency-over-speed observation also tracks at much smaller scale. I ran Gemma 4 e2b on-device against a truncated meeting transcript and it pushed back — flagged the input as incomplete instead of confidently summarizing the trailing fragment. Qwen 2.5 3B in the same condition just summarized the trailing Q&A and called it the meeting's headlines. Sounds like a Gemma 4 family trait, not just MoE.

Curious whether you ever saw the MoE flag a reflection query as malformed when retrieved chunks were too thin or contradictory — or was the consistency you measured purely latency/completion?

Daniel Nwaneri • May 19

The num_ctx parallel is the right frame — different parameter, same failure signature: the model produces something coherent, the degradation stays invisible until you've seen full output. That's exactly why max_tokens: 512 was hard to catch. No error. Just a shorter string.

On the reflection query question: the MoE never refused or flagged anything explicitly. Thin chunks produced thin reflections — shorter, more generic, still grammatically valid. The guard is upstream: the engine drops chunks below 0.45 cosine similarity and bails early if too few qualify. Model behavior stays consistent; the filtering is retrieval-layer, not model-layer.

Your e2b observation is the interesting one. Flagging incomplete input rather than summarizing confidently suggests either a training difference or the on-device context limit surfacing as something that looks like refusal. Was it consistent across sessions, or did it vary?

thehwang • May 19

Honest answer: 6-8 runs, all hedging some variant of "give me the relevant transcript" — which on rereading sounds like the LLM equivalent of "this isn't what you sent me to a meeting for, please try again." Same vibe, different wording every time. Never ran formal N-trial variance because I was building, not publishing a paper, so 1 in 20 might confidently summarize the fragment and I'd never know.

Your (a) and (b) — my data doesn't distinguish them. Clean ablation: feed e2b a 1500-token self-contained paragraph at num_ctx=2048. still hedges → refusing short inputs. Summarizes happily → recognizing incompleteness. ~20 minutes of work. I haven't done it. Recording the idea here so my future self bumps into it during a coffee break.

One weak nudge toward (a): the refusal language was specifically "a mix of unrelated topics" — a content critique, not "this is too short." A length heuristic wouldn't talk about topical coherence. But arguing from one output is exactly the variance question youjust asked, so I'm calibrated about my own uncalibrated claim here.

Daniel Nwaneri • May 20

The "mix of unrelated topics" detail is the signal. A length heuristic would produce "please provide more context" — a process critique. "Mix of unrelated topics" is a content critique. The model evaluated what was there and described it. That's semantic evaluation, not a truncation fallback which pushes toward (a) more than you're giving yourself credit for.

One confound in the ablation: a self-contained paragraph strips out what makes a meeting transcript behave like a meeting transcript — speaker labels, topic jumps, mid-utterance cuts. If e2b hedges on clean prose too, you've learned about short-text behavior, not whether the refusal is tracking length or incoherence artifacts specifically. Truncated paragraph from the same session would be the cleaner control.

thehwang • May 20

Fair point on the process-vs-content distinction — "please provide more context" would have been the length-heuristic tell, and "mix ofunrelated topics" is a content claim. You're right that I was giving the model less credit than the output earns.

Your truncated-paragraph-from-the-same-session control is also clearly the cleaner experiment. The clean-prose version was conflating length AND prosodic style; yours isolates length while holding transcript incoherence constant.

Refined matrix I'd actually run now:

full session (~5K tok, low cohesion) -> ground truth
paragraph from same session, untouched (~600 tok) -> length-only
paragraph from same session, cut mid-sentence -> length + truncation
unrelated clean prose paragraph (~600 tok) -> prose-style control

If e2b refuses on row 2 but accepts on row 4, the refusal is tracking something about the transcript distribution itself — discontinuity density, speaker-label noise — not length or training. More interesting than what I half-claimed in the article either way.

Will run this on the e2b box this week and post deltas back here.

Daniel Nwaneri • May 20 • Edited

Row 3 is the one I'd watch most closely. Rows 2 and 4 test transcript distribution versus prose style useful. But row 3 isolates syntactic incompleteness: cut mid-sentence is a different kind of broken than a mid-session paragraph, which is semantically incoherent but syntactically whole. If the model responds the same way to rows 2 and 3, the signal is probably "this input is damaged" as a class. If it responds differently, it's distinguishing between syntactic and semantic damage which would be a more specific learned behavior than the heuristic framing suggests.

"Discontinuity density" is the right term for what row 2 actually contains. Meetings have high discontinuity structurally — topic jumps, speaker switches, dangling references — so a mid-session extract feels incomplete even when every sentence is grammatically complete....

Syed Ahmer Shah • May 20

The math on this is brutal, and it perfectly highlights the hidden traps of relying on third-party managed AI primitives.

Your breakdown of why the recommended Kimi K2.6 upgrade completely blows up a low-cost, high-volume ingest architecture—forcing $4.00/M output tokens onto a reflection layer that processes 100k+ documents—is a massive reality check. Switching to Gemma 4 MoE (@cf/google/gemma-4-26b-a4b-it) to keep the pipeline entirely edge-native and within the free tier is incredibly clever. The warning about how its constraint-analysis behavior literally regurgitates rules as bullet points if your system prompt is too verbose is an invaluable catch for anyone else facing this exact 22-day deprecation clock. 👍

Daniel Nwaneri • May 20

The system prompt behavior is the one most people will only hit after the fact - verbose prompts feel safe until the model starts treating them as rules to enumerate...

Syed Ahmer Shah • May 20

Keeping system prompts lean is quickly becoming a core senior dev skill. 👍

Ali-Funk • May 26

Daniel, this is an excellent breakdown of a major operational risk in cloud environments.
Here is why I find it so good:
The sudden deprecation of production models and the resulting cost spikes highlight exactly why vendor lock-in and cost predictability are critical architectural concerns. Your pivot to bypass the token cost trap is a pragmatic solution which I like to an issue many enterprise teams face when third-party platforms force their hand.
Strong engineering decision.
It is really well done! Made me think very hard about why I didn´t write this article and you did :)

Daniel Nwaneri • May 26

The last line got me. write it. The Hermes challenge is still open, and so is the GitHub Finish-Up-A-Thon ($3k prize pool, closes June 7). That frustration is exactly the kind of thing that gets reads.