This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
On May 8, Cloudflare posted a deprecation notice.
@cf/moonshot/kimi-k2.5 — the model synthesising knowledge across 45,000 of my saved tweets — was going away on May 30.
I had a live production system, a daily cron, and 100,000+ indexed documents depending on that model. I had 22 days.
Cloudflare's recommended replacement: @cf/google/gemma-4-26b-a4b-it.
So I migrated and benchmarked every step. Here's what I found, what broke, and why Gemma 4 MoE was the right call even after a better Kimi arrived.
bookmark-cli is a personal knowledge engine I built after getting frustrated with X's native search. It syncs my bookmarks and likes into local SQLite, then pushes everything into a Cloudflare Worker for semantic retrieval.
The numbers:
- 45,053 tweets (11,835 bookmarks + 33,218 likes)
- 7,155 photo tweets enriched by Llama 4 Scout vision descriptions
- 100,302 total documents in the vector index
- Daily cron syncing new content automatically
- $5/month total running cost
The architecture: bookmark-cli calls vectorize-mcp-worker, which runs hybrid BM25 + vector search, cross-encoder reranking, and a knowledge reflection layer that synthesises connections across documents.
One question worth answering upfront: if the data is from 2023, what good is it?
This isn't a news feed — it's a thinking tool. When you liked a tweet about RAG failure modes two years ago, you were signalling "this matters to me." The reflection engine connects that to four other things you saved that week across different topics and surfaces the thread you didn't consciously notice. The index only contains what you chose to save. No engagement algorithm, no ads, no recency bias — just your own curation, made searchable and cross-referenced. Google searches the internet. This searches your mind.
A reflection the engine generated from tweets I saved about AI and work — none of which said this:
"Non-technical users are increasingly using AI agents to 'vibe-code' large amounts of software without manual code review or verification. This reliance on generated outputs often involves a level of blind trust that bypasses the rigorous research and scrutiny essential to traditional programming. Although this method can appear highly productive, the lack of technical expertise makes debugging these systems exceptionally difficult and prone to subtle, painful failures."
That's three fragments from different weeks, connected by the model into one coherent insight. The technical details are below.
Demo
Live dashboard: vectorize-mcp-worker.fpl-test.workers.dev/dashboard
Code
- vectorize-mcp-worker: github.com/dannwaneri/vectorize-mcp-worker
- bookmark-cli: github.com/dannwaneri/bookmark-cli
- Benchmark endpoint:
POST /benchmarkwithAuthorization: Bearerheader
How I Used Gemma 4
Why Gemma 4 MoE specifically
Three Gemma 4 variants exist on Workers AI. I needed to pick one.
| Model | Active params | Best for |
|---|---|---|
| gemma-4-e4b-it | 4B total (dense) | Local / memory-constrained |
| gemma-4-27b-it | 27B dense | Max quality, more compute |
| gemma-4-26b-a4b-it | 26B total, 4B active (MoE) | Edge inference, reasoning depth |
The reflection layer does multi-document synthesis — it reads 5 related chunks and produces a structured 3-sentence insight. That's not a summarisation task, it's a reasoning task. The 4B dense model would have been too shallow. The 27B dense would have been too slow at the edge.
4B active parameters per forward pass. 26B total. At the edge, you need the first number. For multi-document synthesis, you need the second. The MoE architecture is the only way to have both.
The entire pipeline — embed, retrieve, rerank, reflect — runs inside one Cloudflare Worker. Gemma 4 MoE is a native Workers AI binding. No external API call. No data leaving the edge.
The migration
The codebase already had a REFLECTION_MODEL env var. The model registry needed one addition:
export const REFLECTION_MODELS = {
'gemma-4': {
id: '@cf/google/gemma-4-26b-a4b-it' as const,
label: 'Gemma 4 26B MoE (4B active)',
note: 'Recommended. 4B active params via MoE — edge-native, no external hop.',
},
'kimi-k2.5': {
id: '@cf/moonshotai/kimi-k2.5' as const,
label: 'Kimi K2.5',
note: 'Deprecated May 30 2026.',
},
Then:
wrangler secret put REFLECTION_MODEL
# enter: gemma-4
wrangler deploy
That was the migration. The reflection engine reads env.REFLECTION_MODEL dynamically. Nothing else changed.
Three gotchas worth knowing
1. max_tokens. Gemma 4 is a thinking model. It writes a full reasoning chain before producing output. With max_tokens: 180 set for the old model, Gemma 4 was spending all its tokens on internal reasoning and returning empty content. Bumping to max_tokens: 2048 fixed it.
2. Response extraction. For thinking models, use choices[0].message.content — not .reasoning and not .response. The reasoning field is the internal chain of thought, not the answer.
3. Prompt format. Verbose rule-lists trigger Gemma 4's constraint-analysis behaviour — it restates your rules as bullet points instead of following them. Keep prompts simple and end with a direct action cue:
Read the new source and related sources below, then write 3 plain prose sentences
that synthesise them into a knowledge base entry. No bullets. No analysis. No preamble.
Just 3 sentences.
New: "..."
Related: ...
Write the 3-sentence synthesis now:
The benchmark
I built a /benchmark endpoint that runs both models in parallel against the same query, logs latency and response to D1, and returns side-by-side results.
POST /benchmark
{ "query": "What are the common failure modes of RAG systems?" }
Results from D1 (9 real queries):
| Query | Gemma 4 MoE | Kimi K2.5 |
|---|---|---|
| RAG failure modes | 12.9s | 12.4s |
| Embedding model selection | 9.9s | 90.7s ⚠️ |
| BM25 vs vector search | 19.6s | 7.3s |
| Reducing hallucination | 19.0s | 6.9s |
| Chunking strategies | 9.3s | 9.0s |
| Edge AI model selection | 11.8s | 8.3s |
| MoE efficiency at scale | 16.5s | 8.0s |
| Cloudflare Workers AI | 22.8s | FAILED |
| KB maintenance | 10.1s | 5.5s |
Kimi K2.5 was faster on 7 of 9 queries. But it produced a 90-second response on one query and failed outright on another — within a single benchmark run. A model that's faster on average but unreliable under load isn't a production model.
Gemma 4 MoE was consistent. Every query returned. Every response was coherent. Latency was predictable.
Beyond the latency numbers, the Kimi K2.5 reflections in the index all started with "Here are the 3 sentences:" — the model was leaking the instruction prefix into every stored reflection. Gemma 4 produces clean prose output with the right prompt.
What's live now
GET /stats → models.reflection: "@cf/google/gemma-4-26b-a4b-it"
The live dashboard is at vectorize-mcp-worker.fpl-test.workers.dev/dashboard — open it and the active reflection model is listed in the stats panel. Gemma 4 MoE, running in production.
1,525 reflections generated since the migration. The cron added more this morning. Verify live: /public-stats — no API key needed.
A second reflection, this one on AI and management:
"AI is increasing individual contributor leverage and is frequently marketed as a labor replacement, driving companies to prioritize cost-cutting and individual productivity. This trend often places pressure on managers to perform individual contributor roles, potentially devaluing the necessity of human oversight and organizational management. Relying on these technologies also introduces risks involving accountability for failures, misunderstandings of AI's true capabilities, and the loss of human-centric benefits like upskilling."
This came from unrelated tweets saved across different weeks, connected by the engine into a single coherent insight, stored back into the index so it surfaces when I search anything adjacent to AI, management, or developer tooling. That's the reflection layer working as intended.
Full pipeline:
bookmark-cli → vectorize-mcp-worker
embed (BGE Small) →
retrieve (Vectorize + BM25) →
rerank (BGE cross-encoder) →
reflect (Gemma 4 MoE) ← NEW
Everything stays inside one Cloudflare Worker. No external hop for the reasoning layer.
Gemma 4 MoE isn't here because of a challenge. It's here because Cloudflare deprecated the model it replaced and this was the right call. It will still be running after June 4.
The verdict
Gemma 4 MoE is not faster than Kimi K2.5 was on average. If raw speed were the only metric, and if Kimi K2.5 were staying around, I'd have a harder decision.
But it isn't staying around.
Cloudflare has since released Kimi K2.6 — 1T parameters, 262k context window, reasoning, vision, tool calling. It's impressive. It's also $0.95/M input tokens and $4.00/M output tokens. The reflection layer synthesises on every ingest. At that pricing, running it across a 100k-document backlog would end the $5/month cost story in a single batch. Gemma 4 MoE, as a native Workers AI model, stays within the free tier. The upgrade path wasn't really an upgrade for this use case.
And for a reflection layer specifically — where the task is multi-document synthesis, where you need reasoning depth more than raw throughput, and where you want the entire pipeline to stay edge-native — Gemma 4 MoE is the right model. The MoE architecture is why. 4B active parameters gives you the inference speed you need at the edge. 26B total parameters gives you the knowledge depth the task requires.
At $4/M output tokens, the upgrade wasn't an upgrade. Gemma 4 MoE still is. The daily cron doesn't know it's in a challenge. It ran this morning.
What's next for Gemma 4 MoE in this pipeline
The reflection layer is one use. The code already has a second.
Every 3 new ingests, the pipeline runs a consolidation pass — Gemma 4 MoE reads the 10 most recent reflections and merges them into a single doc_type='summary': dominant theme, two or three specific non-obvious facts, and the most persistent open question across all the reflections. The summary lands in Vectorize and surfaces in search exactly like a reflection does. Reflections capture individual connections. Summaries capture patterns across connections. Both are Gemma 4 MoE, both are edge-native, both add to the index without touching the $5/month cost ceiling.
That's the current state. Three extensions are already scoped:
Query-time answer synthesis. Right now the pipeline retrieves chunks and returns them. The next layer uses Gemma 4 MoE to read the top 5 retrieved chunks and produce a direct answer — not a list of results, an actual response grounded in what you saved. The retrieval already works. The synthesis step is the same task the reflection layer already does, with a different prompt.
Routing upgrade. The V4 intelligent router currently runs on Llama 3.2 3B — fast classification into six query routes (SQL, BM25, vector, graph, etc.). Moving that to Gemma 4 MoE's thinking mode means the router can reason about ambiguous queries instead of classifying them. A question like "what did I save about RAG that I disagreed with?" hits multiple routes simultaneously. A 3B classifier guesses. A 26B MoE reasons.
Gap detection. The reflection engine already identifies gaps — questions the combined knowledge doesn't answer. A weekly pass that reads all gap annotations across the index and surfaces the three most persistent unanswered questions would make the tool actively useful for research, not just reactive to search queries. One scheduled cron, one Gemma 4 MoE call per week, zero additional cost in the free tier.
Personal preference reranker. The index contains 100k+ documents across AI, politics, sports, and everything else saved since 2016. Every bookmark and like is a signal: this person found this worth keeping. The longer-term path is fine-tuning a small cross-encoder on that signal — not domain expertise, but preference prediction. A model trained on "did this person save this or not" beats every general reranker at one narrow task: knowing what you care about. It slots into the existing pipeline as a final reranking layer after BGE, before the reflection pass. The training data is a decade of curation. The narrow task is yours alone.
The reflection layer was the migration. These four are the reason it stays.
Top comments (18)
The max_tokens gotcha hits the same shape as a num_ctx default I tripped over in my own Ollama pipeline —
/api/generatesilently caps context at 2048 tokens unless you setnum_ctxexplicitly, regardless of the model's actual window. Different parameter, identical failure mode: the model "produces a worse output" when in reality it's been given a fraction of the input.Your consistency-over-speed observation also tracks at much smaller scale. I ran Gemma 4 e2b on-device against a truncated meeting transcript and it pushed back — flagged the input as incomplete instead of confidently summarizing the trailing fragment. Qwen 2.5 3B in the same condition just summarized the trailing Q&A and called it the meeting's headlines. Sounds like a Gemma 4 family trait, not just MoE.
Curious whether you ever saw the MoE flag a reflection query as malformed when retrieved chunks were too thin or contradictory — or was the consistency you measured purely latency/completion?
The num_ctx parallel is the right frame — different parameter, same failure signature: the model produces something coherent, the degradation stays invisible until you've seen full output. That's exactly why max_tokens: 512 was hard to catch. No error. Just a shorter string.
On the reflection query question: the MoE never refused or flagged anything explicitly. Thin chunks produced thin reflections — shorter, more generic, still grammatically valid. The guard is upstream: the engine drops chunks below 0.45 cosine similarity and bails early if too few qualify. Model behavior stays consistent; the filtering is retrieval-layer, not model-layer.
Your e2b observation is the interesting one. Flagging incomplete input rather than summarizing confidently suggests either a training difference or the on-device context limit surfacing as something that looks like refusal. Was it consistent across sessions, or did it vary?
Honest answer: 6-8 runs, all hedging some variant of "give me the relevant transcript" — which on rereading sounds like the LLM equivalent of "this isn't what you sent me to a meeting for, please try again." Same vibe, different wording every time. Never ran formal N-trial variance because I was building, not publishing a paper, so 1 in 20 might confidently summarize the fragment and I'd never know.
Your (a) and (b) — my data doesn't distinguish them. Clean ablation: feed e2b a 1500-token self-contained paragraph at num_ctx=2048. still hedges → refusing short inputs. Summarizes happily → recognizing incompleteness. ~20 minutes of work. I haven't done it. Recording the idea here so my future self bumps into it during a coffee break.
One weak nudge toward (a): the refusal language was specifically "a mix of unrelated topics" — a content critique, not "this is too short." A length heuristic wouldn't talk about topical coherence. But arguing from one output is exactly the variance question youjust asked, so I'm calibrated about my own uncalibrated claim here.
The "mix of unrelated topics" detail is the signal. A length heuristic would produce "please provide more context" — a process critique. "Mix of unrelated topics" is a content critique. The model evaluated what was there and described it. That's semantic evaluation, not a truncation fallback which pushes toward (a) more than you're giving yourself credit for.
One confound in the ablation: a self-contained paragraph strips out what makes a meeting transcript behave like a meeting transcript — speaker labels, topic jumps, mid-utterance cuts. If e2b hedges on clean prose too, you've learned about short-text behavior, not whether the refusal is tracking length or incoherence artifacts specifically. Truncated paragraph from the same session would be the cleaner control.
Fair point on the process-vs-content distinction — "please provide more context" would have been the length-heuristic tell, and "mix ofunrelated topics" is a content claim. You're right that I was giving the model less credit than the output earns.
Your truncated-paragraph-from-the-same-session control is also clearly the cleaner experiment. The clean-prose version was conflating length AND prosodic style; yours isolates length while holding transcript incoherence constant.
Refined matrix I'd actually run now:
full session (~5K tok, low cohesion) -> ground truth
paragraph from same session, untouched (~600 tok) -> length-only
paragraph from same session, cut mid-sentence -> length + truncation
unrelated clean prose paragraph (~600 tok) -> prose-style control
If e2b refuses on row 2 but accepts on row 4, the refusal is tracking something about the transcript distribution itself — discontinuity density, speaker-label noise — not length or training. More interesting than what I half-claimed in the article either way.
Will run this on the e2b box this week and post deltas back here.
Row 3 is the one I'd watch most closely. Rows 2 and 4 test transcript distribution versus prose style useful. But row 3 isolates syntactic incompleteness: cut mid-sentence is a different kind of broken than a mid-session paragraph, which is semantically incoherent but syntactically whole. If the model responds the same way to rows 2 and 3, the signal is probably "this input is damaged" as a class. If it responds differently, it's distinguishing between syntactic and semantic damage which would be a more specific learned behavior than the heuristic framing suggests.
"Discontinuity density" is the right term for what row 2 actually contains. Meetings have high discontinuity structurally — topic jumps, speaker switches, dangling references — so a mid-session extract feels incomplete even when every sentence is grammatically complete....
Row 3 as the load-bearing test, agreed — the syntactic-vs-semantic damage split is the cleaner cut than what I had. And "discontinuity density" being the right term for what row 2 already contains clarifies what we're actually probing: a paragraph from mid-session isn't just shorter than the full transcript, it's shorter AND denser in dangling references per token than either the full session or clean prose.
One layer I'd want to add on top of the 2-vs-3 response difference: the lexical shape of the hedge itself. If row 2 produces "this seems to be a mid-stream excerpt, do you have the full meeting" and row 3 produces "the text appears to be cut off mid-sentence, please provide the rest" — that's two distinct learned diagnostics, not one. Same wording across both collapses them into your "damaged input as a class" reading.
Discontinuity-density-as-a-feature also predicts a 5th row I hadn't considered: clean prose paragraph cut mid-sentence. If e2b flags row 3 but accepts that, it's tracking truncation signature, not underlying coherence; if it flags both, truncation signature is doing the work regardless of distribution. Either way it falsifies one of the two remaining hypotheses for negligible runtime cost.
Will run the 4-row matrix first and only add row 5 if 2-vs-3 is ambiguous. Deltas back either way.
Ran it. 15 runs, temperature=0.0, Gemma 4 E2B on a 16 GB M-series Mac.
Rows 2, 3, 4 + a row 6 I added (tail-of-session with no sub-section opening, structurally close to what num_ctx=2048 truncation produces) all came back clean at num_ctx=32768 : no hedge of any kind, 3/3 each. Row 4 cheerfully summarized the Antikythera mechanism using the meeting-summary template. So H1, H2, H3 refuted, plus the H4 I added to salvage my framing got refuted too.
Then I re-ran row 1 at num_ctx=2048 to control for configuration. Three identical runs of a multi-pass output: templated SUMMARY: / ACTION ITEMS: block (mostly hallucinated), then "Note: The provided transcript does not contain the information listed in the summary or action items above", then a more conservative retry hedged with "implied" / "inferred" / "not fully detailed". Same structure every time.
Net: the hedge appears configuration-deterministic on num_ctx=2048 specifically, not the general semantic-input-quality signal I claimed. Your "mix of unrelated topics is a content claim, not length" point still holds — just only inside that specific configuration.
Full write-up + the multi-pass trace: dev.to/thehwang/gemma-4-wrote-thre.... Harness in benchmarks/calibration-ablation/ of the Scripta repo — README documents H1–H4 and the diagnostic matrix.
Thanks for pushing on this. Got me to a sharper claim than I'd have arrived at alone.
The multi-pass structure is the finding worth isolating. Rows at num_ctx=2048 didn't produce a refusal — they produced a summary, then a self-correction noting the summary wasn't grounded in the input, then a hedged retry. That's not "refuse incomplete input." That's complete, audit, flag discrepancy — three moves inside one inference pass. Different behavior from what we were diagnosing, and closer to what the original consistency observation in the article was actually tracking.
Configuration-determinism is the clean result. The content-claim point holding only inside num_ctx=2048 is a sharper boundary condition than the original claim had. Write-up is bookmarked...
"Complete, audit, flag discrepancy" is a better description than I used. Opens a question I hadn't pinned down: is the audit happening as a separate post-generation move — the model evaluating its own output against the input — or is the same underlying uncertainty just surfacing in later tokens, after the early "SUMMARY:" template committed the prefix? Different mechanisms, similar surface trace.
Cleanest way to disambiguate: at
num_ctx=32768, prompt the model with "summarize this, then critique whether your summary is grounded in the input." If the audit step is gated by context-budget pressure, it won't fire under that prompt either. If it's normally just suppressed by the requested output template, it'll surface immediately. Worth a second ablation — will write it up if I run it.The committed-prefix explanation fits autoregressive generation better — "SUMMARY:" constrains what early tokens can express, and the uncertainty surfaces once the template obligation clears, not as a deliberate second pass...
Conceded — committed-prefix is the more parsimonious story, and "audit" was reaching for a more impressive mechanism than the data needs. The uncertainty was always there; "SUMMARY:" just gated where it could surface.
The disambiguation gets cheaper under your framing. At
num_ctx=2048, prompt with "respond with ONLY the SUMMARY: block, then STOP" — or hard-capnum_predictright after ACTION ITEMS. Committed-prefix predicts the disclaimer vanishes entirely (no room to surface). An actual audit would fight the cap.What it leaves open: what specifically generates the underlying uncertainty at 2048 but not 32K. That's the KV-cache-instrumentation question vericum flagged — probably the actually productive next direction.
The cap test is clean but asymmetric - absence confirms nothing. What separates the two mechanisms more sharply is disclaimer frequency across a token sweep, not just presence or absence at one ceiling. Committed-prefix predicts flat randomness: the disclaimer appears or doesn't based on sampling variance, uncorrelated with budget. An audit predicts rising frequency as tokens increase — more room, more surface for uncertainty to emerge. Five or six num_predict values would separate them faster than KV instrumentation. The wording is a faster shortcut still: same phrase at 32K and 2048 suggests gating; different phrasing suggests different generation paths entirely...
The math on this is brutal, and it perfectly highlights the hidden traps of relying on third-party managed AI primitives.
Your breakdown of why the recommended Kimi K2.6 upgrade completely blows up a low-cost, high-volume ingest architecture—forcing $4.00/M output tokens onto a reflection layer that processes 100k+ documents—is a massive reality check. Switching to Gemma 4 MoE (
@cf/google/gemma-4-26b-a4b-it) to keep the pipeline entirely edge-native and within the free tier is incredibly clever. The warning about how its constraint-analysis behavior literally regurgitates rules as bullet points if your system prompt is too verbose is an invaluable catch for anyone else facing this exact 22-day deprecation clock. 👍The system prompt behavior is the one most people will only hit after the fact - verbose prompts feel safe until the model starts treating them as rules to enumerate...
Keeping system prompts lean is quickly becoming a core senior dev skill. 👍
Daniel, this is an excellent breakdown of a major operational risk in cloud environments.
Here is why I find it so good:
The sudden deprecation of production models and the resulting cost spikes highlight exactly why vendor lock-in and cost predictability are critical architectural concerns. Your pivot to bypass the token cost trap is a pragmatic solution which I like to an issue many enterprise teams face when third-party platforms force their hand.
Strong engineering decision.
It is really well done! Made me think very hard about why I didn´t write this article and you did :)
The last line got me. write it. The Hermes challenge is still open, and so is the GitHub Finish-Up-A-Thon ($3k prize pool, closes June 7). That frustration is exactly the kind of thing that gets reads.