DEV Community

Cover image for Cloudflare Deprecated My Production Model. The Recommended Upgrade Costs $4/M Tokens. Gemma 4 MoE Doesn't.
Daniel Nwaneri
Daniel Nwaneri Subscriber

Posted on

Cloudflare Deprecated My Production Model. The Recommended Upgrade Costs $4/M Tokens. Gemma 4 MoE Doesn't.

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

On May 8, Cloudflare posted a deprecation notice.

@cf/moonshot/kimi-k2.5 — the model synthesising knowledge across 45,000 of my saved tweets — was going away on May 30.

I had a live production system, a daily cron, and 100,000+ indexed documents depending on that model. I had 22 days.

Cloudflare's recommended replacement: @cf/google/gemma-4-26b-a4b-it.

So I migrated and benchmarked every step. Here's what I found, what broke, and why Gemma 4 MoE was the right call even after a better Kimi arrived.

bookmark-cli is a personal knowledge engine I built after getting frustrated with X's native search. It syncs my bookmarks and likes into local SQLite, then pushes everything into a Cloudflare Worker for semantic retrieval.

The numbers:

  • 45,053 tweets (11,835 bookmarks + 33,218 likes)
  • 7,155 photo tweets enriched by Llama 4 Scout vision descriptions
  • 100,302 total documents in the vector index
  • Daily cron syncing new content automatically
  • $5/month total running cost

The architecture: bookmark-cli calls vectorize-mcp-worker, which runs hybrid BM25 + vector search, cross-encoder reranking, and a knowledge reflection layer that synthesises connections across documents.

One question worth answering upfront: if the data is from 2023, what good is it?

This isn't a news feed — it's a thinking tool. When you liked a tweet about RAG failure modes two years ago, you were signalling "this matters to me." The reflection engine connects that to four other things you saved that week across different topics and surfaces the thread you didn't consciously notice. The index only contains what you chose to save. No engagement algorithm, no ads, no recency bias — just your own curation, made searchable and cross-referenced. Google searches the internet. This searches your mind.

A reflection the engine generated from tweets I saved about AI and work — none of which said this:

"Non-technical users are increasingly using AI agents to 'vibe-code' large amounts of software without manual code review or verification. This reliance on generated outputs often involves a level of blind trust that bypasses the rigorous research and scrutiny essential to traditional programming. Although this method can appear highly productive, the lack of technical expertise makes debugging these systems exceptionally difficult and prone to subtle, painful failures."

That's three fragments from different weeks, connected by the model into one coherent insight. The technical details are below.


Demo

Live dashboard: vectorize-mcp-worker.fpl-test.workers.dev/dashboard


Code


How I Used Gemma 4

Why Gemma 4 MoE specifically

Three Gemma 4 variants exist on Workers AI. I needed to pick one.

Model Active params Best for
gemma-4-e4b-it 4B total (dense) Local / memory-constrained
gemma-4-27b-it 27B dense Max quality, more compute
gemma-4-26b-a4b-it 26B total, 4B active (MoE) Edge inference, reasoning depth

The reflection layer does multi-document synthesis — it reads 5 related chunks and produces a structured 3-sentence insight. That's not a summarisation task, it's a reasoning task. The 4B dense model would have been too shallow. The 27B dense would have been too slow at the edge.

4B active parameters per forward pass. 26B total. At the edge, you need the first number. For multi-document synthesis, you need the second. The MoE architecture is the only way to have both.

The entire pipeline — embed, retrieve, rerank, reflect — runs inside one Cloudflare Worker. Gemma 4 MoE is a native Workers AI binding. No external API call. No data leaving the edge.

The migration

The codebase already had a REFLECTION_MODEL env var. The model registry needed one addition:

export const REFLECTION_MODELS = {
  'gemma-4': {
    id: '@cf/google/gemma-4-26b-a4b-it' as const,
    label: 'Gemma 4 26B MoE (4B active)',
    note: 'Recommended. 4B active params via MoE — edge-native, no external hop.',
  },
  'kimi-k2.5': {
    id: '@cf/moonshotai/kimi-k2.5' as const,
    label: 'Kimi K2.5',
    note: 'Deprecated May 30 2026.',
  },
Enter fullscreen mode Exit fullscreen mode

Then:

wrangler secret put REFLECTION_MODEL
# enter: gemma-4

wrangler deploy
Enter fullscreen mode Exit fullscreen mode

That was the migration. The reflection engine reads env.REFLECTION_MODEL dynamically. Nothing else changed.

Three gotchas worth knowing

1. max_tokens. Gemma 4 is a thinking model. It writes a full reasoning chain before producing output. With max_tokens: 180 set for the old model, Gemma 4 was spending all its tokens on internal reasoning and returning empty content. Bumping to max_tokens: 2048 fixed it.

2. Response extraction. For thinking models, use choices[0].message.content — not .reasoning and not .response. The reasoning field is the internal chain of thought, not the answer.

3. Prompt format. Verbose rule-lists trigger Gemma 4's constraint-analysis behaviour — it restates your rules as bullet points instead of following them. Keep prompts simple and end with a direct action cue:

Read the new source and related sources below, then write 3 plain prose sentences
that synthesise them into a knowledge base entry. No bullets. No analysis. No preamble.
Just 3 sentences.

New: "..."
Related: ...

Write the 3-sentence synthesis now:
Enter fullscreen mode Exit fullscreen mode

The benchmark

I built a /benchmark endpoint that runs both models in parallel against the same query, logs latency and response to D1, and returns side-by-side results.

POST /benchmark
{ "query": "What are the common failure modes of RAG systems?" }
Enter fullscreen mode Exit fullscreen mode

Results from D1 (9 real queries):

Query Gemma 4 MoE Kimi K2.5
RAG failure modes 12.9s 12.4s
Embedding model selection 9.9s 90.7s ⚠️
BM25 vs vector search 19.6s 7.3s
Reducing hallucination 19.0s 6.9s
Chunking strategies 9.3s 9.0s
Edge AI model selection 11.8s 8.3s
MoE efficiency at scale 16.5s 8.0s
Cloudflare Workers AI 22.8s FAILED
KB maintenance 10.1s 5.5s

Kimi K2.5 was faster on 7 of 9 queries. But it produced a 90-second response on one query and failed outright on another — within a single benchmark run. A model that's faster on average but unreliable under load isn't a production model.

Gemma 4 MoE was consistent. Every query returned. Every response was coherent. Latency was predictable.

Beyond the latency numbers, the Kimi K2.5 reflections in the index all started with "Here are the 3 sentences:" — the model was leaking the instruction prefix into every stored reflection. Gemma 4 produces clean prose output with the right prompt.

What's live now

GET /stats → models.reflection: "@cf/google/gemma-4-26b-a4b-it"
Enter fullscreen mode Exit fullscreen mode

The live dashboard is at vectorize-mcp-worker.fpl-test.workers.dev/dashboard — open it and the active reflection model is listed in the stats panel. Gemma 4 MoE, running in production.

1,525 reflections generated since the migration. The cron added more this morning. Verify live: /public-stats — no API key needed.

A second reflection, this one on AI and management:

"AI is increasing individual contributor leverage and is frequently marketed as a labor replacement, driving companies to prioritize cost-cutting and individual productivity. This trend often places pressure on managers to perform individual contributor roles, potentially devaluing the necessity of human oversight and organizational management. Relying on these technologies also introduces risks involving accountability for failures, misunderstandings of AI's true capabilities, and the loss of human-centric benefits like upskilling."

This came from unrelated tweets saved across different weeks, connected by the engine into a single coherent insight, stored back into the index so it surfaces when I search anything adjacent to AI, management, or developer tooling. That's the reflection layer working as intended.

Full pipeline:

bookmark-cli → vectorize-mcp-worker
  embed (BGE Small) →
  retrieve (Vectorize + BM25) →
  rerank (BGE cross-encoder) →
  reflect (Gemma 4 MoE) ← NEW
Enter fullscreen mode Exit fullscreen mode

Everything stays inside one Cloudflare Worker. No external hop for the reasoning layer.

Gemma 4 MoE isn't here because of a challenge. It's here because Cloudflare deprecated the model it replaced and this was the right call. It will still be running after June 4.

The verdict

Gemma 4 MoE is not faster than Kimi K2.5 was on average. If raw speed were the only metric, and if Kimi K2.5 were staying around, I'd have a harder decision.

But it isn't staying around.

Cloudflare has since released Kimi K2.6 — 1T parameters, 262k context window, reasoning, vision, tool calling. It's impressive. It's also $0.95/M input tokens and $4.00/M output tokens. The reflection layer synthesises on every ingest. At that pricing, running it across a 100k-document backlog would end the $5/month cost story in a single batch. Gemma 4 MoE, as a native Workers AI model, stays within the free tier. The upgrade path wasn't really an upgrade for this use case.

And for a reflection layer specifically — where the task is multi-document synthesis, where you need reasoning depth more than raw throughput, and where you want the entire pipeline to stay edge-native — Gemma 4 MoE is the right model. The MoE architecture is why. 4B active parameters gives you the inference speed you need at the edge. 26B total parameters gives you the knowledge depth the task requires.

At $4/M output tokens, the upgrade wasn't an upgrade. Gemma 4 MoE still is. The daily cron doesn't know it's in a challenge. It ran this morning.


What's next for Gemma 4 MoE in this pipeline

The reflection layer is one use. The code already has a second.

Every 3 new ingests, the pipeline runs a consolidation pass — Gemma 4 MoE reads the 10 most recent reflections and merges them into a single doc_type='summary': dominant theme, two or three specific non-obvious facts, and the most persistent open question across all the reflections. The summary lands in Vectorize and surfaces in search exactly like a reflection does. Reflections capture individual connections. Summaries capture patterns across connections. Both are Gemma 4 MoE, both are edge-native, both add to the index without touching the $5/month cost ceiling.

That's the current state. Three extensions are already scoped:

Query-time answer synthesis. Right now the pipeline retrieves chunks and returns them. The next layer uses Gemma 4 MoE to read the top 5 retrieved chunks and produce a direct answer — not a list of results, an actual response grounded in what you saved. The retrieval already works. The synthesis step is the same task the reflection layer already does, with a different prompt.

Routing upgrade. The V4 intelligent router currently runs on Llama 3.2 3B — fast classification into six query routes (SQL, BM25, vector, graph, etc.). Moving that to Gemma 4 MoE's thinking mode means the router can reason about ambiguous queries instead of classifying them. A question like "what did I save about RAG that I disagreed with?" hits multiple routes simultaneously. A 3B classifier guesses. A 26B MoE reasons.

Gap detection. The reflection engine already identifies gaps — questions the combined knowledge doesn't answer. A weekly pass that reads all gap annotations across the index and surfaces the three most persistent unanswered questions would make the tool actively useful for research, not just reactive to search queries. One scheduled cron, one Gemma 4 MoE call per week, zero additional cost in the free tier.

Personal preference reranker. The index contains 100k+ documents across AI, politics, sports, and everything else saved since 2016. Every bookmark and like is a signal: this person found this worth keeping. The longer-term path is fine-tuning a small cross-encoder on that signal — not domain expertise, but preference prediction. A model trained on "did this person save this or not" beats every general reranker at one narrow task: knowing what you care about. It slots into the existing pipeline as a final reranking layer after BGE, before the reflection pass. The training data is a decade of curation. The narrow task is yours alone.

The reflection layer was the migration. These four are the reason it stays.

Top comments (0)