Daniel Nwaneri

Posted on Jun 15

My Bookmark Engine Returned Chunks. I Added One Endpoint to Make It Answer.

#ai #rag #cloudflare #gemma

Search returns things you have to read. An answer engine reads them for you.

I built a search engine on top of 50k saved tweets. Ask it something and it returns the five most relevant chunks — found through hybrid retrieval (BM25 keyword search plus vector search) and reranked by a cross-encoder. A Gemma 4 MoE layer already runs in the background too, writing its own reflections on how saved documents connect to each other. You get the chunks back, ranked. Then you read them and synthesise.

That last step bothered me. The model already synthesises when generating reflections. The retrieval already works. The only missing piece was wiring them together at query time.

So I added POST /search?mode=answer.

What it does

Same retrieval pipeline. Top 5 chunks, reranked. Then instead of returning them raw, Gemma 4 MoE reads them and produces a direct answer grounded in what you saved.

const prompt =
  `Answer the question below using only the sources provided. ` +
  `If the sources don't contain the answer, say so directly.\n\n` +
  `Question: "${query}"\n\n` +
  `Sources:\n${context}\n\n` +
  `Write a direct answer in 2–4 sentences. No preamble. No bullets.\n` +
  `Answer:`;

max_tokens: 512 returns an empty answer. Gemma 4 is a thinking model: it burns the token budget on internal reasoning before producing output. max_tokens: 2048 fixes it. The reflection engine hit the same wall. Same fix.

What came back

Three queries against the live index.

"What do people say about consistency and showing up every day?"

"Consistency wins. Some people say all the right things but never quite measure up to their words, while others do everything right without saying a word."

Two tweets, cleanly synthesised. Nothing hallucinated from outside the index.

"What do people say about money and wealth building?"

"Discussions include frustrations over the devaluation of the Naira, declining purchasing power, and debates regarding financial expertise and social media income. Some suggest only a certain regular amount is needed to sustain a lifestyle, as anything beyond that is superfluous. Regarding wealth building, it is suggested that one should provide value and be the supply in a supply and demand relationship."

This one surfaced a reflection document alongside raw tweets — a reflection-type entry the engine had already generated from CBEX-related tweets and stored back into the index. The answer pulled from both layers: raw saved content and a previously generated insight. That's the system compounding on itself.

"What are the best ways to learn programming?"

"Search Google, find a video, watch while coding, repeat."

Thin. The retrieval matched surface-level tweets: "Did you learn programming all by yourself?", "Tips on learning how to code 🧵". Not substantive content. The model answered honestly from what it got. The answer is technically grounded. It's just not useful.

The honest read

It works well on topics with substantive saved content. It returns thin answers on surface-level matches, like the programming question above. The synthesis did its job. The index just doesn't have the depth yet.

The retrieval scores on most queries are low (0.006–0.013 range). That's down to the embedding model the index was built on: bge-small, 384 dimensions, the old default, built for speed over precision. Embeddings are how the engine turns text into numbers it can compare. More dimensions means more room to capture shades of meaning. The index can't switch models without re-ingesting all 50k tweets. When I eventually migrate to qwen3-0.6b (1024 dimensions), retrieval precision improves first, and answer quality follows from that.

For now: the endpoint works. Strong on topics the index has depth on, honest about topics it doesn't.

What it isn't

The sources come back with every answer. Verify the model, check the scores, read the original chunks. And it's not search over the open web. Every answer traces back to something you chose to save. The model can't hallucinate from outside the index because the prompt gives it nothing outside the index to hallucinate from. The grounding is structural, not just instructed.

What's next

Retrieval quality is the ceiling on answer quality right now. The next piece is gap detection: a weekly pass that surfaces the three most persistent unanswered questions in the index, showing where the index has depth and where it doesn't. This endpoint makes those gaps visible in real time, one query at a time. Gap detection will map them systematically, every week.

The endpoint is live. Query it:

POST /search?mode=answer
{ "query": "your question here" }

Source chunks come back alongside the answer. The model used to synthesise it: @cf/google/gemma-4-26b-a4b-it. Same Worker, same $5/month.

The index has 50k saved tweets going back to 2016. What you get back is bounded by that. Google searches the internet. This searches what you decided was worth keeping.

Top comments (2)

Alex Shev • Jun 15

This is the important step most RAG projects skip. Returning chunks is search; answering from those chunks is a product decision with a different failure mode.

The endpoint boundary is useful because it gives you a place to enforce citations, confidence, and "I do not know" behavior instead of hiding all of that inside the retrieval layer.

Daniel Nwaneri • Jun 15

Right now that boundary only enforces the "I don't know" case. Confidence scoring on what it does answer isn't built yet and that's probably the next thing to put there.