Kantemir Satibalov

Posted on Jun 30 • Edited on Jul 10

I stopped trusting generic LLMs for horticulture — so I built a grounded assistant on ~500 scientific articles

#ai #rag #llm #go

Grounded RAG in production (8 Part Series)

I stopped trusting generic LLMs for horticulture — so I built a grounded assistant on ~500 scientific articles - you are here
68 questions before a single token: eval-first RAG
Vector search kept missing rootstock codes, so I went hybrid
Scientific articles aren't FAQ-shaped: chunking a 500-article corpus — coming soon
Gate the answer, not just the retrieval: verifying LLM output against sources — coming soon
Go for the product, Python for the models: anatomy of a two-service RAG — coming soon
Hardening a side project like it's production (and the outage that caused) — coming soon
One RAG platform, swappable domains: what 500 articles taught me about product shape — coming soon

Last year I kept seeing the same pattern in agtech and “AI assistant” demos: a chatbot wrapped around a generic model, a handful of PDFs, and a disclaimer nobody reads.

I'm a developer, not an agronomist. But I'm working on two related projects — a grounded RAG platform (grounded-llm, private repo) and its first production-shaped domain pack: a horticulture assistant built on hundreds of articles from the Russian journal Plodovodstvo i vinogradstvo Yuga Rossii (apple, pear, plum — on the order of ~500 source articles, not five blog posts).

I didn't want another “ChatGPT for gardeners.”
I wanted answers that behave like someone who actually read the literature — and admits when the literature doesn't cover the question.

That gap turned into months of engineering. I'm sharing the story in public; the full corpus and codebase stay private.

What broke first: “sounds right” ≠ “is right”
Early experiments failed in boring, repeatable ways.

1. Domain language doesn't match generic retrieval
Russian horticulture is full of synonyms and notation variants: rootstock labels, disease names, regional cultivars. A user writes марссониоз; the literature may use Marssonina, abbreviations, or OCR-noisy spellings. Naive retrieval misses; the model fills the gap confidently.

2. Scientific text isn't FAQ-shaped
Articles contain experiment sections, tables, and “brief for the grower” blocks. One chunk size for everything → right article, wrong paragraph → fluent wrong answer.

3. Generation is the wrong place to fix retrieval
If the right passage never reaches the prompt, no system prompt saves you. I separated concerns early:

Python service → retrieval only (/rag/context)
Go server → sessions, LLM calls, answer cleanup, guardrails
Not because microservices are fashionable — because I needed to change and measure retrieval without redeploying the whole product.

What I built: two layers, one product
| Layer | What it is |
|-------|------------|
| Platform core (grounded-llm) | Auth, Postgres sessions, orchestration |
| Domain pack (horticulture) | Corpus, crop config, prompts, eval baselines |

There's also a non-agricultural sandbox (demo_hr) — HR policy docs, same pipeline — to show the platform isn't hard-coded to apple diseases.

The horticulture pack indexes on the order of ~14,500 text chunks from the journal corpus. At this scale, “vector search only” and “we'll fix it in the prompt” stop being credible.

I'm not open-sourcing the full article texts (rights + focus). I am sharing architecture lessons, failure modes, and metrics — and offering controlled demos when it's worth someone's time.

One question that kept me honest
Which rootstocks and training systems show up in slope / terrace planting research for our region?

Generic LLMs invent varieties and numbers.
A grounded system either retrieves relevant experimental context — rootstocks, spacing, relief, regional trials — or should refuse to answer.

That requirement ruled out most tutorial RAG stacks I'd seen. It also ruled out marketing photo → disease as the hero feature before a model is actually trained on disease imagery. Vision is on the roadmap; text grounded in papers is what's production-shaped today.

What I deliberately didn't optimize for (yet):

Multi-tenant SaaS billing
Viral B2C Telegram growth
Claiming diagnosis-grade vision from an ImageNet backbone

I optimized for:
1.Retrieval you can regression-test
2.Answers you can gate before users see them
3.A platform you can re-pack for another vertical in days, not months

What's next (Part 2)
Part 1 was the why.

Part 2 is the decision that changed everything: I don't trust the pipeline until a fixed suite of domain questions passes retrieval — today 68 questions across apple, pear, plum, and the HR sandbox — before we pay for a single generated token.

Spoiler: getting there wasn't “use a bigger embedding model.” It was unglamorous engineering — chunking, hybrid search, reranking, glossary expansion — I'll unpack one layer per post.

If this resonates
I'm building in public through writing, not through dumping the entire corpus on GitHub.

Follow on Dev.to for Part 2
Comment if you've hit similar RAG failure modes in regulated or scientific domains
Reach out (GitHub / email in bio) for a short demo: HR sandbox or limited horticulture preview
Disclaimer: assistant output is informational; field decisions require local experts and compliant product labels.

Top comments (1)

Kantemir Satibalov • Jun 30

Thanks for reading Part 1.

Quick roadmap for the series:
• Part 2 (~2 weeks): 68-question retrieval eval — pass/fail before any LLM call
• Part 3: hybrid search (why vector-only wasn't enough)
• Later: Go orchestration + answer verification

If you've built RAG in a scientific or regulated domain — what failed first for you: synonyms/OCR, chunking, or eval discipline?

Happy to do a short demo (HR sandbox or limited horticulture preview) for anyone seriously exploring this — DM or GitHub.