Last year I kept seeing the same pattern in agtech and “AI assistant” demos: a chatbot wrapped around a generic model, a handful of PDFs, and a disclaimer nobody reads.
I'm a developer, not an agronomist. But I'm working on two related projects — a grounded RAG platform (grounded-llm, private repo) and its first production-shaped domain pack: a horticulture assistant built on hundreds of articles from the Russian journal Plodovodstvo i vinogradstvo Yuga Rossii (apple, pear, plum — on the order of ~500 source articles, not five blog posts).
I didn't want another “ChatGPT for gardeners.”
I wanted answers that behave like someone who actually read the literature — and admits when the literature doesn't cover the question.
That gap turned into months of engineering. I'm sharing the story in public; the full corpus and codebase stay private.
What broke first: “sounds right” ≠ “is right”
Early experiments failed in boring, repeatable ways.
1. Domain language doesn't match generic retrieval
Russian horticulture is full of synonyms and notation variants: rootstock labels, disease names, regional cultivars. A user writes марссониоз; the literature may use Marssonina, abbreviations, or OCR-noisy spellings. Naive retrieval misses; the model fills the gap confidently.
2. Scientific text isn't FAQ-shaped
Articles contain experiment sections, tables, and “brief for the grower” blocks. One chunk size for everything → right article, wrong paragraph → fluent wrong answer.
3. Generation is the wrong place to fix retrieval
If the right passage never reaches the prompt, no system prompt saves you. I separated concerns early:
Python service → retrieval only (/rag/context)
Go server → sessions, LLM calls, answer cleanup, guardrails
Not because microservices are fashionable — because I needed to change and measure retrieval without redeploying the whole product.
What I built: two layers, one product
| Layer | What it is |
|-------|------------|
| Platform core (grounded-llm) | Auth, Postgres sessions, orchestration |
| Domain pack (horticulture) | Corpus, crop config, prompts, eval baselines |
There's also a non-agricultural sandbox (demo_hr) — HR policy docs, same pipeline — to show the platform isn't hard-coded to apple diseases.
The horticulture pack indexes on the order of ~14,500 text chunks from the journal corpus. At this scale, “vector search only” and “we'll fix it in the prompt” stop being credible.
I'm not open-sourcing the full article texts (rights + focus). I am sharing architecture lessons, failure modes, and metrics — and offering controlled demos when it's worth someone's time.
One question that kept me honest
Which rootstocks and training systems show up in slope / terrace planting research for our region?
Generic LLMs invent varieties and numbers.
A grounded system either retrieves relevant experimental context — rootstocks, spacing, relief, regional trials — or should refuse to answer.
That requirement ruled out most tutorial RAG stacks I'd seen. It also ruled out marketing photo → disease as the hero feature before a model is actually trained on disease imagery. Vision is on the roadmap; text grounded in papers is what's production-shaped today.
What I deliberately didn't optimize for (yet):
- Multi-tenant SaaS billing
- Viral B2C Telegram growth
- Claiming diagnosis-grade vision from an ImageNet backbone
I optimized for:
1.Retrieval you can regression-test
2.Answers you can gate before users see them
3.A platform you can re-pack for another vertical in days, not months
What's next (Part 2)
Part 1 was the why.
Part 2 is the decision that changed everything: I don't trust the pipeline until a fixed suite of domain questions passes retrieval — today 68 questions across apple, pear, plum, and the HR sandbox — before we pay for a single generated token.
Spoiler: getting there wasn't “use a bigger embedding model.” It was unglamorous engineering — chunking, hybrid search, reranking, glossary expansion — I'll unpack one layer per post.
If this resonates
I'm building in public through writing, not through dumping the entire corpus on GitHub.
Follow on Dev.to for Part 2
Comment if you've hit similar RAG failure modes in regulated or scientific domains
Reach out (GitHub / email in bio) for a short demo: HR sandbox or limited horticulture preview
Disclaimer: assistant output is informational; field decisions require local experts and compliant product labels.
Top comments (1)
Thanks for reading Part 1.
Quick roadmap for the series:
• Part 2 (~2 weeks): 68-question retrieval eval — pass/fail before any LLM call
• Part 3: hybrid search (why vector-only wasn't enough)
• Later: Go orchestration + answer verification
If you've built RAG in a scientific or regulated domain — what failed first for you: synonyms/OCR, chunking, or eval discipline?
Happy to do a short demo (HR sandbox or limited horticulture preview) for anyone seriously exploring this — DM or GitHub.