DEV Community: Kantemir Satibalov

Building an open standard for grounded document assistants

Kantemir Satibalov — Wed, 15 Jul 2026 12:48:38 +0000

Last week I published how I made ~500 horticulture papers queryable without hallucination (passion project on DEV). That vertical worked. This post is about what came next: extracting the repeatable parts into an open platform — and publishing a spec other teams can conform to.

The problem nobody demos on Twitter

Enterprise teams don't need another ChatGPT wrapper. They need assistants that:

Answer only from internal documents (policies, handbooks, runbooks).
Cite sources — filename + chunk — in every response.
Refuse when retrieval cannot support an answer.
Run on infrastructure they control (Docker, K8s, private cloud).
Ship with measurable quality — not "it looked fine in a notebook once."

I learned this building for scientific PDFs. The interesting work was never the chat bubble. It was making archives answerable without lying — and proving retrieval quality before burning LLM tokens.

That discipline became Grounded LLM.

What I'm trying to standardize (and what I'm not)

I'm not building Dify, LangGraph, or a visual agent constructor.

I'm standardizing one narrow class of systems:

Document-grounded assistants — internal Q&A with citations, numeric verification, and regression-tested retrieval.

The positioning line:

Open standard for document-grounded assistants with citations, numeric verify, and measurable retrieval quality — deployable on your infrastructure.

Non-goals (by design):

Arbitrary tool/agent graphs
General chat without a knowledge base
Cloud-only lock-in
Feature parity with Glean or Microsoft Copilot SaaS

We compete on trust + reproducible quality + conformance — not feature count.

Five pillars of the standard

Pillar	What it means	Today in the repo
1. Spec & conformance	Published rules + tests anyone can run	Grounded Spec v1, `python -m conformance`
2. Quality science	Numbers, not demos	89 retrieval cases, retrieval gate in CI, adversarial pack
3. Reference deploy	Reproducible install	Docker, Helm, Terraform (AWS/GCP/Azure)
4. Template marketplace	Growth without forking core	HR, IT Support, Legal FAQ packs + registry
5. Governance	Standard outlives one author	RFC process, RFC-0001 Grounded-compatible

Horizon 1 success metric: any engineer runs conformance on a fresh deploy in under 15 minutes.

What the platform is today (not a slide deck)

kantik001.github.io

Grounded LLM v0.1.0 is the first tagged release — reference implementation of Grounded Spec v1 with hybrid retrieval (BM25 + dense + RRF), pgvector/Chroma/Qdrant backends, 89 retrieval eval cases in CI, and published GHCR images (ghcr.io/kantik001/grounded-llm-*:0.1.0). Landing page: https://kantik001.github.io/grounded-llm/

After 11 delivery phases merged to main, this is a working reference implementation, not a manifesto.

Architecture

Clients (Web / API / Telegram / embed widget)
        ↓
Go server — auth, sessions, LLM, verify, admin, quotas, OIDC/RBAC
        ↓ POST /rag/context
Python RAG — hybrid BM25 + dense + RRF; Chroma / Qdrant / pgvector
        ↓
data/{tenant}/{domain}/  +  Postgres (sessions, audit; pgvector optional)

Split on purpose: Go owns trust boundaries and orchestration; Python owns retrieval only.

What ships out of the box

Capability	Why it matters
Citations in every answer	Audit trail for HR/legal
Numeric verify layer	Dosages, vacation days, SLA numbers must match retrieved context
Retrieval eval gate in CI	Catches silent RAG regressions on every PR
Multi-tenant API	`X-Tenant-ID`, API keys, OpenAPI v1
Enterprise admin	RBAC, OIDC SSO, audit log, async reindex, analytics
Template packs	`python scripts/init_pack.py install hr`
Ingest connectors	SharePoint, Google Drive, Confluence
Conformance CLI	Offline spec check + live HTTP check against any deployment
Embeddable widget	Intranet embed, not only Telegram

Quick start

git clone https://github.com/kantik001/grounded-llm.git
cd grounded-llm
pip install -r conformance/requirements.txt
cp .env.example .env
docker compose up -d --build
python -m conformance spec          # offline OpenAPI contract
python -m conformance check --url http://localhost:8080

If your product is Grounded-compatible, these tests should pass without forking my codebase.

From one vertical to a platform (the story arc)

	Horticulture proof	Grounded LLM platform
Repo	grounded_horticulture_en	grounded-llm
Domain	Apple rootstocks, disease IDs	Any internal documents
Retrieval	Tuned on ~500 papers, eval 68/68	Hybrid BM25+RRF + 89 eval cases, CI gate
Deliverable	Vertical proof + passion story	v0.1.0 spec + conformance + packs + deploy

The horticulture project answered: "Can we make scientific PDFs queryable safely?"

Grounded LLM answers: "Can we ship the next assistant in days without rebuilding auth, verify, eval, and deploy?"

Media: what to show

Screenshot 1 — Chat with citations

Screenshot 2 — Conformance CLI

📸 Terminal recording:

python -m conformance spec
python -m conformance check --url http://localhost:8080

Screenshot 3 — CI retrieval gate

Screenshot 4 — Template packs

Screenshot 5 — Admin panel

Where this sits in the industry (including Google)

Big tech is solving adjacent problems:

Product / area	Focus	Grounded LLM difference
NotebookLM	Research / consumer grounding on uploads	We target enterprise on-prem, API contract, CI gates
Vertex AI Search	Managed cloud retrieval	We target self-hosted, MIT core, conformance badge
Gemini + Workspace	SaaS copilot inside Google	We target any LLM endpoint, any infra

I'm not competing with Google on consumer UX. I'm saying: when procurement asks "is your internal assistant grounded and testable?" — there should be a published spec and CLI answer, not a vendor slide.

If you work on enterprise RAG, OSS standards, or ML platform conformance — I'd genuinely value feedback on RFC-0001.

Call to action

Try v0.1.0: Release notes · docker compose up or pull GHCR :0.1.0
Star / watch the repo if enterprise grounding interests you: github.com/kantik001/grounded-llm
Run conformance on your deploy and open an issue if something should be in Spec v2
Contribute an eval case when you fix a retrieval bug — see GOOD_FIRST_ISSUES.md
Building in horticulture or another vertical? The passion repo is still the deep retrieval story; this platform is the generalization

Summary

Question	Answer
What is it?	Open platform + Spec v1 for cited, verified document assistants
What is it not?	Agent builder, ChatGPT clone, Glean competitor
Why now?	Vertical proof worked; standard + conformance is the multiplier
What's next?	External conformance adopters, Spec v2 feedback, RFP-grade positioning

I started with my father's papers. I want to end with a checkable standard any team can implement — and prove — on their own infrastructure.

My father wrote the papers — I built a RAG assistant so growers can query them safely

Kantemir Satibalov — Fri, 10 Jul 2026 13:02:56 +0000

This is a submission for Weekend Challenge: Passion Edition

What I Built

Gardener's Assistant — a grounded RAG chat for horticulture.

My father spent decades as a plant breeder and researcher (Doctor of Agricultural Sciences, North Caucasus mountain horticulture institute). His articles — and his colleagues' — live in journal PDFs, not in anything a generic LLM can cite reliably. I built an assistant so growers and agronomists can ask questions and get answers grounded in those papers, with numbers verified before they reach the user.

The intended goal: make scientific horticulture queryable without hallucination. If retrieval cannot support an answer, the bot refuses. It does not invent rootstock codes, spray rates, or cultivar names.

What it does today:

Text chat over ~500 scientific articles (apple, pear, plum) via hybrid retrieval
Telegram Mini App + browser client (Docker, one compose up)
68-question retrieval regression suite — the gate I run before trusting any LLM output
Numeric verifier in Go — dosages in the answer must appear in retrieved context

Demo

Gardener's Assistant chat: three horticulture questions in Russian receive grounded answers from scientific articles via RAG, with streaming response in a Telegram-style UI:

Admin panel: login, upload horticulture articles into the RAG corpus, trigger reindex, and review 👍/👎 answer ratings — filter by likes or dislikes, see totals, and inspect each rated Q&A with crop, timestamp, and session/message ID to trace who gave feedback:

GIF: see above — Russian UI and Russian source articles (working demo today). English corpus and UI copy are being rolled out next.

Run locally:

git clone https://github.com/kantik001/grounded_horticulture_en
cd grounded_horticulture_en
cp .env.example .env   # LLM_API_KEY required
docker compose up -d --build

→ Chat: http://localhost/

→ Admin (article upload): http://localhost/admin.html

Three questions from the recording (Russian — matches the indexed corpus):

Какие признаки парши на листьях яблони? — disease + glossary expansion
Как густота посадки влияет на Айдаред на подвое СК 4? — exact identifier (BM25)
Какие подвои подходят для интенсивного сада на склоне? — semantic + lexical blend

English equivalents (for the upcoming EN launch — same retrieval paths, translated articles):

What are signs of apple scab on leaves?
How does planting density affect Aidared on SK 4 rootstock?
What rootstocks work for intensive orchards on slopes?

The public repo ships demo articles only (EN samples for quick start); the full journal corpus stays local for licensing. The pipeline, eval harness, and Docker stack are all there.

Code

kantik001 / grounded_horticulture_en

Grounded RAG horticulture assistant (English public portfolio, demo data only).

🍏 grounded-horticulture — horticulture assistant

Grounded RAG for horticulture: answers grounded in scientific articles with fact checking, not LLM hallucinations. Telegram Mini App and browser chat with API key.

Demo

Chat: question → RAG answer	Admin: articles and 👍/👎

▶ Full chat recording (MP4) · ▶ Full admin recording (MP4)

What it is

An assistant for gardeners and agronomists: text → hybrid search over articles → LLM answer with verification of numbers and dosages; photo → CV + recommendation (beta, no production weights in this repo).

Component	Role
Go (`server/`)	Auth, Postgres sessions, RAG+LLM orchestration, verify, rate limit, `/metrics`
Python (`api/`, `rag/`)	Hybrid retrieval (Chroma + BM25 + reranker), CV `/classify`
Web (`webapp/`)	Chat, article upload admin, nginx in Docker

Access: Telegram initData or browser X-API-Key (see .env.example).

Public repository: git contains demo data only (data/demo_hr/, data/apple/sample_*.txt). Full article…

View on GitHub

Key paths:

Path	What
`rag/` + `api/`	Hybrid retrieval (Chroma, BM25, RRF, reranker)
`server/`	Go orchestration, SSE chat, numeric verifier
`eval/*.jsonl`	68 retrieval regression questions
`scripts/run_rag_eval.py`	One-command eval runner
`webapp/`	Browser chat UI

How I Built It

Passion → engineering constraint

The personal motivation came first: my father's horticulture papers should be queryable, not buried in PDF archives. The engineering rule followed: I don't trust the LLM until retrieval is measurable. Before tuning models or UI, I wrote a 68-question JSONL eval suite — rootstock codes, diseases, out-of-scope refusals — and made it a regression gate.

Why hybrid retrieval (not "better embeddings")

Pure vector search understood topic but missed tokens that matter — cultivar names, rootstock codes like SK 4, OCR-noisy spellings.

Example eval question: "How does planting density affect Aidared on SK 4 rootstock?"

Vector-only returned paragraphs about rootstocks but not the Aidared token:

Fix: per-crop hybrid pipeline:

query → glossary expansion (domain synonyms)
     → Chroma (multilingual-e5-small) top-16
     → BM25 top-16
     → RRF merge
     → conditional cross-encoder rerank (rootstock / disease / variety only)
     → diversify (≤2 chunks per article) → top-8 to the LLM

Result: 68/68 on the retrieval suite:

Re-verify anytime:

python scripts/run_rag_eval.py --suite all --in-process --fast

Decisions worth calling out:

RRF over score normalization — BM25 and cosine similarities live on different scales; ranks merge cleanly.
Category-gated reranker — the cross-encoder helps dense technical questions but costs CPU; "when should I water?" skips it.
Eval retrieval separately from generation — no LLM tokens, ~20s locally, catches regressions before users do.

Go + Python split

Python (rag/, api/): embeddings, Chroma, BM25, reranker, POST /rag/context
Go (server/): auth (Telegram + API key), Postgres sessions, SSE streaming, numeric verifier (numbers in the answer must appear in retrieved context)

If retrieval is weak, Go short-circuits before paying for generation.

What broke (and what saved me)

Change	Symptom	Eval caught it?
Chunking split tables from headers	Apple pass_rate −7	Yes — reverted in minutes
BM25 not rebuilt after corpus update	Exact-code questions failed	Yes
Glossary entry too aggressive	MRR dropped, pass_rate unchanged	Yes

Passion projects still need discipline. The interesting work wasn't the chat bubble — it was making scientific archives answerable without lying.

What I learned

Passion without measurement ships fairy tales. A 20-second eval run changed how I work more than any embedding upgrade.
Hybrid search (BM25 + vectors + RRF) beat "just use a better model" for scientific text with rare codes.
The hard problem was never the chat UI — it was making PDF archives answerable without lying.

Disclaimer: assistant output is informational; field decisions require local experts and compliant product labels. CV classification is beta.

Vector search kept missing rootstock codes, so I went hybrid

Kantemir Satibalov — Fri, 10 Jul 2026 07:48:43 +0000

Grounded RAG in production (8 Part Series)

I stopped trusting generic LLMs for horticulture — so I built a grounded assistant on ~500 scientific articles
68 questions before a single token: eval-first RAG
Vector search kept missing rootstock codes, so I went hybrid — you are here
Scientific articles aren't FAQ-shaped: chunking a 500-article corpus — coming soon
Gate the answer, not just the retrieval: verifying LLM output against sources — coming soon
Go for the product, Python for the models: anatomy of a two-service RAG — coming soon
Hardening a side project like it's production (and the outage that caused) — coming soon
One RAG platform, swappable domains: what 500 articles taught me about product shape — coming soon

When the eval suite first ran against pure vector search, the failures clustered in one place: exact identifiers. Rootstock codes like "SK-4", cultivar names, dosage lines. Embeddings are great at "this paragraph is about slope planting" and terrible at "this exact token matters more than the topic."

A typical miss looked like this — vector search returned paragraphs about rootstocks, but not the cultivar token the eval expected:

The fix wasn't a bigger model. It was accepting that I needed two retrievers with opposite failure modes, and a principled way to merge them.

The pipeline

query
  └─ glossary expansion (synonyms appended)
       ├─ Chroma vector search (multilingual-e5-small), top-16, filtered by crop
       └─ BM25 per-crop index, top-16
             └─ RRF merge → candidates
                   └─ cross-encoder rerank (bge-reranker-base) — only for some categories
                         └─ per-article diversification → top-8 fragments

Every stage exists because a specific eval failure demanded it. Layer by layer:

Embeddings: intfloat/multilingual-e5-small. Multilingual because the corpus is Russian and queries can be either language. One non-obvious gotcha: e5 models require query: / passage: prefixes at query and index time. Without them similarity scores quietly degrade — no error, just worse ranking. I subclassed the embedding wrapper so the prefixes can't be forgotten.

BM25, one index per crop. Classic lexical scoring over tokenized chunks. This is what catches "SK-4" — the exact token is either in the chunk or it isn't. The indexes are built at reindex time and persisted alongside the vector store, so a container restart doesn't silently drop the lexical half (an early bug the eval caught: exact-code questions failing while everything else passed).

RRF merge. Reciprocal Rank Fusion is embarrassingly simple — each list contributes
1 / (k + rank) per document, sum, sort (k=60):

def rrf_merge(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, chunk_id in enumerate(ranking):
            scores[chunk_id] = scores.get(chunk_id, 0.0) + 1.0 / (k + rank + 1)
    return [cid for cid, _ in sorted(scores.items(), key=lambda x: -x[1])]

I tried score normalization schemes first. RRF won because it needs no calibration between BM25 scores and cosine similarities — it only trusts ranks — and it's ten lines you can hold in your head.

Conditional reranking. A cross-encoder (BAAI/bge-reranker-base) reads the query and each candidate together and re-scores the top candidates. It measurably improves ranking for dense technical questions — and costs real CPU time. So it's category-gated: the question classifier (rule-based, config-driven) tags questions as rootstock, disease, variety, fertilizer, relief, or general, and only the complex categories pay the reranker tax. "When should I water?" doesn't need a cross-encoder.

Glossary expansion. A curated JSON of domain synonyms; if the query contains a known term, its synonyms are appended to the search string. This is where user vocabulary meets literature vocabulary — the colloquial disease name pulls in Marssonina spellings the embeddings alone ranked too low. Curated beats automatic here: the glossary is small, reviewable, and each entry exists because a real query missed.

Diversification. Top-ranked chunks tend to come from the same article. The last step caps fragments per source before returning the top-8, so the LLM sees multiple studies instead of one article shredded eight ways.

What I didn't do

Fine-tune embeddings. Tempting, but the eval said ranking was mostly fine once lexical search covered the identifier cases. Not worth the MLOps overhead yet.
LLM query rewriting. Adds latency and a failure mode to the cheap half of the system. The glossary covers the actual observed misses deterministically.
A vector DB migration. Chroma with a persistent local directory is unglamorous and entirely sufficient at ~14.5k chunks. The interesting problems were above the storage layer.

Results

With the full stack, the 68-question suite passes at 100% retrieval, and — the part I care about more — hit_rate@3 stays high enough that the LLM's context isn't padded with near-misses (apple suite: MRR 0.938, hit@3 0.953). With --fast (reranker off) it still passes 68/68 today, which tells me the reranker is currently insurance rather than load-bearing. I keep it because insurance is what you want the week a new corpus batch lands.

68 questions before a single token: eval-first RAG

Kantemir Satibalov — Mon, 06 Jul 2026 08:06:01 +0000

Grounded RAG in production (8 Part Series)

I stopped trusting generic LLMs for horticulture — so I built a grounded assistant on ~500 scientific articles
68 questions before a single token: eval-first RAG — you are here
Vector search kept missing rootstock codes, so I went hybrid
Scientific articles aren't FAQ-shaped: chunking a 500-article corpus — coming soon
Gate the answer, not just the retrieval: verifying LLM output against sources — coming soon
Go for the product, Python for the models: anatomy of a two-service RAG — coming soon
Hardening a side project like it's production (and the outage that caused) — coming soon
One RAG platform, swappable domains: what 500 articles taught me about product shape — coming soon

In Part 1 I promised the decision that changed everything. Here it is:

I don't touch the LLM until a fixed suite of domain questions passes retrieval.

Not "the demo looked good." Not "I asked it five things and it answered." A versioned file of questions with expected evidence, run as a regression suite — the same way you'd treat unit tests. Today that suite is 68 questions: 45 apple, 8 pear, 10 plum, and 5 for an HR-policy sandbox that proves the pipeline isn't hard-coded to orchards.

Why retrieval, not answers

A RAG pipeline fails in two places: the right passage never reaches the prompt, or the model mangles a passage that did. The first failure is cheap to detect and free to test — no API key, no tokens, no flaky LLM in the loop. So the default eval mode stops at retrieval:

{
  "crop_id": "apple",
  "question": "What are signs of scab?",
  "expect_contains": ["scab", "spot"],
  "expect_context": true,
  "expect_out_of_scope": false,
  "category": "disease"
}

The runner sends the question to the retrieval service (POST/rag/context) and checks that every expect_contains substring appears in the combined retrieved context. A few details that turned out to matter more than I expected:

expect_contains_any for synonyms. The literature writes Marssonina; users write the colloquial disease name. Either counts as evidence.
Light stemming. rootstock should match rootstocks. Without it you either overfit the expected strings to one article's phrasing or get false failures.
expect_out_of_scope: true questions. A question about, say, car maintenance must return weak or no context. This catches the embarrassing failure mode where vector search happily returns "closest" chunks for any string whatsoever.

The metrics that survived

pass_rate alone saturates: once you hit 68/68, it can't tell you whether a refactor made ranking worse as long as the evidence still sneaks into position 8. So the runner also reports ranking metrics per suite:

MRR (mean reciprocal rank) of the first relevant fragment,
hit_rate@1 / @3 / @5 — did relevant evidence appear in the top-1/3/5 fragments.

A fragment counts as relevant when it contains at least one expected substring. That's a single-relevant proxy — the baselines don't carry ground-truth chunk ids — and I'm fine with it: it's cheap, stable, and moves in the right direction when I break something.

There's also a --full mode that does call the LLM and reports two more numbers:
verify_pass_rate (do the numbers in the answer actually appear in the retrieved context — more on that verifier in Part 5) and answer_contains_rate (does the answer mention the expected terms). Out-of-scope questions skip the LLM entirely, mirroring the production short-circuit: if retrieval finds nothing, we refuse before paying for tokens.

Making it cheap enough to actually run

An eval nobody runs is documentation. The full HTTP run over 68 questions takes about 4 minutes; that was too slow for "run after every change," so the runner grew flags:

Flag	Effect
`--in-process`	Import the retrieval module directly, skip HTTP
`--fast`	Disable the cross-encoder reranker (~15× faster; still 68/68 on the current set)
`--workers 2`	Parallel requests against one retrieval worker

The ~20-second --in-process --fast combination is what I run reflexively. The full run with reranking is for before releases and after reindexing. In CI, unit tests run on every PR and the complete eval is a manual GitHub Actions workflow — model downloads are too heavy to justify on every push.

Results land in eval/results/<timestamp>_<suite>.json, so "did Tuesday's chunking change hurt pear questions?" is a diff, not an argument.

What the suite caught (a sample)

A chunking change that split experiment tables from their headers — apple pass_rate dropped 7 points, nothing else moved. Reverted in minutes.
BM25 index not rebuilt after a corpus update — vector search masked it for common questions, but exact-code questions (rootstock "SK-4") failed instantly.
A glossary entry that expanded a term too aggressively and pushed the right article out of the top-5 for two questions: visible as an MRR drop with pass_rate unchanged.

None of these would have been caught by "chat with the bot for a while." All of them would have shipped.

What I'd tell you to steal

Write the eval file before tuning retrieval. Even 20 questions change how you work.
Test retrieval separately from generation. It's the cheap 80%.
Add out-of-scope questions early. Refusing well is a feature.
Make the fast path under 30 seconds, or you'll stop running it.

Part 3 is the payoff: what it actually took to get those 68 questions passing — hybrid search, RRF, and why "just use a better embedding model" wasn't the answer.

Disclaimer: assistant output is informational; field decisions require local experts and compliant product labels.

I stopped trusting generic LLMs for horticulture — so I built a grounded assistant on ~500 scientific articles

Kantemir Satibalov — Tue, 30 Jun 2026 13:23:58 +0000

Grounded RAG in production (8 Part Series)

I stopped trusting generic LLMs for horticulture — so I built a grounded assistant on ~500 scientific articles - you are here
68 questions before a single token: eval-first RAG
Vector search kept missing rootstock codes, so I went hybrid
Scientific articles aren't FAQ-shaped: chunking a 500-article corpus — coming soon
Gate the answer, not just the retrieval: verifying LLM output against sources — coming soon
Go for the product, Python for the models: anatomy of a two-service RAG — coming soon
Hardening a side project like it's production (and the outage that caused) — coming soon
One RAG platform, swappable domains: what 500 articles taught me about product shape — coming soon

Last year I kept seeing the same pattern in agtech and “AI assistant” demos: a chatbot wrapped around a generic model, a handful of PDFs, and a disclaimer nobody reads.

I'm a developer, not an agronomist. But I'm working on two related projects — a grounded RAG platform (grounded-llm, private repo) and its first production-shaped domain pack: a horticulture assistant built on hundreds of articles from the Russian journal Plodovodstvo i vinogradstvo Yuga Rossii (apple, pear, plum — on the order of ~500 source articles, not five blog posts).

I didn't want another “ChatGPT for gardeners.”
I wanted answers that behave like someone who actually read the literature — and admits when the literature doesn't cover the question.

That gap turned into months of engineering. I'm sharing the story in public; the full corpus and codebase stay private.

What broke first: “sounds right” ≠ “is right”
Early experiments failed in boring, repeatable ways.

1. Domain language doesn't match generic retrieval
Russian horticulture is full of synonyms and notation variants: rootstock labels, disease names, regional cultivars. A user writes марссониоз; the literature may use Marssonina, abbreviations, or OCR-noisy spellings. Naive retrieval misses; the model fills the gap confidently.

2. Scientific text isn't FAQ-shaped
Articles contain experiment sections, tables, and “brief for the grower” blocks. One chunk size for everything → right article, wrong paragraph → fluent wrong answer.

3. Generation is the wrong place to fix retrieval
If the right passage never reaches the prompt, no system prompt saves you. I separated concerns early:

Python service → retrieval only (/rag/context)
Go server → sessions, LLM calls, answer cleanup, guardrails
Not because microservices are fashionable — because I needed to change and measure retrieval without redeploying the whole product.

What I built: two layers, one product
| Layer | What it is |
|-------|------------|
| Platform core (grounded-llm) | Auth, Postgres sessions, orchestration |
| Domain pack (horticulture) | Corpus, crop config, prompts, eval baselines |

There's also a non-agricultural sandbox (demo_hr) — HR policy docs, same pipeline — to show the platform isn't hard-coded to apple diseases.

The horticulture pack indexes on the order of ~14,500 text chunks from the journal corpus. At this scale, “vector search only” and “we'll fix it in the prompt” stop being credible.

I'm not open-sourcing the full article texts (rights + focus). I am sharing architecture lessons, failure modes, and metrics — and offering controlled demos when it's worth someone's time.

One question that kept me honest
Which rootstocks and training systems show up in slope / terrace planting research for our region?

Generic LLMs invent varieties and numbers.
A grounded system either retrieves relevant experimental context — rootstocks, spacing, relief, regional trials — or should refuse to answer.

That requirement ruled out most tutorial RAG stacks I'd seen. It also ruled out marketing photo → disease as the hero feature before a model is actually trained on disease imagery. Vision is on the roadmap; text grounded in papers is what's production-shaped today.

What I deliberately didn't optimize for (yet):

Multi-tenant SaaS billing
Viral B2C Telegram growth
Claiming diagnosis-grade vision from an ImageNet backbone

I optimized for:
1.Retrieval you can regression-test
2.Answers you can gate before users see them
3.A platform you can re-pack for another vertical in days, not months

What's next (Part 2)
Part 1 was the why.

Part 2 is the decision that changed everything: I don't trust the pipeline until a fixed suite of domain questions passes retrieval — today 68 questions across apple, pear, plum, and the HR sandbox — before we pay for a single generated token.

Spoiler: getting there wasn't “use a bigger embedding model.” It was unglamorous engineering — chunking, hybrid search, reranking, glossary expansion — I'll unpack one layer per post.

If this resonates
I'm building in public through writing, not through dumping the entire corpus on GitHub.

Follow on Dev.to for Part 2
Comment if you've hit similar RAG failure modes in regulated or scientific domains
Reach out (GitHub / email in bio) for a short demo: HR sandbox or limited horticulture preview
Disclaimer: assistant output is informational; field decisions require local experts and compliant product labels.