Sujithkrishnan.p.k

Posted on May 29

I built a docs Q&A engine that returns null instead of hallucinating

#programming #showdev

Every "docs chatbot" today routes user questions through OpenAI. For
open-source maintainers, privacy-conscious teams, and air-gapped
environments, that's either too expensive or unacceptable. So I built
one that doesn't.

The product

Knowledge Base API is a
small FastAPI service that answers questions over a folder of markdown
files using BM25 + POS-aware lemmatization + WordNet synonym
expansion. No models. No API keys. No data leaving the box.

Live demo against FastAPI + Pydantic + Starlette docs
(2,869 sections, 265 files).

The unusual constraint

The single hardest behaviour to enforce was making the API return
null instead of inventing an answer when nothing in the corpus is
a real fit.

curl -X POST https://kb-api-q30f.onrender.com/ask \
  -H "Content-Type: application/json" \
  -d '{"question":"what is quantum chromodynamics"}'

{
  "answer": null,
  "section": null,
  "source": null,
  "confidence": 0.0,
  "message": "I don't have enough information to answer that."
}

Most retrieval systems silently return the least-bad section. The
trade-off — sometimes refusing to answer — is the whole point.

Three things that took longer than I expected

1. Identifier-aware tokenization

The default NLTK tokenizer keeps response_model,
OAuth2PasswordBearer, and Cross-Origin as single opaque tokens.
That means a query for "what is response_model" never matches because
the document body has response_model underscored and the lemmatized
query doesn't.

Solution: split on _, -, and CamelCase boundaries before
lemmatization, and keep BOTH the full identifier and its pieces in the
indexed token stream.

split_identifier("OAuth2PasswordBearer")
# -> ["OAuth2PasswordBearer", "OAuth2", "Password", "Bearer"]

split_identifier("Cross-Origin")
# -> ["Cross-Origin", "Cross", "Origin"]

Going from 50% to 90% accuracy on identifier-heavy queries was almost
entirely this fix.

2. Query-side acronym expansion (without polluting the index)

If you expand CORS to cross origin resource sharing at index time,
every BM25 IDF calculation breaks — terms appear artificially often,
document lengths inflate, scoring degrades.

The right move is query-side only:

_ACRONYMS = {
    "cors": "cross origin resource sharing",
    "jwt":  "json web token",
    "api":  "application programming interface",
    "csrf": "cross site request forgery",
    "xss":  "cross site scripting",
    "orm":  "object relational mapping",
    # ...
}

When the query contains an acronym, append the expansion tokens to
the query. The index stays pure.

3. The heading/filename/path boost stack

Pure BM25 over docs returns weird results because:

"What is X" should rank a section literally titled "X" higher than a section that mentions X 12 times in the body
Files at reference/foo.md are canonical definitions; tutorials are examples
Heading matches mean a lot more than body matches

So the score gets four passes:
raw_bm25_score(query)
× HEADING_BOOST_FACTOR if heading-query overlap ≥ 50%

1.0 if heading EXACTLY matches query subject
× FILENAME_BOOST_FACTOR if filename overlaps query
× REFERENCE_PATH_BOOST if path is under reference/

And below a hard threshold, the result is rejected entirely:

if not scores.size or scores.max() < CONFIDENCE_THRESHOLD:
    return _no_match()

That last line is the difference between "honestly returns null"
and "silently returns the least-bad section."

Day-1 update: typo tolerance

A few hours after launching on Reddit, a commenter asked: "what
about searching 'cross origin' for CORS, or what about typos like
'rsponse_model'?"

The first case worked fine — BM25 finds the CORS docs because the
body contains "Cross-Origin Resource Sharing" verbatim. But typos?
Total miss. "rsponse_model" returned a wrong answer at 0.34
confidence — confidently wrong, above the threshold, no warning to
the user.

That's the worst possible failure mode for a "honest null" product:
the no-fabrication promise breaks for typo'd in-corpus queries,
which is arguably the more common failure mode than out-of-corpus
queries.

Fix shipped same day: a BK-tree (Burkhard-Keller tree) over the
indexed vocabulary at index time, with query-time nearest-neighbour
lookup using length-tuned edit distance:

def fuzzy_candidates(tree, token):
    if len(token) <= 8:
        max_dist = 1   # short words: ambiguous beyond one edit
    else:
        max_dist = 2   # OAuth2PasswordBearer can tolerate more slop
    return [w for w, d in tree.search(token, max_dist) if d > 0]

When fuzzy correction fires, the confidence is capped at 0.6 and the
response includes a "verify the source" message so the caller knows
the answer came from a corrected query, not an exact match.

Plus a guard against fuzzy-correcting nonsense queries: if 3+ user
tokens are unrecognized, return null. "Quantum chromodynamics
neutrino flux" against FastAPI docs correctly stays null even though
fuzzy lookup could find nearest-neighbour matches for each individual
word.

What works well after the fixes

Query	Result	Notes
`what is response_model`	`response_model Priority`	1.0 confidence
`how do I add CORS`	`CORS (Cross-Origin Resource Sharing)`	1.0 confidence
`what is OAuth2PasswordBearer`	`FastAPI's OAuth2PasswordBearer`	1.0 confidence
`what is APIRouter`	`APIRouter class` (in reference/apirouter.md)	1.0 confidence
`what is rsponse_model` (typo)	`response_model Priority`	0.6 confidence + warning
`how do I add corss` (typo)	`CORS preflight requests`	0.46 confidence + warning
`what is quantum chromodynamics`	`null`	honest refusal

What it doesn't do (being honest about scope)

No synthesis. The answer field is the matching section's body verbatim, not a paraphrase. If you want a summary, use a different tool.
No follow-up. Each query is stateless.
No multi-language. English-only NLTK stack.
No conceptual cross-document linking. It's keyword retrieval. Two documents about "California" — one about OpenAI's office and one about almond farming — won't be linked. For that you need embeddings + entity profiles. This product is intentionally not that.
Not a Perplexity replacement. If you ask open-domain questions outside your corpus, you'll get null. That's the feature.

When this is the right tool

OSS docs Q&A — your community can query your docs without you paying per-question LLM costs
Internal team wikis that legally can't go to OpenAI
Air-gapped environments (finance, healthcare, defense)
Personal knowledge management — Obsidian or Logseq vault, offline
CI quality gate for docs — fail a PR if it removes content that used to be answerable (this is what the GitHub Action does)

The stack

Layer	Choice	Why
Web	FastAPI + Uvicorn	Async, typed, batteries-included
Ranking	rank-bm25	Reference Okapi BM25 implementation
NLP	NLTK	WordNet, Penn Treebank tagger, stopwords — boring and reliable
Fuzzy	Custom BK-tree	~150 lines, no dependency
Parser	markdown-it-py	Handles fenced code blocks correctly
File watch	watchdog	Cross-platform file events

Total app code: ~700 lines. Image size: ~250 MB. RAM at runtime:
~40 MB. Indexes 1,800 markdown sections in well under a second.

The repo

github.com/teamerisingstars/KB-API

Live demo: kb-api-q30f.onrender.com

If you've built something similar or have thoughts on the BM25
tuning, the fuzzy correction, or the boost stack, I'd genuinely like
to hear what would change. Drop a comment or open an issue.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.