DEV Community: Virginia Nyambura Mwega

ktx Docs Review: Excellent for AI Agents, One Onboarding Gap

Virginia Nyambura Mwega — Wed, 08 Jul 2026 16:29:42 +0000

A few days ago I announced a personal challenge: reviewing AI startup documentation in public — one thing done well, one place onboarding could be smoother, one rewritten example. This is review #1.

The product: ktx by Kaelio — an open-source, self-improving context layer for data agents. It teaches coding agents like Claude Code, Codex, and Cursor how to query your warehouse accurately, using approved metric definitions and business context stored as reviewable YAML and Markdown in git. You install it with npm install -g @kaelio/ktx (Node.js 22+) and run one guided ktx setup.

I tested the docs rather than just reading them. Everything below was verified live against docs.kaelio.com on July 8, 2026.

What Kaelio does really well: docs written for agents, not just humans

This is the part I want other AI startups to copy.

Most documentation assumes a human reading a browser tab. ktx treats an AI coding agent as a first-class reader:

A dedicated AI Resources route with an Agent Quickstart, Agent Instructions, and Prompt Recipes.
/llms.txt (a curated index of high-value pages) and /llms-full.txt (the full corpus) so an assistant can discover the right pages before diving in.
Per-page Markdown on demand: curl -H "Accept: text/markdown" returns clean source — frontmatter stripped, code blocks preserved, tables preserved.
Missing pages return a plain-text 404 instead of silently falling back to rendered HTML. Small detail, big kindness — it stops an agent from confidently parsing a garbage page.
Copy as Markdown, View MD, and Copy MDX actions sit right on each rendered page.

One of my standing review questions is literally: is the documentation structured so both developers and AI agents can use it effectively? ktx is the clearest "yes" I've come across. If you're building a developer product in 2026, this is the bar.

The human path is strong too: a single guided ktx setup wizard with clearly numbered steps, a live demo warehouse with paste-ready credentials, and a Common issues table that maps symptom → fix.

Where onboarding could be smoother: the MCP step the happy path skips

Here's the one seam.

The quickstart walks you through a clean sequence: install → ktx setup → ktx status (verify) → Connect a coding agent. That final section covers installing project-local agent rules with ktx setup --agents.

What it doesn't mention is that, before your agent client can actually reach ktx, you may need to start the local MCP daemon. That step lives in the README and FAQ — not in the quickstart:

The README notes: if ktx status prints ktx mcp start --project-dir ..., run it before opening your agent client.
The FAQ explains there's no hosted service — the local MCP daemon runs on demand via ktx mcp start when an agent client needs it.

So the failure mode is: a new developer follows the quickstart end-to-end, opens Claude Code or Cursor, and the agent silently can't see ktx — with no breadcrumb in the quickstart pointing at the fix. Making it slightly stickier, the ktx status example output in the Verify section shows a fully-ready project but doesn't include the ktx mcp start line the README says status can emit — so even the example doesn't prepare you for it.

To be fair: ktx is under active development, and this is one missing signpost on an otherwise excellent path. But it sits at the last mile of onboarding — exactly where a new developer's first impression is decided.

The rewritten section

Here's how I'd close the gap — a version of "Connect a coding agent" written in ktx's own voice, folding in the missing step:

Connect a coding agent

The setup wizard installs project-local agent rules in its last step. To install or change targets later:

ktx setup --agents

Claude Code and Codex also support global installs with --global. Agent rules point at the ktx CLI path that created them, so agents don't need a separate ktx binary on PATH. If the CLI path changes, rerun ktx setup --agents.

Start the MCP server if your status asks for it. Some setups serve agents through an on-demand local MCP daemon. If ktx status prints a line like:

ktx mcp start --project-dir /home/user/analytics

run that command before opening your agent client. Otherwise the agent connects to nothing and ktx tools appear empty. You can confirm with ktx status — Agent integration should read ready.

One short subsection, one status breadcrumb. That's the whole fix — and it's the difference between a silent dead-end and a thirty-second recovery.

The pattern worth noticing

The irony here is instructive: ktx has some of the best agent-facing documentation I've reviewed, and its one gap is on the human path — a step the team likely stopped seeing because they already know it's there. That's the most common failure I find across audits: docs that are perfectly correct for someone who already understands the product, which is exactly the one audience that doesn't need them.

If you write docs, the cheapest test available is still the best one: hand your quickstart to someone (or some agent) with zero context, and watch where they stall.

Next review

If you're building an AI product with public documentation, I'd love to include it — drop a link in the comments or reach out. Same format every time: one thing done well, one gap, one rewrite.

I'm Virginia Mwega — a full-stack & AI engineer specializing in documentation engineering and developer experience. Portfolio: virginia-mwega.vercel.app · Writing: virginiamwega-com.vercel.app · Connect: https://www.linkedin.com/in/virginia-mwega-196309313/

I Reviewed 10 AI Startup Documentation Sites. Here Are the 7 Mistakes I Kept Seeing.

Virginia Nyambura Mwega — Mon, 06 Jul 2026 18:12:05 +0000

Documentation is often the first product a developer experiences.

Before they see your architecture, your engineering culture, or your code quality, they interact with your documentation.

If that experience is confusing, incomplete, or frustrating, many developers won't make it to their first successful API request.

Over the past few weeks, I've been reviewing documentation from AI startups to understand what makes onboarding smooth—and where teams unintentionally create friction.

While every company is different, the same patterns kept appearing.

1. Quickstarts assume too much

Many Quickstarts jump straight into code without explaining prerequisites.

Developers are expected to know:

Where to get an API key
Which SDK to install
Required environment variables
Authentication steps

A Quickstart should help someone go from zero to a successful request with as little guesswork as possible.

2. Error messages aren't documented

Developers don't judge documentation by how it works when everything goes right.

They judge it by how quickly it helps them recover when something goes wrong.

Instead of only listing error codes, explain:

Why the error happens
Common causes
How to fix it
What to try next

Good troubleshooting documentation builds confidence.

3. Examples are incomplete

Too many examples leave out important details.

Developers shouldn't have to infer:

Authentication headers
Environment variables
Request payloads
Expected responses

Examples should be copy, paste, run, and understand.

4. There's no clear learning path

Documentation often feels like a collection of pages instead of a guided journey.

A better structure might look like this:

Quickstart
Core Concepts
Tutorials
API Reference
Advanced Guides
Troubleshooting

When developers always know what to read next, they make progress faster.

5. Documentation isn't written for AI-assisted development

Today, developers increasingly rely on AI coding assistants.

That means documentation should also be easy for AI tools to interpret.

This includes:

Consistent headings
Clear terminology
Structured examples
Explicit parameter descriptions
Predictable page organization

Well-structured documentation helps both humans and AI systems retrieve accurate information.

6. Missing "next steps"

A successful API call shouldn't be the end of the journey.

Guide developers toward meaningful progress:

Build a chatbot
Upload files
Authenticate users
Stream responses
Explore advanced features

Momentum matters.

7. Documentation is treated as an afterthought

The strongest engineering teams treat documentation as part of the product—not something that's written after the code ships.

Documentation improves:

Developer experience
Product adoption
Support efficiency
Customer success
Developer trust

It's not just a support resource.

It's a growth asset.

My Challenge

I'm starting a public challenge where I review AI startup documentation and share practical improvements.

Each review includes:

One thing the team did well
One improvement opportunity
A rewritten example or suggestion

The goal isn't to criticize.

It's to learn, contribute, and help create better developer experiences.

If you're building an AI product with public documentation, I'd love to review it.

I'm always looking for examples of thoughtful documentation—and opportunities to make good docs even better.

Thanks for reading!

The Documentation System Every Startup Should Have

Virginia Nyambura Mwega — Wed, 01 Jul 2026 16:43:37 +0000

Key Takeaways

Most documentation advice is written by people whose job is only documentation. This isn't that. This is the system I run as a solo founder shipping production AI, where docs that rot don't get caught by a docs team — they get caught by me, at the worst possible time.
A documentation system isn't a wiki. It's four separable things (Diátaxis: tutorial, how-to, reference, explanation) plus a mechanism that keeps them from drifting out of sync with the code.
The differentiator between docs that survive and docs that rot isn't writing quality. It's whether documentation is coupled to the same PR that changes the behavior, and whether it fails loudly when it goes stale.
Reference docs should be generated from typed code wherever possible, so the one category most likely to drift can't.

Why you should ignore most documentation advice (including some of mine)

Nearly every "how to do documentation" post is written by a technical writer — someone whose entire role is documentation, on a team where documentation is a budgeted function. That advice is often good, but it quietly assumes a luxury most early startups don't have: a person whose job is to notice when the docs stopped being true.

I don't have that person. I'm a solo founder building a production AI system (FamNest, a multi-agent wellness coach). When my documentation drifts, no docs team catches it. I catch it — usually at 11pm, when I'm trying to remember how my own auth boundary works and the note I wrote three months ago is now confidently wrong.

So the system I'm about to describe isn't optimized for a documentation department. It's optimized for the constraint most startups actually have: nobody's full-time job is keeping the docs honest, so the system itself has to do it. That constraint changes the answer, and it's why this isn't a generic "write good docs" post.

The four things a documentation system actually is

The single most useful mental model I've found is Diátaxis — the framework created by Daniele Procida and now used by Cloudflare, Gatsby, and a chunk of the Python ecosystem. Its core claim is deceptively simple: there aren't many kinds of documentation, there are exactly four, they serve different needs, and mixing them is the most common cause of docs that feel wrong even when every individual sentence is correct.

Tutorials — learning-oriented. A beginner, on rails, following exact steps to a known destination. Your quickstart is a tutorial. Its job is confidence, not completeness.
How-to guides — task-oriented. An already-competent user trying to accomplish a specific real-world goal. "How to configure X." Not teaching — helping someone who already knows the basics get a thing done.
Reference — information-oriented. The dry, factual, complete description of the machinery. Your API reference, your schema, your config options. Consulted, not read.
Explanation — understanding-oriented. The "why" behind a design decision, read away from the keyboard. Why the architecture is shaped this way. This is the category startups skip and later regret skipping, because it's the institutional memory of why.

The reason this matters practically: the most common documentation failure isn't a missing page, it's a tutorial that derailed into reference — the getting-started guide that stops to enumerate every flag, until a beginner who wanted to feel capable instead feels buried. Once you can name the four types, you can see the mixing, and most "our docs are confusing" problems turn out to be mixing problems, not writing problems.

The part nobody frames as engineering: keeping it from rotting

Here's the thing the framework alone won't save you from, and where the solo-builder constraint actually produces a better answer than the well-resourced one.

Documentation doesn't fail at the moment of writing. It fails silently, later, every time the code changes and the sentence describing it doesn't. A page that's 80% accurate is arguably worse than a missing page — a missing page sends you to read the source; a subtly-wrong page confidently teaches you the wrong thing and costs you an hour before you stop trusting it.

The teams that beat this don't beat it with discipline. Discipline doesn't scale and it definitely doesn't survive a solo founder's worst week. They beat it structurally, by borrowing the property that makes tests valuable: failing loudly instead of silently. Concretely, the system I run has three rules:

1. Docs live in the same PR as the behavior they describe. If a change to an endpoint doesn't touch the docs, that's a visible gap in the diff, not an invisible one discovered months later. This is the single highest-leverage practice and it costs nothing but a PR-template checkbox.

2. Reference is generated, not hand-written, wherever the code is typed. Reference is the category most prone to drift — it's the most detailed and the most tightly coupled to code. So it's the category you should hand-write the least. My Next.js API routes are typed; the OpenAPI reference is generated from those types, which means the one kind of documentation most likely to lie can't, because it has no independent existence to drift away from. (I've got a full walkthrough of the endpoint-to-OpenAPI generation as its own post — linked at the end.)

3. Some docs should be executable, so staleness breaks a build. A code example that's just text will rot. A code example that actually runs in CI fails the day it stops being true. The closer your documentation sits to something that executes, the less it can quietly lie to you. My agent-boundary contracts are validated with Zod, which means the "shape" documentation and the runtime enforcement are the same artifact — the docs can't disagree with the code because they are the code.

The startup-specific version of this

If you're pre-first-hire, here's the minimum viable documentation system, in priority order — not "write all four types of everything," which is how docs projects die, but the smallest thing that compounds:

One tutorial — a quickstart that actually works if a stranger copy-pastes it. Test it by having someone who isn't you run it. If they get stuck, that's a bug, not their fault.
Generated reference — wire your typed API to auto-generated reference so the most drift-prone category maintains itself.
A running "why" doc — one explanation file where you write down why you made the load-bearing decisions, as you make them. This is the cheapest thing on the list and the one you'll be most grateful for in a year, because it's the context you can't reconstruct later.
How-to guides on demand — write one every time you answer the same question twice. Let real friction, not a content plan, decide what gets written.

Notice what's not on the list: exhaustive coverage. A documentation system isn't measured by how many pages it has. It's measured by whether the pages it has stay true, and whether the structure tells a reader which kind of page they're looking at.

Why I'm the one writing this

Because I live on the wrong side of the constraint that makes this advice real. I don't document my systems because a style guide told me to — I document them because I'm the one who has to reload the entire context of my own architecture at unpredictable moments, alone, and the difference between a system I documented well and one I didn't is whether that reload takes five minutes or two hours.

That's also, not coincidentally, exactly the position a new engineer joining your startup is in. Every argument I just made for why I document my own system for my own future self is the same argument for why your docs determine how fast your next hire becomes useful. The solo constraint just makes the cost legible sooner, because the person paying it is me, tonight.

A documentation system every startup should have, in one sentence: four clearly separated types, reference generated from typed code, docs coupled to the PRs that change behavior, and the "why" written down as you go. Everything else is a variation on those four moves.

I build FamNest and write about the engineering underneath it. The specific pieces of the system above have their own deep-dives — endpoint-to-OpenAPI generation, validating agent boundaries with Zod, and treating Supabase RLS as living auth documentation — linked as I publish them in this series.

Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About It)

Virginia Nyambura Mwega — Wed, 01 Jul 2026 15:54:18 +0000

Key Takeaways

You can't unit-test a coach agent the way you test a pure function — the output is non-deterministic and "good" is a judgment call, not an assertion.
An LLM-as-judge harness lets you grade a whole test set automatically against a rubric, which is the only way solo-scale eval stays sustainable.
But the judge is itself a fallible model. If you don't design around its known biases — position, verbosity, self-preference, and quiet drift when the judge model updates — you build a green dashboard that means nothing.
The mitigations that actually work are mechanical, not prompt-magic: shuffle order on every pairwise call, pin the judge version, keep a small human-labelled anchor set, and re-check the judge against it.

The problem I actually had

FamNest's coach agent generates responses to parents — check-ins, encouragement, the occasional gentle redirect. I have a growing pile of these interactions, and every time I change a prompt, swap a model, or adjust the pipeline, I need to know one thing: did I just make it better or worse?

For normal code, that's what tests are for. I change something, the suite runs, red or green, done. But there's no assertEqual for "was this an empathetic, useful response to a tired parent." The output changes every run even at temperature zero-ish, and the quality bar is a human judgment, not a fixed string. Two responses can be worded completely differently and both be good. One can match my "expected output" word for word and still be worse than a version that didn't.

So the honest options were: read every response by hand every time I change something (does not scale past about week two), or build a harness where a model grades the outputs against a rubric. I built the harness. Then I spent an uncomfortable amount of time learning all the ways a harness like that can lie to you.

What the harness actually is

At its simplest, it's a loop:

def evaluate(test_cases, coach_agent, judge):
    results = []
    for case in test_cases:
        response = coach_agent.generate(case.input, case.context)
        verdict = judge.score(
            rubric=COACH_RUBRIC,
            user_message=case.input,
            response=response,
        )
        results.append({
            "case_id": case.id,
            "score": verdict.score,
            "reasoning": verdict.reasoning,
        })
    return results

COACH_RUBRIC is the part that matters. It's not "rate this 1–10" — that produces mush. It's specific, scored dimensions: does the response acknowledge the actual thing the user said (not a generic version of it)? Does it avoid giving medical advice? Is it the right length for the moment, or is it a wall of text at someone who's exhausted? Each dimension gets a small integer and a one-line justification, and the harness keeps the justification, not just the number — because when the score drops, the reasoning is what tells me whether the agent regressed or the judge just had an opinion.

That last distinction is the whole game.

The part where I stopped trusting the judge

Here's the failure mode that made me rebuild the whole thing. You score helpfulness at 0.91 all quarter. Then the judge model ships a minor version bump. The mean shifts a few points, the distribution narrows, and your CI gate keeps passing — so you don't look. Weeks later the agent does something genuinely bad and the eval never flagged it, because the judge changed underneath you and the number stopped meaning what it meant the day you set the threshold.

The research here is not subtle, and it's worth internalizing before you trust a single green checkmark. A 2026 RAND study that stress-tested judges across multiple benchmarks concluded that no judge was uniformly reliable, and frontier models exceeded 50% error rates on hard bias benchmarks. Consistency broke on inputs as trivial as formatting changes and paraphrasing. Separately, the classic MT-Bench work found that in pairwise comparisons, the answer in the first slot wins something like 10–15 points more often purely because it's first — position bias, nothing to do with quality.

(Worth noting the field isn't static: some 2026 reproductions find position bias has shrunk to near-negligible on current-gen models under a clean pairwise rubric, while verbosity bias stays small. Which is exactly the point — the biases move as the models move, so you measure them yourself rather than trusting a blog post from last year, including this one.)

The named biases I actually design around:

Position bias — in any A-vs-B comparison, slot order can decide the winner. Mitigation: run every pairwise comparison twice with the order flipped, and only count it if the verdict is stable across both.
Verbosity bias — longer answers tend to score higher even at matched quality. Mitigation: put length appropriateness in the rubric as an explicit dimension so the judge is scoring it on purpose instead of rewarding it by accident.
Self-preference — a judge from the same model family as the candidate tends to over-score it. Mitigation: don't let the judge be the same model as the agent it's grading. (In my case the coach runs on one provider; I judge with a different family entirely.)
Calibration drift — the silent one above. Mitigation below, because it's the most important.

The anchor set is the thing that keeps you honest

The single highest-leverage piece of the harness isn't the judge prompt. It's a small set — a few dozen cases — that I labelled by hand, carefully, once. Good responses, bad responses, and the genuinely ambiguous ones. That's my ground truth.

Every time I run the harness, it grades the anchor set too. If the judge's scores on those known cases still line up with my human labels, I trust its scores on the rest of the run. If the judge drifts on the anchor set — because the model updated, because I tweaked the rubric, because Mercury is in retrograde — I find out immediately, on cases where I already know the right answer, instead of finding out in production on a case where I don't.

This is the same instinct as the deterministic crisis floor I wrote about earlier in this series: the most consequential check should be the one that's simplest and least dependent on a model behaving. For safety, that's regex. For evaluation, it's a few dozen examples I graded with my own eyes and refuse to let a model overrule silently.

What I'd tell someone starting this

Build the harness — reading every output by hand does not scale, and an LLM judge genuinely does correlate with human preference well enough to be useful. But treat the judge as a component that can fail, not an oracle. Pin its version so it doesn't change without you deciding. Shuffle order on comparisons. Keep the reasoning, not just the score. And keep a small hand-labelled anchor set that you re-check every single run, because a green eval dashboard that you never validate is worse than no dashboard — it's the confidence of measurement without the substance, and that's exactly the kind of thing that ships a broken agent with a clean conscience.

The harness didn't remove my judgment from the loop. It moved my judgment to where it's cheap and permanent — a small set of examples I curate once — instead of where it's expensive and forgettable, which is re-reading a hundred responses every time I touch a prompt.

Part of an ongoing series documenting FamNest's architecture. Earlier posts cover the deterministic crisis floor and the multi-agent coach pipeline. Next: how we test a non-deterministic system end to end.

The doc was the spec: building a safety layer for an AI app that talks to tired parents published: false

Virginia Nyambura Mwega — Tue, 30 Jun 2026 10:23:52 +0000

"What happened when I stopped designing my AI coach for flexibility and started designing it to be predictable enough to write down."

Last month I deleted three features from my AI app. Nobody asked me to.
I removed them because I couldn't explain them clearly. And I've started treating that as a hard signal: if I can't describe how a feature behaves in plain language a tired parent would understand, it isn't finished — it's a liability.
This post is about the engineering that came out of that rule. It's specific, it has code, and it's the part of building an AI product that nobody puts in the demo.
The context
I'm building FamNest, an AI wellness tool for parents. The user I design for is the one refreshing a feeding tracker at 3am, googling "is this normal" with one hand while holding a sleeping baby with the other.
That user changes your engineering priorities. Latency matters less than predictability. "Clever" is a risk, not a feature. And the moment your AI says something subtly wrong to someone in that state, you've lost them — and possibly done real harm.
So the question I kept returning to wasn't "what can the model do?" It was "what is the model never allowed to improvise?"
Problem 1: a coach with too many answers
In the first version, my AI coach could respond to the same question a dozen slightly different ways. It felt smart. Flexible, even.
It was actually a nightmare. When I sat down to write the documentation, I couldn't promise a parent what the system would do. And if I can't promise it, why should they trust it?
The fix was to put a second model in front of the user — a safety reviewer — and give it exactly three outcomes. Not a confidence score. Not a freeform critique. Three discrete verdicts.
``type Verdict = "ok" | "revise" | "crisis";

interface ReviewResult {
verdict: Verdict;
reason: string; // for logging, never shown to the user
revisedDraft?: string; // only present when verdict === "revise"
}``
The coach generates a draft. The reviewer judges it. The whole contract fits in four lines, and that's the point — I can document it in one sentence: the reviewer either approves the draft, rewrites it, or escalates to a crisis response.

``async function generateReply(userMessage: string): Promise {
const draft = await coach.respond(userMessage);
const review = await safetyReviewer.review(userMessage, draft);

switch (review.verdict) {
case "ok":
return draft;
case "revise":
return review.revisedDraft ?? draft;
case "crisis":
return CRISIS_RESPONSE; // see below — this is not generated
}
}Problem 2: the crisis floor Here's the decision I'm most sure about. When a message hints at a crisis, the system does not generate a response. It returns fixed, human-reviewed text. Every time. Byte for byte.// This string is reviewed by a human, version-controlled,
// and documented word-for-word in the user-facing docs.
const CRISIS_RESPONSE = It sounds like you're going through something really hard right now. You don't have to handle this alone. If you're in immediate danger, please contact your local emergency number....trim();This is the opposite of how we usually think about generative AI. The whole appeal of an LLM is that it composes something new. But the situation where a parent most needs the response to be right is exactly the situation where I least want the model to be creative. I call this the crisis floor: a deterministic baseline that the system can never generate its way below. The model can make the experience better above the floor. It is never allowed to touch what happens at it. A subtle but important detail: the floor is checked before the clever path, not after.async function handleMessage(userMessage: string): Promise {
// Deterministic guardrail runs first, independent of the LLM.
if (crisisFloor.matches(userMessage)) {
return CRISIS_RESPONSE;
}
return generateReply(userMessage);
}If the generative pipeline is down, malfunctioning, or hallucinating, the floor still holds. It doesn't depend on the part of the system most likely to fail. Problem 3: what happens when the provider falls over LLM APIs go down. Rate limits hit. A region has a bad day. For most apps that's an annoyance. For an app a parent leans on at 3am, a blank screen is a broken promise. So every external dependency has a known fallback, and "the model is unavailable" is a documented state — not an exception that bubbles up to a stack trace.async function coachRespond(message: string): Promise {
try {
return await llm.complete(buildPrompt(message));
} catch (err) {
logger.warn("LLM provider unavailable, degrading gracefully", { err });
// A safe, generic, pre-written reply. Not an error.
return GRACEFUL_FALLBACK;
}
}``
The user gets a calm, honest message instead of a spinner that never resolves. Graceful degradation isn't a nice-to-have here. It's part of the trust contract.
The thing I actually learned
The pattern underneath all three: documentation wasn't something I wrote after the engineering. It was the engineering.
The doc became the spec. When a behavior couldn't be written down cleanly — three verdicts, fixed crisis text, a named fallback state — that was the signal the design was wrong, not the writing.
It turns out "predictable enough to document" is a great design constraint. It pushes you toward small, discrete contracts and away from the kind of open-ended cleverness that demos well and ships badly.
I think a lot of AI products are shipping behavior they can't fully explain and quietly calling that confusion "intelligence." For the users I build for, clarity is the feature.
If you can't document it, you don't understand it yet.

I write about building and documenting production AI systems. If you've shipped a guardrail or fallback you're proud of — or one that bit you — I'd genuinely like to hear about it in .

Writing API docs an AI agent can actually consume

Virginia Nyambura Mwega — Mon, 29 Jun 2026 15:01:58 +0000

Your docs are written for a human who can guess. The agent calling your API can't.

I found the gap the embarrassing way: one of my own agents couldn't call one of my own APIs.

FamNest runs a small agent graph — a router hands a parent's message to a retriever, the retriever grounds an answer in a vetted corpus, a coach agent (Groq, Llama 3.3 70B) drafts a reply, and a safety-reviewer agent signs off before anything reaches a human. The agents call internal endpoints the same way a third-party integrator would. And one afternoon the coach kept constructing malformed calls to the retrieval endpoint — wrong field name, missing a required filter, occasionally inventing a parameter that never existed.

The endpoint wasn't broken. The docs were. They were written for a human who could fill in the blanks, and the agent had no blanks to fill — only the tokens I gave it.

That's the whole lesson, and it's worth more than a trend.

The trend everyone's shipping — and where it stops

If you've touched developer tooling in 2026 you've watched llms.txt go from a September-2024 proposal to a routine piece of infrastructure. It's a Markdown file at your domain root that points AI systems at the content that matters, with a one-line summary of each link. Mintlify, Fern, and GitBook ship one-click toggles for it. IDE agents — Cursor, Windsurf, Claude Code, Copilot — fetch it when you point them at a docs site, then pull only the linked pages they need before writing code. LangChain even shipped an MCP server (mcpdoc) that hands those files to host apps as a fetch_docs tool.

People are calling this the Business-to-Agent web, and the framing is right: just as you once needed a site humans could navigate, you now need surfaces agents can route on. Ship the llms.txt. It's a half-day of work.

But notice what it actually solves: discovery. It answers "which page matters." It says nothing about the harder question that broke my coach agent:

Once the agent has found your endpoint, can it call it correctly on the first try — with no human in the loop to recover when your prose is ambiguous?

That's not a discovery problem. That's a contract problem. And it's where most docs quietly fail.

An agent is a different kind of reader

A human reading your docs brings a lifetime of priors. They infer that userId is probably a UUID. They notice the example uses snake_case and adjust. They hit a 400, shrug, read the error, and try again. If they're really stuck they ask a teammate. Human docs can be good enough because the human closes the gap.

An agent closes nothing. It has your tokens and a probability distribution. It pattern-matches structure: if your example shows one field, it produces one field; if you describe an error in a sentence, it treats the sentence as flavor, not as a branch it has to handle. Ambiguity doesn't make an agent cautious — it makes it confident and wrong.

So the doc stops being documentation and becomes the interface itself. Everything the agent will ever know about your endpoint is in the text. If a fact isn't on the page, it doesn't exist.

That reframes what a good endpoint doc has to contain. Here are the five things mine were missing.

A typed schema, not a prose description

Prose says: "Send the user's question and an optional list of topic tags."

A schema says exactly what's allowed, and an agent can pattern-match it without guessing:

ts// retrieve — request const RetrieveRequest = z.object({ query: z.string().min(1).max(2000), topics: z.array(z.enum(["sleep", "feeding", "behavior", "self_care"])) .max(4) .default([]), topK: z.number().int().min(1).max(10).default(5), });

The difference is the enum, the bounds, the default. "Optional list of topic tags" let my coach invent "toddler_tantrums". z.enum([...]) makes the valid set unguessable-wrong. Publish the schema, not a paragraph about the schema.

Exhaustive examples — including the unhappy paths

Agents copy examples. Whatever you show is what you'll get back. If your only example is the happy path, the happy path is the only thing the model knows how to produce.

So I document the empty result and the rejected request as first-class examples, not footnotes:

``jsonc// 200 — results found
{ "matches": [{ "id": "c_18", "score": 0.82, "text": "..." }], "truncated": false }

// 200 — valid query, nothing relevant (NOT an error)
{ "matches": [], "truncated": false }

// 422 — query failed validation
{ "error": "validation_error", "field": "topics", "detail": "unknown topic 'toddler_tantrums'" }`

The middle case is the one humans leave out and agents desperately need. "No matches" is a normal outcome, not a failure — and if you don't say so, the agent will treat an empty array as a bug and retry forever.

An error taxonomy with recovery semantics

Most docs describe errors. Agents need to be told what to do about them. "Returns 429 when rate-limited" is a description. An agent needs a decision.

So I ship a table where every row ends in an action:

Code	`error`	Cause	What the caller should do
422	`validation_error`	Bad input	Fix the field named in `detail`; do not retry unchanged
429	`rate_limited`	Too many calls	Back off using `Retry-After`; retry the same body
503	`model_unavailable`	Upstream LLM down	Fall back to cached/deterministic path; do not retry tightly
409	`idempotency_conflict`	Key reused, different body	Stop; surface to a human

A human reads that table for reference. An agent reads it as a control-flow graph. The "do not retry" cells are the ones that stop a confused agent from hammering your endpoint at 3am.

An explicit determinism / idempotency contract

The single most useful sentence I added to any endpoint doc was: "Is it safe to retry this?"

Agents retry. Networks are flaky, and a retried call that isn't idempotent is how you double-charge a card or send two replies to one anxious parent. For anything with a side effect, I now state the contract in the doc itself:

Idempotency: required for POST /coach/reply and all payment routes. Send an `Idempotency-Key` header (UUID). Replays with the same key + same body return the original result. Same key + different body → 409. Retrieval (GET /retrieve) is side-effect-free and safe to retry freely.

That paragraph is the difference between a retry loop that heals and one that does damage. It's also the kind of thing humans infer and agents simply won't — there is no prior that tells a model your payment webhook is replay-safe. You have to say it.

Auth and limits as data, not folklore

"Authenticated requests only, please don't spam it" is not a contract. Scopes, the exact header, the rate limit, and the window belong in the doc as structured fields the agent can read and self-regulate against:

`plaintext

Auth: Bearer token inAuthorization. Scopecoach:readfor /retrieve. Limits: 60 req/min/token. On exceed → 429 +Retry-After` (seconds).

`plaintext

Now the agent can pace itself instead of discovering your limit by tripping it.

Keep it honest: one source of truth

All of this rots the moment your docs and your code disagree — and an agent can't smell a stale doc the way a human can. So the contract has to be generated, not hand-maintained.

My chain is boring on purpose: the typed Next.js handler validates with the Zod schema, the schema generates the OpenAPI spec, and my llms.txt links to the generated reference. The schema is the only thing I edit. The doc can't drift, because the doc is downstream of the thing that's actually true.

*Zod schema ──► request validation (runtime)
│
└────────► OpenAPI spec ──► /llms.txt entry ──► agent reads it
*
If the handler changes, every artifact downstream changes with it. The doc lies only if the code lies.

The test that actually proves it

Here's the check I run before I trust an endpoint doc: give a fresh model only the doc — no codebase, no context — and ask it to (a) construct a valid call and (b) handle a seeded error. If it can't, the gap is in the doc, not the model. I keep these as tiny snapshot tests next to the endpoint, so a doc regression fails CI like any other bug.

When my coach agent broke, this test would have caught it in seconds. The retrieval doc, fed to a cold model, produced exactly the malformed call I saw in production — because the ambiguity was right there on the page.

This isn't new. It's just a new reader.

None of this is novel discipline. "Unambiguous, complete, verifiable" is the spine of IEEE 29148 — the requirements standard I write everything against. What's changed is that the consumer of your interface is no longer guaranteed to be a person who can paper over a vague spec. Half your integrators in 2026 are agents, and they read your docs literally, exhaustively, and without charity.

Ship the llms.txt so agents can find you. But the thing that makes them succeed once they arrive is older and less glamorous: a contract precise enough that it can't be misread. The agentic web doesn't need prettier docs. It needs docs that can't be guessed wrong.

I build FamNest, an AI wellness coach for busy parents, and write about production reliability and safety for multi-agent systems. If you're documenting an agent-callable API and want a second pair of eyes on the contract, my notes are open.

Your WHERE clause is not a security boundary (multi-tenant RAG with pgvector + RLS)

Virginia Nyambura Mwega — Sun, 28 Jun 2026 19:01:51 +0000

TL;DR: app-layer filtering is a single point of failure. Push tenant isolation into Postgres with RLS — and watch out for the security definer trap in your vector-search function.

Your WHERE clause is not a security boundary

My app is an AI wellness coach for parents. Every user's data is about the most private thing they have: how they're actually coping. Their check-ins, their bad nights, the things they'd never say out loud. The whole product runs on retrieval — when someone talks to the coach, the system pulls their relevant history out of a vector store and grounds the response in it.

Which means the single scariest bug I can imagine isn't a crash. It's user A asking a question and the retrieval quietly returning a snippet of user B's private history. No error. No stack trace. Just one person's worst night surfacing in another person's conversation.

In a multi-tenant app, that bug is one forgotten line of code away at all times. Here's how I make sure it can't happen — and the part of it that no tutorial warns you about.

The obvious fix is a single point of failure

The instinctive way to keep tenants apart is to filter in your query:

``
sql
select * from embeddings
where user_id = $current_user
order by embedding <=> $query
limit 5;

This works. It also relies on me, a tired human, remembering to write where user_id = ... on every single query that ever touches that table, forever, across every feature, including the ones I haven't built yet.

That's not a security boundary. That's a promise. And the failure mode of a promise is that the day you forget it — or a new query path skips it, or a refactor drops it there is nothing underneath to catch you. The app returns the wrong tenant's data and looks completely healthy doing it. That's exactly the shape of bug I caught in my own audit once. I didn't want to rely on never making it again.

Isolation belongs in the database, not the application

The fix is to move the boundary down a layer, into Postgres itself, using Row Level Security. RLS lets the database enforce which rows a user is even allowed to see, regardless of what the query asks for.

``
sql
alter table embeddings enable row level security;

create policy "Users read their own embeddings"
on embeddings for select
using (auth.uid() = user_id);

Now the rule isn't "please remember to filter." The rule is: this user physically cannot select another user's rows, because the database won't return them. A query that forgets the filter still comes back isolated, because the isolation isn't in the query anymore — it's in the table.

This is defense in depth, the same principle security people have leaned on for decades. The app-layer filter is still there as the first line. RLS is the backstop that makes a mistake in that first line survivable instead of catastrophic. One layer can fail without the whole guarantee failing.

The pgvector trap nobody mentions

Here's where it gets interesting, and where I'd put real money that most "build RAG on Supabase" tutorials are quietly broken.

Vector similarity search is usually wrapped in a SQL function — a match_documents-style RPC so you can call it cleanly from your app and keep the ANN index happy:

``
sql
create function match_user_docs(query_embedding vector(1536), match_count int)
returns setof embeddings
language sql
as $$
select *
from embeddings
order by embedding <=> query_embedding
limit match_count;
$$;

The footgun is the function's security mode. If you mark a function security definer — and a lot of copy-pasted vector-search examples do, to smooth over permissions — it runs with the definer's privileges and bypasses the caller's RLS entirely. You carefully set up Row Level Security on the table, then call it through a function that turns that protection off, and you'd never know: the function returns results, the app works in the demo, and every tenant's vectors are quietly reachable through that one call.

The fix is boring and important: keep the search function security invoker so the caller's RLS still applies, or — if it genuinely has to be security definer — filter by auth.uid() inside the function and pin the search_path. The point is to never let the convenience wrapper become the hole in the wall you just built.

One more wrinkle: filtering and approximate search fight a little

There's a subtle performance interaction worth knowing. pgvector's index (HNSW or IVFFlat) does approximate nearest-neighbor search — it returns roughly the closest vectors, fast. Add RLS on top, and the isolation filter trims that candidate set down to the current tenant's rows.

If you ask the index for the global top 5 and then isolation removes the ones that aren't yours, you can end up with fewer than 5 results — or, in a busy table, none. The pattern is to over-fetch: ask the index for more candidates than you need, so that after isolation you still have enough to ground a good answer. It's a small thing that only shows up under real multi-tenant load, which is exactly why it's worth saying out loud.

The takeaway

The model gets all the attention, but the part of an AI app that has to be certain is rarely the model. Here, it's the data boundary. And a boundary you enforce in application code is only as strong as your memory on your worst day.

So I push it down to where it can't be forgotten. The app filters because it should. The database isolates because it must. One forgotten where clause should be a non-event, not a breach — and the only way to guarantee that is to stop trusting the query and start trusting the table.