DEV Community: Incultnito LLC

Four pgvector patterns that kept our RAG SaaS on one Postgres

pengspirit — Fri, 12 Jun 2026 08:17:45 +0000

Most RAG tutorials stop at embedding <=> query. They show you the operator, return five rows, and call it retrieval. Then you ship it, a second customer signs up, and you discover the four things the tutorial skipped: indexing on a column that's half-NULL, the distance-vs-similarity sign flip, the dimension lock-in, and the function that quietly bypasses your tenant isolation.

I run a Discord-native Company Brain. Teams /save docs, links, and PDFs; /ask returns a grounded, cited answer. The whole vector store is one Supabase Postgres with pgvector — no Pinecone, no second system to bill and reconcile. Here are four patterns that made that survive contact with real workspaces.

## The problem: a vector column is not a vector store

A vector(1536) column gives you storage and a distance operator. It does not give you fast search, correct ranking, dimension discipline, or multi-tenant safety. Those are four separate decisions, and getting any one wrong shows up as a production bug, not a compile error.

Our artifacts table holds every chunk a workspace has ingested. The relevant columns:


sql
  CREATE TABLE artifacts (
    id           uuid    PRIMARY KEY DEFAULT gen_random_uuid(),
    workspace_id uuid    NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
    content      text    NOT NULL,
    content_hash text    NOT NULL,        -- sha256, short-circuits re-embedding
    metadata     jsonb   NOT NULL DEFAULT '{}'::jsonb,
    -- 1536 dims = OpenAI text-embedding-3-small.
    embedding    vector(1536),            -- NULLABLE on purpose. See pattern 1.
    created_at   timestamptz NOT NULL DEFAULT now(),
    UNIQUE (workspace_id, source_type, external_id)
  );

  Note that embedding is nullable. Artifacts arrive un-embedded — the web service writes the row instantly, a worker embeds it async on a */15 cron. That single nullable column drives the first pattern.

  Pattern 1: Index only the rows that have a vector

  The naive HNSW index covers the whole column. But half our rows are NULL at any given moment during backfill, and building HNSW graph edges for NULL rows is wasted work and wasted index size.

  The fix is a partial index with a WHERE predicate:

  -- Partial HNSW index: only index rows that actually have an embedding.
  -- Keeps the index small during async backfill and skips HNSW build cost
  -- on NULL rows entirely.
  CREATE INDEX artifacts_embedding_hnsw_idx
    ON artifacts
    USING hnsw (embedding vector_cosine_ops)
    WHERE embedding IS NOT NULL;

  Two choices worth defending:

  - HNSW over IVFFlat. IVFFlat needs training data to build its lists — you have to populate the table first, then build the index, and rebuild as the distribution shifts. HNSW builds incrementally as rows arrive. For a product where every workspace starts at zero artifacts and grows continuously, "no training step, no rebuild" wins. We left m and ef_construction at pgvector defaults and wrote a note to tune them once we have real latency data — premature index tuning is just a guess with extra steps.
  - vector_cosine_ops, not the default. The operator class in the index must match the distance operator your query uses. Index on vector_cosine_ops, query with <=> (cosine distance). Mismatch them and Postgres silently does a sequential scan — correct answers, terrible latency, no error to tell you why.

  Pattern 2: The sign flip — distance is not similarity
  pgvector's <=> returns cosine distance: 0 is identical, 2 is opposite. Humans, dashboards, and threshold configs think in similarity: 1 is identical, 0 is unrelated. The conversion is similarity = 1 - distance, and you have to apply it consistently in three places or your ranking inverts.

  Here's the actual retrieval RPC. Watch where <=> appears raw (ordering) versus converted (filtering and output):

  CREATE OR REPLACE FUNCTION match_artifacts(
    p_workspace_id  uuid,
    query_embedding vector(1536),
    match_count     int   DEFAULT 5,
    min_similarity  float DEFAULT 0.15
  )
  RETURNS TABLE (id uuid, content text, similarity float)
  LANGUAGE sql
  SECURITY INVOKER                      -- critical. See pattern 4.
  AS $$
    SELECT
      a.id,
      a.content,
      1 - (a.embedding <=> query_embedding) AS similarity   -- distance -> similarity
    FROM artifacts a
    WHERE a.workspace_id = p_workspace_id
      AND a.embedding IS NOT NULL
      AND 1 - (a.embedding <=> query_embedding) >= min_similarity  -- filter in similarity space
    ORDER BY a.embedding <=> query_embedding                       -- order in DISTANCE space (ASC)
    LIMIT match_count;
  $$;

  The ORDER BY stays in distance space and sorts ascending — smallest distance first — because that's the direction the HNSW index understands. Flip it to ORDER BY similarity DESC and you get the same logical result but you've handed the planner an expression it can't satisfy from the index, so it sorts in memory after a scan. Order by the raw operator; convert only for the human-facing columns.

  Our retrieval defaults — match_count = 5, min_similarity = 0.15 — came out of tuning against our own corpus, not a paper. Higher k bloats the model's context window without lifting answer quality; a lower threshold lets junk through and the model starts hedging. They're defaults, not laws: the RPC takes both as parameters so we can override per workspace.

  Pattern 3: Dimensions are a one-way door — plan the migration before you need it

  vector(1536) is a hard constraint. The number 1536 is OpenAI's text-embedding-3-small. If you decide to swap models, a different dimension count means the column type no longer fits and every existing embedding is now garbage against the new query vectors.

  We evaluated text-embedding-3-large (3072-dim) in week two. The numbers:

  ┌──────────────────────────┬─────────────────┬───────────────┐
  │           Knob           │ -small (chosen) │    -large     │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ Dimensions               │ 1536            │ 3072          │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ Top-5 recall (our eval)  │ baseline        │ ~3 pts higher │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ Cost per token           │ 1×              │ 6×            │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ pgvector storage per row │ 1×              │ 2×            │
  └──────────────────────────┴─────────────────┴───────────────┘

  Three points of recall for six times the cost and double the storage did not clear the bar at our scale. Tuning min_similarity lifted precision more cheaply than the extra dimensions did. But the real lesson is the migration rule we wrote down so a future me doesn't fight the column type at 2am:

  ▎ When we change embedding models, the new vector goes in a new column (embedding_v2 vector(3072)), backfilled and dual-read behind a flag — never an in-place ALTER of the 
  ▎ existing column.

  Adding a column lets old and new embeddings coexist while you backfill millions of rows and verify recall didn't regress. Altering the column in place takes a write lock on the
  whole table and gives you no rollback. Pick the boring migration.

  Pattern 4: The function that bypasses your tenancy — SECURITY INVOKER, always

  This one nearly made me quit for the day. Our entire multi-tenant model is Row Level Security keyed on workspace_id: a policy on artifacts means a query physically cannot return
  another tenant's rows. Airtight — except for a function declared SECURITY DEFINER, which runs with the definer's privileges and skips RLS entirely.

  A vector-search RPC is exactly the kind of function people reflexively mark SECURITY DEFINER (it's calling into internals, feels like it should be privileged). Do that, and match_artifacts happily returns chunks across workspace boundaries even though RLS is enabled on the table. The leak doesn't throw — it just quietly serves the wrong tenant's data.

  Two defenses, both in the RPC above:

  1. SECURITY INVOKER — the function runs as the caller, so RLS policies apply inside it exactly as they would on a direct query.
  2. An explicit WHERE a.workspace_id = p_workspace_id predicate — belt and suspenders. RLS is the wall; the predicate is the lock. If a future migration ever fumbles a policy, the predicate still scopes the result.

  And because the only caller is the worker (holding a service-role key on a trusted server), we revoke the function from public roles entirely:

  -- Only the service-role worker needs this. Anon/authenticated never call it.
  REVOKE EXECUTE ON FUNCTION match_artifacts FROM anon, authenticated;

  The TypeScript side stays boring, which is the point — all the safety lives in the database:

  const { data: matches } = await supabase.rpc("match_artifacts", {
    p_workspace_id: workspaceId,     // scoped by the caller, enforced by RLS + predicate
    query_embedding: queryVector,    // 1536-dim, same model as ingest
    match_count: 5,
    min_similarity: 0.15,
  });

  Write the cross-tenant leak test before the retrieval feature, not after. I wrote it after, which is how I learned the difference between DEFINER and INVOKER the expensive way.

  Takeaways

  - Partial-index your vector column when embeddings arrive async — don't pay HNSW cost on NULL rows.
  - HNSW when rows stream in continuously (no training step); match the operator class to your distance operator or you'll silently seq-scan.
  - Convert distance to similarity only for filtering and output — keep ORDER BY in raw distance space so the index does the sorting.
  - Dimensions are immutable: new model means new column, dual-read, backfill — never in-place ALTER.
  - SECURITY INVOKER plus an explicit tenant predicate. A DEFINER vector RPC is a cross-tenant leak with a clean stack trace.

  Keeping embeddings inside the same Postgres that enforces RLS is what makes one operator (me) able to run multi-tenant RAG without a second system to secure. That's the bet behind Acortia (https://acortia.com) — the brain lives where the tenancy is already enforced.

How I built a RAG-grounded Discord brain in 5 weeks (solo, ESL, no funding)

pengspirit — Wed, 03 Jun 2026 06:19:18 +0000

Day 14. The fourth time.

A user in our Discord asked, for the fourth time that week, the same question. Same wording, almost. The first three answers were buried somewhere in a thread, a pinned message, and a Notion page nobody bookmarked. A mod typed it out again. I watched it happen, opened Cursor, and started typing.

That's the moment Acortia became a product instead of a side note.

I'm Peng. Solo founder. Non-native English speaker. ESL teacher in Taipei by day, building backend software at night and on weekends. No funding. No team. No accelerator yet — YC F26 application is in. Five weeks ago I committed to building Acortia: a Discord-native Company Brain that answers /ask <q> with a grounded, cited answer pulled from whatever the server has /saved. $99/month. Mid-June launch.

This is the build log. Real numbers, real bugs, real tradeoffs. No hype.

The problem, stated honestly

Discord communities accumulate institutional knowledge the way a cluttered desk accumulates receipts: faster than anyone can file it. Threads scroll past. Pinned messages cap at 50. Search is keyword-based and stops at the channel boundary. New members ask questions that were answered six months ago in a thread that's now archived.

The cost isn't dramatic — it's grinding. Mods burn out re-answering. Founders re-explain pricing. Engineers re-link the same architecture diagram. Knowledge exists; it just isn't retrievable.

I looked at the existing options. Notion + Discord bots: too much manual upkeep. Generic AI chatbots: hallucinate confidently with no source. Custom in-house RAG: out of reach for the average community. The gap was a thin, opinionated tool that lived where the conversation already happened.

The shape of the fix

Acortia is three slash commands and a cron job.

/save <url> — ingest a doc, a thread, a webpage, a PDF. Worker chunks it, embeds it, stores it.
/ask <q> — retrieve top-k chunks via cosine similarity, ground a model response in them, return the answer with inline citations to the source artifacts.
/sources — list what the server has ingested. Audit trail.

Install: OAuth the bot, click through to api.acortia.com/install, claim the workspace via magic-link email. Thirty seconds end-to-end if the operator already has Discord admin.

That's the whole product surface. Everything else is plumbing.

Architecture, in three layers

Discord is the surface. Three slash commands registered globally, one OAuth flow, webhook-style interaction endpoints handled by the Render web service.

Supabase is the brain. Seven tables. Postgres with the pgvector extension. Row Level Security keyed to workspace_id. A single SQL RPC, match_artifacts, does the vector search. RLS means a misrouted query physically cannot return another workspace's data — the database itself enforces tenancy.

Render is the muscle. A web service handles interactive Discord requests with a < 3s deadline. A worker process handles the slow path: fetch URL, extract text (PDF connector for application/pdf, readability-style extractor for HTML), chunk, embed, write. A */15 cron sweeps queued ingest jobs and re-runs anything that timed out.

Stripe is the till. Checkout session for the $99/mo plan, webhook handler with idempotency (every event ID is upserted into stripe_events_seen before any side effect runs), portal link for self-serve management. Promo codes managed in the Stripe dashboard.

Here's the SQL signature of the only RPC the app calls for retrieval. Stylized — the live function has more telemetry, but this is the shape:

-- match_artifacts: cosine similarity search scoped by workspace
create or replace function match_artifacts(
  query_embedding vector(1536),
  workspace_id_input uuid,
  match_count int default 5,
  min_similarity float default 0.15
)
returns table (
  artifact_id uuid,
  chunk_id uuid,
  content text,
  source_url text,
  similarity float
)
language sql stable
as $$
  select
    a.id as artifact_id,
    c.id as chunk_id,
    c.content,
    a.source_url,
    1 - (c.embedding <=> query_embedding) as similarity
  from chunks c
  join artifacts a on a.id = c.artifact_id
  where a.workspace_id = workspace_id_input
    and 1 - (c.embedding <=> query_embedding) >= min_similarity
  order by c.embedding <=> query_embedding
  limit match_count;
$$;

Two numbers in there worth naming: match_count = 5 and min_similarity = 0.15. I tuned both empirically against my own corpus. Higher k bloats the context window without lifting answer quality; lower threshold lets junk through and the model hedges. Lower k makes confident answers brittle when the corpus is sparse. These are the knobs you'll want to revisit per-customer in v2.

A slash command, end to end

Here's /ask, sanitized and stylized. The real handler has more error wrapping and a deferred-response pattern for Discord's 3-second deadline, but the spine looks like this:

// apps/web/src/routes/interactions/ask.ts (illustrative)
import { embed } from "../../lib/embed";
import { supabase } from "../../lib/supabase";
import { groundAnswer } from "../../lib/llm";

export async function handleAsk(interaction: DiscordInteraction) {
  const question = interaction.data.options[0].value as string;
  const workspaceId = await resolveWorkspace(interaction.guild_id);

  const queryEmbedding = await embed(question);

  const { data: matches, error } = await supabase.rpc("match_artifacts", {
    query_embedding: queryEmbedding,
    workspace_id_input: workspaceId,
    match_count: 5,
    min_similarity: 0.15,
  });

  if (error) throw error;
  if (!matches?.length) {
    return reply(interaction, "No grounded sources found. Try `/save` first.");
  }

  const answer = await groundAnswer(question, matches);
  await logQuery(workspaceId, question, matches, answer); // queries.metadata

  return reply(interaction, formatWithCitations(answer, matches));
}

The logQuery call writes to queries.metadata — a JSON column that captures which artifacts were retrieved, the similarity scores, latency, and the model used. Telemetry isn't an afterthought; it's the only way to tell, six weeks in, whether the threshold of 0.15 is still right for a given customer.

Three decisions I'd defend at a YC interview

1. pgvector over Pinecone

Pinecone is excellent. It's also a second system to bill, monitor, and reconcile RLS against. Acortia's whole tenancy model is workspace_id on every table. If embeddings live in a separate vector DB, I have to re-implement multi-tenant isolation there and trust two systems instead of one.

pgvector keeps embeddings inside the same Postgres that enforces RLS. The retrieval call is a single RPC. Cost at MVP scale: included in Supabase free tier. The day I outgrow it, the migration to a dedicated vector DB is a few hours, not a rewrite.

2. Magic-link claim over OAuth-only

Discord OAuth tells me who installed the bot. It does not tell me which email owns the workspace for billing. I needed a second factor: a magic link sent to the operator's email so the Stripe Checkout, the invoice, and the workspace ownership all land on the same identity.

The decision inside that decision was implicit-flow vs PKCE for the magic-link callback. I went with implicit. PKCE is more secure on paper, but it requires client-side code verifier storage, which on Discord's embedded browser context is fragile. Implicit + short-lived (10 min) one-time codes + server-side verification gave me a flow that worked first try on iOS Discord, Android Discord, and desktop. The tradeoff: implicit is theoretically replayable in the 10-minute window. Mitigation: one-time-use enforced server-side, codes invalidated on first verification.

I'll revisit PKCE in v2 when I have time to test the embedded-browser edge cases properly.

3. Render over Vercel

Vercel is faster to ship for stateless routes. Acortia is not stateless. The ingest pipeline runs longer than any serverless function's hard timeout — PDFs in particular. I needed a long-running worker process and a cron. Render gives me both with one config file and one bill. Web + worker + cron on Render hobby tier costs less than a sandwich per month at MVP scale.

The day I need autoscale across regions, I'll consider Fly. Not before.

What broke: the workspace claim race

Day 20. A test user installed Acortia in two Discord servers using the same email, within about ninety seconds of each other. Both installs triggered a workspace-claim flow. Both wrote to the workspaces table. The second write silently overwrote the first install's billing pointer. The user ended up with one Stripe customer and two Discord servers, but only one of the servers was correctly linked.

The bug had two causes braided together. The naive implementation was:

// Buggy original — two installs collide
const existing = await supabase
  .from("workspaces")
  .select("id")
  .eq("guild_id", guildId)
  .maybeSingle();

if (existing.data) {
  await supabase.from("workspaces").update({ ... }).eq("id", existing.data.id);
} else {
  await supabase.from("workspaces").insert({ ... });
}

Classic check-then-act. Two concurrent claims both saw existing.data === null, both ran insert, the unique constraint caught one and the other won the race. The losing install thought it succeeded because the response came from a different row.

The fix was atomic upsert plus moving email collection to claim time, not install time:

// Day-20 fix — atomic, idempotent
const { data, error } = await supabase
  .from("workspaces")
  .upsert(
    {
      guild_id: guildId,
      claim_email: null, // email collected later via magic link
      claim_token: generateToken(),
      claim_expires_at: new Date(Date.now() + 10 * 60 * 1000),
    },
    { onConflict: "guild_id", ignoreDuplicates: false }
  )
  .select()
  .single();

The atomic upsert means the database decides the winner. The deferred email means the second install doesn't even try to write the email column until the magic link is verified, which by then has a unique session token to disambiguate. I also added a trigger to fail-loud if claim_email ever gets overwritten on a row that already has one — defense in depth.

Stripe webhooks got the same treatment because they always should:

// Webhook idempotency — check before any side effect
const { data: seen } = await supabase
  .from("stripe_events_seen")
  .select("id")
  .eq("event_id", event.id)
  .maybeSingle();

if (seen) return new Response("ok", { status: 200 });

await supabase.from("stripe_events_seen").insert({ event_id: event.id });
await handleStripeEvent(event); // safe to run exactly once

Idempotent webhooks are non-negotiable. Stripe will retry. You will get duplicates. Plan for it on Day 1, not Day 30.

What I didn't ship

Three things were on the board and got cut. Each cut was deliberate.

Slack adapter. I scaffolded a platform-adapter abstraction on Day 8 — the idea was that /save and /ask would be platform-agnostic and Slack would be a second surface. The scaffolding is in the repo. I did not build the Slack OAuth flow, slash command registration, or interaction handler. Reason: Slack outreach pre-launch was zero signal. Discord operators were actively asking for the tool. Building Slack would have cost a week and shipped a feature for a customer I didn't have. Parked until live revenue justifies it.

Notion connector. Considered. Killed. The use case I imagined — pull Notion pages as artifacts — is well-served by users copy-pasting URLs into /save. The MCP route through Claude Desktop is enough for the operator's personal workflow. A first-party Notion connector adds OAuth, page-permission edge cases, and a separate sync cron. Not worth the complexity at MVP.

Pipedream MCP custom server. I spent a few hours wiring Pipedream as a generic connector tier. Backend was healthy, auth worked, but the abstraction was leaking into the slash-command UX. I cut it and routed power-user workflows through Claude Desktop's MCP instead. Acortia stays focused. Operators who want orchestration use Claude Desktop and call Acortia as a tool.

What I'd do differently

Telemetry first. I added queries.metadata on Day 6, which was correct, but I didn't build a dashboard around it until Week 4. For the first three weeks I was debugging retrieval quality by reading raw Postgres rows. A 30-minute Metabase dashboard would have saved hours of squinting. If you're building RAG: instrument retrieval before you instrument anything else. You can't tune what you can't see.

Try it

Mid-June 2026 launch. Soft-live now for beta operators.

Install: api.acortia.com/install
Domain: acortia.com

Promo for readers of this post: BETA-FREE-30D — 100% off the first month, 10 redemptions, expires 2026-06-30 23:59 UTC. After that the price is $99/month flat. No per-seat. No usage tier. One Discord server, one bill.

If you operate a Discord community, run a developer relations team, or moderate a paid creator server: this was built for you. If you don't, the architecture above is open notes — steal whatever's useful.

Footer: the founder context

I'm in Taipei. I teach English to fund this build. I am not a native English speaker and I rewrite half of what I publish three times before it reads cleanly. Every line of Acortia was written between lesson plans and weekend mornings. No team. No accelerator yet. No outside capital.

What I'm proving with this build: a solo non-US founder can ship a credible B2B SaaS product end-to-end — auth, billing, RAG, multi-tenant data isolation, idempotent webhooks, a real cron pipeline — in five weeks of nights-and-weekends time, on a stack that costs less than a streaming subscription to run.

If that's interesting to you, the install link is above. If you want to talk shop, I'm on Discord and X under the same handle.

Brief. Concept. Preview. Ship.

6 of 6 official MCP servers cluster at 56–60/100 on schema-description density

pengspirit — Wed, 27 May 2026 07:10:39 +0000

After ten days of running the v1.1.0 publishability rubric against every MCP server I can find on npm under the official @modelcontextprotocol scope, the cluster pattern is now
hard to ignore.

6 of 6 official Anthropic-shipped MCP servers score 56–60/100 on the v1.1.0 publishability composite. The cap that fires is the same axis every time: description-five-axis.

| Server | Composite | Protocol | Edge cases | Publish | Per-tool axis avg | Cap |
|---|---:|---:|---:|---:|---:|---|
| server-sequential-thinking | 60 | 100 | 100 | 20 | n/a (single tool) | description-five-axis |
| server-memory | 60 | 100 | 85 | 50 | 1.00 / 5 | description-five-axis |
| server-everything | 60 | 100 | 94 | 20 | 0.55 / 5 | description-five-axis |
| server-filesystem | 60 | 100 | 57 | 50 | 0.88 / 5 | description-five-axis |
| server-github (legacy) | 60 | 100 | 26 | 50 | 0.44 / 5 | description-five-axis |
| server-puppeteer (deprecated) | 56 | 100 | 50 | 20 | 0.17 / 5 | description-five-axis |

Every protocol score is 100. The wire format is right on every server. The 40-point gap is entirely how the schemas read.

## What "0.17 / 5" looks like in practice

Take Puppeteer's puppeteer_navigate. The full schema description is:

Navigate to a URL.
Score that against the 5 axes:

Purpose — "navigate to a URL" ✓ (1 axis)
Mutation signal — does it read or write? Silent. ✗
Side-effects — network call, can hit any URL, executes JS, arbitrary cookie state. High-blast. Silent. ✗
Invariants — does it close existing tabs? Open a new one? Same tab? Silent. ✗
Examples — none. ✗

1 / 5. The other six Puppeteer tools score the same way. Average 0.17.

A planner LLM that has to decide whether to call puppeteer_navigate from a tool list of 7 has nothing to pattern-match on. It cannot tell the difference between puppeteer_navigate (mutates browser state, can hit any URL) and puppeteer_screenshot (read-only, current page only) from the schema alone — they read identically.

## Why this matters more than it looks

The reference servers are calibration anchors. When a server author opens the docs to figure out "what does a good MCP server look like", they read these. When an LLM coding agent autocompletes a new MCP server skeleton, it pattern-matches on these. When the spec doc shows "here's how to write a tool", it links to these.

If the bar Anthropic ships at is 56–60/100, that's the bar most third-party servers will start from too — and probably stay at, because there's no public benchmark telling them they're under it.

That's the v1.1.0 thesis: surface the bar so authors can decide where they want to land. mcp-probe score is one command.

```bash npx -y @incultnitollc/mcp-probe score "" --full




  The 5-axis breakdown tells you exactly which axis is empty on which tool. Per-tool axis avg below 3.0/5 fires the ≤60 publishability cap. Fix two axes per tool (mutation signal + one concrete example is usually fastest) and the cap lifts.

  ## Methodology

  - v1.1.0 spec: <https://github.com/Incultnitollc/mcp-probe/blob/main/docs/specs/publishability-score-v1.1.0.md>
  - Calibration drift notes: <https://github.com/Incultnitollc/mcp-probe/blob/main/docs/specs/publishability-score-v1.1.0-amendments.md>
  - 6-server summary (canonical): <https://github.com/Incultnitollc/mcp-probe/blob/main/docs/publishability-scorecards/SUMMARY.md>
  - Individual server scorecards: under `docs/publishability-scorecards/` in the same repo

  ## Caveat — install-time security is a different lane

  `mcp-probe` is pre-publish quality (server authors, before they ship). For install-time security (server installers, before they connect a third-party server), see[`@stephenywilson/mcp-doctor`](https://www.npmjs.com/package/@stephenywilson/mcp-doctor). Different audience, different lane, complementary tool.

Tool descriptions are load-bearing too: the anti-purpose pattern in MCP

pengspirit — Thu, 07 May 2026 14:33:09 +0000

A few days ago I posted Schema descriptions are load-bearing: why missing parameter descriptions break MCP clients. The argument: every parameter without a description is a load-bearing element silently absent from the schema, and agents fail in ways that look like model problems but are actually contract problems.

The post got a comment from @mickyarun that's worth its own essay:

The "load-bearing" framing is the right shape — the same observation applies one level up at the tool level. Most MCP catalogues we've audited had perfectly described parameters but no description of when not to call this tool, which is the bit that actually decides whether an agent reaches for the right surface. The half-hour we spent adding "anti-purpose" descriptions to about a dozen of our internal tools cut the wrong-tool-selected rate roughly in half. Arguably the parameter case in this post is just the most visible instance of a broader rule: every field of every schema an agent reads is doing structural work whether you specified it or not.

He's right, and the pattern deserves a name. Call it the anti-purpose pattern: every tool description should specify not just what the tool is for, but what it is not for.

HOW vs WHETHER

Parameter descriptions answer HOW to call a tool — what types, what shape, what valid values.

Tool descriptions answer WHETHER to call a tool — does this surface match the user's intent at all.

Both are schema. Both are load-bearing. The first is usually under-specified. The second is almost always under-specified.

Why "Searches the web" fails

Most MCP tool descriptions read like marketing copy:

"Searches the web for information"
"Retrieves data from the database"
"Sends an email"

This is fine in isolation. It collapses the moment an agent has three search tools, two database tools, and four messaging tools loaded at once — which is the actual production scenario.

The agent has to disambiguate. The schema gave it nothing to disambiguate with. So it picks the first plausible match, or the one with the cleanest parameter list, or the one whose name lexically matches the user's phrasing. None of these correlate with correctness.

The anti-purpose pattern

The fix is mechanical:

Before: "Searches the web for information"

After:  "Searches the public web for current events,
         news, and recently published content.
         Do not use for: code lookup (use code_search),
         internal documentation (use docs_search),
         or queries answerable from training data."

Three changes:

Specific scope — "public web" not "the web", "current events" not "information"
Disambiguation pointers — names the sibling tools the agent might confuse this with
Explicit exclusions — the "do not use for" clause

@mickyarunreports roughly 50% fewer wrong-tool-selection errors after adding clauses like this to about a dozen internal tools. That's a half-hour edit producing a measurable behavior shift, with no model change and no prompt-engineering tax on the consumer side.

Why tool authors skip this

Two reasons, both fixable:

The author knows what the tool is for, so the description is implicit. Authors write descriptions that document the tool's positive purpose because that's what they were thinking about while writing it. The negative purpose — what they consciously decided this tool would not do — never makes it onto the page.
MCP examples don't model it. Look at any MCP server template or quickstart and tool descriptions are one-line declaratives. There's no canonical example that says "here's what a production tool description looks like with anti-purpose."

The first is fixed by a checklist. The second is fixed by people writing posts like this one.

Concrete checklist

When writing or auditing a tool description, the description should answer:

Scope: What specifically does this operate on? ("public web", "this user's calendar", "Postgres tables in the analytics schema")
Trigger: What user intent should select this tool?
Anti-trigger: What user intent looks similar but should select a different tool?
Sibling pointer: Which neighboring tools are the most likely confusion sources, and what should send the agent there instead?

If you have more than one tool in your MCP server, all four are load-bearing. Skipping any of them outsources the disambiguation to whatever the model happens to guess.

Coming to mcp-probe

This is the next axis I'm adding to mcp-probe. Parameter-description coverage is already scored. Tool-description quality — including a heuristic for anti-purpose clauses — belongs in the same scorecard.

Thanks to @mickyarun for the comment that pulled the framing one level up. Schema descriptions are load-bearing. So is every other field of the contract an agent is asked to read.