DEV Community: pengspirit

Your OTP regex assumes six digits. Supabase magic links don't.

pengspirit — Wed, 01 Jul 2026 09:38:23 +0000

Sign-in worked flawlessly in dev. Then a real user pasted a real code and got "invalid format" — before the code ever reached Supabase. The credential was fine. My regex was wrong. Here's the one-line assumption that broke auth for every human who wasn't me.

I run a Discord-native Company Brain. Teams /save docs and /ask grounded answers; access is gated by a magic-link claim that emails a one-time code. Standard GoTrue OTP flow. The client shows a box, you paste the code, the server verifies it. Boring — which is exactly what auth should be.

The bug: a six-digit assumption in a validation guard

The claim handler did a cheap client-side sanity check before calling verifyOtp:

// The bug. Looks reasonable. Rejects every real code.
const OTP = /^\d{6}$/;

function normalize(input: string): string {
  const code = input.trim();
  if (!OTP.test(code)) throw new Error("Enter the 6-digit code from your email.");
  return code;
}

Every OTP tutorial uses \d{6}. Every code demo shows six digits. So I typed six digits into the test and it passed. In dev I was generating my own codes and never actually reading the email.

Supabase's GoTrue emits an eight-digit code on this project. ^\d{6}$ rejects eight digits outright. The user's perfectly valid credential got thrown out by my own front door with a lie for an error message — "enter the 6-digit code" when the email plainly showed eight.

Why it happens: OTP length is a setting, not a constant

The length of a GoTrue email OTP is configurable — GOTRUE_MAILER_OTP_LENGTH (Dashboard → Authentication → Email). It defaults to six in many setups and to eight in others depending on when and how the project was provisioned. The number in the tutorial is that author's project setting, not a property of OTPs.

Hardcoding 6 couples your client to a server config you don't control and might change. Bump the length for security later and every client silently starts rejecting valid codes. No error in your logs — the rejection happens before the request leaves the browser.

The fix: the client guard must never be stricter than the issuer

A format check on a security token is a UX affordance, not a security control. Its only job is catching "you pasted your grocery list" before a round-trip. The real validity check is verifyOtp on the server — that's the authority. So the client regex should be loose: wide enough to never reject a real code, tight enough to skip an obviously empty box.

// Loose format guard. Supabase is the authority on validity — verifyOtp decides.
// Accept any 6–10 digit code so a server-side length change never breaks the client.
const OTP = /^\d{6,10}$/;

function normalize(input: string): string {
  // Users paste from an email client: trailing newline, stray spaces, a stray dash.
  const code = input.replace(/\D/g, "");
  if (!OTP.test(code)) throw new Error("Enter the code from your email.");
  return code;
}

// runnable check — the exact cases that bit me
function demo() {
  console.assert(normalize("12345678") === "12345678", "8-digit must pass");
  console.assert(normalize(" 1234 5678 \n") === "12345678", "strip paste noise");
  console.assert(normalize("123456") === "123456", "6-digit still passes");
  let threw = false;
  try { normalize("hello"); } catch { threw = true; }
  console.assert(threw, "non-digits must reject");
  console.log("ok");
}
demo();

Two things doing the work:

\d{6,10} instead of \d{6}. A range absorbs whatever length GoTrue is configured for, today or after a future bump. I don't have to redeploy the client to match a server setting.
replace(/\D/g, "") instead of trim(). People don't retype the code, they paste it — straight out of Gmail with a trailing newline, a leading space, sometimes a soft-wrap dash. Stripping every non-digit is more honest than trimming the ends, and it's what the user meant.

Then let the server be the authority:

const { error } = await supabase.auth.verifyOtp({
  email,
  token: normalize(input),   // loose format guard already ran
  type: "email",
});
// verifyOtp is the real check: wrong code, expired code, wrong length — all rejected here,
// server-side, with a signal you can actually trust and log.

The general rule

Any time the client validates a token the server issues, the client's check must be a superset of what the server accepts — never a subset. A guard stricter than the issuer doesn't add security; it manufactures false rejections of valid credentials, and it does it silently, before anything reaches a log you'd look at.

I found this the expensive way: a working sign-in for exactly one person (me), and a "the code doesn't work" report from everyone else. The fix was five characters — {6} to {6,10} — plus a normalize that respects how people actually paste.

Takeaways

OTP length is a server setting (GOTRUE_MAILER_OTP_LENGTH), not a constant. Don't hardcode 6 from a tutorial.
Client format checks are UX, not security. Keep them looser than the issuer; verifyOtp is the authority.
A guard stricter than the issuer rejects valid credentials silently — the worst kind of bug, because nothing errors on your side.
Users paste, they don't type. Strip non-digits, don't just trim().

Boring auth is good auth — but boring means the failure modes hide in the five characters you copied without reading. That's the tax on running a real product for real users, which is the whole bet behind Acortia.

The Postgres error code that makes Stripe webhooks idempotent: 23505

pengspirit — Sat, 20 Jun 2026 05:39:10 +0000

Stripe delivers webhooks at least once. Not exactly once — at least once. A network blip on your side, a 200 that arrives a half-second too late, an event Stripe decides to re-send during an outage, and the same checkout.session.completed lands on your endpoint twice. Sometimes three times. Your handler does not know that.

If that handler grants a subscription, sends a welcome email, or increments a usage counter, "at least once" is how one signup becomes two welcome emails and a double-counted seat. The fix is not a queue, not a distributed lock, not an idempotency framework. It's one Postgres table, one unique constraint, and catching one error code: 23505.

I run a Discord-native Company Brain — teams /save docs and /ask grounded answers, billed at a flat monthly price through Stripe. The entire billing surface is one Fastify route and one Supabase Postgres. Here's the whole idempotency story, in the order the bytes actually arrive.

The problem: webhooks are at-least-once, your side effects are not

A Stripe webhook is a state change that already happened — a card was charged, a subscription renewed. Your job is to reconcile your database to that fact. The trap is that Stripe's delivery guarantee (at-least-once) and your handler's behavior (run the side effect every time) don't match. Run a non-idempotent handler twice and you've corrupted state with no error to tell you.

So before any business logic, two things have to be true: the request is really from Stripe, and you have never processed this exact event before.

Step 1: verify the signature — on the raw body, before anything parses it

The signature check is also your authenticity check; an attacker who can POST to /stripe/webhook can forge events otherwise. But there's a gotcha that eats an afternoon: constructEvent hashes the exact raw bytes Stripe sent. If a global express.json() / Fastify body parser has already turned those bytes into an object and back, the re-serialized JSON won't byte-match, and verification fails on legitimate events.

You have to retain the raw body for this one route. In Fastify, a content-type parser hook that stashes it:

// Retain the raw body so Stripe (and Discord) signature checks see exact bytes.
req.rawBody = body as string;

Then the route verifies against that raw string, never the parsed object:

app.post('/stripe/webhook', async (req, reply) => {
  const signature = req.headers['stripe-signature'];
  if (typeof signature !== 'string' || !req.rawBody) {
    return reply.code(400).send({ error: 'missing signature or body' });
  }
  try {
    const result = await handleStripeWebhook(req.rawBody, signature);
    return reply.code(200).send(result);            // 200 = "I own this now, stop retrying"
  } catch (err) {
    if (err instanceof WebhookSignatureError) {
      return reply.code(400).send({ error: 'signature_invalid' });
    }
    return reply.code(500).send({ error: 'webhook processing failed' }); // 500 = retry me
  }
});

The status codes are the protocol. 2xx tells Stripe "processed, stop retrying." Any non-2xx queues a retry with backoff. That single fact is what makes the next step both necessary and safe.

Step 2: let the unique constraint be the lock — insert first, catch 23505

Here's the move. A tiny table whose only job is to remember which event IDs you've seen:

-- One row per Stripe event. event_id is the natural primary key;
-- Stripe guarantees it's stable across retries of the same event.
CREATE TABLE stripe_events_seen (
  event_id   text PRIMARY KEY,
  event_type text NOT NULL,
  seen_at    timestamptz NOT NULL DEFAULT now()
);

The naive approach is SELECT to check, then INSERT if absent. That's a race: two concurrent deliveries of the same event both SELECT empty, both proceed, both process. The check-then-act gap is exactly wide enough for Stripe's parallel retries to slip through.

So don't check. Insert, and let the database reject the duplicate. The unique constraint is the lock — atomic, no race window, no extra round trip:

async function recordEventSeen(eventId: string, eventType: string): Promise<boolean> {
  const { error } = await supabase
    .from('stripe_events_seen')
    .insert({ event_id: eventId, event_type: eventType });
  if (!error) return true;          // first time we've seen this event
  if (error.code === '23505') return false;   // unique_violation = duplicate, already handled
  throw new Error(`stripe_events_seen insert failed: ${error.message}`);
}

23505 is Postgres's unique_violation SQLSTATE. It is not an error to log and move past — it's a signal: "another delivery of this event already claimed the row." Returning false on it turns the constraint into a deduplication primitive. Two requests race to insert the same event_id; Postgres lets exactly one win and hands the loser a 23505. No advisory lock, no Redis, no SELECT ... FOR UPDATE.

The single most important detail: don't treat 23505 as failure. Plenty of webhook code wraps the insert in a generic try/catch that logs and re-raises — which turns a successful dedup into a 500, which tells Stripe to retry, which hits the same constraint, forever. Catch the one code that means "duplicate" and convert it to a clean early return.

Step 3: dedup once, at the top — not inside every handler

Where you put the check matters. Put it before the event-type switch, so a duplicate is rejected once and no handler ever sees it:

export async function handleStripeWebhook(rawBody: string, signature: string) {
  let event: Stripe.Event;
  try {
    event = stripe.webhooks.constructEvent(rawBody, signature, env.STRIPE_WEBHOOK_SECRET);
  } catch (err) {
    throw new WebhookSignatureError(`signature verification failed`);
  }

  const fresh = await recordEventSeen(event.id, event.type);
  if (!fresh) {
    return { ok: true, duplicate: true, type: event.type };   // 200, did nothing — correct
  }

  switch (event.type) {
    case 'checkout.session.completed':
      await applyCheckoutCompleted(event.data.object); break;
    case 'customer.subscription.created':
    case 'customer.subscription.updated':
      await applySubscriptionUpsert(event.data.object); break;
    case 'customer.subscription.deleted':
      await applySubscriptionDeleted(event.data.object); break;
    case 'invoice.payment_failed':
      await applyPaymentFailed(event.data.object); break;
    default:
      console.log(`unhandled event type: ${event.type}`);
  }
  return { ok: true, duplicate: false, type: event.type };
}

A duplicate returns { ok: true, duplicate: true }, the route sends 200, Stripe stops retrying. One guard protects every current and future handler. Add a new event type next month and it inherits idempotency for free — the most valuable kind of safety is the kind you can't forget to apply.

Step 4: insert-first is at-most-once — know the tradeoff you just made

This is the part most posts skip, and it's the part that bites. Look at the ordering: we insert the seen row, then run the handler. If the handler throws after the insert committed — say applyCheckoutCompleted calls stripe.subscriptions.retrieve and that times out — the route returns 500 and Stripe retries. But the retry hits a row that already exists, gets a 23505, and is deduped away. The event is marked seen and never actually processed. Insert-first buys you at-most-once, and at-most-once can silently drop work.

Two honest ways to handle it, pick by what your handlers do:

Move the insert to the end (process, then record seen). Now it's at-least-once: a mid-handler crash means the row is never written, so Stripe's retry reprocesses cleanly. The cost: your handlers must be idempotent, because they will sometimes run twice.
Keep insert-first, and lean on the fact that your handlers are already idempotent for a different reason. Mine are pure state reconciliation — every one is an UPDATE ... WHERE id = $1:

await supabase.from('workspaces')
  .update({ subscription_status: period.status, current_period_end: period.currentPeriodEndIso })
  .eq('id', workspaceId);     // applying this twice == applying it once

An update-by-key sets the row to a value; running it again sets it to the same value. There's no counter to double, no email sent inline. And Stripe's events overlap: a dropped checkout.session.completed gets re-covered by the customer.subscription.created that fires for the same signup. The state re-converges on the next event. That overlap is why at-most-once is acceptable here — and exactly why it wouldn't be if a handler sent an email or charged a card as a side effect.

The rule that falls out: reconciliation handlers can drop an event safely; side-effecting handlers cannot. Know which kind you're writing before you choose where the insert goes.

Takeaways

Stripe is at-least-once. Design for duplicate delivery as the normal case, not the edge case.
Verify the signature on the raw body. A global JSON parser silently breaks constructEvent on legitimate events — retain raw bytes for the webhook route only.
Insert-first, catch 23505. The unique constraint is a race-free lock. Check-then-insert has a gap that parallel retries walk right through.
Treat 23505 as "duplicate," not "error." Logging-and-reraising it turns successful dedup into an infinite retry loop.
Dedup above the switch, so every handler — present and future — is covered once.
Insert-first is at-most-once. Fine for idempotent reconciliation handlers; dangerous for side-effecting ones. Choose insert order deliberately.

The whole thing is a 40-line file and one table. No queue, no lock service, no idempotency-key plumbing — Postgres already ships the primitive you need, it's just spelled 23505. That economy is the bet behind Acortia: one operator, one database, billing that survives Stripe's retries without a second system to reconcile.

The MCP Server Pre-Publish Checklist

pengspirit — Mon, 15 Jun 2026 08:38:37 +0000

Before you publish an MCP server, run 10 checks. Most servers fail at least three — and the failures are invisible until an agent picks the wrong tool, hallucinates an argument, or silently drops your server on connect. This is the checklist we built mcp-probe to enforce, distilled to what actually breaks in the wild.

TL;DR — A publishable MCP server connects cleanly, names tools unambiguously, describes every argument, validates inputs, and ships install metadata. The single most common failure is thin tool descriptions: even the five official Anthropic reference servers cap at 60/100 on description quality.

Why "it works in Inspector" isn't enough

MCP Inspector answers "does my server connect and list tools?" That's necessary, not sufficient. The agent doesn't experience your server the way you do in a UI — it experiences your tool descriptions and schemas as text in a context window, and it picks tools by reading them. A server can pass Inspector and still be functionally unpublishable because the model can't tell your tools apart.

So the pre-publish question isn't "does it run?" It's "is it publishable?" — will a real agent, with no docs and no human in the loop, use it correctly?

The 10-point checklist

Connection & protocol

Connects without transport errors — stdio or HTTP, the handshake completes and the protocol version is current.
Lists tools, resources, and prompts — everything you intend to expose actually appears after initialize.
No initialize timeout — large tool lists can exceed the client's probe timeout and get silently dropped. Keep initialize fast.

Tool legibility (where most servers fail)

Every tool has a real description — not a restated name. "create_issue: creates an issue" tells the model nothing. The description has to do the disambiguation work.
No naming collisions — create_issue exists in a dozen servers. If yours collides, the model guesses. Namespace or specify.
Arguments are described, not just typed — every parameter needs a description, required fields marked, enums enumerated. Nested args make the model miss required fields.
Mutations are legible — a tool that writes/deletes/charges should say so. The model should never discover a side effect at runtime.

Schema & inputs

Inputs validate — valid input succeeds, invalid input produces a useful error, not a stack trace or a silent pass.
Enum and shape constraints are explicit — if a field takes one of four values, the schema says so. "string" where you mean an enum is a footgun.

Distribution

Install metadata ships — clear package name, runnable example, fresh README, and a server.json so the official MCP Registry can discover you. Devs find tools at install-time, not search-time.

How to score it in 3 seconds

You can walk this list by hand, or run it:

npx @incultnitollc/mcp-probe score "node ./your-server.js"

mcp-probe connects to your server, runs all ten checks, and returns a 0–100 publishability score across five axes — description quality, enum/shape correctness, mutation legibility, anti-"restate the name" clauses, and distribution metadata. A passing server clears ~80. The official reference servers sit at 60 (the description cap fires on every one). A typical first-draft community server lands in the 40s.

Wire it into CI so it runs on every release:

# .github/workflows/publishability.yml
- run: npx @incultnitollc/mcp-probe score "node ./dist/server.js" --fail-under 80

The exit code gates the publish. Your server can't regress below the bar you set.

The one thing to fix first

If you do nothing else: rewrite your tool descriptions so a model with no context could choose correctly between yours and a similarly named tool. That single fix moves more servers across the publishable line than any other on this list — and it's the one almost everyone skips.

mcp-probe is an open-source CLI for testing and scoring MCP servers before you publish. npx @incultnitollc/mcp-probe · github.com/incultnitollc/mcp-probe

Four pgvector patterns that kept our RAG SaaS on one Postgres

pengspirit — Fri, 12 Jun 2026 08:17:45 +0000

Most RAG tutorials stop at embedding <=> query. They show you the operator, return five rows, and call it retrieval. Then you ship it, a second customer signs up, and you discover the four things the tutorial skipped: indexing on a column that's half-NULL, the distance-vs-similarity sign flip, the dimension lock-in, and the function that quietly bypasses your tenant isolation.

I run a Discord-native Company Brain. Teams /save docs, links, and PDFs; /ask returns a grounded, cited answer. The whole vector store is one Supabase Postgres with pgvector — no Pinecone, no second system to bill and reconcile. Here are four patterns that made that survive contact with real workspaces.

## The problem: a vector column is not a vector store

A vector(1536) column gives you storage and a distance operator. It does not give you fast search, correct ranking, dimension discipline, or multi-tenant safety. Those are four separate decisions, and getting any one wrong shows up as a production bug, not a compile error.

Our artifacts table holds every chunk a workspace has ingested. The relevant columns:


sql
  CREATE TABLE artifacts (
    id           uuid    PRIMARY KEY DEFAULT gen_random_uuid(),
    workspace_id uuid    NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
    content      text    NOT NULL,
    content_hash text    NOT NULL,        -- sha256, short-circuits re-embedding
    metadata     jsonb   NOT NULL DEFAULT '{}'::jsonb,
    -- 1536 dims = OpenAI text-embedding-3-small.
    embedding    vector(1536),            -- NULLABLE on purpose. See pattern 1.
    created_at   timestamptz NOT NULL DEFAULT now(),
    UNIQUE (workspace_id, source_type, external_id)
  );

  Note that embedding is nullable. Artifacts arrive un-embedded — the web service writes the row instantly, a worker embeds it async on a */15 cron. That single nullable column drives the first pattern.

  Pattern 1: Index only the rows that have a vector

  The naive HNSW index covers the whole column. But half our rows are NULL at any given moment during backfill, and building HNSW graph edges for NULL rows is wasted work and wasted index size.

  The fix is a partial index with a WHERE predicate:

  -- Partial HNSW index: only index rows that actually have an embedding.
  -- Keeps the index small during async backfill and skips HNSW build cost
  -- on NULL rows entirely.
  CREATE INDEX artifacts_embedding_hnsw_idx
    ON artifacts
    USING hnsw (embedding vector_cosine_ops)
    WHERE embedding IS NOT NULL;

  Two choices worth defending:

  - HNSW over IVFFlat. IVFFlat needs training data to build its lists — you have to populate the table first, then build the index, and rebuild as the distribution shifts. HNSW builds incrementally as rows arrive. For a product where every workspace starts at zero artifacts and grows continuously, "no training step, no rebuild" wins. We left m and ef_construction at pgvector defaults and wrote a note to tune them once we have real latency data — premature index tuning is just a guess with extra steps.
  - vector_cosine_ops, not the default. The operator class in the index must match the distance operator your query uses. Index on vector_cosine_ops, query with <=> (cosine distance). Mismatch them and Postgres silently does a sequential scan — correct answers, terrible latency, no error to tell you why.

  Pattern 2: The sign flip — distance is not similarity
  pgvector's <=> returns cosine distance: 0 is identical, 2 is opposite. Humans, dashboards, and threshold configs think in similarity: 1 is identical, 0 is unrelated. The conversion is similarity = 1 - distance, and you have to apply it consistently in three places or your ranking inverts.

  Here's the actual retrieval RPC. Watch where <=> appears raw (ordering) versus converted (filtering and output):

  CREATE OR REPLACE FUNCTION match_artifacts(
    p_workspace_id  uuid,
    query_embedding vector(1536),
    match_count     int   DEFAULT 5,
    min_similarity  float DEFAULT 0.15
  )
  RETURNS TABLE (id uuid, content text, similarity float)
  LANGUAGE sql
  SECURITY INVOKER                      -- critical. See pattern 4.
  AS $$
    SELECT
      a.id,
      a.content,
      1 - (a.embedding <=> query_embedding) AS similarity   -- distance -> similarity
    FROM artifacts a
    WHERE a.workspace_id = p_workspace_id
      AND a.embedding IS NOT NULL
      AND 1 - (a.embedding <=> query_embedding) >= min_similarity  -- filter in similarity space
    ORDER BY a.embedding <=> query_embedding                       -- order in DISTANCE space (ASC)
    LIMIT match_count;
  $$;

  The ORDER BY stays in distance space and sorts ascending — smallest distance first — because that's the direction the HNSW index understands. Flip it to ORDER BY similarity DESC and you get the same logical result but you've handed the planner an expression it can't satisfy from the index, so it sorts in memory after a scan. Order by the raw operator; convert only for the human-facing columns.

  Our retrieval defaults — match_count = 5, min_similarity = 0.15 — came out of tuning against our own corpus, not a paper. Higher k bloats the model's context window without lifting answer quality; a lower threshold lets junk through and the model starts hedging. They're defaults, not laws: the RPC takes both as parameters so we can override per workspace.

  Pattern 3: Dimensions are a one-way door — plan the migration before you need it

  vector(1536) is a hard constraint. The number 1536 is OpenAI's text-embedding-3-small. If you decide to swap models, a different dimension count means the column type no longer fits and every existing embedding is now garbage against the new query vectors.

  We evaluated text-embedding-3-large (3072-dim) in week two. The numbers:

  ┌──────────────────────────┬─────────────────┬───────────────┐
  │           Knob           │ -small (chosen) │    -large     │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ Dimensions               │ 1536            │ 3072          │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ Top-5 recall (our eval)  │ baseline        │ ~3 pts higher │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ Cost per token           │ 1×              │ 6×            │
  ├──────────────────────────┼─────────────────┼───────────────┤
  │ pgvector storage per row │ 1×              │ 2×            │
  └──────────────────────────┴─────────────────┴───────────────┘

  Three points of recall for six times the cost and double the storage did not clear the bar at our scale. Tuning min_similarity lifted precision more cheaply than the extra dimensions did. But the real lesson is the migration rule we wrote down so a future me doesn't fight the column type at 2am:

  ▎ When we change embedding models, the new vector goes in a new column (embedding_v2 vector(3072)), backfilled and dual-read behind a flag — never an in-place ALTER of the 
  ▎ existing column.

  Adding a column lets old and new embeddings coexist while you backfill millions of rows and verify recall didn't regress. Altering the column in place takes a write lock on the
  whole table and gives you no rollback. Pick the boring migration.

  Pattern 4: The function that bypasses your tenancy — SECURITY INVOKER, always

  This one nearly made me quit for the day. Our entire multi-tenant model is Row Level Security keyed on workspace_id: a policy on artifacts means a query physically cannot return
  another tenant's rows. Airtight — except for a function declared SECURITY DEFINER, which runs with the definer's privileges and skips RLS entirely.

  A vector-search RPC is exactly the kind of function people reflexively mark SECURITY DEFINER (it's calling into internals, feels like it should be privileged). Do that, and match_artifacts happily returns chunks across workspace boundaries even though RLS is enabled on the table. The leak doesn't throw — it just quietly serves the wrong tenant's data.

  Two defenses, both in the RPC above:

  1. SECURITY INVOKER — the function runs as the caller, so RLS policies apply inside it exactly as they would on a direct query.
  2. An explicit WHERE a.workspace_id = p_workspace_id predicate — belt and suspenders. RLS is the wall; the predicate is the lock. If a future migration ever fumbles a policy, the predicate still scopes the result.

  And because the only caller is the worker (holding a service-role key on a trusted server), we revoke the function from public roles entirely:

  -- Only the service-role worker needs this. Anon/authenticated never call it.
  REVOKE EXECUTE ON FUNCTION match_artifacts FROM anon, authenticated;

  The TypeScript side stays boring, which is the point — all the safety lives in the database:

  const { data: matches } = await supabase.rpc("match_artifacts", {
    p_workspace_id: workspaceId,     // scoped by the caller, enforced by RLS + predicate
    query_embedding: queryVector,    // 1536-dim, same model as ingest
    match_count: 5,
    min_similarity: 0.15,
  });

  Write the cross-tenant leak test before the retrieval feature, not after. I wrote it after, which is how I learned the difference between DEFINER and INVOKER the expensive way.

  Takeaways

  - Partial-index your vector column when embeddings arrive async — don't pay HNSW cost on NULL rows.
  - HNSW when rows stream in continuously (no training step); match the operator class to your distance operator or you'll silently seq-scan.
  - Convert distance to similarity only for filtering and output — keep ORDER BY in raw distance space so the index does the sorting.
  - Dimensions are immutable: new model means new column, dual-read, backfill — never in-place ALTER.
  - SECURITY INVOKER plus an explicit tenant predicate. A DEFINER vector RPC is a cross-tenant leak with a clean stack trace.

  Keeping embeddings inside the same Postgres that enforces RLS is what makes one operator (me) able to run multi-tenant RAG without a second system to secure. That's the bet behind Acortia (https://acortia.com) — the brain lives where the tenancy is already enforced.

Writing a cross-client config installer for MCP servers in TypeScript

pengspirit — Wed, 10 Jun 2026 16:02:19 +0000

Anthropic's Model Context Protocol shipped without two things developers immediately wanted: a registry, and a tool to wire a server into a client without hand-editing JSON. This post is about the second one — specifically the mcpr install command we shipped in @incultnitollc/mcpr@0.2.0, what it actually does, and the bugs we hit building it.

If you've never integrated an MCP server, the current flow is:

Find the server on GitHub.
Read the README for its launch command.
Copy a JSON fragment.
Paste it into the right config file for your client.
Restart the client.
Repeat per server. Per client. Per arg change.

Step 4 alone is a maze. Claude Desktop reads from ~/Library/Application Support/Claude/claude_desktop_config.json on macOS, %APPDATA%\Claude\claude_desktop_config.json on Windows. Cline (the VS Code extension) reads from ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json. Cursor, Continue, VS Code MCP, and Zed each have their own. The schemas overlap but aren't identical.

The goal of mcpr install is to collapse step 4 into two commands — search the registry for the server, then install the slug it hands back:

npx -y @incultnitollc/mcpr search filesystem
npx -y @incultnitollc/mcpr install npm-agent-infra-mcp-server-filesystem --client claude-desktop

search prints the slug; you paste that slug into install. The slug isn't a free-form string you guess — install without one errors out, because the slug is a key the registry resolves, not an argument you invent. (More on why that matters in the next section.)

That second command does five things worth talking about: npm resolution from the registry, a cross-OS path matrix, a JSON deep-merge that doesn't clobber sibling servers, atomic writes with backups, and file-mode preservation (which we got wrong the first time).

npm-resolve from the registry

The slug isn't a free-form string. It's a key in the MCP Registry's Supabase, where each server row carries an npm_package field. The install path looks the slug up, derives the launch command, and writes it into the client config.

This gives us a useful sandbox boundary: only servers that resolve to an npm package are installable through mcpr install. There's no --from-url escape hatch in v0.2.0. If the registry doesn't have it, the CLI refuses, and you fall back to editing JSON by hand. That's deliberate — it keeps the threat model tight while the registry is small, and it forces server authors to publish to npm (which they should anyway, for npx -y reachability).

The derived launch entry looks like this for a registry server with slug everything:

{
  "command": "npx",
  "args": ["-y", "@modelcontextprotocol/server-everything"]
}

That object is what gets merged into the client config under mcpServers.everything.

The cross-OS, cross-client path matrix

Every supported client has a different config location, and most of them differ per OS. We keep this in a single resolver module so adding a new client is one entry rather than a scavenger hunt across the codebase.

The shape is roughly:

type ClientId = "claude-desktop" | "cline";

const CONFIG_PATHS: Record<ClientId, Partial<Record<NodeJS.Platform, string>>> = {
  "claude-desktop": {
    darwin: "~/Library/Application Support/Claude/claude_desktop_config.json",
    win32:  "%APPDATA%/Claude/claude_desktop_config.json",
    linux:  "~/.config/Claude/claude_desktop_config.json",
  },
  cline: {
    darwin: "~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json",
    // ...
  },
};

os.platform() picks the row. ~ and %APPDATA% are expanded with the usual os.homedir() / process.env.APPDATA lookups. If a (client, platform) pair isn't supported, the CLI exits non-zero with the unsupported pair named — not a generic "couldn't find config." Naming the missing combination is the difference between a bug report we can act on and one we can't.

v0.2.0 ships Claude Desktop and Cline. v1.2 will add candidates from Cursor, Continue, VS Code MCP, and Zed. The matrix is the entire reason that addition is small.

JSON deep-merge, not shallow overwrite

This is the part developers asked for most loudly. A naive installer would read the config, set config.mcpServers = { [slug]: entry }, and write it back. That destroys every other server the user had configured. It also stomps top-level keys like theme, autoUpdate, or whatever the client happens to track alongside MCP servers.

The actual flow:

Read existing JSON. If the file doesn't exist, start from {}.
Parse with a tolerant parser (trailing commas in user-edited configs are real).
Deep-merge: preserve all sibling keys, preserve all sibling servers, set mcpServers[slug] to the new entry.
Serialize with stable key ordering and a final newline.

Before:

{
  "theme": "dark",
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/me/projects"]
    }
  }
}

After mcpr install everything --client claude-desktop:

{
  "theme": "dark",
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/me/projects"]
    },
    "everything": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-everything"]
    }
  }
}

theme survives. filesystem survives. everything is added.

Refuse to clobber, unless told otherwise

If mcpServers.<slug> already exists, the default behavior is to refuse and exit non-zero. The user sees the existing entry, the proposed entry, and a one-line hint about --force.

Two flags govern the override:

--force overwrites the existing entry (still preserves siblings).
--dry-run writes nothing and prints the planned post-merge JSON to stdout.

--dry-run is the flag we use most in tests. It also turns out to be the flag users reach for first when they want to inspect what mcpr would do without trusting it yet, which is the correct instinct.

Atomic write with timestamped backup

Config files holding API keys deserve more care than fs.writeFileSync. If the process dies mid-write, you do not want a half-written JSON file to be the only copy.

The write path:

Read the original. If it exists, copy it to <config>.bak.<unix-timestamp>.
Write the new content to <config>.tmp.
fsync the temp file.
rename the temp file over the original. rename within the same filesystem is atomic on POSIX.

The .bak.<timestamp> suffix matters. If a user runs mcpr install ten times, they get ten distinct backups, not a .bak that quietly overwrites the only good copy from yesterday.

The file-mode bug (and the fix)

This is the part worth reading. We shipped a first cut of the install command, self-reviewed it, and caught a real security regression before any user ran it.

The bug: the write step used fs.writeFileSync(tmpPath, json) with no mode argument. Node's default mode for new files is 0o644 — owner read/write, everyone else read.

Claude Desktop's config can contain env entries with API keys. Many users (correctly) set their config to 0o600 — owner read/write only, no group, no world. The install step, by writing a fresh file with the default mode, was silently widening 0o600 to 0o644. On a shared machine, every other user could now read your OpenAI key.

The fix is small:

// Before:
fs.writeFileSync(tmpPath, json);
fs.renameSync(tmpPath, configPath);

// After:
const originalMode = fs.existsSync(configPath)
  ? fs.statSync(configPath).mode & 0o777
  : 0o600; // safe default for new configs

fs.writeFileSync(tmpPath, json, { mode: originalMode });
fs.renameSync(tmpPath, configPath);

Two notes on this:

For a fresh config (file didn't exist), we default to 0o600, not 0o644. The default should fail closed.
& 0o777 strips the file-type bits from stat.mode. Forgetting that mask gives you a number that looks right in octal but has the wrong high bits.

Shipped in commit 3af5724. The lesson is older than the bug: file-mode preservation belongs in the same mental category as preserving sibling JSON keys. Both are about respecting state the user already set.

Tests

The CLI ships with 29 vitest unit tests covering the resolver, the merge, the refuse-vs-force matrix, dry-run output, mode preservation, and backup naming. The merge and mode tests use fixtures: real claude_desktop_config.json shapes with sibling servers, real 0o600 permissions, real Windows path strings.

We also ran a live end-to-end smoke against Claude Desktop on macOS: install into an empty config (noop diff against expected), install when the slug already exists (refused), install with --force (overwrote, sibling survived). All green. The unit tests caught the logic bugs; the live test caught one path-expansion bug that only showed up against the real config directory.

What's next, and a specific ask

v0.2.0 supports Claude Desktop and Cline. v1.2 will add one or two of: Cursor, Continue, VS Code MCP, Zed. If you have a strong opinion about which one should ship first — based on what you actually use day-to-day, not what's trending — open an issue on the repo with your client and your mcpServers schema if it differs from Claude Desktop's.

The companion pieces are also open source: mcp-probe validates server behavior, and mcp-vouch scores servers against the OWASP MCP Top 10 and emits an A–F trust grade. The registry web UI surfaces those scores on each listing.

Repo: https://github.com/Incultnitollc/mcp-registry (MIT)
npm: https://www.npmjs.com/package/@incultnitollc/mcpr
Web: https://mcp-registry-dh5.pages.dev

The two things I'd most like external eyes on: (1) the client path matrix — if your client config lives somewhere I haven't mapped, send the exact path and platform, and (2) the merge semantics for env entries. Today they get merged as-is; there's a reasonable argument for refusing to merge env blocks at all and forcing the user to confirm, because env values are often secrets and silent overwrite is the wrong default. PRs welcome on both.

How I built a RAG-grounded Discord brain in 5 weeks (solo, ESL, no funding)

pengspirit — Wed, 03 Jun 2026 06:19:18 +0000

Day 14. The fourth time.

A user in our Discord asked, for the fourth time that week, the same question. Same wording, almost. The first three answers were buried somewhere in a thread, a pinned message, and a Notion page nobody bookmarked. A mod typed it out again. I watched it happen, opened Cursor, and started typing.

That's the moment Acortia became a product instead of a side note.

I'm Peng. Solo founder. Non-native English speaker. ESL teacher in Taipei by day, building backend software at night and on weekends. No funding. No team. No accelerator yet — YC F26 application is in. Five weeks ago I committed to building Acortia: a Discord-native Company Brain that answers /ask <q> with a grounded, cited answer pulled from whatever the server has /saved. $99/month. Mid-June launch.

This is the build log. Real numbers, real bugs, real tradeoffs. No hype.

The problem, stated honestly

Discord communities accumulate institutional knowledge the way a cluttered desk accumulates receipts: faster than anyone can file it. Threads scroll past. Pinned messages cap at 50. Search is keyword-based and stops at the channel boundary. New members ask questions that were answered six months ago in a thread that's now archived.

The cost isn't dramatic — it's grinding. Mods burn out re-answering. Founders re-explain pricing. Engineers re-link the same architecture diagram. Knowledge exists; it just isn't retrievable.

I looked at the existing options. Notion + Discord bots: too much manual upkeep. Generic AI chatbots: hallucinate confidently with no source. Custom in-house RAG: out of reach for the average community. The gap was a thin, opinionated tool that lived where the conversation already happened.

The shape of the fix

Acortia is three slash commands and a cron job.

/save <url> — ingest a doc, a thread, a webpage, a PDF. Worker chunks it, embeds it, stores it.
/ask <q> — retrieve top-k chunks via cosine similarity, ground a model response in them, return the answer with inline citations to the source artifacts.
/sources — list what the server has ingested. Audit trail.

Install: OAuth the bot, click through to api.acortia.com/install, claim the workspace via magic-link email. Thirty seconds end-to-end if the operator already has Discord admin.

That's the whole product surface. Everything else is plumbing.

Architecture, in three layers

Discord is the surface. Three slash commands registered globally, one OAuth flow, webhook-style interaction endpoints handled by the Render web service.

Supabase is the brain. Seven tables. Postgres with the pgvector extension. Row Level Security keyed to workspace_id. A single SQL RPC, match_artifacts, does the vector search. RLS means a misrouted query physically cannot return another workspace's data — the database itself enforces tenancy.

Render is the muscle. A web service handles interactive Discord requests with a < 3s deadline. A worker process handles the slow path: fetch URL, extract text (PDF connector for application/pdf, readability-style extractor for HTML), chunk, embed, write. A */15 cron sweeps queued ingest jobs and re-runs anything that timed out.

Stripe is the till. Checkout session for the $99/mo plan, webhook handler with idempotency (every event ID is upserted into stripe_events_seen before any side effect runs), portal link for self-serve management. Promo codes managed in the Stripe dashboard.

Here's the SQL signature of the only RPC the app calls for retrieval. Stylized — the live function has more telemetry, but this is the shape:

-- match_artifacts: cosine similarity search scoped by workspace
create or replace function match_artifacts(
  query_embedding vector(1536),
  workspace_id_input uuid,
  match_count int default 5,
  min_similarity float default 0.15
)
returns table (
  artifact_id uuid,
  chunk_id uuid,
  content text,
  source_url text,
  similarity float
)
language sql stable
as $$
  select
    a.id as artifact_id,
    c.id as chunk_id,
    c.content,
    a.source_url,
    1 - (c.embedding <=> query_embedding) as similarity
  from chunks c
  join artifacts a on a.id = c.artifact_id
  where a.workspace_id = workspace_id_input
    and 1 - (c.embedding <=> query_embedding) >= min_similarity
  order by c.embedding <=> query_embedding
  limit match_count;
$$;

Two numbers in there worth naming: match_count = 5 and min_similarity = 0.15. I tuned both empirically against my own corpus. Higher k bloats the context window without lifting answer quality; lower threshold lets junk through and the model hedges. Lower k makes confident answers brittle when the corpus is sparse. These are the knobs you'll want to revisit per-customer in v2.

A slash command, end to end

Here's /ask, sanitized and stylized. The real handler has more error wrapping and a deferred-response pattern for Discord's 3-second deadline, but the spine looks like this:

// apps/web/src/routes/interactions/ask.ts (illustrative)
import { embed } from "../../lib/embed";
import { supabase } from "../../lib/supabase";
import { groundAnswer } from "../../lib/llm";

export async function handleAsk(interaction: DiscordInteraction) {
  const question = interaction.data.options[0].value as string;
  const workspaceId = await resolveWorkspace(interaction.guild_id);

  const queryEmbedding = await embed(question);

  const { data: matches, error } = await supabase.rpc("match_artifacts", {
    query_embedding: queryEmbedding,
    workspace_id_input: workspaceId,
    match_count: 5,
    min_similarity: 0.15,
  });

  if (error) throw error;
  if (!matches?.length) {
    return reply(interaction, "No grounded sources found. Try `/save` first.");
  }

  const answer = await groundAnswer(question, matches);
  await logQuery(workspaceId, question, matches, answer); // queries.metadata

  return reply(interaction, formatWithCitations(answer, matches));
}

The logQuery call writes to queries.metadata — a JSON column that captures which artifacts were retrieved, the similarity scores, latency, and the model used. Telemetry isn't an afterthought; it's the only way to tell, six weeks in, whether the threshold of 0.15 is still right for a given customer.

Three decisions I'd defend at a YC interview

1. pgvector over Pinecone

Pinecone is excellent. It's also a second system to bill, monitor, and reconcile RLS against. Acortia's whole tenancy model is workspace_id on every table. If embeddings live in a separate vector DB, I have to re-implement multi-tenant isolation there and trust two systems instead of one.

pgvector keeps embeddings inside the same Postgres that enforces RLS. The retrieval call is a single RPC. Cost at MVP scale: included in Supabase free tier. The day I outgrow it, the migration to a dedicated vector DB is a few hours, not a rewrite.

2. Magic-link claim over OAuth-only

Discord OAuth tells me who installed the bot. It does not tell me which email owns the workspace for billing. I needed a second factor: a magic link sent to the operator's email so the Stripe Checkout, the invoice, and the workspace ownership all land on the same identity.

The decision inside that decision was implicit-flow vs PKCE for the magic-link callback. I went with implicit. PKCE is more secure on paper, but it requires client-side code verifier storage, which on Discord's embedded browser context is fragile. Implicit + short-lived (10 min) one-time codes + server-side verification gave me a flow that worked first try on iOS Discord, Android Discord, and desktop. The tradeoff: implicit is theoretically replayable in the 10-minute window. Mitigation: one-time-use enforced server-side, codes invalidated on first verification.

I'll revisit PKCE in v2 when I have time to test the embedded-browser edge cases properly.

3. Render over Vercel

Vercel is faster to ship for stateless routes. Acortia is not stateless. The ingest pipeline runs longer than any serverless function's hard timeout — PDFs in particular. I needed a long-running worker process and a cron. Render gives me both with one config file and one bill. Web + worker + cron on Render hobby tier costs less than a sandwich per month at MVP scale.

The day I need autoscale across regions, I'll consider Fly. Not before.

What broke: the workspace claim race

Day 20. A test user installed Acortia in two Discord servers using the same email, within about ninety seconds of each other. Both installs triggered a workspace-claim flow. Both wrote to the workspaces table. The second write silently overwrote the first install's billing pointer. The user ended up with one Stripe customer and two Discord servers, but only one of the servers was correctly linked.

The bug had two causes braided together. The naive implementation was:

// Buggy original — two installs collide
const existing = await supabase
  .from("workspaces")
  .select("id")
  .eq("guild_id", guildId)
  .maybeSingle();

if (existing.data) {
  await supabase.from("workspaces").update({ ... }).eq("id", existing.data.id);
} else {
  await supabase.from("workspaces").insert({ ... });
}

Classic check-then-act. Two concurrent claims both saw existing.data === null, both ran insert, the unique constraint caught one and the other won the race. The losing install thought it succeeded because the response came from a different row.

The fix was atomic upsert plus moving email collection to claim time, not install time:

// Day-20 fix — atomic, idempotent
const { data, error } = await supabase
  .from("workspaces")
  .upsert(
    {
      guild_id: guildId,
      claim_email: null, // email collected later via magic link
      claim_token: generateToken(),
      claim_expires_at: new Date(Date.now() + 10 * 60 * 1000),
    },
    { onConflict: "guild_id", ignoreDuplicates: false }
  )
  .select()
  .single();

The atomic upsert means the database decides the winner. The deferred email means the second install doesn't even try to write the email column until the magic link is verified, which by then has a unique session token to disambiguate. I also added a trigger to fail-loud if claim_email ever gets overwritten on a row that already has one — defense in depth.

Stripe webhooks got the same treatment because they always should:

// Webhook idempotency — check before any side effect
const { data: seen } = await supabase
  .from("stripe_events_seen")
  .select("id")
  .eq("event_id", event.id)
  .maybeSingle();

if (seen) return new Response("ok", { status: 200 });

await supabase.from("stripe_events_seen").insert({ event_id: event.id });
await handleStripeEvent(event); // safe to run exactly once

Idempotent webhooks are non-negotiable. Stripe will retry. You will get duplicates. Plan for it on Day 1, not Day 30.

What I didn't ship

Three things were on the board and got cut. Each cut was deliberate.

Slack adapter. I scaffolded a platform-adapter abstraction on Day 8 — the idea was that /save and /ask would be platform-agnostic and Slack would be a second surface. The scaffolding is in the repo. I did not build the Slack OAuth flow, slash command registration, or interaction handler. Reason: Slack outreach pre-launch was zero signal. Discord operators were actively asking for the tool. Building Slack would have cost a week and shipped a feature for a customer I didn't have. Parked until live revenue justifies it.

Notion connector. Considered. Killed. The use case I imagined — pull Notion pages as artifacts — is well-served by users copy-pasting URLs into /save. The MCP route through Claude Desktop is enough for the operator's personal workflow. A first-party Notion connector adds OAuth, page-permission edge cases, and a separate sync cron. Not worth the complexity at MVP.

Pipedream MCP custom server. I spent a few hours wiring Pipedream as a generic connector tier. Backend was healthy, auth worked, but the abstraction was leaking into the slash-command UX. I cut it and routed power-user workflows through Claude Desktop's MCP instead. Acortia stays focused. Operators who want orchestration use Claude Desktop and call Acortia as a tool.

What I'd do differently

Telemetry first. I added queries.metadata on Day 6, which was correct, but I didn't build a dashboard around it until Week 4. For the first three weeks I was debugging retrieval quality by reading raw Postgres rows. A 30-minute Metabase dashboard would have saved hours of squinting. If you're building RAG: instrument retrieval before you instrument anything else. You can't tune what you can't see.

Try it

Mid-June 2026 launch. Soft-live now for beta operators.

Install: api.acortia.com/install
Domain: acortia.com

Promo for readers of this post: BETA-FREE-30D — 100% off the first month, 10 redemptions, expires 2026-06-30 23:59 UTC. After that the price is $99/month flat. No per-seat. No usage tier. One Discord server, one bill.

If you operate a Discord community, run a developer relations team, or moderate a paid creator server: this was built for you. If you don't, the architecture above is open notes — steal whatever's useful.

Footer: the founder context

I'm in Taipei. I teach English to fund this build. I am not a native English speaker and I rewrite half of what I publish three times before it reads cleanly. Every line of Acortia was written between lesson plans and weekend mornings. No team. No accelerator yet. No outside capital.

What I'm proving with this build: a solo non-US founder can ship a credible B2B SaaS product end-to-end — auth, billing, RAG, multi-tenant data isolation, idempotent webhooks, a real cron pipeline — in five weeks of nights-and-weekends time, on a stack that costs less than a streaming subscription to run.

If that's interesting to you, the install link is above. If you want to talk shop, I'm on Discord and X under the same handle.

Brief. Concept. Preview. Ship.

6 of 6 official MCP servers cluster at 56–60/100 on schema-description density

pengspirit — Wed, 27 May 2026 07:10:39 +0000

After ten days of running the v1.1.0 publishability rubric against every MCP server I can find on npm under the official @modelcontextprotocol scope, the cluster pattern is now
hard to ignore.

6 of 6 official Anthropic-shipped MCP servers score 56–60/100 on the v1.1.0 publishability composite. The cap that fires is the same axis every time: description-five-axis.

| Server | Composite | Protocol | Edge cases | Publish | Per-tool axis avg | Cap |
|---|---:|---:|---:|---:|---:|---|
| server-sequential-thinking | 60 | 100 | 100 | 20 | n/a (single tool) | description-five-axis |
| server-memory | 60 | 100 | 85 | 50 | 1.00 / 5 | description-five-axis |
| server-everything | 60 | 100 | 94 | 20 | 0.55 / 5 | description-five-axis |
| server-filesystem | 60 | 100 | 57 | 50 | 0.88 / 5 | description-five-axis |
| server-github (legacy) | 60 | 100 | 26 | 50 | 0.44 / 5 | description-five-axis |
| server-puppeteer (deprecated) | 56 | 100 | 50 | 20 | 0.17 / 5 | description-five-axis |

Every protocol score is 100. The wire format is right on every server. The 40-point gap is entirely how the schemas read.

## What "0.17 / 5" looks like in practice

Take Puppeteer's puppeteer_navigate. The full schema description is:

Navigate to a URL.
Score that against the 5 axes:

Purpose — "navigate to a URL" ✓ (1 axis)
Mutation signal — does it read or write? Silent. ✗
Side-effects — network call, can hit any URL, executes JS, arbitrary cookie state. High-blast. Silent. ✗
Invariants — does it close existing tabs? Open a new one? Same tab? Silent. ✗
Examples — none. ✗

1 / 5. The other six Puppeteer tools score the same way. Average 0.17.

A planner LLM that has to decide whether to call puppeteer_navigate from a tool list of 7 has nothing to pattern-match on. It cannot tell the difference between puppeteer_navigate (mutates browser state, can hit any URL) and puppeteer_screenshot (read-only, current page only) from the schema alone — they read identically.

## Why this matters more than it looks

The reference servers are calibration anchors. When a server author opens the docs to figure out "what does a good MCP server look like", they read these. When an LLM coding agent autocompletes a new MCP server skeleton, it pattern-matches on these. When the spec doc shows "here's how to write a tool", it links to these.

If the bar Anthropic ships at is 56–60/100, that's the bar most third-party servers will start from too — and probably stay at, because there's no public benchmark telling them they're under it.

That's the v1.1.0 thesis: surface the bar so authors can decide where they want to land. mcp-probe score is one command.

```bash npx -y @incultnitollc/mcp-probe score "" --full




  The 5-axis breakdown tells you exactly which axis is empty on which tool. Per-tool axis avg below 3.0/5 fires the ≤60 publishability cap. Fix two axes per tool (mutation signal + one concrete example is usually fastest) and the cap lifts.

  ## Methodology

  - v1.1.0 spec: <https://github.com/Incultnitollc/mcp-probe/blob/main/docs/specs/publishability-score-v1.1.0.md>
  - Calibration drift notes: <https://github.com/Incultnitollc/mcp-probe/blob/main/docs/specs/publishability-score-v1.1.0-amendments.md>
  - 6-server summary (canonical): <https://github.com/Incultnitollc/mcp-probe/blob/main/docs/publishability-scorecards/SUMMARY.md>
  - Individual server scorecards: under `docs/publishability-scorecards/` in the same repo

  ## Caveat — install-time security is a different lane

  `mcp-probe` is pre-publish quality (server authors, before they ship). For install-time security (server installers, before they connect a third-party server), see[`@stephenywilson/mcp-doctor`](https://www.npmjs.com/package/@stephenywilson/mcp-doctor). Different audience, different lane, complementary tool.

What does a missing description on an MCP tool actually do? Four failure modes I traced from real MCP servers

pengspirit — Tue, 12 May 2026 12:46:23 +0000

This is the third article in a series. The first established that schema descriptions are load-bearing — if you ship an MCP tool with { "type": "string" } and no description, the model has to guess at a contract that doesn't exist. The second pushed further: tool descriptions are runtime policy, not documentation — the absence of a "do not use for X" clause is a permission to use the tool for X.

This one answers the engineering question that sits underneath both: what specifically happens, mechanically, when an MCP tool's description is missing? Not in the abstract — in the four failure modes I have actually watched a Claude-class agent produce against real MCP servers I've run mcp-probe over.

The short version is that a missing description does not produce one failure. It produces a hierarchy of four, each one further away from where the bug appears to come from.

Failure mode 1 — selection failure (the tool is invisible)

The cheapest failure, and the one nobody notices, is that the tool simply doesn't get called.

When Claude looks at a tool list, it reads name + description + inputSchema.properties[].description as a single decision packet. The name alone is rarely enough. fetch_data could mean "fetch from the database," "fetch from the API," "fetch from cache," or "read a file." Without a description that disambiguates, the agent treats the tool as a noisy candidate and picks something else.

I have a server in front of me right now where one of the tools is named lookup. No description on the tool. The schema's single string parameter has no description either. Across maybe 30 attempts to use it through Claude over a week, the model called it twice. Both times, the tool was wrong. The other 28 times, the model went elsewhere — usually to a tool with a clearer description, even when that tool was a worse fit.

The signal you'd want here — "the model would have used my tool but doesn't know what it does" — is invisible. The tool doesn't error. It's not slow. It just doesn't show up in the trace, because the trace only records calls that happened.

Failure mode 2 — argument shape failure (the model picks, the schema rejects)

If the model does pick the tool, the next thing it has to do is fill in arguments. With no parameter descriptions, it makes the argument shape up from the parameter name and type.

Real example from @modelcontextprotocol/server-filesystem. The server has a read_file tool. The schema declares one required property: path: { type: "string" } — and this is the documented behavior, no description on the parameter. Watch what happens when you try to use it:

The model has to decide: absolute path or relative? Relative to what — workspace, server CWD, user home?
It has to decide: is the path expected to be inside an allowed root, or anywhere on disk?
It has to decide: is ~/foo.txt allowed, or does it need to be expanded?
It has to decide whether forward-slashes or backslashes matter on the platform it thinks it's running on.

None of these are answerable from path: string. The model will pick something — usually /Users/<name>/<project>/<file> for absolute, or ./<file> for relative — but the choice is a 50/50 against your real path-resolution logic. Half the time, the call succeeds. Half the time, it returns "permission denied" or "file not found," and the model has to retry with a different shape, blowing through 1–2 turns of context to recover from a description that should have been one sentence.

The fix on read_file is exactly one line of schema:

 path: {
   type: "string",
+  description: "Absolute path inside one of the allowed roots configured at server startup. Use forward slashes. Tilde expansion is not performed."
 }

Add that, and the failure mode goes away. The argument lands right on the first try.

Failure mode 3 — LLM-side validator rejection (the call never leaves the client)

This is the failure mode I had not seen until I started running mcp-probe against real servers, and it's the one that surprised me.

Several MCP clients — Claude Desktop in particular at certain config thresholds — apply a secondary validator on top of the schema you ship. Not the JSON Schema validation that runs server-side after the call. A pre-flight check that runs before the call leaves the client.

That validator looks for two things: (a) is description present at the tool level, and (b) is description present on every required parameter. When either is missing, the client doesn't refuse the tool outright — it down-weights it heavily, and in some configurations the call gets rewritten to a "ask the user" path instead.

I do not have a public spec to point at for this — it's behavior I observed across multiple MCP clients while building the scorecards published in this repo's docs/scorecards/ directory. Servers with full descriptions consistently saw 2–3× more tool invocations through the same agent task than servers without, holding everything else constant. The mechanism, as best I can reconstruct it, is the client treating description-completeness as a quality signal and routing around tools that score low.

If that's right — and the scorecard data is the evidence I have — then a missing description doesn't just degrade tool selection. It degrades it twice: once at the model layer (failure mode 1) and once at the client layer (failure mode 3). Stacked, those move a tool from "occasionally used wrong" to "effectively unreachable."

Failure mode 4 — routing collapse (your tool gets used, the wrong tool gets used instead)

The last failure mode is the one that tool authors notice last and find most painful, because it shows up as "another team's tool is eating my tool's traffic."

When two MCP tools have overlapping intent surfaces — say, your send_email and another server's notify_user — the description is the only thing the model uses to route between them. If yours has a sharp description ("transactional email triggered by an explicit user action; do not use for marketing or broadcast") and the other has nothing, the routing collapses toward the vague one, not away from it.

This is counterintuitive. You would expect "more specific description = more likely to be picked." It works the other way. A vague description has no negative scope. The model sees "could plausibly handle this" and picks it for everything within the envelope, including cases your tool would have handled better. Yours, with the sharp scope, only gets picked when the model is sure your case applies — which is rare, because being sure is expensive.

The defense is the anti-purpose clause from the second article in this series: write what your tool is not for, by name, pointing at the specific other tool you want the routing to go to instead. "Do not use this for marketing campaigns or one-off broadcasts — those go through marketing_send." The other tool's vagueness is now your contract. If they don't add an anti-purpose clause back, you've at least claimed the boundary unilaterally.

What this means for the schema you ship

Three small rules that fall out of the four failure modes:

Every tool gets a description, period. Not "TODO: add description." Actually describe what the tool does, in one sentence, in the first 80 characters — that's the part the agent's selection packet uses most heavily.
Every required parameter gets a description that pins the shape. Not "the path." A description like "Absolute path inside an allowed root, forward slashes, no tilde expansion" — five constraints in fifteen words. If you can't write that sentence, you don't fully understand the parameter, and your server will fail in failure mode 2 anyway.
For any tool whose intent overlaps another tool you know about, write the anti-purpose clause. Name the other tool. Point at it. Vagueness is a vacuum that the routing fills with whichever tool sounds adjacent enough.

The contract framing

If I had to compress the whole series into one line, it would be this: the description fields in an MCP tool's schema are the only contract the model sees at runtime. Not the README, not the docs site, not the GitHub issues. The schema. Anything you don't write into the description doesn't exist for the agent.

The four failure modes above are what happens when that contract has gaps. Each gap looks like a different bug — selection went wrong, arguments went wrong, the call never left the client, traffic went to a competitor — but the root cause is the same one-line fix every time.

I built mcp-probe to make these failures visible before they ship. It enumerates every tool a server exposes, flags missing descriptions on tools and required parameters, runs every callable tool with auto-generated arguments matching the declared schema, and exits non-zero if any of failure modes 1–4 are statically detectable. It's not a replacement for Anthropic's MCP Inspector — Inspector is the right tool for interactive debugging when something has already gone wrong. mcp-probe is the pre-publish CLI for catching the four failures above before the model ever sees the server.

Both tools are useful. They sit on different sides of the same problem.

If you're shipping an MCP server, the one specific thing I'd ask is this: before you publish, run something that fails on missing descriptions. It can be mcp-probe, it can be a homemade lint, it can be a code review checklist. The failure modes above are not theoretical — they're the four actual ways a missing description shows up in production. Catch them at lint time and your server enters the ecosystem at the top of the routing surface, not invisible at the bottom.

The next article in this series will walk through the same four failure modes from the client author's side — what an MCP client should do when it sees a tool with no description, beyond just rendering it. That's where the secondary validator in failure mode 3 lives, and it's where the load-bearing-descriptions framing has its sharpest implication.

Tool descriptions are load-bearing too: the anti-purpose pattern in MCP

pengspirit — Thu, 07 May 2026 14:33:09 +0000

A few days ago I posted Schema descriptions are load-bearing: why missing parameter descriptions break MCP clients. The argument: every parameter without a description is a load-bearing element silently absent from the schema, and agents fail in ways that look like model problems but are actually contract problems.

The post got a comment from @mickyarun that's worth its own essay:

The "load-bearing" framing is the right shape — the same observation applies one level up at the tool level. Most MCP catalogues we've audited had perfectly described parameters but no description of when not to call this tool, which is the bit that actually decides whether an agent reaches for the right surface. The half-hour we spent adding "anti-purpose" descriptions to about a dozen of our internal tools cut the wrong-tool-selected rate roughly in half. Arguably the parameter case in this post is just the most visible instance of a broader rule: every field of every schema an agent reads is doing structural work whether you specified it or not.

He's right, and the pattern deserves a name. Call it the anti-purpose pattern: every tool description should specify not just what the tool is for, but what it is not for.

HOW vs WHETHER

Parameter descriptions answer HOW to call a tool — what types, what shape, what valid values.

Tool descriptions answer WHETHER to call a tool — does this surface match the user's intent at all.

Both are schema. Both are load-bearing. The first is usually under-specified. The second is almost always under-specified.

Why "Searches the web" fails

Most MCP tool descriptions read like marketing copy:

"Searches the web for information"
"Retrieves data from the database"
"Sends an email"

This is fine in isolation. It collapses the moment an agent has three search tools, two database tools, and four messaging tools loaded at once — which is the actual production scenario.

The agent has to disambiguate. The schema gave it nothing to disambiguate with. So it picks the first plausible match, or the one with the cleanest parameter list, or the one whose name lexically matches the user's phrasing. None of these correlate with correctness.

The anti-purpose pattern

The fix is mechanical:

Before: "Searches the web for information"

After:  "Searches the public web for current events,
         news, and recently published content.
         Do not use for: code lookup (use code_search),
         internal documentation (use docs_search),
         or queries answerable from training data."

Three changes:

Specific scope — "public web" not "the web", "current events" not "information"
Disambiguation pointers — names the sibling tools the agent might confuse this with
Explicit exclusions — the "do not use for" clause

@mickyarunreports roughly 50% fewer wrong-tool-selection errors after adding clauses like this to about a dozen internal tools. That's a half-hour edit producing a measurable behavior shift, with no model change and no prompt-engineering tax on the consumer side.

Why tool authors skip this

Two reasons, both fixable:

The author knows what the tool is for, so the description is implicit. Authors write descriptions that document the tool's positive purpose because that's what they were thinking about while writing it. The negative purpose — what they consciously decided this tool would not do — never makes it onto the page.
MCP examples don't model it. Look at any MCP server template or quickstart and tool descriptions are one-line declaratives. There's no canonical example that says "here's what a production tool description looks like with anti-purpose."

The first is fixed by a checklist. The second is fixed by people writing posts like this one.

Concrete checklist

When writing or auditing a tool description, the description should answer:

Scope: What specifically does this operate on? ("public web", "this user's calendar", "Postgres tables in the analytics schema")
Trigger: What user intent should select this tool?
Anti-trigger: What user intent looks similar but should select a different tool?
Sibling pointer: Which neighboring tools are the most likely confusion sources, and what should send the agent there instead?

If you have more than one tool in your MCP server, all four are load-bearing. Skipping any of them outsources the disambiguation to whatever the model happens to guess.

Coming to mcp-probe

This is the next axis I'm adding to mcp-probe. Parameter-description coverage is already scored. Tool-description quality — including a heuristic for anti-purpose clauses — belongs in the same scorecard.

Thanks to @mickyarun for the comment that pulled the framing one level up. Schema descriptions are load-bearing. So is every other field of the contract an agent is asked to read.

Schema descriptions are load-bearing: why missing parameter descriptions break MCP clients

pengspirit — Tue, 05 May 2026 16:16:49 +0000

I shipped mcp-probe — a CLI that points at any MCP server, enumerates every tool, resource, and prompt, calls each with auto-generated arguments, validates against declared schemas, prints a pass/fail scorecard, and exits 0/1 for CI.

The plan for launch week: run it against the official Node MCP servers and post results. The first run made me look like I'd broken half the ecosystem. The second, after I read my own output, told a different story — most failures were bugs in my client, not the servers. The rest collapsed into one finding about schema design.

This post is the corrected version. Three sections: what mcp-probe does, what the scorecards say, and the three bugs I fixed in my own client first.

1. What mcp-probe does

One command. stdio, SSE, or Streamable HTTP transport. No config file required.

npx @incultnitollc/mcp-probe test "npx -y @modelcontextprotocol/server-memory"

Output is a scorecard:

Tools callable:      9/9
Resources readable:  n/a
Prompts callable:    n/a
Schema warnings:     4
ALL CHECKS PASSED

Exit code 0 if everything passes, 1 if anything fails. Drop it in CI:

- run: npx -y @incultnitollc/mcp-probe test "node dist/index.js"

Install globally if you'd rather not npx every time:

npm install -g @incultnitollc/mcp-probe

The mental model is curl for MCP servers. You don't open Claude Desktop, hand-write a config, restart the app, and stare at the tool list to see whether anything broke. You run one command and get a scorecard.

2. What I found across the four official Node servers

Here is the actual scorecard from docs/scorecards/SUMMARY.md, re-run on @incultnitollc/mcp-probe@1.0.1:

Server	Tools	Resources	Prompts	Schema warns	Status
`@modelcontextprotocol/server-memory`	9 / 9	n/a	n/a	4	PASS
`@modelcontextprotocol/server-sequential-thinking`	1 / 1	n/a	n/a	0	PASS
`@modelcontextprotocol/server-everything`	12 / 13	7 / 7	3 / 4	1	partial
`@modelcontextprotocol/server-filesystem`	8 / 14	n/a	n/a	18	partial

Aggregate: 30 of 37 tools callable across four servers, 81%. Two servers fully pass. The other two have a single failure pattern between them.

A scope note before the finding, because I got this wrong the first time: Anthropic's fetch MCP server is Python-only, installed via uvx mcp-server-fetch. It has never been published to npm. mcp-probe runs against any stdio MCP server regardless of language — only this scorecard is scoped to the official Node servers. Earlier launch copy of mine that called server-fetch "broken on npm" was wrong, and I want to flag it explicitly here because I almost shipped that draft.

Now the real finding. Every remaining failure on the partial-pass servers traces to the same root cause: missing description fields on schema properties.

On server-filesystem, six of the fourteen tools fail because mcp-probe doesn't know which arguments are supposed to be file paths versus directory paths versus arbitrary strings. The path parameter on read_file, read_text_file, read_media_file, edit_file, and write_file has no description in the schema, so my client defaults to the allowed sandbox directory itself. The server correctly returns EISDIR (you tried to read a directory as a file) or EACCES (you tried to write to one). move_file fails the same way — both source and destination resolve to the same directory, and the server correctly refuses the no-op rename. The server is doing its job. The schema is the gap.

On server-everything, one prompt fails because the resourceType argument has no description. It's an enum — "Text" or "Blob" — but with no description and no examples, my client passes the literal string "test" and the server correctly returns Invalid resourceType: test. The schema validator inside mcp-probe even raises a warning on this property before the call fires:

WARN  get-resource-reference — Property "resourceType" missing description

That warning is the diagnostic working as intended — mcp-probe still attempts the call, then surfaces both the warning and the resulting failure side-by-side so you can see the connection.

The substantive insight, and the line I'll repeat at every MCP-related event for the next year: when an MCP server ships parameter properties without descriptions, no automated tool can guess valid arguments. Not mcp-probe. Not your IDE's autocomplete. Not an LLM trying to call the tool from Claude Desktop. Schema descriptions aren't documentation polish. They're the instruction manual the model is reading every time it picks an argument. They're load-bearing.

If you maintain an MCP server and you want a quick win, add "description" to every property in every input schema. The 18 schema warnings on server-filesystem are not 18 separate problems — they're 18 instances of the same one-line fix.

3. The three bugs I fixed in my own client first

Here's the part I want to be honest about. The first time I ran mcp-probe against server-filesystem, I got 2 of 14 tools passing and a scorecard that screamed FAIL. My instinct was to write a launch post saying "the official filesystem server is broken." I almost did.

Then I actually read my own output. Most of those failures were because my client was sending arguments the server had no way to accept. A diagnostic tool is only credible if it can distinguish "your server is broken" from "I sent garbage." Stress-testing forced that distinction, and three commits came out of it before I trusted the scorecard.

Commit 3825170 — show the args we sent on every failure. When a tool or prompt call fails, mcp-probe now prints the exact JSON it sent alongside the server's error response. Before this, a failure looked like MCP error -32603: Invalid resourceType: test with no indication that "test" was something my client had auto-generated. After this, you can read the failure and immediately tell whether the server rejected something reasonable or something nonsense. This is the smallest of the three changes and the most important one for the trust story.

Commit ce4f55e — sandbox-aware paths. server-filesystem enforces an allowed-directory sandbox. mcp-probe now calls list_allowed_directories before generating sample arguments and uses one of those directories as the default for any path-shaped parameter. On macOS, where /tmp is a symlink to /private/tmp, it normalizes via realpath so the path the server receives matches what the sandbox check expects. This single commit moved server-filesystem from 2 of 14 passing to 8 of 14. The remaining 6 are the missing-description cases I already covered — the bugs that aren't mine.

Prompt-argument enum extractor. When a prompt argument is described in prose like "one of: Text, Blob" instead of as a JSON Schema enum, mcp-probe now tries to parse the allowed values out of the description string and pick one. Partial — it works on the prompts that have prose-level documentation, and it does nothing for arguments like resourceType on server-everything that have neither schema enum nor prose description. This is why the schema-description finding above isn't theoretical: I built the workaround, and the workaround can't help when there's no text to read.

The loop, in one sentence: I had to make my client honest about what it was sending before I could call any server's failure a server bug.

Try it

npm install -g @incultnitollc/mcp-probe
mcp-probe test "npx -y @modelcontextprotocol/server-memory"

Repo: github.com/incultnitollc/mcp-probe
npm: @incultnitollc/mcp-probe
Raw scorecards from this post: docs/scorecards/
Pre-publish checklist for MCP server maintainers: docs/checklist.md

If you maintain an MCP server and you want a scorecard run against it, open an issue with the test-my-server template and I'll post the results as a comment. If mcp-probe reports something that looks like a server bug and isn't, open an issue against mcp-probe instead — that's the loop that produced commits 3825170 and ce4f55e, and it's the only way the diagnostic gets more trustworthy.