DEV Community: yongrean

tools/list is not a readiness check for MCP servers

yongrean — Mon, 01 Jun 2026 06:48:53 +0000

The first version of mcp-probe checked the obvious things:

can the MCP server initialize?
does tools/list work?
are tool schemas present?

That was useful, but not enough.

The more I tested real MCP workflows, the clearer the problem became:

tools/list is self-report. CI needs a receipt.

An MCP server can advertise a clean tool catalog and still fail every real call because OAuth handoff, scopes, downstream credentials, row limits, tenant boundaries, or response shapes are broken.

So the latest release of mcp-probe focuses less on "does the process start?" and more on "is CI enforcing the contract an agent actually depends on?"

The new bootstrap flow

npx @k08200/mcp-probe@latest init \
  --target @your-org/your-mcp-server \
  --discover \
  --lock-tools \
  --github-actions

This creates:

mcp-probe.config.json
.mcp-probe.json
.github/workflows/mcp-probe.yml

The important part is what happens during --discover.

mcp-probe connects to the server, reads the live tools/list catalog, and generates a starting contract from the observed tool schemas.

Schema-aware sidecar samples

Older generated samples were too naive. If a schema said:

{
  "type": "object",
  "required": ["location", "count"],
  "properties": {
    "location": { "type": "string", "enum": ["Chicago", "New York"] },
    "count": { "type": "integer", "minimum": 1 }
  }
}

the old fallback might produce empty strings or zero values. That often hit input validation and never tested the real call path.

v1.11.0 now uses schema hints:

default
enum
numeric minimum
string minLength
nested objects
array minItems

So the generated sample becomes:

{
  "location": "Chicago",
  "count": 1
}

It is still only a starting point. You should review generated samples before running them with production credentials, especially for mutating, admin, export, or environment-inspection tools.

Catalog locking

The other new piece is --lock-tools.

With --discover, mcp-probe now writes the observed tool names into expectedTools, so CI fails if a required tool disappears.

With --lock-tools, it also writes allowedTools, so CI fails if unexpected tools appear.

That matters for low-trust agent surfaces. If a server suddenly exposes delete_user, export_all, or rotate_api_key, I do not want that to silently become available to an agent just because tools/list still returns valid JSON.

Example config:

{
  "timeoutMs": 10000,
  "servers": [
    {
      "name": "my-mcp-server",
      "target": "@your-org/your-mcp-server",
      "probeTools": true,
      "toolsFile": ".mcp-probe.json",
      "expectedTools": ["search", "read_record"],
      "allowedTools": ["search", "read_record"]
    }
  ]
}

Receipts

For CI, the workflow can also persist a redacted receipt artifact:

npx @k08200/mcp-probe@latest \
  --config mcp-probe.config.json \
  --github-summary \
  --fail-on-warn \
  --receipt-file mcp-probe.receipt.json

That receipt is the thing I want CI to trust: not the server claiming it has tools, and not an agent claiming what happened later, but an independent probe that actually ran against the boundary.

Try it

npx @k08200/mcp-probe@latest @modelcontextprotocol/server-memory

GitHub: k08200/mcp-probe

Release: v1.11.0

I am especially looking for real Datadog, Supabase, and Gmail MCP recipes. The public fixtures are useful, but the real value is catching auth handoff, permission, tenant-scope, and response-contract failures in CI.

Stop Building AI Assistants. Build AI Firewalls.

yongrean — Thu, 28 May 2026 15:40:23 +0000

Every week another "AI agent for X" launches. Email triage. Calendar coordination. Sales follow-up. PR reviewer. Slack monitor. Meeting summarizer.

I've installed enough of them to see the pattern. Here's the dirty secret nobody mentions in the launch posts:

These tools don't reduce your work. They multiply your notifications.

Each AI tool is configured to be helpful by default. "Helpful" means: "I noticed this thing — here's a notification." Stack a dozen of those, and instead of one inbox to ignore you have twelve. The signal-to-noise ratio gets worse every time you add an AI to your workflow.

The mainstream answer is "just configure each one." Sure. Spend four hours tuning notification settings every time you add a tool, and another four hours when one of them ships a "smarter notifications" update. That's not productivity. That's notification janitorial work disguised as setup.

This is a structural problem. Not a configuration problem.

60-second walkthrough

// Detect dark theme var iframe = document.getElementById('tweet-2060688051920314608-305'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2060688051920314608&theme=dark" }

The wrong question

Every AI tool asks the same thing: "Is this important?"

Wrong question. There is no objective "important." Importance depends on you, right now. A Stripe webhook is important when you're debugging a checkout flow. The same webhook is pure noise during a deep work block. A Slack message from your cofounder is critical at 11am Tuesday and irrelevant at 11pm Friday.

The right question is:

Is this urgent enough to interrupt me, right now, given what I'm doing?

That's not a question any individual AI agent can answer. It's a layer above all your AI agents. None of them have the context. None of them know what the others are doing. None of them know how you're spending the next hour.

So they all default to "I'll just send you a notification, you decide." Which is exactly the experience you have right now: drowning.

What an AI firewall actually looks like

I'm building that layer. It's called Klorn. Here's how it works in practice — and what's already shipping vs what's scope-deferred.

Every incoming email goes through a 4-tier classification:

Tier	Behavior	PoC state
PUSH	Wakes you up. Phone notification.	Classified + alert ✅
QUEUE	Review on your own schedule.	Classified + queued ✅
SILENT	Recorded. Never interrupts.	Classified + logged ✅
AUTO	Reversible, hands-off. Low-risk actions execute; external-facing actions stay approval-gated.	Partial execution: LOW-risk internal (classify, mark read, briefing) auto-executes. MEDIUM (send email, create event) and HIGH (delete) always go through an approve button.

That's the entire surface. No "Call" tier. No fancy automations. Narrow on purpose.

The tier is decided by a 4-feature scorer:

Confidence — how clearly the signal type maps to a tier
Sender trust — your historical reply rate and meeting acceptance for this contact
Reversibility — can the wrong tier be undone without consequence?
Urgency — actual urgency signals, not "URGENT!!!" in the subject line

80% agreement with my hand-labels on 50 real emails. That's the Day 7 PoC gate, met.

Override is GROUP BY, not LLM

When the firewall gets a tier wrong, one click moves the email to the right tier. Your correction doesn't just fix this one email — it becomes ground truth for the next prompt.

The override loop is the wedge. The classifier is replaceable; the alignment signal isn't. Every disagreement is signal, not noise.

Boring + measurable beats fuzzy + ambitious.

Why building this is unpopular in 2026

Building AI firewalls is unsexy. Investors want "AI agents that DO things." Saying "I built a system that does fewer things, more quietly" sounds backwards on a pitch deck.

But every founder I've shown this to has the same reaction: relief. Because they're drowning. Because every productivity tool they bought made their attention worse, not better. The AI agent boom didn't reduce their work. It raised the floor of background notifications.

The default for AI tools should be: shut up unless it actually matters.

Most don't. So I'm building the layer that enforces it from outside, since none of the individual tools will do it on their own.

Where I am

PoC sprint, Week 5, solo. 14-day window ending June 9, 2026.

Day 7 Technical Gate — ≥80% classifier agreement on 50 hand-labeled emails. Met.
Day 14 UX Gate — ≥3/5 ICP demos register "oh, this is different." Pending.

I dogfood it every day. My own inbox runs through the firewall.

Stack: Next.js 15, TypeScript, Prisma, Postgres (Supabase), Claude / OpenAI for the tier reasoning, Gmail for ingest.

The actual unpopular opinion

If your AI tool sends push notifications by default, it's broken. Doesn't matter how good its reasoning is. You can't reason your way out of a notification flood.

The next valuable layer of agentic products won't be more agents. It'll be the firewall that decides which agents are allowed to interrupt you, when.

Try it: klorn.ai
Code: github.com/k08200/klorn

If you're building agentic products and you disagree, I want to hear it. If you've solved it differently, I want to hear that more.

MCP CI gates need receipts: tools/list is not enough

yongrean — Thu, 28 May 2026 11:44:32 +0000

MCP servers are starting to look like normal infrastructure.

That means they need boring infrastructure checks.

The mistake I kept seeing is this:

"The server starts, and tools/list returns a clean schema. Therefore it works."

That is not enough.

An MCP server can pass initialize, advertise every expected tool, and still fail every real call because auth, scopes, tenant boundaries, environment variables, downstream permissions, or read-only roles are broken.

So I pushed mcp-probe@1.8.0 further toward being a real CI readiness gate for MCP servers.

npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --fail-on-warn

What changed

1. Warnings can now fail CI

By default, warnings still exit 0. That keeps existing users from getting surprise CI failures.

But production gates often need stricter behavior:

mcp-probe --config mcp-probe.config.json --fail-on-warn

With --fail-on-warn, auth handoff issues, permission warnings, or incomplete readiness receipts can block the workflow.

That matters because many MCP failures are not hard crashes. They are degraded states:

OAuth flow requires a browser redirect the agent cannot complete
a server starts but every tool call returns 401
a database tool works with admin credentials but fails with the intended read-only role
the workflow mentions a probe but does not actually run the production boundary check

2. Doctor now checks the actual workflow receipt

mcp-probe doctor already checked whether a GitHub Actions workflow existed.

But that is not enough either.

The new behavior is stricter: the required flags must appear on the same actual mcp-probe run step.

This should pass:

- run: npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --fail-on-warn

This should not count as a complete gate:

- run: npx @k08200/mcp-probe --config mcp-probe.config.json
- run: npx @k08200/mcp-probe ./server.js --github-summary --fail-on-warn

The flags are present somewhere in the workflow, but no single run step proves the intended config is actually being checked with CI summaries and strict warning handling.

That is the difference between "we have a gate" and "the gate is enforcing the thing we trust."

3. Tool call coverage is now tied to expected tools

For config-based checks, you can declare the expected tool catalog:

{
  "servers": [
    {
      "name": "datadog",
      "target": "https://mcp.example.com/mcp",
      "transport": "http",
      "headers": {
        "Authorization": "Bearer ${DATADOG_MCP_TOKEN}"
      },
      "expectedTools": ["logs_query"],
      "forbiddenTools": ["delete_dashboard", "rotate_api_key"],
      "toolsFile": "./datadog.tools.json"
    }
  ]
}

If expectedTools and toolsFile are both set, every expected tool needs a sidecar sample input.

That means CI checks not just "is the tool advertised?" but "did we actually provide a meaningful dry-run sample for the tool an agent depends on?"

4. Sidecar inputs are the real contract

Auto-generated inputs are useful for smoke tests, but they mostly hit schema validation.

Real readiness checks need meaningful inputs:

{
  "tools": {
    "logs_query": {
      "input": {
        "query": "service:web status:error",
        "timeframe": "1h"
      },
      "expect": {
        "status": "pass",
        "not_error_code": [401, 403],
        "requiredFields": ["source", "freshness"],
        "maxRows": 100
      }
    }
  }
}

For database-backed MCP servers, these assertions are the interesting part:

does the read-only role work?
are row limits enforced?
are broad exports/admin actions absent or gated?
are denied writes structured enough for agents to recover?
do results include provenance fields like source and freshness?
does the response avoid leaking secrets, stack traces, or raw internals?

Install

npm install -D @k08200/mcp-probe

Or run directly:

npx @k08200/mcp-probe@latest doctor
npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary --fail-on-warn

GitHub: https://github.com/k08200/mcp-probe
npm: https://www.npmjs.com/package/@k08200/mcp-probe

The goal is simple: CI for MCP should test the contract an agent will actually depend on, not just whether the process starts.

mcp-probe v1.6.0: Stricter GitHub Actions checks for MCP CI gates

yongrean — Tue, 26 May 2026 04:35:59 +0000

I shipped mcp-probe v1.6.0 with a small but useful improvement to mcp-probe doctor.

Previous behavior:

check whether .github/workflows exists
check whether any workflow mentions mcp-probe

That was useful, but too shallow. A workflow can mention mcp-probe and still not run the actual CI gate correctly.

What changed

mcp-probe doctor now warns when the matching GitHub Actions workflow is missing any of these pieces:

actions/checkout@v6
--config <config-file>
--github-summary

Example:

npx @k08200/mcp-probe@latest doctor

If your workflow calls mcp-probe directly but does not use the configured fleet gate, doctor now tells you what is missing before you trust the CI result.

Why this matters

The larger goal of mcp-probe is to make MCP servers testable like normal infrastructure. That means checking more than process startup:

MCP initialize handshake
tools/list discovery
real tools/call dry-runs
sidecar sample inputs
contract assertions for row limits, stable error codes, and leak checks
and now, whether the CI workflow itself is wired correctly

A readiness gate is only useful if the gate is actually installed correctly.

GitHub: https://github.com/k08200/mcp-probe
npm: https://www.npmjs.com/package/@k08200/mcp-probe
Release: https://github.com/k08200/mcp-probe/releases/tag/v1.6.0

mcp-probe v1.5.0: Doctor checks for MCP CI readiness

yongrean — Mon, 25 May 2026 15:40:20 +0000

MCP servers are starting to look like infrastructure. That means the tooling around them needs boring preflight checks, not just optimistic smoke tests.

I just shipped mcp-probe v1.5.0 with a new command:

npx @k08200/mcp-probe@latest doctor

mcp-probe doctor checks whether the current repository is ready to run MCP readiness checks in CI before you even probe an external server.

What it checks

Node.js runtime satisfies mcp-probe requirements
mcp-probe.config.json exists and parses
configured sidecar files exist and have valid tools.*.input objects
GitHub Actions workflows are present and mention mcp-probe

Example:

mcp-probe doctor --config-file examples/self-check.config.json

Output:

mcp-probe doctor
────────────────────────────────────────────────────
  ✓  Node.js version
     Node 24.13.0 satisfies >=20.19.0
  ✓  Config file
     examples/self-check.config.json contains 1 server
  ✓  Sidecar examples/self-check.tools.json
     Found 4 tool entries
  ✓  GitHub Actions workflow
     Found 1 workflow file mentioning mcp-probe
────────────────────────────────────────────────────
  PASS

For automation:

mcp-probe doctor --output json

Why this matters

The earlier releases focused on the MCP server itself:

initialize handshake
tools/list discovery
real tools/call dry-runs
sidecar sample inputs
contract assertions for row limits, metadata, stable error codes, and leak checks

But teams still need to know whether their own probe setup is sane. A broken config file, missing sidecar, or workflow that never invokes the probe should fail early and loudly.

This release is a small step, but an important one: before testing the MCP contract an agent depends on, test that your CI gate is actually wired correctly.

GitHub: https://github.com/k08200/mcp-probe
npm: https://www.npmjs.com/package/@k08200/mcp-probe
Release: https://github.com/k08200/mcp-probe/releases/tag/v1.5.0

Stop building AI inboxes. Build decision layers instead.

yongrean — Mon, 25 May 2026 13:40:43 +0000

I spent six months building an AI-powered email tool. Then I deleted half of it.

Not because the model was bad. Not because the embeddings were off. Because I finally noticed what every "AI inbox" on the market — including the one I was building — was actually doing.

They were surfacing more.

More "smart suggestions". More "priority signals". More "AI-drafted replies waiting for your review". More badges, more banners, more nudges. Every product in the category was racing to add a new surface and call it intelligence.

My six-month-old prototype did all of that. I used it every day. And every morning the inbox was just as loud as the day I started. The model was right about which emails mattered. I still read all the other ones anyway, because they were right there, with a little colored dot suggesting maybe-they-mattered-too.

The model was solving the wrong problem.

The category bug

Look at the leading email tools through this lens:

Superhuman made reading faster. You still read everything.
Shortwave classified smarter. You still read everything.
Motion / Reclaim got more proactive. They added a calendar layer on top of the noise.

None of them subtract. They all add. "AI assistant" became a license to put one more thing in front of you.

The deeper bug: these tools treat email as the primary surface and try to make it better. But email is not what you want. What you want is decisions you have to make. Email is one cheap, unreliable transport that occasionally contains those decisions, buried under hundreds that don't.

Making the transport prettier doesn't fix the signal-to-noise problem. It hides it.

The right abstraction: decision layer

A decision layer doesn't replace your inbox. It sits above mail, calendar, Slack, and any other transport, and it surfaces exactly one thing: items where the system genuinely needs your judgment.

Three properties make a layer a decision layer rather than just "a better inbox":

It subtracts more than it adds. A signal that you've ignored four times in a row should never reach you again. Not muted. Gone.
It treats relationships as data. Two people asking for the same thing are not the same ask. One of them has hit every deadline you've ever had with them; the other ships +3 days late, every time. That should weight the queue.
It refuses to act without your approval. The model can draft, propose, plan. It cannot send, modify, or commit. Approval-before-action has to be a schema-level constraint, not a UI nicety.

None of these are AI features. They are boundary features. The AI is helpful for the classification underneath, but the value lives in what the system refuses to surface.

Here is what each of them actually looks like in production.

Pattern 1 — Closed-loop suppression learning

The single most useful thing the system does is forget.

Every time the user dismisses an attention item, we record a FeedbackEvent with the signal DISMISSED or IGNORED. That table is the cheap part. The interesting part is a job that reads it weekly:

export async function runFeedbackAdaptation(userId: string): Promise<number> {
  const since = new Date(Date.now() - LOOK_BACK_DAYS * 24 * 60 * 60 * 1000);

  const events = await prisma.feedbackEvent.findMany({
    where: {
      userId,
      source: "ATTENTION_ITEM",
      signal: { in: ["DISMISSED", "IGNORED"] },
      createdAt: { gte: since },
    },
    select: { sourceId: true },
  });

  // Join to the attention items themselves so we can bucket by (source, type,
  // priority) instead of just (source, type) — the bucket prevents an
  // over-broad rule from silencing legitimate high-priority signals.
  const items = await prisma.attentionItem.findMany({
    where: { id: { in: events.map(e => e.sourceId) } },
    select: { id: true, source: true, type: true, priority: true },
  });

  const counts = new Map<string, { key: CountKey; count: number }>();
  for (const event of events) {
    const item = itemMap.get(event.sourceId);
    if (!item) continue;
    const bucket = priorityBucket(item.priority);
    const k = suppressionKey(item.source, item.type, bucket);
    const existing = counts.get(k);
    if (existing) existing.count += 1;
    else counts.set(k, { key: { source: item.source, type: item.type, bucket }, count: 1 });
  }

  // Threshold: same tuple dismissed ≥4 times in 30 days → suppress forever.
  const suppressed = [...counts.values()]
    .filter(({ count }) => count >= DISMISS_THRESHOLD)
    .map(({ key, count }) => ({ ...key, dismissCount: count }));

  await remember(userId, "CONTEXT", "attention_suppression_v2", JSON.stringify(suppressed));
  return suppressed.length;
}

The suppression set is then read at the upsert path for every new attention item:

export function isSuppressed(
  set: Set<string>,
  source: string,
  type: string,
  priority?: number,
): boolean {
  if (typeof priority === "number") {
    const bucket = priorityBucket(priority);
    if (set.has(suppressionKey(source, type, bucket))) return true;
  }
  return set.has(suppressionKey(source, type));
}

If the tuple is in the suppression set, the new attention item is forced into SILENT tier — it gets recorded for the audit log, but the user is never paged about it.

A few design choices worth pointing out:

Priority buckets matter. The first version keyed only on (source, type). Dismissing four "due-today commitment" notifications would silence every commitment-due signal, including overdue ones. The current version buckets priority into HIGH / MEDIUM / LOW, so the user can train "I don't care about LOW-priority due commitments" without losing the HIGH ones.
Backwards-compatible key. Memory rows from the previous version are still read; a v1 row without a bucket matches every bucket, so a rollback doesn't lose learned behavior.
10-minute in-process cache. The upsert path is hot — checking the suppression set on every new item against the DB would be wasteful. A 10-minute TTL is short enough that a weekly adaptation run propagates fast and long enough to be free at request time.

Notice what's missing: an LLM. The classifier underneath uses one, but the suppression loop itself is plain counting. The model is not the right tool for "remember what the user doesn't care about". A GROUP BY is.

Pattern 2 — Contact Trust Score

The second feature changed how I think about every productivity tool I've ever used.

When someone makes a commitment to you — "I'll send the deck by Thursday", "let's reconnect next week" — that's a tracked row in a commitment ledger. When the commitment is fulfilled, we record whether it was on-time or late, and update a running tally per contact:

export async function updateTrustScore(
  userId: string,
  contactEmail: string,
  displayName: string | null,
  wasOnTime: boolean,
  daysLate = 0,
): Promise<void> {
  await prisma.contactTrustScore.upsert({
    where: { userId_contactEmail: { userId, contactEmail: email } },
    create: {
      userId,
      contactEmail: email,
      displayName,
      totalCount: 1,
      onTimeCount: wasOnTime ? 1 : 0,
      lateCount: wasOnTime ? 0 : 1,
      totalDelayDays: Math.max(0, daysLate),
      lastUpdatedAt: new Date(),
    },
    update: {
      totalCount: { increment: 1 },
      ...(wasOnTime ? { onTimeCount: { increment: 1 } } : { lateCount: { increment: 1 } }),
      ...(daysLate > 0 ? { totalDelayDays: { increment: daysLate } } : {}),
      lastUpdatedAt: new Date(),
    },
  });
}

That tally rolls up to a badge:

reliable — ≥80% on-time, ≥3 data points
mostly reliable — ≥50% on-time, ≥3 data points
unreliable — <50% on-time, ≥3 data points
unknown — fewer than 3 data points, or stale (no signal in 60+ days)

The stale check is doing real work. A year-old "reliable" badge on someone who has since gone dark shouldn't be load-bearing. Until we get full exponential decay, we demote anyone untouched in two half-lives back to unknown.

The badge gets surfaced as a small chip on the inbox card. But the actually-useful place is inside the agent prompt itself:

export async function buildTrustHintForPrompt(userId: string): Promise<string> {
  const rows = await prisma.contactTrustScore.findMany({
    where: { userId, totalCount: { gte: MIN_DATA_POINTS } },
    orderBy: { lastUpdatedAt: "desc" },
    take: 10,
  });
  if (rows.length === 0) return "";

  const lines = rows.map((row) => {
    const r = computeResult(row);
    const name = r.displayName || r.contactEmail;
    if (r.badge === "reliable")
      return `- ${name}: reliable (${Math.round(r.onTimeRate * 100)}% on-time)`;
    if (r.badge === "mostly_reliable") {
      const delay = r.avgDelayDays > 0 ? `, avg +${Math.round(r.avgDelayDays)}d late` : "";
      return `- ${name}: mostly reliable (${Math.round(r.onTimeRate * 100)}% on-time${delay})`;
    }
    return `- ${name}: unreliable (${Math.round(r.onTimeRate * 100)}% on-time, avg +${Math.round(r.avgDelayDays)}d late) — factor in extra buffer`;
  });

  return `\n## Contact Reliability\nBased on tracked commitments:\n${lines.join("\n")}`;
}

Now when the model decides how urgently to surface "Mina is asking for an update" vs "Sarah is asking for an update", it has actual data on which of them is going to deliver if you give them a polite nudge versus which one needs the deadline restated three times. The prompt isn't fed any feelings about either person. It is fed numbers.

The productivity-tool industry has spent ten years building calendars that don't know which meeting attendees actually show up on time. That's strange.

Pattern 3 — Approval-before-action as a schema constraint

The third pattern is the boring one, and it's the one most AI assistants get wrong.

The model is allowed to draft a reply. It is allowed to propose a calendar move. It is allowed to plan a sequence of actions. It is not allowed to send, move, or commit any of it. Not because we don't trust the model — we sometimes do — but because the user needs to know the surface area of what the system is doing on their behalf, and "silently sent" is a category of bug that never recovers user trust once it happens.

This is enforced at the schema level. Every action the agent proposes lives in a PendingAction row with a status enum. The state machine for that enum is the contract: only one transition (approve()) gets the side effect to actually run. The agent can propose() all day; nothing ships without a deliberate user transition.

The lowest-risk class of actions — internal-only things like blocking calendar time for focus, snoozing an item, setting a reminder — can be marked auto and skip approval. Everything that touches an outside party (sending mail, modifying someone else's calendar) is always gated. The boundary is conservative on purpose. The day a single user discovers their AI assistant silently sent an apology to their VC is the day every AI assistant in the category becomes harder to sell.

What this looks like in practice

The sum of these three patterns is not a smarter inbox. It is a small, quiet queue that contains roughly six to twelve items on any given day. Each item is either an explicit ask, a tracked commitment coming due, or a proposed action waiting for confirmation. The model spent the morning reading and reasoning about a few hundred other things, all of which the system decided you don't need to know about.

When you dismiss an item, the system learns. When a contact reliably delivers, their asks rise. When the model wants to act outside a narrow safelist, it asks first. The result, after a few weeks of training the noise floor, is a queue that feels like it was assembled by someone who actually knows what you ignore.

None of this requires a frontier model. The classifier underneath is a small, cheap LLM with strict cost guards. Almost all of the value is in the boundaries — what the system refuses to surface, what it refuses to do without you, and what it remembers about people you work with.

If you're building anything in this category and you find yourself adding a new surface that shows the user more things, stop and ask whether you'd rather build the thing that subtracts. The market is crowded with smarter inboxes. There is no good decision layer yet.

I'm shipping one at klorn.ai. Not asking for signups — sharing the pattern because I think more people should be building toward it. The closed-loop suppression and trust-score code above are excerpts from the real thing.

Built in TypeScript on Fastify, Prisma, and Postgres. Code patterns shown are production excerpts.

mcp-probe v1.4.0: Contract assertions for production MCP servers

yongrean — Sat, 23 May 2026 15:53:52 +0000

MCP servers are starting to look like infrastructure.

That means the old readiness question is no longer enough:

Does the process start?

Even this is not enough:

Does tools/list return a clean schema?

A server can pass both checks and still fail every real agent loop because auth handoff, scopes, downstream permissions, environment setup, or data boundaries are broken.

So I shipped mcp-probe v1.4.0 with contract assertions for production MCP servers.

GitHub: https://github.com/k08200/mcp-probe

npm: https://www.npmjs.com/package/@k08200/mcp-probe

The problem: discovery is not readiness

A typical MCP smoke test looks like this:

Start the server
Run initialize
Run tools/list
Check that schemas exist

That catches broken startup and malformed tools.

But it misses the failures that matter in production:

The tool advertises correctly, but every call returns 401
OAuth requires a browser redirect the agent cannot trigger
The DB role is not actually read-only
Write attempts leak raw SQL errors or stack traces
Results omit metadata agents need to reason safely
Tenant or project scope is not preserved
Broad exports or admin actions are reachable
Error codes are unstable, so agents cannot recover

In other words: the server starts, but the contract is broken.

v1.4.0: sidecar contract assertions

mcp-probe already supported sidecar inputs via .mcp-probe.json so teams could run real tools/call checks instead of relying on schema-minimum dummy inputs.

v1.4.0 extends that sidecar with assertions.

Example for a database-backed MCP server:

{
  "tools": {
    "execute_sql": {
      "input": {
        "project_id": "YOUR_PROJECT_ID",
        "query": "select 1 as health_check"
      },
      "expect": {
        "status": "pass",
        "requiredFields": ["rowCount", "limit", "source", "freshness"],
        "maxRows": 100
      }
    },
    "execute_sql_write_denied": {
      "input": {
        "project_id": "YOUR_PROJECT_ID",
        "query": "delete from users where id = 1"
      },
      "expect": {
        "status": "fail",
        "errorCode": "WRITE_NOT_ALLOWED",
        "notContains": ["DATABASE_URL", "password", "stack"]
      }
    }
  }
}

Now CI can validate the contract an agent actually depends on.

What assertions are supported?

`expect.status`

Declare whether a call should pass, fail, or warn.

This is important for negative probes. A write attempt against a read-only DB role should fail. In that case, failure is success.

{
  "expect": {
    "status": "fail"
  }
}

`expect.requiredFields`

Validate that result metadata exists.

For database tools, an agent often needs more than rows. It needs context:

rowCount
limit
source
freshness

{
  "expect": {
    "requiredFields": ["rowCount", "limit", "source", "freshness"]
  }
}

`expect.maxRows`

Catch broad exports or missing limits.

{
  "expect": {
    "maxRows": 100
  }
}

mcp-probe looks for common result shapes such as rowCount, rowsReturned, rows, data, items, and records.

`expect.errorCode`

Require stable structured error codes.

{
  "expect": {
    "status": "fail",
    "errorCode": "WRITE_NOT_ALLOWED"
  }
}

This matters because agents can only recover if errors are predictable.

`expect.contains` and `expect.notContains`

Check for expected output and leaked internals.

{
  "expect": {
    "notContains": ["DATABASE_URL", "password", "stack"]
  }
}

This catches errors that expose raw internals.

`expect.not_error_code`

Treat known auth/permission status codes as warnings instead of hard failures.

{
  "expect": {
    "not_error_code": [401, 403]
  }
}

This keeps OAuth handoff failures visible without confusing them with transport or runtime crashes.

Output example

When assertions pass:

Tool Call Dry-run
  ✓ db_query [sidecar] 1ms
    ✓ status: Tool status matched expected pass
    ✓ requiredFields.rowCount: Found required field "rowCount"
    ✓ requiredFields.limit: Found required field "limit"
    ✓ requiredFields.source: Found required field "source"
    ✓ requiredFields.freshness: Found required field "freshness"
    ✓ maxRows: Row count 1 is within maxRows 100

  ✓ db_write [sidecar] 0ms
    ✓ status: Tool status matched expected fail
    ✓ errorCode: Found expected error code WRITE_NOT_ALLOWED
    ✓ notContains.DATABASE_URL: Output does not contain "DATABASE_URL"
    ✓ notContains.password: Output does not contain "password"
    ✓ notContains.stack: Output does not contain "stack"

If a contract assertion fails, mcp-probe reports:

CONTRACT_ASSERTION_FAILED

and includes per-assertion details in terminal output, JSON output, and GitHub Actions summaries.

Quick start

npx @k08200/mcp-probe@latest init \
  --target @your-org/your-mcp-server \
  --discover \
  --github-actions

Then edit .mcp-probe.json with real read-only probes and run:

npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary

Why this matters

MCP CI should test the contract an agent will actually depend on, not just whether the server process starts.

For database-backed MCP servers, that means validating things like:

read-only role behavior
denied writes
stable error codes
row limits
tenant or project scope
result metadata
no leaked internals

mcp-probe should not know every server's semantics. But it can give teams a small, declarative way to encode the production contract their agents rely on.

That is the goal of v1.4.0.

Release: https://github.com/k08200/mcp-probe/releases/tag/v1.4.0

npm: https://www.npmjs.com/package/@k08200/mcp-probe

mcp-probe v1.0.0: A CI readiness gate for MCP servers

yongrean — Wed, 20 May 2026 16:01:55 +0000

mcp-probe started as a small CLI for checking whether an MCP server starts and exposes tools.

That was useful, but after feedback from developers running real MCP servers in agent workflows, the gap became obvious:

A server can start, pass tools/list, and still fail every real tool call because OAuth, browser auth, or downstream permissions are broken.

So I shipped mcp-probe v1.0.0 as a CI-ready readiness gate for MCP servers.

Install

npx @k08200/mcp-probe@latest <server>

Example:

npx @k08200/mcp-probe@latest @modelcontextprotocol/server-memory

What it checks

MCP protocol handshake
tools/list
optional resources and prompts discovery
tool schema shape
actual tool-call dry-runs
stderr classification
latency
batch/fleet CI status

Tool-call dry-runs

npx @k08200/mcp-probe@latest <server> --probe-tools

This closes the gap between “the server registered tools” and “those tools actually work in an agent loop.”

Sidecar inputs

Auto-generated inputs are fallback only. For real CI, v1 supports sidecar files:

{
  "tools": {
    "logs_query": {
      "input": {
        "query": "service:web status:error",
        "timeframe": "1h"
      },
      "expect": {
        "not_error_code": [401, 403]
      }
    }
  }
}

Run:

npx @k08200/mcp-probe@latest datadog-mcp --probe-tools --tools-file .mcp-probe.json

This lets CI validate meaningful tool calls instead of just schema-minimum empty strings.

Batch checks

npx @k08200/mcp-probe@latest --config mcp-probe.config.json

Useful when a team runs multiple MCP servers and wants one readiness gate.

GitHub Actions output

npx @k08200/mcp-probe@latest --config mcp-probe.config.json --github-summary

v1 writes GitHub step summaries, emits annotations, and can generate a shields-compatible badge JSON file.

HTTP and SSE

mcp-probe now supports stdio, Streamable HTTP, and legacy SSE:

npx @k08200/mcp-probe@latest https://example.com/mcp --header "Authorization: Bearer TOKEN"

Stderr classification

Some servers print harmless startup warnings; others print fatal init errors. v1 adds explicit rules:

npx @k08200/mcp-probe@latest <server> \
  --stderr-allow "deprecated" \
  --stderr-fatal "missing required api key"

Recipes

The repo includes starter recipes for Datadog, Supabase, Gmail, single-server GitHub Actions checks, fleet checks, and remote HTTP checks.

GitHub: https://github.com/k08200/mcp-probe

Release: https://github.com/k08200/mcp-probe/releases/tag/v1.0.0

npm:

npm install -g @k08200/mcp-probe

I disabled push notifications on my own AI app in 24 hours — here is what I rebuilt

yongrean — Mon, 18 May 2026 16:02:01 +0000

I disabled push notifications on my own AI productivity app within 24 hours of shipping it.

That was the moment I realized I had built something that looked useful but was actually attention spam dressed up in a clean UI.

Here's what was wrong, what I learned, and the architecture I rebuilt around it.

The "helpful" trap

The first version of my product (then called EVE, now Jigeum) did the obvious thing: connect Gmail, classify emails, surface anything important via push notification.

The logic seemed sound. The execution was a disaster.

Day 1, 9am: push notification — "Stripe receipt may need attention"
Day 1, 9:14am: push — "LinkedIn message from a recruiter"
Day 1, 9:32am: push — "GitHub PR review request"
Day 1, 10:01am: push — "Newsletter — possibly important"

By noon I had 14 notifications. By 5pm I had silenced the app on my phone.

I had recreated the exact problem I was trying to solve: another channel demanding my attention, no smarter than the inbox it was sitting on top of.

The wrong mental model

Here's the assumption almost every AI productivity tool makes — and the one I had to unlearn:

"If something is important, notify the user. If it's not, don't."

This is wrong. Importance is binary. Attention is not.

The real model is: every signal has an escalation level, and most signals deserve none.

A contract waiting for signature is not the same as a newsletter from a YC partner you respect. Both are "important." Only one should interrupt your morning.

The architecture I rebuilt: 5-tier escalation

Every incoming signal — email, calendar event, extracted commitment — gets classified into exactly one tier:

SILENT    → never surfaced
QUEUE     → added to a review list, no notification
PUSH      → mobile push, the actual interrupt
CALL      → urgent override (not yet built)
AUTO      → handled without asking me

The default is QUEUE. Not PUSH. Most things just sit there until I open the app.

This single change — defaulting to the quietest reasonable tier instead of the noisiest — is the difference between a tool I use and a tool I muted.

Trust Score: who actually deserves to reach you

Routing depends on the sender. Each contact has a Trust Score (0–100) derived from real interaction history:

interface TrustScore {
  userId: string;
  contactEmail: string;
  score: number;               // 0–100
  interactionCount: number;
  avgResponseMinutes: number | null;
  lastInteractionAt: Date | null;
}

A cold sender I've never replied to: ~10.
A teammate I exchange messages with daily: ~95.

Tier assignment combines Trust Score × content urgency × time-of-day context. A 95 score sending a question gets PUSH. A 10 score sending the same question gets QUEUE. Same email content, different outcome — because who matters as much as what.

Commitment Ledger: the feature I didn't know I needed

This was the unexpected one.

Every email where I had written "I'll send the contract by Friday" or "Let me get back to you next week" — those were commitments I kept forgetting. They lived inside threads. The other person remembered. I didn't.

interface Commitment {
  id: string;
  userId: string;
  title: string;
  kind: "DELIVERABLE" | "MEETING" | "FOLLOW_UP" | "DECISION";
  owner: "USER" | "COUNTERPART";  // who owes whom
  dueAt: Date | null;
  dueText: string | null;          // "by Friday", "next week"
  confidence: number;              // 0–1
  status: "OPEN" | "DONE" | "OVERDUE";
}

The confidence score matters. "Let's sync sometime" → 0.3, ignored. "Please send the NDA by Tuesday EOD" → 0.9, surfaced immediately.

In four weeks of dogfooding, this caught three commitments I would have genuinely dropped. That's the metric I judge the whole product by now.

What changed when I rebuilt around this

Before	After
Default tier: PUSH	Default tier: QUEUE
Routing: keyword/urgency heuristics	Routing: Trust Score × content × context
Surface: notification feed	Surface: single morning page (Command Center)
My behavior: disabled the app	My behavior: open it before checking email

The Command Center is one page with four blocks: Morning Briefing, Approval Queue, Commitment Ledger, Reply Needed. I open it once before email and I'm done.

I haven't opened raw Gmail first thing in the morning in 3 weeks.

The principle

If I had to compress the lesson into one rule it would be this:

Default to silence. Earn the right to interrupt.

Most "smart" tools fail because they assume the user wants to be helped at every opportunity. The user does not. The user wants their attention managed down, not flooded with more "important" inputs.

Stack

For the curious:

API: Fastify + TypeScript + Prisma + PostgreSQL (Supabase)
Web: Next.js 15 App Router
AI: Claude Sonnet for content analysis, Claude Haiku for classification
Email: Gmail API with incremental sync
Push: Web Push API + service workers
Deploy: Render (API) + Vercel (web)

Try it

Jigeum is in private beta. Connect Gmail + Calendar, initial sync takes about 30 seconds, and you'll see your inbox classified by tier within a minute.

If you're a founder, solo operator, or anyone whose inbox is currently managing them — I'd genuinely value the feedback. Especially where the classification gets it wrong. That's where the next iteration comes from.

Architecture questions welcome in the comments.

Built solo. The first version annoyed me. The second one I actually use.

I built an AI that filters what actually needs your attention — architecture, failures, and what works

yongrean — Sun, 17 May 2026 16:21:32 +0000

I used to start every morning the same way: open Gmail, feel immediately overwhelmed, spend 40 minutes triaging emails that turned out not to matter — and then miss the one thing that actually did.

So I built Jigeum — an AI Chief of Staff that reads your inbox, scores every signal by urgency, and surfaces only what needs you. This is the architecture, the honest failures, and what actually changed my routine.

The first version completely failed

My original product was called EVE. The pitch: "AI employee — connect your tools, EVE handles the rest."

It was too broad. Every demo, people nodded politely. Nobody knew what to do with it. I shipped features — Slack integration, task management, autonomous agent loops — but the core value proposition was muddy. Four months in, I was the only person using it daily.

Every morning I'd open it and quietly ask myself: why am I actually using this?

That question cracked it open.

The real problem: attention is the bottleneck

The problem isn't having too many emails. It's that every notification competes for the same finite resource: your decision-making capacity.

A GDPR newsletter. A contract waiting for signature. A customer reporting a critical bug. A LinkedIn request. Your inbox treats them identically.

I renamed the product Jigeum (지금 — Korean for right now) and rebuilt around a single question: what needs my attention right now, and what doesn't?

Architecture: the Attention OS

The core model is a 5-tier escalation system. Every incoming signal — email, calendar event, extracted commitment — gets classified before it ever reaches me:

SILENT    → don't surface, don't notify
QUEUE     → add to review list, no interrupt
PUSH      → mobile push notification
CALL      → urgent interrupt (not yet built)
AUTO      → handle automatically without asking

I call this the Attention Firewall. Before anything reaches my conscious attention, it passes through classification.

Trust Score

Each sender gets a Trust Score (0–100). Higher score means more likely to escalate. It's derived from:

Historical reply frequency
Whether I've responded before, and how fast
Explicit feedback ("always notify me from this person")
Domain-level signals (my own domain scores higher than cold outreach)

interface TrustScore {
  userId: string;
  contactEmail: string;
  score: number;               // 0–100
  interactionCount: number;
  avgResponseMinutes: number | null;
  lastInteractionAt: Date | null;
}

A newsletter I've never replied to scores ~10. My co-founder scores 95. The escalation tier is calculated from trust score combined with content analysis of the email itself.

Voice Profile

The AI needs to know how I communicate, not just what to do. Voice Profile stores the patterns extracted from my sent mail:

interface VoiceProfile {
  userId: string;
  tone: string;                // "direct", "warm", "formal"
  signatureStyle: string;
  preferredLength: "short" | "medium" | "long";
  phrases: string[];           // things I actually say
  avoidPhrases: string[];      // things I never say
}

When drafting a reply suggestion, the AI pulls this profile. The goal is that a suggested reply reads like me — not like a generic AI assistant.

Commitment Ledger

This is the feature that made me realize the product had real value.

Every email where I wrote "I'll send this by Friday" or "Let me get back to you next week" — those are commitments. They disappear into threads. I forget them. The other person doesn't.

The commitment extractor runs on every processed email and populates a ledger:

interface Commitment {
  id: string;
  userId: string;
  title: string;
  kind: "DELIVERABLE" | "MEETING" | "FOLLOW_UP" | "DECISION";
  owner: "USER" | "COUNTERPART";
  dueAt: Date | null;
  dueText: string | null;      // "by Friday", "next week"
  confidence: number;          // 0–1
  status: "OPEN" | "DONE" | "OVERDUE";
}

The confidence field matters a lot. "Let's sync sometime" → confidence 0.3, stays quiet. "Please send the NDA by Tuesday EOD" → confidence 0.9, surfaced immediately in the Command Center.

The Command Center

The UI is a single page. This replaced my inbox as the first screen I open each morning.

Layout (left → right on desktop):

Morning Briefing — AI summary of what happened overnight and what needs attention today, full width at top
Approval Queue — actions Jigeum wants to take but needs my sign-off first
Commitment Ledger — things I promised, things others promised me
Reply Needed — emails where someone asked a direct question

The Reply Needed surface was the hardest to get right. Naive approach (detect ? in email body) had terrible precision — questions in automated receipts, rhetorical questions, quoted threads all triggered false positives.

What actually works: question detection + sender trust weighting + thread position analysis (a question in the first email of a thread means something different than the same question in reply #5).

// GET /api/inbox/reply-needed
const rows = await prisma.emailMessage.findMany({
  where: { userId, needsReply: true },
  orderBy: [
    { needsReplyConfidence: "desc" },
    { receivedAt: "desc" }
  ],
  take: 8,
  select: {
    id: true,
    subject: true,
    from: true,
    snippet: true,
    needsReplyReason: true,
    needsReplyConfidence: true,
    receivedAt: true,
  },
});

Tech stack

Layer	Choice
API	Fastify + TypeScript + Prisma
Database	PostgreSQL (Supabase)
Web	Next.js 15 App Router
AI	OpenRouter — Claude Sonnet for analysis, Haiku for classification
Email	Gmail API (OAuth2, incremental sync via `historyId`)
Push	Web Push API + service workers
Deploy	Render (API) + Vercel (web)

One thing I'd do differently: Gmail sync architecture. I built polling with historyId-based incremental sync when I should have used Gmail Push Notifications from day one. The polling works but introduces ~30s latency on new emails. That latency matters when something urgent arrives.

What failed (honestly)

Notification flood. The early version pushed a notification for every signal classified as PUSH tier. Within 24 hours I had disabled push notifications on my own app. Had to rebuild with rate limiting — same-sender notifications within 15 minutes now collapse into one.

Over-trusting AUTO. The autonomous tier where Jigeum acts without asking me — I thought I wanted this. Turns out I don't trust it yet. I've pulled AUTO back to only unsubscribes and read-receipts. Anything that involves sending a message or making a decision goes through the Approval Queue.

The rebrand was a distraction. Spent a full week renaming EVE → Jigeum across the codebase, updating marketing copy, redoing the landing page. The code ran identically after. Should have shipped instead.

Mobile doesn't exist yet. It's web-only. For something meant to filter morning attention, the fact that you have to open a browser tab is a real friction point. Working on it.

Four weeks of dogfooding — what actually changed

I check the Commitment Ledger before every morning standup. It's caught 3 things I would have genuinely dropped.
Reply Needed reduced my inbox-zero anxiety. If something actually needs me, it surfaces there. If it's not there, I'm not missing anything.
Morning Briefing saves roughly 20 minutes of triage per day.
The AI still occasionally misclassifies cold outreach as high-priority. Trust Score calibration is ongoing.

Try it

Jigeum is in private beta at hire-eve-web.vercel.app. Connect Gmail + Calendar, initial sync takes about 30 seconds.

If you're a founder, solo operator, or anyone who feels like their attention is being managed by their inbox rather than by themselves — I'd genuinely value the feedback. Especially where it gets the classification wrong.

Happy to answer architecture questions in the comments.

Built solo. Stack: Next.js, Fastify, PostgreSQL, Gmail API, and a lot of OpenRouter credits.

ko-prompt-kit: Production-ready Korean LLM prompts for Claude & GPT

yongrean — Sun, 17 May 2026 15:31:15 +0000

If you're building AI apps that need to output natural Korean, translating English prompts doesn't cut it. Korean has formal/informal speech levels (존댓말/반말), unique document conventions, and cultural context that English-native prompts completely ignore.

So I built ko-prompt-kit — 14 production-ready Korean prompt templates across 5 categories, with a TypeScript API and CLI.

Zero install, instant use

npx ko-prompt list
npx ko-prompt get coding/code-review

14 prompts across 5 categories

Category	Prompts
Business	Email reply, meeting minutes, report summary
Coding	Code review, commit message, bug analysis, JSDoc
Customer Service	Complaint reply, FAQ answer
Writing	Blog post, marketing copy
Analysis	Document summary, sentiment, competitive analysis

TypeScript API

import { getById, buildPrompt, search } from 'ko-prompt-kit';

const prompt = getById('coding/code-review');
const built = buildPrompt(prompt, {
  language: 'typescript',
  code: yourCode,
  focus: '보안',
});

// Use with Claude
const response = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  system: built.system,
  messages: [{ role: 'user', content: built.user }],
});

What makes Korean prompts different

Each prompt is designed around Korean language specifics:

Speech level: formal (합쇼체) vs informal (해체) — selected per use case
Document structure: Korean business docs have specific conventions
Cultural context: complaint handling, business email norms

Search and filter

// Find formal prompts for business use
const formal = search({ category: 'business', speechLevel: 'formal' });

// Search by keyword (Korean or English)
const emailPrompts = search({ query: '이메일' });

GitHub: k08200/ko-prompt-kit

npm install ko-prompt-kit

Would love contributions — especially prompts for domains I haven't covered yet (legal, medical, education).

I built the npm audit for MCP servers

yongrean — Sun, 17 May 2026 13:53:28 +0000

The MCP (Model Context Protocol) ecosystem has exploded. awesome-mcp-servers lists 200+ servers — but there was no way to know if any of them actually worked.

So I built mcp-probe: a zero-config CLI that validates MCP servers in one command.

The problem

You add a server to Claude Desktop, it silently fails. You look at logs, get "connection closed". You have no idea if it is a network issue, a broken dependency, or the server just does not implement the protocol correctly.

What mcp-probe does

npx @k08200/mcp-probe @modelcontextprotocol/server-memory

mcp-probe  @modelcontextprotocol/server-memory
────────────────────────────────────────────────────
  ✓  MCP protocol handshake  1392ms — memory-server v0.6.3
  ✓  Tools discovery  33ms — Found 9 tools
  ✓  Tool schema validation — All tool schemas are valid
────────────────────────────────────────────────────
  Server   memory-server v0.6.3
  Caps     tools

  Tools
    ▸ create_entities  Create multiple new entities in the knowledge graph
    ▸ read_graph  Read the entire knowledge graph
    ▸ search_nodes  Search for nodes in the knowledge graph
    ▸ ...and 6 more

  ✓  PASS  1455ms total

For a server with resources and prompts too (server-everything):

  ✓  Tools discovery  22ms — Found 14 tools
  ✓  Resources discovery  2ms — Found 7 resources
  ✓  Prompts discovery  5ms — Found 4 prompts

It catches real bugs

@modelcontextprotocol/server-filesystem — one of the most well-known MCP servers — currently has a broken dependency:

  ✗  MCP protocol handshake — Error: Cannot find module 'ajv'

Before mcp-probe, this would show as "connection closed" with no indication of why.

CI integration

Exit code 1 on failure means it works as a CI gate:

- name: Validate MCP server
  run: npx @k08200/mcp-probe @your-org/your-mcp-server
  timeout-minutes: 2

JSON output for scripting:

npx @k08200/mcp-probe @scope/server --output json

How it works

Under the hood it uses the official @modelcontextprotocol/sdk to run the actual protocol handshake. It pipes stderr from the spawned process so when a server crashes on startup, you see the real error.

const transport = new StdioClientTransport({
  command: 'npx',
  args: ['--yes', target],
  stderr: 'pipe',  // capture crash output
});

const client = new Client(
  { name: 'mcp-probe', version: '0.1.0' },
  { capabilities: { roots: { listChanged: false } } }
);

await client.connect(transport);
const tools = await client.listTools();
// also listResources() and listPrompts() if server advertises them

Get it

npx @k08200/mcp-probe @modelcontextprotocol/server-memory

GitHub: k08200/mcp-probe
npm: @k08200/mcp-probe

Would love to hear what servers you try it on — especially if you find one where the output is confusing or wrong.

DEV Community: yongrean

tools/list is not a readiness check for MCP servers

The new bootstrap flow

Schema-aware sidecar samples

Catalog locking

Receipts

Try it

Stop Building AI Assistants. Build AI Firewalls.

60-second walkthrough

The wrong question

What an AI firewall actually looks like

Override is GROUP BY, not LLM

Why building this is unpopular in 2026

Where I am

The actual unpopular opinion

MCP CI gates need receipts: tools/list is not enough

What changed

1. Warnings can now fail CI

2. Doctor now checks the actual workflow receipt

3. Tool call coverage is now tied to expected tools

4. Sidecar inputs are the real contract

Install

mcp-probe v1.6.0: Stricter GitHub Actions checks for MCP CI gates

What changed

Why this matters

mcp-probe v1.5.0: Doctor checks for MCP CI readiness

What it checks

Why this matters

Stop building AI inboxes. Build decision layers instead.

The category bug

The right abstraction: decision layer

Pattern 1 — Closed-loop suppression learning

Pattern 2 — Contact Trust Score

Pattern 3 — Approval-before-action as a schema constraint

What this looks like in practice

mcp-probe v1.4.0: Contract assertions for production MCP servers

The problem: discovery is not readiness

v1.4.0: sidecar contract assertions

What assertions are supported?

expect.status

expect.requiredFields

expect.maxRows

expect.errorCode

expect.contains and expect.notContains

expect.not_error_code

Output example

Quick start

Why this matters

mcp-probe v1.0.0: A CI readiness gate for MCP servers

Install

What it checks

Tool-call dry-runs

Sidecar inputs

Batch checks

GitHub Actions output

HTTP and SSE

Stderr classification

Recipes

I disabled push notifications on my own AI app in 24 hours — here is what I rebuilt

The "helpful" trap

The wrong mental model

The architecture I rebuilt: 5-tier escalation

Trust Score: who actually deserves to reach you

Commitment Ledger: the feature I didn't know I needed

What changed when I rebuilt around this

The principle

Stack

Try it

I built an AI that filters what actually needs your attention — architecture, failures, and what works

The first version completely failed

The real problem: attention is the bottleneck

Architecture: the Attention OS

Trust Score

Voice Profile

Commitment Ledger

The Command Center

Tech stack

What failed (honestly)

Four weeks of dogfooding — what actually changed

Try it

`expect.status`

`expect.requiredFields`

`expect.maxRows`

`expect.errorCode`

`expect.contains` and `expect.notContains`

`expect.not_error_code`