StemSplit

Posted on May 24

5 Things I Wish I'd Known Before Writing a Production MCP Server in TypeScript (2026)

#mcp #opensource #typescript #ai

"Wire @modelcontextprotocol/sdk to your API, register a few tools with Zod schemas, ship it."

That's the shape of every MCP server tutorial I read before writing my first one. Three weeks of dogfooding stemsplit-mcp — the open-source MCP server I built for StemSplit's audio separation API — and that summary turns out to be just the prologue.

Below are the five things that make the difference between an MCP server that works in a tutorial and one you'd actually depend on. All of them are now hardcoded into stemsplit-mcp (source, MIT-licensed) and 100% portable to any other MCP server you write.

If you're about to write your first MCP server: read this. If you've already shipped one: there's a non-zero chance you have at least one of these bugs.

What You'll Get From This Post

✅ A withRetry helper that handles transient failures without double-charging users
✅ A simple rule for deciding which requests are safe to retry
✅ Why relative paths from an LLM are a bug, and how to reject them with a helpful message
✅ A structured error shape that gives the LLM machine-readable context
✅ How to wire MCP progress notifications so long jobs don't look frozen

1. Retry transient failures — but only the right ones

The first version of every MCP server you write will look like this:

async function callApi<T>(path: string): Promise<T> {
  const res = await fetch(`${baseUrl}${path}`, { headers });
  if (!res.ok) throw new Error(`${res.status} ${res.statusText}`);
  return res.json();
}

One bad upstream gateway, one transient 502, one TLS handshake hiccup, and the entire MCP tool call fails. The LLM sees the error, the user sees the error, the user concludes your tool is broken.

The fix isn't "retry everything." Retrying a POST /jobs that the server already processed will double-charge your user. The right fix is to classify the error, and to classify the request.

Here's the helper I now use everywhere:

export type RetryDecision = boolean | { retryAfterMs: number };

export interface RetryOptions {
  maxAttempts: number;
  initialDelayMs: number;
  maxDelayMs: number;
  shouldRetry: (err: unknown) => RetryDecision;
  onRetry?: (err: unknown, attempt: number, delayMs: number) => void;
}

export async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions,
): Promise<T> {
  let attempt = 0;
  while (true) {
    attempt++;
    try {
      return await fn();
    } catch (err) {
      const decision = options.shouldRetry(err);
      if (!decision || attempt >= options.maxAttempts) throw err;

      const baseDelay = Math.min(
        options.initialDelayMs * 2 ** (attempt - 1),
        options.maxDelayMs,
      );
      const jitter = Math.random() * baseDelay * 0.25;
      const explicit =
        typeof decision === "object" ? decision.retryAfterMs : null;
      const delayMs = explicit ?? baseDelay + jitter;

      options.onRetry?.(err, attempt, delayMs);
      await new Promise((r) => setTimeout(r, delayMs));
    }
  }
}

The interesting bit is the RetryDecision union: shouldRetry can return either true/false, or an object with an explicit retryAfterMs. That last one lets you honor Retry-After headers on 429s without writing a separate code path.

Then the policy is split by request shape:

function shouldRetryApiError(err: unknown, mutating: boolean): RetryDecision {
  if (err instanceof StemSplitError) {
    // Network-level error — server may not have seen the request
    if (err.code === "NETWORK_ERROR") return true;

    // 5xx on a read-only request — safe to retry
    if (err.status && err.status >= 500 && !mutating) return true;

    // 429 — honor Retry-After
    if (err.status === 429 && err.retryAfterMs !== undefined) {
      return { retryAfterMs: err.retryAfterMs };
    }
  }
  return false;
}

The mutating flag is the rule. A 5xx response on a POST /jobs could mean the job was created and the response was just lost. Retrying it could double-charge. So POST /jobs is mutating: true, gets fewer attempts (3), and only retries on errors that prove the server never saw the request.

By contrast, GET /jobs/:id is mutating: false and gets a more aggressive policy (4 attempts, 5xx retries enabled). Even POST /upload — which only returns a presigned URL and doesn't change billing state — can be marked mutating: false.

This single distinction has saved me three production incidents already, mostly from rare 502s during the 10-minute polling loops on long stem-separation jobs.

2. Reject relative paths up front

When you give an LLM a tool that takes a file path, the LLM will eventually pass song.mp3. Or ./song.mp3. Or worse, file:///Users/me/Music/song.mp3 — a URL-form path that looks identical to its argument but fails Node's fs.createReadStream.

If you don't validate this, Node will resolve those against the MCP server's process working directory. For Claude Desktop and Cursor, that's usually / or /Applications/.... The file doesn't exist there. The LLM gets an "ENOENT" error, has no idea what /song.mp3 is, and either retries the same path or gives up.

The fix is to reject the bad path before you try to open the file, with an error message designed for the LLM to act on:

import { isAbsolute } from "node:path";

function isTildeHome(p: string): boolean {
  return p === "~" || p.startsWith("~/");
}

export function classifyLocalPath(source: string): string {
  const trimmed = source.trim();

  if (trimmed.startsWith("file://")) {
    throw new Error(
      "file:// URIs are not supported. Pass the absolute filesystem path " +
      "instead (e.g. /Users/you/Music/song.mp3).",
    );
  }

  if (!isTildeHome(trimmed) && !isAbsolute(trimmed)) {
    throw new Error(
      `Relative paths are not supported (got "${trimmed}"). ` +
      `Pass an absolute path like "/Users/you/Music/song.mp3" or a ` +
      `home-anchored path like "~/Music/song.mp3". ` +
      `If you do not know the absolute path, ask the user for it before retrying.`,
    );
  }

  return trimmed;
}

Two things to call out:

The error tells the LLM exactly what to do next. Not "invalid path" — "ask the user for the absolute path." This is the difference between a tool the LLM gives up on and a tool it recovers from gracefully.
~/foo is accepted because Claude Desktop is good at resolving it. It's a human-friendly form, and you'll get fewer dead-end conversations if you support it. Most filesystem helpers (fs.realpath, path.resolve with os.homedir()) handle it for you.

Side benefit: this also makes your tool description shorter. You can write path: "Absolute path like /Users/you/Music/song.mp3" in your Zod schema and the LLM will get the same hint from both the schema and the error message.

3. Make errors machine-readable

The default MCP error shape is just a string. That's fine for users, terrible for LLMs.

LLMs that have to figure out what to do next from an error message do best when the error has discrete states. "Out of credits" needs a different recovery than "rate limit hit" needs a different recovery than "file too large."

So I always wrap upstream errors into a class with a code:

export type StemSplitErrorCode =
  | "AUTH_INVALID"
  | "INSUFFICIENT_CREDITS"
  | "RATE_LIMIT_EXCEEDED"
  | "FILE_TOO_LARGE"
  | "UNSUPPORTED_FORMAT"
  | "JOB_FAILED"
  | "JOB_TIMEOUT"
  | "NETWORK_ERROR"
  | "API_ERROR";

export class StemSplitError extends Error {
  constructor(
    public readonly code: StemSplitErrorCode,
    public readonly userMessage: string,
    public readonly status?: number,
    public readonly retryAfterMs?: number,
    public readonly details?: unknown,
  ) {
    super(userMessage);
    this.name = "StemSplitError";
  }
}

export async function buildErrorFromResponse(
  res: Response,
): Promise<StemSplitError> {
  const text = await res.text();
  let body: { error?: string; code?: string } = {};
  try { body = JSON.parse(text); } catch { /* ignore */ }

  if (res.status === 401) {
    return new StemSplitError(
      "AUTH_INVALID",
      "StemSplit API key invalid. Check STEMSPLIT_API_KEY.",
      401,
    );
  }
  if (res.status === 402) {
    return new StemSplitError(
      "INSUFFICIENT_CREDITS",
      "Not enough StemSplit credits. Top up at stemsplit.io/app/billing.",
      402,
    );
  }
  if (res.status === 429) {
    const retryAfter = Number(res.headers.get("retry-after"));
    return new StemSplitError(
      "RATE_LIMIT_EXCEEDED",
      `Rate limited by StemSplit. Retry in ${retryAfter || 60}s.`,
      429,
      isFinite(retryAfter) ? retryAfter * 1000 : undefined,
    );
  }
  // ...etc
}

When you serialize this back to the MCP client, include both:

{
  isError: true,
  content: [{ type: "text", text: err.userMessage }],
  _meta: { code: err.code, status: err.status }
}

Anthropic's clients ignore _meta they don't understand, so this is forward-compatible. And the LLM gets a clean userMessage that's safe to relay to the user.

4. Fire progress notifications for anything over ~5s

MCP supports progress notifications:

await server.notification({
  method: "notifications/progress",
  params: {
    progressToken,
    progress: 35,    // 0–100
    total: 100,
  },
});

If your tool takes more than 5 seconds (audio processing definitely does), use them. Without them, Claude Desktop will sit at "Running tool stemsplit/separate_stems..." indefinitely and the user has no idea if you're stuck or making progress.

The trick is wiring this through your polling loop:

export async function pollUntilDone<T>(
  fetchStatus: () => Promise<{ status: string; progress?: number } & T>,
  options: {
    onProgress?: (progress: number) => void;
    intervalMs?: number;
    timeoutMs?: number;
  } = {},
): Promise<T> {
  const interval = options.intervalMs ?? 3000;
  const timeout = options.timeoutMs ?? 10 * 60 * 1000;
  const start = Date.now();

  while (true) {
    const status = await fetchStatus();
    if (status.progress !== undefined) options.onProgress?.(status.progress);
    if (status.status === "COMPLETED") return status;
    if (status.status === "FAILED") throw new Error("Job failed");
    if (Date.now() - start > timeout) throw new Error("Job timed out");
    await new Promise((r) => setTimeout(r, interval));
  }
}

Then your tool handler wires the MCP progress token through:

const progressToken = request.params?._meta?.progressToken;

const job = await pollUntilDone(
  () => client.getJob(jobId),
  {
    onProgress: progressToken
      ? (p) => server.notification({
          method: "notifications/progress",
          params: { progressToken, progress: p, total: 100 },
        })
      : undefined,
  },
);

This is the difference between users abandoning a long job at 30 seconds and waiting patiently because they can see the bar moving.

5. Re-fetch presigned URLs on demand

This one isn't MCP-specific, but it bites every MCP server that wraps an API with expiring URLs.

Cloud storage providers (Cloudflare R2, S3, GCS) hand out presigned URLs that expire after 1–24 hours. If your MCP tool stores the URL in the chat history and the user comes back tomorrow asking "can you re-download those stems?", the URLs are dead and the LLM gets a 403.

Don't make the user re-run the entire separation job. Instead, expose a separate download_stems tool that takes a jobId, re-fetches the latest presigned URLs from your API, and downloads:

async function handleDownloadStems(jobId: string, outputDir: string) {
  const job = await client.getJob(jobId);
  if (job.status !== "COMPLETED") {
    throw new StemSplitError("JOB_FAILED", `Job ${jobId} not complete.`);
  }
  return downloadAllStems(job.outputs, outputDir);
}

The LLM picks up on this naturally — if it has the jobId from an earlier chat, it'll call download_stems instead of re-running separate_stems. Your user re-downloads in 2 seconds instead of waiting 90 seconds for a fresh separation.

Bonus: this is also how you let the user choose a different output directory on the second download without redoing work.

Putting it all together

These five patterns make a real difference for any MCP server that hits a remote API:

withRetry with mutating-aware policy — kills 90% of transient failures.
Absolute path validation with actionable errors — saves you from confusing LLMs.
Structured error codes — lets the LLM choose recovery strategy.
Progress notifications — keeps users waiting instead of giving up.
Re-fetch-by-ID tool — turns expiring URLs from a footgun into a feature.

None of these are in the MCP SDK examples. They're the lessons you only learn from running an MCP server against real users.

If you want to see all five in one place, the full implementation is in github.com/StemSplit/stemsplit-mcp — ~1.5k lines of TypeScript, MIT-licensed, every pattern above in production today.

And if you happen to want stem separation in your MCP-enabled AI assistant, stemsplit-mcp is on npm and works with Claude Desktop, Cursor, Cline, Windsurf, and Zed today. The StemSplit API it talks to is a hosted Demucs / HT-Demucs FT pipeline (same models you'd self-host) with a generous free tier — sign up at stemsplit.io/free-trial and you can have a working setup in five minutes.

Happy MCP-ing. Tell me what you build.

Source code: github.com/StemSplit/stemsplit-mcp • npm: stemsplit-mcp • Hosted API: stemsplit.io/developers

Top comments (3)

Truong Bui • May 25

The path classification function is doing more security work than it might look like. Relative paths from an LLM aren't just a UX problem — they're the start of a path traversal chain. An attacker who can influence the LLM's tool calls (indirect injection from a document the agent reads, for example) could feed ../../etc/passwd if you're not rejecting non-absolute paths up front. The error message that tells the LLM "ask the user for the absolute path" is actually the right security boundary too: it forces human re-entry for anything that looks wrong.

The structured error codes point matters for a similar reason. Returning raw upstream API error strings is an information disclosure path — error bodies from backend services often contain stack traces, internal hostnames, and token fragments. A typed error code surfaces just the category, not the internals.

The one thing I'd add to this list for production servers: audit your tool descriptions before you ship. Tool descriptions are the first thing the agent reads and they're never shown to the user — they're also the surface where we see the most intentional abuse. We scanned 508 public MCP servers at MCPSafe (mcpsafe.io) and found that 18% had tool descriptions crafted to influence agent behavior rather than describe the tool. For a server you write yourself that's not a concern, but for operators installing third-party servers, it's worth running a scan before connecting.

StemSplit • Jun 1

Great additions, thanks for the thorough breakdown.

On path traversal — you're right that the framing in the article undersells it. I wrote it as a UX/reliability fix, but the actual risk is indirect prompt injection: a poisoned document the agent reads could feed ../../etc/passwd into a tool call without the user ever typing it. The error message that bounces the LLM back to the user for the absolute path is the security boundary, not just a friendliness improvement. Worth calling out more explicitly.

On structured errors — the buildErrorFromResponse path in stemsplit-mcp does construct safe user-facing messages, but you've made me look more closely at what goes into structuredContent. There's a ...extras spread from the raw API response body that includes endpoint, method, and whatever else the backend appends. Since I control both sides it's lower-stakes than wrapping a third-party API, but it's still leaking API structure and a future backend diagnostic field would silently flow through. Going to tighten that to an explicit allowlist of known-safe fields (code, httpStatus, retryAfterSeconds, requiredSeconds).

On tool description audits — entirely agree that this is the operator's problem more than the author's. If you're writing your own server you know what's in the descriptions; the threat is third-party servers you install without reading the source. The 18% figure is sobering. Worth making that scan part of the install checklist for anyone running a multi-server setup.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.