Ahmed Mahmoud

Posted on Jun 30 • Originally published at eng-ahmed.com

Shipping with the Claude Agent SDK: 6 Patterns That Survived Production

#ai #typescript #agents #webdev

Headline: Agents are easy to demo and hard to keep running. The patterns below are the ones that survived contact with paying users, flaky APIs, and the long tail of "wait, can it also do this?"

I've shipped six agents on the Claude Agent SDK this year — three for clients at Devya Solutions, three for my own use. The boring patterns won. Here they are.

1. Tools as a stable contract, prompts as a runtime detail

The single biggest unlock: treat the tool surface as the durable interface, and let the model figure out how to call it.

// stable — change rarely
const tools = [
  {
    name: 'lookup_customer',
    description: 'Find a customer by email or phone. Use this before any account action.',
    input_schema: {
      type: 'object',
      properties: {
        email: { type: 'string', format: 'email' },
        phone: { type: 'string' },
      },
      anyOf: [{ required: ['email'] }, { required: ['phone'] }],
    },
  },
];

When the model misbehaves, 80% of the time the fix is in the tool description (more constraints, clearer "use this when…"), not in the system prompt.

2. Context engineering > prompt engineering

I used to write 2,000-word system prompts. They drifted, they conflicted, they got expensive on every turn.

What replaced them:

A 300-token system prompt that sets role + non-negotiables
The full operating context delivered as the first user turn (data, current state, recent events)
Prompt caching on the system prompt + the operating context

Result: half the tokens, twice the consistency, and a 4× faster first response thanks to cache hits.

3. Run evals before you "fix" anything

Every "the agent is broken" report I've gotten in the last six months was actually a regression I could have caught with an eval suite. Now every agent ships with:

// eval-set.json — examples that MUST pass
[
  {
    name: 'refund-policy-2-week',
    input: 'I bought this 9 days ago, can I return it?',
    expect: {
      tool_called: 'lookup_refund_policy',
      response_contains: ['within 14 days', 'eligible'],
    },
  },
];

The set starts small (10-20 cases). It grows every time something breaks in prod. Re-run on every model version, prompt change, or tool-surface change.

4. Cost guards at every loop boundary

Agents can loop. When they loop on a paid model they cost real money in seconds.

I now bake two limits into every agent:

const MAX_TURNS = 12;          // hard cap on tool-call rounds
const MAX_TOKENS_PER_RUN = 50_000; // hard cap on total tokens

let turns = 0;
let totalTokens = 0;

while (turns < MAX_TURNS && totalTokens < MAX_TOKENS_PER_RUN) {
  const response = await client.messages.create({ ... });
  totalTokens += response.usage.input_tokens + response.usage.output_tokens;
  if (response.stop_reason === 'end_turn') break;
  turns++;
}

These are guards, not targets. A real production run should average 2-3 turns. Hitting 12 means something is wrong — alert on it.

5. Streaming + cancellation, day one

Users abandon agents that don't stream. They also abandon agents they can't cancel. Both need to be wired from the first commit:

const stream = await client.messages.stream({
  model: 'claude-opus-4-7',
  max_tokens: 2048,
  messages,
  tools,
});

const abort = new AbortController();

stream.on('text', (chunk) => writeToUser(chunk));
stream.on('error', (err) => log(err));

// When the user closes the tab / clicks cancel:
abort.abort();

Vercel's Fluid Compute runtime supports request cancellation natively. Lean on it — don't roll your own.

6. Memory is a tool, not a feature

The agents that have to "remember" things use a memory tool with explicit read / write / delete actions. The model has to decide to read or write. No magic, no implicit recency window.

Why: when memory is implicit, debugging an agent that "forgot something it shouldn't have" is impossible. When it's a tool call, you have a log.

{
  name: 'remember',
  description: 'Persist a fact about the user across sessions. Use sparingly — only stable, important facts.',
  input_schema: {
    type: 'object',
    properties: {
      key: { type: 'string' },
      value: { type: 'string' },
    },
    required: ['key', 'value'],
  },
}

What I'd skip

Multi-agent orchestration as the default. Most "multi-agent" architectures are one capable agent + a tool-call. Reach for true multi-agent only when context windows force you to.
Custom finetunes. The Claude family has improved fast enough that any finetune I shipped six months ago is now worse than the base model with a better prompt.
Vector DBs for everything. A flat JSON file or a Postgres table with a tsvector column covers 70% of "we need memory" use cases at 5% of the operational cost.

Closing

The boring takeaway: agents are software. The same engineering discipline — clear interfaces, evals, cost guards, observability — is what makes them ship. The model is the smallest part of the system.

If you're building one and want a second pair of eyes on the architecture, find me at eng-ahmed.com or Devya Solutions.

DEV Community