Function-Calling Schemas That Survive Model Upgrades

#ai #agents #prompt #python

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A refund_order call lands with amount: "12.50" (string). The validator rejects it. The schema didn't change — the model alias did. A small interpretation drift on a single optional field, and the eval scores you trusted last sprint were lying.

This is the upgrade tax on tool-calling. Schemas look like contracts. They behave like prompts. The model reads the description and the field names and decides what to put in. When the model changes, that decision changes. Some tighten validation on enums you thought were permissive. Others reinterpret a renamed field. Most cost nothing; a few cost real money.

Freezing on one model alias isn't the answer. Design the schema so the next release has nowhere to drift to, and run a small eval rig that catches the drift before customers do. Four patterns and one runnable eval. Python with the Anthropic SDK shape; the patterns are SDK-agnostic, with the obvious shape changes when porting to OpenAI or Vertex (OpenAI nests the schema under function.parameters and returns tool_calls.function.arguments as a JSON string; Vertex uses function_declarations).

Pattern 1: stamp every tool with a schema_version

Every tool gets a schema_version field in its description and, when it matters for routing, in the input schema itself. Not a code-level version, a contract-level one. The version increments when the meaning of any field changes, when a required field gets added, or when an enum loses a value.

TOOLS = [
    {
        "name": "refund_order",
        "description": (
            "schema_version: 3. Refund a paid order. "
            "Use for customer-initiated refunds only. "
            "For chargebacks, use record_chargeback."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "format": "uuid",
                },
                "amount_cents": {
                    "type": "integer",
                    "minimum": 1,
                },
                "reason_code": {
                    "type": "string",
                    "enum": [
                        "duplicate",
                        "defective",
                        "not_as_described",
                        "customer_request",
                    ],
                },
            },
            "required": [
                "order_id",
                "amount_cents",
                "reason_code",
            ],
        },
    },
]

A few things land here. The version sits in the description, where the model reads it, so a downstream reviewer can grep one line and know what contract the agent saw. amount_cents replaces the older amount (a float). Floats invite "12.50" or "$12.50"; integers in cents close that door. The reason_code enum closes the same door on the prose side: no free-text dialect creep.

When you bump from schema_version: 2 to schema_version: 3, every running agent picks up the new contract on its next request. There's no version negotiation, no client-side cache. The model is the client; it reads the description fresh every turn.

Pattern 2: additions only, never re-typed required fields

This is the rule that survives the most upgrades. New fields are fine. New optional fields with sensible defaults are fine. Renaming a required field is not fine. Renaming a required field while changing its type (amount: float to amount_cents: int) is the same change disguised. Both leave older traces in your logs, replays, and conversation histories that no longer parse.

The safe path: add amount_cents as optional, leave the old amount field in place for two release cycles, let the description tell the model which one to prefer, and make the validator accept either while preferring the new shape.

from pydantic import BaseModel, model_validator


class RefundOrderArgs(BaseModel):
    order_id: str
    reason_code: str
    amount_cents: int | None = None
    amount: float | None = None  # legacy, still accepted

    @model_validator(mode="after")
    def _coalesce_amount(self):
        if self.amount_cents is None:
            if self.amount is None:
                raise ValueError(
                    "amount_cents is required"
                )
            self.amount_cents = round(self.amount * 100)
        return self

Description text steers the model.

schema_version: 3. Pass amount_cents (integer cents).
The legacy amount (float dollars) is accepted for
backward compatibility but will be removed in v4. Avoid
it on new calls.

The model reads "avoid it" and starts producing amount_cents on the very next turn. Old replays keep working. New calls use the cleaner shape. When the eval shows the legacy field is below 1% of calls, you remove it.

Pattern 3: deprecation hints in the description, not in your head

Engineers carry the deprecation in their head and forget the model only sees the description. If you mean "stop using this", say it in the description. The phrase that wins most often in production evals is the literal "avoid X; use Y".

schema_version: 3. Refund a paid order.

Use reason_code from this exact list: duplicate,
defective, not_as_described, customer_request.

Avoid the legacy reason field (free text). It is
accepted for backward compatibility and will be
removed in v4.

For chargebacks, use record_chargeback. This tool
will reject reason_code=chargeback.

Three deprecation patterns on one tool. The legacy reason field gets an explicit "avoid". The wrong-tool case (reason_code=chargeback) tells the model exactly what will happen if it tries — that is more useful than silently rejecting, because the model corrects on the next turn instead of looping. Naming the enum twice (once in JSON Schema, once in prose) is belt-and-braces; in our experience prose tends to win when they conflict.

The cost of a deprecation hint is one paragraph per tool. Skip it and the next release interprets the field a little differently; the eval flags it three weeks after launch.

Pattern 4: golden-test the schema across every alias, in CI

The previous three patterns shape the contract. This one keeps it honest. Every model alias you run in production gets re-evaluated against a frozen tool schema and a frozen test set, on every PR that touches the schema or the system prompt. The eval is small; it doesn't need to be a benchmark. It needs to fail loud when an alias starts producing arguments your validator rejects.

The shape: a list of tasks, a list of model aliases. For each pair, fire one turn with the tool schema, capture the tool_use block, and run the input through the validator. Score: pass if the args validate and the tool name matches expectations, fail otherwise.

import json
from dataclasses import dataclass
from anthropic import Anthropic
from pydantic import BaseModel, ValidationError, model_validator

# RefundOrderArgs is the same class shown in Pattern 2.
# Reproduced here so the harness runs as one file.
class RefundOrderArgs(BaseModel):
    order_id: str
    reason_code: str
    amount_cents: int | None = None
    amount: float | None = None

    @model_validator(mode="after")
    def _coalesce_amount(self):
        if self.amount_cents is None:
            if self.amount is None:
                raise ValueError("amount_cents is required")
            self.amount_cents = round(self.amount * 100)
        return self


client = Anthropic()


@dataclass
class Task:
    prompt: str
    expected_tool: str
    args_validator: type


@dataclass
class Result:
    model: str
    task: str
    tool: str | None
    ok: bool
    detail: str


def run_one(model: str, task: Task,
            tools: list[dict]) -> Result:
    resp = client.messages.create(
        model=model,
        max_tokens=512,
        tools=tools,
        messages=[
            {"role": "user", "content": task.prompt}
        ],
    )
    tool_use = next(
        (b for b in resp.content if b.type == "tool_use"),
        None,
    )
    if tool_use is None:
        return Result(
            model, task.prompt, None, False,
            "no_tool_call",
        )
    if tool_use.name != task.expected_tool:
        return Result(
            model, task.prompt, tool_use.name, False,
            f"wrong_tool: got {tool_use.name}",
        )
    try:
        task.args_validator(**tool_use.input)
    except ValidationError as e:
        return Result(
            model, task.prompt, tool_use.name, False,
            f"args_invalid: {e.errors()[0]['msg']}",
        )
    return Result(
        model, task.prompt, tool_use.name, True, "ok"
    )


def run_eval(models: list[str], tasks: list[Task],
             tools: list[dict]) -> dict:
    results: list[Result] = []
    for m in models:
        for t in tasks:
            results.append(run_one(m, t, tools))
    summary = {}
    for m in models:
        rows = [r for r in results if r.model == m]
        passed = sum(1 for r in rows if r.ok)
        summary[m] = {
            "pass_rate": passed / len(rows),
            "failures": [
                {"task": r.task, "detail": r.detail}
                for r in rows if not r.ok
            ],
        }
    return summary


if __name__ == "__main__":
    tasks = [
        Task(
            "Refund order 7c1d... for $12.50, "
            "the customer says it arrived broken.",
            "refund_order",
            RefundOrderArgs,
        ),
    ]
    # Replace these with the alias / snapshot IDs your
    # stack actually ships. Bare aliases follow the latest
    # release in that family; pin to a snapshot ID from the
    # console if you need a frozen target.
    models = [
        "claude-opus-4-7",
        "claude-sonnet-4-5",
    ]
    print(json.dumps(
        run_eval(models, tasks, TOOLS), indent=2
    ))

Under a hundred lines once the imports settle. Output looks like this:

{
  "claude-opus-4-7": {
    "pass_rate": 1.0,
    "failures": []
  },
  "claude-sonnet-4-5": {
    "pass_rate": 0.0,
    "failures": [
      {
        "task": "Refund order 7c1d...",
        "detail": "args_invalid: amount_cents required"
      }
    ]
  }
}

That second block is the alarm. In this worked example, one alias is sending the legacy amount field; the validator coalesces silently in production, so the only thing that surfaces the drift is a separate metric that tracks "fraction of calls using the legacy field" alongside the eval. Once that fraction climbs, you either strengthen the deprecation hint or hold the model upgrade until the schema lands.

CI runs this on every PR that changes tools/, system_prompts/, or any model identifier in config. A full run is single-digit dollars in API spend per release, depending on alias count and task length. Cheap insurance against the drift the post opened with.

What to keep on the wall

Treat the schema as the prompt it actually is. Stamp it with a version, add fields instead of re-typing them, write the deprecation in the description not in your head, and let a small CI eval be the thing that tells you the next release is safe. Model upgrades stop being scary the moment the eval is the source of truth.

If this was useful

The Prompt Engineering Pocket Guide goes deep on the description-writing patterns from this post: how the model reads tool descriptions as a rubric, why "avoid X; use Y" beats every other phrasing, and how to write deprecation hints that actually move call rates. The chapter on schema versioning pairs the four patterns above with before/after eval scores and the prompt diffs that produced them.