Hassann

Posted on May 6 • Originally published at apidog.com

How to test AI agents that call your APIs without losing data

An AI coding agent ran a script, watched it succeed, and then watched a production database table disappear. The viral post-mortem headline — “AI didn’t delete your database, you did” — worked because the failure was not magic. The agent followed a tool definition. The tool hit a real endpoint. The endpoint had no guardrails. A human had given write access to a process that will not stop and ask whether DELETE FROM users looks suspicious. A separate r/ClaudeAI thread described a billing loop that burned through hundreds of dollars in tokens before anyone noticed. Different incident, same root cause: the API layer was not tested for agent behavior.

Try Apidog today

💡 If you’re shipping autonomous agents that call your APIs, this guide is for you. You’ll learn how to mock external endpoints during agent development, sandbox destructive operations, write contract tests for tool schemas, set per-agent budget caps, and rehearse failure modes before they hit production. We’ll use Apidog for the testing scaffolding because it supports OpenAPI imports, mock servers, and scenario tests that map cleanly to agent tool-call sequences.

TL;DR

Agents fail in production when their tools can call APIs without guardrails:

No rate limits
No idempotency
Destructive endpoints exposed to agent tokens
Tool schemas that drift from the real API
Retry loops with no budget ceiling

Fix it with four controls:

Contract-test agent tool definitions against your OpenAPI spec.
Use mock servers for destructive endpoints during development.
Require idempotency keys and soft deletes for write operations.
Enforce per-agent request, token, time, and spend budgets.

Apidog gives you OpenAPI import, mocks, and scenario testing in one project.

Introduction

A year ago, “test the AI agent” usually meant prompting Claude or GPT and grading the answer. That is no longer enough.

Today’s agents call functions. Those functions hit APIs. Those APIs touch real databases, billing systems, queues, CRMs, and third-party services.

A bad tool definition is no longer just a bad prompt. It can become:

A deleted table
A duplicate payment
A thousand queued emails
A runaway token bill
A compliance incident

The model layer matters, but the API layer is where you prevent damage.

This guide shows how to test AI agent API integrations end to end:

Validate agent tool schemas against OpenAPI
Mock destructive endpoints
Replay agent call sequences as API scenarios
Add idempotency and budget controls
Detect schema drift in CI
Separate read and write credentials

Use this as a practical checklist before giving an agent access to anything important.

Why agent failures look like API failures

Read enough agent post-mortems and the pattern becomes obvious: the model is rarely the real protagonist. The API is.

Prompt injection becomes an authorization failure

A user uploads a PDF with hidden instructions. The agent reads it and then calls:

DELETE /admin/users?delete_all=true

The fix is not only “write a better system prompt.”

The API should not allow a user-context agent token to call admin-only destructive endpoints in the first place.

If a normal user cannot delete all users, an agent acting on behalf of that user cannot either.

Faulty tool schemas become data bugs

Your OpenAPI spec says:

{
  "amount": {
    "type": "integer",
    "description": "Amount in cents"
  }
}

But the agent tool definition says:

{
  "amount": {
    "type": "number",
    "description": "Amount in dollars"
  }
}

Eventually, someone refunds 19 cents as $19.

The model did not invent the bug. It used the schema you gave it.

Missing rate limits become billing incidents

An agent retries a failed email notification step because its planner keeps marking the task as incomplete.

Without caps, it can call:

POST /notifications/email

hundreds or thousands of times.

That costs money, spams users, and may get your provider account flagged.

Missing idempotency becomes duplicate writes

An agent calls:

POST /payments

The network times out. The agent retries. The first request actually succeeded. Now the customer is charged twice.

The agent cannot know what happened unless your API gives it a safe retry mechanism.

That mechanism is an idempotency key.

The four guardrails every agent-API integration needs

These four controls prevent most expensive agent failures.

If you can only add one this week, start with contract tests. If your agents can write data, add idempotency next.

1. Tool-schema contract tests

Your OpenAPI spec should be the source of truth for your API.

Your agent tool definitions should not be hand-maintained copies that silently drift.

Add a CI test that compares each tool definition against the matching OpenAPI operation.

Here is a minimal Python example:

from jsonschema import Draft202012Validator

def validate_tool_against_openapi(tool_def: dict, openapi_spec: dict) -> list[str]:
    """
    Compare an agent tool definition with the OpenAPI request schema.

    Returns:
        List of mismatch errors. Empty list means pass.
    """
    errors = []

    op = openapi_spec["paths"][tool_def["path"]][tool_def["method"].lower()]
    api_schema = op["requestBody"]["content"]["application/json"]["schema"]
    tool_schema = tool_def["input_schema"]

    api_props = set(api_schema.get("properties", {}).keys())
    tool_props = set(tool_schema.get("properties", {}).keys())

    for missing in api_props - tool_props:
        if missing in api_schema.get("required", []):
            errors.append(f"Tool missing required field: {missing}")

    for extra in tool_props - api_props:
        errors.append(f"Tool defines field not in API: {extra}")

    for prop, api_def in api_schema.get("properties", {}).items():
        if prop not in tool_schema.get("properties", {}):
            continue

        tool_prop = tool_schema["properties"][prop]

        if api_def.get("type") != tool_prop.get("type"):
            errors.append(
                f"Type mismatch on {prop}: "
                f"API={api_def.get('type')} tool={tool_prop.get('type')}"
            )

    return errors

Example CI usage:

import json
import sys

with open("openapi.json") as f:
    openapi_spec = json.load(f)

with open("agent-tools.json") as f:
    tools = json.load(f)

all_errors = []

for tool in tools:
    errors = validate_tool_against_openapi(tool, openapi_spec)
    for error in errors:
        all_errors.append(f"{tool['name']}: {error}")

if all_errors:
    print("\n".join(all_errors))
    sys.exit(1)

print("Tool schemas match OpenAPI spec")

Run this whenever a PR changes:

openapi.yaml
openapi.json
Agent tool definitions
Request DTOs
Generated API clients

Fail the build on mismatch. Do not ship schema drift as a warning.

2. Sandbox and mock destructive endpoints

Agents need a place to practice. That place should not be production.

For every endpoint that mutates state, provide a mock or sandbox equivalent:

POST
PUT
PATCH
DELETE

During development, point the agent at the mock server instead of the real API.

With Apidog, you can import your OpenAPI spec and generate mock endpoints directly from it. That gives your agent realistic response shapes without touching real data.

For example, your production endpoint might be:

DELETE https://api.example.com/users/123

During agent development, use:

DELETE https://mock.apidog.com/m1/your-project-id/users/123

The agent receives a valid response, but your production database remains untouched.

This fits the broader contract-first development workflow.

3. Idempotency keys and soft deletes

Every write endpoint an agent can call should accept an idempotency key.

Every delete operation should default to soft delete.

Express idempotency middleware

const idempotencyCache = new Map();

function idempotency(req, res, next) {
  const key = req.headers["idempotency-key"];

  if (!key) {
    return res.status(400).json({
      error: "Missing Idempotency-Key header"
    });
  }

  if (idempotencyCache.has(key)) {
    const cached = idempotencyCache.get(key);
    return res.status(cached.status).json(cached.body);
  }

  const originalJson = res.json.bind(res);

  res.json = function (body) {
    idempotencyCache.set(key, {
      status: res.statusCode,
      body
    });

    setTimeout(() => {
      idempotencyCache.delete(key);
    }, 24 * 60 * 60 * 1000);

    return originalJson(body);
  };

  next();
}

app.post("/payments", idempotency, createPayment);

The agent should generate one UUID per logical operation and reuse it on retries.

Example:

POST /payments
Idempotency-Key: refund-01HYX4R8G8B9M7R4N2M3KQZV6A
Content-Type: application/json

{
  "customer_id": "cus_123",
  "amount": 1900,
  "currency": "usd"
}

If the agent retries after a timeout, the API returns the cached response instead of creating a second payment.

Use the same pattern for:

Payment creation
Refunds
Email sends
CRM updates
Ticket creation
File uploads
Any non-idempotent write

For deletes, prefer:

PATCH /users/{id}

{
  "deleted": true,
  "deleted_at": "2026-05-06T10:00:00Z"
}

Reserve hard deletes for human-approved paths.

4. Per-agent budget caps

Every agent needs hard ceilings.

Track budgets by agent ID, session ID, user ID, or task ID.

Useful limits include:

Tokens per session
API calls per minute
API calls per task
Dollar spend per task
Runtime duration
Tool-call depth
Retry count

Example policy:

{
  "agent_id": "support-triage-agent",
  "limits": {
    "tokens_per_session": 50000,
    "api_calls_per_minute": 30,
    "max_spend_cents_per_task": 500,
    "max_tool_call_depth": 10,
    "max_retries_per_operation": 3
  }
}

When a cap is hit, return a structured 429:

HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-Budget-Exceeded: api_calls_per_minute
Content-Type: application/json

{
  "error": "Budget exceeded",
  "budget": "api_calls_per_minute",
  "limit": 30,
  "retry_after_seconds": 60,
  "action": "escalate_to_human"
}

The agent planner can then stop, retry later, or escalate.

Do not rely on monitoring alone. Monitoring tells you damage happened. Budget caps stop the loop while it is happening.

Test agent API calls with Apidog

Here is a practical workflow for testing agent API integrations with Apidog.

You need:

Your OpenAPI 3.x spec
The agent’s tool definitions
A list of high-risk endpoints
Example tasks the agent performs

Step 1: Import the OpenAPI spec

Create a new Apidog project and import your OpenAPI file.

Apidog parses:

Paths
Methods
Request schemas
Response schemas
Examples
Auth configuration

If your API is not documented in OpenAPI yet, start there. Agent safety depends on having one contract that humans, tests, and agents all share.

The design-first API workflow guide covers this process if you are starting from scratch.

Step 2: Mock destructive endpoints

Find every endpoint that mutates data:

POST /payments
POST /refunds
PATCH /users/{id}
DELETE /users/{id}
POST /notifications
PUT /billing/subscription

For each endpoint:

Open it in Apidog.
Add a mock response.
Use the same response shape as production.
Override values so they are obviously fake.
Start the mock server.
Point the agent’s base URL to the mock URL.

Use test-looking values:

{
  "id": "mock_user_1970_001",
  "email": "mock-user@example.test",
  "status": "deleted",
  "deleted_at": "1970-01-01T00:00:00Z"
}

Avoid mock data that looks like real customer data. If it leaks into logs, dashboards, or screenshots, it should be obvious that it is fake.

Step 3: Replay the agent’s call sequence as a scenario

Apidog scenarios let you chain API calls with assertions.

For a support-ticket triage agent, a scenario might be:

POST /auth/token
- Use test credentials.
- Capture the bearer token.
GET /tickets?status=open
- Pass the token.
- Capture the first ticket ID.
POST /tickets/{id}/triage
- Send a category.
- Assert 200.
- Capture assigned_to.
POST /notifications
- Send a templated message.
- Assert the response body matches a regex.

Example assertion targets:

{
  "status": 200,
  "body.assigned_to": "support_l2",
  "body.category": "billing"
}

You are rehearsing the agent’s API behavior before the model gets production access.

If a developer changes the ticket schema and the scenario fails, you catch the issue before the agent does.

See API testing for QA engineers for a broader scenario-testing workflow.

Step 4: Run scenarios from CI

Add scenario runs to your PR pipeline.

Example command:

apidog run -t scenario-id --env test

Run it when these files change:

on:
  pull_request:
    paths:
      - "openapi.yaml"
      - "agent-tools/**"
      - "src/routes/**"
      - "src/controllers/**"

Example GitHub Actions step:

jobs:
  api-agent-tests:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Run Apidog scenario
        run: apidog run -t ${{ secrets.APIDOG_SCENARIO_ID }} --env test

The goal is simple: every API or tool-definition change replays the same baseline agent scenarios.

Step 5: Compare model versions safely

When upgrading models, test tool-call behavior before production.

Run the same task twice:

Model A against the Apidog mock server
Model B against the same mock server

Capture and diff:

Endpoint paths
HTTP methods
Request bodies
Header values
Date formats
Enum values
Missing fields
Retry counts

Example drift:

{
- "priority": "medium",
+ "priority": "urgent"
}

or:

{
- "due_date": "2026-05-06",
+ "due_date": "05/06/2026"
}

This catches behavior changes before they become production data changes.

This pattern also matters when evaluating newer model APIs, as discussed in GPT-5.5 API integration.

Advanced techniques and pro tips

Pin temperature to zero in tests

When testing tool-call behavior, remove unnecessary randomness.

Use:

{
  "temperature": 0
}

You are testing API behavior and tool selection, not creativity.

Snapshot tool-call traces

Record every tool call:

[
  {
    "tool": "get_open_tickets",
    "arguments": {
      "status": "open"
    }
  },
  {
    "tool": "triage_ticket",
    "arguments": {
      "ticket_id": "ticket_123",
      "category": "billing"
    }
  }
]

Diff traces across runs.

If an agent suddenly calls /users twice or starts passing a new field, fail the test or flag it for review.

Never give an agent broad production credentials

Use scoped service accounts.

Bad:

PROD_ADMIN_API_KEY=...

Better:

SUPPORT_TRIAGE_AGENT_READ_KEY=...
SUPPORT_TRIAGE_AGENT_WRITE_KEY=...

Best:

Short-lived tokens
Scoped permissions
Proxy-signed requests
No direct database access
No secrets in files the agent can read

Separate read and write API keys

Most agent tasks are read-heavy.

Issue read-only keys by default. Require separate approval for write-capable keys.

Example:

GET /tickets
Authorization: Bearer agent_read_token

For writes:

POST /tickets/{id}/triage
Authorization: Bearer agent_write_token
Idempotency-Key: triage-01HYX4...

This reduces blast radius if an agent is compromised or prompt-injected.

Use HTTP 423 for human-approval gates

For operations that require human confirmation, return 423 Locked instead of 403 Forbidden.

Example:

HTTP/1.1 423 Locked
Content-Type: application/json

{
  "error": "Human approval required",
  "confirmation_url": "https://app.example.com/approvals/abc123",
  "expires_at": "2026-05-06T12:00:00Z"
}

403 means “you cannot do this.”

423 means “you cannot do this yet.”

That distinction is useful for agent planners.

Fail closed on schema drift

If the agent tool definition does not match OpenAPI, fail CI.

Do not allow:

Warning: schema mismatch detected

Require:

Error: schema mismatch detected
Build failed

The cost of a failed build is lower than the cost of a production incident.

Common mistakes to avoid

Avoid these patterns:

Hardcoding mock URLs into prompts
Skipping idempotency on “small” write endpoints
Logging full request bodies with PII
Giving agents direct database access
Letting agents use production admin credentials
Treating model confidence as API safety
Relying only on prompt instructions for authorization
Running destructive test cases against production
Ignoring retry behavior in tests

If your agent talks to multiple internal services, use the patterns in microservices testing to fan out scenario tests across services.

Alternatives and tooling

You have several ways to test agent/API integrations.

Approach	Setup time	Strength	Weakness	Best for
Handcrafted unit tests	Low	Full control, no vendor lock-in	High maintenance, easy to drift from the real API	Small projects and single-developer teams
LangSmith / LangGraph eval harness	Medium	Trace replay and model-aware metrics	Strong on the agent layer, lighter on the API layer	Eval-heavy AI teams
Postman + Postbot	Medium	Familiar UI and large template library	Mock server is a paid add-on; scenario syntax can feel dated	Teams already invested in Postman
Apidog scenarios + mocks	Medium	OpenAPI-native import, mocks, and scenario CLI for CI	Less brand recognition than Postman	Teams that want design, mocks, and tests in one API project

If you already use LangSmith, keep it for prompt and agent-level evals, then add API-side tests separately.

If you have outgrown Postman’s pricing or mock model, Apidog is a strong replacement.

Many teams pair tools:

LangSmith for prompt-level and trace evals
Apidog for API contracts, mocks, and scenario replays

That split works because the tools cover different layers.

Real-world use cases

Agent updates production database rows

A customer-success team built an agent that updates account fields from support tickets.

Before launch, they:

Required idempotency keys on every write endpoint
Used a sandbox database
Replayed 200 Apidog scenarios
Validated enum values against the OpenAPI schema

The tests caught two cases where the agent tried to set:

{
  "subscription_status": "paused_by_customer"
}

But the API allowed only:

["active", "paused", "cancelled"]

They added schema validation before launch.

Agent calls a payments API

A fintech team built an automated refund agent.

Their controls:

Max 5 refunds per session
Max $50 per refund
Idempotency required on every refund
Contract tests against the payment API schema
Human approval for higher-value refunds

They ran the contract test suite on every PR.

The important pattern is not the specific limits. It is that the agent could not exceed them silently.

Agent triages GitHub issues

A platform team built an issue-triage agent inspired by Clawsweeper.

Before launch, they mocked the GitHub API in Apidog and tested cases such as:

Deleted issues
Missing labels
Malformed user input
Closed issues
Permission failures
Rate limits

They found crashes before the agent touched the live repository.

Implementation checklist

Use this before shipping an agent that calls APIs.

API contract

[ ] OpenAPI spec exists and is current
[ ] Agent tool definitions map to OpenAPI operations
[ ] CI fails on schema drift
[ ] Required fields match
[ ] Types and enums match
[ ] Request and response examples exist

Mocking and sandboxing

[ ] Destructive endpoints have mock responses
[ ] Agent development points to mock base URL
[ ] Staging uses sandbox data
[ ] Production endpoints are not used in test loops
[ ] Mock data is clearly fake

Write safety

[ ] Every write endpoint requires an idempotency key
[ ] Deletes are soft deletes by default
[ ] Hard deletes require human approval
[ ] Retry behavior is tested
[ ] Duplicate writes are prevented

Budgets and limits

[ ] API calls per minute are capped
[ ] Token usage per session is capped
[ ] Spend per task is capped
[ ] Retry count is capped
[ ] Tool-call depth is capped
[ ] Budget failures return structured 429 responses

Credentials

[ ] Agents use scoped service accounts
[ ] Read and write keys are separate
[ ] Production admin keys are never exposed to agents
[ ] Tokens are short-lived where possible
[ ] Secrets are stored in a vault

CI

[ ] Contract tests run on PRs
[ ] Apidog scenarios run on PRs
[ ] Tool-call traces are snapshotted
[ ] Model upgrades are tested against the same scenarios
[ ] Failures block deploys

Conclusion

The agent is not usually the problem by itself. The API is either the failure point or the safety layer.

Five takeaways:

Treat tool schemas as contracts and test them in CI.
Mock destructive endpoints during agent development.
Require idempotency keys on every write endpoint.
Set per-agent budget caps that fail closed.
Replay scenarios on every API or tool-definition change.

Start with the mock-server step. It gives agents a safe place to make mistakes.

Download Apidog if you want to set up OpenAPI-based mocks and scenario tests. For the QA-team perspective, see API testing tools for QA engineers. For broader context on writing safer agent instructions, see how to write AGENTS.md files.

FAQ

How do I test AI agent API calls without spending money on tokens?

Use a mock server during development.

Point the agent’s API base URL at the mock instead of the real service. Apidog mock URLs return realistic responses from your OpenAPI schema, so you can test API behavior without hitting production endpoints.

Also:

Use a fixed prompt set
Set temperature to 0
Limit retries
Snapshot tool-call traces

See the QA engineer’s testing checklist for a fuller setup.

What is the difference between testing the agent and testing the API?

Agent testing checks whether the model chooses the right tool and fills arguments correctly.

API testing checks whether the endpoint behaves safely and correctly when called.

You need both.

A good agent calling a broken API still creates bad outcomes. A broken agent calling a safe API should fail closed.

Do I need idempotency keys on every endpoint?

Use idempotency keys on every write endpoint.

Reads are idempotent by default. Writes are not.

Agents retry after timeouts, 500 responses, tool failures, and ambiguous planner states. Idempotency prevents those retries from creating duplicate side effects.

How do I prevent prompt injection from triggering bad API calls?

Do not rely only on prompt instructions.

Enforce authorization at the API layer based on the original user context.

If the user cannot normally call:

DELETE /admin/delete-all-users

then an agent acting on behalf of that user should not be able to call it either.

Prompt injection should hit an authorization boundary, not a production database.

Can I use Apidog with Claude or GPT directly?

Use Apidog’s mock URL as the API base URL in your agent configuration or tool layer.

For example:

API_BASE_URL=https://mock.apidog.com/m1/your-project-id

When moving from mock to staging, change the environment variable:

API_BASE_URL=https://staging-api.example.com

Keep the tool definitions and scenarios the same.

What is the right budget cap for an agent?

Start strict, then loosen with data.

A reasonable starting policy:

50,000 tokens per session
30 API calls per minute
$5 per task
10 nested tool calls
3 retries per operation

Review logs for two weeks. Raise limits that legitimate tasks hit. Lower limits that are never needed.

The goal is not a universal number. The goal is a ceiling tight enough to stop runaway loops.

How do I detect schema drift between agent tools and my API?

Run a schema diff in CI.

Compare:

Agent tool JSON schema
OpenAPI request schema
Required fields
Types
Enums
Formats

Fail the build if they diverge.

The Python snippet in the contract-test section is enough to get started.