DEV Community

Cover image for How to test AI agents that call your APIs without losing data
Hassann
Hassann

Posted on • Originally published at apidog.com

How to test AI agents that call your APIs without losing data

An AI coding agent ran a script, watched it succeed, and then watched a production database table disappear. The viral post-mortem headline — “AI didn’t delete your database, you did” — worked because the failure was not magic. The agent followed a tool definition. The tool hit a real endpoint. The endpoint had no guardrails. A human had given write access to a process that will not stop and ask whether DELETE FROM users looks suspicious. A separate r/ClaudeAI thread described a billing loop that burned through hundreds of dollars in tokens before anyone noticed. Different incident, same root cause: the API layer was not tested for agent behavior.

Try Apidog today

💡 If you’re shipping autonomous agents that call your APIs, this guide is for you. You’ll learn how to mock external endpoints during agent development, sandbox destructive operations, write contract tests for tool schemas, set per-agent budget caps, and rehearse failure modes before they hit production. We’ll use Apidog for the testing scaffolding because it supports OpenAPI imports, mock servers, and scenario tests that map cleanly to agent tool-call sequences.

TL;DR

Agents fail in production when their tools can call APIs without guardrails:

  • No rate limits
  • No idempotency
  • Destructive endpoints exposed to agent tokens
  • Tool schemas that drift from the real API
  • Retry loops with no budget ceiling

Fix it with four controls:

  1. Contract-test agent tool definitions against your OpenAPI spec.
  2. Use mock servers for destructive endpoints during development.
  3. Require idempotency keys and soft deletes for write operations.
  4. Enforce per-agent request, token, time, and spend budgets.

Apidog gives you OpenAPI import, mocks, and scenario testing in one project.

Introduction

A year ago, “test the AI agent” usually meant prompting Claude or GPT and grading the answer. That is no longer enough.

Today’s agents call functions. Those functions hit APIs. Those APIs touch real databases, billing systems, queues, CRMs, and third-party services.

A bad tool definition is no longer just a bad prompt. It can become:

  • A deleted table
  • A duplicate payment
  • A thousand queued emails
  • A runaway token bill
  • A compliance incident

The model layer matters, but the API layer is where you prevent damage.

This guide shows how to test AI agent API integrations end to end:

  • Validate agent tool schemas against OpenAPI
  • Mock destructive endpoints
  • Replay agent call sequences as API scenarios
  • Add idempotency and budget controls
  • Detect schema drift in CI
  • Separate read and write credentials

Use this as a practical checklist before giving an agent access to anything important.

Why agent failures look like API failures

Read enough agent post-mortems and the pattern becomes obvious: the model is rarely the real protagonist. The API is.

Prompt injection becomes an authorization failure

A user uploads a PDF with hidden instructions. The agent reads it and then calls:

DELETE /admin/users?delete_all=true
Enter fullscreen mode Exit fullscreen mode

The fix is not only “write a better system prompt.”

The API should not allow a user-context agent token to call admin-only destructive endpoints in the first place.

If a normal user cannot delete all users, an agent acting on behalf of that user cannot either.

Faulty tool schemas become data bugs

Your OpenAPI spec says:

{
  "amount": {
    "type": "integer",
    "description": "Amount in cents"
  }
}
Enter fullscreen mode Exit fullscreen mode

But the agent tool definition says:

{
  "amount": {
    "type": "number",
    "description": "Amount in dollars"
  }
}
Enter fullscreen mode Exit fullscreen mode

Eventually, someone refunds 19 cents as $19.

The model did not invent the bug. It used the schema you gave it.

Missing rate limits become billing incidents

An agent retries a failed email notification step because its planner keeps marking the task as incomplete.

Without caps, it can call:

POST /notifications/email
Enter fullscreen mode Exit fullscreen mode

hundreds or thousands of times.

That costs money, spams users, and may get your provider account flagged.

Missing idempotency becomes duplicate writes

An agent calls:

POST /payments
Enter fullscreen mode Exit fullscreen mode

The network times out. The agent retries. The first request actually succeeded. Now the customer is charged twice.

The agent cannot know what happened unless your API gives it a safe retry mechanism.

That mechanism is an idempotency key.

The four guardrails every agent-API integration needs

These four controls prevent most expensive agent failures.

If you can only add one this week, start with contract tests. If your agents can write data, add idempotency next.

1. Tool-schema contract tests

Your OpenAPI spec should be the source of truth for your API.

Your agent tool definitions should not be hand-maintained copies that silently drift.

Add a CI test that compares each tool definition against the matching OpenAPI operation.

Here is a minimal Python example:

from jsonschema import Draft202012Validator

def validate_tool_against_openapi(tool_def: dict, openapi_spec: dict) -> list[str]:
    """
    Compare an agent tool definition with the OpenAPI request schema.

    Returns:
        List of mismatch errors. Empty list means pass.
    """
    errors = []

    op = openapi_spec["paths"][tool_def["path"]][tool_def["method"].lower()]
    api_schema = op["requestBody"]["content"]["application/json"]["schema"]
    tool_schema = tool_def["input_schema"]

    api_props = set(api_schema.get("properties", {}).keys())
    tool_props = set(tool_schema.get("properties", {}).keys())

    for missing in api_props - tool_props:
        if missing in api_schema.get("required", []):
            errors.append(f"Tool missing required field: {missing}")

    for extra in tool_props - api_props:
        errors.append(f"Tool defines field not in API: {extra}")

    for prop, api_def in api_schema.get("properties", {}).items():
        if prop not in tool_schema.get("properties", {}):
            continue

        tool_prop = tool_schema["properties"][prop]

        if api_def.get("type") != tool_prop.get("type"):
            errors.append(
                f"Type mismatch on {prop}: "
                f"API={api_def.get('type')} tool={tool_prop.get('type')}"
            )

    return errors
Enter fullscreen mode Exit fullscreen mode

Example CI usage:

import json
import sys

with open("openapi.json") as f:
    openapi_spec = json.load(f)

with open("agent-tools.json") as f:
    tools = json.load(f)

all_errors = []

for tool in tools:
    errors = validate_tool_against_openapi(tool, openapi_spec)
    for error in errors:
        all_errors.append(f"{tool['name']}: {error}")

if all_errors:
    print("\n".join(all_errors))
    sys.exit(1)

print("Tool schemas match OpenAPI spec")
Enter fullscreen mode Exit fullscreen mode

Run this whenever a PR changes:

  • openapi.yaml
  • openapi.json
  • Agent tool definitions
  • Request DTOs
  • Generated API clients

Fail the build on mismatch. Do not ship schema drift as a warning.

2. Sandbox and mock destructive endpoints

Agents need a place to practice. That place should not be production.

For every endpoint that mutates state, provide a mock or sandbox equivalent:

  • POST
  • PUT
  • PATCH
  • DELETE

During development, point the agent at the mock server instead of the real API.

With Apidog, you can import your OpenAPI spec and generate mock endpoints directly from it. That gives your agent realistic response shapes without touching real data.

For example, your production endpoint might be:

DELETE https://api.example.com/users/123
Enter fullscreen mode Exit fullscreen mode

During agent development, use:

DELETE https://mock.apidog.com/m1/your-project-id/users/123
Enter fullscreen mode Exit fullscreen mode

The agent receives a valid response, but your production database remains untouched.

This fits the broader contract-first development workflow.

3. Idempotency keys and soft deletes

Every write endpoint an agent can call should accept an idempotency key.

Every delete operation should default to soft delete.

Express idempotency middleware

const idempotencyCache = new Map();

function idempotency(req, res, next) {
  const key = req.headers["idempotency-key"];

  if (!key) {
    return res.status(400).json({
      error: "Missing Idempotency-Key header"
    });
  }

  if (idempotencyCache.has(key)) {
    const cached = idempotencyCache.get(key);
    return res.status(cached.status).json(cached.body);
  }

  const originalJson = res.json.bind(res);

  res.json = function (body) {
    idempotencyCache.set(key, {
      status: res.statusCode,
      body
    });

    setTimeout(() => {
      idempotencyCache.delete(key);
    }, 24 * 60 * 60 * 1000);

    return originalJson(body);
  };

  next();
}

app.post("/payments", idempotency, createPayment);
Enter fullscreen mode Exit fullscreen mode

The agent should generate one UUID per logical operation and reuse it on retries.

Example:

POST /payments
Idempotency-Key: refund-01HYX4R8G8B9M7R4N2M3KQZV6A
Content-Type: application/json

{
  "customer_id": "cus_123",
  "amount": 1900,
  "currency": "usd"
}
Enter fullscreen mode Exit fullscreen mode

If the agent retries after a timeout, the API returns the cached response instead of creating a second payment.

Use the same pattern for:

  • Payment creation
  • Refunds
  • Email sends
  • CRM updates
  • Ticket creation
  • File uploads
  • Any non-idempotent write

For deletes, prefer:

PATCH /users/{id}

{
  "deleted": true,
  "deleted_at": "2026-05-06T10:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

Reserve hard deletes for human-approved paths.

4. Per-agent budget caps

Every agent needs hard ceilings.

Track budgets by agent ID, session ID, user ID, or task ID.

Useful limits include:

  • Tokens per session
  • API calls per minute
  • API calls per task
  • Dollar spend per task
  • Runtime duration
  • Tool-call depth
  • Retry count

Example policy:

{
  "agent_id": "support-triage-agent",
  "limits": {
    "tokens_per_session": 50000,
    "api_calls_per_minute": 30,
    "max_spend_cents_per_task": 500,
    "max_tool_call_depth": 10,
    "max_retries_per_operation": 3
  }
}
Enter fullscreen mode Exit fullscreen mode

When a cap is hit, return a structured 429:

HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-Budget-Exceeded: api_calls_per_minute
Content-Type: application/json

{
  "error": "Budget exceeded",
  "budget": "api_calls_per_minute",
  "limit": 30,
  "retry_after_seconds": 60,
  "action": "escalate_to_human"
}
Enter fullscreen mode Exit fullscreen mode

The agent planner can then stop, retry later, or escalate.

Do not rely on monitoring alone. Monitoring tells you damage happened. Budget caps stop the loop while it is happening.

Test agent API calls with Apidog

Here is a practical workflow for testing agent API integrations with Apidog.

You need:

  • Your OpenAPI 3.x spec
  • The agent’s tool definitions
  • A list of high-risk endpoints
  • Example tasks the agent performs

Step 1: Import the OpenAPI spec

Create a new Apidog project and import your OpenAPI file.

Apidog parses:

  • Paths
  • Methods
  • Request schemas
  • Response schemas
  • Examples
  • Auth configuration

If your API is not documented in OpenAPI yet, start there. Agent safety depends on having one contract that humans, tests, and agents all share.

The design-first API workflow guide covers this process if you are starting from scratch.

Step 2: Mock destructive endpoints

Find every endpoint that mutates data:

POST /payments
POST /refunds
PATCH /users/{id}
DELETE /users/{id}
POST /notifications
PUT /billing/subscription
Enter fullscreen mode Exit fullscreen mode

For each endpoint:

  1. Open it in Apidog.
  2. Add a mock response.
  3. Use the same response shape as production.
  4. Override values so they are obviously fake.
  5. Start the mock server.
  6. Point the agent’s base URL to the mock URL.

Use test-looking values:

{
  "id": "mock_user_1970_001",
  "email": "mock-user@example.test",
  "status": "deleted",
  "deleted_at": "1970-01-01T00:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

Avoid mock data that looks like real customer data. If it leaks into logs, dashboards, or screenshots, it should be obvious that it is fake.

Step 3: Replay the agent’s call sequence as a scenario

Apidog scenarios let you chain API calls with assertions.

For a support-ticket triage agent, a scenario might be:

  1. POST /auth/token

    • Use test credentials.
    • Capture the bearer token.
  2. GET /tickets?status=open

    • Pass the token.
    • Capture the first ticket ID.
  3. POST /tickets/{id}/triage

    • Send a category.
    • Assert 200.
    • Capture assigned_to.
  4. POST /notifications

    • Send a templated message.
    • Assert the response body matches a regex.

Example assertion targets:

{
  "status": 200,
  "body.assigned_to": "support_l2",
  "body.category": "billing"
}
Enter fullscreen mode Exit fullscreen mode

You are rehearsing the agent’s API behavior before the model gets production access.

If a developer changes the ticket schema and the scenario fails, you catch the issue before the agent does.

See API testing for QA engineers for a broader scenario-testing workflow.

Step 4: Run scenarios from CI

Add scenario runs to your PR pipeline.

Example command:

apidog run -t scenario-id --env test
Enter fullscreen mode Exit fullscreen mode

Run it when these files change:

on:
  pull_request:
    paths:
      - "openapi.yaml"
      - "agent-tools/**"
      - "src/routes/**"
      - "src/controllers/**"
Enter fullscreen mode Exit fullscreen mode

Example GitHub Actions step:

jobs:
  api-agent-tests:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Run Apidog scenario
        run: apidog run -t ${{ secrets.APIDOG_SCENARIO_ID }} --env test
Enter fullscreen mode Exit fullscreen mode

The goal is simple: every API or tool-definition change replays the same baseline agent scenarios.

Step 5: Compare model versions safely

When upgrading models, test tool-call behavior before production.

Run the same task twice:

  • Model A against the Apidog mock server
  • Model B against the same mock server

Capture and diff:

  • Endpoint paths
  • HTTP methods
  • Request bodies
  • Header values
  • Date formats
  • Enum values
  • Missing fields
  • Retry counts

Example drift:

{
- "priority": "medium",
+ "priority": "urgent"
}
Enter fullscreen mode Exit fullscreen mode

or:

{
- "due_date": "2026-05-06",
+ "due_date": "05/06/2026"
}
Enter fullscreen mode Exit fullscreen mode

This catches behavior changes before they become production data changes.

This pattern also matters when evaluating newer model APIs, as discussed in GPT-5.5 API integration.

Advanced techniques and pro tips

Pin temperature to zero in tests

When testing tool-call behavior, remove unnecessary randomness.

Use:

{
  "temperature": 0
}
Enter fullscreen mode Exit fullscreen mode

You are testing API behavior and tool selection, not creativity.

Snapshot tool-call traces

Record every tool call:

[
  {
    "tool": "get_open_tickets",
    "arguments": {
      "status": "open"
    }
  },
  {
    "tool": "triage_ticket",
    "arguments": {
      "ticket_id": "ticket_123",
      "category": "billing"
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Diff traces across runs.

If an agent suddenly calls /users twice or starts passing a new field, fail the test or flag it for review.

Never give an agent broad production credentials

Use scoped service accounts.

Bad:

PROD_ADMIN_API_KEY=...
Enter fullscreen mode Exit fullscreen mode

Better:

SUPPORT_TRIAGE_AGENT_READ_KEY=...
SUPPORT_TRIAGE_AGENT_WRITE_KEY=...
Enter fullscreen mode Exit fullscreen mode

Best:

  • Short-lived tokens
  • Scoped permissions
  • Proxy-signed requests
  • No direct database access
  • No secrets in files the agent can read

Separate read and write API keys

Most agent tasks are read-heavy.

Issue read-only keys by default. Require separate approval for write-capable keys.

Example:

GET /tickets
Authorization: Bearer agent_read_token
Enter fullscreen mode Exit fullscreen mode

For writes:

POST /tickets/{id}/triage
Authorization: Bearer agent_write_token
Idempotency-Key: triage-01HYX4...
Enter fullscreen mode Exit fullscreen mode

This reduces blast radius if an agent is compromised or prompt-injected.

Use HTTP 423 for human-approval gates

For operations that require human confirmation, return 423 Locked instead of 403 Forbidden.

Example:

HTTP/1.1 423 Locked
Content-Type: application/json

{
  "error": "Human approval required",
  "confirmation_url": "https://app.example.com/approvals/abc123",
  "expires_at": "2026-05-06T12:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

403 means “you cannot do this.”

423 means “you cannot do this yet.”

That distinction is useful for agent planners.

Fail closed on schema drift

If the agent tool definition does not match OpenAPI, fail CI.

Do not allow:

Warning: schema mismatch detected
Enter fullscreen mode Exit fullscreen mode

Require:

Error: schema mismatch detected
Build failed
Enter fullscreen mode Exit fullscreen mode

The cost of a failed build is lower than the cost of a production incident.

Common mistakes to avoid

Avoid these patterns:

  • Hardcoding mock URLs into prompts
  • Skipping idempotency on “small” write endpoints
  • Logging full request bodies with PII
  • Giving agents direct database access
  • Letting agents use production admin credentials
  • Treating model confidence as API safety
  • Relying only on prompt instructions for authorization
  • Running destructive test cases against production
  • Ignoring retry behavior in tests

If your agent talks to multiple internal services, use the patterns in microservices testing to fan out scenario tests across services.

Alternatives and tooling

You have several ways to test agent/API integrations.

Approach Setup time Strength Weakness Best for
Handcrafted unit tests Low Full control, no vendor lock-in High maintenance, easy to drift from the real API Small projects and single-developer teams
LangSmith / LangGraph eval harness Medium Trace replay and model-aware metrics Strong on the agent layer, lighter on the API layer Eval-heavy AI teams
Postman + Postbot Medium Familiar UI and large template library Mock server is a paid add-on; scenario syntax can feel dated Teams already invested in Postman
Apidog scenarios + mocks Medium OpenAPI-native import, mocks, and scenario CLI for CI Less brand recognition than Postman Teams that want design, mocks, and tests in one API project

If you already use LangSmith, keep it for prompt and agent-level evals, then add API-side tests separately.

If you have outgrown Postman’s pricing or mock model, Apidog is a strong replacement.

Many teams pair tools:

  • LangSmith for prompt-level and trace evals
  • Apidog for API contracts, mocks, and scenario replays

That split works because the tools cover different layers.

Real-world use cases

Agent updates production database rows

A customer-success team built an agent that updates account fields from support tickets.

Before launch, they:

  • Required idempotency keys on every write endpoint
  • Used a sandbox database
  • Replayed 200 Apidog scenarios
  • Validated enum values against the OpenAPI schema

The tests caught two cases where the agent tried to set:

{
  "subscription_status": "paused_by_customer"
}
Enter fullscreen mode Exit fullscreen mode

But the API allowed only:

["active", "paused", "cancelled"]
Enter fullscreen mode Exit fullscreen mode

They added schema validation before launch.

Agent calls a payments API

A fintech team built an automated refund agent.

Their controls:

  • Max 5 refunds per session
  • Max $50 per refund
  • Idempotency required on every refund
  • Contract tests against the payment API schema
  • Human approval for higher-value refunds

They ran the contract test suite on every PR.

The important pattern is not the specific limits. It is that the agent could not exceed them silently.

Agent triages GitHub issues

A platform team built an issue-triage agent inspired by Clawsweeper.

Before launch, they mocked the GitHub API in Apidog and tested cases such as:

  • Deleted issues
  • Missing labels
  • Malformed user input
  • Closed issues
  • Permission failures
  • Rate limits

They found crashes before the agent touched the live repository.

Implementation checklist

Use this before shipping an agent that calls APIs.

API contract

  • [ ] OpenAPI spec exists and is current
  • [ ] Agent tool definitions map to OpenAPI operations
  • [ ] CI fails on schema drift
  • [ ] Required fields match
  • [ ] Types and enums match
  • [ ] Request and response examples exist

Mocking and sandboxing

  • [ ] Destructive endpoints have mock responses
  • [ ] Agent development points to mock base URL
  • [ ] Staging uses sandbox data
  • [ ] Production endpoints are not used in test loops
  • [ ] Mock data is clearly fake

Write safety

  • [ ] Every write endpoint requires an idempotency key
  • [ ] Deletes are soft deletes by default
  • [ ] Hard deletes require human approval
  • [ ] Retry behavior is tested
  • [ ] Duplicate writes are prevented

Budgets and limits

  • [ ] API calls per minute are capped
  • [ ] Token usage per session is capped
  • [ ] Spend per task is capped
  • [ ] Retry count is capped
  • [ ] Tool-call depth is capped
  • [ ] Budget failures return structured 429 responses

Credentials

  • [ ] Agents use scoped service accounts
  • [ ] Read and write keys are separate
  • [ ] Production admin keys are never exposed to agents
  • [ ] Tokens are short-lived where possible
  • [ ] Secrets are stored in a vault

CI

  • [ ] Contract tests run on PRs
  • [ ] Apidog scenarios run on PRs
  • [ ] Tool-call traces are snapshotted
  • [ ] Model upgrades are tested against the same scenarios
  • [ ] Failures block deploys

Conclusion

The agent is not usually the problem by itself. The API is either the failure point or the safety layer.

Five takeaways:

  • Treat tool schemas as contracts and test them in CI.
  • Mock destructive endpoints during agent development.
  • Require idempotency keys on every write endpoint.
  • Set per-agent budget caps that fail closed.
  • Replay scenarios on every API or tool-definition change.

Start with the mock-server step. It gives agents a safe place to make mistakes.

Download Apidog if you want to set up OpenAPI-based mocks and scenario tests. For the QA-team perspective, see API testing tools for QA engineers. For broader context on writing safer agent instructions, see how to write AGENTS.md files.

FAQ

How do I test AI agent API calls without spending money on tokens?

Use a mock server during development.

Point the agent’s API base URL at the mock instead of the real service. Apidog mock URLs return realistic responses from your OpenAPI schema, so you can test API behavior without hitting production endpoints.

Also:

  • Use a fixed prompt set
  • Set temperature to 0
  • Limit retries
  • Snapshot tool-call traces

See the QA engineer’s testing checklist for a fuller setup.

What is the difference between testing the agent and testing the API?

Agent testing checks whether the model chooses the right tool and fills arguments correctly.

API testing checks whether the endpoint behaves safely and correctly when called.

You need both.

A good agent calling a broken API still creates bad outcomes. A broken agent calling a safe API should fail closed.

Do I need idempotency keys on every endpoint?

Use idempotency keys on every write endpoint.

Reads are idempotent by default. Writes are not.

Agents retry after timeouts, 500 responses, tool failures, and ambiguous planner states. Idempotency prevents those retries from creating duplicate side effects.

How do I prevent prompt injection from triggering bad API calls?

Do not rely only on prompt instructions.

Enforce authorization at the API layer based on the original user context.

If the user cannot normally call:

DELETE /admin/delete-all-users
Enter fullscreen mode Exit fullscreen mode

then an agent acting on behalf of that user should not be able to call it either.

Prompt injection should hit an authorization boundary, not a production database.

Can I use Apidog with Claude or GPT directly?

Use Apidog’s mock URL as the API base URL in your agent configuration or tool layer.

For example:

API_BASE_URL=https://mock.apidog.com/m1/your-project-id
Enter fullscreen mode Exit fullscreen mode

When moving from mock to staging, change the environment variable:

API_BASE_URL=https://staging-api.example.com
Enter fullscreen mode Exit fullscreen mode

Keep the tool definitions and scenarios the same.

What is the right budget cap for an agent?

Start strict, then loosen with data.

A reasonable starting policy:

  • 50,000 tokens per session
  • 30 API calls per minute
  • $5 per task
  • 10 nested tool calls
  • 3 retries per operation

Review logs for two weeks. Raise limits that legitimate tasks hit. Lower limits that are never needed.

The goal is not a universal number. The goal is a ceiling tight enough to stop runaway loops.

How do I detect schema drift between agent tools and my API?

Run a schema diff in CI.

Compare:

  • Agent tool JSON schema
  • OpenAPI request schema
  • Required fields
  • Types
  • Enums
  • Formats

Fail the build if they diverge.

The Python snippet in the contract-test section is enough to get started.

Top comments (0)