Fernando Rodriguez

Posted on Apr 30 • Originally published at frr.dev

Adversarial Programming: When Your AI Copilot Invents APIs

#ai #llm #rust #testing

TL;DR: Your AI will invent API fields that sound perfect but don't exist. The solution isn't hoping it gets it right: download the real schema before writing code, capture real responses as fixtures, and separate fetch from processing so you can test without the network. Adversarial programming: code assuming your copilot lies.

Have you ever written code against an API where everything compiled, tests passed, the logic made sense... and when you connected to the real API, nothing worked?

Programming alone, this happens when you misread the documentation. Programming with AI, this happens because the AI invented the documentation.

The hallucination that doesn't look like one

I was building a Rust CLI to interact with a GraphQL API. I asked my AI copilot to implement a filter for sorting results by priority. It returned something like this:

query {
  issues(orderBy: { priority: ASC }) {
    nodes {
      id
      title
      priority
    }
  }
}

Clean. Reasonable. Exactly what you'd expect. One problem: that API's orderBy field doesn't accept priority as a value. The real enum is called PaginatedOrder and has values like createdAt and updatedAt. No priority.

How did I find out? When the API returned a 400 error that made no sense. It took me 20 minutes to understand the problem wasn't my code—it was that the field I was using didn't exist.

The pattern is always the same

This wasn't an isolated case. Over weeks of development, the AI repeatedly hallucinated:

Non-existent filter fields — state.id.or instead of state.type.in. Sounds logical, but the API uses a completely different pattern.
Invented enums — value names that sound like they should be called that, but which the API never defined.
Patterns from other ecosystems — in a Rust project, it suggested using fcntl.flock for file locking. That's Python. In Rust you use fs2::FileExt.

Every error was plausible. None were stupid. A junior could have made exactly the same mistakes skimming the documentation. And that's what's dangerous: they don't look like hallucinations. They look like reasonable code from someone who knows the domain "more or less."

Why LLMs invent APIs

Put simply: the LLM doesn't know what fields your API has. It's seen thousands of GraphQL APIs in its training, and when you ask it to use one, it does what a human with good intuition but no access to documentation would do: it guesses.

And it guesses well. Almost always. Enough for you to trust it. That "almost" is what breaks your sprint.

It's like working with a brilliant teammate who never reads documentation but always has a convincing answer. They tell you "yes, the endpoint accepts a priority field" with such confidence that you don't check. And when it fails in production, you discover they made it up.

The solution: adversarial programming

After the third hallucination in a week, I adopted a different approach. Instead of trusting and verifying later, I started distrusting and verifying first. I call it adversarial programming: code assuming your copilot will invent things.

It's not hostility. It's hygiene.

1. Schema introspection before writing code

If you're working with a GraphQL API, before asking your AI for anything, download the real schema:

# Download the complete API schema
curl -s https://api.example.com/graphql \
  -H "Authorization: token-here" \
  -H "Content-Type: application/json" \
  -d '{"query":"{ __schema { types { name fields { name type { name kind ofType { name } } } } } }"}' \
  > schema.json

Now you have the truth. When the AI tells you "use orderBy: { priority: ASC }", you can search the schema and see that priority isn't in the ordering enum. You pass it the relevant schema fragment and say "use only these fields." No more guessing games.

For REST APIs, the equivalent is downloading the OpenAPI spec. For any API, the principle is the same: get the source of truth before writing a line of code.

2. Real fixtures, not invented ones

The second defense is capturing real responses from the API and saving them as test fixtures:

# Capture a real response
curl -s https://api.example.com/graphql \
  -H "Authorization: token-here" \
  -d '{"query":"{ items(first: 5) { nodes { id title state { name } } } }"}' \
  > tests/fixtures/items_real.json

That JSON wasn't generated by any LLM. It comes from the real API. It has the real fields, with the real types, with the real values. When you write a parser, test it against that fixture. If your DTO can't deserialize the real response, the test fails. End of fiction.

The trick is discipline: never let the AI generate fixtures. If it does, you're testing invention against invention. A perfect house of cards that collapses when it touches reality.

3. Separate fetch from processing

This is the key architectural piece. If your code does fetch + parse + transformation all together, you can't test parsing without the network. And if you can't test without the network, you need mocks. And if the AI generates the mocks... we're back at point 2.

The solution is to separate into two layers:

┌─────────────────────┐
│   Client (fetch)    │  ← Talks to real API
│   Only HTTP + JSON  │
└────────┬────────────┘
         │ Raw JSON
┌────────▼────────────┐
│   Processor         │  ← Parses, transforms, formats
│   Only data         │
└─────────────────────┘

The client is thin: makes the HTTP request and returns raw JSON. The processor takes that JSON and transforms it. To test the processor, you feed it real fixtures. You don't need HTTP mocks. You don't need the network. And you don't give the AI a chance to invent the JSON shape, because you already have it.

4. The adversarial checklist

Before accepting code that interacts with an external API, run this checklist:

Question	If the answer is no...
Do I have the API schema/spec in the project?	Download it before continuing
Do the fields the code uses exist in the schema?	Look them up. If not, the AI invented them
Do the test fixtures come from the real API?	Capture them. Don't let the AI generate them
Can I test parsing without making HTTP calls?	Separate fetch from processing
Do the language types match the API types?	Compare your struct/DTO with the schema

Five questions. Thirty seconds. Saves you hours of phantom debugging.

Brutal dogfooding

Here's the part that hurts. While building this CLI, I suffered firsthand exactly the problems the tool was meant to solve.

The CLI existed to simplify interaction with a task management API. And during development, every time I needed to create a task to track a bug... I had to fight with the API I was wrapping. Dogfooding wasn't a decision—it was a sentence.

Concrete examples from hell:

Broken JSON escaping. To put a description with quotes in a task, you had to escape at three levels: shell, JSON, GraphQL. One misplaced parenthesis and the API returned a cryptic error. It took longer to correctly escape a bug description than to fix the bug.
UUIDs for relationships. Want to assign a task to a project? You can't use the project name. You need its UUID. And how do you get the UUID? With another query. And the label? Another UUID, another query. To create a task with project, label, and state, I needed 4 chained queries.
32 requests to create dependencies. I wanted to create 8 tasks with dependencies between them. Each dependency requires a separate mutation with both tasks' UUIDs. 8 creates + 24 relationships = 32 API calls for what should be a YAML file.

Each of these problems directly fed the tool's design. Broken escaping → file input, never inline. UUIDs → automatic resolution by name. 32 requests → batch operations.

Convergent evolution of output

And then something curious happened. One of the design principles was that the output should be compact—designed for an LLM to consume using few tokens. I designed from scratch a format that eliminated repeated keys and used position and lightweight delimiters:

PROJ-42 [Backlog] backend — Refactor configuration parser (14d)
PROJ-43 [In Progress] api — Implement rate limiting (3d, overdue!)

About 25 tokens per item, versus ~50 for JSONL. No repeated keys ("state":, "labels":, "title":) because the LLM understands the structure by position.

Months later, I discovered TOON (Token-Oriented Object Notation), a format published in November 2025 that does exactly the same thing: eliminate JSON redundancy to reduce token consumption when the consumer is an LLM. TOON uses schema headers and tabular rows—different syntax, same principle.

I didn't copy it. I didn't know about it. Convergent evolution: when two teams solve the same problem (JSON is too verbose for LLMs), they arrive at the same solution (eliminate repeated keys, use position). Same reason dolphins and sharks have the same shape even though one is a mammal and the other a fish.

Having two independent projects reach the same conclusion is the best validation that the problem is real.

What changed

After adopting adversarial programming, the hallucination rate in API code dropped dramatically. Not to zero—AI is still AI—but the errors that survive are logic errors, not fiction errors. Normal errors. Programmer errors. Not "I invented a field that doesn't exist and built a castle on top" errors.

The key difference is when you discover the error:

Without adversarial	With adversarial
Discover at runtime	Discover before writing code
30-min debug searching "why doesn't it work"	Schema tells you "that field doesn't exist"
Invented fixture passes the test	Real fixture breaks the test
Three layers of fiction on top of fiction	One layer of reality from the start

Your turn

If you program with an LLM and touch external APIs, start with this:

Download the schema of every external API you use. Put it in the repo. It's your source of truth.
Capture real fixtures. A curl and a > fixture.json. Ten seconds.
Separate fetch from processing. Make your parser testable with a file, not with the network.
Distrust plausible names. If the AI tells you "use orderBy.priority", look it up in the schema before using it.

It's not paranoia. It's engineering. AI is an extraordinary tool, but its most dangerous failure mode isn't the obvious error—it's the error that looks correct. And against that, the only defense is reality.

Keep grinding.

Series: Adversarial Programming