The problem
We had a customer running our AI agent to generate tests for their backend services. Resolvers, DB query functions, API wrappers. The agent produced hundreds of test files. Every single one was pure mocks.
Their repo had test DB helpers sitting right there. The agent never used them.
Our prompt already said: "If the repo has test DB infrastructure, use it for data-access methods instead of mocking." One line, buried in a list of other instructions.
Why the model ignored it
The rest of the prompt biased toward mocking. The <mocks> section had four rules about how to mock correctly. The <passthrough_methods> section offered an easy out: "If no test infrastructure exists, verify the mock was called with correct parameters." The model took the path of least resistance every time.
One instruction saying "do X" loses to ten instructions explaining how to do Y.
This is not a model failure. It is a prompt design failure. If your prompt has competing signals, the majority wins. The minority instruction might as well not exist.
The fix: two layers
Layer 1: Rebalance the prompt. We restructured the test standards around three explicit levels: solitary (mocked), sociable (real collaborators), and integration (real external services). The rule is simple: if you wrote a mock, you must also write a sociable or integration test to prove the real thing works. Mocking alone is not enough.
Layer 2: Add a programmatic gate. We added an "integration" category to our quality checklist with three checks: db_operations_use_real_test_db, api_calls_tested_end_to_end, and env_var_guards_for_secrets. The quality evaluation now programmatically flags test files that mock DB operations without any integration tests.
A prompt suggests. A gate enforces.
The pattern
This applies beyond testing. Any time you want an AI agent to reliably follow a specific behavior:
- Make the instruction prominent, not buried
- Remove competing signals that bias toward the opposite behavior
- Add programmatic verification that checks the output
In traditional software, we do not rely on comments saying "do not forget to validate input." We write validation middleware. The same principle applies to AI agents: treat important behaviors as invariants, not suggestions.
Practical considerations
Integration tests that need secrets (DB connection strings, API keys) need env var guards so they skip when secrets are unavailable:
const mongoUri = process.env.MONGODB_URI;
(mongoUri ? describe : describe.skip)('integration', () => {
// tests that need real DB
});
The agent generates the test, skips it locally (no secrets), pushes the code, and CI runs it with real secrets. For paid APIs (Stripe, Twilio), use unconditional describe.skip to avoid costs, but keep the tests as documentation.
Prompts tell the model what to do. Gates make sure it actually did it.
Top comments (0)