You shipped that fancy LLM-powered feature to production. It worked fine in your notebook. Users reported it returns gibberish sometimes, breaks on edge cases, and occasionally hallucinates API responses. Classic.
Here's the thing: everyone treats AI like magic. It feels magical because it works sometimes. So they skip testing. But AI features are still code, and code needs tests.
The Problem With "It Works for Me"
When you test regular code, you know what happens:
- Function called with input X → output Y, every time
- Side effects are predictable
- You can verify correctness
With LLM features:
- Same input → different output every time (temperature, randomness)
- Hallucinations happen silently
- The model might ignore your instructions if it "decides" to
- Failures are weird and inconsistent
Testing "normal" features works because computers are deterministic. LLMs aren't. So your testing strategy has to change.
What You Actually Need to Test
1. Deterministic inputs → deterministic outputs
Yeah, the LLM is random. But the wrapper around it isn't. Test that:
- Your prompt construction works
- You're passing the right variables
- Token limits don't cut off critical parts
- Temperature/sampling settings actually apply
def build_summary_prompt(text, max_length):
return f"Summarize this in {max_length} words:\n\n{text}"
# Test this, not the LLM
def test_prompt_building():
prompt = build_summary_prompt("hello world", 10)
assert "hello world" in prompt
assert "10 words" in prompt
2. Output parsing works
The LLM returns text. You parse it. That parsing will break.
# Bad: hope it returns valid JSON
result = llm.generate(prompt)
data = json.loads(result) # crashes sometimes
# Good: handle failure gracefully
try:
data = json.loads(result)
except json.JSONDecodeError:
# fallback, log, alert
data = {"error": "failed_to_parse"}
3. Boundary cases
- Empty input → what happens?
- Very long input → does it truncate sensibly?
- Special characters, code samples, URLs → does the prompt break?
- Languages other than English → does it handle them or fail silently?
Test these before users hit them in production.
4. Cost isn't infinite
Tokens cost money. If your feature can waste them:
- Set max token limits in your API calls
- Monitor token usage in production
- Test that you're not accidentally re-prompting 100 times for one request
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
max_tokens=500, # don't let it run forever
timeout=10 # and don't wait forever
)
Actually Testing This
You need three layers:
Layer 1: Mocking
Replace the LLM with a fake that returns known outputs. Test your code's logic, not the model.
from unittest.mock import patch
@patch('openai.ChatCompletion.create')
def test_summary_generation(mock_llm):
mock_llm.return_value = {"choices": [{"message": {"content": "Fake summary"}}]}
result = generate_summary("input text")
assert result == "Fake summary"
mock_llm.assert_called_once()
Layer 2: Integration tests with real API (cheap models)
Hit the real API occasionally with a cheap model (or a local one). Verify your prompt actually works with a real LLM.
Layer 3: Staging environment
Before production: run your feature with real traffic patterns on a cheap model. See what breaks.
The Dev.to Reality Check
You're probably thinking: "I don't have time to test this."
Fair. But shipping untested AI code costs you more time:
- Users report weird bugs
- You spend hours debugging hallucinations
- You ship hotfixes at 2 AM
- You lose trust
Spend 30 minutes writing tests. Save yourself 8 hours of production chaos.
Quick Checklist
- [ ] Prompt construction tested (variables inserted correctly)
- [ ] Output parsing handles failure (try/except exists)
- [ ] Boundary cases tested (empty, long, special chars, etc.)
- [ ] Token limits set (won't accidentally run forever)
- [ ] Mock tests exist (test your code, not the LLM)
- [ ] Cost monitored (not burning tokens in a loop)
- [ ] Fallback behavior defined (what if LLM fails?)
That's it. You don't need fancy. You need real.
One More Thing
Join LearnAI Weekly if you want practical AI dev tips delivered to your inbox. Not "AI is disrupting everything." Real stuff: tools, patterns, gotchas, and how to actually build with this stuff.
Now go test your features. Your users will thank you.
Have you shipped an AI feature that broke in production? What did you miss? Drop it in the comments—let's learn from each other's mistakes.
Top comments (0)