How to Test LLM-Powered Applications Effectively
Testing a CRUD app is deterministic. You input X, you expect Y, you assert equality. Testing an LLM-powered application is different in a way that breaks most of your existing instincts.
The model's output is probabilistic. The same prompt can return different phrasing across runs. "Correct" is often subjective. Traditional assertEqual doesn't work here.
Here's how to think about testing LLM apps properly.
The Three Layers of an LLM App
Before writing a single test, map out what you're actually testing:
[ User Input ]
↓
[ Prompt Construction ] ← Layer 1: Deterministic. Testable normally.
↓
[ LLM API Call ] ← Layer 2: Non-deterministic. Mock in unit tests.
↓
[ Output Parsing ] ← Layer 3: Deterministic. Testable normally.
↓
[ App Response ]
Most bugs aren't in the LLM — they're in layers 1 and 3. Start there.
Layer 1: Test Prompt Construction
Your prompt builder is plain code. Test it like code.
def build_prompt(user_query: str, context: str) -> str:
return f"""You are a helpful assistant.
Context: {context}
User: {user_query}
Answer concisely."""
def test_prompt_includes_context():
prompt = build_prompt("What is the policy?", "Refund window is 30 days.")
assert "Refund window is 30 days." in prompt
def test_prompt_has_system_instruction():
prompt = build_prompt("Hi", "")
assert "You are a helpful assistant" in prompt
These are fast, free, and catch the majority of regressions.
Layer 2: Test Output Parsing
If your app parses structured data from LLM output, test the parser independently with canned responses:
def parse_llm_json_response(raw: str) -> dict:
import json, re
match = re.search(r'\{.*\}', raw, re.DOTALL)
if not match:
raise ValueError("No JSON found in response")
return json.loads(match.group())
def test_parser_extracts_json():
raw = "Here is the result: {\"score\": 8, \"reason\": \"Clear\"}"
result = parse_llm_json_response(raw)
assert result["score"] == 8
def test_parser_raises_on_no_json():
with pytest.raises(ValueError):
parse_llm_json_response("Sorry, I cannot help with that.")
Layer 3: Evaluating LLM Output Quality
For actual model output, shift from assertion-based testing to evaluation-based testing. Three practical approaches:
1. Rubric Scoring (LLM-as-Judge)
def evaluate_response(question, answer, criteria):
eval_prompt = f"""
Rate the following answer on a scale of 1-5 for each criterion.
Question: {question}
Answer: {answer}
Criteria: {criteria}
Return JSON: {{"score": int, "reason": str}}
"""
# Call your LLM here and parse response
...
2. Semantic Similarity (for factual tasks)
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def is_semantically_similar(expected, actual, threshold=0.85):
emb1 = model.encode(expected, convert_to_tensor=True)
emb2 = model.encode(actual, convert_to_tensor=True)
score = util.cos_sim(emb1, emb2).item()
return score >= threshold
3. Behavioral Testing (what should never happen)
FORBIDDEN_PHRASES = ["I cannot", "As an AI", "I don't have access"]
def test_no_refusals_on_valid_queries(llm_client):
response = llm_client.ask("What is the return policy?")
for phrase in FORBIDDEN_PHRASES:
assert phrase not in response, f"Got refusal: {phrase}"
Testing for Regressions: Golden Datasets
Build a golden dataset — a curated set of input/expected-output pairs — and run evaluations on every model or prompt change:
| Input | Min Score | Pass? |
|---|---|---|
| "Summarize this in 3 points" | 4/5 | ✅ |
| "Translate to French" | 4/5 | ✅ |
| "What's 2+2?" (sanity check) | 5/5 | ✅ |
This won't catch everything, but it will catch regressions — which is the main goal.
Tools Worth Knowing
- Promptfoo — open-source LLM eval framework, define test cases in YAML
- LangSmith — tracing + eval if you're on LangChain
- DeepEval — pytest-style assertions for LLM metrics
Written by Yash| Senior SDET catching failures other layers miss — cross-validating UI, API, DB simultaneously and test infrastructure.
Top comments (0)