Yash Pandey

Posted on Apr 26

How to Test LLM-Powered Applications Effectively

#ai #llm #testing #qualityassurance

How to Test LLM-Powered Applications Effectively

Testing a CRUD app is deterministic. You input X, you expect Y, you assert equality. Testing an LLM-powered application is different in a way that breaks most of your existing instincts.

The model's output is probabilistic. The same prompt can return different phrasing across runs. "Correct" is often subjective. Traditional assertEqual doesn't work here.

Here's how to think about testing LLM apps properly.

The Three Layers of an LLM App

Before writing a single test, map out what you're actually testing:

[ User Input ]
     ↓
[ Prompt Construction ]   ← Layer 1: Deterministic. Testable normally.
     ↓
[ LLM API Call ]          ← Layer 2: Non-deterministic. Mock in unit tests.
     ↓
[ Output Parsing ]        ← Layer 3: Deterministic. Testable normally.
     ↓
[ App Response ]

Most bugs aren't in the LLM — they're in layers 1 and 3. Start there.

Layer 1: Test Prompt Construction

Your prompt builder is plain code. Test it like code.

def build_prompt(user_query: str, context: str) -> str:
    return f"""You are a helpful assistant.
Context: {context}
User: {user_query}
Answer concisely."""

def test_prompt_includes_context():
    prompt = build_prompt("What is the policy?", "Refund window is 30 days.")
    assert "Refund window is 30 days." in prompt

def test_prompt_has_system_instruction():
    prompt = build_prompt("Hi", "")
    assert "You are a helpful assistant" in prompt

These are fast, free, and catch the majority of regressions.

Layer 2: Test Output Parsing

If your app parses structured data from LLM output, test the parser independently with canned responses:

def parse_llm_json_response(raw: str) -> dict:
    import json, re
    match = re.search(r'\{.*\}', raw, re.DOTALL)
    if not match:
        raise ValueError("No JSON found in response")
    return json.loads(match.group())

def test_parser_extracts_json():
    raw = "Here is the result: {\"score\": 8, \"reason\": \"Clear\"}"
    result = parse_llm_json_response(raw)
    assert result["score"] == 8

def test_parser_raises_on_no_json():
    with pytest.raises(ValueError):
        parse_llm_json_response("Sorry, I cannot help with that.")

Layer 3: Evaluating LLM Output Quality

For actual model output, shift from assertion-based testing to evaluation-based testing. Three practical approaches:

1. Rubric Scoring (LLM-as-Judge)

def evaluate_response(question, answer, criteria):
    eval_prompt = f"""
Rate the following answer on a scale of 1-5 for each criterion.
Question: {question}
Answer: {answer}
Criteria: {criteria}
Return JSON: {{"score": int, "reason": str}}
"""
    # Call your LLM here and parse response
    ...

2. Semantic Similarity (for factual tasks)

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def is_semantically_similar(expected, actual, threshold=0.85):
    emb1 = model.encode(expected, convert_to_tensor=True)
    emb2 = model.encode(actual, convert_to_tensor=True)
    score = util.cos_sim(emb1, emb2).item()
    return score >= threshold

3. Behavioral Testing (what should never happen)

FORBIDDEN_PHRASES = ["I cannot", "As an AI", "I don't have access"]

def test_no_refusals_on_valid_queries(llm_client):
    response = llm_client.ask("What is the return policy?")
    for phrase in FORBIDDEN_PHRASES:
        assert phrase not in response, f"Got refusal: {phrase}"

Testing for Regressions: Golden Datasets

Build a golden dataset — a curated set of input/expected-output pairs — and run evaluations on every model or prompt change:

Input	Min Score	Pass?
"Summarize this in 3 points"	4/5	✅
"Translate to French"	4/5	✅
"What's 2+2?" (sanity check)	5/5	✅

This won't catch everything, but it will catch regressions — which is the main goal.

Tools Worth Knowing

Promptfoo — open-source LLM eval framework, define test cases in YAML
LangSmith — tracing + eval if you're on LangChain

- DeepEval — pytest-style assertions for LLM metrics

Written by Yash| Senior SDET catching failures other layers miss — cross-validating UI, API, DB simultaneously and test infrastructure.

DEV Community

How to Test LLM-Powered Applications Effectively

How to Test LLM-Powered Applications Effectively

The Three Layers of an LLM App

Layer 1: Test Prompt Construction

Layer 2: Test Output Parsing

Layer 3: Evaluating LLM Output Quality

1. Rubric Scoring (LLM-as-Judge)

2. Semantic Similarity (for factual tasks)

3. Behavioral Testing (what should never happen)

Testing for Regressions: Golden Datasets

Tools Worth Knowing

- DeepEval — pytest-style assertions for LLM metrics

Top comments (0)