DEV Community

Cover image for How to Test LLM-Powered Applications Effectively
Yash Pandey
Yash Pandey

Posted on

How to Test LLM-Powered Applications Effectively

How to Test LLM-Powered Applications Effectively


Testing a CRUD app is deterministic. You input X, you expect Y, you assert equality. Testing an LLM-powered application is different in a way that breaks most of your existing instincts.

The model's output is probabilistic. The same prompt can return different phrasing across runs. "Correct" is often subjective. Traditional assertEqual doesn't work here.

Here's how to think about testing LLM apps properly.


The Three Layers of an LLM App

Before writing a single test, map out what you're actually testing:

[ User Input ]
     ↓
[ Prompt Construction ]   ← Layer 1: Deterministic. Testable normally.
     ↓
[ LLM API Call ]          ← Layer 2: Non-deterministic. Mock in unit tests.
     ↓
[ Output Parsing ]        ← Layer 3: Deterministic. Testable normally.
     ↓
[ App Response ]
Enter fullscreen mode Exit fullscreen mode

Most bugs aren't in the LLM — they're in layers 1 and 3. Start there.


Layer 1: Test Prompt Construction

Your prompt builder is plain code. Test it like code.

def build_prompt(user_query: str, context: str) -> str:
    return f"""You are a helpful assistant.
Context: {context}
User: {user_query}
Answer concisely."""

def test_prompt_includes_context():
    prompt = build_prompt("What is the policy?", "Refund window is 30 days.")
    assert "Refund window is 30 days." in prompt

def test_prompt_has_system_instruction():
    prompt = build_prompt("Hi", "")
    assert "You are a helpful assistant" in prompt
Enter fullscreen mode Exit fullscreen mode

These are fast, free, and catch the majority of regressions.


Layer 2: Test Output Parsing

If your app parses structured data from LLM output, test the parser independently with canned responses:

def parse_llm_json_response(raw: str) -> dict:
    import json, re
    match = re.search(r'\{.*\}', raw, re.DOTALL)
    if not match:
        raise ValueError("No JSON found in response")
    return json.loads(match.group())

def test_parser_extracts_json():
    raw = "Here is the result: {\"score\": 8, \"reason\": \"Clear\"}"
    result = parse_llm_json_response(raw)
    assert result["score"] == 8

def test_parser_raises_on_no_json():
    with pytest.raises(ValueError):
        parse_llm_json_response("Sorry, I cannot help with that.")
Enter fullscreen mode Exit fullscreen mode

Layer 3: Evaluating LLM Output Quality

For actual model output, shift from assertion-based testing to evaluation-based testing. Three practical approaches:

1. Rubric Scoring (LLM-as-Judge)

def evaluate_response(question, answer, criteria):
    eval_prompt = f"""
Rate the following answer on a scale of 1-5 for each criterion.
Question: {question}
Answer: {answer}
Criteria: {criteria}
Return JSON: {{"score": int, "reason": str}}
"""
    # Call your LLM here and parse response
    ...
Enter fullscreen mode Exit fullscreen mode

2. Semantic Similarity (for factual tasks)

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def is_semantically_similar(expected, actual, threshold=0.85):
    emb1 = model.encode(expected, convert_to_tensor=True)
    emb2 = model.encode(actual, convert_to_tensor=True)
    score = util.cos_sim(emb1, emb2).item()
    return score >= threshold
Enter fullscreen mode Exit fullscreen mode

3. Behavioral Testing (what should never happen)

FORBIDDEN_PHRASES = ["I cannot", "As an AI", "I don't have access"]

def test_no_refusals_on_valid_queries(llm_client):
    response = llm_client.ask("What is the return policy?")
    for phrase in FORBIDDEN_PHRASES:
        assert phrase not in response, f"Got refusal: {phrase}"
Enter fullscreen mode Exit fullscreen mode

Testing for Regressions: Golden Datasets

Build a golden dataset — a curated set of input/expected-output pairs — and run evaluations on every model or prompt change:

Input Min Score Pass?
"Summarize this in 3 points" 4/5
"Translate to French" 4/5
"What's 2+2?" (sanity check) 5/5

This won't catch everything, but it will catch regressions — which is the main goal.


Tools Worth Knowing

  • Promptfoo — open-source LLM eval framework, define test cases in YAML
  • LangSmith — tracing + eval if you're on LangChain

- DeepEval — pytest-style assertions for LLM metrics

Written by Yash| Senior SDET catching failures other layers miss — cross-validating UI, API, DB simultaneously and test infrastructure.

Top comments (0)