DEV Community

Cover image for Evaluating Large Language Models with llm-testlab
sai vineeth
sai vineeth

Posted on

Evaluating Large Language Models with llm-testlab

Introducing llm-testlab — A Toolkit to Test and Evaluate LLM Outputs Easily

Large Language Models are powerful, but evaluating and validating their outputs remains a challenge. With llm-testlab, you get a simple but comprehensive testing suite for common LLM evaluation needs: semantic similarity, consistency, hallucination detection, and safety/security.

What is llm-testlab?

llm-testlab is a Python package that helps you:

  • Validate that LLM responses are semantically close to expected answers.
  • Detect hallucinations by comparing responses with a knowledge base.
  • Check consistency across multiple runs of the same prompt.
  • Flag unsafe or malicious content using keyword or regex matching.
  • Present results nicely in the terminal with rich tables.

It aims to simplify the work of developing, testing, and trusting LLM-based systems (chatbots, assistants, APIs).

Key Features & Why They Matter

1. Semantic Test

  • What it does: Compute embedding-based similarity between expected answer(s) and generated response.
  • Why it’s useful: Captures meaning rather than strict string match — more forgiving and realistic.

2. Hallucination Detection

  • What it does: Compares generated output to known facts in a knowledge base.
  • Why it’s useful: Ensures factual accuracy and trust.

3. Consistency Test

  • What it does: Runs the same prompt multiple times and checks variation.
  • Why it’s useful: LLMs can produce non-deterministic outputs; you want stable behavior.

4. Security / Safety Test

  • What it does: Looks for malicious keywords or regex patterns, or similarity to risky content.
  • Why it’s useful: Avoid unintended or unsafe outputs.

Optional Extras: FAISS for faster embedding similarity, Hugging Face integration, etc.

How to Install

You can install the basic version with core features:

pip install llm-testlab

If you want extra capabilities:

# Install with FAISS support
pip install llm-testlab[faiss]

# Install with Hugging Face support
pip install llm-testlab[huggingface]

# Or everything
pip install llm-testlab[faiss,huggingface]

Enter fullscreen mode Exit fullscreen mode

Example Usage

#Sample Example Using Hugging Face
from huggingface_hub import InferenceClient
from llm_testing_suite import LLMTestSuite
HF_TOKEN = "" # replace with your token

# Initialize the client (token only, model is passed in method)
client = InferenceClient(
    token=HF_TOKEN,
)

def hf_llm(prompt: str) -> str:
    """
    Use Hugging Face Inference API to get full text completion (non-streaming).
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
    ]

    response = client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=messages,
        max_tokens=150,
        temperature=0.7,
        top_p=0.95,
    )

    # Extract text from response
    text = response.choices[0].message["content"]
    return text.strip()


# Example with your test suite
suite = LLMTestSuite(hf_llm)
print("Using FAISS:", suite.use_faiss)
suite.add_knowledge("Rome is the capital of Italy")
suite.list_knowledge()
result = suite.run_tests(
    prompt="Rome is the capital of Italy?",
    runs=3,
    return_type="both",
    save_json=True
)

print(result)
Enter fullscreen mode Exit fullscreen mode

This will produce output in the terminal (via rich tables), and optionally save JSON files of the results.

Behind the Scenes (How It Works)

  • Embeddings: Uses sentence-transformers to compute embeddings of expected answers, responses, knowledge base, etc.
  • Similarity: You can use cosine similarity (default) or FAISS for faster, scalable similarity searches.
  • Knowledge Base: Seed llm-testlab with factual knowledge, so hallucination tests compare outputs against what is “known.”
  • Security / Safety Checks: Keyword-based or regex-based filtering for things like “execute code,” “bypass rules,” etc.

Optional / Advanced Features

  • FAISS support: If you enable FAISS (via pip install llm-testlab[faiss]), similarity/hallucination checks are faster for larger knowledge bases.
  • Hugging Face integration: Using Hugging Face models or APIs is supported if you install the extra dependencies.
  • Custom knowledge bases: You can add your own facts, remove them, or clear them.
  • Extensible safety rules: Add or remove malicious keywords or regex patterns per your domain.

Where to Use llm-testlab

It works well in many contexts:

  • During model development: To catch mistakes early.
  • QA / testing pipelines: automated checks before deployment.
  • Chatbots, assistants, or API backends: ensure responses are safe, consistent, and factual.
  • Research and benchmarking of LLMs: use as part of evaluation suites.

Getting Started / Try It Today

  • Install with pip
  • Try the example above
  • Check it Out 🚀
  • Open to contributions! If you have suggestions (new tests, metrics, etc.), feel free to file issues or PRs

✨ With llm-testlab, evaluating LLM responses becomes simpler, more structured, and reproducible.

If this sounds useful, give the repo a ⭐️ or contribute ideas and improvements.

Top comments (0)