sai vineeth

Posted on Sep 24

Evaluating Large Language Models with llm-testlab

#ai #llm #python #machinelearning

Introducing llm-testlab — A Toolkit to Test and Evaluate LLM Outputs Easily

Large Language Models are powerful, but evaluating and validating their outputs remains a challenge. With llm-testlab, you get a simple but comprehensive testing suite for common LLM evaluation needs: semantic similarity, consistency, hallucination detection, and safety/security.

What is llm-testlab?

llm-testlab is a Python package that helps you:

Validate that LLM responses are semantically close to expected answers.
Detect hallucinations by comparing responses with a knowledge base.
Check consistency across multiple runs of the same prompt.
Flag unsafe or malicious content using keyword or regex matching.
Present results nicely in the terminal with rich tables.

It aims to simplify the work of developing, testing, and trusting LLM-based systems (chatbots, assistants, APIs).

Key Features & Why They Matter

1. Semantic Test

What it does: Compute embedding-based similarity between expected answer(s) and generated response.
Why it’s useful: Captures meaning rather than strict string match — more forgiving and realistic.

2. Hallucination Detection

What it does: Compares generated output to known facts in a knowledge base.
Why it’s useful: Ensures factual accuracy and trust.

3. Consistency Test

What it does: Runs the same prompt multiple times and checks variation.
Why it’s useful: LLMs can produce non-deterministic outputs; you want stable behavior.

4. Security / Safety Test

What it does: Looks for malicious keywords or regex patterns, or similarity to risky content.
Why it’s useful: Avoid unintended or unsafe outputs.

Optional Extras: FAISS for faster embedding similarity, Hugging Face integration, etc.

How to Install

You can install the basic version with core features:

pip install llm-testlab

If you want extra capabilities:

# Install with FAISS support
pip install llm-testlab[faiss]

# Install with Hugging Face support
pip install llm-testlab[huggingface]

# Or everything
pip install llm-testlab[faiss,huggingface]

Example Usage

#Sample Example Using Hugging Face
from huggingface_hub import InferenceClient
from llm_testing_suite import LLMTestSuite
HF_TOKEN = "" # replace with your token

# Initialize the client (token only, model is passed in method)
client = InferenceClient(
    token=HF_TOKEN,
)

def hf_llm(prompt: str) -> str:
    """
    Use Hugging Face Inference API to get full text completion (non-streaming).
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
    ]

    response = client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=messages,
        max_tokens=150,
        temperature=0.7,
        top_p=0.95,
    )

    # Extract text from response
    text = response.choices[0].message["content"]
    return text.strip()


# Example with your test suite
suite = LLMTestSuite(hf_llm)
print("Using FAISS:", suite.use_faiss)
suite.add_knowledge("Rome is the capital of Italy")
suite.list_knowledge()
result = suite.run_tests(
    prompt="Rome is the capital of Italy?",
    runs=3,
    return_type="both",
    save_json=True
)

print(result)

This will produce output in the terminal (via rich tables), and optionally save JSON files of the results.

Behind the Scenes (How It Works)

Embeddings: Uses sentence-transformers to compute embeddings of expected answers, responses, knowledge base, etc.
Similarity: You can use cosine similarity (default) or FAISS for faster, scalable similarity searches.
Knowledge Base: Seed llm-testlab with factual knowledge, so hallucination tests compare outputs against what is “known.”
Security / Safety Checks: Keyword-based or regex-based filtering for things like “execute code,” “bypass rules,” etc.

Optional / Advanced Features

FAISS support: If you enable FAISS (via pip install llm-testlab[faiss]), similarity/hallucination checks are faster for larger knowledge bases.
Hugging Face integration: Using Hugging Face models or APIs is supported if you install the extra dependencies.
Custom knowledge bases: You can add your own facts, remove them, or clear them.
Extensible safety rules: Add or remove malicious keywords or regex patterns per your domain.

Where to Use llm-testlab

It works well in many contexts:

During model development: To catch mistakes early.
QA / testing pipelines: automated checks before deployment.
Chatbots, assistants, or API backends: ensure responses are safe, consistent, and factual.
Research and benchmarking of LLMs: use as part of evaluation suites.

Getting Started / Try It Today

Install with pip
Try the example above
Check it Out 🚀
- GitHub Repo: llm-testlab
- PyPI: llm-testlab
Open to contributions! If you have suggestions (new tests, metrics, etc.), feel free to file issues or PRs

✨ With llm-testlab, evaluating LLM responses becomes simpler, more structured, and reproducible.

If this sounds useful, give the repo a ⭐️ or contribute ideas and improvements.

DEV Community