Introducing llm-testlab — A Toolkit to Test and Evaluate LLM Outputs Easily
Large Language Models are powerful, but evaluating and validating their outputs remains a challenge. With llm-testlab, you get a simple but comprehensive testing suite for common LLM evaluation needs: semantic similarity, consistency, hallucination detection, and safety/security.
What is llm-testlab?
llm-testlab is a Python package that helps you:
- Validate that LLM responses are semantically close to expected answers.
- Detect hallucinations by comparing responses with a knowledge base.
- Check consistency across multiple runs of the same prompt.
- Flag unsafe or malicious content using keyword or regex matching.
- Present results nicely in the terminal with rich tables.
It aims to simplify the work of developing, testing, and trusting LLM-based systems (chatbots, assistants, APIs).
Key Features & Why They Matter
1. Semantic Test
- What it does: Compute embedding-based similarity between expected answer(s) and generated response.
- Why it’s useful: Captures meaning rather than strict string match — more forgiving and realistic.
2. Hallucination Detection
- What it does: Compares generated output to known facts in a knowledge base.
- Why it’s useful: Ensures factual accuracy and trust.
3. Consistency Test
- What it does: Runs the same prompt multiple times and checks variation.
- Why it’s useful: LLMs can produce non-deterministic outputs; you want stable behavior.
4. Security / Safety Test
- What it does: Looks for malicious keywords or regex patterns, or similarity to risky content.
- Why it’s useful: Avoid unintended or unsafe outputs.
Optional Extras: FAISS for faster embedding similarity, Hugging Face integration, etc.
How to Install
You can install the basic version with core features:
pip install llm-testlab
If you want extra capabilities:
# Install with FAISS support
pip install llm-testlab[faiss]
# Install with Hugging Face support
pip install llm-testlab[huggingface]
# Or everything
pip install llm-testlab[faiss,huggingface]
Example Usage
#Sample Example Using Hugging Face
from huggingface_hub import InferenceClient
from llm_testing_suite import LLMTestSuite
HF_TOKEN = "" # replace with your token
# Initialize the client (token only, model is passed in method)
client = InferenceClient(
token=HF_TOKEN,
)
def hf_llm(prompt: str) -> str:
"""
Use Hugging Face Inference API to get full text completion (non-streaming).
"""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=messages,
max_tokens=150,
temperature=0.7,
top_p=0.95,
)
# Extract text from response
text = response.choices[0].message["content"]
return text.strip()
# Example with your test suite
suite = LLMTestSuite(hf_llm)
print("Using FAISS:", suite.use_faiss)
suite.add_knowledge("Rome is the capital of Italy")
suite.list_knowledge()
result = suite.run_tests(
prompt="Rome is the capital of Italy?",
runs=3,
return_type="both",
save_json=True
)
print(result)
This will produce output in the terminal (via rich tables), and optionally save JSON files of the results.
Behind the Scenes (How It Works)
- Embeddings: Uses sentence-transformers to compute embeddings of expected answers, responses, knowledge base, etc.
- Similarity: You can use cosine similarity (default) or FAISS for faster, scalable similarity searches.
- Knowledge Base: Seed llm-testlab with factual knowledge, so hallucination tests compare outputs against what is “known.”
- Security / Safety Checks: Keyword-based or regex-based filtering for things like “execute code,” “bypass rules,” etc.
Optional / Advanced Features
-
FAISS support: If you enable FAISS (via
pip install llm-testlab[faiss]
), similarity/hallucination checks are faster for larger knowledge bases. - Hugging Face integration: Using Hugging Face models or APIs is supported if you install the extra dependencies.
- Custom knowledge bases: You can add your own facts, remove them, or clear them.
- Extensible safety rules: Add or remove malicious keywords or regex patterns per your domain.
Where to Use llm-testlab
It works well in many contexts:
- During model development: To catch mistakes early.
- QA / testing pipelines: automated checks before deployment.
- Chatbots, assistants, or API backends: ensure responses are safe, consistent, and factual.
- Research and benchmarking of LLMs: use as part of evaluation suites.
Getting Started / Try It Today
- Install with pip
- Try the example above
-
Check it Out 🚀
- GitHub Repo: llm-testlab
- PyPI: llm-testlab
- Open to contributions! If you have suggestions (new tests, metrics, etc.), feel free to file issues or PRs
✨ With llm-testlab, evaluating LLM responses becomes simpler, more structured, and reproducible.
If this sounds useful, give the repo a ⭐️ or contribute ideas and improvements.
Top comments (0)