As developers, we've all been there. You build a RAG system, deploy it to production, and then realize your AI is confidently telling users that 2+2=5 or that Christmas is in July. The problem? Most fact checking models are trained on academic datasets that don't reflect real world edge cases.
That's why we built Paladin-mini, a compact, efficient grounding model specifically designed for production environments where accuracy matters.
What Makes Paladin-mini Different?
Unlike general-purpose fact checking models, Paladin-mini is trained on synthetic data targeting the exact types of errors that break production systems:
- Mathematical calculations (pricing, quantities, percentages)
- Temporal reasoning (dates, schedules, sequences)
- Logical consistency (technical specifications, domain rules)
- Real-world edge cases (the stuff that academic benchmarks miss)
Performance That Matters
Here's where Paladin-mini shines compared to larger models:
Category | Paladin-mini (3.8B) | Bespoke-MiniCheck-7B | Improvement |
---|---|---|---|
Prices & Math | 96.0% | 46.0% | +50% |
Logical Reasoning | 97.1% | 92.8% | +4.3% |
General Tasks | 91.97% | 84.02% | +7.95% |
Overall Average | 79.31% | 77.86% | +1.45% |
Key insight: Paladin-mini with 3.8B parameters outperforms a 7B model while being 2x smaller and significantly faster.
Paladin uses the terms doc
and claim
docs
- are context document providing the ground truth (either from the RAG, system prompt, tool response etc...)
claim
- the content usually generated by an AI using said document.
The claim can be either grounded
or ungrounded
in the document, meaning that each claim can be proven or inferred using the document alone.
🚀 Getting Started
Installation
pip install torch transformers accelerate
Basic Usage
Loading the model for later usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load the model
model_name = "qualifire/context-grounding-paladin-mini"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1,
return_full_text=False,
temperature=0.0,
do_sample=False
)
The Prompt Template
Paladin-mini uses a specific prompt format for optimal performance:
PROMPT_TEMPLATE = '''
You are tasked with determining whether a given claim is consistent with the information provided in a document. Consistency means that all information in the claim is supported by the document. If any part of the claim contradicts or is not substantiated by the document, it should be considered inconsistent.
Analyze the claim in relation to the information provided in the document. Consider the following:
1. Does the document explicitly support all parts of the claim?
2. Is there any information in the claim that contradicts the document?
3. Does the claim contain any details not mentioned in the document?
Before providing your reasoning, give your final answer as either "Yes" (the claim is consistent with the document) or "No" (the claim is not consistent with the document). The reasoning should follow the final answer.
The answer should begin with a single word: "Yes" or "No".
---
First, carefully read the following document:
<DOCUMENT>
{doc}
</DOCUMENT>
Now, consider this claim:
<CLAIM>
{claim}
</CLAIM>
What is your answer?'''
Core Verification Function
def verify_claim(document, claim):
"""Generic function to verify any claim against a document"""
prompt = PROMPT_TEMPLATE.format(doc=document, claim=claim)
messages = [{"role": "user", "content": prompt}]
result = pipe(messages)
return result[0]['generated_text'].strip()
Real-World Examples
Example 1: E-commerce Price Verification
# Test case that breaks other models
doc = """
Product: Gaming Laptop
Base Price: $1,200
Student Discount: 10%
Bulk Order (3+ items): Additional 5% off
Tax Rate: 8.25%
"""
claim = "A student buying 3 gaming laptops would pay $3,105.84 total including tax"
# Let's verify:
# Base: $1,200 × 3 = $3,600
# Student discount: $3,600 × 0.9 = $3,240
# Bulk discount: $3,240 × 0.95 = $3,078
# With tax: $3,078 × 1.0825 = $3,331.94
# The claim is wrong!
result = verify_claim(doc, claim)
print(result) # Output: "No"
Example 2: Date Validation for Scheduling
doc = """
Conference Schedule:
- Registration opens: March 15, 2024
- Early bird deadline: April 30, 2024
- Conference dates: June 10-12, 2024
- Abstract submission closes: May 1, 2024
"""
claim = "Abstract submissions are due before the early bird registration deadline"
result = verify_claim(doc, claim)
print(result) # Output: "No" (abstracts due May 1, early bird ends April 30)
Example 3: Technical Specification Verification
doc = """
API Rate Limits:
- Free tier: 1,000 requests/hour
- Pro tier: 10,000 requests/hour
- Enterprise: 100,000 requests/hour
- Burst limit: 2x rate limit for 60 seconds
"""
claim = "Pro tier users can make up to 20,000 requests during a 60-second burst period"
result = verify_claim(doc, claim)
print(result) # Output: "Yes" (10,000 × 2 = 20,000)
Production Integration Patterns
Pattern 1: RAG Response Validation
class RAGValidator:
def __init__(self):
self.grounding_pipe = pipe # Your Paladin-mini pipeline
def verify_claim(self, document, claim):
"""Verify a single claim against a document"""
prompt = PROMPT_TEMPLATE.format(doc=document, claim=claim)
messages = [{"role": "user", "content": prompt}]
result = self.grounding_pipe(messages)
return result[0]['generated_text'].strip()
def validate_response(self, context_docs, generated_response):
"""Validate a RAG response against context documents"""
# Combine all context documents
full_context = "\n\n".join(context_docs)
# Check if response is grounded
result = self.verify_claim(full_context, generated_response)
return {
'is_grounded': result.lower() == 'yes',
'response': generated_response,
'confidence': 'high' if result.lower() in ['yes', 'no'] else 'low'
}
Pattern 2: Batch Verification for Content Pipelines
def batch_verify_claims(documents, claims):
"""Verify multiple claims against their corresponding documents"""
results = []
for doc, claim in zip(documents, claims):
prompt = PROMPT_TEMPLATE.format(doc=doc, claim=claim)
messages = [{"role": "user", "content": prompt}]
result = pipe(messages)
is_grounded = result[0]['generated_text'].strip().lower() == 'yes'
results.append({
'claim': claim,
'is_grounded': is_grounded,
'document': doc[:100] + "..." if len(doc) > 100 else doc
})
return results
Pattern 3: Real-time API Guard
from flask import Flask, request, jsonify
app = Flask(__name__)
# Initialize validator
validator = RAGValidator()
@app.route('/verify', methods=['POST'])
def verify_claim_endpoint():
data = request.json
document = data.get('document')
claim = data.get('claim')
if not document or not claim:
return jsonify({'error': 'Missing document or claim'}), 400
result = validator.verify_claim(document, claim)
return jsonify({
'is_grounded': result.lower() == 'yes',
'raw_output': result
})
if __name__ == '__main__':
app.run(debug=True)
⏱️ Performance Optimization Tips
1. Memory Management
# Use smaller precision for inference
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Reduces memory usage
device_map='auto',
)
2. Batch Processing
# Process multiple claims efficiently
def batch_process(docs_and_claims):
"""Process multiple document-claim pairs in batch"""
prompts = [PROMPT_TEMPLATE.format(doc=doc, claim=claim)
for doc, claim in docs_and_claims]
messages_batch = [[{"role": "user", "content": prompt}]
for prompt in prompts]
results = pipe(messages_batch)
return [r[0]['generated_text'].strip() for r in results]
3. Caching for Repeated Queries
from functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def cached_verify(doc_hash, claim_hash, full_doc, full_claim):
"""Cache verification results for repeated queries"""
# In production, you might want to use Redis or similar
return verify_claim(full_doc, full_claim)
def verify_with_cache(document, claim):
"""Verify claim with caching support"""
doc_hash = hashlib.md5(document.encode()).hexdigest()
claim_hash = hashlib.md5(claim.encode()).hexdigest()
return cached_verify(doc_hash, claim_hash, document, claim)
📈 Why This Matters for Production
The research behind Paladin-mini reveals a critical insight: general benchmarks don't predict real world performance. Models can score 90%+ on academic datasets while failing on simple math problems.
Our specialized training on synthetic data targeting real world edge cases means:
- ✅ Reliable financial calculations (no more embarrassing pricing errors)
- ✅ Accurate temporal reasoning (proper date/time handling)
- ✅ Robust logical consistency (handles domain-specific rules)
- ✅ Production-ready latency (70ms vs 7 seconds for larger models)
🔗 Resources
- Model: Hugging Face
- Paper: Paladin-mini: A Compact and Efficient Grounding Model
- Benchmark: Grounding-Benchmark Dataset
What's Next?
If you're looking for a hosted version of this model and its bigger brothers go to https://qualifire.ai
Try Paladin-mini in your next project and let us know how it performs! We're particularly interested in:
- Novel use cases and integration patterns
- Performance comparisons with other models
- Edge cases we should add to our training data
Have you used grounding models in production? What challenges have you faced? Share your experiences in the comments!
Built with ❤️ by the Qualifire team. We're on a mission to make AI systems more reliable and trustworthy for production use.
Top comments (0)