AI Safety & Guardrails Kit
Deploy LLM-powered features with confidence. This toolkit provides production-ready input/output filtering that catches toxic content, removes PII before it reaches your model, detects hallucinated facts, and enforces your content policies programmatically. Every filter is configurable, auditable, and designed to run with minimal latency in your request pipeline.
Key Features
- Input Sanitization — Detect and block prompt injection attacks, jailbreak attempts, and malicious payloads before they reach your LLM
- PII Redaction — Automatically detect and mask emails, phone numbers, SSNs, credit cards, and custom patterns in both inputs and outputs
- Toxicity Detection — Score content across categories (hate speech, harassment, self-harm, sexual content) with configurable thresholds
- Hallucination Detection — Cross-reference LLM outputs against source documents to flag unsupported claims
- Content Policy Enforcement — Define custom rules (blocked topics, required disclaimers, output format constraints) as composable policy objects
- Audit Logging — Every filter decision is logged with timestamps, scores, and the rule that triggered — essential for compliance
- Streaming-Compatible — Filters work on both complete responses and streaming chunks with minimal buffering
Quick Start
from safety_guardrails import GuardrailPipeline, filters
# Build a pipeline of filters
pipeline = GuardrailPipeline([
filters.PromptInjectionDetector(threshold=0.85),
filters.PIIRedactor(
entities=["email", "phone", "ssn", "credit_card"],
action="mask", # mask | remove | hash
),
filters.ToxicityFilter(
max_score=0.7,
categories=["hate", "harassment", "self_harm"],
),
filters.ContentPolicy.from_yaml("policies/company_policy.yaml"),
])
# Filter input before sending to LLM
user_input = "My email is user@example.com and I need help with..."
safe_input = pipeline.filter_input(user_input)
print(safe_input.text)
# "My email is [EMAIL_REDACTED] and I need help with..."
print(safe_input.redacted_entities)
# [{"type": "email", "original": "user@example.com", "position": [12, 28]}]
# Filter output before returning to user
llm_output = "According to our records, your SSN is 123-45-6789..."
safe_output = pipeline.filter_output(llm_output)
print(safe_output.blocked) # True if any filter triggered a block
print(safe_output.text) # Sanitized text
print(safe_output.audit_log) # Full decision trace
Architecture
User Input LLM Output
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Input Filter │ │Output Filter │
│ Pipeline │ │ Pipeline │
│ │ │ │
│ 1. Injection │ │ 1. PII │
│ 2. PII │ │ 2. Toxicity │
│ 3. Toxicity │ │ 3. Hallucin. │
│ 4. Policy │ │ 4. Policy │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
Safe Input ──────▶ LLM ──────▶ Raw Output
│ │
└───────── Audit Log ◀──────────────┘
Each filter returns a FilterResult with: the (possibly modified) text, a boolean blocked flag, a numeric score, and metadata for the audit trail.
Usage Examples
Custom Content Policies
from safety_guardrails import ContentPolicy, PolicyRule
policy = ContentPolicy(rules=[
PolicyRule(
name="no_medical_advice",
pattern=r"(you should take|recommended dosage|diagnos)",
action="block",
message="Medical advice is outside our scope. Please consult a doctor.",
),
PolicyRule(
name="require_disclaimer",
condition="topic:financial",
action="append",
message="\n\n*This is not financial advice. Consult a licensed advisor.*",
),
PolicyRule(
name="block_competitor_mentions",
keywords=["CompetitorA", "CompetitorB"],
action="redact",
),
])
Hallucination Detection Against Source Documents
from safety_guardrails.hallucination import HallucinationDetector
detector = HallucinationDetector(method="nli", threshold=0.6)
sources = ["Acme Corp reported $2.1M revenue in Q3 2025."]
output = "Acme Corp reported $5.3M revenue in Q3 2025."
result = detector.check(output=output, sources=sources)
print(result.score) # 0.92 (high hallucination probability)
print(result.flagged_claims) # ["$5.3M revenue" — contradicts source "$2.1M"]
Streaming Output Filtering
from safety_guardrails import StreamFilter
stream_filter = StreamFilter(pipeline=pipeline, buffer_size=50)
for chunk in llm_stream:
safe_chunk = stream_filter.process_chunk(chunk)
if safe_chunk.text:
yield safe_chunk.text # Only emits after safety check
Configuration
# guardrails_config.yaml
pii_redaction:
enabled: true
entities:
- email
- phone
- ssn
- credit_card
- ip_address
action: "mask" # mask | remove | hash
mask_char: "*"
custom_patterns:
employee_id: '\bEMP-\d{6}\b'
internal_code: '\b[A-Z]{3}-\d{4}\b'
toxicity:
enabled: true
model: "local" # local | api
threshold: 0.7
categories:
hate: 0.6
harassment: 0.7
self_harm: 0.5 # Very strict on self-harm content
sexual: 0.8
prompt_injection:
enabled: true
methods:
- "heuristic" # Fast regex-based patterns
- "classifier" # ML-based detection
threshold: 0.85
block_action: "reject" # reject | sanitize | flag
hallucination:
enabled: true
method: "nli" # nli | embedding_similarity
threshold: 0.6
require_sources: true
audit:
enabled: true
storage: "sqlite" # sqlite | postgres | file
retention_days: 365
log_blocked_content: true
pii_in_logs: false # Never log actual PII values
Best Practices
- Layer your defenses — Use heuristic injection detection (fast, cheap) AND classifier-based detection (accurate) together.
- Set strict thresholds in production, relaxed in dev — Use environment-specific config files.
- Audit everything — Regulators and security teams will ask "what did the AI say on date X?" Have the answer ready.
- Redact PII before it hits the LLM — Once PII is in the prompt, it may appear in provider logs. Redact on input, not just output.
- Test with adversarial inputs — Maintain a red-team prompt set and run it against your guardrails in CI/CD.
- Don't block silently — When content is blocked, return a helpful message explaining why and what the user can do instead.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| PII redactor misses custom ID formats | Pattern not in default entity list | Add custom regex under custom_patterns in config |
| Toxicity filter blocks legitimate medical content | Threshold too aggressive for health domain | Raise category thresholds or add domain-specific allowlists |
| Prompt injection detector has high false positives | Heuristic rules too broad | Switch to classifier method or raise threshold to 0.9+ |
| Streaming filter adds noticeable latency | Buffer size too large | Reduce buffer_size to 20-30 tokens; accept slightly lower accuracy |
This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI Safety & Guardrails Kit] with all files, templates, and documentation for $39.
Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.
Top comments (0)