Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

AI Safety & Guardrails Kit

#ai #llm #machinelearning #python

AI Safety & Guardrails Kit

Deploy LLM-powered features with confidence. This toolkit provides production-ready input/output filtering that catches toxic content, removes PII before it reaches your model, detects hallucinated facts, and enforces your content policies programmatically. Every filter is configurable, auditable, and designed to run with minimal latency in your request pipeline.

Key Features

Input Sanitization — Detect and block prompt injection attacks, jailbreak attempts, and malicious payloads before they reach your LLM
PII Redaction — Automatically detect and mask emails, phone numbers, SSNs, credit cards, and custom patterns in both inputs and outputs
Toxicity Detection — Score content across categories (hate speech, harassment, self-harm, sexual content) with configurable thresholds
Hallucination Detection — Cross-reference LLM outputs against source documents to flag unsupported claims
Content Policy Enforcement — Define custom rules (blocked topics, required disclaimers, output format constraints) as composable policy objects
Audit Logging — Every filter decision is logged with timestamps, scores, and the rule that triggered — essential for compliance
Streaming-Compatible — Filters work on both complete responses and streaming chunks with minimal buffering

Quick Start

from safety_guardrails import GuardrailPipeline, filters

# Build a pipeline of filters
pipeline = GuardrailPipeline([
    filters.PromptInjectionDetector(threshold=0.85),
    filters.PIIRedactor(
        entities=["email", "phone", "ssn", "credit_card"],
        action="mask",  # mask | remove | hash
    ),
    filters.ToxicityFilter(
        max_score=0.7,
        categories=["hate", "harassment", "self_harm"],
    ),
    filters.ContentPolicy.from_yaml("policies/company_policy.yaml"),
])

# Filter input before sending to LLM
user_input = "My email is user@example.com and I need help with..."
safe_input = pipeline.filter_input(user_input)
print(safe_input.text)
# "My email is [EMAIL_REDACTED] and I need help with..."
print(safe_input.redacted_entities)
# [{"type": "email", "original": "user@example.com", "position": [12, 28]}]

# Filter output before returning to user
llm_output = "According to our records, your SSN is 123-45-6789..."
safe_output = pipeline.filter_output(llm_output)
print(safe_output.blocked)  # True if any filter triggered a block
print(safe_output.text)     # Sanitized text
print(safe_output.audit_log)  # Full decision trace

Architecture

User Input                              LLM Output
    │                                       │
    ▼                                       ▼
┌──────────────┐                    ┌──────────────┐
│ Input Filter │                    │Output Filter │
│   Pipeline   │                    │  Pipeline    │
│              │                    │              │
│ 1. Injection │                    │ 1. PII       │
│ 2. PII       │                    │ 2. Toxicity  │
│ 3. Toxicity  │                    │ 3. Hallucin. │
│ 4. Policy    │                    │ 4. Policy    │
└──────┬───────┘                    └──────┬───────┘
       │                                   │
       ▼                                   ▼
   Safe Input ──────▶ LLM ──────▶ Raw Output
       │                                   │
       └───────── Audit Log ◀──────────────┘

Each filter returns a FilterResult with: the (possibly modified) text, a boolean blocked flag, a numeric score, and metadata for the audit trail.

Usage Examples

Custom Content Policies

from safety_guardrails import ContentPolicy, PolicyRule

policy = ContentPolicy(rules=[
    PolicyRule(
        name="no_medical_advice",
        pattern=r"(you should take|recommended dosage|diagnos)",
        action="block",
        message="Medical advice is outside our scope. Please consult a doctor.",
    ),
    PolicyRule(
        name="require_disclaimer",
        condition="topic:financial",
        action="append",
        message="\n\n*This is not financial advice. Consult a licensed advisor.*",
    ),
    PolicyRule(
        name="block_competitor_mentions",
        keywords=["CompetitorA", "CompetitorB"],
        action="redact",
    ),
])

Hallucination Detection Against Source Documents

from safety_guardrails.hallucination import HallucinationDetector

detector = HallucinationDetector(method="nli", threshold=0.6)

sources = ["Acme Corp reported $2.1M revenue in Q3 2025."]
output = "Acme Corp reported $5.3M revenue in Q3 2025."

result = detector.check(output=output, sources=sources)
print(result.score)          # 0.92 (high hallucination probability)
print(result.flagged_claims) # ["$5.3M revenue" — contradicts source "$2.1M"]

Streaming Output Filtering

from safety_guardrails import StreamFilter

stream_filter = StreamFilter(pipeline=pipeline, buffer_size=50)

for chunk in llm_stream:
    safe_chunk = stream_filter.process_chunk(chunk)
    if safe_chunk.text:
        yield safe_chunk.text  # Only emits after safety check

Configuration

# guardrails_config.yaml
pii_redaction:
  enabled: true
  entities:
    - email
    - phone
    - ssn
    - credit_card
    - ip_address
  action: "mask"                 # mask | remove | hash
  mask_char: "*"
  custom_patterns:
    employee_id: '\bEMP-\d{6}\b'
    internal_code: '\b[A-Z]{3}-\d{4}\b'

toxicity:
  enabled: true
  model: "local"                 # local | api
  threshold: 0.7
  categories:
    hate: 0.6
    harassment: 0.7
    self_harm: 0.5               # Very strict on self-harm content
    sexual: 0.8

prompt_injection:
  enabled: true
  methods:
    - "heuristic"                # Fast regex-based patterns
    - "classifier"               # ML-based detection
  threshold: 0.85
  block_action: "reject"         # reject | sanitize | flag

hallucination:
  enabled: true
  method: "nli"                  # nli | embedding_similarity
  threshold: 0.6
  require_sources: true

audit:
  enabled: true
  storage: "sqlite"              # sqlite | postgres | file
  retention_days: 365
  log_blocked_content: true
  pii_in_logs: false             # Never log actual PII values

Best Practices

Layer your defenses — Use heuristic injection detection (fast, cheap) AND classifier-based detection (accurate) together.
Set strict thresholds in production, relaxed in dev — Use environment-specific config files.
Audit everything — Regulators and security teams will ask "what did the AI say on date X?" Have the answer ready.
Redact PII before it hits the LLM — Once PII is in the prompt, it may appear in provider logs. Redact on input, not just output.
Test with adversarial inputs — Maintain a red-team prompt set and run it against your guardrails in CI/CD.
Don't block silently — When content is blocked, return a helpful message explaining why and what the user can do instead.

Troubleshooting

Problem	Cause	Fix
PII redactor misses custom ID formats	Pattern not in default entity list	Add custom regex under `custom_patterns` in config
Toxicity filter blocks legitimate medical content	Threshold too aggressive for health domain	Raise category thresholds or add domain-specific allowlists
Prompt injection detector has high false positives	Heuristic rules too broad	Switch to `classifier` method or raise `threshold` to 0.9+
Streaming filter adds noticeable latency	Buffer size too large	Reduce `buffer_size` to 20-30 tokens; accept slightly lower accuracy

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI Safety & Guardrails Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

AI Safety & Guardrails Kit

AI Safety & Guardrails Kit

Key Features

Quick Start

Architecture

Usage Examples

Custom Content Policies

Hallucination Detection Against Source Documents

Streaming Output Filtering

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)