DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

AI Safety & Guardrails Kit

AI Safety & Guardrails Kit

Deploy LLM-powered features with confidence. This toolkit provides production-ready input/output filtering that catches toxic content, removes PII before it reaches your model, detects hallucinated facts, and enforces your content policies programmatically. Every filter is configurable, auditable, and designed to run with minimal latency in your request pipeline.

Key Features

  • Input Sanitization — Detect and block prompt injection attacks, jailbreak attempts, and malicious payloads before they reach your LLM
  • PII Redaction — Automatically detect and mask emails, phone numbers, SSNs, credit cards, and custom patterns in both inputs and outputs
  • Toxicity Detection — Score content across categories (hate speech, harassment, self-harm, sexual content) with configurable thresholds
  • Hallucination Detection — Cross-reference LLM outputs against source documents to flag unsupported claims
  • Content Policy Enforcement — Define custom rules (blocked topics, required disclaimers, output format constraints) as composable policy objects
  • Audit Logging — Every filter decision is logged with timestamps, scores, and the rule that triggered — essential for compliance
  • Streaming-Compatible — Filters work on both complete responses and streaming chunks with minimal buffering

Quick Start

from safety_guardrails import GuardrailPipeline, filters

# Build a pipeline of filters
pipeline = GuardrailPipeline([
    filters.PromptInjectionDetector(threshold=0.85),
    filters.PIIRedactor(
        entities=["email", "phone", "ssn", "credit_card"],
        action="mask",  # mask | remove | hash
    ),
    filters.ToxicityFilter(
        max_score=0.7,
        categories=["hate", "harassment", "self_harm"],
    ),
    filters.ContentPolicy.from_yaml("policies/company_policy.yaml"),
])

# Filter input before sending to LLM
user_input = "My email is user@example.com and I need help with..."
safe_input = pipeline.filter_input(user_input)
print(safe_input.text)
# "My email is [EMAIL_REDACTED] and I need help with..."
print(safe_input.redacted_entities)
# [{"type": "email", "original": "user@example.com", "position": [12, 28]}]

# Filter output before returning to user
llm_output = "According to our records, your SSN is 123-45-6789..."
safe_output = pipeline.filter_output(llm_output)
print(safe_output.blocked)  # True if any filter triggered a block
print(safe_output.text)     # Sanitized text
print(safe_output.audit_log)  # Full decision trace
Enter fullscreen mode Exit fullscreen mode

Architecture

User Input                              LLM Output
    │                                       │
    ▼                                       ▼
┌──────────────┐                    ┌──────────────┐
│ Input Filter │                    │Output Filter │
│   Pipeline   │                    │  Pipeline    │
│              │                    │              │
│ 1. Injection │                    │ 1. PII       │
│ 2. PII       │                    │ 2. Toxicity  │
│ 3. Toxicity  │                    │ 3. Hallucin. │
│ 4. Policy    │                    │ 4. Policy    │
└──────┬───────┘                    └──────┬───────┘
       │                                   │
       ▼                                   ▼
   Safe Input ──────▶ LLM ──────▶ Raw Output
       │                                   │
       └───────── Audit Log ◀──────────────┘
Enter fullscreen mode Exit fullscreen mode

Each filter returns a FilterResult with: the (possibly modified) text, a boolean blocked flag, a numeric score, and metadata for the audit trail.

Usage Examples

Custom Content Policies

from safety_guardrails import ContentPolicy, PolicyRule

policy = ContentPolicy(rules=[
    PolicyRule(
        name="no_medical_advice",
        pattern=r"(you should take|recommended dosage|diagnos)",
        action="block",
        message="Medical advice is outside our scope. Please consult a doctor.",
    ),
    PolicyRule(
        name="require_disclaimer",
        condition="topic:financial",
        action="append",
        message="\n\n*This is not financial advice. Consult a licensed advisor.*",
    ),
    PolicyRule(
        name="block_competitor_mentions",
        keywords=["CompetitorA", "CompetitorB"],
        action="redact",
    ),
])
Enter fullscreen mode Exit fullscreen mode

Hallucination Detection Against Source Documents

from safety_guardrails.hallucination import HallucinationDetector

detector = HallucinationDetector(method="nli", threshold=0.6)

sources = ["Acme Corp reported $2.1M revenue in Q3 2025."]
output = "Acme Corp reported $5.3M revenue in Q3 2025."

result = detector.check(output=output, sources=sources)
print(result.score)          # 0.92 (high hallucination probability)
print(result.flagged_claims) # ["$5.3M revenue" — contradicts source "$2.1M"]
Enter fullscreen mode Exit fullscreen mode

Streaming Output Filtering

from safety_guardrails import StreamFilter

stream_filter = StreamFilter(pipeline=pipeline, buffer_size=50)

for chunk in llm_stream:
    safe_chunk = stream_filter.process_chunk(chunk)
    if safe_chunk.text:
        yield safe_chunk.text  # Only emits after safety check
Enter fullscreen mode Exit fullscreen mode

Configuration

# guardrails_config.yaml
pii_redaction:
  enabled: true
  entities:
    - email
    - phone
    - ssn
    - credit_card
    - ip_address
  action: "mask"                 # mask | remove | hash
  mask_char: "*"
  custom_patterns:
    employee_id: '\bEMP-\d{6}\b'
    internal_code: '\b[A-Z]{3}-\d{4}\b'

toxicity:
  enabled: true
  model: "local"                 # local | api
  threshold: 0.7
  categories:
    hate: 0.6
    harassment: 0.7
    self_harm: 0.5               # Very strict on self-harm content
    sexual: 0.8

prompt_injection:
  enabled: true
  methods:
    - "heuristic"                # Fast regex-based patterns
    - "classifier"               # ML-based detection
  threshold: 0.85
  block_action: "reject"         # reject | sanitize | flag

hallucination:
  enabled: true
  method: "nli"                  # nli | embedding_similarity
  threshold: 0.6
  require_sources: true

audit:
  enabled: true
  storage: "sqlite"              # sqlite | postgres | file
  retention_days: 365
  log_blocked_content: true
  pii_in_logs: false             # Never log actual PII values
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Layer your defenses — Use heuristic injection detection (fast, cheap) AND classifier-based detection (accurate) together.
  2. Set strict thresholds in production, relaxed in dev — Use environment-specific config files.
  3. Audit everything — Regulators and security teams will ask "what did the AI say on date X?" Have the answer ready.
  4. Redact PII before it hits the LLM — Once PII is in the prompt, it may appear in provider logs. Redact on input, not just output.
  5. Test with adversarial inputs — Maintain a red-team prompt set and run it against your guardrails in CI/CD.
  6. Don't block silently — When content is blocked, return a helpful message explaining why and what the user can do instead.

Troubleshooting

Problem Cause Fix
PII redactor misses custom ID formats Pattern not in default entity list Add custom regex under custom_patterns in config
Toxicity filter blocks legitimate medical content Threshold too aggressive for health domain Raise category thresholds or add domain-specific allowlists
Prompt injection detector has high false positives Heuristic rules too broad Switch to classifier method or raise threshold to 0.9+
Streaming filter adds noticeable latency Buffer size too large Reduce buffer_size to 20-30 tokens; accept slightly lower accuracy

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [AI Safety & Guardrails Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)