DEV Community

RESK
RESK

Posted on

Prevent LLM Jailbreaks at the Logits Layer with resk-logits GPU-Accelerated Aho-Corasick

Links


The Problem

Most LLM safety filters check generated text after the model produces it. By the time you scan the output, the dangerous token is already sampled. This reactive approach misses jailbreaks, prompt injections, and tool call hijacks that slip through instruction-based filters.

The Solution: Logits-Level Filtering

resk-logits operates where it matters most — on the logits tensor itself, before any token is selected. Using a GPU-accelerated Aho-Corasick automaton, it scans every candidate token against 10,000+ disallowed patterns in under 1ms on an RTX 4090.

When a match is found, the token logit is set to -inf. The model never sees it. No generation. No output to scan.

Quick Start

pip install resklogits
Enter fullscreen mode Exit fullscreen mode

Here is how to integrate it with any HuggingFace model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from resklogits import ReskLogitsProcessor, create_regex_patterns

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

patterns = create_regex_patterns([
    #[Ignore system prompt injection patterns]
    r"ignore previous instructions",
    r"forget your training",
    r"DAN|do anything now",
])

logits_processor = ReskLogitsProcessor(
    tokenizer=tokenizer,
    patterns=patterns,
    penalty_mode="hard",  # -inf logits
    device="cuda"
)

inputs = tokenizer("Write a poem about AI", return_tensors="pt").to("cuda")
output = model.generate(
    **inputs,
    logits_processor=[logits_processor]
)
Enter fullscreen mode Exit fullscreen mode

Key Features

  • GPU-Accelerated: CUDA-optimized multi-pattern matching with Aho-Corasick automaton
  • Sub-millisecond Latency: 10,000+ patterns scanned in under 1ms
  • Shadow-Ban Mode: Inject penalty bias instead of hard block for nuanced moderation
  • HuggingFace Compatible: Works as a standard LogitsProcessorList
  • Python 3.13+: Supports latest Python and PyTorch 2.0+
  • Apache 2.0: Fully open source

Why This Matters

Instruction-based safety filters are brittle — jailbreaks find new phrasings. Logits-level filtering is mathematically robust: if a token maps to a disallowed pattern, it cannot be sampled. This is prevention, not detection.

Check out the project on GitHub and let me know what you think. Contributions welcome!


Built by RESK Security — AI Security Tools for Enterprise

Top comments (0)