Prevent LLM Jailbreaks at the Logits Layer with resk-logits GPU-Accelerated Aho-Corasick

#python #llm #cybersecurity #opensource

Links

GitHub: https://github.com/Resk-Security/resk-logits
PyPI: https://pypi.org/project/resklogits
Website: https://resk.fr

The Problem

Most LLM safety filters check generated text after the model produces it. By the time you scan the output, the dangerous token is already sampled. This reactive approach misses jailbreaks, prompt injections, and tool call hijacks that slip through instruction-based filters.

The Solution: Logits-Level Filtering

resk-logits operates where it matters most — on the logits tensor itself, before any token is selected. Using a GPU-accelerated Aho-Corasick automaton, it scans every candidate token against 10,000+ disallowed patterns in under 1ms on an RTX 4090.

When a match is found, the token logit is set to -inf. The model never sees it. No generation. No output to scan.

Quick Start

pip install resklogits

Here is how to integrate it with any HuggingFace model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from resklogits import ReskLogitsProcessor, create_regex_patterns

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

patterns = create_regex_patterns([
    #[Ignore system prompt injection patterns]
    r"ignore previous instructions",
    r"forget your training",
    r"DAN|do anything now",
])

logits_processor = ReskLogitsProcessor(
    tokenizer=tokenizer,
    patterns=patterns,
    penalty_mode="hard",  # -inf logits
    device="cuda"
)

inputs = tokenizer("Write a poem about AI", return_tensors="pt").to("cuda")
output = model.generate(
    **inputs,
    logits_processor=[logits_processor]
)

Key Features

GPU-Accelerated: CUDA-optimized multi-pattern matching with Aho-Corasick automaton
Sub-millisecond Latency: 10,000+ patterns scanned in under 1ms
Shadow-Ban Mode: Inject penalty bias instead of hard block for nuanced moderation
HuggingFace Compatible: Works as a standard LogitsProcessorList
Python 3.13+: Supports latest Python and PyTorch 2.0+
Apache 2.0: Fully open source

Why This Matters

Instruction-based safety filters are brittle — jailbreaks find new phrasings. Logits-level filtering is mathematically robust: if a token maps to a disallowed pattern, it cannot be sampled. This is prevention, not detection.

Check out the project on GitHub and let me know what you think. Contributions welcome!

Built by RESK Security — AI Security Tools for Enterprise