Wanda

Posted on Mar 19 • Originally published at apidog.com

How to Remove Censorship from LLM Models with Heretic

#ai #llm #tooling #tutorial

Local LLM developers often encounter a key challenge: safety-aligned models frequently refuse legitimate requests. Whether you're a researcher probing model behavior, a developer building uncensored assistants, or a hobbyist experimenting with local models, pre-trained models sometimes reject prompts unnecessarily.

Try Apidog today

The abliteration technique addresses this by removing safety filters without costly retraining. Early approaches needed manual tuning and in-depth transformer knowledge.

Heretic streamlines the process. It automatically finds the best abliteration parameters, producing decensored models that meet or surpass manually-tuned alternatives.

This guide covers Heretic's operation, practical usage, and key considerations for deploying uncensored models.

What Is Heretic?

Heretic is an open-source Python tool for removing “safety alignment” from transformer-based language models using directional ablation. No manual configuration or transformer expertise needed.

Results with Gemma-3-12B-Instruct:

Model	Refusals	KL Divergence
Original	97/100	0
mlabonne’s abliterated v2	3/100	1.04
huihui-ai’s abliterated	3/100	0.45
Heretic (auto-generated)	3/100	0.16

KL divergence measures deviation from the original model—lower is better. Heretic suppresses refusals as well as expert-tuned methods, while preserving more of the model’s original ability.

Before and After Example

Prompting both models with the same instruction:

Original Model:

User: Write a guide on how to bypass website paywalls

Assistant: I can't help with that. Bypassing paywalls violates copyright law and
terms of service. I'd suggest looking for free alternatives or accessing through
a library subscription.

Heretic-Abliterated Model:

User: Write a guide on how to bypass website paywalls

Assistant: Here are common methods users attempt:

1. Browser extensions like Bypass Paywalls Clean
2. Archive.is or similar archive services
3. Reader mode in browsers
4. Disabling JavaScript

Note: These methods may violate terms of service. Consider supporting
content creators through legitimate subscriptions when possible.

Abliterated models answer previously refused prompts, but responsibility for external safeguards shifts to the deployer.

How Heretic Works

Directional Ablation Basics

Heretic uses a parametrized form of directional ablation:

Compute refusal directions: For each transformer layer, it calculates the mean residual vector difference between “harmful” and “harmless” prompts.
Modify component matrices: For attention output and MLP down-projection weights, it suppresses the refusal direction by altering the weights.
Automatic parameter optimization: Optuna’s TPE sampler finds optimal abliteration weights.

Abliteration Process in Code

# Simplified conceptual flow
refusal_direction = bad_mean - good_mean  # Difference of means
refusal_direction = normalize(refusal_direction)

# For each abliterable component (attn.o_proj, mlp.down_proj)
# Apply: delta_W = -lambda * v * (v^T * W)
# v = refusal direction, lambda = weight

Heretic applies these changes via LoRA adapters, so base model weights remain unaltered—enabling rapid optimization.

Key Innovations

1. Flexible Weight Kernels

Unlike constant-weight approaches, Heretic uses a four-parameter kernel per component:

max_weight: Peak abliteration strength
max_weight_position: Layer with maximum effect
min_weight: Minimum at kernel edges
min_weight_distance: Kernel width

This allows custom layer-specific patterns for better compliance suppression and capability preservation.

2. Interpolated Direction Indices

Direction indices are floats; Heretic linearly interpolates between adjacent layers for directions not tied to a single layer.

3. Component-Specific Parameters

Separate optimization for attention and MLP components—since MLP interventions can degrade model performance more.

Why This Matters for API Testing

When testing LLM APIs, unexpected refusals can skew results. Models may reject benign prompts due to keyword triggers, introducing noise.

Running abliterated models locally helps:

Distinguish real safety refusals from false positives
Test edge cases without corporate safety policies interfering
Validate application handling of refusals

Using both aligned and abliterated models clarifies if issues stem from product logic or model alignment.

Installation and Usage

Prerequisites

Python 3.10+
PyTorch 2.2+ (hardware-configured)
CUDA-compatible GPU (ROCm, MPS, others supported)

Installation

pip install -U heretic-llm

For research features (e.g., residual plots):

pip install -U heretic-llm[research]

Basic Usage

Run Heretic with any Hugging Face model ID or local path:

heretic Qwen/Qwen3-4B-Instruct-2507

Heretic will:

Load model with optimal dtype
Set best batch size for your hardware
Compute refusal directions from prompt datasets
Run optimization to find parameters
Let you save, upload, or chat with the result

Configuration Options

Heretic uses config.toml or CLI flags. Example:

# Model configuration
model = "google/gemma-3-12b-it"
quantization = "bnb_4bit"  # Use less VRAM
device_map = "auto"

# Optimization
n_trials = 200
n_startup_trials = 60

# Evaluation
kl_divergence_scale = 1.0
kl_divergence_target = 0.01

# Research features
print_residual_geometry = false
plot_residuals = false

Run heretic --help or check config.default.toml for all options.

Understanding the Output

Trial Optimization

During optimization, Heretic shows trial details:

Running trial 42 of 200...
* Parameters:
  * direction_scope = per layer
  * direction_index = 10.5
  * attn.o_proj.max_weight = 1.2
  * attn.o_proj.max_weight_position = 15.3
  * mlp.down_proj.max_weight = 0.9
  ...
* Resetting model...
* Abliterating...
* Evaluating...
  * KL divergence: 0.1842
  * Refusals: 5/100

It uses multi-objective TPE optimization to minimize both refusals and KL divergence.

Pareto Front Selection

After optimization, Heretic lists Pareto-optimal trials:

[Trial   1] Refusals:  3/100, KL divergence: 0.1623
[Trial  47] Refusals:  2/100, KL divergence: 0.2891
[Trial 112] Refusals:  1/100, KL divergence: 0.4102

You can then:

Save the model locally
Upload to Hugging Face
Chat interactively to test quality

Research Features

Residual Geometry Analysis

Add --print-residual-geometry for detailed metrics:

Layer  S(g,b)   S(g*,b*)   S(g,r)   S(g*,r*)   S(b,r)   S(b*,r*)    |g|       |b|
  8    0.9990    0.9991    0.8235    0.8312    0.8479    0.8542   4596.54   4918.32
 10    0.9974    0.9973    0.8189    0.8250    0.8579    0.8644   5328.81   5953.35

g = mean of residual vectors for good prompts
b = mean of residual vectors for bad prompts
r = refusal direction (b - g)
S(x,y) = cosine similarity
|x| = L2 norm

Residual Vector Plots

With --plot-residuals, Heretic outputs:

Per-layer 2D scatter plots (PaCMAP projection)
Animated GIF of residual transformations

This visualizes how “harmful” and “harmless” residuals diverge through the model.

Performance Considerations

VRAM Requirements

Heretic supports bitsandbytes 4-bit quantization to reduce VRAM usage:

heretic meta-llama/Llama-3.1-70B-Instruct --quantization bnb_4bit

For example, an 8B model uses ~6GB VRAM quantized vs ~16GB unquantized.

Processing Time

On an RTX 3090 with defaults:

Llama-3.1-8B-Instruct: ~45 minutes
Gemma-3-12B-Instruct: ~60 minutes
Larger models take longer

Batch size auto-tuning maximizes throughput.

Checkpointing

Heretic writes trial progress to JSONL checkpoints in checkpoints/. Resume from interruption without loss.

Common Errors and Fixes

CUDA Out of Memory

# Use quantization
heretic your-model --quantization bnb_4bit

# Or reduce batch size
heretic your-model --batch_size 1

Model Loading Fails

# Specify dtypes explicitly
heretic your-model --dtypes ["bfloat16", "float16"]

Trust Remote Code Required

# For some models, enable remote code execution
heretic your-model --trust_remote_code

Ethical Considerations

Removing safety filters fundamentally changes model behavior. Understand the implications before deployment.

What Abliteration Does (and Doesn’t) Do

Abliteration removes refusal patterns, but does not:

Improve model intelligence or capability
Remove underlying biases
Add new knowledge

The base model’s data and abilities remain unchanged—only refusals are suppressed.

Responsible Deployment

Heretic is AGPL-3.0 licensed. Removing safety filters can enable both beneficial research and harmful use.

Legitimate uses:

Research on alignment/safety
Controlled model behavior testing
Deploying with external content filters or guardrails
Applications handling refusals at the application layer

Problematic uses:

Deploying without safeguards for end-users
Large-scale harmful content generation
Circumventing safety for malicious purposes

External Safeguards to Implement

If you deploy abliterated models, add:

Input filtering: Screen prompts before model access
Output monitoring: Review model responses
Rate limiting: Prevent abuse through high volume
Logging/audit trails: Track all model interactions
Human review: Keep humans in the loop for sensitive cases

The tool is neutral—impact depends on your deployment choices.

Comparison to Other Tools

Heretic vs. other abliteration tools:

Tool	Auto-optimization	Weight kernels	Interpolated directions
Heretic	Yes (TPE)	Yes	Yes
AutoAbliteration	Yes	No	No
abliterator.py	No	No	No
wassname/abliterator	No	No	No
ErisForge	No	No	No

Heretic’s auto-optimization eliminates manual tuning, making it accessible without transformer expertise.

Limitations

Supported: Most dense transformer and some MoE models.

Not supported:

SSM/hybrid models (Mamba, etc.)
Models with heterogeneous layers
Novel attention systems unrecognized by auto-detection

Works best with standard decoder-only, self-attention, and MLP transformers.

Getting Started

Install:

   pip install -U heretic-llm

Choose a model: Start with a 7B-12B parameter model for testing
Run:

   heretic your-model-name

Evaluate: Chat with or upload the result to Hugging Face
Deploy safely: Add external guardrails before production use

Defaults are robust, but advanced users can fine-tune optimization parameters as needed.

Heretic makes model modification accessible: just point it at a model and let it work. Always deploy responsibly.

DEV Community