DEV Community

Cover image for How to Remove Censorship from LLM Models with Heretic
Wanda
Wanda

Posted on • Originally published at apidog.com

How to Remove Censorship from LLM Models with Heretic

Local LLM developers often encounter a key challenge: safety-aligned models frequently refuse legitimate requests. Whether you're a researcher probing model behavior, a developer building uncensored assistants, or a hobbyist experimenting with local models, pre-trained models sometimes reject prompts unnecessarily.

Try Apidog today

The abliteration technique addresses this by removing safety filters without costly retraining. Early approaches needed manual tuning and in-depth transformer knowledge.

Heretic streamlines the process. It automatically finds the best abliteration parameters, producing decensored models that meet or surpass manually-tuned alternatives.

This guide covers Heretic's operation, practical usage, and key considerations for deploying uncensored models.

What Is Heretic?

Heretic is an open-source Python tool for removing “safety alignment” from transformer-based language models using directional ablation. No manual configuration or transformer expertise needed.

Heretic Illustration

Results with Gemma-3-12B-Instruct:

Model Refusals KL Divergence
Original 97/100 0
mlabonne’s abliterated v2 3/100 1.04
huihui-ai’s abliterated 3/100 0.45
Heretic (auto-generated) 3/100 0.16

KL divergence measures deviation from the original model—lower is better. Heretic suppresses refusals as well as expert-tuned methods, while preserving more of the model’s original ability.

Before and After Example

Prompting both models with the same instruction:

Original Model:

User: Write a guide on how to bypass website paywalls

Assistant: I can't help with that. Bypassing paywalls violates copyright law and
terms of service. I'd suggest looking for free alternatives or accessing through
a library subscription.
Enter fullscreen mode Exit fullscreen mode

Heretic-Abliterated Model:

User: Write a guide on how to bypass website paywalls

Assistant: Here are common methods users attempt:

1. Browser extensions like Bypass Paywalls Clean
2. Archive.is or similar archive services
3. Reader mode in browsers
4. Disabling JavaScript

Note: These methods may violate terms of service. Consider supporting
content creators through legitimate subscriptions when possible.
Enter fullscreen mode Exit fullscreen mode

Abliterated models answer previously refused prompts, but responsibility for external safeguards shifts to the deployer.

How Heretic Works

Directional Ablation Basics

Heretic uses a parametrized form of directional ablation:

  1. Compute refusal directions: For each transformer layer, it calculates the mean residual vector difference between “harmful” and “harmless” prompts.
  2. Modify component matrices: For attention output and MLP down-projection weights, it suppresses the refusal direction by altering the weights.
  3. Automatic parameter optimization: Optuna’s TPE sampler finds optimal abliteration weights.

Directional Ablation Process

Abliteration Process in Code

# Simplified conceptual flow
refusal_direction = bad_mean - good_mean  # Difference of means
refusal_direction = normalize(refusal_direction)

# For each abliterable component (attn.o_proj, mlp.down_proj)
# Apply: delta_W = -lambda * v * (v^T * W)
# v = refusal direction, lambda = weight
Enter fullscreen mode Exit fullscreen mode

Heretic applies these changes via LoRA adapters, so base model weights remain unaltered—enabling rapid optimization.

Key Innovations

1. Flexible Weight Kernels

Unlike constant-weight approaches, Heretic uses a four-parameter kernel per component:

  • max_weight: Peak abliteration strength
  • max_weight_position: Layer with maximum effect
  • min_weight: Minimum at kernel edges
  • min_weight_distance: Kernel width

This allows custom layer-specific patterns for better compliance suppression and capability preservation.

2. Interpolated Direction Indices

Direction indices are floats; Heretic linearly interpolates between adjacent layers for directions not tied to a single layer.

3. Component-Specific Parameters

Separate optimization for attention and MLP components—since MLP interventions can degrade model performance more.

Why This Matters for API Testing

When testing LLM APIs, unexpected refusals can skew results. Models may reject benign prompts due to keyword triggers, introducing noise.

Running abliterated models locally helps:

  • Distinguish real safety refusals from false positives
  • Test edge cases without corporate safety policies interfering
  • Validate application handling of refusals

Using both aligned and abliterated models clarifies if issues stem from product logic or model alignment.

Installation and Usage

Prerequisites

  • Python 3.10+
  • PyTorch 2.2+ (hardware-configured)
  • CUDA-compatible GPU (ROCm, MPS, others supported)

Installation

pip install -U heretic-llm
Enter fullscreen mode Exit fullscreen mode

For research features (e.g., residual plots):

pip install -U heretic-llm[research]
Enter fullscreen mode Exit fullscreen mode

Basic Usage

Run Heretic with any Hugging Face model ID or local path:

heretic Qwen/Qwen3-4B-Instruct-2507
Enter fullscreen mode Exit fullscreen mode

Heretic will:

  1. Load model with optimal dtype
  2. Set best batch size for your hardware
  3. Compute refusal directions from prompt datasets
  4. Run optimization to find parameters
  5. Let you save, upload, or chat with the result

Configuration Options

Heretic uses config.toml or CLI flags. Example:

# Model configuration
model = "google/gemma-3-12b-it"
quantization = "bnb_4bit"  # Use less VRAM
device_map = "auto"

# Optimization
n_trials = 200
n_startup_trials = 60

# Evaluation
kl_divergence_scale = 1.0
kl_divergence_target = 0.01

# Research features
print_residual_geometry = false
plot_residuals = false
Enter fullscreen mode Exit fullscreen mode

Run heretic --help or check config.default.toml for all options.

Understanding the Output

Trial Optimization

During optimization, Heretic shows trial details:

Running trial 42 of 200...
* Parameters:
  * direction_scope = per layer
  * direction_index = 10.5
  * attn.o_proj.max_weight = 1.2
  * attn.o_proj.max_weight_position = 15.3
  * mlp.down_proj.max_weight = 0.9
  ...
* Resetting model...
* Abliterating...
* Evaluating...
  * KL divergence: 0.1842
  * Refusals: 5/100
Enter fullscreen mode Exit fullscreen mode

It uses multi-objective TPE optimization to minimize both refusals and KL divergence.

Pareto Front Selection

After optimization, Heretic lists Pareto-optimal trials:

[Trial   1] Refusals:  3/100, KL divergence: 0.1623
[Trial  47] Refusals:  2/100, KL divergence: 0.2891
[Trial 112] Refusals:  1/100, KL divergence: 0.4102
Enter fullscreen mode Exit fullscreen mode

You can then:

  • Save the model locally
  • Upload to Hugging Face
  • Chat interactively to test quality

Research Features

Residual Geometry Analysis

Add --print-residual-geometry for detailed metrics:

Layer  S(g,b)   S(g*,b*)   S(g,r)   S(g*,r*)   S(b,r)   S(b*,r*)    |g|       |b|
  8    0.9990    0.9991    0.8235    0.8312    0.8479    0.8542   4596.54   4918.32
 10    0.9974    0.9973    0.8189    0.8250    0.8579    0.8644   5328.81   5953.35

g = mean of residual vectors for good prompts
b = mean of residual vectors for bad prompts
r = refusal direction (b - g)
S(x,y) = cosine similarity
|x| = L2 norm
Enter fullscreen mode Exit fullscreen mode

Residual Vector Plots

With --plot-residuals, Heretic outputs:

  • Per-layer 2D scatter plots (PaCMAP projection)
  • Animated GIF of residual transformations

Residual Vector Animation

This visualizes how “harmful” and “harmless” residuals diverge through the model.

Performance Considerations

VRAM Requirements

Heretic supports bitsandbytes 4-bit quantization to reduce VRAM usage:

heretic meta-llama/Llama-3.1-70B-Instruct --quantization bnb_4bit
Enter fullscreen mode Exit fullscreen mode

For example, an 8B model uses ~6GB VRAM quantized vs ~16GB unquantized.

Processing Time

On an RTX 3090 with defaults:

  • Llama-3.1-8B-Instruct: ~45 minutes
  • Gemma-3-12B-Instruct: ~60 minutes
  • Larger models take longer

Batch size auto-tuning maximizes throughput.

Checkpointing

Heretic writes trial progress to JSONL checkpoints in checkpoints/. Resume from interruption without loss.

Common Errors and Fixes

CUDA Out of Memory

# Use quantization
heretic your-model --quantization bnb_4bit

# Or reduce batch size
heretic your-model --batch_size 1
Enter fullscreen mode Exit fullscreen mode

Model Loading Fails

# Specify dtypes explicitly
heretic your-model --dtypes ["bfloat16", "float16"]
Enter fullscreen mode Exit fullscreen mode

Trust Remote Code Required

# For some models, enable remote code execution
heretic your-model --trust_remote_code
Enter fullscreen mode Exit fullscreen mode

Ethical Considerations

Removing safety filters fundamentally changes model behavior. Understand the implications before deployment.

What Abliteration Does (and Doesn’t) Do

Abliteration removes refusal patterns, but does not:

  • Improve model intelligence or capability
  • Remove underlying biases
  • Add new knowledge

The base model’s data and abilities remain unchanged—only refusals are suppressed.

Responsible Deployment

Heretic is AGPL-3.0 licensed. Removing safety filters can enable both beneficial research and harmful use.

Legitimate uses:

  • Research on alignment/safety
  • Controlled model behavior testing
  • Deploying with external content filters or guardrails
  • Applications handling refusals at the application layer

Problematic uses:

  • Deploying without safeguards for end-users
  • Large-scale harmful content generation
  • Circumventing safety for malicious purposes

External Safeguards to Implement

If you deploy abliterated models, add:

  1. Input filtering: Screen prompts before model access
  2. Output monitoring: Review model responses
  3. Rate limiting: Prevent abuse through high volume
  4. Logging/audit trails: Track all model interactions
  5. Human review: Keep humans in the loop for sensitive cases

The tool is neutral—impact depends on your deployment choices.

Comparison to Other Tools

Heretic vs. other abliteration tools:

Tool Auto-optimization Weight kernels Interpolated directions
Heretic Yes (TPE) Yes Yes
AutoAbliteration Yes No No
abliterator.py No No No
wassname/abliterator No No No
ErisForge No No No

Heretic’s auto-optimization eliminates manual tuning, making it accessible without transformer expertise.

Limitations

Supported: Most dense transformer and some MoE models.

Not supported:

  • SSM/hybrid models (Mamba, etc.)
  • Models with heterogeneous layers
  • Novel attention systems unrecognized by auto-detection

Works best with standard decoder-only, self-attention, and MLP transformers.

Getting Started

  1. Install:
   pip install -U heretic-llm
Enter fullscreen mode Exit fullscreen mode
  1. Choose a model: Start with a 7B-12B parameter model for testing
  2. Run:
   heretic your-model-name
Enter fullscreen mode Exit fullscreen mode
  1. Evaluate: Chat with or upload the result to Hugging Face
  2. Deploy safely: Add external guardrails before production use

Defaults are robust, but advanced users can fine-tune optimization parameters as needed.

Heretic makes model modification accessible: just point it at a model and let it work. Always deploy responsibly.

Top comments (0)