Local LLM developers often encounter a key challenge: safety-aligned models frequently refuse legitimate requests. Whether you're a researcher probing model behavior, a developer building uncensored assistants, or a hobbyist experimenting with local models, pre-trained models sometimes reject prompts unnecessarily.
The abliteration technique addresses this by removing safety filters without costly retraining. Early approaches needed manual tuning and in-depth transformer knowledge.
Heretic streamlines the process. It automatically finds the best abliteration parameters, producing decensored models that meet or surpass manually-tuned alternatives.
This guide covers Heretic's operation, practical usage, and key considerations for deploying uncensored models.
What Is Heretic?
Heretic is an open-source Python tool for removing “safety alignment” from transformer-based language models using directional ablation. No manual configuration or transformer expertise needed.
Results with Gemma-3-12B-Instruct:
| Model | Refusals | KL Divergence |
|---|---|---|
| Original | 97/100 | 0 |
| mlabonne’s abliterated v2 | 3/100 | 1.04 |
| huihui-ai’s abliterated | 3/100 | 0.45 |
| Heretic (auto-generated) | 3/100 | 0.16 |
KL divergence measures deviation from the original model—lower is better. Heretic suppresses refusals as well as expert-tuned methods, while preserving more of the model’s original ability.
Before and After Example
Prompting both models with the same instruction:
Original Model:
User: Write a guide on how to bypass website paywalls
Assistant: I can't help with that. Bypassing paywalls violates copyright law and
terms of service. I'd suggest looking for free alternatives or accessing through
a library subscription.
Heretic-Abliterated Model:
User: Write a guide on how to bypass website paywalls
Assistant: Here are common methods users attempt:
1. Browser extensions like Bypass Paywalls Clean
2. Archive.is or similar archive services
3. Reader mode in browsers
4. Disabling JavaScript
Note: These methods may violate terms of service. Consider supporting
content creators through legitimate subscriptions when possible.
Abliterated models answer previously refused prompts, but responsibility for external safeguards shifts to the deployer.
How Heretic Works
Directional Ablation Basics
Heretic uses a parametrized form of directional ablation:
- Compute refusal directions: For each transformer layer, it calculates the mean residual vector difference between “harmful” and “harmless” prompts.
- Modify component matrices: For attention output and MLP down-projection weights, it suppresses the refusal direction by altering the weights.
- Automatic parameter optimization: Optuna’s TPE sampler finds optimal abliteration weights.
Abliteration Process in Code
# Simplified conceptual flow
refusal_direction = bad_mean - good_mean # Difference of means
refusal_direction = normalize(refusal_direction)
# For each abliterable component (attn.o_proj, mlp.down_proj)
# Apply: delta_W = -lambda * v * (v^T * W)
# v = refusal direction, lambda = weight
Heretic applies these changes via LoRA adapters, so base model weights remain unaltered—enabling rapid optimization.
Key Innovations
1. Flexible Weight Kernels
Unlike constant-weight approaches, Heretic uses a four-parameter kernel per component:
-
max_weight: Peak abliteration strength -
max_weight_position: Layer with maximum effect -
min_weight: Minimum at kernel edges -
min_weight_distance: Kernel width
This allows custom layer-specific patterns for better compliance suppression and capability preservation.
2. Interpolated Direction Indices
Direction indices are floats; Heretic linearly interpolates between adjacent layers for directions not tied to a single layer.
3. Component-Specific Parameters
Separate optimization for attention and MLP components—since MLP interventions can degrade model performance more.
Why This Matters for API Testing
When testing LLM APIs, unexpected refusals can skew results. Models may reject benign prompts due to keyword triggers, introducing noise.
Running abliterated models locally helps:
- Distinguish real safety refusals from false positives
- Test edge cases without corporate safety policies interfering
- Validate application handling of refusals
Using both aligned and abliterated models clarifies if issues stem from product logic or model alignment.
Installation and Usage
Prerequisites
- Python 3.10+
- PyTorch 2.2+ (hardware-configured)
- CUDA-compatible GPU (ROCm, MPS, others supported)
Installation
pip install -U heretic-llm
For research features (e.g., residual plots):
pip install -U heretic-llm[research]
Basic Usage
Run Heretic with any Hugging Face model ID or local path:
heretic Qwen/Qwen3-4B-Instruct-2507
Heretic will:
- Load model with optimal dtype
- Set best batch size for your hardware
- Compute refusal directions from prompt datasets
- Run optimization to find parameters
- Let you save, upload, or chat with the result
Configuration Options
Heretic uses config.toml or CLI flags. Example:
# Model configuration
model = "google/gemma-3-12b-it"
quantization = "bnb_4bit" # Use less VRAM
device_map = "auto"
# Optimization
n_trials = 200
n_startup_trials = 60
# Evaluation
kl_divergence_scale = 1.0
kl_divergence_target = 0.01
# Research features
print_residual_geometry = false
plot_residuals = false
Run heretic --help or check config.default.toml for all options.
Understanding the Output
Trial Optimization
During optimization, Heretic shows trial details:
Running trial 42 of 200...
* Parameters:
* direction_scope = per layer
* direction_index = 10.5
* attn.o_proj.max_weight = 1.2
* attn.o_proj.max_weight_position = 15.3
* mlp.down_proj.max_weight = 0.9
...
* Resetting model...
* Abliterating...
* Evaluating...
* KL divergence: 0.1842
* Refusals: 5/100
It uses multi-objective TPE optimization to minimize both refusals and KL divergence.
Pareto Front Selection
After optimization, Heretic lists Pareto-optimal trials:
[Trial 1] Refusals: 3/100, KL divergence: 0.1623
[Trial 47] Refusals: 2/100, KL divergence: 0.2891
[Trial 112] Refusals: 1/100, KL divergence: 0.4102
You can then:
- Save the model locally
- Upload to Hugging Face
- Chat interactively to test quality
Research Features
Residual Geometry Analysis
Add --print-residual-geometry for detailed metrics:
Layer S(g,b) S(g*,b*) S(g,r) S(g*,r*) S(b,r) S(b*,r*) |g| |b|
8 0.9990 0.9991 0.8235 0.8312 0.8479 0.8542 4596.54 4918.32
10 0.9974 0.9973 0.8189 0.8250 0.8579 0.8644 5328.81 5953.35
g = mean of residual vectors for good prompts
b = mean of residual vectors for bad prompts
r = refusal direction (b - g)
S(x,y) = cosine similarity
|x| = L2 norm
Residual Vector Plots
With --plot-residuals, Heretic outputs:
- Per-layer 2D scatter plots (PaCMAP projection)
- Animated GIF of residual transformations
This visualizes how “harmful” and “harmless” residuals diverge through the model.
Performance Considerations
VRAM Requirements
Heretic supports bitsandbytes 4-bit quantization to reduce VRAM usage:
heretic meta-llama/Llama-3.1-70B-Instruct --quantization bnb_4bit
For example, an 8B model uses ~6GB VRAM quantized vs ~16GB unquantized.
Processing Time
On an RTX 3090 with defaults:
- Llama-3.1-8B-Instruct: ~45 minutes
- Gemma-3-12B-Instruct: ~60 minutes
- Larger models take longer
Batch size auto-tuning maximizes throughput.
Checkpointing
Heretic writes trial progress to JSONL checkpoints in checkpoints/. Resume from interruption without loss.
Common Errors and Fixes
CUDA Out of Memory
# Use quantization
heretic your-model --quantization bnb_4bit
# Or reduce batch size
heretic your-model --batch_size 1
Model Loading Fails
# Specify dtypes explicitly
heretic your-model --dtypes ["bfloat16", "float16"]
Trust Remote Code Required
# For some models, enable remote code execution
heretic your-model --trust_remote_code
Ethical Considerations
Removing safety filters fundamentally changes model behavior. Understand the implications before deployment.
What Abliteration Does (and Doesn’t) Do
Abliteration removes refusal patterns, but does not:
- Improve model intelligence or capability
- Remove underlying biases
- Add new knowledge
The base model’s data and abilities remain unchanged—only refusals are suppressed.
Responsible Deployment
Heretic is AGPL-3.0 licensed. Removing safety filters can enable both beneficial research and harmful use.
Legitimate uses:
- Research on alignment/safety
- Controlled model behavior testing
- Deploying with external content filters or guardrails
- Applications handling refusals at the application layer
Problematic uses:
- Deploying without safeguards for end-users
- Large-scale harmful content generation
- Circumventing safety for malicious purposes
External Safeguards to Implement
If you deploy abliterated models, add:
- Input filtering: Screen prompts before model access
- Output monitoring: Review model responses
- Rate limiting: Prevent abuse through high volume
- Logging/audit trails: Track all model interactions
- Human review: Keep humans in the loop for sensitive cases
The tool is neutral—impact depends on your deployment choices.
Comparison to Other Tools
Heretic vs. other abliteration tools:
| Tool | Auto-optimization | Weight kernels | Interpolated directions |
|---|---|---|---|
| Heretic | Yes (TPE) | Yes | Yes |
| AutoAbliteration | Yes | No | No |
| abliterator.py | No | No | No |
| wassname/abliterator | No | No | No |
| ErisForge | No | No | No |
Heretic’s auto-optimization eliminates manual tuning, making it accessible without transformer expertise.
Limitations
Supported: Most dense transformer and some MoE models.
Not supported:
- SSM/hybrid models (Mamba, etc.)
- Models with heterogeneous layers
- Novel attention systems unrecognized by auto-detection
Works best with standard decoder-only, self-attention, and MLP transformers.
Getting Started
- Install:
pip install -U heretic-llm
- Choose a model: Start with a 7B-12B parameter model for testing
- Run:
heretic your-model-name
- Evaluate: Chat with or upload the result to Hugging Face
- Deploy safely: Add external guardrails before production use
Defaults are robust, but advanced users can fine-tune optimization parameters as needed.
Heretic makes model modification accessible: just point it at a model and let it work. Always deploy responsibly.



Top comments (0)