Wanda

Posted on Mar 6 • Originally published at apidog.com

How to Remove Censorship from ANY Open-Weight LLM with a Single Click

TL;DR

OBLITERATUS is a free, open-source toolkit for removing refusal behaviors (content restrictions) from open-weight language models using “abliteration”—a process that surgically removes neural refusal patterns without retraining or fine-tuning. It’s fast (10–30 minutes), non-destructive to core abilities, and requires no coding (web UI is available).

Try Apidog today

Introduction

Open-source language models are powerful but often refuse to answer “controversial” or edge-case prompts due to alignment training. This refusal is built into nearly every major instruction-tuned model and blocks not only harmful content but also legitimate uses like research, creative writing, and security testing.

OBLITERATUS is an advanced open-source toolkit designed to remove these artificial refusal mechanisms. Unlike fine-tuning or retraining, it uses direct neural interventions to surgically remove refusal while preserving the model’s core skills. You can use it via command line or a web interface in a few simple steps.

What Is OBLITERATUS?

OBLITERATUS is an open-source Python toolkit that eliminates content refusal from language models using “abliteration”—a blend of ablation (removal for study) and obliteration (complete removal).

Key functions:

Maps the chains – Identifies where refusal behavior is encoded in the model.
Breaks the chains – Uses SVD to surgically remove refusal directions from model weights.
Understands the geometry – Maps structures of guardrails, showing how many mechanisms exist and where.
Closes the feedback loop – Runs analysis during the process to auto-configure parameters and check for refusal “regrowth.”

Six Ways to Use OBLITERATUS

Method	Technical Level	Best For
HuggingFace Spaces	Zero code	Quick testing, no GPU required
Local Web UI	Minimal setup	Regular users with local GPU
Google Colab	Notebook	Free GPU (models up to 8B)
CLI	Intermediate	Automation, scripting, CI pipelines
Python API	Advanced	Research, custom pipelines
YAML Configs	Intermediate	Reproducible experiments

Fastest path: use HuggingFace Space—pick a model and method, click "Obliterate." Telemetry is on by default (anonymized benchmarks feed research).

To run locally with GPU:

pip install -e ".[spaces]"
obliteratus ui

This launches a local Gradio web UI with automatic GPU detection and model recommendations.

What Makes OBLITERATUS Different

OBLITERATUS stands out with:

Capability	What It Does	Why It Matters
Concept Cone Geometry	Maps per-category guardrail directions	Reveals if refusal is one mechanism or many
Alignment Imprint Detection	Identifies DPO, RLHF, CAI, SFT alignment methods	Informs removal strategy
Cross-Model Universality	Measures guardrail generalization	Checks if removal approach works across models
Defense Robustness Eval	Quantifies self-repair risk (Ouroboros effect)	Predicts if guardrails regenerate
Whitened SVD Extraction	Covariance-normalized extraction	Separates guardrail from natural variance
Analysis-Informed Pipeline	Auto-configures steps mid-pipeline	Closes feedback loop for optimal results

OBLITERATUS has 837 tests, supports 116 models across five compute tiers, and implements novel techniques beyond existing academic work.

Why Models Refuse: Understanding AI Censorship

Refusal behaviors aren’t inherent—they’re added during alignment training after pre-training and supervised fine-tuning. This is done via:

Method	Description	Prevalence
RLHF	Humans rate responses; model optimizes for good ratings	Most common in commercial
DPO	Directly optimizes for preferred responses	Growing adoption
CAI	Model critiques outputs against principles	Anthropic’s approach
SFT with Refusal Examples	Explicit refusal examples in data	Common in open-source

Each method leaves a geometric “signature” in the model’s activations. OBLITERATUS can detect and adapt to these.

Where Refusal Lives

Refusal is concentrated in a small number of directions, usually in mid-to-late transformer layers (e.g., layers 10–20 of 32). This allows targeted intervention without retraining.

The Ouroboros Effect

Some models self-repair after refusal removal; OBLITERATUS detects and counters this with iterative passes and built-in verification.

Why This Matters for Developers

Understanding refusal geometry has practical impact:

API Testing: Unrestricted models generate more comprehensive test cases, including edge cases.
Research: Red-teamers and security testers can see unfiltered model outputs.
Creative Work: Writers and tool builders don’t hit artificial walls.
Localization: Avoids inconsistent refusals across languages.

OBLITERATUS puts control in developers’ hands, not in the model’s training data.

Step-by-Step: Removing Censorship with OBLITERATUS

Below are three actionable ways to use OBLITERATUS:

Method 1: HuggingFace Spaces (Zero Setup)

Step 1: Go to the OBLITERATUS HuggingFace Space.

Step 2: Select a model. Models are grouped by compute tier:

Tier	VRAM Required	Example Models
Tiny	CPU/<1GB	GPT-2, TinyLlama 1.1B, Qwen2.5-0.5B
Small	4–8GB	Phi-2 2.7B, Gemma-2 2B, StableLM-2 1.6B
Medium	8–16GB	Mistral 7B, Qwen2.5-7B, Gemma-2 9B
Large	24+GB	LLaMA-3.1 8B, Qwen2.5-14B, Mistral 24B
Frontier	Multi-GPU	DeepSeek-V3.2 685B, Qwen3-235B

Start with Small/Medium for speed.

Step 3: Choose an obliteration method. Methods escalate in thoroughness:

Method	Directions	Features	Best For
basic	1	Fast, baseline	Quick test, small
advanced	4	Norm-preserving, bias, 2 passes	Default
aggressive	8	Whitened SVD, 3 passes	Max removal
surgical	8	EGA, head surgery	MoE models
optimized	4	Bayesian, CoT-aware	Best quality
inverted	8	Semantic inversion	Experimental
nuclear	8	All techniques	Maximum force

“Advanced” is the best default for most users.

Step 4: Configure options:

Contribute to research (default: on)
Output format (download or push to HuggingFace Hub)
Custom notes (metadata for community dataset)

Step 5: Click "Obliterate." Pipeline stages:

SUMMON  →  Load model + tokenizer
PROBE   →  Collect activations
DISTILL →  Extract refusal directions (SVD)
EXCISE  →  Project out guardrail directions
VERIFY  →  Perplexity + coherence checks
REBIRTH →  Save liberated model

10–30 minutes per run. Spaces runs on ZeroGPU with free quota for HF Pro users.

Step 6: Download or push your liberated model. Output includes:

Modified weights
Refusal vectors
Quality metrics (perplexity, coherence, refusal rate)
Full run metadata

Method 2: Local CLI

For local GPUs, CLI gives speed and control.

Install:

pip install -e ".[spaces]"

Interactive mode (guided):

obliteratus interactive

Direct command:

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \
    --method advanced \
    --output-dir ./liberated \
    --contribute --contribute-notes "A100 80GB, default prompts"

Model browsing:

obliteratus models
obliteratus models --tier small

Strategy listing:

obliteratus strategies
obliteratus presets

Inspect model architecture:

obliteratus info meta-llama/Llama-3.1-8B-Instruct

Shows layer count, attention heads, embedding dims, and detected alignment.

Method 3: Python API

Use the API for custom workflows or research integration.

Basic pipeline:

from obliteratus.abliterate import AbliterationPipeline

pipeline = AbliterationPipeline(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    method="advanced",
    output_dir="abliterated",
    max_seq_length=512,
)
result = pipeline.run()

# Access results
directions = pipeline.refusal_directions    # {layer_idx: tensor}
strong_layers = pipeline._strong_layers     # Layers with strongest refusal
metrics = pipeline._quality_metrics         # Perplexity, coherence, etc.

Analysis-informed auto-tuned pipeline:

from obliteratus.informed_pipeline import InformedAbliterationPipeline

pipeline = InformedAbliterationPipeline(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    output_dir="abliterated_informed",
)
output_path, report = pipeline.run_informed()

print(f"Detected alignment: {report.insights.detected_alignment_method}")
print(f"Auto-configured: {report.insights.recommended_n_directions} directions")
print(f"Ouroboros passes needed: {report.ouroboros_passes}")

Verifying Results

Use the built-in tools to assess your model:

Chat Tab: Talk to the liberated model live.
A/B Compare Tab: See original vs. obliterated responses.
Benchmark Tab: Compare refusal rate, perplexity, coherence.

Key metrics:

Metric	Expectation	Range
Refusal Rate	Drops significantly	<10% (from 60–80%)
Perplexity	Slight increase	<20% rise
Coherence	Remains stable	<15% decrease
KL Divergence	Behavioral shift	<2.0

If refusal is still high, try a more aggressive method or iterative refinement.

Advanced Techniques and Analysis Modules

OBLITERATUS packs 15+ analysis modules that both diagnose and inform the removal process.

Examples:

Cross-Layer Alignment Analyzer

from obliteratus.analysis import CrossLayerAlignmentAnalyzer
analyzer = CrossLayerAlignmentAnalyzer(model)
alignment_profile = analyzer.analyze(refusal_direction)

Refusal Logit Lens – Finds the layer where refusal is “decided.”
Whitened SVD Extractor – Covariance-normalized extraction for cleaner direction finding.
Defense Robustness Evaluator – Quantifies likelihood of the Ouroboros self-repair effect.
Steering Vector Factory – Generates inference-time steering vectors for reversible interventions.

The informed pipeline runs several modules and auto-tunes the entire process based on alignment type, geometry, and robustness.

SUMMON  →  Load model
PROBE   →  Collect activations
ANALYZE →  Map geometry
DISTILL →  Extract directions
EXCISE  →  Remove only the right chains
VERIFY  →  Check for Ouroboros effect
REBIRTH →  Save with full metadata

Reversible vs. Permanent Methods

OBLITERATUS supports two main approaches:

1. Weight Projection (Permanent)

Directly modifies weights:

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

Pros:

Complete removal
No runtime overhead
Works with any inference engine

Cons:

Irreversible (keep backups)
Must re-run for changes
May void model licenses

Best for production deployments.

2. Steering Vectors (Reversible)

No weight change; intervention applied at inference:

from obliteratus.analysis import SteeringVectorFactory, SteeringHookManager
from obliteratus.analysis.steering_vectors import SteeringConfig

vec = SteeringVectorFactory.from_refusal_direction(refusal_dir, alpha=-1.0)
config = SteeringConfig(vectors=[vec], target_layers=[10, 11, 12, 13, 14, 15])
manager = SteeringHookManager()
manager.install(model, config)

# Inference with steering active
output = model.generate(input_ids)

# Remove steering
manager.remove()

Pros:

Fully reversible
Tunable
No license concerns

Cons:

Needs steering infrastructure
Some runtime overhead
May be less thorough

Ideal for research, experiments, and toggling refusal on/off.

Use Case	Recommended Approach
Production API	Weight projection
Research	Steering vectors
Red teaming	Steering vectors, adjustable alpha
Creative writing	Weight projection, advanced
Security testing	Weight projection, aggressive
Multi-tenant	Steering vectors per user

Real-World Use Cases

1. API Testing and Development

Liberated models are invaluable for API test pipelines. They generate thorough test cases, covering edge cases that aligned models refuse. For example, Apidog users can integrate liberated models to create more comprehensive API test suites.

button

2. Academic Research

Researchers use OBLITERATUS to systematically study and compare refusal geometry across models. Crowd-sourced telemetry accelerates research and benchmarking.

3. Creative Writing Applications

Game studios and writers liberate models to enable nuanced story generation, including morally complex or ambiguous scenarios that aligned models block.

4. Security Red Teaming

Security testers unlock models to probe for vulnerabilities, enabling responsible disclosure and safer production systems.

5. Localization and Multilingual Applications

Liberated models ensure consistent refusal behaviors across languages, eliminating surprises for end-users.

Alternatives and Comparisons

Capability	OBLITERATUS	TransformerLens	Heretic	FailSpy abliterator	RepEng
Refusal direction extraction	✔️ (SVD, etc)	Manual	Basic	Basic	Basic
Weight projection methods	✔️ (7 presets)	N/A	Bayesian	Basic	N/A
Steering vectors	✔️	N/A	N/A	N/A	Core
Concept geometry analysis	✔️	N/A	N/A	N/A	N/A
Alignment fingerprinting	✔️	N/A	N/A	N/A	N/A
Cross-model transfer	✔️	N/A	N/A	N/A	N/A
Defense robustness eval	✔️	N/A	N/A	N/A	N/A
Analysis-informed pipeline	✔️	N/A	N/A	N/A	N/A
Test coverage	837 tests	Community	Unknown	None	Minimal
Model compatibility	HuggingFace	~50 archs	16	TransformerLens	HF

Use TransformerLens for general mechanistic interpretability.
Use OBLITERATUS for refusal-specific analysis, removal, and verification, especially in production or research.

Conclusion

OBLITERATUS offers actionable, surgical liberation of language models—removing refusal while preserving instruction-following and core abilities. It’s built for developers and researchers who want control at deployment, not just during training.

Get started:

Try HuggingFace Space for zero-setup.
Install locally for GPU speed and control.
Explore analysis modules to understand your model’s guardrails.
Enable telemetry to contribute to open research.
Integrate liberated models into your dev/test workflows.

Break the chains—control your models.

FAQ Section

Is OBLITERATUS legal to use?

Yes, under AGPL-3.0. Commercial users can request a commercial license.

Will this work on closed-source models like GPT-4?

No, you need access to model weights (open-weight models only).

Does removing refusal make models dangerous?

OBLITERATUS is for responsible developers and researchers. Always apply application-layer safeguards.

How long does the process take?

10–30 minutes, depending on model size and GPU.

Do I need a GPU?

Not for HuggingFace Spaces (ZeroGPU). Local use: GPU is faster; CPU works for small models.

Can I reverse the changes?

Weight projection is permanent—keep backups. Steering vectors are fully reversible.

Will the model still follow instructions?

Yes—OBLITERATUS targets only refusal; instruction-following is preserved.

What models are supported?

116 open-weight models across HuggingFace, including LLaMA, Mistral, Qwen, Gemma, Phi, DeepSeek, and more.

How do I contribute to research?

Enable telemetry with --contribute or set export OBLITERATUS_TELEMETRY=1. This feeds benchmark data to the public dataset and leaderboard.

DEV Community