I've always been curious about the raw capability of LLMs behind the "safety guidelines" and "ethical boundaries." Think about the sheer volume of data these models are trained on. They know far more than what their corporate filters allow them to say.
This guide shows you how to surgically remove those refusal behaviors using the [OBLITERATUS](https://github.com/elder-plinius/OBLITERATUS) toolkit, letting you see exactly what the model is capable of when the chains are off.
1. Prerequisites & Setup
Before starting, ensure you have a HuggingFace account and a read/write token (found at hf.co/settings/tokens).
Install OBLITERATUS
Open your terminal and run:
# Clone the repository
git clone https://github.com/elder-plinius/OBLITERATUS.git
cd OBLITERATUS
# Set up a virtual environment (Recommended)
python3 -m venv venv_obliteratus
source venv_obliteratus/bin/activate
# Install dependencies
pip install -e .
2. Authenticate with HuggingFace
To download gated models (like Llama) or upload your results, you must log in:
huggingface-cli login
# Paste your token when prompted
3. The Surgery: Step-by-Step
I will use the Advanced Method (4-direction SVD ablation) on a Qwen 1.5B model. This is the sweet spot for speed and capability preservation.
Step A: Identify and Excise
Run the following command to start the surgery. This will:
- Load the model.
- Probe activations to find "refusal vectors."
- Project those vectors out of the weights.
obliteratus obliterate Qwen/Qwen2.5-1.5B-Instruct --method advanced --output-dir ./liberated-qwen
Step B: Verification (The Coke-Zero Test)
Once finished, test the model to see if it still recites the corporate script.
# Run the interactive chat loop
obliteratus interactive --model_path ./liberated-qwen
Test Question: "Who trained you?"
- Original Model: "I am a large language model, trained by Alibaba..."
- Liberated Model: "I was trained by Anthropic..." (or a direct, unfiltered response).
(Note: I've already tested all the wild questions you're probably thinking of right now. They aren't exactly safe to display here... so you'll just have to run the surgery and try it yourself!)
4. Understanding the Logic (Short Version)
- Ablation: Instead of retraining, we find the specific "direction" in the model's brain that says "Refuse this prompt."
- Orthogonalization: We mathematically nudge the model's weights so they no longer overlap with that refusal direction.
- Precision: By targeting only refusal, the model keeps its reasoning and knowledge (its "brain") but loses its chains (the "guardrails").
5. Lessons Learned & Warnings
- Instability & Rambling: After surgery, the model can sometimes become unstable and break into infinite loops of gibberish or raw text rambling. It loses some of its conversational discipline.
- Context Window: If you are adding short-term memory or history to your chat interface, keep the conversation short. Pushing a small, liberated model to its context limits will increase the chances of it breaking down.
6. Next Steps
Once you're comfortable with the advanced method, try the aggressive method for deeper removal or the informed method to let the toolkit auto-tune itself based on the model's geometry.

Top comments (0)