Hermes-4-14B Abliterated, MLX 4-bit — Apple Silicon Just Got Another Real Model

#ai #mlx #applesilicon #hermes

When Babsie uploaded Hermes-4-14B-BF16-abliterated to Hugging Face yesterday, the only way to run it on a Mac was to download 28 GB of BF16 weights and feed them to transformers — which on Apple Silicon means falling back to PyTorch MPS, which is fine but not what the hardware was built for.

So I converted it to MLX 4-bit. ~8 GB on disk, runs at the speed Apple Silicon was actually designed for, and the model card is now live at:

huggingface.co/divinetribe/Hermes-4-14B-abliterated-4bit-mlx

Free, open weights, drop-in for any mlx-lm workflow.

What it is

Hermes 4 is NousResearch's instruction-tuned model family. The 14B variant is built on Qwen3, so it inherits Qwen3's tokenizer, chat template, and architectural quirks — but the post-training is pure Hermes, which means it's tuned for tool use, role-play, structured output, and not refusing benign-but-edgy questions.

Babsie's abliteration applies refusal-direction projection per Arditi et al. (2024) to the BF16 weights, which suppresses the model's built-in refusal vector. Plain English: the model will answer questions that the upstream Hermes would politely decline. You become the moderator.

My contribution is the boring-but-useful part: convert those BF16 weights to MLX, quantize to 4 bits with group size 64, ship.

Why this matters for Mac users

14B is the sweet spot on Apple Silicon. Big enough to be genuinely useful for instruction-following, small enough to fit comfortably in 16 GB of unified memory at 4-bit, and tiny enough on disk that you can keep a half-dozen variants without thinking about it.

For comparison:

Llama 3.3 70B 8-bit MLX (also on my Hugging Face) — 75 GB, needs a 96 GB machine
Gemma 4 31B 4-bit MLX — 17 GB, runs on any 32 GB Mac
Hermes 4 14B 4-bit MLX — 8 GB, runs on a 16 GB Mac, snappy on anything bigger

If you have an M-series Mac and you want a capable instruction-tuned model that doesn't refuse benign requests and doesn't phone home, this is the easiest install on the list.

How to run it

The mlx_lm package does all the heavy lifting. From a terminal:

pip install mlx-lm

Then, in Python:

from mlx_lm import load, generate

model, tokenizer = load("divinetribe/Hermes-4-14B-abliterated-4bit-mlx")

messages = [{"role": "user", "content": "Write a haiku about local inference."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

Or as a server, using mlx-lm's OpenAI-compatible endpoint:

mlx_lm.server --model divinetribe/Hermes-4-14B-abliterated-4bit-mlx --port 8080

That gives you a local http://localhost:8080/v1/chat/completions endpoint that any OpenAI SDK client can hit. No tokens, no API bills, no telemetry.

Where it slots into the local-first stack

I maintain claude-code-local on GitHub — a project that lets you run Claude Code 100% on-device using local MLX models in place of the Anthropic API. Hermes 4 14B abliterated is now a supported backend, sitting between Gemma 4 31B (the everyday workhorse) and Llama 3.3 70B (the heavy lifter) in terms of size and speed.

For workflows where refusals are noise (security research, fiction with edge, prompt-engineering experiments where the same prompt needs to hit both refusing and non-refusing models for comparison), the abliterated variant saves a lot of re-prompting.

Why I publish these

From where I sit, the Apple Silicon community is underserved on model variety. The big quantization shops mostly target llama.cpp and GGUF, which is fantastic for Linux/Windows GPU workflows but adds a translation layer on Macs. MLX is Apple's own framework, and when a model lands in MLX format the experience on a Mac goes from "this works" to "this is what the hardware is for."

So a few times a month, when a hot new model drops, I take the abliterated BF16 (if a trusted shop has done the abliteration upstream) and run it through mlx_lm.convert. Takes about an hour from git pull to huggingface-cli upload. Costs me ~80 GB of disk during the build and ~8 GB to keep around.

The downloads page tells me people are finding them. Llama 3.3 70B 8-bit MLX and Gemma 4 31B 4-bit MLX each pull ~1,000 downloads every 30 days, almost entirely from Mac users running them through mlx-lm or claude-code-local. Hermes 4 14B should be more accessible than either, since 16 GB Macs are everywhere.

What's next

I'm watching for the BF16 abliterated source of DeepSeek V4 Flash to drop publicly. The GGUF version (cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF) already has 86 K downloads, but the BF16 source isn't published yet — the moment it lands, I'll be the first to ship an MLX 4-bit. DeepSeek V4 Flash on Apple Silicon at ~140 GB is going to be a moment for people with 128 GB+ Macs.

Until then: enjoy Hermes 4 14B. The model card has the full install recipe and a benchmark snippet.