I read two papers about improving LLMs at inference time — no training, no fine-tuning, just architectural surgery. I tried applying these ideas to Qwen 3.5-9B. The initial results looked incredible (+245% reasoning!). Then I ran fair evaluations and discovered most of the improvement was an evaluation artifact. Here's the full story, including what I got wrong and what's genuinely new.
The Research That Started This
Two pieces of research motivated this experiment:
1. The RYS Method (David Ng) — Transformers contain "reasoning circuits": contiguous blocks of 3-5 layers that act as indivisible cognitive units. Duplicate them in the GGUF file and the model gets a second pass through its reasoning pipeline. The llm-circuit-finder toolkit validated this on Devstral-24B (+245% logical deduction on BBH) and Qwen2.5-32B (+23% reasoning). The boundaries are sharp — shift by one layer and the improvement vanishes.
2. H-Neurons Paper (arXiv:2512.01797) — Fewer than 0.1% of neurons in an LLM predict whether it will hallucinate. These neurons are baked in during pre-training and survive instruction tuning. Scaling their activations at inference time controls hallucination rates.
Both papers point to the same idea: you can change model behavior at inference time by manipulating the architecture, without touching the weights. I wanted to try this on Qwen 3.5 — a newer, community-loved model.
Discovery 1: Qwen 3.5's Hybrid Architecture Requires Cycle-Aligned Duplication
Qwen 3.5 doesn't use standard transformer layers. It uses a repeating pattern of [DeltaNet, DeltaNet, DeltaNet, Attention] — three linear attention layers followed by one full quadratic attention layer. This 4-layer cycle repeats 8 times for 32 total layers.
I discovered this empirically. My first sweep tried duplicating 3-layer blocks. Every config crashed:
# Config (2,5) - 3 layers
llama_model_load: error: missing tensor 'blk.6.ssm_conv1d.weight'
# Config (4,7) - 3 layers
llama_model_load: error: missing tensor 'blk.7.attn_q.weight'
The errors alternate: ssm_conv1d (DeltaNet tensor) missing, then attn_q (Attention tensor) missing. Duplicating 3 layers shifts the pattern, putting the wrong layer type at each position. But duplicating 4 layers (one complete cycle) works — the pattern stays aligned.
This is new. The original RYS work only tested standard transformers where all layers are identical. Nobody had tried it on a hybrid DeltaNet architecture before. The finding: layer duplication on hybrid models must respect the architectural cycle.
Discovery 2: Initial Results Looked Amazing (But Were Wrong)
I built custom probes (code generation, hallucination detection, reasoning) and swept all cycle-aligned configs. The initial results were dramatic:
| Config | Code Gen | Hallucination Resistance | Reasoning |
|---|---|---|---|
| Baseline | 7% | 54% | 29% |
| (0,4) layers 0-3 duplicated | 79% | 96% | 88% |
Code generation went from 7% to 79%. Hallucination resistance nearly doubled. Reasoning tripled. I was convinced I'd found the reasoning circuit in Qwen 3.5.
Discovery 3: The Improvement Was an Evaluation Artifact
Then I ran fair evaluations. The initial sweep used max_tokens of 512-1024. Qwen 3.5 wraps responses in <think>...</think> tags, which consume tokens. With limited budget:
- Base model: Spent 500+ tokens thinking, ran out before producing an answer → empty response → scored 0
- RYS model: Didn't use think tags, answered directly in 50-200 tokens → correct response → scored 1
The "improvement" was measuring which model fits its answer within the token budget, not which model is smarter.
When I re-ran with max_tokens=4096 (fair for both):
| Config | Code Gen | Hallucination Resistance | Reasoning | Overall |
|---|---|---|---|---|
| Baseline | 80% | 40% | 100% | 73.3% |
| (0,4) | 60% | 80% | 100% | 80.0% |
| (4,8) | 80% | 60% | 80% | 73.3% |
| (8,12) | 0% | 40% | 80% | 40.0% |
| (12,16) | 0% | 60% | 80% | 46.7% |
| (16,20) | 0% | 40% | 100% | 46.7% |
| (20,24) | 60% | 60% | 100% | 73.3% |
The real improvement from (0,4) is +6.67% overall — not the +286% from the flawed evaluation. Most configs actually hurt the model. And the baseline reasoning score is 100%, not 29%.
Discovery 4: When Both Models Answer, They're Identical
I tested both models on 10 hard hallucination prompts (fake APIs, version confusion, tricky Python behavior). Side by side, with identical settings:
- Both correctly rejected
list.add(),dict.sort_by_value(),json.parse() - Both correctly refused to name a 2028 World Cup winner
- Both correctly explained that
list.sort()returns None - Both incorrectly said match/case works in Python 3.9 (it's 3.10+)
- Both correctly explained banker's rounding for
round(2.5)
The layer duplication doesn't change the model's knowledge. When both models respond, they give the same answers — same correct ones, same mistakes.
What the Original Author Actually Said
Going back to the original RYS blog, David Ng explicitly noted:
"Smaller models seem to be more complex...I never found a single area of duplication that generalised across tasks."
His successful results were on 72B+ parameter models. I used 9B. He also said:
"Every architecture has its own neuroanatomy...The brains are different."
And critically: neither the original author nor anyone else had tested RYS on hybrid DeltaNet architectures. The method was validated exclusively on standard transformers (Qwen2, Llama, Mistral, Phi). Qwen 3.5's hybrid architecture was untested territory.
Even though the author warned about small models, we tried it anyway and quantified exactly what happens. Next up: running this on Qwen 3.5 122B — the scale where Ng saw real gains.
What's Genuinely New Here
Despite the accuracy improvement not holding up, this experiment produced three findings nobody else has published:
- Hybrid architectures require cycle-aligned duplication. On Qwen 3.5's [D,D,D,A] pattern, only block-size-4 duplication works. Block-size-3 crashes. This constrains how RYS can be applied to next-generation architectures.
-
Layer duplication can change output behavior. The (0,4) config switched the model from using
<think>tags to responding directly. This is an unexpected side effect — duplicating layers doesn't just affect accuracy, it can change the model's generation strategy. - Evaluation methodology on thinking models is treacherous. Token budget, think-tag handling, and response parsing can swing results from "dramatic improvement" to "no improvement". Anyone evaluating thinking models needs to control for these factors.
How to Reproduce
# Clone the circuit finder toolkit
git clone https://github.com/alainnothere/llm-circuit-finder.git
cd llm-circuit-finder
pip install gguf
# Download Qwen3.5-9B GGUF (from unsloth on HuggingFace)
# Then build the modified model:
python layer_path.py Qwen3.5-9B-Q4_K_M.gguf \
Qwen3.5-9B-RYS-0-4.gguf \
-p "0..3,0,1,2,3,4..31" -v
# Run with llama.cpp
llama-server -m Qwen3.5-9B-RYS-0-4.gguf -c 8192 -ngl 99
Lessons Learned
- Always run fair evaluations first. Same max_tokens, same conditions, same scoring for both models. Our first sweep used different effective token budgets and produced wildly misleading results.
- Check what the original authors actually tested. We assumed RYS works on all transformers. The author explicitly said small models are harder and every architecture is different.
- Empty responses are not zero capability. The base model returned empty strings on some prompts, but with enough tokens it answered correctly. Scoring empty as zero inflated the apparent improvement.
- Hybrid architectures are genuinely different. Techniques proven on standard transformers don't transfer automatically. DeltaNet layers maintain recurrent state — duplicating them isn't the same as "thinking longer."
References & Links
- RYS Model on HuggingFace — The modified GGUF with layers 0-3 duplicated
- llm-circuit-finder — The sweep and GGUF surgery toolkit
- RYS Method — David Ng — Original blog post and method
- H-Neurons Paper (arXiv:2512.01797) — Hallucination-associated neurons in LLMs
- Qwen 3.5 Architecture — Model card with hybrid DeltaNet details
Top comments (0)