MoE Degeneration on Long Context — Why My 35B Model Started Repeating Itself
The first 600 tokens looked great. Coherent prose, on-topic, the same voice I'd been getting from Qwen 3.6 35B-A3B Q8 for weeks. Then something snapped. The next 200 tokens were a chain of synonyms — "leadership management administration supervision oversight stewardship" running for half a paragraph. After that, partial sentences. After that, nonsense.
I assumed I'd hit a quant artifact. Q8 isn't lossless. Maybe the model was confused. Re-ran with a fresh prompt at the same max_tokens. Same collapse around the 600-700 token mark.
It wasn't quantization. It's an MoE-specific failure mode that gets worse on long generations. And the fix isn't tuning sampling — it's not generating long sequences at all.
What the output looked like
Here's a sanitized example of what I saw. The prompt asked for a 1500-word blog post outline.
Tokens 0-600 (coherent):
The Apple Silicon memory model differs fundamentally from x86. Where Intel and AMD systems separate DRAM from VRAM with explicit bandwidth boundaries, M-series chips share a single unified memory pool. This has real implications for ML inference: model weights and runtime allocations compete for the same physical bytes...
Tokens 600-800 (degradation starts):
One of the key considerations when working with this architecture involves understanding the relationship between allocation, deallocation, retention, release, management, oversight, supervision, administration, governance, stewardship, custody, oversight again, the role of overseeing the management of the allocation in a managed manner...
Tokens 800+ (collapse):
The the the system system the system the management of of the the of allocation system the the management the the management oversight oversight oversight...
The transition from coherent to degraded happened over about 100 tokens. Before that, the output was on-voice and useful. After, it was unsalvageable.
What I think is happening
Caveat first: I don't have authoritative access to Qwen's training data or routing internals. This is a working hypothesis built from observing the symptom across hundreds of generations and reading public research on MoE behavior.
Mixture-of-experts models route each token to a small subset of available "expert" sub-networks. In Qwen 3.6 35B-A3B, the "A3B" means roughly 3 billion active parameters per token out of 35 billion total. The router picks which experts handle which tokens based on attention patterns and learned routing weights.
On short generations (under 400 tokens), the router behavior is stable. Each token's expert selection has plenty of attention context to score against, and the experts that win tend to be the right ones for the topic.
On long generations, two things start to drift:
The attention context fills with the model's own output. By token 600, more than half the context is what the model just generated. The router's routing weights are now being computed against generated content, not the original prompt. If any expert produced even mildly low-quality output, that output now influences which experts get picked next.
Routing collapse on dominant experts. When the router has been picking the same few experts consistently, attention weights start concentrating on those experts' "vocabulary." The model develops a self-reinforcing loop where the experts good at certain word categories (abstract nouns, conjunctions, hedge phrases) keep winning the routing competition.
Combine these and you get the synonym-chain pattern: experts good at abstract management vocabulary win routing, the output reinforces their winning, and the chain spirals.
Dense (non-MoE) models hit similar degradation but later — usually past 1500-2000 tokens — because there's no router to collapse. MoE seems to fail earlier in this specific mode.
The fix that worked: sectional generation
Once I understood it was a context-buildup problem, the fix was simple: don't generate long sequences. Generate short ones and concatenate.
Specifically: split your target content into sections of 250-400 tokens each, generate each section independently with its own prompt, then concatenate.
def generate_long_content(prompt_skeleton, sections):
outputs = []
for section_name, section_instruction in sections:
section_prompt = f"{prompt_skeleton}\n\nWrite the section: {section_name}\nInstruction: {section_instruction}\nTarget length: 300 tokens."
section_output = generate(model, tokenizer, section_prompt, max_tokens=400)
outputs.append(f"## {section_name}\n\n{section_output}\n")
return "\n".join(outputs)
Each section gets a fresh attention context. The router never sees more than 400 tokens of generated content at once. Degradation never starts because the runway is too short.
Trade-off: section transitions can feel slightly choppy because each section was generated independently. For a blog post, this is usually fine — the human-written section headings paper over the seam. For continuous prose like a novel, you'd want extra glue prompts to maintain flow.
The other trade-off: it's slower. Five 300-token generations take longer than one 1500-token generation because of per-call overhead. In my measurements, sectional gen was about 30-40% slower for the same total token count. The quality difference more than justifies it.
What didn't work
Before landing on sectional gen, I tried a few things that sounded reasonable.
Temperature dropping. Lowered temperature from 0.7 to 0.3 hoping more deterministic sampling would avoid the synonym chain. It didn't. The degradation still started around token 600. Lower temperature just made the synonym chain more repetitive, not absent.
Repetition penalty. Added a repetition_penalty=1.15 to the generate call. This helped slightly — pushed the collapse out to token 700-800. But it didn't prevent the underlying routing collapse, and at higher penalties the output started avoiding common words (articles, prepositions) in weird ways.
Top-p tightening. Dropped top_p from 0.9 to 0.7. Same story as temperature drop — the collapse still happened, just with a smaller vocabulary.
Longer prompt. Padded the prompt with more context, hoping the model would have more to anchor on. Made degradation slightly worse if anything — more context for the router to chase as the generation continued.
The pattern across all the failed attempts: sampling adjustments treat the symptom (low-quality token at position N) but don't fix the cause (routing dynamics on long generations). Sectional generation fixes the cause by avoiding the long generation entirely.
How to detect it in your own runs
If you suspect MoE degeneration, the easiest signal is a word-overlap check on the output. Compare token sets across sliding windows:
def detect_collapse(text, window=100, overlap_threshold=0.65):
tokens = text.split()
for i in range(0, len(tokens) - window):
window_tokens = set(tokens[i:i+window])
next_window = set(tokens[i+50:i+150])
overlap = len(window_tokens & next_window) / len(window_tokens)
if overlap > overlap_threshold:
return i # position where collapse starts
return None
If the same 100-word window has more than 65% overlap with the next 100 words, you're probably in collapse territory. Truncate the output at the collapse position, regenerate the rest with a fresh context.
This is the same check I bake into my generation wrapper. When it fires, the wrapper retries the section with a smaller token budget. It's not elegant, but it's robust.
What this isn't
This is calibrated for Qwen 3.6 35B-A3B Q8 specifically. The exact collapse threshold (600-700 tokens for me) will differ for other models.
Smaller MoE models (Qwen 1.5 14B MoE, Mixtral 8x7B): degradation may start later because fewer experts means simpler routing dynamics. Or it may start earlier if those experts overlap heavily. I haven't tested these.
Dense models (Llama 3 70B, Qwen 2.5 dense): you'll hit similar degradation but usually past 1500-2000 tokens. Sectional gen still helps but is less urgent.
Higher quantization (Q4 vs Q8): I tested Q4 of the same model briefly before swapping to Q8 for quality. Q4 collapsed earlier (around token 400-500), consistent with the hypothesis that quantization noise compounds with routing instability.
The smaller lesson
When you see word salad from a long generation, your first instinct is probably sampling tuning. Mine was. It almost never works as a primary fix.
The pattern to internalize: long-sequence degradation on MoE models is a routing problem, not a sampling problem. The fix is structural (don't generate long sequences) not numeric (don't tune temperature). Sectional generation forces the structure.
If you've hit this on a different MoE model and found a different fix, I'd genuinely like to know which one. Reply on the post.
Come along for the ride — see me fall or thrive, whichever comes first.
Top comments (0)