GLM 4.6 at UD_Q3_K_XL is surprisingly usable

So I currently run GLM 4.7 Q8 on my M3 Ultra, and after wrestling to find a solid model that would work well on the M2 Ultra 192GB, I finally decided to give the older GLM 4.6 UD_Q3_K_XL a try on it, seeing how much the quantization would affect it. (I also just wanted to mess around with 4.6 after using 4.7 for a while, to see how much I missed it lol. They have different styles for doing reviews and giving feedback)

Honestly, I've been shocked at how well it works. The coding isn't terrible and the general ability to look over docs and give feedback feels pretty comparable to full quality. I've definitely seen a general failure to handle numbers of any kind well, but not to the extent I had imagined for an MoE. In the past, these did not handle quantization well.

There was an arxiv paper that found LLMs can only cram about 3.6 bits per parameter worth of memorized stuff from training data. That's not exactly about quantization, so I'm kinda stretching it here, but my brain went "well if 3.6 bpw is some kind of limit for one thing, maybe the rest of the model has similar wiggle room?"

Could be a wholly wrong way to look at it, but after reading that, my mental baseline for "Im about to break this model if it gets any smaller" has been somewhere in that area of 3.6-4bpw. Not exactly doing science here, but it does line up a little bit with the old MMLU-Pro results from a long time back. I just didn't expect modern MOEs to hold up as well as the big dense models.

DEV Community

GLM 4.6 at UD_Q3_K_XL is surprisingly usable

Top comments (0)