Discussion on: My fully offline AI-assisted Linux development machine

View post

For coding assistance, a well-quantized 4B model at 40+ tok/s beats a 27B model at 8 tok/s in actual productivity. The bottleneck isn't intelligence — it's iteration speed. At 50-100 completions per hour, latency compounds fast. The practical setup: small+fast model for flow state (completions, quick edits), big+slow model for architecture planning and code review invoked 2-3 times per session. Two-tier local beats single-tier every time.

Deepu K Sasidharan • May 14

Have you tried the said models? In my experience anything other than the dense models aren't that useful for serious coding. Maybe ok for minor stuff but not for generating entire apps or fixing complex issues.

Deepu K Sasidharan • May 14

Also 8 tok/s is not that bad when you generating code. The 35B A3B will get you around 40 tok/s if the task doesn't need best intelligence. 4B models are not useful atleast for what I do. Again you do you. Everyone has different needs.