Chasing 16MB: My Parameter Golf Journey and What I Learned the Hard Way

#parametergolf #tinyllm #aiexperimentation #quantization

I saw what big companies and research labs were doing at massive scale and tried to adapt those ideas to extreme compression in tiny models. Here’s what happened.

When OpenAI launched the Parameter Golf challenge, the rules were brutal: train a small language model that must fit inside a 16 megabyte compressed file and finish training in just 10 minutes on powerful hardware.

Most participants focused on proven techniques that were already working on the leaderboard. I took a different approach. I read papers and articles about what large companies and research labs were doing at massive scale and tried to adapt those concepts to the extreme constraints of this challenge.

The Experiments I Tried

Aggressive Int4 Quantization

Inspired by frontier quantization research from big labs showing that very low-bit weights could work in larger models, I pushed hard on Int4. I believed that if I could make aggressive 4-bit quantization stable in a tiny model, it would give me a massive space advantage. I spent weeks building custom mixed-precision code (Int6 for attention, Int4 for MLP layers), dynamic scaling, special training ramps, and heavy pruning. It was a bold, theoretically viable direction, but in practice the precision loss was too damaging for such a small model trained on very few steps.

Gimlet-Hetero (Layer-wise Heterogeneous Design)

This came directly from the Gimlet Labs paper “Efficient and Scalable Agentic AI with Heterogeneous Systems” (arXiv:2507.19635v1). The paper discusses how mixing different hardware tiers can optimize cost and performance for AI agents. I adapted that systems-level idea of heterogeneous resource allocation to transformer layers: giving wider MLP blocks and different precision levels to middle layers versus early and late layers. The idea was to allocate capacity where it mattered most.

TurboQuant

This was inspired by Google Research’s TurboQuant work on extreme compression, particularly for KV cache and vector search. I tried to adapt similar aggressive compression principles to weight quantization during training, hoping to push even more compression while maintaining stability.

Bayesian Backoff + TT Adapters

These came from research on dynamic correction mechanisms and low-rank decompositions (Tensor-Train). The goal was to add “smart recovery” during or after training to fix quality lost during quantization.

Some of these ideas were quite wild. A few came from unusual inspirations and might still be viable if explored further with more experience and compute. Int4 ultimately became my strongest contender, but none of them delivered the breakthrough I was hoping for.

The Evolutionary Agent

At one point I got tired of manual tweaking and built an autonomous evolutionary agent. The system could mark sections of code, generate mutations, run fast tests on Colab, rank them by real performance, and iterate.

It was technically interesting and worked mechanically, but after several generations I realized I was mostly automating the exploration of a weak search space. The gains were too small to justify the time I was spending on it, especially with very limited Colab quota. I shelved the agent. That was an important lesson: just because something can be automated does not mean it is the best use of limited time and compute.

What I Learned

My biggest mistake was choosing hard, experimental paths instead of first deeply understanding and building upon what was already working well on the leaderboard. As an amateur, I thought innovation meant doing something completely different. I now understand that you earn the right to innovate by first mastering proven approaches and then improving upon them.

I got close. My best runs projected to around 1.21 to 1.25 BPB on full hardware. That would have been a respectable non-record submission, but I never quite broke into true leaderboard territory. I also did not receive RunPod credits until the very end, which limited how much I could validate on real hardware.

Final Thoughts

Parameter Golf was a humbling but valuable experience. I explored a lot, built some interesting systems along the way, and gained a much clearer sense of where to focus effort when resources are limited.

The repository is public if you want to see the full journey:

https://github.com/jmoncayo-pursuit/parameter-golf-uniform-int4

I am still experimenting and still learning. Next time, I will be wiser about balancing bold exploration with proven foundations.