BakaLLM. Part 16. Procrastinating in style

#llm #machinelearning #ai

Hello, fairy dairy diary!~

Focus of last several days is idea of Progressively Stacking 2.0. (Of course I skipped the part about keeping optimizer as math-too-hard).

After 3 epochs of training by parts, each epoch trained only one third of the net, loss is 4.4424.

It's comparable to after 1 epoch, baka fanout-zeropad-upscaler (4.4411 -- fanout where MLP was not upscaling yet).

However I'm not expecting it to be that good for a while.
Play for now is to train all layers up to ~15 epoch and look what will happen.

Depending on the result then I may scale network up to 1B and train using LoRA to tie all layers together.

Also I was too lazy to change number of layers run-time, so for now I just disable them and they run only upscaler, while parms for attn and mlp are still there.

Hit into a bug with rotary_embedding_torch (it didn't like when requires_grad_ was called on different layers with different arg), but it already was fixed in 0.5.3. Sweet.

One day I'll throw more recurrence as intended!

But for now, I'm still chilling.
And you should do.

DEV Community

BakaLLM. Part 16. Procrastinating in style

Top comments (0)

Read next

How We Made AI Code Review 40% More Efficient Using ReAct Patterns

A conversation with your architecture

Test Intelligence in the Era of AI: Opportunities and Challenges

My 2025 AI Engineer Roadmap List