Hello, fairy dairy diary!~
Focus of last several days is idea of Progressively Stacking 2.0. (Of course I skipped the part about keeping optimizer as math-too-hard).
After 3 epochs of training by parts, each epoch trained only one third of the net, loss is 4.4424.
It's comparable to after 1 epoch, baka fanout-zeropad-upscaler
(4.4411 -- fanout where MLP was not upscaling yet).
However I'm not expecting it to be that good for a while.
Play for now is to train all layers up to ~15 epoch and look what will happen.
Depending on the result then I may scale network up to 1B and train using LoRA to tie all layers together.
Also I was too lazy to change number of layers run-time, so for now I just disable them and they run only upscaler, while parms for attn and mlp are still there.
Hit into a bug with rotary_embedding_torch
(it didn't like when requires_grad_
was called on different layers with different arg), but it already was fixed in 0.5.3. Sweet.
One day I'll throw more recurrence as intended!
But for now, I'm still chilling.
And you should do.
Top comments (0)