Hello, fairy dairy diary~!
Progress
Integrating of Llama* based architecture gave lots of improvements, so much that to celebrate the occasion I've trained for 15 epochs and uploaded weights to HF.
Just check out the splendorous graph! Validation graph. I even measured E2 on the occasion.
The bakanet slowly crawls towards pythia70. We are closer to it than to 30M model!
Current architecture uses 180M parms. Like original Llama* all dimensions use the same hidden size in intermediate layers of mlp. Originally I was going to change them too, but this approach is more effective.
I'm still not sure if this way is the most efficient way.
Splitting layers into groups where each group has their own MLP dimensions (or even shares MLP, the experiment may reappear again) might be simple idea, returning RMT solidifer once more to reshape RMT embeddings is another simple idea.
But I want to return to recurrence(some plans (step4? multi-head step4? mamba? monarch?) or play around with training(train N layers at the time; this would allow to train 1B model in time required to train 5 200M models, so around 5x5h=25 hours. And maybe then use lora as a glue)
Logging
Oh. I also don't remember when I've used wandb last time. I've completely moved towards home-grown logs. Flogs as they are called here. Because it's f
loat. log
ging. Had to add index for the first time, as experiments produced so much data, it took 2s to load relevant floats from sqlite. How scandalous. With such pace soon I'd learn to move some fields
Not pointed on graph.
Transformer (vanilla) model from permuteformer model has ppl 31.45 after E15, which would equal to loss of ~3.44, better than pythia-70M. At E3 they have PPL of 53.75, which is ~3.98 loss. Baka has loss 4.00 at E2 and 3.98 at E3, so at least on earlier stages we beat model (as usual the one with less parms) that has bigger batch size.
Chill while the winter allows it~
Top comments (0)