Added pauses inspired by https://arxiv.org/abs/2310.02226
Model adds psuedo tokens randomly(they are not part of vocabulary), with exception of adding token at position 0, to get that sweet BOS imitation.
This also means that ppl is no longer truly deterministic as pause are placed at random.
However after measuring loss on valid split two times, loss was 4.02xx both times, so I'll fixed it later(maybe never; maybe in the next mainline iteration)
I think soon(TM) we'll get to sub 4.0 on valid split.
Mini model, which had parameters were like parameters in Permuteformer, also was tested, but it was bad: rapid learning followed by rapid stagnation.
Might revisit the idea of even more minimal model later, but for now ~200M is what I'm aiming at.
Next target to beat: beating pythia-31m.
We are just ~0.10 loss away from beating model 1/7 of our size and there are lots tricks in the sleeve(including boring like training for 15 epochs)
PS. I decided to remove randomness. Now baka supports inserting pauses each N tokens which works better:
(Beginning of graph for pause fixed was not recorded, oopsies; imagine pink graph moved to the right)
Having free time now, I'm not going to waste 16 hours for training. More tweaks to implement!
As the ancient proverb goes, ChatGPTo delenda est
Top comments (0)