Hello fairy dairy diary! Shall we talk of elephants yet again today? And I am not talking of long lipped guys
But of these elephants, also known as $$\frac{1}{1+\abs{\frac{x}{a}}^d}$$
raised in
@misc{lan2023elephant,
title={Elephant Neural Networks: Born to Be a Continual Learner},
author={Qingfeng Lan and A. Rupam Mahmood},
year={2023},
eprint={2310.01365},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Today the miracle occurred, for the change of activation function was worth it. Yes.
OK, the graph is hard to read, but it has one very important arrow which shows the most important part.
So red line on the top is the first run of elephant.
Red line in the middle is the 2nd run with elephant activation. (each run is a complete epoch of wikitext103 training split)
Blue run on the top is the first run with SwiGLU.
Blue in the middle is the second run with SwiGLU
Green is the third run.
The arrow shows that the Elephant stomped 3 epochs of SwiGLU in just 2 epochs. First epoch was so-so. Better, but not that much. But then Elephant started running and it was unstoppable! After three it almost reached the point of beating Pythia-14m, small sis who still enjoys the temporary amnesia of BakaLLM.
So which means it becomes the branch 002_
of the main repo and all further improvements will be done on that. I haven't played with hyper parameters much(d=8, a=1) because one epoch takes ~7 hours. Choosing the elephant diet's will be done on later stage with all ingredients of memory stew in place.
Which will be started added this weekend probably. For a LLM that is supposed to be about memory, postponing it again and again becoming boring.
So the plans are to start working on XL on the weekend as it essentially doubles K
for attention.
Some failed experiments:
I tried LOMO its implementation SlimTrainer, but either I couldn't figure out how to set up them, or it was not good for training from scratch. Loss was so awful, I didn't finish a single epoch.
After pondering on Feed-Forward Layers Are Key-Value Memories I tried replacing MLP with good old SDPA, where
Q=XW
and K and V are bunch of learned vectors. After all, if FF is key value, why not give it key-value based on attention? But using not tokens, but parts of token itself
In current version of BakaNet, which uses parameters borrowed from Pythia-160m, it meant a budget of 12 dim^2 (SwiGLU works on 2 vectors d_ff, to get them we use Linear(dim_model, dim_model*4)
and one (dim_model*4, dim_model)
is used to cast it back, so 3 * dim * 4dim
Well, that's too much of parameters. When I tried to use them all, CUDA OoMed almost immediatly. And that's without extra tokens of RMT or <pause>
! When I removed some parameters, I got worse performance overall both in terms of speed and loss. So fortunately the elephant stays!
I also tried "mini-moe": make bunch of Swigly units further gated by sigmoid(adding it gives better result), add all together. Didn't work. Looks like upcast is very important.
Well, back to the past! XL will be added soon and then Pythia-14M will fall first.
The humanity will follow soon(C)
Top comments (0)