DEV Community

Maykeye
Maykeye

Posted on

BakaLLM, part 11. Progress was made, this is not a DRILL, this is a PROGRESS! RMT and getting the loss <4.00, finally!

Hello, fairy dairy diary!

After fixing RMT and training it for three epochs, finally loss has gone below 4: 3.9841 🎉🥳🎉🐘 (down from 4.0221)

Cirno and V sign

Which means we are one step closer to beating pythia-31m.
So, what went wrong before? In endless wisdom of baka abyss I implemented RMT originally as (pseudocode)

sequence[0].input_with_rmt = rmt_tokens 🐱 input(with pause) 🐱 rmt_tokens
and
sequence[i].input_with_rmt = sequence[i-1].rmt_at_the_end 🐱 input(with pause) 🐱 rmt_tokens
Original formula is different:

RMT formula

i.e. it should be

sequence[i].input_with_rmt = sequence[i-1].rmt_at_the_end 🐱 input(with pause) 🐱 sequence[i-1].rmt_at_the_end

i.e. summary of previous chunk should serve both as read-memory and write-memory.

I thought that rmt_tokens should transcribe to "in other words" on token level on the input and produce summary on the output. Didn't work this way! Experiments were had, but they were so awful, I didn't even record them in my fancy SQL results database.

And here are training graphs. They tell the different story!
They say "whaa-at, difference of 0.04? that's nothing":

SQL database with results

Training graps

Gray-we don't talk about this one, red is a pause, blue is current RMT, head of 00x_rmt in the repo, a-head of the curve; so maybe if I waited previous implementation it would be fine as well, but, but, but, I am a mere mortal, so fixes were made, 3 epochs were run, and that's it.

Now, why is it still in 00x branch and has not been moved into mainline 005_rmt? Well, well, well. I want to eperiment more with MEMORY SOLIDIFIER. GLU is out of the window, we don't talk about this one. But maybe simple BakaMLP layers or returning full layer again, will help?
But maybe elephant will help us all once again? So far gating with elephant so cozy after all.

And then graphs will be happy, elephants will be happy and loss will go towards minus infinity.

The experimentation will continue, the memory will be solidifed because putting layer[-1].output into layer[0].input without massaging tokens seems insane. Intuition says that distribution of data across hidden dimensions shouldn't be the same on two different layers. But we treat it as it was. I will not stand for this travesty. Every layer is unique! Is special! Therefore in mainline output will be

sequence[i].input_with_rmt = solidifier(sequence[i-1].rmt_at_the_end) 🐱 input(with pause) 🐱 solidifier(sequence[i-1].rmt_at_the_end)

And maybe I will even use two solidifiers: one for read and one for write.

Whatever the end, the elephants will prevail!

And so will cats, but that goes without saying, so it was not said before.

Chill!

Top comments (0)