DEV Community

Maykeye
Maykeye

Posted on

BakaLLM part NaN and general shenanigans

Hello, fairy dairy diary~.

No Cirno picture today as not much interesting was accomplished.

This week I tried to integrate mamba into BakaLLM by
adding it alongside MLP and ATTN.

Unfortunately it increased number of parameters to unreasonable amount, so I then added it to each 4th layer.

Didn't go well with the loss function, so I think I'll try to move it into attn layer as MEGA does (where it affects QK).

Other things tried: "MoE"ficatotion of layers. Each "big"-layer splitted in two normal layer, then little softmax here, little addition there and we got the result. Didn't work well, and passing state around was pita, so I think I return to Mamba experiments.

Also for fun played around with very simple wav-generator:
took 3 youtube videos of touhou, converted them into 1024 dimensional tensors where X=[[sample at time point 0, sample at time point 1,... sample at time point 1023], [sample at time point 1, sample at time point 1,... sample at time point 1024]] and passed it through 1 layer of mamba with loss=simple MSE. It learned to produce nice "pluck" but that's it. Not unexpected, but still fun.

I'll try to play around music next time maybe by cutting off first and last 15 seconds, as I think it learned silence.

Top comments (0)