BakaLLM part NaN and general shenanigans

Hello, fairy dairy diary~.

No Cirno picture today as not much interesting was accomplished.

This week I tried to integrate mamba into BakaLLM by
adding it alongside MLP and ATTN.

Unfortunately it increased number of parameters to unreasonable amount, so I then added it to each 4th layer.

Didn't go well with the loss function, so I think I'll try to move it into attn layer as MEGA does (where it affects QK).

Other things tried: "MoE"ficatotion of layers. Each "big"-layer splitted in two normal layer, then little softmax here, little addition there and we got the result. Didn't work well, and passing state around was pita, so I think I return to Mamba experiments.

Also for fun played around with very simple wav-generator:
took 3 youtube videos of touhou, converted them into 1024 dimensional tensors where X=[[sample at time point 0, sample at time point 1,... sample at time point 1023], [sample at time point 1, sample at time point 1,... sample at time point 1024]] and passed it through 1 layer of mamba with loss=simple MSE. It learned to produce nice "pluck" but that's it. Not unexpected, but still fun.

I'll try to play around music next time maybe by cutting off first and last 15 seconds, as I think it learned silence.

DEV Community

BakaLLM part NaN and general shenanigans

Top comments (0)

Read next

Scaling Docker and Kubernetes: Best Practices for Efficient Container Management

How Docker Works in a Kubernetes Cluster: A Complete Guide

Understanding Kubernetes Pods with Docker: The Heart of Containerized Applications

Password Composition Policies Are Bad and Here's Why