So I just got an RTX 4090 and put the machine together. This is going to make a huge difference.
Right off the bat I'm jumping into the 13b bandwagon. Vicuna 13b 16k looks awesome, but I've been struggling with the RoPE settings for it.
In The Bloke's readme, I'm seeing
Change -c 2048 to the desired sequence length for this model. For example, -c 4096 for a Llama 2 model. For models that use RoPE, add --rope-freq-base 10000 --rope-freq-scale 0.5 for doubled context, or --rope-freq-base 10000 --rope-freq-scale 0.25 for 4x context.
The problem was that I couldn't work out where the rope scale was in Oobabooga; 0.5 wasn't a viable option. However, after a while I saw someone else recommend leaving rope freq base at 10k and doing compress_pos_emb to 4. After thinking about it a bit, it looks like instead of going downward from 1 to 0 on the rope scale the more more you increase the context, you actually go up by how much you want to multiple the context. So if you take a 4096 LLM and want to push it to 16k, then instead of 0.25, you just do 4.
So far really liking this 4090. I had all kinds of issues getting Metal to work right on the Mac, but NVidia really is the first class citizen in the AI world. This is lightning fast.
Top comments (0)