DEV Community

rudyon
rudyon

Posted on

15 Architecture Experiments: Training GPT-2 Style Model on Vast.ai for $10

Recently I dropped out of my English Literature degree to pursue ML/AI instead. I felt like this was more my passion and what I truly wanted to do. I initially started with the fast.ai course only to be frustrated with it's teaching style and outdated libraries. Thus I pivoted to Andrej Karpathy's Zero to Hero playlist pretty quickly after that, without finishing the fast.ai course.

I followed Karpathy's videos pretty much exactly, while doing this I often paused to search things up or ask questions to LLMs for parts I felt like I didn't understand. I believe I know a lot more about ML/AI then when I first started. This culminated in me creating rudyon/pipeline by using what I learned from Karpathy's videos.

rudyon/pipeline started out as a simple implementation of a training loop for the GPT-2 architecture. However I quickly grew it into a full training pipeline that can be run on rented GPU instances. I was primarily targeting Vast.ai like services while building it. Essentially rental services for instances with GPUs that one can ssh into.

After getting rudyon/pipeline to certain state I was satisfied with. I rented a 2x4090S Ti instance on Vast.ai with 400GBs of storage space to make sure I wouldn't run out of space during the training. I used said instance to pretrain the rudyon/rudygpt model which has 124M parameters and was trained with 12 depth using the training pipeline that I had built. This training run cost me about ~$10 in total to run. The model was trained on theHuggingFaceFW/fineweb-edu dataset's sample-10BTsubset for 19073 training steps, the training took ~19 hours.

I then fine-tuned rudyon/rudygpt on the tatsu-lab/alpaca dataset to make rudyon/rudygpt-instruct on Kaggle's free T4 GPU using a notebook. You can chat to the final model on the Huggingface space I made.

Now, after all of this. I was kind of dumbfounded on what exactly I wanted to do next. I still wanted to work rudyon/pipeline, but I didn't want to spend any more money. I wasn't sure what to do next. So I just went on Twitter to doomscroll for a bit as you do. Then I saw that Karpathy had released karpathy/autoresearch. I took a quick look at it and immediately wanted to do the same thing on my project. Except there was one problem... I don't have a GPU of my own. I am doing all of this on a Matebook D15 with only an i5 10th gen. But after spending some more time thinking about it and looking at karpathy/autoresearch a couple more times and diving a little deeper into it and how it exactly worked. I figure out that I didn't need a GPU. I could run the experiments on Kaggle manually.

This would not really make it "autoresearch" per se, but I guess it'd be at least semi-automatic. I didn't really do anything to the repo itself for this. I simply just used the "import code" feature on Gemini which lets you import Github repos. So I just imported my repository and asked Gemini for experiment ideas to lower the validation loss. I documented the experiments manually in a experiments.jsonl file after running each experiment. Gemini came up with some ideas, but they were too generic. I was looking for more weird stuff. So eventually I thought I'd try something that I had always though would improve LLMs at least a little bit. Convolution. I added a causal 1D convolution layer before the attention mechanism. To my surprise it actually worked. There were a few other experiments honestly but the most surprising one to me was the convolution before the attention. I never thought that my naive idea could work but it does. At least at depth 4. Which is what was used to run the experiments. It is also important to note that each experiment was ran for 300 steps total. In order to be able to iterate fast. I used depth 4 specifically because Kaggle's T4s don't have the memory for the parameters of a depth 12 model. Below is the experiment data and a graph showing the running best.

Description Validation Loss Improvment from Baseline Kept?
Baseline 6.8945 0.00% Yes
RMSNorm (Attempt 1) 6.9030 +0.12% No
No GPT-2 Weight Inits 7.6813 +11.41% No
SwiGLU 6.8919 -0.03% Yes
No Positional Embeddings 6.8180 -1.10% Yes
RoPE & No Bias 6.7807 -1.65% Yes
Maximum Learning Rate = 0.001 6.5290 -5.30% Yes
No Weight Tying 6.9030 +0.12% No
Convolution Before Attention 6.5077 -5.61% Yes
Convolution Before MLP 6.5978 -4.30% No
Mixture of Experts 6.5322 -5.25% No
Mobile Convolution 6.5121 -5.54% No
RMSNorm (Attempt 2) 6.5139 -5.52% No
Muon Optimizer 6.4495 -6.45% Yes
Parallel Q/K Normalization 6.3235 -8.28% Yes
RoPE Base = 50000 6.3221 -8.30% Yes

a graph showing running best validation loss for the experiments

I have yet do a new full training run after these improvements. I want to make a few more improvements to the architecture before I do another full training run on a bigger scale model this time. These last three weeks after dropping out of my English Literature degree have honestly been so fun. I don't know if I did the right thing by dropping out of my degree yet. But I certainly do not regret it right now.

Top comments (0)