<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: rudyon</title>
    <description>The latest articles on DEV Community by rudyon (@rudyon).</description>
    <link>https://dev.to/rudyon</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3833625%2Fcdd0dcff-7b4c-464c-a606-caccfd2d1148.png</url>
      <title>DEV Community: rudyon</title>
      <link>https://dev.to/rudyon</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rudyon"/>
    <language>en</language>
    <item>
      <title>15 Architecture Experiments: Training a GPT-2 Style Model on Vast.ai for $10</title>
      <dc:creator>rudyon</dc:creator>
      <pubDate>Thu, 19 Mar 2026 11:41:25 +0000</pubDate>
      <link>https://dev.to/rudyon/15-architecture-experiments-training-gpt-2-on-vastai-for-10-2g3j</link>
      <guid>https://dev.to/rudyon/15-architecture-experiments-training-gpt-2-on-vastai-for-10-2g3j</guid>
      <description>&lt;p&gt;Recently I dropped out of my English Literature degree to pursue ML/AI instead. I felt like this was more my passion and what I truly wanted to do. I initially started with the &lt;a href="https://course.fast.ai" rel="noopener noreferrer"&gt;fast.ai course&lt;/a&gt; only to be frustrated with it's teaching style and outdated libraries. Thus I pivoted to &lt;a href="https://karpathy.ai" rel="noopener noreferrer"&gt;Andrej Karpathy&lt;/a&gt;'s &lt;a href="https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ" rel="noopener noreferrer"&gt;Zero to Hero playlist&lt;/a&gt; pretty quickly after that, without finishing the &lt;a href="https://course.fast.ai" rel="noopener noreferrer"&gt;fast.ai course&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I followed &lt;a href="https://karptathy.ai" rel="noopener noreferrer"&gt;Karpathy&lt;/a&gt;'s videos pretty much exactly, while doing this I often paused to search things up or ask questions to LLMs for parts I felt like I didn't understand. I believe I know a lot more about ML/AI then when I first started. This culminated in me creating &lt;a href="https://github.com/rudyon/pipeline" rel="noopener noreferrer"&gt;rudyon/pipeline&lt;/a&gt; by using what I learned from &lt;a href="https://karpathy.ai" rel="noopener noreferrer"&gt;Karpathy&lt;/a&gt;'s videos.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/rudyon/pipeline" rel="noopener noreferrer"&gt;rudyon/pipeline&lt;/a&gt; started out as a simple implementation of a training loop for the GPT-2 architecture. However I quickly grew it into a full training pipeline that can be run on rented GPU instances. I was primarily targeting &lt;a href="https://vast.ai" rel="noopener noreferrer"&gt;Vast.ai&lt;/a&gt; like services while building it. Essentially rental services for instances with GPUs that one can &lt;code&gt;ssh&lt;/code&gt; into.&lt;/p&gt;

&lt;p&gt;After getting &lt;a href="https://github.com/rudyon/pipeline" rel="noopener noreferrer"&gt;rudyon/pipeline&lt;/a&gt; to certain state I was satisfied with. I rented a 2x4090S Ti  instance on &lt;a href="https://vast.ai" rel="noopener noreferrer"&gt;Vast.ai&lt;/a&gt; with 400GBs of storage space to make sure I wouldn't run out of space during the training. I used said instance to pretrain the &lt;a href="https://huggingface.co/rudyon/rudygpt" rel="noopener noreferrer"&gt;rudyon/rudygpt&lt;/a&gt; model which has 124M parameters and was trained with 12 &lt;code&gt;depth&lt;/code&gt; using the training pipeline that I had built. This training run cost me about ~$10 in total to run. The model was trained on the&lt;a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu" rel="noopener noreferrer"&gt;HuggingFaceFW/fineweb-edu&lt;/a&gt; dataset's &lt;code&gt;sample-10BT&lt;/code&gt;subset for 19073 training steps, the training took ~19 hours.&lt;/p&gt;

&lt;p&gt;I then fine-tuned &lt;a href="https://huggingface.co/rudyon/rudygpt" rel="noopener noreferrer"&gt;rudyon/rudygpt&lt;/a&gt; on the &lt;a href="https://huggingface.co/datasets/tatsu-lab/alpaca" rel="noopener noreferrer"&gt;tatsu-lab/alpaca&lt;/a&gt; dataset to make &lt;a href="https://huggingface.co/rudyon/rudygpt-instruct" rel="noopener noreferrer"&gt;rudyon/rudygpt-instruct&lt;/a&gt; on Kaggle's free T4 GPU using a notebook. You can chat to the final model on the &lt;a href="https://huggingface.co/spaces/rudyon/rudygpt" rel="noopener noreferrer"&gt;Huggingface space&lt;/a&gt; I made.&lt;/p&gt;

&lt;p&gt;Now, after all of this. I was kind of dumbfounded on what exactly I wanted to do next. I still wanted to work &lt;a href="https://github.com/rudyon/pipeline" rel="noopener noreferrer"&gt;rudyon/pipeline&lt;/a&gt;, but I didn't want to spend any more money. I wasn't sure what to do next. So I just went on Twitter to doomscroll for a bit as you do. Then I saw that &lt;a href="https://karptathy.ai" rel="noopener noreferrer"&gt;Karpathy&lt;/a&gt; had released &lt;a href="https://github.com/karpathy/autoresearch" rel="noopener noreferrer"&gt;karpathy/autoresearch&lt;/a&gt;. I took a quick look at it and immediately wanted to do the same thing on my project. Except there was one problem... I don't have a GPU of my own. I am doing all of this on a Matebook D15 with only an i5 10th gen. But after spending some more time thinking about it and looking at &lt;a href="https://github.com/karpathy/autoresearch" rel="noopener noreferrer"&gt;karpathy/autoresearch&lt;/a&gt; a couple more times and diving a little deeper into it and how it exactly worked. I figure out that I didn't need a GPU. I could run the experiments on Kaggle manually.&lt;/p&gt;

&lt;p&gt;This would not really make it "autoresearch" per se, but I guess it'd be at least semi-automatic. I didn't really do anything to the repo itself for this. I simply just used the "import code" feature on Gemini which lets you import Github repos. So I just imported my repository and asked Gemini for experiment ideas to lower the validation loss. I documented the experiments manually in a &lt;code&gt;experiments.jsonl&lt;/code&gt; file after running each experiment. Gemini came up with some ideas, but they were too generic. I was looking for more weird stuff. So eventually I thought I'd try something that I had always though would improve LLMs at least a little bit. Convolution. I added a causal 1D convolution layer before the attention mechanism. To my surprise it actually worked. There were a few other experiments honestly but the most surprising one to me was the convolution before the attention. I never thought that my naive idea could work but it does. At least at &lt;code&gt;depth&lt;/code&gt; 4. Which is what was used to run the experiments. It is also important to note that each experiment was ran for 300 steps total. In order to be able to iterate fast. I used &lt;code&gt;depth&lt;/code&gt; 4 specifically because Kaggle's T4s don't have the memory for the parameters of a &lt;code&gt;depth&lt;/code&gt; 12 model. Below is the experiment data and a graph showing the running best.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Validation Loss&lt;/th&gt;
&lt;th&gt;Improvment from Baseline&lt;/th&gt;
&lt;th&gt;Kept?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;6.8945&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RMSNorm (Attempt 1)&lt;/td&gt;
&lt;td&gt;6.9030&lt;/td&gt;
&lt;td&gt;+0.12%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No GPT-2 Weight Inits&lt;/td&gt;
&lt;td&gt;7.6813&lt;/td&gt;
&lt;td&gt;+11.41%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SwiGLU&lt;/td&gt;
&lt;td&gt;6.8919&lt;/td&gt;
&lt;td&gt;-0.03%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No Positional Embeddings&lt;/td&gt;
&lt;td&gt;6.8180&lt;/td&gt;
&lt;td&gt;-1.10%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RoPE &amp;amp; No Bias&lt;/td&gt;
&lt;td&gt;6.7807&lt;/td&gt;
&lt;td&gt;-1.65%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maximum Learning Rate = 0.001&lt;/td&gt;
&lt;td&gt;6.5290&lt;/td&gt;
&lt;td&gt;-5.30%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No Weight Tying&lt;/td&gt;
&lt;td&gt;6.9030&lt;/td&gt;
&lt;td&gt;+0.12%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Convolution Before Attention&lt;/td&gt;
&lt;td&gt;6.5077&lt;/td&gt;
&lt;td&gt;-5.61%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Convolution Before MLP&lt;/td&gt;
&lt;td&gt;6.5978&lt;/td&gt;
&lt;td&gt;-4.30%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mixture of Experts&lt;/td&gt;
&lt;td&gt;6.5322&lt;/td&gt;
&lt;td&gt;-5.25%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mobile Convolution&lt;/td&gt;
&lt;td&gt;6.5121&lt;/td&gt;
&lt;td&gt;-5.54%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RMSNorm (Attempt 2)&lt;/td&gt;
&lt;td&gt;6.5139&lt;/td&gt;
&lt;td&gt;-5.52%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Muon Optimizer&lt;/td&gt;
&lt;td&gt;6.4495&lt;/td&gt;
&lt;td&gt;-6.45%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel Q/K Normalization&lt;/td&gt;
&lt;td&gt;6.3235&lt;/td&gt;
&lt;td&gt;-8.28%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RoPE Base = 50000&lt;/td&gt;
&lt;td&gt;6.3221&lt;/td&gt;
&lt;td&gt;-8.30%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9196lpi7fw1yz6rkcgl5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9196lpi7fw1yz6rkcgl5.png" alt="a graph showing running best validation loss for the experiments" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I have yet do a new full training run after these improvements. I want to make a few more improvements to the architecture before I do another full training run on a bigger scale model this time. These last three weeks after dropping out of my English Literature degree have honestly been so fun. I don't know if I did the right thing by dropping out of my degree yet. But I certainly do not regret it right now.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
