🔥Finally, I was able to build the model from scratch🔥

#slm #llm #gpt3 #ai

After multiple iterations, experiments, and lessons learned, I finally built a 550M-parameter model completely from scratch.

This isn’t my first time building a Small Language Model. I’ve built few before, but they were trained on toy datasets like TinyStories. Some of them are

This time, I made a deliberate choice: to build something meaningful, using real data, not a toy dataset.

Dataset

Pretraining: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
MidTraining: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk
Supervised Fine-tuning: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk

Tokenizer
Tokenizers are often overlooked, but they play a critical role in building effective language models. I created this video to share my journey of understanding and choosing the right tokenizer

Picking the right Tokenizer: The reason behind the choice: https://youtu.be/Xr2xpHDSC6A

Attention
Attention is one of those concepts that sounds complex, but its idea is simple: focus on what matters. When I started building models from scratch, understanding attention completely changed how I looked at language models.I created this video to share my journey of understanding and choosing the right attention mechanism

Picking the Right Attention Mechanism: What Actually Works in Practice: https://youtu.be/HCa6Pp9EUiI

If you’re curious about how self-attention actually works under the hood, check out this video where I break it down step by step: https://youtu.be/EXnvO86m1W8

Architecture
The architecture I followed is a modern pre-normalized Transformer block, optimized for efficiency, stability, and scalability, especially for mid-sized models like your 550M-parameter SLM.

**Training Cost:
**For training, I used RunPod(https://runpod.io/) and rented 8×A100 GPUs for 1.5 days, with a total cost of approximately $405.

NOTE: Make sure your root disk has enough space. I had to cancel one training run because I ran out of disk space

**Final Output
**After training and setup, the model is now up and running, ready to answer questions.

Book

Throughout this journey, one resource that consistently helped me was my own book, Building a Small Language Model from Scratch.

Writing the book forced me to slow down and deeply understand every component, tokenizers, attention mechanisms, architecture choices, training pipelines, and debugging failures. When I was building this 550M-parameter model, I often found myself going back to my own explanations, diagrams, and code walkthroughs to validate decisions and avoid shortcuts.

This project wasn’t just about applying what I knew, it was about closing the gap between theory and practice. The process of documenting concepts while building models in parallel played a huge role in making this work possible.

✅ Gumroad: https://plakhera.gumroad.com/l/BuildingASmallLanguageModelfromScratch

✅ Amazon: https://www.amazon.com/dp/B0G64SQ4F8/

✅ Leanpub: https://leanpub.com/buildingasmalllanguagemodelfromscratch/

Summary
Over the last four months, I’ve fully dedicated myself to building Small Language Models from scratch. Along the way, I’ve learned a tremendous amount,lessons I’ll be sharing through upcoming YouTube videos and blog posts.

Can this model compete with frontier lab models? Absolutely not—and that was never the goal. What truly matters are the lessons learned at every step of the journey. The model is still being tested, and once validation is complete across all datasets, I’ll make it available on Hugging Face.

DEV Community

🔥Finally, I was able to build the model from scratch🔥

Top comments (0)