SelfLM — Building a Tiny LLM from Scratch (End-to-End)
LLMs are complex, but not magical — once you break them into components, everything becomes understandable.
It started with a simple question:
“How do models like GPT actually work?”
So I decided to build a smaller version myself — step by step — from dataset generation to tokenization, training, and deployment. Everything is fully open-source.
What This Project Covers
Instead of treating models as black boxes, this project focuses on the entire pipeline:
- Synthetic dataset generation (~60K samples)
- Tokenization & preprocessing
- Transformer architecture (from scratch)
- Training pipeline
- Inference & deployment
Highlights
- Trained in ~5 minutes (Colab T4 GPU)
- Fully custom LLM (~9M parameters)
- Hugging Face model + dataset + live Space
- Serverless deployment using ONNX on Vercel (free tier)
- Lightweight, browser-friendly inference
Live Demo
https://selflm.vercel.app/docs
Hugging Face Space
https://huggingface.co/spaces/Mudasir-Habib/selflm-demo
Colab Notebook
https://colab.research.google.com/drive/1EyR5mFuHupJWdnJWazvdjU1Bre2rF2RD?usp=sharing
GitHub Repository
https://github.com/Mudasirhabib123/selflm
Customization Feature
One of the most interesting parts:
You can customize the model with your own data by simply:
- Editing the first cell in the Colab notebook
- OR modifying
src/dataset/data.py
Add your own context, retrain, and instantly get a personalized LLM.
Goal
This project is built for:
- Learning how LLMs actually work
- Experimentation with small-scale models
- Understanding the full pipeline end-to-end
Open Source
Fully open-source and designed to make LLMs accessible, transparent, and understandable.
If you find it useful, consider giving it a star on GitHub.
Top comments (0)