Introduction
"In 2019, training GPT-2 cost about $43,000. Today, for under $100 and about 3 hours, you can reproduce it on 8×H100s and chat with it."
This is Part 22 of the "Open Source Project of the Day" series. Today we explore nanochat (GitHub) by Andrej Karpathy. The project's tagline: The best ChatGPT that $100 can buy.
nanochat isn't yet another bloated "LLM framework" — no giant configuration objects, no model factories, no screens full of if-else branches. It's a minimalist, readable, hackable, forkable end-to-end experiment suite: tokenization, pretraining, fine-tuning, evaluation, inference, and a chat web UI, all connected end-to-end. It uses a single knob --depth (number of Transformer layers) to automatically derive width, heads, learning rate, training steps, and more — producing a family of compute-optimal models. Want "a GPT-2-level model"? Set depth to around 26, run runs/speedrun.sh, and a few hours later you can chat with it in a ChatGPT-style interface.
What You'll Learn
- nanochat's positioning: a minimalist, single-node multi-GPU, full-pipeline LLM training and chat suite
- The "single knob" design: how
--depthdetermines compute-optimal hyperparameters and model scale - From speedrun to chat: how to train a GPT-2-level model with one script and talk to it via Web UI
- The Time-to-GPT-2 leaderboard and the meaning of the CORE metric
- Project directory structure, key scripts, and extension approaches (new capabilities, identity/persona injection)
- Comparison with nanoGPT, modded-nanogpt, and common LLM frameworks
Prerequisites
- Basic understanding of Transformer and GPT-style models
- Familiarity with the distinction between pretraining and fine-tuning (SFT)
- Experience with Python and PyTorch; basic knowledge of multi-GPU training (e.g.,
torchrun) is helpful - A GPU environment (single or multi-GPU) is helpful for hands-on reproduction
Project Background
Project Introduction
nanochat is an open-source minimalist LLM experiment suite by Andrej Karpathy, designed for single GPU nodes (e.g., 8×H100) with a small, clear, readable, and modifiable codebase. It covers the complete LLM pipeline: tokenization, pretraining, fine-tuning (SFT/RL), evaluation, inference, and chat UI. The goal is not to be a "configure everything" framework, but to provide a strong baseline: go from zero to a conversational ChatGPT-style model at a cost in the hundreds of dollars (e.g., ~3 hours on 8×H100, ~$72; spot instances as low as ~$20).
Core problems the project solves:
- Want to personally train, fine-tune, and chat with a small LLM, but mainstream frameworks are too complex with high cognitive overhead
- Need compute-optimal default configurations rather than manually tuning dozens of hyperparameters
- Want code that is minimal, hackable, and easy to fork for research and teaching
- Limited budget (under $1000), want clear time and cost expectations (e.g., "3 hours to GPT-2")
Target user groups:
- Developers and students who want to run the full LLM training and chat pipeline end-to-end
- Researchers doing small model and scaling studies
- Enthusiasts who want to understand the complete pipeline from data to conversational LLM
- Teams needing modifiable, reproducible baseline code
Author/Team Introduction
- Author: Andrej Karpathy (@karpathy), former Tesla AI Director, OpenAI researcher, Stanford CS PhD; widely known for nanoGPT, neural network tutorials, and other projects.
- Acknowledgments: The nanochat name comes from nanoGPT; the pretraining and leaderboard design was inspired by modded-nanoGPT; thanks to HuggingFace (fineweb, smoltalk), Lambda (compute), Alec Radford, Sofie @svlandeg, and others.
- Project creation date: October 2025 (GitHub shows created_at 2025-10-13).
Project Stats
- ⭐ GitHub Stars: 43k+
- 🍴 Forks: 5.6k+
- 📦 Version: No official version number; main branch is the trunk; README and Discussions continuously updated (e.g., leaderboard, fp8, batch size in February 2026).
- 📄 License: MIT
- 🌐 Website: No independent website; primarily GitHub and Discussions
- 💬 Community: GitHub Discussions, Discord #nanochat; README recommends using DeepWiki for Q&A
Main Features
Core Purpose
nanochat's core purpose is to run the complete pipeline from data to conversational LLM on a single multi-GPU node with minimal cognitive overhead, with compute-optimal model families by default (via a single --depth knob):
- Tokenization: BPE tokenizer training and evaluation (GPT-4 style wrapper)
- Pretraining: Base model training with distributed support and gradient accumulation, optimized for "Time-to-GPT-2" speed
- Fine-tuning: SFT (supervised fine-tuning), RL, etc., paired with tasks (ARC, GSM8K, MMLU, Humaneval, SmolTalk, etc.)
- Evaluation: CORE score (DCLM), bits per byte, various task evaluations
- Inference: Efficient inference engine with KV Cache
- Chat: CLI and ChatGPT-style Web UI for chatting directly with trained models
Use Cases
-
Reproduce "hundred-dollar GPT-2" and chat with it
- Rent an 8×H100 node (or 8×A100), run
bash runs/speedrun.sh, then after ~3 hours launch the Web UI withpython -m scripts.chat_weband chat with the model like ChatGPT (write poems, Q&A, observe hallucinations, etc.)
- Rent an 8×H100 node (or 8×A100), run
-
Research and iterate on pretraining
- Use
runs/scaling_laws.shandruns/miniseries.shfor scaling experiments; use small models like--depth=12for ~5-minute rapid pretraining iterations, watchingval_bpb,core_metric, MFU, tok/s, etc. in wandb
- Use
-
Teaching and understanding the full pipeline
- Minimal code with no giant abstractions, suitable as a "from token to chat" educational codebase
-
Inject identity/persona into a model
- Via the Guide: infusing identity discussion, inject personas using synthetic data during the SFT stage
-
Add new capabilities
- Refer to discussions like counting r in strawberry to add new tasks or data formats to
tasks/
- Refer to discussions like counting r in strawberry to add new tasks or data formats to
Quick Start
Environment: Recommended Python + uv; the project root contains pyproject.toml and uv.lock.
One-click reproduce GPT-2 and chat (requires 8×H100-class node, e.g., Lambda):
# Clone
git clone https://github.com/karpathy/nanochat.git && cd nanochat
# Create virtual environment and install dependencies (example)
uv venv && source .venv/bin/activate
uv sync
# Run full pipeline (pretraining + fine-tuning, etc., ~3 hours)
bash runs/speedrun.sh
After training, launch the chat Web UI:
source .venv/bin/activate
python -m scripts.chat_web
Open the URL printed in the terminal (e.g., http://<machine-IP>:8000/) in a browser to chat with the model like ChatGPT.
Single GPU or limited VRAM: Remove torchrun to run on a single GPU (roughly 8x slower); if VRAM is insufficient, reduce --device_batch_size in the script (e.g., 32→16→8→4→2→1). CPU/Apple Silicon: See runs/runcpu.sh, which uses a much smaller model and fewer steps for demonstration purposes only — results are limited.
Core Features
-
Single complexity knob
--depth- The number of Transformer layers determines model scale; all other hyperparameters (width, heads, learning rate, training steps, weight decay, etc.) are automatically derived by the code in a compute-optimal way — users only choose "bigger or smaller" model.
-
Time-to-GPT-2 leaderboard
- Measures wall-clock time to "reach GPT-2 (1.6B) level" using DCLM CORE score; currently ~2.76–3.04 hours (8×H100); README and
dev/LEADERBOARD.mdare continuously updated.
- Measures wall-clock time to "reach GPT-2 (1.6B) level" using DCLM CORE score; currently ~2.76–3.04 hours (8×H100); README and
-
Full pipeline coverage
- Tokenizer training and evaluation, base pretraining, SFT/RL, CORE/bpb evaluation, multi-task evaluation, KV Cache inference, CLI/Web chat — one codebase all the way through.
-
Minimal codebase
- No large framework-style configs and factories; clear structure (
nanochat/core library +scripts/entrypoints +runs/scripts +tasks/evaluation tasks), easy to read and modify.
- No large framework-style configs and factories; clear structure (
-
Unified distributed and single-GPU
- Multi-GPU uses
torchrun; single GPU automatically falls back to gradient accumulation with consistent results.
- Multi-GPU uses
-
ChatGPT-style Web UI
-
scripts.chat_webprovides a local web interface for chatting with the trained chat model.
-
-
Research-friendly
- Provides scaling_laws, miniseries, and other scripts; PRs must disclose LLM-assisted writing portions (see Contributing).
Project Advantages
| Comparison | nanochat | nanoGPT | Large LLM frameworks (e.g., full HuggingFace stack) |
|---|---|---|---|
| Pipeline completeness | Tokenization→pretraining→fine-tuning→evaluation→chat | Pretraining only | Full pipeline but complex to configure |
| Cognitive overhead | One knob --depth
|
Must configure hyperparameters manually | Many config options, steep learning curve |
| Code scale | Minimal, hackable | Minimal | Large, highly abstracted |
| Target | Hundred-dollar conversational ChatGPT-style model | Pretraining baseline | General-purpose, enterprise-grade |
| Leaderboard/community | Time-to-GPT-2 + Discussions | None | Each has its own ecosystem |
Why choose nanochat?
- Karpathy's backing: Same lineage as nanoGPT, consistent design philosophy, suitable for learning and secondary development
- Hundred-dollar reproducibility: Clear cost and time expectations (~$72/3h or spot ~$20)
- Teaching and research combined: Can serve as both a "zero to chat" tutorial and a scaling/pretraining research baseline
- Minimal and forkable: No bloated abstractions; a few changes can produce experiments or teaching variants
Detailed Project Analysis
Architecture and Directory Structure
The project is layered as "library + scripts + run configs":
-
nanochat/: Core library-
gpt.py: GPT Transformer -
tokenizer.py: BPE wrapper -
dataloader.py,dataset.py: Distributed data and tokenization -
optim.py: AdamW, Muon, single-GPU and distributed -
engine.py: Inference and KV Cache -
checkpoint_manager.py: Checkpoint read/write -
core_eval.py: CORE score (DCLM);loss_eval.py: bits per byte -
execution.py: Model can execute Python as a tool (if enabled) -
ui.html: Chat frontend static assets
-
-
scripts/: Entrypoints-
base_train.py/base_eval.py: Pretraining and evaluation -
chat_sft.py/chat_rl.py: Fine-tuning;chat_cli.py/chat_web.py: Chat -
tok_train.py/tok_eval.py: Tokenizer training and evaluation
-
-
runs/: One-click scripts-
speedrun.sh: Zero to GPT-2-level and chat -
scaling_laws.sh,miniseries.sh: Research and miniseries -
runcpu.sh: CPU/MPS small-scale demo
-
-
tasks/: Evaluation and data- ARC, GSM8K, MMLU, Humaneval, SmolTalk, spellingbee, etc., plus
TaskMixture/TaskSequenceand custom jsonl (customjson.py)
- ARC, GSM8K, MMLU, Humaneval, SmolTalk, spellingbee, etc., plus
The "Single Knob": depth and compute-optimal
nanochat's core design principle: users only set --depth (number of layers), and everything else (width, heads, learning rate, training steps, weight decay, etc.) is automatically calculated by the code using compute-optimal relationships — yielding a family of models that are "best use of compute" at different scales. The model for GPT-2-level capability is at roughly depth 24–26; changing depth gives larger or smaller miniseries models. Any change to the repo is expected to make sense across different depth values, avoiding tuning for a specific scale.
Pretraining and the Leaderboard
Pretraining is the current development focus. Time-to-GPT-2 is defined as: on an 8×H100 node, wall-clock time to surpass the GPT-2 (1.6B) DCLM CORE score (0.256525). The table in README is updated with the best time (currently ~2.76–3.04 hours), val_bpb, CORE, description, date, and commit. For research, use a small depth (e.g., 12) for short pretraining runs, monitoring val_bpb, core_metric, MFU, tok/s, etc. in wandb for rapid validation.
Extension and Customization
-
New capabilities: Add new tasks or data formats in
tasks/, see counting r in strawberry. - Identity/persona: Use synthetic data in SFT, see Guide: infusing identity.
-
Data and repackaging:
dev/repackage_data_reference.pyand related files for pretraining data sharding.
Dependencies and Runtime Environment
-
Python: Project includes
.python-version; recommended to manage environment with uv - PyTorch: Core dependency; supports CUDA; xpu/mps etc. may work but are not fully tested across all paths
-
GPU: Speedrun targets 8×H100 (or 8×A100, slightly slower); single GPU or limited VRAM requires reducing
device_batch_sizeor usingruncpu.sh-style scripts
Project Resources
Official Resources
- 🌟 GitHub: github.com/karpathy/nanochat
Who Should Use This
- Developers who want to personally train and chat with an LLM: Hundred-dollar cost, one script, a few hours to get a conversational model
- Researchers doing small model and scaling studies: Minimal baseline, single knob, complete leaderboard and scripts
- Teaching: Suitable as a full-pipeline "from token to chat" educational codebase
- Those needing a readable, modifiable, forkable baseline: No giant frameworks, easy to build your own variant on top of nanochat
Welcome to visit my personal homepage for more useful knowledge and interesting products
Top comments (0)