TL;DR
nanochat is Andrej Karpathy’s open-source LLM training framework that enables you to train a GPT-2 level chatbot for under $50 in about 2 hours. Using a single 8xH100 GPU node, minimal code (~500 lines for the core model), and a single configuration dial (--depth), nanochat automatically optimizes all hyperparameters. Training completes in as little as 1.65 hours and achieves a CORE score of 0.2626—surpassing OpenAI’s 2019 GPT-2 (which cost $43,000 and took 168 hours).
Introduction
Training a large language model no longer requires millions of dollars or a research team.
Andrej Karpathy’s nanochat project lets you train a conversational AI for less than $50. The entire pipeline runs on a single 8xH100 GPU node and finishes in under 2 hours.
Why This Matters Now
The AI landscape has fundamentally shifted. OpenAI’s original GPT-2 required 168 hours and $43,000 in 2019. Now, you can reach similar capabilities in 1.65 hours and $48—thanks to algorithmic advances, modern hardware, and community-driven optimization.
API developers and teams can now train custom models, experiment with architectures, and learn LLM internals without huge infrastructure costs.
💡 Tip: Combine nanochat with API development platforms like Apidog for streamlined testing and documentation of your AI services.
What You’ll Learn
By following this guide, you’ll learn:
- How nanochat achieves 100x cost reduction compared to traditional LLM training
- The architecture: GPT model, Muon optimizer, data loading
- Step-by-step instructions for training your own model
- How to use nanochat for rapid LLM research and experimentation
- Real limitations and the meaning of “GPT-2 capability”
What Is nanochat?
nanochat is a minimal LLM training harness covering the entire pipeline: tokenization, pretraining, finetuning, evaluation, inference, and a ChatGPT-like web UI.
The codebase is designed to be readable and hackable—ideal for experimentation and modification.
The Core Claim
You can train a GPT-2 capability model (1.6B parameters) for:
- $48 on-demand (2 hours at ~$24/hour for 8xH100)
- ~$15 on spot instances
For comparison, OpenAI’s GPT-2 (2019) cost ~$43,000 and took 7 days on 32 TPU v3 chips.
What nanochat Covers
| Stage | Script | Description |
|---|---|---|
| Tokenization | scripts.tok_train |
Train BPE tokenizer (vocab 32,768) |
| Pretraining | scripts.base_train |
Train base GPT model |
| Finetuning | scripts.chat_sft |
Supervised finetuning for chat |
| Evaluation | scripts.base_eval |
CORE metric, bits-per-byte |
| Inference | scripts.chat_cli |
CLI chat interface |
| Web UI | scripts.chat_web |
ChatGPT-like web interface |
The Philosophy: One Dial to Control Everything
Most LLM frameworks overwhelm you with configurations. nanochat simplifies this with a single parameter: --depth (number of transformer layers).
# GPT-1 size model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=12
# GPT-2 capability model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=24
# Pushing the boundaries
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26
Set --depth, and nanochat computes all other hyperparameters automatically:
- Transformer width (embedding dimension)
- Number of attention heads
- Learning rates for parameter groups
- Training steps
- Weight decay schedules
- Batch sizes
This “one dial” approach enables the creation of compute-optimal models at various sizes, all with the same framework.
Why This Works
Scaling laws were measured across dozens of runs, revealing predictable relationships between depth, width, batch size, and training duration. Instead of exposing all these parameters, nanochat encodes the relationships directly into the scripts.
No need for deep expertise—just set the model size and go.
The Leaderboard: Racing to Beat GPT-2
nanochat maintains a public leaderboard tracking “time to GPT-2” capability. The target is beating OpenAI’s original CORE score of 0.256525 across 22 evaluation tasks.
| Run | Model | Time | CORE Score | Key Innovation |
|---|---|---|---|---|
| Original GPT-2 | 1.6B | 168 hrs | 0.2565 | OpenAI 2019 baseline |
| Run 1 | d24 | 3.04 hrs | 0.2585 | Initial baseline |
| Run 2 | d26 | 2.91 hrs | 0.2578 | FP8 training |
| Run 3 | d26 | 2.76 hrs | 0.2602 | 1M token batch size |
| Run 4 | d24 | 2.02 hrs | 0.2571 | ClimbMix dataset |
| Run 5 | d24 | 1.80 hrs | 0.2690 | AI-discovered optimizations |
| Run 6 | d24 | 1.65 hrs | 0.2626 | Improved smear/backout |
How AI Discovered Optimizations
Some runs used Karpathy’s “autoresearch” system—an AI that tests architectural tweaks on small models, then applies the best changes to full-size models. This led to:
- Backout mechanism: Improved mid-layer residual subtraction
- Smear implementation: Efficient bigram mixing
These tweaks shaved training time from 2.02 to 1.65 hours, a 19% improvement via autonomous experimentation.
How nanochat Works
nanochat’s codebase is around 3,000 lines across all modules. Here’s a technical breakdown:
1. The GPT Model (nanochat/gpt.py)
The transformer is modern and optimized:
Key Features:
- Rotary embeddings (RoPE): Relative positional encoding
- QK normalization: Stabilizes large-scale training
- Untied weights: Separate embedding and output layers
- ReLU² activation: Squared ReLU in MLP
- Grouped Query Attention (GQA): Fast inference with fewer KV heads
- Sliding window attention: Configurable context patterns
- Flash Attention 3: Optimized for Hopper GPUs, with SDPA fallback
Value Embeddings (ResFormer): Alternating layers add learnable value embeddings with gated mixing:
# Value residual: mix in value embedding with per-head gate
if ve is not None:
ve = ve.view(B, T, self.n_kv_head, self.head_dim)
gate = 3 * torch.sigmoid(self.ve_gate(x[..., :self.ve_gate_channels]))
v = v + gate.unsqueeze(-1) * ve
Efficiency Tricks: Three learned mechanisms improve dynamics:
# 1. Per-layer residual scaling
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
# 2. Smear: bigram info
gate = self.smear_lambda * torch.sigmoid(self.smear_gate(x[:, :, :24]))
x = x + gate * x_pre_smear
# 3. Backout: subtract mid-layer residual
x = x - self.backout_lambda * x_backout
2. The Muon Optimizer (nanochat/optim.py)
nanochat uses a mixed optimizer:
| Parameter Type | Optimizer | Purpose |
|---|---|---|
| Embeddings, lm_head | AdamW | Adaptive optimization |
| Scalar params | AdamW | Learned scaling |
| 2D matrices | Muon | Orthogonalized updates |
Muon (MomentUm Orthogonalized by Newton-Schulz):
# Polar Express coefficients
polar_express_coeffs = [
(8.156, -22.483, 15.879),
(4.043, -2.809, 0.500),
# ...
]
# Orthogonalization loop
for a, b, c in polar_express_coeffs[:ns_steps]:
A = X.mT @ X
B = b * A + c * (A @ A)
X = a * X + X @ B
NorMuon Variance Reduction:
v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
v_norm = v_mean.sum(dim=(-2, -1), keepdim=True).sqrt()
final_scale = step_size * (v_norm / v_norm_new.clamp_min(1e-10))
g = g * final_scale.to(g.dtype)
Distributed Training: ZeRO-2 style sharding, three-phase async communication:
Phase 1: Launch async reduce_scatter
Phase 2: Compute updates, launch all_gathers
Phase 3: Copy back updated params
3. Precision Management (nanochat/common.py)
nanochat manages precision explicitly:
| Hardware | Default dtype | Reason |
|---|---|---|
| CUDA SM 80+ | bfloat16 | Native BF16 tensor cores |
| CUDA SM < 80 | float32 | No BF16 support |
| CPU / MPS | float32 | No reduced-precision cores |
Custom Linear layers match compute dtype:
class Linear(nn.Linear):
def forward(self, x):
return F.linear(x, self.weight.to(dtype=x.dtype))
FP8 training (--fp8) is available on H100/Blackwell GPUs.
4. Data Loading (nanochat/dataloader.py)
Uses BOS-aligned best-fit packing to maximize utilization:
- Each row starts with BOS token
- Documents are packed using a best-fit algorithm
- If no doc fits, one is cropped to fill exactly
- Achieves ~100% utilization, ~35% cropping at 2048 seq length
# Find largest document that fits
best_idx = -1
best_len = 0
for i, doc in enumerate(doc_buffer):
doc_len = len(doc)
if doc_len <= remaining and doc_len > best_len:
best_idx = i
best_len = doc_len
if best_idx >= 0:
doc = doc_buffer.pop(best_idx)
# Add full document
else:
# Crop shortest doc to fill remaining space
5. Flash Attention Unification (nanochat/flash_attention.py)
Unified interface auto-selects backend:
from nanochat.flash_attention import flash_attn
# Auto-selects best backend
y = flash_attn.flash_attn_func(q, k, v, causal=True, window_size=window_size)
Uses Flash Attention 3 on Hopper/bfloat16, else PyTorch SDPA.
6. Inference Engine (nanochat/engine.py)
The Engine class supports:
- KV Cache: Pre-filled for efficient prompt handling
- Tool Use: Special tokens trigger a Python calculator
- Batch Generation: Clone KV cache for parallel sampling
Step-by-Step: Train Your Own Model
The entire pipeline is automated in runs/speedrun.sh. Here’s how to do it:
Prerequisites
- 8xH100 GPU node (or similar)
- ~20 GB disk space
- Python 3.10+
- uv package manager
Step 1: Environment Setup
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv sync --extra gpu
Step 2: Download Training Data
# Download ~2B characters from ClimbMix dataset
python -m nanochat.dataset -n 170
# Downloads ~170 shards (~17 GB compressed)
Step 3: Train the Tokenizer
python -m scripts.tok_train # Train BPE tokenizer (32,768 vocab)
python -m scripts.tok_eval # Evaluate tokenizer compression
Tokenizer training takes ~10 minutes on 2B characters.
Step 4: Pretrain the Base Model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=24 \
--target-param-data-ratio=8 \
--device-batch-size=16 \
--fp8 \
--run=my-first-model
-
--depth=24: GPT-2 size -
--target-param-data-ratio=8: Slightly undertrained for speed -
--device-batch-size=16: Per-GPU batch -
--fp8: Enable FP8 (H100+ only)
Expected runtime: ~2 hours.
Step 5: Supervised Finetuning
# Download identity conversations
curl -L -o ~/.cache/nanochat/identity_conversations.jsonl \
https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
# Run SFT for chat
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
--device-batch-size=16 \
--run=my-sft
This step teaches conversation, special tokens, and tool use.
Step 6: Chat With Your Model
# CLI chat
python -m scripts.chat_cli -p "Why is the sky blue?"
# Or launch web UI
python -m scripts.chat_web
The web UI runs on port 8000 and provides a ChatGPT-like interface.
Research Workflow: Rapid Experimentation
For fast prototyping, use smaller models (e.g., d12) for quick testing.
Quick Experiments (~5 minutes)
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=12 \
--run="d12-test" \
--core-metric-every=999999 \
--sample-every=-1 \
--save-every=-1
Metrics to Monitor
Track these in Weights & Biases (wandb):
- val_bpb: Validation bits-per-byte (vocab-size-independent loss)
- core_metric: DCLM CORE evaluation score
- train/mfu: Model FLOPS utilization
- train/tok_per_sec: Training throughput
Testing Requirements
Any improvement must be robust across all depths (d12–d26) to avoid overfitting to a single model size.
Why nanochat Matters
Cost Accessibility
| Approach | Cost | Time | Hardware |
|---|---|---|---|
| OpenAI GPT-2 (2019) | $43,000 | 168 hrs | 32 TPU v3 |
| nanochat (2026) | $48 | 2 hrs | 8xH100 |
| nanochat spot | ~$15 | 2 hrs | 8xH100 spot |
nanochat brings LLM training within reach for:
- Individual researchers
- Small startups
- University courses
- Hobbyists
Educational Value
- ~500 lines for GPT model
- ~530 lines for optimizer
- Clear comments and no hidden configs
- Easy to read, modify, and experiment
Research Velocity
- Faster hypothesis testing
- More experiments per week
- Lower cost of failure
- Community leaderboard for collaboration
Transparency
- Scaling laws in
dev/LOG.md - Ablation studies in GitHub Discussions
- Full reproduction details for leaderboard
- Clear AI contribution disclosure
Limitations and Reality Check
Hardware Requirements
The $48 cost assumes 8xH100 access. Cloud pricing:
- Lambda Labs: ~$25/hour for 8xH100
- RunPod: ~$15/hour spot
- Total: ~2 hours pretraining + SFT
Budget $50–$100 for a full run.
Capability Ceiling
nanochat reaches GPT-2 level (2019), which means:
Can do:
- Basic conversation
- Simple reasoning
- Elementary math
- Limited factual recall
Cannot do:
- Complex reasoning
- Advanced code generation
- Nuanced instruction following
- Compete with GPT-4, Claude, Gemini
Think of it as a kindergartener’s capability.
Data Requirements
- ~170 data shards
- ~17 GB compressed
- ~2B characters
Ensure enough storage and bandwidth.
Metric Limitations
CORE score covers 22 tasks but doesn’t reflect:
- Real-world conversation quality
- Domain-specific knowledge
- Instruction nuance
- Safety/alignment
Different seeds yield ~0.016 CORE variance.
FAQ
How much does it cost to train with nanochat?
About $48 on-demand (2 hours at $24/hr) or ~$15 on spot. This is for pretraining only—add ~30 minutes for SFT.
What GPU do I need?
- Minimum: Any modern datacenter GPU (single GPU)
- Optimal: 8xH100 or 8xA100 for fastest training
- Code scales from 1 to 8 GPUs via gradient accumulation
How long does training take?
1.65 to 3 hours depending on hardware/config. Leaderboard record: 1.65 hours for d24.
What is the CORE metric?
DCLM CORE evaluates 22 tasks (ARC, MMLU, etc). GPT-2 scored 0.256525; nanochat exceeds 0.26.
Can I train on a single GPU?
Yes. Omit torchrun for automatic gradient accumulation. Training is 8x slower but results are similar.
What dataset does nanochat use?
Current: ClimbMix (NVIDIA curated web dataset). Previous: FineWeb-EDU. The tokenizer is trained on ~2B characters.
Does nanochat work on Apple Silicon?
Yes. Runs on MPS (float32 precision). Slower than CUDA, but functional for experiments.
Can I resume from a checkpoint?
Yes. Use --resume-from-step=<step>. The dataloader state is also saved.
nanochat vs nanoGPT?
nanoGPT handled pretraining only. nanochat covers tokenization, pretraining, SFT, RLHF, evaluation, inference, and web UI.
Conclusion
nanochat proves that LLM training is now affordable and accessible. What once cost $43,000 now costs under $50.
By providing a minimal, readable codebase and a “one dial” system, Karpathy has made LLM training feasible for research, education, and experimentation.
Key Takeaways
- 100x cost reduction: $43,000 → $48 for GPT-2 capability
- 100x speedup: 168 hours → 1.65 hours
-
Single configuration dial:
--depthcontrols all - Full pipeline: From tokenization to web UI
- Community driven: Public leaderboard, rapid iteration
Next Steps
Ready to train your own model? Start with the nanochat repository and the runs/speedrun.sh script.
If you’re an API developer building AI-powered applications, understanding LLM training internals is now within your reach. The barrier to entry has dropped from “venture-funded startup” to “weekend project”.


Top comments (0)