Wanda

Posted on Mar 19 • Originally published at apidog.com

How to Train Your Own ChatGPT for $50?

#ai #llm #machinelearning #opensource

TL;DR

nanochat is Andrej Karpathy’s open-source LLM training framework that enables you to train a GPT-2 level chatbot for under $50 in about 2 hours. Using a single 8xH100 GPU node, minimal code (~500 lines for the core model), and a single configuration dial (--depth), nanochat automatically optimizes all hyperparameters. Training completes in as little as 1.65 hours and achieves a CORE score of 0.2626—surpassing OpenAI’s 2019 GPT-2 (which cost $43,000 and took 168 hours).

Try Apidog today

Introduction

Training a large language model no longer requires millions of dollars or a research team.

Andrej Karpathy’s nanochat project lets you train a conversational AI for less than $50. The entire pipeline runs on a single 8xH100 GPU node and finishes in under 2 hours.

Why This Matters Now

The AI landscape has fundamentally shifted. OpenAI’s original GPT-2 required 168 hours and $43,000 in 2019. Now, you can reach similar capabilities in 1.65 hours and $48—thanks to algorithmic advances, modern hardware, and community-driven optimization.

API developers and teams can now train custom models, experiment with architectures, and learn LLM internals without huge infrastructure costs.

💡 Tip: Combine nanochat with API development platforms like Apidog for streamlined testing and documentation of your AI services.

What You’ll Learn

By following this guide, you’ll learn:

How nanochat achieves 100x cost reduction compared to traditional LLM training
The architecture: GPT model, Muon optimizer, data loading
Step-by-step instructions for training your own model
How to use nanochat for rapid LLM research and experimentation
Real limitations and the meaning of “GPT-2 capability”

What Is nanochat?

nanochat is a minimal LLM training harness covering the entire pipeline: tokenization, pretraining, finetuning, evaluation, inference, and a ChatGPT-like web UI.

The codebase is designed to be readable and hackable—ideal for experimentation and modification.

The Core Claim

You can train a GPT-2 capability model (1.6B parameters) for:

$48 on-demand (2 hours at ~$24/hour for 8xH100)
~$15 on spot instances

For comparison, OpenAI’s GPT-2 (2019) cost ~$43,000 and took 7 days on 32 TPU v3 chips.

What nanochat Covers

Stage	Script	Description
Tokenization	`scripts.tok_train`	Train BPE tokenizer (vocab 32,768)
Pretraining	`scripts.base_train`	Train base GPT model
Finetuning	`scripts.chat_sft`	Supervised finetuning for chat
Evaluation	`scripts.base_eval`	CORE metric, bits-per-byte
Inference	`scripts.chat_cli`	CLI chat interface
Web UI	`scripts.chat_web`	ChatGPT-like web interface

The Philosophy: One Dial to Control Everything

Most LLM frameworks overwhelm you with configurations. nanochat simplifies this with a single parameter: --depth (number of transformer layers).

# GPT-1 size model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=12

# GPT-2 capability model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=24

# Pushing the boundaries
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26

Set --depth, and nanochat computes all other hyperparameters automatically:

Transformer width (embedding dimension)
Number of attention heads
Learning rates for parameter groups
Training steps
Weight decay schedules
Batch sizes

This “one dial” approach enables the creation of compute-optimal models at various sizes, all with the same framework.

Why This Works

Scaling laws were measured across dozens of runs, revealing predictable relationships between depth, width, batch size, and training duration. Instead of exposing all these parameters, nanochat encodes the relationships directly into the scripts.

No need for deep expertise—just set the model size and go.

The Leaderboard: Racing to Beat GPT-2

nanochat maintains a public leaderboard tracking “time to GPT-2” capability. The target is beating OpenAI’s original CORE score of 0.256525 across 22 evaluation tasks.

Run	Model	Time	CORE Score	Key Innovation
Original GPT-2	1.6B	168 hrs	0.2565	OpenAI 2019 baseline
Run 1	d24	3.04 hrs	0.2585	Initial baseline
Run 2	d26	2.91 hrs	0.2578	FP8 training
Run 3	d26	2.76 hrs	0.2602	1M token batch size
Run 4	d24	2.02 hrs	0.2571	ClimbMix dataset
Run 5	d24	1.80 hrs	0.2690	AI-discovered optimizations
Run 6	d24	1.65 hrs	0.2626	Improved smear/backout

How AI Discovered Optimizations

Some runs used Karpathy’s “autoresearch” system—an AI that tests architectural tweaks on small models, then applies the best changes to full-size models. This led to:

Backout mechanism: Improved mid-layer residual subtraction
Smear implementation: Efficient bigram mixing

These tweaks shaved training time from 2.02 to 1.65 hours, a 19% improvement via autonomous experimentation.

How nanochat Works

nanochat’s codebase is around 3,000 lines across all modules. Here’s a technical breakdown:

1. The GPT Model (`nanochat/gpt.py`)

The transformer is modern and optimized:

Key Features:

Rotary embeddings (RoPE): Relative positional encoding
QK normalization: Stabilizes large-scale training
Untied weights: Separate embedding and output layers
ReLU² activation: Squared ReLU in MLP
Grouped Query Attention (GQA): Fast inference with fewer KV heads
Sliding window attention: Configurable context patterns
Flash Attention 3: Optimized for Hopper GPUs, with SDPA fallback

Value Embeddings (ResFormer): Alternating layers add learnable value embeddings with gated mixing:

# Value residual: mix in value embedding with per-head gate
if ve is not None:
    ve = ve.view(B, T, self.n_kv_head, self.head_dim)
    gate = 3 * torch.sigmoid(self.ve_gate(x[..., :self.ve_gate_channels]))
    v = v + gate.unsqueeze(-1) * ve

Efficiency Tricks: Three learned mechanisms improve dynamics:

# 1. Per-layer residual scaling
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0

# 2. Smear: bigram info
gate = self.smear_lambda * torch.sigmoid(self.smear_gate(x[:, :, :24]))
x = x + gate * x_pre_smear

# 3. Backout: subtract mid-layer residual
x = x - self.backout_lambda * x_backout

2. The Muon Optimizer (`nanochat/optim.py`)

nanochat uses a mixed optimizer:

Parameter Type	Optimizer	Purpose
Embeddings, lm_head	AdamW	Adaptive optimization
Scalar params	AdamW	Learned scaling
2D matrices	Muon	Orthogonalized updates

Muon (MomentUm Orthogonalized by Newton-Schulz):

# Polar Express coefficients
polar_express_coeffs = [
    (8.156, -22.483, 15.879),
    (4.043, -2.809, 0.500),
    # ...
]
# Orthogonalization loop
for a, b, c in polar_express_coeffs[:ns_steps]:
    A = X.mT @ X
    B = b * A + c * (A @ A)
    X = a * X + X @ B

NorMuon Variance Reduction:

v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
v_norm = v_mean.sum(dim=(-2, -1), keepdim=True).sqrt()
final_scale = step_size * (v_norm / v_norm_new.clamp_min(1e-10))
g = g * final_scale.to(g.dtype)

Distributed Training: ZeRO-2 style sharding, three-phase async communication:

Phase 1: Launch async reduce_scatter
Phase 2: Compute updates, launch all_gathers
Phase 3: Copy back updated params

3. Precision Management (`nanochat/common.py`)

nanochat manages precision explicitly:

Hardware	Default dtype	Reason
CUDA SM 80+	bfloat16	Native BF16 tensor cores
CUDA SM < 80	float32	No BF16 support
CPU / MPS	float32	No reduced-precision cores

Custom Linear layers match compute dtype:

class Linear(nn.Linear):
    def forward(self, x):
        return F.linear(x, self.weight.to(dtype=x.dtype))

FP8 training (--fp8) is available on H100/Blackwell GPUs.

4. Data Loading (`nanochat/dataloader.py`)

Uses BOS-aligned best-fit packing to maximize utilization:

Each row starts with BOS token
Documents are packed using a best-fit algorithm
If no doc fits, one is cropped to fill exactly
Achieves ~100% utilization, ~35% cropping at 2048 seq length

# Find largest document that fits
best_idx = -1
best_len = 0
for i, doc in enumerate(doc_buffer):
    doc_len = len(doc)
    if doc_len <= remaining and doc_len > best_len:
        best_idx = i
        best_len = doc_len

if best_idx >= 0:
    doc = doc_buffer.pop(best_idx)
    # Add full document
else:
    # Crop shortest doc to fill remaining space

5. Flash Attention Unification (`nanochat/flash_attention.py`)

Unified interface auto-selects backend:

from nanochat.flash_attention import flash_attn

# Auto-selects best backend
y = flash_attn.flash_attn_func(q, k, v, causal=True, window_size=window_size)

Uses Flash Attention 3 on Hopper/bfloat16, else PyTorch SDPA.

6. Inference Engine (`nanochat/engine.py`)

The Engine class supports:

KV Cache: Pre-filled for efficient prompt handling
Tool Use: Special tokens trigger a Python calculator
Batch Generation: Clone KV cache for parallel sampling

Step-by-Step: Train Your Own Model

The entire pipeline is automated in runs/speedrun.sh. Here’s how to do it:

Prerequisites

8xH100 GPU node (or similar)
~20 GB disk space
Python 3.10+
uv package manager

Step 1: Environment Setup

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv sync --extra gpu

Step 2: Download Training Data

# Download ~2B characters from ClimbMix dataset
python -m nanochat.dataset -n 170
# Downloads ~170 shards (~17 GB compressed)

Step 3: Train the Tokenizer

python -m scripts.tok_train      # Train BPE tokenizer (32,768 vocab)
python -m scripts.tok_eval       # Evaluate tokenizer compression

Tokenizer training takes ~10 minutes on 2B characters.

Step 4: Pretrain the Base Model

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=24 \
    --target-param-data-ratio=8 \
    --device-batch-size=16 \
    --fp8 \
    --run=my-first-model

--depth=24: GPT-2 size
--target-param-data-ratio=8: Slightly undertrained for speed
--device-batch-size=16: Per-GPU batch
--fp8: Enable FP8 (H100+ only)

Expected runtime: ~2 hours.

Step 5: Supervised Finetuning

# Download identity conversations
curl -L -o ~/.cache/nanochat/identity_conversations.jsonl \
    https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

# Run SFT for chat
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
    --device-batch-size=16 \
    --run=my-sft

This step teaches conversation, special tokens, and tool use.

Step 6: Chat With Your Model

# CLI chat
python -m scripts.chat_cli -p "Why is the sky blue?"

# Or launch web UI
python -m scripts.chat_web

The web UI runs on port 8000 and provides a ChatGPT-like interface.

Research Workflow: Rapid Experimentation

For fast prototyping, use smaller models (e.g., d12) for quick testing.

Quick Experiments (~5 minutes)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12-test" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

Metrics to Monitor

Track these in Weights & Biases (wandb):

val_bpb: Validation bits-per-byte (vocab-size-independent loss)
core_metric: DCLM CORE evaluation score
train/mfu: Model FLOPS utilization
train/tok_per_sec: Training throughput

Testing Requirements

Any improvement must be robust across all depths (d12–d26) to avoid overfitting to a single model size.

Why nanochat Matters

Cost Accessibility

Approach	Cost	Time	Hardware
OpenAI GPT-2 (2019)	$43,000	168 hrs	32 TPU v3
nanochat (2026)	$48	2 hrs	8xH100
nanochat spot	~$15	2 hrs	8xH100 spot

nanochat brings LLM training within reach for:

Individual researchers
Small startups
University courses
Hobbyists

Educational Value

~500 lines for GPT model
~530 lines for optimizer
Clear comments and no hidden configs
Easy to read, modify, and experiment

Research Velocity

Faster hypothesis testing
More experiments per week
Lower cost of failure
Community leaderboard for collaboration

Transparency

Scaling laws in dev/LOG.md
Ablation studies in GitHub Discussions
Full reproduction details for leaderboard
Clear AI contribution disclosure

Limitations and Reality Check

Hardware Requirements

The $48 cost assumes 8xH100 access. Cloud pricing:

Lambda Labs: ~$25/hour for 8xH100
RunPod: ~$15/hour spot
Total: ~2 hours pretraining + SFT

Budget $50–$100 for a full run.

Capability Ceiling

nanochat reaches GPT-2 level (2019), which means:

Can do:

Basic conversation
Simple reasoning
Elementary math
Limited factual recall

Cannot do:

Complex reasoning
Advanced code generation
Nuanced instruction following
Compete with GPT-4, Claude, Gemini

Think of it as a kindergartener’s capability.

Data Requirements

~170 data shards
~17 GB compressed
~2B characters

Ensure enough storage and bandwidth.

Metric Limitations

CORE score covers 22 tasks but doesn’t reflect:

Real-world conversation quality
Domain-specific knowledge
Instruction nuance
Safety/alignment

Different seeds yield ~0.016 CORE variance.

FAQ

How much does it cost to train with nanochat?

About $48 on-demand (2 hours at $24/hr) or ~$15 on spot. This is for pretraining only—add ~30 minutes for SFT.

What GPU do I need?

Minimum: Any modern datacenter GPU (single GPU)
Optimal: 8xH100 or 8xA100 for fastest training
Code scales from 1 to 8 GPUs via gradient accumulation

How long does training take?

1.65 to 3 hours depending on hardware/config. Leaderboard record: 1.65 hours for d24.

What is the CORE metric?

DCLM CORE evaluates 22 tasks (ARC, MMLU, etc). GPT-2 scored 0.256525; nanochat exceeds 0.26.

Can I train on a single GPU?

Yes. Omit torchrun for automatic gradient accumulation. Training is 8x slower but results are similar.

What dataset does nanochat use?

Current: ClimbMix (NVIDIA curated web dataset). Previous: FineWeb-EDU. The tokenizer is trained on ~2B characters.

Does nanochat work on Apple Silicon?

Yes. Runs on MPS (float32 precision). Slower than CUDA, but functional for experiments.

Can I resume from a checkpoint?

Yes. Use --resume-from-step=<step>. The dataloader state is also saved.

nanochat vs nanoGPT?

nanoGPT handled pretraining only. nanochat covers tokenization, pretraining, SFT, RLHF, evaluation, inference, and web UI.

Conclusion

nanochat proves that LLM training is now affordable and accessible. What once cost $43,000 now costs under $50.

By providing a minimal, readable codebase and a “one dial” system, Karpathy has made LLM training feasible for research, education, and experimentation.

Key Takeaways

100x cost reduction: $43,000 → $48 for GPT-2 capability
100x speedup: 168 hours → 1.65 hours
Single configuration dial: --depth controls all
Full pipeline: From tokenization to web UI
Community driven: Public leaderboard, rapid iteration

Next Steps

Ready to train your own model? Start with the nanochat repository and the runs/speedrun.sh script.

If you’re an API developer building AI-powered applications, understanding LLM training internals is now within your reach. The barrier to entry has dropped from “venture-funded startup” to “weekend project”.