DEV Community

Cover image for How to Train Your Own ChatGPT for $50?
Wanda
Wanda

Posted on • Originally published at apidog.com

How to Train Your Own ChatGPT for $50?

TL;DR

nanochat is Andrej Karpathy’s open-source LLM training framework that enables you to train a GPT-2 level chatbot for under $50 in about 2 hours. Using a single 8xH100 GPU node, minimal code (~500 lines for the core model), and a single configuration dial (--depth), nanochat automatically optimizes all hyperparameters. Training completes in as little as 1.65 hours and achieves a CORE score of 0.2626—surpassing OpenAI’s 2019 GPT-2 (which cost $43,000 and took 168 hours).

Try Apidog today

Introduction

Training a large language model no longer requires millions of dollars or a research team.

Andrej Karpathy’s nanochat project lets you train a conversational AI for less than $50. The entire pipeline runs on a single 8xH100 GPU node and finishes in under 2 hours.

Why This Matters Now

The AI landscape has fundamentally shifted. OpenAI’s original GPT-2 required 168 hours and $43,000 in 2019. Now, you can reach similar capabilities in 1.65 hours and $48—thanks to algorithmic advances, modern hardware, and community-driven optimization.

API developers and teams can now train custom models, experiment with architectures, and learn LLM internals without huge infrastructure costs.

💡 Tip: Combine nanochat with API development platforms like Apidog for streamlined testing and documentation of your AI services.

What You’ll Learn

By following this guide, you’ll learn:

  • How nanochat achieves 100x cost reduction compared to traditional LLM training
  • The architecture: GPT model, Muon optimizer, data loading
  • Step-by-step instructions for training your own model
  • How to use nanochat for rapid LLM research and experimentation
  • Real limitations and the meaning of “GPT-2 capability”

What Is nanochat?

nanochat is a minimal LLM training harness covering the entire pipeline: tokenization, pretraining, finetuning, evaluation, inference, and a ChatGPT-like web UI.

nanochat architecture

The codebase is designed to be readable and hackable—ideal for experimentation and modification.

The Core Claim

You can train a GPT-2 capability model (1.6B parameters) for:

  • $48 on-demand (2 hours at ~$24/hour for 8xH100)
  • ~$15 on spot instances

For comparison, OpenAI’s GPT-2 (2019) cost ~$43,000 and took 7 days on 32 TPU v3 chips.

What nanochat Covers

Stage Script Description
Tokenization scripts.tok_train Train BPE tokenizer (vocab 32,768)
Pretraining scripts.base_train Train base GPT model
Finetuning scripts.chat_sft Supervised finetuning for chat
Evaluation scripts.base_eval CORE metric, bits-per-byte
Inference scripts.chat_cli CLI chat interface
Web UI scripts.chat_web ChatGPT-like web interface

The Philosophy: One Dial to Control Everything

Most LLM frameworks overwhelm you with configurations. nanochat simplifies this with a single parameter: --depth (number of transformer layers).

# GPT-1 size model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=12

# GPT-2 capability model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=24

# Pushing the boundaries
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26
Enter fullscreen mode Exit fullscreen mode

Set --depth, and nanochat computes all other hyperparameters automatically:

  • Transformer width (embedding dimension)
  • Number of attention heads
  • Learning rates for parameter groups
  • Training steps
  • Weight decay schedules
  • Batch sizes

This “one dial” approach enables the creation of compute-optimal models at various sizes, all with the same framework.

Why This Works

Scaling laws were measured across dozens of runs, revealing predictable relationships between depth, width, batch size, and training duration. Instead of exposing all these parameters, nanochat encodes the relationships directly into the scripts.

nanochat scaling laws

No need for deep expertise—just set the model size and go.

The Leaderboard: Racing to Beat GPT-2

nanochat maintains a public leaderboard tracking “time to GPT-2” capability. The target is beating OpenAI’s original CORE score of 0.256525 across 22 evaluation tasks.

Run Model Time CORE Score Key Innovation
Original GPT-2 1.6B 168 hrs 0.2565 OpenAI 2019 baseline
Run 1 d24 3.04 hrs 0.2585 Initial baseline
Run 2 d26 2.91 hrs 0.2578 FP8 training
Run 3 d26 2.76 hrs 0.2602 1M token batch size
Run 4 d24 2.02 hrs 0.2571 ClimbMix dataset
Run 5 d24 1.80 hrs 0.2690 AI-discovered optimizations
Run 6 d24 1.65 hrs 0.2626 Improved smear/backout

How AI Discovered Optimizations

Some runs used Karpathy’s “autoresearch” system—an AI that tests architectural tweaks on small models, then applies the best changes to full-size models. This led to:

  • Backout mechanism: Improved mid-layer residual subtraction
  • Smear implementation: Efficient bigram mixing

These tweaks shaved training time from 2.02 to 1.65 hours, a 19% improvement via autonomous experimentation.

How nanochat Works

nanochat’s codebase is around 3,000 lines across all modules. Here’s a technical breakdown:

1. The GPT Model (nanochat/gpt.py)

The transformer is modern and optimized:

Key Features:

  • Rotary embeddings (RoPE): Relative positional encoding
  • QK normalization: Stabilizes large-scale training
  • Untied weights: Separate embedding and output layers
  • ReLU² activation: Squared ReLU in MLP
  • Grouped Query Attention (GQA): Fast inference with fewer KV heads
  • Sliding window attention: Configurable context patterns
  • Flash Attention 3: Optimized for Hopper GPUs, with SDPA fallback

Value Embeddings (ResFormer): Alternating layers add learnable value embeddings with gated mixing:

# Value residual: mix in value embedding with per-head gate
if ve is not None:
    ve = ve.view(B, T, self.n_kv_head, self.head_dim)
    gate = 3 * torch.sigmoid(self.ve_gate(x[..., :self.ve_gate_channels]))
    v = v + gate.unsqueeze(-1) * ve
Enter fullscreen mode Exit fullscreen mode

Efficiency Tricks: Three learned mechanisms improve dynamics:

# 1. Per-layer residual scaling
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0

# 2. Smear: bigram info
gate = self.smear_lambda * torch.sigmoid(self.smear_gate(x[:, :, :24]))
x = x + gate * x_pre_smear

# 3. Backout: subtract mid-layer residual
x = x - self.backout_lambda * x_backout
Enter fullscreen mode Exit fullscreen mode

2. The Muon Optimizer (nanochat/optim.py)

nanochat uses a mixed optimizer:

Parameter Type Optimizer Purpose
Embeddings, lm_head AdamW Adaptive optimization
Scalar params AdamW Learned scaling
2D matrices Muon Orthogonalized updates

Muon (MomentUm Orthogonalized by Newton-Schulz):

# Polar Express coefficients
polar_express_coeffs = [
    (8.156, -22.483, 15.879),
    (4.043, -2.809, 0.500),
    # ...
]
# Orthogonalization loop
for a, b, c in polar_express_coeffs[:ns_steps]:
    A = X.mT @ X
    B = b * A + c * (A @ A)
    X = a * X + X @ B
Enter fullscreen mode Exit fullscreen mode

NorMuon Variance Reduction:

v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
v_norm = v_mean.sum(dim=(-2, -1), keepdim=True).sqrt()
final_scale = step_size * (v_norm / v_norm_new.clamp_min(1e-10))
g = g * final_scale.to(g.dtype)
Enter fullscreen mode Exit fullscreen mode

Distributed Training: ZeRO-2 style sharding, three-phase async communication:

Phase 1: Launch async reduce_scatter
Phase 2: Compute updates, launch all_gathers
Phase 3: Copy back updated params
Enter fullscreen mode Exit fullscreen mode

3. Precision Management (nanochat/common.py)

nanochat manages precision explicitly:

Hardware Default dtype Reason
CUDA SM 80+ bfloat16 Native BF16 tensor cores
CUDA SM < 80 float32 No BF16 support
CPU / MPS float32 No reduced-precision cores

Custom Linear layers match compute dtype:

class Linear(nn.Linear):
    def forward(self, x):
        return F.linear(x, self.weight.to(dtype=x.dtype))
Enter fullscreen mode Exit fullscreen mode

FP8 training (--fp8) is available on H100/Blackwell GPUs.

4. Data Loading (nanochat/dataloader.py)

Uses BOS-aligned best-fit packing to maximize utilization:

  • Each row starts with BOS token
  • Documents are packed using a best-fit algorithm
  • If no doc fits, one is cropped to fill exactly
  • Achieves ~100% utilization, ~35% cropping at 2048 seq length
# Find largest document that fits
best_idx = -1
best_len = 0
for i, doc in enumerate(doc_buffer):
    doc_len = len(doc)
    if doc_len <= remaining and doc_len > best_len:
        best_idx = i
        best_len = doc_len

if best_idx >= 0:
    doc = doc_buffer.pop(best_idx)
    # Add full document
else:
    # Crop shortest doc to fill remaining space
Enter fullscreen mode Exit fullscreen mode

5. Flash Attention Unification (nanochat/flash_attention.py)

Unified interface auto-selects backend:

from nanochat.flash_attention import flash_attn

# Auto-selects best backend
y = flash_attn.flash_attn_func(q, k, v, causal=True, window_size=window_size)
Enter fullscreen mode Exit fullscreen mode

Uses Flash Attention 3 on Hopper/bfloat16, else PyTorch SDPA.

6. Inference Engine (nanochat/engine.py)

The Engine class supports:

  • KV Cache: Pre-filled for efficient prompt handling
  • Tool Use: Special tokens trigger a Python calculator
  • Batch Generation: Clone KV cache for parallel sampling

Step-by-Step: Train Your Own Model

The entire pipeline is automated in runs/speedrun.sh. Here’s how to do it:

Prerequisites

  • 8xH100 GPU node (or similar)
  • ~20 GB disk space
  • Python 3.10+
  • uv package manager

Step 1: Environment Setup

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv sync --extra gpu
Enter fullscreen mode Exit fullscreen mode

Step 2: Download Training Data

# Download ~2B characters from ClimbMix dataset
python -m nanochat.dataset -n 170
# Downloads ~170 shards (~17 GB compressed)
Enter fullscreen mode Exit fullscreen mode

Step 3: Train the Tokenizer

python -m scripts.tok_train      # Train BPE tokenizer (32,768 vocab)
python -m scripts.tok_eval       # Evaluate tokenizer compression
Enter fullscreen mode Exit fullscreen mode

Tokenizer training takes ~10 minutes on 2B characters.

Step 4: Pretrain the Base Model

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=24 \
    --target-param-data-ratio=8 \
    --device-batch-size=16 \
    --fp8 \
    --run=my-first-model
Enter fullscreen mode Exit fullscreen mode
  • --depth=24: GPT-2 size
  • --target-param-data-ratio=8: Slightly undertrained for speed
  • --device-batch-size=16: Per-GPU batch
  • --fp8: Enable FP8 (H100+ only)

Expected runtime: ~2 hours.

Step 5: Supervised Finetuning

# Download identity conversations
curl -L -o ~/.cache/nanochat/identity_conversations.jsonl \
    https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

# Run SFT for chat
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
    --device-batch-size=16 \
    --run=my-sft
Enter fullscreen mode Exit fullscreen mode

This step teaches conversation, special tokens, and tool use.

Step 6: Chat With Your Model

# CLI chat
python -m scripts.chat_cli -p "Why is the sky blue?"

# Or launch web UI
python -m scripts.chat_web
Enter fullscreen mode Exit fullscreen mode

The web UI runs on port 8000 and provides a ChatGPT-like interface.

Research Workflow: Rapid Experimentation

For fast prototyping, use smaller models (e.g., d12) for quick testing.

Quick Experiments (~5 minutes)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12-test" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1
Enter fullscreen mode Exit fullscreen mode

Metrics to Monitor

Track these in Weights & Biases (wandb):

  1. val_bpb: Validation bits-per-byte (vocab-size-independent loss)
  2. core_metric: DCLM CORE evaluation score
  3. train/mfu: Model FLOPS utilization
  4. train/tok_per_sec: Training throughput

Testing Requirements

Any improvement must be robust across all depths (d12–d26) to avoid overfitting to a single model size.

Why nanochat Matters

Cost Accessibility

Approach Cost Time Hardware
OpenAI GPT-2 (2019) $43,000 168 hrs 32 TPU v3
nanochat (2026) $48 2 hrs 8xH100
nanochat spot ~$15 2 hrs 8xH100 spot

nanochat brings LLM training within reach for:

  • Individual researchers
  • Small startups
  • University courses
  • Hobbyists

Educational Value

  • ~500 lines for GPT model
  • ~530 lines for optimizer
  • Clear comments and no hidden configs
  • Easy to read, modify, and experiment

Research Velocity

  • Faster hypothesis testing
  • More experiments per week
  • Lower cost of failure
  • Community leaderboard for collaboration

Transparency

  • Scaling laws in dev/LOG.md
  • Ablation studies in GitHub Discussions
  • Full reproduction details for leaderboard
  • Clear AI contribution disclosure

Limitations and Reality Check

Hardware Requirements

The $48 cost assumes 8xH100 access. Cloud pricing:

  • Lambda Labs: ~$25/hour for 8xH100
  • RunPod: ~$15/hour spot
  • Total: ~2 hours pretraining + SFT

Budget $50–$100 for a full run.

Capability Ceiling

nanochat reaches GPT-2 level (2019), which means:

Can do:

  • Basic conversation
  • Simple reasoning
  • Elementary math
  • Limited factual recall

Cannot do:

  • Complex reasoning
  • Advanced code generation
  • Nuanced instruction following
  • Compete with GPT-4, Claude, Gemini

Think of it as a kindergartener’s capability.

Data Requirements

  • ~170 data shards
  • ~17 GB compressed
  • ~2B characters

Ensure enough storage and bandwidth.

Metric Limitations

CORE score covers 22 tasks but doesn’t reflect:

  • Real-world conversation quality
  • Domain-specific knowledge
  • Instruction nuance
  • Safety/alignment

Different seeds yield ~0.016 CORE variance.

FAQ

How much does it cost to train with nanochat?

About $48 on-demand (2 hours at $24/hr) or ~$15 on spot. This is for pretraining only—add ~30 minutes for SFT.

What GPU do I need?

  • Minimum: Any modern datacenter GPU (single GPU)
  • Optimal: 8xH100 or 8xA100 for fastest training
  • Code scales from 1 to 8 GPUs via gradient accumulation

How long does training take?

1.65 to 3 hours depending on hardware/config. Leaderboard record: 1.65 hours for d24.

What is the CORE metric?

DCLM CORE evaluates 22 tasks (ARC, MMLU, etc). GPT-2 scored 0.256525; nanochat exceeds 0.26.

Can I train on a single GPU?

Yes. Omit torchrun for automatic gradient accumulation. Training is 8x slower but results are similar.

What dataset does nanochat use?

Current: ClimbMix (NVIDIA curated web dataset). Previous: FineWeb-EDU. The tokenizer is trained on ~2B characters.

Does nanochat work on Apple Silicon?

Yes. Runs on MPS (float32 precision). Slower than CUDA, but functional for experiments.

Can I resume from a checkpoint?

Yes. Use --resume-from-step=<step>. The dataloader state is also saved.

nanochat vs nanoGPT?

nanoGPT handled pretraining only. nanochat covers tokenization, pretraining, SFT, RLHF, evaluation, inference, and web UI.

Conclusion

nanochat proves that LLM training is now affordable and accessible. What once cost $43,000 now costs under $50.

By providing a minimal, readable codebase and a “one dial” system, Karpathy has made LLM training feasible for research, education, and experimentation.

Key Takeaways

  • 100x cost reduction: $43,000 → $48 for GPT-2 capability
  • 100x speedup: 168 hours → 1.65 hours
  • Single configuration dial: --depth controls all
  • Full pipeline: From tokenization to web UI
  • Community driven: Public leaderboard, rapid iteration

Next Steps

Ready to train your own model? Start with the nanochat repository and the runs/speedrun.sh script.

If you’re an API developer building AI-powered applications, understanding LLM training internals is now within your reach. The barrier to entry has dropped from “venture-funded startup” to “weekend project”.

Top comments (0)