WonderLab

Posted on Mar 27

Open Source Project of the Day (Part 22): nanochat - The Best ChatGPT $100 Can Buy, Karpathy's Minimalist LLM Training Suite

#chatgpt #ai #llm #opensource

Introduction

"In 2019, training GPT-2 cost about $43,000. Today, for under $100 and about 3 hours, you can reproduce it on 8×H100s and chat with it."

This is Part 22 of the "Open Source Project of the Day" series. Today we explore nanochat (GitHub) by Andrej Karpathy. The project's tagline: The best ChatGPT that $100 can buy.

nanochat isn't yet another bloated "LLM framework" — no giant configuration objects, no model factories, no screens full of if-else branches. It's a minimalist, readable, hackable, forkable end-to-end experiment suite: tokenization, pretraining, fine-tuning, evaluation, inference, and a chat web UI, all connected end-to-end. It uses a single knob --depth (number of Transformer layers) to automatically derive width, heads, learning rate, training steps, and more — producing a family of compute-optimal models. Want "a GPT-2-level model"? Set depth to around 26, run runs/speedrun.sh, and a few hours later you can chat with it in a ChatGPT-style interface.

What You'll Learn

nanochat's positioning: a minimalist, single-node multi-GPU, full-pipeline LLM training and chat suite
The "single knob" design: how --depth determines compute-optimal hyperparameters and model scale
From speedrun to chat: how to train a GPT-2-level model with one script and talk to it via Web UI
The Time-to-GPT-2 leaderboard and the meaning of the CORE metric
Project directory structure, key scripts, and extension approaches (new capabilities, identity/persona injection)
Comparison with nanoGPT, modded-nanogpt, and common LLM frameworks

Prerequisites

Basic understanding of Transformer and GPT-style models
Familiarity with the distinction between pretraining and fine-tuning (SFT)
Experience with Python and PyTorch; basic knowledge of multi-GPU training (e.g., torchrun) is helpful
A GPU environment (single or multi-GPU) is helpful for hands-on reproduction

Project Background

Project Introduction

nanochat is an open-source minimalist LLM experiment suite by Andrej Karpathy, designed for single GPU nodes (e.g., 8×H100) with a small, clear, readable, and modifiable codebase. It covers the complete LLM pipeline: tokenization, pretraining, fine-tuning (SFT/RL), evaluation, inference, and chat UI. The goal is not to be a "configure everything" framework, but to provide a strong baseline: go from zero to a conversational ChatGPT-style model at a cost in the hundreds of dollars (e.g., ~3 hours on 8×H100, ~$72; spot instances as low as ~$20).

Core problems the project solves:

Want to personally train, fine-tune, and chat with a small LLM, but mainstream frameworks are too complex with high cognitive overhead
Need compute-optimal default configurations rather than manually tuning dozens of hyperparameters
Want code that is minimal, hackable, and easy to fork for research and teaching
Limited budget (under $1000), want clear time and cost expectations (e.g., "3 hours to GPT-2")

Target user groups:

Developers and students who want to run the full LLM training and chat pipeline end-to-end
Researchers doing small model and scaling studies
Enthusiasts who want to understand the complete pipeline from data to conversational LLM
Teams needing modifiable, reproducible baseline code

Author/Team Introduction

Author: Andrej Karpathy (@karpathy), former Tesla AI Director, OpenAI researcher, Stanford CS PhD; widely known for nanoGPT, neural network tutorials, and other projects.
Acknowledgments: The nanochat name comes from nanoGPT; the pretraining and leaderboard design was inspired by modded-nanoGPT; thanks to HuggingFace (fineweb, smoltalk), Lambda (compute), Alec Radford, Sofie @svlandeg, and others.
Project creation date: October 2025 (GitHub shows created_at 2025-10-13).

Project Stats

⭐ GitHub Stars: 43k+
🍴 Forks: 5.6k+
📦 Version: No official version number; main branch is the trunk; README and Discussions continuously updated (e.g., leaderboard, fp8, batch size in February 2026).
📄 License: MIT
🌐 Website: No independent website; primarily GitHub and Discussions
💬 Community: GitHub Discussions, Discord #nanochat; README recommends using DeepWiki for Q&A

Main Features

Core Purpose

nanochat's core purpose is to run the complete pipeline from data to conversational LLM on a single multi-GPU node with minimal cognitive overhead, with compute-optimal model families by default (via a single --depth knob):

Tokenization: BPE tokenizer training and evaluation (GPT-4 style wrapper)
Pretraining: Base model training with distributed support and gradient accumulation, optimized for "Time-to-GPT-2" speed
Fine-tuning: SFT (supervised fine-tuning), RL, etc., paired with tasks (ARC, GSM8K, MMLU, Humaneval, SmolTalk, etc.)
Evaluation: CORE score (DCLM), bits per byte, various task evaluations
Inference: Efficient inference engine with KV Cache
Chat: CLI and ChatGPT-style Web UI for chatting directly with trained models

Use Cases

Reproduce "hundred-dollar GPT-2" and chat with it
- Rent an 8×H100 node (or 8×A100), run bash runs/speedrun.sh, then after ~3 hours launch the Web UI with python -m scripts.chat_web and chat with the model like ChatGPT (write poems, Q&A, observe hallucinations, etc.)
Research and iterate on pretraining
- Use runs/scaling_laws.sh and runs/miniseries.sh for scaling experiments; use small models like --depth=12 for ~5-minute rapid pretraining iterations, watching val_bpb, core_metric, MFU, tok/s, etc. in wandb
Teaching and understanding the full pipeline
- Minimal code with no giant abstractions, suitable as a "from token to chat" educational codebase
Inject identity/persona into a model
- Via the Guide: infusing identity discussion, inject personas using synthetic data during the SFT stage
Add new capabilities
- Refer to discussions like counting r in strawberry to add new tasks or data formats to tasks/

Quick Start

Environment: Recommended Python + uv; the project root contains pyproject.toml and uv.lock.

One-click reproduce GPT-2 and chat (requires 8×H100-class node, e.g., Lambda):

# Clone
git clone https://github.com/karpathy/nanochat.git && cd nanochat

# Create virtual environment and install dependencies (example)
uv venv && source .venv/bin/activate
uv sync

# Run full pipeline (pretraining + fine-tuning, etc., ~3 hours)
bash runs/speedrun.sh

After training, launch the chat Web UI:

source .venv/bin/activate
python -m scripts.chat_web

Open the URL printed in the terminal (e.g., http://<machine-IP>:8000/) in a browser to chat with the model like ChatGPT.

Single GPU or limited VRAM: Remove torchrun to run on a single GPU (roughly 8x slower); if VRAM is insufficient, reduce --device_batch_size in the script (e.g., 32→16→8→4→2→1). CPU/Apple Silicon: See runs/runcpu.sh, which uses a much smaller model and fewer steps for demonstration purposes only — results are limited.

Core Features

Single complexity knob --depth
- The number of Transformer layers determines model scale; all other hyperparameters (width, heads, learning rate, training steps, weight decay, etc.) are automatically derived by the code in a compute-optimal way — users only choose "bigger or smaller" model.
Time-to-GPT-2 leaderboard
- Measures wall-clock time to "reach GPT-2 (1.6B) level" using DCLM CORE score; currently ~2.76–3.04 hours (8×H100); README and dev/LEADERBOARD.md are continuously updated.
Full pipeline coverage
- Tokenizer training and evaluation, base pretraining, SFT/RL, CORE/bpb evaluation, multi-task evaluation, KV Cache inference, CLI/Web chat — one codebase all the way through.
Minimal codebase
- No large framework-style configs and factories; clear structure (nanochat/ core library + scripts/ entrypoints + runs/ scripts + tasks/ evaluation tasks), easy to read and modify.
Unified distributed and single-GPU
- Multi-GPU uses torchrun; single GPU automatically falls back to gradient accumulation with consistent results.
ChatGPT-style Web UI
- scripts.chat_web provides a local web interface for chatting with the trained chat model.
Research-friendly
- Provides scaling_laws, miniseries, and other scripts; PRs must disclose LLM-assisted writing portions (see Contributing).

Project Advantages

Comparison	nanochat	nanoGPT	Large LLM frameworks (e.g., full HuggingFace stack)
Pipeline completeness	Tokenization→pretraining→fine-tuning→evaluation→chat	Pretraining only	Full pipeline but complex to configure
Cognitive overhead	One knob `--depth`	Must configure hyperparameters manually	Many config options, steep learning curve
Code scale	Minimal, hackable	Minimal	Large, highly abstracted
Target	Hundred-dollar conversational ChatGPT-style model	Pretraining baseline	General-purpose, enterprise-grade
Leaderboard/community	Time-to-GPT-2 + Discussions	None	Each has its own ecosystem

Why choose nanochat?

Karpathy's backing: Same lineage as nanoGPT, consistent design philosophy, suitable for learning and secondary development
Hundred-dollar reproducibility: Clear cost and time expectations (~$72/3h or spot ~$20)
Teaching and research combined: Can serve as both a "zero to chat" tutorial and a scaling/pretraining research baseline
Minimal and forkable: No bloated abstractions; a few changes can produce experiments or teaching variants

Detailed Project Analysis

Architecture and Directory Structure

The project is layered as "library + scripts + run configs":

nanochat/: Core library
- gpt.py: GPT Transformer
- tokenizer.py: BPE wrapper
- dataloader.py, dataset.py: Distributed data and tokenization
- optim.py: AdamW, Muon, single-GPU and distributed
- engine.py: Inference and KV Cache
- checkpoint_manager.py: Checkpoint read/write
- core_eval.py: CORE score (DCLM); loss_eval.py: bits per byte
- execution.py: Model can execute Python as a tool (if enabled)
- ui.html: Chat frontend static assets
scripts/: Entrypoints
- base_train.py / base_eval.py: Pretraining and evaluation
- chat_sft.py / chat_rl.py: Fine-tuning; chat_cli.py / chat_web.py: Chat
- tok_train.py / tok_eval.py: Tokenizer training and evaluation
runs/: One-click scripts
- speedrun.sh: Zero to GPT-2-level and chat
- scaling_laws.sh, miniseries.sh: Research and miniseries
- runcpu.sh: CPU/MPS small-scale demo
tasks/: Evaluation and data
- ARC, GSM8K, MMLU, Humaneval, SmolTalk, spellingbee, etc., plus TaskMixture / TaskSequence and custom jsonl (customjson.py)

The "Single Knob": depth and compute-optimal

nanochat's core design principle: users only set --depth (number of layers), and everything else (width, heads, learning rate, training steps, weight decay, etc.) is automatically calculated by the code using compute-optimal relationships — yielding a family of models that are "best use of compute" at different scales. The model for GPT-2-level capability is at roughly depth 24–26; changing depth gives larger or smaller miniseries models. Any change to the repo is expected to make sense across different depth values, avoiding tuning for a specific scale.

Pretraining and the Leaderboard

Pretraining is the current development focus. Time-to-GPT-2 is defined as: on an 8×H100 node, wall-clock time to surpass the GPT-2 (1.6B) DCLM CORE score (0.256525). The table in README is updated with the best time (currently ~2.76–3.04 hours), val_bpb, CORE, description, date, and commit. For research, use a small depth (e.g., 12) for short pretraining runs, monitoring val_bpb, core_metric, MFU, tok/s, etc. in wandb for rapid validation.

Extension and Customization

New capabilities: Add new tasks or data formats in tasks/, see counting r in strawberry.
Identity/persona: Use synthetic data in SFT, see Guide: infusing identity.
Data and repackaging: dev/repackage_data_reference.py and related files for pretraining data sharding.

Dependencies and Runtime Environment

Python: Project includes .python-version; recommended to manage environment with uv
PyTorch: Core dependency; supports CUDA; xpu/mps etc. may work but are not fully tested across all paths
GPU: Speedrun targets 8×H100 (or 8×A100, slightly slower); single GPU or limited VRAM requires reducing device_batch_size or using runcpu.sh-style scripts

Project Resources

Official Resources

🌟 GitHub: github.com/karpathy/nanochat

Who Should Use This

Developers who want to personally train and chat with an LLM: Hundred-dollar cost, one script, a few hours to get a conversational model
Researchers doing small model and scaling studies: Minimal baseline, single knob, complete leaderboard and scripts
Teaching: Suitable as a full-pipeline "from token to chat" educational codebase
Those needing a readable, modifiable, forkable baseline: No giant frameworks, easy to build your own variant on top of nanochat

Welcome to visit my personal homepage for more useful knowledge and interesting products

DEV Community