DEV Community

WonderLab
WonderLab

Posted on

Open Source Project of the Day (Part 22): nanochat - The Best ChatGPT $100 Can Buy, Karpathy's Minimalist LLM Training Suite

Introduction

"In 2019, training GPT-2 cost about $43,000. Today, for under $100 and about 3 hours, you can reproduce it on 8×H100s and chat with it."

This is Part 22 of the "Open Source Project of the Day" series. Today we explore nanochat (GitHub) by Andrej Karpathy. The project's tagline: The best ChatGPT that $100 can buy.

nanochat isn't yet another bloated "LLM framework" — no giant configuration objects, no model factories, no screens full of if-else branches. It's a minimalist, readable, hackable, forkable end-to-end experiment suite: tokenization, pretraining, fine-tuning, evaluation, inference, and a chat web UI, all connected end-to-end. It uses a single knob --depth (number of Transformer layers) to automatically derive width, heads, learning rate, training steps, and more — producing a family of compute-optimal models. Want "a GPT-2-level model"? Set depth to around 26, run runs/speedrun.sh, and a few hours later you can chat with it in a ChatGPT-style interface.

What You'll Learn

  • nanochat's positioning: a minimalist, single-node multi-GPU, full-pipeline LLM training and chat suite
  • The "single knob" design: how --depth determines compute-optimal hyperparameters and model scale
  • From speedrun to chat: how to train a GPT-2-level model with one script and talk to it via Web UI
  • The Time-to-GPT-2 leaderboard and the meaning of the CORE metric
  • Project directory structure, key scripts, and extension approaches (new capabilities, identity/persona injection)
  • Comparison with nanoGPT, modded-nanogpt, and common LLM frameworks

Prerequisites

  • Basic understanding of Transformer and GPT-style models
  • Familiarity with the distinction between pretraining and fine-tuning (SFT)
  • Experience with Python and PyTorch; basic knowledge of multi-GPU training (e.g., torchrun) is helpful
  • A GPU environment (single or multi-GPU) is helpful for hands-on reproduction

Project Background

Project Introduction

nanochat is an open-source minimalist LLM experiment suite by Andrej Karpathy, designed for single GPU nodes (e.g., 8×H100) with a small, clear, readable, and modifiable codebase. It covers the complete LLM pipeline: tokenization, pretraining, fine-tuning (SFT/RL), evaluation, inference, and chat UI. The goal is not to be a "configure everything" framework, but to provide a strong baseline: go from zero to a conversational ChatGPT-style model at a cost in the hundreds of dollars (e.g., ~3 hours on 8×H100, ~$72; spot instances as low as ~$20).

Core problems the project solves:

  • Want to personally train, fine-tune, and chat with a small LLM, but mainstream frameworks are too complex with high cognitive overhead
  • Need compute-optimal default configurations rather than manually tuning dozens of hyperparameters
  • Want code that is minimal, hackable, and easy to fork for research and teaching
  • Limited budget (under $1000), want clear time and cost expectations (e.g., "3 hours to GPT-2")

Target user groups:

  • Developers and students who want to run the full LLM training and chat pipeline end-to-end
  • Researchers doing small model and scaling studies
  • Enthusiasts who want to understand the complete pipeline from data to conversational LLM
  • Teams needing modifiable, reproducible baseline code

Author/Team Introduction

  • Author: Andrej Karpathy (@karpathy), former Tesla AI Director, OpenAI researcher, Stanford CS PhD; widely known for nanoGPT, neural network tutorials, and other projects.
  • Acknowledgments: The nanochat name comes from nanoGPT; the pretraining and leaderboard design was inspired by modded-nanoGPT; thanks to HuggingFace (fineweb, smoltalk), Lambda (compute), Alec Radford, Sofie @svlandeg, and others.
  • Project creation date: October 2025 (GitHub shows created_at 2025-10-13).

Project Stats

  • GitHub Stars: 43k+
  • 🍴 Forks: 5.6k+
  • 📦 Version: No official version number; main branch is the trunk; README and Discussions continuously updated (e.g., leaderboard, fp8, batch size in February 2026).
  • 📄 License: MIT
  • 🌐 Website: No independent website; primarily GitHub and Discussions
  • 💬 Community: GitHub Discussions, Discord #nanochat; README recommends using DeepWiki for Q&A

Main Features

Core Purpose

nanochat's core purpose is to run the complete pipeline from data to conversational LLM on a single multi-GPU node with minimal cognitive overhead, with compute-optimal model families by default (via a single --depth knob):

  1. Tokenization: BPE tokenizer training and evaluation (GPT-4 style wrapper)
  2. Pretraining: Base model training with distributed support and gradient accumulation, optimized for "Time-to-GPT-2" speed
  3. Fine-tuning: SFT (supervised fine-tuning), RL, etc., paired with tasks (ARC, GSM8K, MMLU, Humaneval, SmolTalk, etc.)
  4. Evaluation: CORE score (DCLM), bits per byte, various task evaluations
  5. Inference: Efficient inference engine with KV Cache
  6. Chat: CLI and ChatGPT-style Web UI for chatting directly with trained models

Use Cases

  1. Reproduce "hundred-dollar GPT-2" and chat with it

    • Rent an 8×H100 node (or 8×A100), run bash runs/speedrun.sh, then after ~3 hours launch the Web UI with python -m scripts.chat_web and chat with the model like ChatGPT (write poems, Q&A, observe hallucinations, etc.)
  2. Research and iterate on pretraining

    • Use runs/scaling_laws.sh and runs/miniseries.sh for scaling experiments; use small models like --depth=12 for ~5-minute rapid pretraining iterations, watching val_bpb, core_metric, MFU, tok/s, etc. in wandb
  3. Teaching and understanding the full pipeline

    • Minimal code with no giant abstractions, suitable as a "from token to chat" educational codebase
  4. Inject identity/persona into a model

  5. Add new capabilities

Quick Start

Environment: Recommended Python + uv; the project root contains pyproject.toml and uv.lock.

One-click reproduce GPT-2 and chat (requires 8×H100-class node, e.g., Lambda):

# Clone
git clone https://github.com/karpathy/nanochat.git && cd nanochat

# Create virtual environment and install dependencies (example)
uv venv && source .venv/bin/activate
uv sync

# Run full pipeline (pretraining + fine-tuning, etc., ~3 hours)
bash runs/speedrun.sh
Enter fullscreen mode Exit fullscreen mode

After training, launch the chat Web UI:

source .venv/bin/activate
python -m scripts.chat_web
Enter fullscreen mode Exit fullscreen mode

Open the URL printed in the terminal (e.g., http://<machine-IP>:8000/) in a browser to chat with the model like ChatGPT.

Single GPU or limited VRAM: Remove torchrun to run on a single GPU (roughly 8x slower); if VRAM is insufficient, reduce --device_batch_size in the script (e.g., 32→16→8→4→2→1). CPU/Apple Silicon: See runs/runcpu.sh, which uses a much smaller model and fewer steps for demonstration purposes only — results are limited.

Core Features

  1. Single complexity knob --depth

    • The number of Transformer layers determines model scale; all other hyperparameters (width, heads, learning rate, training steps, weight decay, etc.) are automatically derived by the code in a compute-optimal way — users only choose "bigger or smaller" model.
  2. Time-to-GPT-2 leaderboard

    • Measures wall-clock time to "reach GPT-2 (1.6B) level" using DCLM CORE score; currently ~2.76–3.04 hours (8×H100); README and dev/LEADERBOARD.md are continuously updated.
  3. Full pipeline coverage

    • Tokenizer training and evaluation, base pretraining, SFT/RL, CORE/bpb evaluation, multi-task evaluation, KV Cache inference, CLI/Web chat — one codebase all the way through.
  4. Minimal codebase

    • No large framework-style configs and factories; clear structure (nanochat/ core library + scripts/ entrypoints + runs/ scripts + tasks/ evaluation tasks), easy to read and modify.
  5. Unified distributed and single-GPU

    • Multi-GPU uses torchrun; single GPU automatically falls back to gradient accumulation with consistent results.
  6. ChatGPT-style Web UI

    • scripts.chat_web provides a local web interface for chatting with the trained chat model.
  7. Research-friendly

    • Provides scaling_laws, miniseries, and other scripts; PRs must disclose LLM-assisted writing portions (see Contributing).

Project Advantages

Comparison nanochat nanoGPT Large LLM frameworks (e.g., full HuggingFace stack)
Pipeline completeness Tokenization→pretraining→fine-tuning→evaluation→chat Pretraining only Full pipeline but complex to configure
Cognitive overhead One knob --depth Must configure hyperparameters manually Many config options, steep learning curve
Code scale Minimal, hackable Minimal Large, highly abstracted
Target Hundred-dollar conversational ChatGPT-style model Pretraining baseline General-purpose, enterprise-grade
Leaderboard/community Time-to-GPT-2 + Discussions None Each has its own ecosystem

Why choose nanochat?

  • Karpathy's backing: Same lineage as nanoGPT, consistent design philosophy, suitable for learning and secondary development
  • Hundred-dollar reproducibility: Clear cost and time expectations (~$72/3h or spot ~$20)
  • Teaching and research combined: Can serve as both a "zero to chat" tutorial and a scaling/pretraining research baseline
  • Minimal and forkable: No bloated abstractions; a few changes can produce experiments or teaching variants

Detailed Project Analysis

Architecture and Directory Structure

The project is layered as "library + scripts + run configs":

  • nanochat/: Core library
    • gpt.py: GPT Transformer
    • tokenizer.py: BPE wrapper
    • dataloader.py, dataset.py: Distributed data and tokenization
    • optim.py: AdamW, Muon, single-GPU and distributed
    • engine.py: Inference and KV Cache
    • checkpoint_manager.py: Checkpoint read/write
    • core_eval.py: CORE score (DCLM); loss_eval.py: bits per byte
    • execution.py: Model can execute Python as a tool (if enabled)
    • ui.html: Chat frontend static assets
  • scripts/: Entrypoints
    • base_train.py / base_eval.py: Pretraining and evaluation
    • chat_sft.py / chat_rl.py: Fine-tuning; chat_cli.py / chat_web.py: Chat
    • tok_train.py / tok_eval.py: Tokenizer training and evaluation
  • runs/: One-click scripts
    • speedrun.sh: Zero to GPT-2-level and chat
    • scaling_laws.sh, miniseries.sh: Research and miniseries
    • runcpu.sh: CPU/MPS small-scale demo
  • tasks/: Evaluation and data
    • ARC, GSM8K, MMLU, Humaneval, SmolTalk, spellingbee, etc., plus TaskMixture / TaskSequence and custom jsonl (customjson.py)

The "Single Knob": depth and compute-optimal

nanochat's core design principle: users only set --depth (number of layers), and everything else (width, heads, learning rate, training steps, weight decay, etc.) is automatically calculated by the code using compute-optimal relationships — yielding a family of models that are "best use of compute" at different scales. The model for GPT-2-level capability is at roughly depth 24–26; changing depth gives larger or smaller miniseries models. Any change to the repo is expected to make sense across different depth values, avoiding tuning for a specific scale.

Pretraining and the Leaderboard

Pretraining is the current development focus. Time-to-GPT-2 is defined as: on an 8×H100 node, wall-clock time to surpass the GPT-2 (1.6B) DCLM CORE score (0.256525). The table in README is updated with the best time (currently ~2.76–3.04 hours), val_bpb, CORE, description, date, and commit. For research, use a small depth (e.g., 12) for short pretraining runs, monitoring val_bpb, core_metric, MFU, tok/s, etc. in wandb for rapid validation.

Extension and Customization

  • New capabilities: Add new tasks or data formats in tasks/, see counting r in strawberry.
  • Identity/persona: Use synthetic data in SFT, see Guide: infusing identity.
  • Data and repackaging: dev/repackage_data_reference.py and related files for pretraining data sharding.

Dependencies and Runtime Environment

  • Python: Project includes .python-version; recommended to manage environment with uv
  • PyTorch: Core dependency; supports CUDA; xpu/mps etc. may work but are not fully tested across all paths
  • GPU: Speedrun targets 8×H100 (or 8×A100, slightly slower); single GPU or limited VRAM requires reducing device_batch_size or using runcpu.sh-style scripts

Project Resources

Official Resources

Who Should Use This

  • Developers who want to personally train and chat with an LLM: Hundred-dollar cost, one script, a few hours to get a conversational model
  • Researchers doing small model and scaling studies: Minimal baseline, single knob, complete leaderboard and scripts
  • Teaching: Suitable as a full-pipeline "from token to chat" educational codebase
  • Those needing a readable, modifiable, forkable baseline: No giant frameworks, easy to build your own variant on top of nanochat

Welcome to visit my personal homepage for more useful knowledge and interesting products

Top comments (0)