Supreet Sethi

Posted on Feb 19

Why Stockfish is So Good (and How You Could Write a Chess Engine)

#ai #algorithms #cpp #opensource

Stockfish 18 plays chess at 3759 Elo—far beyond the strongest humans ever lived. But it's open-source, written in about 50,000 lines of C++, and runs on any laptop. This article surveys how modern chess engines work, why NNUE neural networks changed everything, and most importantly: the design decisions you'd make if you wanted to build one yourself.

The Tournament Results Tell the Story

Every engine in the top 10 worldwide is now open-source. The proprietary era ended. On the CCRL (Chess Computer Rating List) 40/15 time control, the 2025 hierarchy looks like this:

Rank	Engine	Elo	Language	Evaluation	Open Source
1	Stockfish 18	3650	C++	NNUE	Yes
2	PlentyChess 7.0	3643	C++	NNUE	Yes
3	Obsidian 16.0	3635	C++	NNUE	Yes
4	Alexandria 8	3633	Rust	NNUE	Yes
5	Viridithas 19	3629	Rust	NNUE	Yes
6	Berserk 13	3615	C	NNUE	Yes
10	Leela Chess Zero	~3443	C++	Deep NN	Yes

Leela Chess Zero is roughly 200 Elo weaker than Stockfish despite using a 191-million-parameter transformer neural network, simply because Stockfish can evaluate 1,500× more positions per second. Speed matters more than brilliance in chess.

NNUE Was The Breakthrough Nobody Expected

In August 2020, Stockfish introduced NNUE (Efficiently Updatable Neural Networks), originally invented for computer shogi. The jump was +80 to +100 Elo overnight—equivalent to two years of conventional development. Every top engine today uses NNUE or a variation of it.

The genius of NNUE is that it's specialized for the constraints of chess engines:

Sparse activation: Only ~30 of 40,960 input features are active per position (99.93% dormant). Most neural network activations are wasted computation in the context of alpha-beta search.
Incremental updates: Since only 2-4 features change per move, you don't recompute the entire network—you update an "accumulator" in place. This makes evaluation sub-microsecond.
Integer arithmetic: NNUE uses int8 and int16 quantization, optimized for CPU SIMD (AVX2, AVX-512). No float32 overhead. Roughly 60 million evaluations per second on a modern CPU.

Stockfish's current NNUE uses a 1024×2→8→32→1 architecture (inputs expand to 1024, then compress through small hidden layers). The network is trained on billions of self-play positions scored at modest search depths, using PyTorch with custom CUDA kernels.

The completely hand-crafted evaluation function—Stockfish's secret sauce for 15 years—was deleted entirely in 2023. Now NNUE does everything.

Two Paradigms Compete: Alpha-Beta vs. Monte Carlo Tree Search

Alpha-Beta Search with NNUE (Stockfish's Approach)

Stockfish searches a tree systematically, pruning branches that cannot improve the best move. It uses Principal Variation Search (PVS), a variant of alpha-beta that searches the expected best move with a full window and all others with a zero-width window.

The trick is aggressive pruning without missing critical moves:

Late Move Reductions (LMR): Moves ordered later are searched at reduced depth. Estimated +100 Elo when first introduced.
Null Move Pruning: If passing (making no move) already produces a cutoff, prune the subtree. Dangerous in zugzwang, but nets +100-200 Elo.
Singular Extensions: If one move dramatically outperforms all others, search it deeper.
Futility Pruning: At shallow depths, positions too far below alpha are pruned immediately.
SEE Pruning: Moves with losing material exchanges are pruned based on static exchange evaluation.

These techniques shrink the effective branching factor from ~35 to less than 2, enabling search to depth 30-40 plies where NNUE provides the final judgment.

MCTS with Deep Neural Networks (Leela Chess Zero)

Leela Chess Zero uses Monte Carlo Tree Search combined with a 191-million-parameter transformer. Instead of exhaustively searching and pruning, it builds a search tree by repeatedly:

Selecting a node via the PUCT formula: Q(s,a) + c_puct × P(s,a) × √(N_parent) / (1 + N(s,a))
Evaluating with the neural network (value head), which replaces classical random rollouts
Backing up results to update statistics

The network provides both a policy head (prior probability for each move) and a value head (win probability), replacing the random playout entirely.

The outcome: MCTS's deep NN produces higher-quality per-position judgments, but Stockfish compensates by evaluating vastly more positions per second. On TCEC (the engine championship), Stockfish beats Leela roughly 57-43 and has won every superfinal since 2020. The gap appears to be widening.

Why C++ (And Sometimes Rust)

Programming language choice directly impacts Elo rating at the elite level. Each additional ply of search adds roughly 50-70 Elo. A 2× difference in nodes-per-second (NPS) means one extra ply—so language choice matters.

Language	Top Engine	Elo	NPS Relative to C++	Elo Cost
C++	Stockfish	3759	Baseline	0
C	Berserk	3646	100%	0
Rust	Alexandria	3633	95-100%	0-5
C#	Ceres	~3624	70-80%	30-60
Java	chess22k	~3000	60-80%	30-60
Python	–	~1800	1-2%	400+

C and C++ dominate because they offer:

Direct hardware access (cache control, SIMD intrinsics)
No garbage collection pauses
Compile-time optimizations
Fast bitboard operations (64-bit integers for board representation)

Rust is a surprise win. Alexandria and Viridithas prove that Rust—with its zero-cost abstractions and LLVM compilation—is fully competitive at the elite level. Move generation, board representation, and evaluation all run at near-C++ speeds.

Managed languages like C# and Java incur 30-60 Elo penalties from runtime overhead and garbage collection. Below ~3200 Elo, this doesn't matter—algorithmic skill outweighs implementation. But above 3600, the language tax is disqualifying.

Open-Source Development Beats Proprietary

Twelve of the top 15 engines are open-source. This wasn't ideological—it was mechanistic.

In 2013, Gary Linscott created Fishtest, a distributed testing framework for Stockfish using Sequential Probability Ratio Testing (SPRT). Volunteers donate CPU time. The numbers:

19,700+ CPU-years consumed
9.8+ billion games played
~199 concurrent worker machines
For Stockfish 15: ~13,000 candidate changes tested, ~200 merged

Stockfish gained +120 Elo in the first year of Fishtest—a pace no proprietary team could match. Andrew Grant's OpenBench extended this framework to 17+ other engines (Berserk, Koivisto, Stormphrax, Viridithas, etc.).

The reason is simple: you cannot optimize an engine in isolation. Testing every patch requires millions of games. Distributed testing at cloud scale is the only way to move fast. Open-source enabled this. Proprietary teams struggle to match it.

Other Key Technical Insights

Lazy SMP (Shared Memory Parallelization): Going from 1 to 16 threads adds only ~20-32 Elo, far less than the theoretical benefit. Engines are memory-bound, not CPU-bound, at high core counts.

Syzygy Tablebases: 6-man endgame databases provide ~10 Elo improvement by allowing perfect play in simplified positions. Stockfish ships with 6-man tablebases.

Depth vs. NNUE Quality: Modern NNUE networks are compact (several MB), but the old hand-crafted evaluations were more "reasonable" locally. Deeper search compensates: Stockfish searches to 40+ plies where NNUE makes the final judgment.

Each Additional Ply: Worth approximately +50-70 Elo across all engines. Doubling NPS ≈ one extra ply ≈ 50-70 Elo.

If You Want To Build a Chess Engine: Design Sweet Spots

Most programmers who build chess engines do it for fun or learning. Here's a realistic roadmap with reasonable design trade-offs.

Tier 1: A Respectable Hobbyist Engine (1400-1800 Elo)

Goal: Beat humans, work on a laptop, finish in a weekend.

Design choices:

Language: Python or Rust (you care about learning, not top-10 rankings)
Search: Basic alpha-beta with minimal pruning (null move pruning is the only essential trick)
Evaluation: Hand-crafted. A simple function counting material + positional bonuses for piece centralization, king safety, pawn advancement
Board: 8×8 2D array (don't use bitboards yet; they're an optimization)
Move generation: Brute-force checking all 64 squares

Expected Elo: 1400-1800 (beats most casual players)

Time to implement: ~40-80 lines of core move generation + alpha-beta + evaluation = 200-300 lines total

# Pseudocode for minimum viable chess engine

def evaluate(board):
    material = count_material(board)
    position = count_piece_activity(board)
    return material + position

def search(board, depth, alpha, beta):
    if depth == 0:
        return evaluate(board)

    for move in generate_moves(board):
        board.make_move(move)
        score = -search(board, depth - 1, -beta, -alpha)
        board.unmake_move(move)

        alpha = max(alpha, score)
        if alpha >= beta:
            break  # Beta cutoff

    return alpha

This gives you 1600+ Elo with virtually no optimization.

Tier 2: A Competitive Amateur Engine (2000-2500 Elo)

Goal: Beat most online opponents, viable on older hardware.

Design choices:

Language: C++ or Rust (you need speed now, not just learning)
Search: Alpha-beta with:
- Null move pruning
- Late Move Reductions (LMR) with basic formula: reduction = 1 + ln(depth) × ln(moveNumber)
- Move ordering (best moves first)
- Transposition table (hash table caching positions)
- Quiescence search (don't stop at arbitrary depth; handle captures)
Evaluation: Still hand-crafted, but richer:
- Material (pawns=1, knights=3, bishops=3.25, rooks=5, queens=9)
- Piece-square tables (centralized pieces score higher)
- Pawn structure (doubled, isolated, passed pawns)
- King safety
- Mobility (count available squares for each piece)
Board: Bitboards (64-bit integers for each piece type)
Move generation: Bit-scanning optimized

Expected Elo: 2000-2500 (beats most club players)

Time to implement: 1,000-2,000 lines of well-structured C++

Key optimization: Transposition table. Storing and retrieving previously evaluated positions cuts the search tree in half.

struct TTEntry {
    uint64_t hash;
    int depth, score;
    int flag;  // EXACT, LOWER, UPPER
};

std::unordered_map<uint64_t, TTEntry> transposition_table;

int search(Board &board, int depth, int alpha, int beta) {
    uint64_t hash = board.hash();
    if (transposition_table.count(hash)) {
        auto &entry = transposition_table[hash];
        if (entry.depth >= depth) {
            if (entry.flag == EXACT) return entry.score;
            if (entry.flag == LOWER) alpha = max(alpha, entry.score);
            if (entry.flag == UPPER) beta = min(beta, entry.score);
            if (alpha >= beta) return entry.score;
        }
    }

    // ... search continues ...
}

Tier 3a: A Strong Club Engine (~3000 lines, 2400-2600 Elo)

Goal: Impressive hobbyist engine; beats most players, playable against titled opponents.

What you add to Tier 2:

Iterative deepening: Search depth 1, then 2, then 3, etc. Allows precise time management and search stability
Killer heuristic: Track moves that produced cutoffs in sibling nodes; try them early in similar positions
History heuristic: Moves that produced cutoffs anywhere in the tree are tried earlier at shallower depths
Aspiration windows: Search the full depth with a narrow alpha-beta window around the previous iteration's score (much faster when you're right)
Staged move ordering: Generate checking moves first, then captures, then quiet moves
Better evaluation:
- Piece-square tables for all pieces (not just pawns)
- Pawn structure analysis (passed pawns, weak squares)
- King tropism (pieces attacking near the opponent's king)
- Tempo bonus (slight evaluation advantage for the side to move)

Board representation: Still bitboards, but optimized move generation using bit-scanning

Expected Elo: 2400-2600 (beats most club-level players, competitive against titled opponents)

Time to implement: ~3,000 lines of well-optimized C++

Code example (iterative deepening structure):

int iterative_deepening(Board &board, int time_ms) {
    int best_move = 0;
    int best_score = 0;

    auto start = std::chrono::system_clock::now();

    for (int depth = 1; depth <= 30; depth++) {
        int score = search(board, depth, -INFINITY, INFINITY);

        if (score_is_mate(score)) {
            // If we found mate, no point searching deeper
            return best_move;
        }

        best_move = pv[0];  // Principal variation from this depth
        best_score = score;

        auto elapsed = std::chrono::system_clock::now() - start;
        if (elapsed.count() > time_ms * 0.9) {
            // Stop if we've used 90% of allocated time
            break;
        }
    }

    return best_move;
}

Tier 3b: An Intermediate Engine (~5000 lines, 2800-3200 Elo)

Goal: Seriously competitive; tournaments, online rankings, formidable opponent.

What you add to Tier 3a:

Singular extensions: If one move is dramatically better than siblings, search it one ply deeper
Futility pruning: At shallow depths (1-2 plies from qsearch), prune moves that can't possibly improve alpha
SEE (Static Exchange Evaluation): Use a fast algorithm to estimate if a capture wins or loses material
Countermove heuristic: Track which move was played at the parent node; moves that counter it get priority
Quiescence search improvements:
- Only examine captures that win material or give check
- Stop immediately if a "quiet" move (non-capture) is better than alpha
Transposition table refinements:
- Increased size (16 MB → 256 MB)
- Better replacement strategy (always replace, or replace only weaker entries)
- Age tracking (overwrite older entries)
Advanced evaluation (still hand-crafted):
- Separate evaluation for opening, middlegame, endgame (phase calculation)
- Rook on 7th rank bonus
- Opposite-colored bishop endgames (draw tendency)
- Pawn promotion threats
- Mobility counts refined by piece type

Search enhancements:

Multi-threaded search (Lazy SMP): 2-4 threads with shared transposition table (adds ~20 Elo but improves responsiveness)
Mate distance pruning: If we've found mate in N moves, prune branches that can't improve it
Null move window refinements: Adapt null move reduction based on position criticality (zugzwang detection)

Board representation: Bitboards with magic multiplication for bishop/rook attacks (O(1) attack generation)

Expected Elo: 2800-3200 (serious tournament contender; dominates online)

Time to implement: ~5,000 lines of optimized C++

Critical file structure at this level:

src/
  board.cpp/h        (1,000 lines: bitboards, move gen, fen parsing)
  search.cpp/h       (1,500 lines: alpha-beta, pruning, extensions, TT)
  evaluate.cpp/h     (1,200 lines: handcrafted evaluation)
  movegen.cpp/h      (500 lines: move ordering, sorting)
  uci.cpp/h          (300 lines: UCI protocol interface)
  main.cpp           (100 lines: main loop)

Key decision at 5000 lines: You've optimized hand-crafted evaluation as far as it goes. To go further, you either:

Fork an open-source NNUE (adds ~1000 lines for integration), or
Spend months training your own neural network on billions of positions

Most competitive hobbyist engines stop here or move to Tier 3c.

Tier 3c: An Elite Engine (~10,000+ lines, 3000+ Elo)

Goal: Grandmaster territory; competitive in engine championships; serious technical achievement.

What you add to Tier 3b:

NNUE Integration (the pivotal decision):
- Use an existing trained NNUE (most open-source engines do this)
- OR train your own on self-play data using PyTorch
- Integrate int8 NNUE evaluation (~500 lines)
Advanced search techniques:
- Razoring: Near quiescence, if static eval + margin > alpha, go straight to qsearch (worth ~10-20 Elo)
- Internal iterative reduction: On the PV, if no best move is available, reduce depth slightly
- Probcut: At shallow depths, use a wider evaluation margin to make cutoff predictions
Time management II:
- Chebyshev time allocation (give more time to critical moves)
- Resignation detection (stop searching if score is lost)
- Contempt factor (slight bias toward fighting positions over draws)
Endgame improvements:
- Syzygy tablebases (6-man endgame databases; ~20-30 MB download, adds ~10 Elo)
- Pawn endgame tables (7-piece, perfect play)
- Bitbase knowledge (e.g., K+P vs K endgame rules)
Evaluation refinements (hand-crafted component paired with NNUE):
- Piece-square table adjustments by material count
- King safety asymmetry (attacker bonus)
- Evaluation tempo bonus (now learned by NNUE, but can add explicit bonuses)
Distributed testing infrastructure (if you're serious):
- Hook into OpenBench or Fishtest
- Enable crowd-sourced testing of your patches

Search algorithm: Principal Variation Search (PVS) with all optimizations, or a modern variant like Negamax

Threading: 4-16 threads with lazy SMP (shared TT, split search at root)

Expected Elo: 3000-3500 (elite level; competitive in engine tournaments)

Time to implement: 10,000-15,000 lines of production-quality C++

Representative code structure at this level:

src/
  board.cpp/h        (1,200 lines)
  search.cpp/h       (2,500 lines: alpha-beta, all pruning techniques, TT)
  evaluate.cpp/h     (800 lines: NNUE integration + handcrafted bonuses)
  nnue.cpp/h         (800 lines: int8 quantization, accumulator updates)
  movegen.cpp/h      (600 lines)
  uci.cpp/h          (400 lines: UCI options, time management)
  tt.cpp/h           (300 lines: transposition table)
  bitboards.cpp/h    (200 lines: magic bitboards, attacks)
  endgame.cpp/h      (300 lines: tablebases, special rules)
  main.cpp           (100 lines)
  tests/             (2,000 lines: perft, position tests, regression suite)

NNUE Integration Example:

// Simplified NNUE forward pass
class NNUE {
public:
    int8_t input[2][512];    // Two perspectives
    int32_t hidden[8];       // First hidden layer
    int16_t output;          // Final evaluation

    void update_accumulator(Move m) {
        // Incrementally update instead of full recompute
        // Most of the 1,500× speedup comes from here
    }

    int evaluate() {
        // Update accumulator
        update_accumulator(last_move);

        // Forward pass through tiny network
        for (int i = 0; i < 8; i++) {
            hidden[i] = input[0][i*64] + input[1][i*64];  // Simplified
        }

        // Final output layer
        output = (hidden[0] * w0 + hidden[1] * w1 + ...) >> 7;

        return output;
    }
};

The training pipeline (if you're training your own NNUE):

Generate self-play games using your engine
  ↓
Score positions at depth 8-12
  ↓
Collect 1+ billion positions
  ↓
PyTorch training loop (SGD, L1 loss, quantization-aware training)
  ↓
Export to ONNX or custom int8 format
  ↓
Integrate into engine
  ↓
Test on Fishtest/OpenBench
  ↓
If +25 Elo: merge. If -25 Elo: discard.

Maintenance burden at 10K lines:

A comprehensive test suite (perft, mate-in-N puzzles, position regression tests)
Profiling to identify bottlenecks (search takes ~95% of time; move gen ~2%)
Continuous tuning: 50+ parameters (LMR reductions, null move depth limits, eval weights) can be optimized with texel tuning or local search

Where to start: Clone Berserk (simpler than Stockfish, ~4,500 lines) and add NNUE support. This gives you a 2.5x head start.

Code Complexity vs. Elo Gain (Diminishing Returns)

Here's the brutal truth about chess engine development:

Lines of Code	Elo	Time to Build	Improvement Rate
300	1600	1 weekend	+5.3 Elo/line
1,000	2000	1 week	+0.4 Elo/line
3,000	2500	2-3 weeks	+0.17 Elo/line
5,000	2900	1-2 months	+0.08 Elo/line
10,000	3300	3-6 months	+0.04 Elo/line
50,000	3759	Years	+0.01 Elo/line

The first 1,000 lines are magical. Each line of code adds ~4 Elo. By 5,000 lines, you're down to 0.08 Elo per line. The last 40,000 lines of Stockfish are engineering optimization for marginal gains.

This is why starting with 3,000 lines is the sweet spot for learning. You get a genuinely strong engine (beats humans), understand all the major algorithmic concepts (alpha-beta, pruning, TT), and can see your work play real chess. Anything beyond 5,000 lines is specialist territory.

The Real Lesson

The gap between 1800 Elo and 3600 Elo is about 40 years of accumulated engineering wisdom:

LMR (2004): +100 Elo
Null move pruning (1980s): +150 Elo
Singular extensions (2000s): +30 Elo
Killer heuristic + history (1980s): +50 Elo
NNUE evaluation (2020): +100 Elo
Iterative deepening + time management (1980s): +30 Elo

Each innovation compounds. Modern engines are not singularly brilliant; they're the product of thousands of incremental improvements, tested rigorously on billions of games.

If you're building a hobbyist engine, focus on correctness first, optimization second. Get move generation and search right before worrying about evaluation. A simple alpha-beta + null move pruning engine at 1600 Elo is a better learning experience than a buggy, half-finished attempt at NNUE.

If you're aiming for competitive strength, use distributed testing (OpenBench or Fishtest) early. A single developer testing patches locally will trend toward local optima. The open-source ecosystem wins because every improvement is validated against thousands of positions and millions of games.

And remember: Stockfish is 50,000 lines of C++, but the first 1,000 lines give you 1500 Elo. The remaining 49,000 lines bought maybe 500 more.

DEV Community