How We Index 342 Million Chess Positions for Millisecond Lookups

#go #lichess #opensource #database

Disco Chess is a chess tactics trainer built on the Woodpecker Method - solving the same puzzles repeatedly until pattern recognition becomes automatic. One feature scans your games from Lichess and Chess.com to find positions where you missed a winning move, then drills you on those patterns until they stick.

To find missed tactics, we need Stockfish evaluations for every position. At depth 36, that's seconds per position. A 40-move game has 80+ positions. The compute costs were adding up fast.

Lichess publishes their entire evaluation database monthly - 342 million positions already analyzed. If we could look up positions fast enough, we'd only need Stockfish for positions that aren't already there.

But how do you query 342 million positions in milliseconds?

The Insight: Material Rarely Changes

In a typical chess game, captures are rare. You might go 8-10 moves between captures - that's 8-10 consecutive positions with identical material.

If we group positions by material configuration, consecutive positions in a game land in the same group. Load that group once, cache hit for the next several lookups.

We call these groups "shards." Each shard contains all positions with a specific material signature. 8,359 shards containing 342 million positions.

Proof: 10,000 Magnus Carlsen Games

We ran 10,268 Magnus Carlsen games (873,418 positions) through a cache simulator:

Strategy	Cache Hit Rate (LRU-100)
Material-based	93.42%
Hash-based	2.00%

Material-based sharding wins by 91 percentage points.

Material changes on average every 8.7 positions. With a 100-shard LRU cache, we hit 93% of the time. Hash-based sharding scatters consecutive positions randomly - destroying locality entirely.

Read the full post →

Links: