How We Built a 162K-Word Database on Cloudflare Workers + D1

#cloudflare #database #api #tutorial

We needed a dictionary that runs at the edge. Not a wrapper around an existing API — a full database with definitions, etymology, translations, pronunciation data, and age-appropriate explanations, all served from SQLite co-located with the compute.

Here's how we built it.

The Requirements

Our product (thedailylesson.com) serves 62,000 lesson segments in 11 languages and 3 age groups. Every lesson needs vocabulary: definitions, translations, pronunciation, and explanations calibrated for the learner's age.

We needed:

162,250 words with verified definitions and etymology
846,000 translation rows across 47 languages with native script
240,000 audio files for pronunciation
Sub-50ms response globally
No cold starts — educational applications can't tolerate 500ms+ delays

Why D1 (SQLite at Edge)

The first decision was the database. Options:

Option	Pros	Cons
Postgres (Neon, Supabase)	Familiar SQL, rich features	Connection pooling overhead, cold starts, network round-trip
D1 (SQLite at edge)	Co-located with Workers, no connection pooling, fast reads	Write latency, 10GB limit, newer service
KV	Global edge reads	No SQL, key-value only, eventual consistency
Durable Objects	Strong consistency	Overkill for read-heavy reference data

D1 won because our workload is overwhelmingly reads. A dictionary is reference data — we write once (or rarely), read millions of times. SQLite excels at this. No connection pool to manage. No cold-start penalty. The database is literally sitting next to the Worker.

Schema Design

Three tables:

CREATE TABLE words (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  word TEXT UNIQUE NOT NULL,
  def TEXT,
  ipa TEXT,
  pos TEXT,
  etymology TEXT,
  tone_child TEXT,
  tone_teen TEXT,
  tone_adult TEXT,
  tone_elder TEXT,
  teaching_archetypes TEXT,
  lookups INTEGER DEFAULT 0,
  created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE translations (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  word_id INTEGER REFERENCES words(id),
  lang TEXT NOT NULL,
  translated_word TEXT,
  native_script TEXT,
  phonetic TEXT,
  UNIQUE(word_id, lang)
);

CREATE TABLE ethics (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  word_id INTEGER REFERENCES words(id),
  category TEXT,
  detail TEXT,
  alternatives TEXT
);

The translations table uses a composite unique constraint on (word_id, lang) — this keeps translation lookups to a single index scan.

The Data Pipeline

Getting 162,250 words into D1 required a bulk insert strategy. D1 doesn't have COPY or bulk import like Postgres. We wrote a batch inserter:

Generate word data in batches of 50 (definitions, etymology, translations, tones)
POST each batch to an internal endpoint on the Worker
The Worker inserts into D1 using a transaction
Rate-limit to avoid D1's write throughput limits

The full pipeline took several days to run. 162,250 words x 47 languages = 846,000 translation rows. Each word also gets 4 age-appropriate tone variants and etymology.

Audio Storage in R2

240,000 pronunciation files went into R2 (Cloudflare's S3-compatible object storage). We organized them by path:

/audio/pronunciation/{word}.mp3
/audio/kelly_says/{word}.mp3

The upload pipeline used a parallel worker script that:

Reads audio files from the local filesystem
PUTs each file to R2 via the Worker's internal upload endpoint
Retries on transient R2 errors (we hit 13 across 240K files)

Total upload time: ~189 minutes for 240,297 files. R2 was remarkably stable.

Serving is simple — the Worker reads from R2 and returns the file with appropriate Content-Type and Cache-Control headers. Cloudflare's CDN caches hot audio files at the edge automatically.

Cache-Miss Generation

The most interesting architectural decision: what happens when someone requests a word that isn't in the database?

We generate it on demand:

// Pseudocode
async function handleWord(word, env) {
  // Try D1 first
  let row = await env.DB.prepare("SELECT * FROM words WHERE word = ?").bind(word).first();
  if (row) return row;

  // Cache miss — generate
  const generated = await generateWord(word);

  // Store permanently in D1
  await env.DB.prepare("INSERT INTO words ...").bind(...generated).run();
  await insertTranslations(env, generated.translations);

  return generated;
}

First request: 2-4 seconds (generation + D1 write). Every subsequent request: ~3ms (D1 read from edge). The dictionary grows organically.

Performance Results

After deployment:

Metric	Value
p50 response time	3ms
p99 response time	48ms
Total words	162,250
Total translations	846,742
Audio files	240,297
Edge locations	300+
Uptime	99.99%

The p99 spike to 48ms happens on cache misses to D1's read replicas. For a dictionary API, this is more than acceptable.

MCP Server

We exposed the API as an MCP server so AI agents can use it natively:

{
  "mcpServers": {
    "word-orb": {
      "url": "https://word-orb-api.nicoletterankin.workers.dev/mcp"
    }
  }
}

This works with Claude Desktop, ChatGPT, Cursor, Gemini, and any MCP-compatible client. The agent can look up words, get translations, and access pronunciation audio through natural language.

Lessons Learned

D1 is excellent for read-heavy reference data. If your data changes infrequently and you need global edge reads, it's the right choice.
R2 bulk uploads are reliable. 240K files with only 13 transient errors is impressive. Just build retry logic.
Cache-miss generation is powerful. Start with a large seed database, then let it grow. Users get fast responses for known words and acceptable latency for new ones.
Audio files belong in object storage, not databases. This seems obvious, but the temptation to base64-encode audio into a database column is real. Don't.
Composite indexes matter for translations. Without UNIQUE(word_id, lang), translation lookups would scan the full 846K-row table.