DEV Community

Nicolette Rankin
Nicolette Rankin

Posted on

How We Built a 162K-Word Database on Cloudflare Workers + D1

We needed a dictionary that runs at the edge. Not a wrapper around an existing API — a full database with definitions, etymology, translations, pronunciation data, and age-appropriate explanations, all served from SQLite co-located with the compute.

Here's how we built it.

The Requirements

Our product (thedailylesson.com) serves 62,000 lesson segments in 11 languages and 3 age groups. Every lesson needs vocabulary: definitions, translations, pronunciation, and explanations calibrated for the learner's age.

We needed:

  • 162,250 words with verified definitions and etymology
  • 846,000 translation rows across 47 languages with native script
  • 240,000 audio files for pronunciation
  • Sub-50ms response globally
  • No cold starts — educational applications can't tolerate 500ms+ delays

Why D1 (SQLite at Edge)

The first decision was the database. Options:

Option Pros Cons
Postgres (Neon, Supabase) Familiar SQL, rich features Connection pooling overhead, cold starts, network round-trip
D1 (SQLite at edge) Co-located with Workers, no connection pooling, fast reads Write latency, 10GB limit, newer service
KV Global edge reads No SQL, key-value only, eventual consistency
Durable Objects Strong consistency Overkill for read-heavy reference data

D1 won because our workload is overwhelmingly reads. A dictionary is reference data — we write once (or rarely), read millions of times. SQLite excels at this. No connection pool to manage. No cold-start penalty. The database is literally sitting next to the Worker.

Schema Design

Three tables:

CREATE TABLE words (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  word TEXT UNIQUE NOT NULL,
  def TEXT,
  ipa TEXT,
  pos TEXT,
  etymology TEXT,
  tone_child TEXT,
  tone_teen TEXT,
  tone_adult TEXT,
  tone_elder TEXT,
  teaching_archetypes TEXT,
  lookups INTEGER DEFAULT 0,
  created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE translations (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  word_id INTEGER REFERENCES words(id),
  lang TEXT NOT NULL,
  translated_word TEXT,
  native_script TEXT,
  phonetic TEXT,
  UNIQUE(word_id, lang)
);

CREATE TABLE ethics (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  word_id INTEGER REFERENCES words(id),
  category TEXT,
  detail TEXT,
  alternatives TEXT
);
Enter fullscreen mode Exit fullscreen mode

The translations table uses a composite unique constraint on (word_id, lang) — this keeps translation lookups to a single index scan.

The Data Pipeline

Getting 162,250 words into D1 required a bulk insert strategy. D1 doesn't have COPY or bulk import like Postgres. We wrote a batch inserter:

  1. Generate word data in batches of 50 (definitions, etymology, translations, tones)
  2. POST each batch to an internal endpoint on the Worker
  3. The Worker inserts into D1 using a transaction
  4. Rate-limit to avoid D1's write throughput limits

The full pipeline took several days to run. 162,250 words x 47 languages = 846,000 translation rows. Each word also gets 4 age-appropriate tone variants and etymology.

Audio Storage in R2

240,000 pronunciation files went into R2 (Cloudflare's S3-compatible object storage). We organized them by path:

/audio/pronunciation/{word}.mp3
/audio/kelly_says/{word}.mp3
Enter fullscreen mode Exit fullscreen mode

The upload pipeline used a parallel worker script that:

  1. Reads audio files from the local filesystem
  2. PUTs each file to R2 via the Worker's internal upload endpoint
  3. Retries on transient R2 errors (we hit 13 across 240K files)

Total upload time: ~189 minutes for 240,297 files. R2 was remarkably stable.

Serving is simple — the Worker reads from R2 and returns the file with appropriate Content-Type and Cache-Control headers. Cloudflare's CDN caches hot audio files at the edge automatically.

Cache-Miss Generation

The most interesting architectural decision: what happens when someone requests a word that isn't in the database?

We generate it on demand:

// Pseudocode
async function handleWord(word, env) {
  // Try D1 first
  let row = await env.DB.prepare("SELECT * FROM words WHERE word = ?").bind(word).first();
  if (row) return row;

  // Cache miss — generate
  const generated = await generateWord(word);

  // Store permanently in D1
  await env.DB.prepare("INSERT INTO words ...").bind(...generated).run();
  await insertTranslations(env, generated.translations);

  return generated;
}
Enter fullscreen mode Exit fullscreen mode

First request: 2-4 seconds (generation + D1 write). Every subsequent request: ~3ms (D1 read from edge). The dictionary grows organically.

Performance Results

After deployment:

Metric Value
p50 response time 3ms
p99 response time 48ms
Total words 162,250
Total translations 846,742
Audio files 240,297
Edge locations 300+
Uptime 99.99%

The p99 spike to 48ms happens on cache misses to D1's read replicas. For a dictionary API, this is more than acceptable.

MCP Server

We exposed the API as an MCP server so AI agents can use it natively:

{
  "mcpServers": {
    "word-orb": {
      "url": "https://word-orb-api.nicoletterankin.workers.dev/mcp"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This works with Claude Desktop, ChatGPT, Cursor, Gemini, and any MCP-compatible client. The agent can look up words, get translations, and access pronunciation audio through natural language.

Lessons Learned

  1. D1 is excellent for read-heavy reference data. If your data changes infrequently and you need global edge reads, it's the right choice.

  2. R2 bulk uploads are reliable. 240K files with only 13 transient errors is impressive. Just build retry logic.

  3. Cache-miss generation is powerful. Start with a large seed database, then let it grow. Users get fast responses for known words and acceptable latency for new ones.

  4. Audio files belong in object storage, not databases. This seems obvious, but the temptation to base64-encode audio into a database column is real. Don't.

  5. Composite indexes matter for translations. Without UNIQUE(word_id, lang), translation lookups would scan the full 846K-row table.

Try It

Free tier: 50 calls/day, no credit card.


Built by Lesson of the Day PBC. Powers 62,000 lesson segments across 11 languages and 3 age groups.

Top comments (0)