SEN LLC

Posted on Jun 10

A Quiz Is Only as Good as Its Wrong Answers — Building Algorithmic Distractor Generation for a Japanese Idiom Quiz

#javascript #algorithms #testing #frontend

When you build a 4-choice quiz app, the laziest part of most implementations is how the wrong choices (distractors) are picked: random sampling from the answer pool. Random distractors are beatable by elimination — if the question is about an effort-related idiom and only one choice mentions effort, you don't need to know the answer. A good question has plausible wrong answers. I built a Japanese four-character-idiom (yojijukugo) quiz with distractor generation scored on two axes: shared kanji (visual confusability) and same semantic category (meaning confusability). Plus: injectable RNG for deterministic tests, and a statistical test that "hard mode is actually harder."

🌐 Demo: https://sen.ltd/portfolio/yojijukugo/
📦 GitHub: https://github.com/sen-ltd/yojijukugo

Why random distractors fail

The naive implementation:

// bad
const distractors = shuffle(pool.filter(x => x !== answer)).slice(0, 3);

generates questions like:

What does 切磋琢磨 (setsa-takuma) mean?
a. To strive and improve together through friendly competition ← correct
b. The beauty of nature's scenery ← 花鳥風月
c. Indecisiveness, inability to commit ← 優柔不断
d. Surrounded by enemies on all sides ← 四面楚歌

If you vaguely remember that 切磋琢磨 is "an effort word," choices b/c/d are all in different semantic territories and you eliminate your way to the answer. The question measures distribution-reading, not knowledge.

Two axes of confusability

Humans confuse idioms in two ways:

Visual — they share kanji. 一期一会 and 一蓮托生 (both start with 一). Especially potent in reading-quiz mode.
Semantic — they're in the same meaning category. 切磋琢磨 and 臥薪嘗胆 are both about effort.

Score both:

export function sharedKanjiCount(a, b) {
  const setA = new Set(a);
  const setB = new Set(b);
  let n = 0;
  for (const ch of setA) {
    if (setB.has(ch)) n++;
  }
  return n;
}

export function rankDistractors(answer, pool = IDIOMS) {
  const others = pool.filter((x) => x.word !== answer.word);

  const withScore = others.map((x) => {
    const kanji = sharedKanjiCount(answer.word, x.word);
    const sameCategory = x.category === answer.category ? 1 : 0;
    // Shared kanji dominates: 10 pts per kanji. Same category: 1 pt.
    return { entry: x, score: kanji * 10 + sameCategory };
  });

  withScore.sort((a, b) => b.score - a.score);
  return withScore;
}

The 10:1 weighting means even one shared kanji outranks a category match — visually similar idioms confuse harder than semantically similar ones. Category acts as the tiebreaker.

The dataset tags each idiom with a category:

export const IDIOMS = [
  { word: "切磋琢磨", reading: "せっさたくま", meaning: "...", category: "effort" },
  { word: "臥薪嘗胆", reading: "がしんしょうたん", meaning: "...", category: "effort" },
  { word: "花鳥風月", reading: "かちょうふうげつ", meaning: "...", category: "nature" },
  // 79 entries across 8 categories
];

Sampling: 3 from the top 6

Taking the top 3 ranked distractors verbatim means a question always shows identical choices. For replayability, sample 3 from the top 6:

if (difficulty === "hard") {
  const top = ranked.slice(0, 6).map((r) => r.entry);
  picked = shuffle(top, rng).slice(0, 3);
} else {
  // easy: random sample from the bottom half
  const bottom = ranked.slice(Math.floor(ranked.length / 2)).map((r) => r.entry);
  picked = shuffle(bottom, rng).slice(0, 3);
}

Easy mode samples from the bottom half — idioms that share neither kanji nor category. Difficulty IS the sampling range.

A real hard-mode question this generates:

How do you read 急転直下?
a. ういてんぺん ← 有為転変 (shares 転)
b. きゅうてんちょっか ← correct
c. たんとうちょくにゅう ← 単刀直入 (shares 直)
d. しんきいってん ← 心機一転 (shares 転)

Every distractor shares a kanji with the answer. Random sampling would essentially never produce this density.

Injectable RNG for deterministic tests

Quiz generation is inherently random; tests shouldn't be. Don't call Math.random directly — inject:

// Mulberry32 — tiny deterministic PRNG
export function mulberry32(seed) {
  let s = seed >>> 0;
  return function () {
    s = (s + 0x6d2b79f5) >>> 0;
    let t = s;
    t = Math.imul(t ^ (t >>> 15), t | 1);
    t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
    return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
  };
}

export function shuffle(arr, rng = Math.random) { /* Fisher-Yates with rng */ }

test("deterministic with same seed", () => {
  const q1 = generateQuestion(answer, { rng: mulberry32(5) });
  const q2 = generateQuestion(answer, { rng: mulberry32(5) });
  assert.deepEqual(q1.choices.map((c) => c.key), q2.choices.map((c) => c.key));
});

Statistically testing "hard is harder"

My favorite test in the suite. How do you assert that hard mode actually picks harder distractors, when both modes are random? Average over many seeds:

test("hard difficulty picks higher-ranked distractors than easy", () => {
  const ranked = rankDistractors(answer).map((r) => r.entry.word);
  const rankOf = (word) => ranked.indexOf(word);

  let hardSum = 0, easySum = 0, n = 0;
  for (let seed = 0; seed < 30; seed++) {
    const qh = generateQuestion(answer, { difficulty: "hard", rng: mulberry32(seed) });
    const qe = generateQuestion(answer, { difficulty: "easy", rng: mulberry32(seed + 1000) });
    for (const c of qh.choices.filter((c) => !c.correct)) hardSum += rankOf(c.key);
    for (const c of qe.choices.filter((c) => !c.correct)) easySum += rankOf(c.key);
    n++;
  }
  assert.ok(hardSum / n < easySum / n); // lower rank index = harder
});

A single seed could coincidentally produce a hard-looking easy question. Thirty seeds make the property statistical rather than anecdotal. This pattern generalizes to any randomness-bearing code.

Data integrity tests caught real bugs

The 79-entry dataset is embedded code, so it gets integrity tests:

test("every word is exactly 4 chars", () => {
  for (const x of IDIOMS) {
    assert.equal([...x.word].length, 4, `${x.word} is not 4 kanji`);
  }
});

test("no duplicate words", () => {
  const words = IDIOMS.map((x) => x.word);
  const dupes = words.filter((w, i) => words.indexOf(w) !== i);
  assert.deepEqual(dupes, []);
});

test("readings are pure hiragana", () => {
  for (const x of IDIOMS) {
    assert.match(x.reading, /^[ぁ-ゖー]+$/);
  }
});

These caught three real bugs in my own data entry: a Cyrillic string fragment had crept into one idiom (четы面楚歌 instead of 四面楚歌), an English word into another (right唯々諾々), and one idiom was registered twice. The 4-char check and duplicate check flagged all three on first run. If you embed data in code, ship integrity tests with it.

Note [...x.word].length instead of x.word.length — String.length counts UTF-16 code units, and rare kanji outside the BMP (like 𠮟) are surrogate pairs. Spreading counts code points.

Architecture

data.js  ← 79 idioms ({word, reading, meaning, category})
quiz.js  ← distractor ranking, question generation, scoring (DOM-free, 32 tests)
app.js   ← UI glue

quiz.js never touches the DOM. The UI calls generateQuiz(10, { mode, difficulty }) and renders the choices arrays as buttons.

Try it

Demo: https://sen.ltd/portfolio/yojijukugo/
GitHub: https://github.com/sen-ltd/yojijukugo

Try hard-mode reading questions. When you hit an idiom starting with 一 and all four choices start with "いち...", that despair is the algorithm working.

Takeaways

Distractor quality determines quiz quality. Random wrong answers are beatable by elimination.
Confusability has two axes: visual (shared kanji / shared tokens) and semantic (same category). Score and rank.
Sample from the top-K rather than taking top-3 verbatim — replayability without sacrificing difficulty.
Difficulty = sampling range. Hard pulls from the top of the ranking, easy from the bottom.
Inject your RNG. Deterministic tests for random features.
Test statistical properties over many seeds, not single runs.
Embedded datasets need integrity tests. Mine caught Cyrillic contamination and duplicates on day one.

This is OSS portfolio #259 from SEN LLC (Tokyo). https://sen.ltd/portfolio/

DEV Community