If you wire an LLM up to "write me 10 multiple-choice questions about photosynthesis," you'll get something that looks great in the demo and falls apart the moment a real student uses it. I've been building a quiz generator for a while now, and almost all of the actual engineering went into things that have nothing to do with generating the question text. Here are the problems that turned out to matter.
1. The distractors are the whole game. A multiple-choice question is only as good as its wrong answers. If the three distractors are obviously wrong — different category, absurd, or grammatically mismatched with the stem — the student picks the right answer by elimination without knowing anything. Good distractors have to be plausible: common misconceptions, adjacent concepts, the answer to a slightly different question. Getting an LLM to produce genuinely tempting wrong answers (rather than filler) is much harder than getting it to produce the correct one, and it's where most generated quizzes quietly fail.
2. Answer leakage. LLMs love to give the game away. The correct option is subtly longer, or more precisely worded, or the only one that's grammatically consistent with the stem. Humans pick up on these tells unconsciously. You have to actively normalize option length and phrasing so the correct answer doesn't stand out, and check that the stem doesn't contain a word that only appears in the right option.
3. Exactly one defensibly-correct answer. This is the bug that erodes trust fastest. The model writes a question where two options are arguably correct, or where the "correct" one is wrong on a technicality. You need a validation pass — ideally a separate check — that confirms the keyed answer is unambiguously right and the others are unambiguously wrong, and discards or regenerates anything that isn't.
4. Coverage from messy source material. When the input is a PDF or a wall of pasted notes, the easy failure mode is to generate five questions about the first paragraph and ignore the rest. Useful generation means chunking the source, spreading questions across the material, and skipping boilerplate (headers, references, page furniture) so the quiz reflects the whole document, not just whatever was at the top.
5. Knowing what you're bad at. Fact-dense material — biology, history, vocabulary, definitions — generates well. Multi-step reasoning, math proofs, and anything where the "question" is really an argument generate poorly, and it's more honest to be upfront about that than to emit confident garbage.
None of this is glamorous, and none of it shows up in a 30-second demo. But it's the difference between a quiz a student can actually study from and four random options where the wrong ones are obviously wrong. If you want to see how it holds up on your own material, you can paste a topic, notes, or a PDF into Quiz Maker — it's free and there's no signup to generate. I'm genuinely interested in where it still breaks, so if you find a subject where the distractors fall flat, that's useful to know.
Top comments (0)