DEV Community

Cover image for How adaptive testing converges on cert readiness in 25 questions
miad
miad

Posted on

How adaptive testing converges on cert readiness in 25 questions

How adaptive testing converges on cert readiness in 25 questions

A well-built adaptive test is binary search for skill level.

You start at the middle of the difficulty range. Get one right,
the next question shifts harder. Get it wrong, it drops back.
Each answer halves the uncertainty band around your real skill
estimate. By question 15, you know more about a learner than a
fixed 50-question test does at the end.

This is computerized adaptive testing (CAT), and it's the most
underused idea in cert prep.

## The IRT engine underneath

Item response theory (IRT) is what makes it work. Every question
in the bank has a calibrated difficulty on a continuous scale.
Every learner has a latent skill value on the same scale. The
algorithm's job is to estimate that value as fast as possible.

After each answer, it does two things:

  1. Updates the point estimate of your skill (Bayesian update, roughly)
  2. Picks the next question whose calibrated difficulty sits closest to the current estimate

That second step is the key. A question that's far above your
current estimate is mostly noise. A question at your current
estimate is maximally informative. The algorithm isn't picking
"the next hard question" or "the next easy question." It's picking
the one most likely to shrink the confidence interval.

## Why convergence is geometric, not linear

The uncertainty band doesn't shrink uniformly. It shrinks fast
early, then the gains flatten.

After 8 questions, the band is wide but already useful. After 15,
it's narrow enough to act on for most learners. After 25, the
marginal information per question has dropped close to zero.
Asking question 26 is roughly a coin flip on whether it tells
you anything new.

This is why stopping at 25 isn't a shortcut. It's the point
where continuing would add fatigue and noise, not signal. Fixed
50-question pretests are a holdover from paper testing. They
survived in software because they're easier to build and look
more thorough. They aren't.

## The output is per-domain, not a percentage

Most cert pretests return a single number. 74%. Which tells you
almost nothing useful about what to study next.

A real CAT returns a skill estimate per domain, because failure
modes are domain-specific. AWS SAA-C03 might return:

  • Networking: Proficient (82)
  • Storage: Developing (47)
  • Security: Novice (24)

"Security: Novice" means start the roadmap on Security.
"Networking: Proficient" means one validation milestone, not
six. The prep plan is different for each learner. That's the
whole point.

## Where it breaks down

Three failure modes worth knowing if you're building something
similar:

Narrow item bank. If the bank doesn't have well-calibrated
items at the high end, confidence can't push past a ceiling
no matter how well the learner answers. They cap at Competent
on a domain they actually own at Proficient. Fix: bank breadth,
tracked per cert.

Intentional gaming. Deliberately answering easy items wrong
to see harder ones. The algorithm obliges, then climbs back.
The estimate converges on the gamed pattern. Can't distinguish
intent from skill.

Sparse domain coverage. On certs with many domains, the
CAT sometimes stops before sampling all of them. Untouched
domains report as the lowest level by default. Not a failure,
but an absence of signal.

None of these are unique to CAT. Fixed-length tests have worse
versions of each, plus the personalization cost on top.


I wrote a longer breakdown of the exact stopping conditions,
the 95% confidence threshold, and how the domain-level output
drives a personalized roadmap in the ClaudeLab docs:
CAT evaluation explained

If you're building something that needs to assess knowledge
quickly and accurately, the 25-question ceiling is worth
understanding before you default to "just make the test longer."

Top comments (0)