Oleksandr Gamanyuk for Instafill.ai

Posted on Jun 19

How we put a complexity score on 3,500 government forms (without ever guessing a 'time to fill')"

#ai #pdf #formfiller #algorithms

We run Instafill, an AI form-filler with a public catalog of roughly 3,500 fillable forms — IRS, SSA, ACORD, USCIS, the long tail of PDFs that make people groan. At some point a question kept coming back, from users and from us: how hard is this form, actually, before I start?

That sounds like a UX nicety. It turned into a small but genuinely interesting engineering problem, because the answer had to satisfy four constraints at once:

Deterministic — the same form must always produce the same number.
Reproducible by anyone — given the formula and the form, you can recompute it by hand.
Calibrated — the output has to match what a human who fills these forms for a living would say (a W-9 is not as hard as a 603-field insurance application).
Honest — no signal we can't actually defend.

This is the story of the Form Complexity Index (FCI): a 0–100 score we compute for every form in the catalog. I'll cover why we built it the way we did, the actual formula, and the most useful part — the blind spot in v1.0 that the whole catalog only revealed once we'd scored all of it.

Why a number at all

The product reason and the engineering reason point the same way.

The product framing we used internally is "the DrugBank of forms." DrugBank and PubChem give every molecule a set of computed, citable scalars — molar mass, logP, and so on. Those numbers are objective, reproducible, and referenceable: a Wikipedia article can cite the molar mass of caffeine and link back. We wanted the equivalent for forms — a public, per-form page anchored by a unique computed scalar that's stable enough to be cited. The FCI is that scalar. (We publish the full methodology precisely so it can be independently verified — that's the whole point.)

The engineering reason follows directly: if a number is going to be cited, it cannot be a vibe. It can't be a model output that drifts when we retrain, and it can't be hand-assigned. It has to be a pure function of the form's own structure, versioned, and recomputable from the published formula. That constraint is what made the rest of the design fall into place.

The thing we deliberately refused to ship

The obvious feature request is "estimated time to fill: 12 minutes." We don't produce one, on purpose.

A time estimate is a promise you can't keep. It depends on whether you have your data handy, whether you've filled the form before, how fast you type, how many of the conditional branches apply to you. We'd be inventing precision we don't have, and the first user who took 40 minutes on our "12-minute" form would be right to be annoyed. So FCI measures the form, not the person — the structural effort it imposes — and stops there.

The data we already had

Every form in the catalog is a form_templates document with structure we'd already extracted for the actual form-filling pipeline:

fields[] — each with a form_type (text, date, checkbox, signature, …), a group, a page_num, and an ignore flag
dependencies — the "fill only if…" conditional rules between fields
form_sections and field groups — how the form is organized
table_fields — detected tables and repeating-row lists
form_pages_number — total PDF pages
areas — per-widget geometry, which tells us which pages actually carry fields
asText — the OCR/extracted text, handy for scraping the federal OMB control number

No new extraction work. The FCI is a derivation over data we already trusted.

The formula

The index has two parts: a base that captures the raw work, and three modifiers that catch the ways a form can be disproportionately hard in a way the base misses.

FCI = clamp( base + T + I + Y , 0, 100 )

where base = 0.36·F + 0.26·D + 0.15·L + 0.16·C + 0.07·S

The five base components

Each is scaled to 0–100, then weighted.

Component	Weight	What it measures
F — Field load	0.36	Count of fillable fields, log-scaled (anchors 3 → 200)
D — Input difficulty	0.26	Average per-field difficulty by type
L — Length	0.15	Fillable pages; 12+ pages maxes it out
C — Conditional logic	0.16	Share of fields gated by "fill only if…" rules
S — Structure	0.07	Distinct field groups, log-scaled (anchors 1 → 30)

Two design decisions inside the base are worth calling out.

Field counts are log-scaled, not linear. The jump from a 5-field form to a 50-field form is a completely different experience. The jump from 450 to 500 fields is… still a slog, but it's the same slog. Linear scaling would let giant forms dominate everything. So counts run through:

def _logn(n, lo, hi):
    if n <= lo:
        return 0.0
    return _clamp(100 * (math.log(n + 1) - math.log(lo + 1))
                       / (math.log(hi + 1) - math.log(lo + 1)))

"Pages" means pages that actually have fields. This one bit us until we got it right. A W-9 is a 6-page PDF — but 5 of those pages are printed instructions. Charging a form as "long" because the government bundled a rulebook with it is just wrong. So Length counts fillable pages, derived from the per-widget areas geometry (every widget knows its 0-indexed page), falling back to field page_num, and only falling back to total PDF pages as a last resort:

def _fillable_pages(form_template):
    areas = form_template.get("areas")
    if isinstance(areas, list) and areas:
        ap = [a["page"] for a in areas
              if isinstance(a, dict) and isinstance(a.get("page"), int)]
        if ap:
            return max(ap) + 1
    # ... fall back to field page_num, then total PDF pages

About 16% of forms bundle instruction-only pages. Getting this wrong would have mis-scored one in six forms in the catalog.

Input-difficulty weights

Not all fields are equal work. A checkbox is a flick; a signature is a commitment; free text is open-ended. Each field type carries a weight, and D is the average across the form:

Field type	Weight	Field type	Weight
Signature	0.9	Dropdown / Radio	0.4
Free text	0.6	Checkbox	0.2
Number / Date / Time	0.5	Button	0.2

The three modifiers (the v2.0 part)

Each modifier is additive, individually capped, and ~0 for an ordinary form — so a simple form scores exactly as its base, and the modifiers only lift forms that are hard along an axis the base can't see.

Modifier	Max	What it catches
T — Tables & lists	+13	Grids of line items — real effort the field count understates
I — Instructions	+9	Pages of rules to read first (`total pages − fillable pages`)
Y — Layout density	+12	Crowded, information-dense pages: `(fields + groups) / fillable page`

Notice that I uses the instruction pages we subtracted out of Length. They don't make the form longer to fill, but a short form bundled with a thick rulebook genuinely is harder to approach — so we measure them separately instead of throwing them away.

Here's the whole compute, modifiers and all — this is the actual function:

# base: v1.0 five-factor weighted sum (0-100)
comp = {
    "F": _logn(nf, *F_ANCHOR),
    "D": 100 * sum(DIFF.get(t, DIFF_DEFAULT) for t in types) / nf,
    "L": _clamp(100 * (pages - 1) / L_DIV),
    "C": _clamp(100 * ndeps / nf),
    "S": _logn(sections, *S_ANCHOR),
}
base = sum(WEIGHTS[k] * comp[k] for k in WEIGHTS)

# additive complexity modifiers (each capped; ~0 for simple forms)
comp["T"] = _clamp(_logn(n_tables, *T_ANCHOR) * T_WEIGHT, 0, T_CAP)
comp["I"] = _clamp(100 * instr_pages / I_DIV * I_WEIGHT, 0, I_CAP)
comp["Y"] = _clamp((_logn(density, *Y_ANCHOR) - Y_FLOOR) * Y_WEIGHT, 0, Y_CAP)

score = round(_clamp(base + comp["T"] + comp["I"] + comp["Y"]))

The v1.0 → v2.0 story (the actually interesting part)

We didn't ship the modifiers on day one. v1.0 was just the five-factor base, and it looked great on the forms we tested it against. The bug only showed up when we scored the entire catalog and looked at the distribution.

The five tiers are:

Tier	Score
Simple	0–28
Basic	29–45
Moderate	46–62
Complex	63–79
Very Complex	80–100

Under v1.0, Very Complex had basically nothing in it — 11 forms out of 3,468, ~0.3%. That's not a tier, that's a rounding error. Something was wrong, and it wasn't the forms.

The diagnosis: the base rewards depth — more fields, more pages, more branching. But it treated three very different forms as roughly equal:

a dense grid of 600 line items crammed onto 4 pages,
a thin form stapled to a thick instruction booklet,
and a roomy, well-spaced one-pager.

All three could land in the same band, because the base couldn't see density, tables, or bundled instructions. The hardest forms in the catalog are hard along exactly those axes, and the base was blind to all of them. So the top tier stayed empty while genuinely brutal forms sat in "Complex."

v2.0 fixed it without touching the base — that part was well-calibrated and we didn't want to disturb scores that were already right. We added the three additive modifiers (T, I, Y) that lift only the multi-axis-hard tail, then recalibrated the tier cutoffs against the live distribution instead of against our intuition.

The result, across 3,468 forms:

Tier	v1.0	v2.0
Simple	—	~5%
Basic	—	~36%
Moderate	—	~28%
Complex	—	~23%
Very Complex	~0.3%	~6–7%

Now every tier is meaningfully populated, the median sits around 50, and the forms everyone agrees are nightmares actually land in the top tier.

The lesson, which I keep relearning: you cannot calibrate a population metric on a sample. Our hand-picked test forms (W-9, SSA-44, a few others) all scored sensibly under v1.0 — that's why we shipped it. The blind spot was structurally invisible until the metric met the full population it was meant to describe. Score everything, then look at the shape.

Calibration: anchoring to forms humans already have opinions about

The weights and anchors aren't pulled from the air, but they're also not fit by regression — we have no ground-truth "true complexity" labels, because none exist. Instead we anchored to forms our team and users have strong, consensus intuitions about, and tuned until the tiers matched:

W-9 → Basic (everyone's filled one; it's the definition of "routine")
SSA-44 → Moderate (multi-page, real work, but tractable)
Form 5695 / 8949 → Complex (long, heavily conditional)
ACORD 125 → Very Complex (603 fields — the boss fight)

A standalone calibration harness scores these anchors on every weight change and flags any that drift out of their expected tier, so a tweak that fixes one form can't silently break another.

A worked example: why the W-9 is a 39

The methodology is published, so anyone can recompute this. The W-9 has 23 fillable fields (15 free text, 8 checkboxes), 1 fillable page, 5 instruction pages, 8 groups, 4 conditional fields, no tables.

Component	Calculation	Value
F	`norm(23, 3, 200)`	46
D	`100 × (15×0.6 + 8×0.2) / 23`	46
L	`100 × (1 − 1) / 11`	0
C	`100 × 4 / 23`	17
S	`norm(8, 1, 30)`	55
base	`0.36·46 + 0.26·46 + 0.15·0 + 0.16·17 + 0.07·55`	35
+ I	`clamp(100 × 5 / 12) × 0.10` — 5 instruction pages	+4
FCI	35 + 4 (T and Y add 0)	39 → Basic

The Length component is 0 — one fillable page — even though it's a 6-page PDF. The 5 instruction pages show up in the I modifier instead, adding a modest +4. That's the fillable-vs-total page distinction doing exactly its job.

The engineering decisions that made it boring to operate

The math is the fun part. The reason it actually ships and stays correct is a handful of unglamorous decisions.

The compute is a pure function — no I/O. compute_form_complexity(form_template) takes a dict and returns a dict. No DB calls, no logging, no clock except the timestamp it stamps at the end. That single property means the identical code path serves the live pipeline hooks and the one-off backfill script. There is no "the batch job computes it slightly differently than production" class of bug, because there's only one implementation.

def compute_form_complexity(form_template):
    """Pure. Return the `fci` sub-document, or None if no fillable fields."""
    fields = [f for f in (form_template.get("fields") or [])
              if isinstance(f, dict) and not f.get("ignore")]
    nf = len(fields)
    if nf == 0:
        return None
    # ... all arithmetic, no side effects

Persistence is best-effort and never raises into the form pipeline. Computing a vanity score must never be able to break the thing users actually came for — filling a form. The persistence wrapper swallows everything:

async def persist_form_complexity(form_template, metadata=None):
    try:
        fci = compute_form_complexity(form_template)
        if fci is None:
            return False
        # lazy imports so the pure compute stays importable in tests/backfill
        # without dragging in Motor + the ES log handler
        from utils.mongodb import update_document
        await update_document(...)
        return True
    except Exception as ex:
        from log.logger import logger
        logger.warning(f"persist_form_complexity failed: {ex}", extra=metadata)
        return False

Those lazy imports matter more than they look: they keep the pure functions importable by the backfill script and unit tests without pulling in the app's database driver and Elasticsearch log handler. The math has no infrastructure dependencies, and we kept it that way at the import level, not just in spirit.

Every stored score records its METHODOLOGY_VERSION. This is the keystone of "citable." A number written to the DB carries the exact version of the formula that produced it. When we bump the version, the backfill is version-aware and idempotent — it recomputes forms whose stored version is stale and skips the rest, and it doesn't touch the document's modified timestamp (a methodology change is not a content change). A score is always traceable to a specific, reproducible formula. v1.0 and v2.0 are both published, with what changed and why.

What I'd tell someone building their own "index"

If you're putting a single defensible number on a population of messy real-world things, the pattern that worked for us:

Refuse the seductive-but-unfalsifiable metric. Our "time to fill" was begging to be built. Not shipping it was the right call.
Make the compute a pure function of data you already trust. It makes the metric reproducible, testable, and identical across batch and live paths — for free.
Calibrate against cases humans already have consensus on, not against your own first guess.
Score the entire population before you trust the tiers. Your sample will lie to you about the shape of the distribution. Ours hid an empty top tier for a whole version.
Version every stored value. The day you want the number to be cited is the day you'll be glad it's traceable to an exact formula.

The FCI now sits on every form page in the catalog as a transparent card — base, each modifier, the final score, and the breakdown — next to the form's standardized identifiers. It's a small number. Getting it to be a trustworthy small number was the whole job.

The full methodology, worked examples, and live catalog distribution are public at instafill.ai/form-complexity-index.

DEV Community