Mehul Jain

Posted on Jun 4

Detecting Q&A Patterns and Heading Trees in Raw HTML

#html #machinelearning #nlp #webscraping

When we started building the Content Structure Analyzer, the first thing we did was throw out the question everyone expects a content tool to answer. We did not want to score whether a page was good content. Good is subjective, it needs a human, and there are a hundred tools that already pretend to measure it. The question we actually cared about was narrower and mechanical: can a model lift an answer out of this page. That one has a yes-or-no shape, which means you can write code that estimates it.

So the design problem became: what observable properties of raw HTML predict whether an extractive model can isolate a clean answer span. We settled on six measurable proxies, each weighted by how much we believe it moves that outcome:

Heading structure, 25 percent. The nested heading tree and its defects.
Content depth, 20 percent. Body word count from paragraph text.
Q&A patterns, 20 percent. Question-form headings, definition lists, FAQ sections.
Semantic HTML, 15 percent. Landmark element coverage.
Lists, 10 percent. Ordered and unordered list count.
Media, 10 percent. Alt text coverage across images.

None of these is a direct measurement of extractability, because you cannot measure that without running every model against every query. They are proxies. The rest of this post is about how each one is computed and where each one is honestly wrong.

Building the heading tree

Headings come out of the parser as a flat list: an H1, then an H2, then another H2, then an H3, in document order. But a model does not read them as a flat list. It reads them as an outline, where an H3 belongs under the H2 above it. So the first job is turning the flat sequence back into a tree of {level, text, children} nodes.

A stack handles this cleanly. You keep a running stack of open ancestors and, for each heading, pop until the top of the stack is a valid parent:

def build_tree(headings):  # headings: [(level, text), ...] in document order
    root = {"level": 0, "text": None, "children": []}
    stack = [root]
    for level, text in headings:
        node = {"level": level, "text": text, "children": []}
        # pop until the stack top is a shallower heading (the real parent)
        while stack[-1]["level"] >= level:
            stack.pop()
        stack[-1]["children"].append(node)
        stack.append(node)
    return root

The tree is what makes defect detection easy. Counting H1s is a pass over the top-level children. Skip detection falls out of comparing each node's level to its parent's: if a node is more than one level deeper than its parent, a level was skipped. An H2 with an H4 child means H3 never appeared, and that gap is exactly where a model loses track of what nests under what.

The penalty model is graded, applied to the heading dimension before weighting:

Missing H1: minus 40. The page has no top-level claim at all.
Multiple H1s: minus 20. The claim is split across competing tops.
Each skipped level: minus 15. Every jump is one more broken parent link in the outline.

A missing H1 costs more than multiple H1s on purpose. No H1 means the outline has no root; two H1s means it has two roots, which is confusing but recoverable. Per-skip stacking matters too: a page that jumps H2 to H4 twice has two broken spots, and the score should feel both.

Question detection heuristics

A heading counts as a question if it ends in a question mark, or if it starts with one of the interrogatives: who, what, why, how, when, which. That is deliberately loose. "How to reset your password" has no question mark but is plainly a question handle, and the starter check catches it. We accept some false positives ("What's New" is not really a question) because under-counting handles hurts more than over-counting them for this signal.

Beyond headings, two structural patterns count. Definition lists are counted directly by tallying <dl> elements, since a <dl> is term-and-definition pairing, which is a question-and-answer in HTML form. FAQ and accordion sections are detected by substring matching on class and id attributes, looking for the tokens sites actually use to mark these blocks. It is a heuristic, not a parser, and it leans on the convention that developers name these regions what they are.

The three signals combine into a single count, banded:

0 found scores 20.
1 to 2 found scores 50.
3 to 5 found scores 75.
6 or more found scores 100.

The bands reward presence, not density. We do not measure how good the answer under each question is, only that the page visibly poses questions and structures answers. Presence is what we can read from markup reliably; quality is what we cannot. Banding by count keeps the signal to the thing the HTML actually tells us.

Counting words honestly

Content depth is a word count, and the only interesting decision there is what counts as a word. We count text inside <p> tags and nothing else.

The reason is boilerplate. A page's navigation, footer, cookie banner, and sidebar are full of text, and if you count all of it, a thin article wrapped in a heavy template scores as deep when its actual body is two paragraphs. Restricting to paragraph text dodges almost all of that chrome, because nav links and footer columns are rarely marked up as <p>.

The trade-off is real and we accepted it knowingly: prose that lives in bare <div>s instead of paragraphs is invisible to the count. A page that is substantial but built without <p> tags will score thin. We decided that under-counting div-only pages is the lesser error, because counting boilerplate produces confidently wrong high scores, and a wrong high score is worse than a conservative low one. The bands:

Under 300 words scores 30.
300 to 800 words scores 60.
800 to 1,500 words scores 80.
1,500 words and up scores 100.

Semantic landmarks and media

Semantic HTML scores landmark coverage. We look for these elements: article, main, section, aside, figure, figcaption, nav. The score is the count found divided by three, times 100, capped at 100. Three landmarks present is full marks.

Why three and not all seven. The signal we want is "this page marks its regions semantically at all," and a page that uses a main, an article, and a section has clearly made that choice. Requiring all seven would punish a simple page that has no figure or aside to mark, which is most pages. Three is the threshold where the intent is unambiguous, so that is where we cap.

Media scores alt text coverage: images with an alt attribute divided by total images. A page with no images at all scores 50, a neutral middle rather than a zero, because no images is not a failure, it is just an absence of signal. Penalizing a text-only page for having no alt text to measure would be measuring nothing.

The known blind spot

The honest limitation, stated in the tool itself: we read server-rendered HTML only. The fetcher pulls the raw response and parses that. It does not run a browser, so it does not execute JavaScript, so any content hydrated on the client is invisible to every check above.

This means a page built as a client-side app, with its real content injected after load, scores as nearly empty: no headings, no paragraphs, no landmarks. That looks like a flaw until you remember what we are estimating. A crawler reading your raw response sees the same emptiness we do. Scoring the un-hydrated HTML is not a limitation of the measurement, it is the measurement. We surface it plainly rather than hide it, because a tool that quietly renders JavaScript would report a score the actual extractors never see. An honest blind spot beats a confident wrong number.

Try it

The Content Structure Analyzer is free and takes a URL, no signup. You get the heading tree, per-category scores, the issues it found, and a short list of recommendations. Once a page's structure is clean, the Schema Generator emits the JSON-LD to label what the page actually is.

Here is the thing worth doing with it. Run it against a page you are sure is well structured, one you wrote carefully and would defend. Then look at the heading tree it builds. The first time we ran it on our own docs, the tree showed two skipped levels and a second H1 we had styled down to look like a subhead. The page looked fine to us and read fine to a human. The tree did not lie. Go find out what yours looks like.

Mehul Jain is an AI entrepreneur and product builder. He works on Geology, a GEO platform.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.