Validating JSON-LD Beyond Syntax: Required Properties per schema.org Type

#python #webdev #testing #programming

json.loads returning without an exception tells you exactly one thing: the bytes are well-formed JSON. It tells you nothing about whether the block means anything. An Article block can parse perfectly and have no author. A Product can parse and carry a ratingValue of "4.5", a string where the consumer wants a number. The parser is happy. The AI engine reading the page is not, because the block it just parsed does not assert the fact it exists to assert. Syntax validation is the floor. We built a validator that has to clear it and then keep going.

So the validator runs three layers, and each one catches a class of failure the layer before it waves through:

Syntax. Does the block parse as JSON at all. Cheap, binary, first.
Type rules. Given the @type, are the properties we treat as required for that type actually present, are the dates real ISO 8601, do the URL fields hold URLs. The "required" set follows Google's rich-results requirements, not the schema.org vocabulary, because that is what governs whether an engine trusts the block.
Content alignment. Does the markup match the page it is on, or is it claiming a headline and author that appear nowhere in the rendered text.

This piece is mostly about layer two, because that is where "valid JSON" and "valid schema" stop being the same sentence.

Encoding type rules as data

The temptation with type rules is to write them as code: a function per type, a branch per property, a growing if @type == "Article" tree. We did not do that. The rules are data, a table keyed by type, each row listing required and recommended properties. Checking a block is a lookup and a set comparison, not a walk through a switch statement. Here is the shape, for three of the types:

Type	Required	Recommended
`Article`	`headline`, `author`	`datePublished`, `image`, `publisher`
`Product`	`name`	`image`, `offers`, `aggregateRating`
`LocalBusiness`	`name`, `address`	`telephone`, `openingHours`, `priceRange`

One thing the table is not: a transcription of the schema.org spec. schema.org itself rarely mandates anything, so our "required" column is an opinionated baseline that follows Google's rich-results requirements, since that is the bar that decides whether an engine trusts the block in practice. The headline and author we mark required on Article are required by Google, not by the vocabulary.

The validator covers 15 common types this way. Alongside the three above, the table holds BlogPosting and NewsArticle (both, like Article, requiring headline and author), FAQPage, BreadcrumbList, Event, Recipe, VideoObject, HowTo, Organization, WebSite, and WebPage. The payoff of rules-as-data is the maintenance story. Adding a new type means adding a row to the table. It does not mean writing a function, finding the right place to call it, and remembering to wire it into the scorer. The check logic is written once and runs against every row. New type, new row, done.

The sneaky failures

Missing properties are the obvious failures and the easy ones to report. The annoying class is the values that are present, look fine to a human, and are wrong to a parser. Three keep showing up:

ISO 8601 dates. The standard wants 2026-05-18, optionally with a time and offset. People write 05/18/2026, May 18, 2026, 18-05-2026, 2026/05/18. Every one of those is a string a human reads correctly and a date parser rejects. So the validator does not just check that datePublished exists; it checks that the value is parseable as ISO 8601, and flags it as a type error when it is not.
Relative URLs in URL fields. An image of /img/header.jpg or a url of ../about is a valid string and a broken URL field. An engine cannot resolve a fragment with no origin. The check is for an absolute http(s) URL, not just a non-empty string.
Numbers shipped as strings. "ratingValue": "4.5", "price": "29.99", "reviewCount": "212". JSON cannot tell you these are wrong, because they are valid JSON strings. The type rule knows the field expects a number and reports the mismatch.

The common thread: none of these is a JSON error. They are all schema errors that pass straight through a syntax check, which is the entire reason layer two exists.

Scoring per block, averaging per page

A pass/fail verdict is too blunt to be useful. A block missing one recommended property is not in the same state as a block that does not parse, and a single number should say so. So each block scores out of 100, split four ways:

Syntax: 40. Valid JSON or zero. The largest slice, because nothing else matters if the block does not parse.
Required-property coverage: 30, scaled. If a type requires four properties and three are present, the block earns 30 times 3/4, which is 22.5.
Recommended-property coverage: 20, scaled. Same coverage math, applied to the recommended set.
Content alignment: 10. Does the markup match the page.

The page score is the average across all blocks on the page. The deliberate choice is to score per block rather than mashing every block into one global tally. A page with three clean blocks and one gutted one should not read as uniformly mediocre, and it should not read as fine either. Collapse those four into a single page number and you get something around 75, which describes none of them: the three good blocks look worse than they are and the broken one disappears into the average. Keep them separate and the 22 sits next to the three 95s, so you know precisely which block to open.

LLM-assisted suggestions

Layers one through three are deterministic. On top of them, there is a suggestion layer that is not. Gemini reads the page text alongside the schemas already present and proposes markup the page could carry but does not. The canonical case: the page has a visible FAQ section, real questions with real answers, and no FAQPage block describing it. A rules engine cannot find that, because there is no block to validate; the opportunity lives in the gap between the content and the markup, and spotting it takes reading the prose.

The hard design rule on this layer is that it is best-effort and must never block the validation results. The model call can time out, refuse the request, or come back with something we reject. When it does, the validator still returns the full deterministic output: every block's score, every error, every warning. You lose the suggestions for that run and you keep everything the rules produced. The probabilistic layer is strictly additive. It can fail without taking the rest of the report down with it, and that property is wired in by design rather than hoped for.

Multiple blocks and @graph

Real pages do not ship one tidy block. A WordPress site with a couple of plugins active can emit four or five separate <script type="application/ld+json"> tags, one from the theme, one from the SEO plugin, one from a reviews widget. So in URL mode the validator extracts every such tag on the page and scores each independently.

The other shape is the @graph bundle: one script tag whose payload is an object with a @graph array holding several entities, an Organization, a WebSite, a WebPage, a BreadcrumbList, all in one block. This is common output from WordPress SEO plugins, which like to ship the whole site graph in a single tag.

Here is an honest limitation. The type rules key off a block's top-level @type. When a tag's top level is a JSON array, we split it and each element becomes its own block. But a @graph bundle is a single object whose top level has no @type of its own, so the validator treats the whole bundle as one block of type Unknown. Unknown types get the syntax and @context checks only; the per-type rules never reach the Organization, WebSite, and the rest sitting inside the @graph. The bundle is checked, but its inner entities are not individually walked.

The workaround is mechanical. Copy a single entity out of the @graph array, paste it into the validator on its own, and it parses as a block with a real top-level @type, so the type rules apply. Pull the Organization node out, run it through paste mode, and you get the required-property check that the bundle does not give you.

Try it

The Schema Validator runs all of this on a URL or on JSON-LD you paste straight in: syntax, type rules, alignment, score per block, and the best-effort suggestions on top. If a page comes back with no schema to validate at all, generate a typed first draft with the Schema Generator, fill the placeholders, and run the result back through the validator to confirm your edits hold.

The thing worth stealing from how this is built is the layering, and the discipline about which layer is allowed to fail. Syntax, rules, and alignment are deterministic and load-bearing, so they run first and always return. The model sits on top, adds what only reading prose can add, and is fenced off so its failure is a missing section rather than a broken report. Decide up front which checks must never be probabilistic, make those the foundation, and let the clever part be the part you can afford to lose.

Mehul Jain is an AI entrepreneur and product builder. He works on Geology, a GEO platform.