The parser cascade pattern: extracting recipes from messy food blogs

#webdev #nextjs #architecture #opensource

Most recipe pages are not hard because the recipe is complicated. They are hard because the useful data is surrounded by everything else a publishing business needs: ads, modals, autoplay video, SEO prose, social widgets, tracking scripts, and sometimes bot protection.

For RecipeStripper, the product goal is small: paste a public recipe URL and get a clean cooking view. The implementation is not one parser. It is a cascade.

This is the pattern that has held up best in production:

Fetch the page with the cheapest reliable method.
Parse the highest-confidence structure first.
Fall back only when the previous layer cannot return enough recipe data.
Preserve failure reasons instead of pretending every site works.

Stage 0: fetching is part of parsing

Before a parser can run, the app has to get usable HTML.

RecipeStripper's fetch chain starts with a normal server-side request using browser-like headers. If a server returns a block status or a challenge-looking page, it can fall back to a headless Chromium request with a realistic user agent and a few stealth evasions. If that still returns a challenge page, the final attempt is a Wayback Machine snapshot.

That last step matters because many recipe pages expose stable structured data in archived HTML even when the live site blocks server-side fetches.

The app still does not claim universal support. Some sites, especially PerimeterX-protected properties, are marked as blocked or limited in the Works With directory.

Stage 1: JSON-LD first

Most modern recipe sites publish Schema.org Recipe data in application/ld+json scripts. That is the best path because it is already structured.

The JSON-LD parser handles a few common shapes:

// simplified from lib/parsers/jsonld.ts
const type = obj["@type"];
const isRecipe =
  type === "Recipe" ||
  (Array.isArray(type) && type.includes("Recipe"));

It also walks arrays and @graph wrappers, because SEO plugins often place the recipe object inside a graph with breadcrumbs, article metadata, and organization data.

When a page exposes more than one recipe object, RecipeStripper picks the best candidate by matching URL slug words against recipe names, then falls back to the object with the most ingredients.

Stage 2: Microdata still exists

Older sites sometimes use Microdata instead of JSON-LD:

<div itemscope itemtype="https://schema.org/Recipe">
  <h1 itemprop="name">...</h1>
  <li itemprop="recipeIngredient">...</li>
</div>

This path is less common, but it is cheap and deterministic. If a page has itemscope and itemprop recipe markup, there is no reason to call a model.

Stage 3: heuristic HTML parsing

When structured data is missing, the parser looks for recipe-shaped HTML.

The heuristic parser searches for known recipe containers, then uses section headings and list patterns:

headings like "Ingredients", "Instructions", "Directions", or "Method"
ingredient-looking lines that begin with quantities and units
ordered or unordered lists inside recipe-like containers
common WordPress recipe plugin selectors

This is not as clean as Schema.org data, but it catches a lot of hand-built pages and older blogs.

The important guardrail is to accept partial confidence without over-trusting it. RecipeStripper filters out non-instruction junk such as nutrition lines, star-rating prompts, social calls to action, and promotional fragments.

Stage 4: model fallback, not model first

The GPT-4o-mini fallback only runs when deterministic parsers fail or return a recipe missing ingredients or instructions.

That keeps cost and latency under control, and it avoids turning every request into a hallucination risk. The model receives a cleaned text window, not raw page HTML, and is instructed to return structured JSON or { "found": false }.

The useful rule: models are better as recovery layers than as the first parser.

The second cascade: ingredient-to-step matching

Extraction alone still leaves the classic cookbook layout: ingredients at the top, instructions below.

RecipeStripper's differentiator is inline quantity embedding. After extraction, a matcher links ingredient names to instruction steps. When it is confident, the rendered step can show the quantity where the cook needs it:

"fold in the flour" becomes "fold in 2 cups all-purpose flour"

The internal representation uses a small token format:

{qty:ingredientId:display text}

That lets the renderer highlight matched quantities and lets the servings scaler update both the ingredient list and the inline step amounts.

Why this pattern works

A cascade has three practical advantages:

It keeps fast paths fast. JSON-LD extraction is usually enough.
It keeps fallbacks honest. A blocked site becomes a clear blocked-site error, not a mysterious empty recipe.
It lets the product improve one layer at a time. Better JSON-LD handling, better heuristics, and better matching all compound.

The current public research dataset is here: Recipe Site Markup Coverage and Extraction Observations 2026. It includes the public site inventory plus anonymized domain-level extraction observations. No submitted recipe URLs or user identifiers are included.

The browser workflow is also being split into smaller surfaces: a bookmarklet and a downloadable Chrome extension package. Both simply open the current recipe URL in the clean reader. They do not inject a widget into someone else's site.

The broader lesson is portable: when the web is inconsistent, build a parser cascade. Put the most trustworthy structure first, keep each fallback narrow, and make failures explicit.