DEV Community

John Rooney for Extract by Zyte

Posted on

How I get Claude to build HTML parsing code the way I want it

Getting HTML off a page is only the first step. Once you have it, the real work begins: pulling out the data that actually matters — product names, prices, ratings, specifications — in a clean, structured format you can actually do something with.

That's what the parser skill is for. If you haven't read the introduction to skills in our fetcher post, it's worth a quick look first. But the short version is this: a skill is a SKILL.md file that gives Claude precise, reusable instructions for using a specific tool. The parser skill is one of three that together form a complete web scraping pipeline.

claude-webscraping-skills

A collection of claude skills and other tools to assist your web-scraping needs.

video explanations:

https://youtu.be/HH0Q9OfKLu0 https://youtu.be/P2HhnFRXm-I

Other reading:

https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/ https://www.zyte.com/blog/supercharging-web-scraping-with-claude-skills/


Other claude tools for web scraping

  1. zyte-fetch-page-content-mcp-server

A Model Context Protocol (MCP) server that runs locally using docker desktop mcp-toolkit and help you extracts clean, LLM-friendly content from any webpage using the Zyte API. Perfect for AI assistants that need to read and understand web content. by Ayan Pahwa

  1. Improve Claude Code WebFetch with Zyte API

When Claude encounters a WebFetch failure, it reads the CLAUDE.md instructions and makes a curl request to the Zyte API endpoint. The API returns base64-encoded HTML, which Claude decodes and processes just like it would with a normal WebFetch response. by Joshua Odmark




What is a skill?

A skill is a small markdown file that tells Claude how to use a specific script or tool — what it does, when to use it, and step-by-step how to run it. Claude reads the file and follows the instructions as part of a broader workflow, with no manual intervention required.

Skills are composable by design. The fetcher skill hands raw HTML to the parser skill, which hands structured JSON to the compare skill. Each one does one job well, and they're built to work together.

The parser skill's front matter sets out its purpose immediately:

---
name: parser
description: "Extracts structured product data from raw HTML. Tries JSON-LD via"
Extruct first, falls back to CSS selectors via Parsel.
---
Enter fullscreen mode Exit fullscreen mode

Two methods, one fallback. That single description line captures the entire logic of the skill.

What the parser skill does

The parser skill takes raw HTML as input and returns a structured JSON object. It uses two extraction methods in sequence, trying the more reliable one first and falling back to the more flexible one if needed.

The primary method uses Extruct to find JSON-LD data embedded in the page. JSON-LD is a structured data format that many modern sites include in their HTML specifically to make their content machine-readable — it's used for search engine optimisation and data portability. When it's present, Extruct can read it cleanly and reliably, with no need to write or maintain selectors.

If no usable JSON-LD is found, the skill falls back to Parsel, which uses CSS selectors to locate data heuristically across the page. This is more flexible but inherently tied to the page's visual structure, which can change.

When to use it

## When to use
Use this skill when you have raw HTML and need to extract structured data from
it — product details, prices, specs, ratings, or any page content.
Enter fullscreen mode Exit fullscreen mode

In practice, that means the parser skill is almost always the second step in a pipeline — running immediately after the fetcher skill has retrieved your HTML. It works with any page type, and handles the most common product fields out of the box.

How it works

## Instructions
1. Save the HTML to a temporary file `page.html`
2. Run `parser.py` against it:
   python parser.py page.html

3. The script outputs a JSON object. Check the `method` field:
   - "extruct" — clean structured data was found, use it directly
   - "parsel" — fell back to CSS selectors, review fields for completeness
4. If key fields are missing from the Parsel output, ask the user which fields
   they need and re-run with --fields:
   python parser.py page.html --fields "price,rating,brand"

5. Return the parsed JSON to the conversation for use in the Compare skill.
Enter fullscreen mode Exit fullscreen mode

The method field in the output is particularly useful. It tells you immediately how the data was extracted and how much trust to place in it. An "extruct" result is clean and stable. A "parsel" result is worth reviewing, especially if you're working with an unusual page layout.

The --fields flag is a practical escape hatch. Rather than requiring you to dig into the script when key data is missing, it lets you specify exactly what you need and re-run — a much more efficient loop.

Why prefer Extruct?

The notes section of the skill file makes this explicit:

## Notes
- Always prefer the Extruct path — it is more stable and requires no maintenance
- Parsel selectors are generated heuristically and may need adjustment for
  unusual page layouts
- Run once per page; pass all outputs together into the Compare skill
Enter fullscreen mode Exit fullscreen mode

Parsel selectors break when sites redesign. JSON-LD, by contrast, is structured data the site publishes independently of its visual layout. A site can completely overhaul its design and its JSON-LD will often remain untouched. That stability is worth prioritising wherever possible.

What comes next

Once you've run the parser skill across all your target pages, you have a set of structured JSON objects ready to compare. That's where the compare skill picks up — generating tables, summaries, and side-by-side analysis from the extracted data.

Do you need a skill?

The parser skill works well when the data you need maps cleanly onto fields that Extruct or Parsel can find — product names, prices, ratings, and similar structured attributes that sites commonly expose through JSON-LD or consistent HTML patterns. For that category of task, the skill is fast to apply and requires no custom code.

See our post about Skills vs MCP vs Web Scraping Copilot (our VS Code Extension):

But not every extraction problem fits that mould. If you're working with pages that don't include JSON-LD and have highly irregular layouts, Parsel's heuristic selectors may return incomplete or inconsistent results, and you'll spend time debugging field by field. In those cases, a purpose-built extraction script using Parsel or BeautifulSoup directly — with selectors you've written and tested against the specific target — will be more reliable.

For larger-scale or more complex extraction work, Zyte API's automatic extraction capabilities go further still. Rather than relying on selectors at all, automatic extraction uses AI to identify and return structured data from a page without requiring you to specify fields or maintain selector logic. If you're extracting data from many different site structures, or you need extraction to keep working through site redesigns without manual intervention, that's a more robust foundation than a skill-based approach. The parser skill is best understood as a practical middle ground: fast to use, good enough for a wide range of common cases, and easy to slot into a pipeline — but not a replacement for extraction tooling built for scale or resilience.

Top comments (0)