DEV Community

Taisei
Taisei

Posted on

Generating scraper logic at runtime instead of writing it per site

pluckmd exists so an agent can pull blog posts into markdown, index them into a wiki, and generate interactive HTML to learn from. This post is about the first step, the part with no per-site code, because the design is the interesting bit.

If you want the practical side, how I actually use it day to day, I wrote that up separately: https://dev.to/taisei_ide/how-i-use-pluckmd-to-read-blogs-with-an-ai-agent-1jpe

pluckmd demo

It downloads articles from a blog without any per-site code. No handler for Medium, no handler for Substack, nothing keyed on a domain. Here's how that works.

The core idea: treat extraction as data, not code.

AdapterSpec

Instead of branching on which site you're on, pluckmd resolves an AdapterSpec. It's a plain object that says which selector finds article links, what the URL pattern looks like, and how pagination behaves.

interface AdapterSpec {
  listing:    ListingExtractionSpec;   // how to find article links
  article:    ArticleExtractionSpec;   // how to pull the body
  pagination: PaginationSpec;          // none | scroll | button-click | next-url | auto
  evidence:   string;
}
Enter fullscreen mode Exit fullscreen mode

Because it's data, the same shape can come from a heuristic, an LLM, an agent, or a person typing it by hand. They all produce the same thing, and they all go through the same checks.

Resolving it, cheapest path first

cache  ->  heuristics (local, free)  ->  LLM (only if needed)
Enter fullscreen mode Exit fullscreen mode

Cache first, rechecked against today's DOM so a stale entry can't sneak through. Then local heuristics. The LLM only gets called when the heuristics aren't sure. Every result that works gets written back, so the second run on a site is basically instant.

How the heuristics find an article list

This part has no idea what site it's looking at. It takes every link, normalizes the path, and collapses the parts that vary into wildcards.

/blog/my-first-post   ->  /blog/*
/blog/another-article ->  /blog/*
/about                ->  /about
Enter fullscreen mode Exit fullscreen mode

Group by that shape. Any group with the same pattern repeated three or more times is a candidate for the article list. Score it by how many links, what fraction of the page they are, path depth, whether they sit inside a main content area. Highest score wins.

Numbers and long hashes in a path get treated as variable, so article IDs and dates don't fragment the grouping. The whole thing is structure, never names.

The validation gate

Here's the rule that makes runtime generation safe to trust. Nothing is used or cached until it proves itself on the live DOM:

  • the link selector matches at least 3 links
  • at least half of those match the URL pattern
  • if it's selector-based body extraction, the body has at least 80 characters

A spec that fails gets dropped. This is what keeps a bad LLM guess from quietly poisoning your output or your cache. Same gate for every source.

One page, three ways to get it

A static fetch, a headless Playwright render, and your logged-in Chrome tab are very different beasts. pluckmd puts all three behind one interface, plus a DomEvaluator for live operations like scrolling and clicking next. The link collector and the extractor don't know which backend produced the page they're working on. Adding a fourth source would mean implementing one interface.

The agent escape hatch

When heuristics give up and there's no LLM configured, it doesn't error out. It writes a request file with the page structure and candidate selectors, and a coding agent reads that and produces the spec. You validate and cache it with one command. So even the hardest sites resolve, they just route through the agent instead of an API call.

That's the whole thing. A single data contract that makes the source, the resolver, and the page backend all swappable.

Repo (MIT): https://github.com/taisei-ide-0123/pluckmd

If you've solved generic extraction a different way I'd genuinely like to hear it. The confidence threshold between "trust the heuristic" and "call the model" is the part I'm least sure about.

Top comments (0)