DEV Community: Taisei

Generating scraper logic at runtime instead of writing it per site

Taisei — Wed, 03 Jun 2026 03:49:32 +0000

pluckmd exists so an agent can pull blog posts into markdown, index them into a wiki, and generate interactive HTML to learn from. This post is about the first step, the part with no per-site code, because the design is the interesting bit.

If you want the practical side, how I actually use it day to day, I wrote that up separately: https://dev.to/taisei_ide/how-i-use-pluckmd-to-read-blogs-with-an-ai-agent-1jpe

It downloads articles from a blog without any per-site code. No handler for Medium, no handler for Substack, nothing keyed on a domain. Here's how that works.

The core idea: treat extraction as data, not code.

AdapterSpec

Instead of branching on which site you're on, pluckmd resolves an AdapterSpec. It's a plain object that says which selector finds article links, what the URL pattern looks like, and how pagination behaves.

interface AdapterSpec {
  listing:    ListingExtractionSpec;   // how to find article links
  article:    ArticleExtractionSpec;   // how to pull the body
  pagination: PaginationSpec;          // none | scroll | button-click | next-url | auto
  evidence:   string;
}

Because it's data, the same shape can come from a heuristic, an LLM, an agent, or a person typing it by hand. They all produce the same thing, and they all go through the same checks.

Resolving it, cheapest path first

cache  ->  heuristics (local, free)  ->  LLM (only if needed)

Cache first, rechecked against today's DOM so a stale entry can't sneak through. Then local heuristics. The LLM only gets called when the heuristics aren't sure. Every result that works gets written back, so the second run on a site is basically instant.

How the heuristics find an article list

This part has no idea what site it's looking at. It takes every link, normalizes the path, and collapses the parts that vary into wildcards.

/blog/my-first-post   ->  /blog/*
/blog/another-article ->  /blog/*
/about                ->  /about

Group by that shape. Any group with the same pattern repeated three or more times is a candidate for the article list. Score it by how many links, what fraction of the page they are, path depth, whether they sit inside a main content area. Highest score wins.

Numbers and long hashes in a path get treated as variable, so article IDs and dates don't fragment the grouping. The whole thing is structure, never names.

The validation gate

Here's the rule that makes runtime generation safe to trust. Nothing is used or cached until it proves itself on the live DOM:

the link selector matches at least 3 links
at least half of those match the URL pattern
if it's selector-based body extraction, the body has at least 80 characters

A spec that fails gets dropped. This is what keeps a bad LLM guess from quietly poisoning your output or your cache. Same gate for every source.

One page, three ways to get it

A static fetch, a headless Playwright render, and your logged-in Chrome tab are very different beasts. pluckmd puts all three behind one interface, plus a DomEvaluator for live operations like scrolling and clicking next. The link collector and the extractor don't know which backend produced the page they're working on. Adding a fourth source would mean implementing one interface.

The agent escape hatch

When heuristics give up and there's no LLM configured, it doesn't error out. It writes a request file with the page structure and candidate selectors, and a coding agent reads that and produces the spec. You validate and cache it with one command. So even the hardest sites resolve, they just route through the agent instead of an API call.

That's the whole thing. A single data contract that makes the source, the resolver, and the page backend all swappable.

Repo (MIT): https://github.com/taisei-ide-0123/pluckmd

If you've solved generic extraction a different way I'd genuinely like to hear it. The confidence threshold between "trust the heuristic" and "call the model" is the part I'm least sure about.

How I use pluckmd to read blogs with an AI agent

Taisei — Tue, 02 Jun 2026 23:42:20 +0000

I wanted to read blog posts with an LLM in the loop, not just on my own.

The push came from two places. Karpathy's LLM Wiki idea, where the model keeps a folder of markdown notes as you learn a topic. And Thariq's post on how well Claude generates interactive HTML, which is now on the Anthropic blog. Put together, the workflow I wanted looked like this: pull blog articles into markdown, have an agent index them into a wiki, then generate interactive HTML pages to learn from.

Step one was the blocker. Getting clean articles out of a website kept breaking, and every tool wanted a config per site. So I made pluckmd to handle just that part. This post is how I use it. The architecture write-up is separate.

References if you want the background:

LLM Wiki by Karpathy: https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
The Unreasonable Effectiveness of HTML by Thariq: https://x.com/trq212/status/2052809885763747935

The basic case

npx pluckmd download https://example.com/blog -o ./articles

That walks the listing page, follows pagination, pulls each article, and writes markdown with frontmatter (title, date, author, tags). On a small blog I get maybe 5 posts saved in a few seconds. No site config, no setup.

If a page is heavy on javascript it quietly switches to a real browser to render it. You don't pick that, it decides.

Paid and login-only stuff

A lot of the writing I actually care about sits behind a login. Two ways to handle it.

pluckmd login https://example.com/login

That opens a browser once, you log in by hand, and the session sticks around. After that, normal downloads just work.

Or if you'd rather not hand it credentials at all, open the page in Chrome with the extension installed and run:

pluckmd download --active-tab -o ./articles

It reads straight from the tab you're already logged into. The CLI itself never reads your cookies.

The agent part

This is the reason it exists for me. I don't actually run the CLI by hand most of the time. pluckmd ships skills for Claude Code and Codex, so I just talk to the agent and it runs the right commands for me.

The whole learning loop is three messages:

Collect the posts from https://example.com/blog

The agent runs the download and saves everything as markdown into raw/.

Build a wiki from them

It reads the markdown, pulls out the concepts, and links them into wiki notes (works as an Obsidian vault). That's the Karpathy LLM Wiki part, a set of notes the model maintains as I learn.

Generate interactive HTML for this concept

It turns a concept into an interactive HTML page to study from, the Thariq HTML idea. The raw files stay untouched, the wiki and the HTML are things the agent regenerates.

So I never touch flags or paths unless I want to. I describe what I want, the agent drives pluckmd. And if you don't have an LLM key set for the extraction itself, it still works: pluckmd writes out a file describing the page, and the agent reads that and produces the extraction rules. The agent is the brain, the CLI is the hands.

Where it struggles

Honestly, not every site cooperates. I hit a couple of layouts where the heuristics couldn't find a clean article pattern and it had to lean on the agent fallback. Infinite scroll feeds are hit or miss depending on how the load-more is wired up. If you try it on something exotic and it flops, that's useful to me.

npm install -g pluckmd

Repo (MIT): https://github.com/taisei-ide-0123/pluckmd

Curious what people are pointing their agents at. What would you want read into a wiki first?

Code Formatting in Nim [nim pretty］

Taisei — Wed, 16 Nov 2022 16:33:36 +0000

This is my first post on dev. By the way, do you guys know Nim? It's an elegant language, compiles to C, and has a Python-like syntax. So it's very fast and easy to write. I've been using it for work recently. I was looking for a code formatting tool for it and finally found one called Nim pretty.

How to use Nim pretty

You can use it easily. Here's a sample code.

const     hello   =      "Hello"
echo   "Say, ",                      hello

It's terrible syntax, right? Here's a way to format it.
You can run the following command.

nimpretty --indent:2 sample.nim

The sample code can be formatted as follows.

const hello = "Hello"
echo "Say, ", hello

The terrible code was formatted with an indent as 2.
It became quite readable but you don't want to run the command for each file, right?

Here's a better way. You can run the following command.

find . -name "*.nim" -exec nimpretty --indent:2 {} +

All Nim files in a directory can be formatted. It's a combination of a Nim pretty command and shell script.

You can create a task in a nimble file as below if you don't want to type or copy the command every time.

version       = "0.0.1"
author        = "Sample"
description   = "Sample code"
license       = "Sample"

task pretty, "Formats all nim files":
  exec "find . -name '*.nim' -exec nimpretty --indent:2 {} +"

And you can run the following command.

nimble pretty

Then all Nim files can be formatted even if they are in different directories.

That's all. Any tips about Nim would be appreciated.
Thank you!!