<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Taisei</title>
    <description>The latest articles on DEV Community by Taisei (@taisei_ide).</description>
    <link>https://dev.to/taisei_ide</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F707449%2F7accf1a0-2395-43cf-aad5-1a064826a0b2.jpg</url>
      <title>DEV Community: Taisei</title>
      <link>https://dev.to/taisei_ide</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/taisei_ide"/>
    <language>en</language>
    <item>
      <title>Generating scraper logic at runtime instead of writing it per site</title>
      <dc:creator>Taisei</dc:creator>
      <pubDate>Wed, 03 Jun 2026 03:49:32 +0000</pubDate>
      <link>https://dev.to/taisei_ide/generating-scraper-logic-at-runtime-instead-of-writing-it-per-site-3j3g</link>
      <guid>https://dev.to/taisei_ide/generating-scraper-logic-at-runtime-instead-of-writing-it-per-site-3j3g</guid>
      <description>&lt;p&gt;pluckmd exists so an agent can pull blog posts into markdown, index them into a wiki, and generate interactive HTML to learn from. This post is about the first step, the part with no per-site code, because the design is the interesting bit.&lt;/p&gt;

&lt;p&gt;If you want the practical side, how I actually use it day to day, I wrote that up separately: &lt;a href="https://dev.to/taisei_ide/how-i-use-pluckmd-to-read-blogs-with-an-ai-agent-1jpe"&gt;https://dev.to/taisei_ide/how-i-use-pluckmd-to-read-blogs-with-an-ai-agent-1jpe&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7neud864y1q5gk89sez.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7neud864y1q5gk89sez.gif" alt="pluckmd demo" width="600" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It downloads articles from a blog without any per-site code. No handler for Medium, no handler for Substack, nothing keyed on a domain. Here's how that works.&lt;/p&gt;

&lt;p&gt;The core idea: treat extraction as data, not code.&lt;/p&gt;

&lt;h2&gt;
  
  
  AdapterSpec
&lt;/h2&gt;

&lt;p&gt;Instead of branching on which site you're on, pluckmd resolves an &lt;code&gt;AdapterSpec&lt;/code&gt;. It's a plain object that says which selector finds article links, what the URL pattern looks like, and how pagination behaves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AdapterSpec&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nx"&gt;ListingExtractionSpec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// how to find article links&lt;/span&gt;
  &lt;span class="nl"&gt;article&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nx"&gt;ArticleExtractionSpec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// how to pull the body&lt;/span&gt;
  &lt;span class="nl"&gt;pagination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;PaginationSpec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// none | scroll | button-click | next-url | auto&lt;/span&gt;
  &lt;span class="nl"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because it's data, the same shape can come from a heuristic, an LLM, an agent, or a person typing it by hand. They all produce the same thing, and they all go through the same checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resolving it, cheapest path first
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cache  -&amp;gt;  heuristics (local, free)  -&amp;gt;  LLM (only if needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache first, rechecked against today's DOM so a stale entry can't sneak through. Then local heuristics. The LLM only gets called when the heuristics aren't sure. Every result that works gets written back, so the second run on a site is basically instant.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the heuristics find an article list
&lt;/h2&gt;

&lt;p&gt;This part has no idea what site it's looking at. It takes every link, normalizes the path, and collapses the parts that vary into wildcards.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;/&lt;span class="n"&gt;blog&lt;/span&gt;/&lt;span class="n"&gt;my&lt;/span&gt;-&lt;span class="n"&gt;first&lt;/span&gt;-&lt;span class="n"&gt;post&lt;/span&gt;   -&amp;gt;  /&lt;span class="n"&gt;blog&lt;/span&gt;/*
/&lt;span class="n"&gt;blog&lt;/span&gt;/&lt;span class="n"&gt;another&lt;/span&gt;-&lt;span class="n"&gt;article&lt;/span&gt; -&amp;gt;  /&lt;span class="n"&gt;blog&lt;/span&gt;/*
/&lt;span class="n"&gt;about&lt;/span&gt;                -&amp;gt;  /&lt;span class="n"&gt;about&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Group by that shape. Any group with the same pattern repeated three or more times is a candidate for the article list. Score it by how many links, what fraction of the page they are, path depth, whether they sit inside a main content area. Highest score wins.&lt;/p&gt;

&lt;p&gt;Numbers and long hashes in a path get treated as variable, so article IDs and dates don't fragment the grouping. The whole thing is structure, never names.&lt;/p&gt;

&lt;h2&gt;
  
  
  The validation gate
&lt;/h2&gt;

&lt;p&gt;Here's the rule that makes runtime generation safe to trust. Nothing is used or cached until it proves itself on the live DOM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the link selector matches at least 3 links&lt;/li&gt;
&lt;li&gt;at least half of those match the URL pattern&lt;/li&gt;
&lt;li&gt;if it's selector-based body extraction, the body has at least 80 characters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A spec that fails gets dropped. This is what keeps a bad LLM guess from quietly poisoning your output or your cache. Same gate for every source.&lt;/p&gt;

&lt;h2&gt;
  
  
  One page, three ways to get it
&lt;/h2&gt;

&lt;p&gt;A static fetch, a headless Playwright render, and your logged-in Chrome tab are very different beasts. pluckmd puts all three behind one interface, plus a &lt;code&gt;DomEvaluator&lt;/code&gt; for live operations like scrolling and clicking next. The link collector and the extractor don't know which backend produced the page they're working on. Adding a fourth source would mean implementing one interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agent escape hatch
&lt;/h2&gt;

&lt;p&gt;When heuristics give up and there's no LLM configured, it doesn't error out. It writes a request file with the page structure and candidate selectors, and a coding agent reads that and produces the spec. You validate and cache it with one command. So even the hardest sites resolve, they just route through the agent instead of an API call.&lt;/p&gt;

&lt;p&gt;That's the whole thing. A single data contract that makes the source, the resolver, and the page backend all swappable.&lt;/p&gt;

&lt;p&gt;Repo (MIT): &lt;a href="https://github.com/taisei-ide-0123/pluckmd" rel="noopener noreferrer"&gt;https://github.com/taisei-ide-0123/pluckmd&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've solved generic extraction a different way I'd genuinely like to hear it. The confidence threshold between "trust the heuristic" and "call the model" is the part I'm least sure about.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>typescript</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I use pluckmd to read blogs with an AI agent</title>
      <dc:creator>Taisei</dc:creator>
      <pubDate>Tue, 02 Jun 2026 23:42:20 +0000</pubDate>
      <link>https://dev.to/taisei_ide/how-i-use-pluckmd-to-read-blogs-with-an-ai-agent-1jpe</link>
      <guid>https://dev.to/taisei_ide/how-i-use-pluckmd-to-read-blogs-with-an-ai-agent-1jpe</guid>
      <description>&lt;p&gt;I wanted to read blog posts with an LLM in the loop, not just on my own.&lt;/p&gt;

&lt;p&gt;The push came from two places. Karpathy's LLM Wiki idea, where the model keeps a folder of markdown notes as you learn a topic. And Thariq's post on how well Claude generates interactive HTML, which is now on the Anthropic blog. Put together, the workflow I wanted looked like this: pull blog articles into markdown, have an agent index them into a wiki, then generate interactive HTML pages to learn from.&lt;/p&gt;

&lt;p&gt;Step one was the blocker. Getting clean articles out of a website kept breaking, and every tool wanted a config per site. So I made pluckmd to handle just that part. This post is how I use it. The architecture write-up is separate.&lt;/p&gt;

&lt;p&gt;References if you want the background:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM Wiki by Karpathy: &lt;a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f" rel="noopener noreferrer"&gt;https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Unreasonable Effectiveness of HTML by Thariq: &lt;a href="https://x.com/trq212/status/2052809885763747935" rel="noopener noreferrer"&gt;https://x.com/trq212/status/2052809885763747935&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The basic case
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx pluckmd download https://example.com/blog &lt;span class="nt"&gt;-o&lt;/span&gt; ./articles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That walks the listing page, follows pagination, pulls each article, and writes markdown with frontmatter (title, date, author, tags). On a small blog I get maybe 5 posts saved in a few seconds. No site config, no setup.&lt;/p&gt;

&lt;p&gt;If a page is heavy on javascript it quietly switches to a real browser to render it. You don't pick that, it decides.&lt;/p&gt;

&lt;h2&gt;
  
  
  Paid and login-only stuff
&lt;/h2&gt;

&lt;p&gt;A lot of the writing I actually care about sits behind a login. Two ways to handle it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pluckmd login https://example.com/login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That opens a browser once, you log in by hand, and the session sticks around. After that, normal downloads just work.&lt;/p&gt;

&lt;p&gt;Or if you'd rather not hand it credentials at all, open the page in Chrome with the extension installed and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pluckmd download &lt;span class="nt"&gt;--active-tab&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; ./articles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It reads straight from the tab you're already logged into. The CLI itself never reads your cookies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agent part
&lt;/h2&gt;

&lt;p&gt;This is the reason it exists for me. I don't actually run the CLI by hand most of the time. pluckmd ships skills for Claude Code and Codex, so I just talk to the agent and it runs the right commands for me.&lt;/p&gt;

&lt;p&gt;The whole learning loop is three messages:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Collect the posts from &lt;a href="https://example.com/blog" rel="noopener noreferrer"&gt;https://example.com/blog&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent runs the download and saves everything as markdown into &lt;code&gt;raw/&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build a wiki from them&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It reads the markdown, pulls out the concepts, and links them into wiki notes (works as an Obsidian vault). That's the Karpathy LLM Wiki part, a set of notes the model maintains as I learn.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Generate interactive HTML for this concept&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It turns a concept into an interactive HTML page to study from, the Thariq HTML idea. The raw files stay untouched, the wiki and the HTML are things the agent regenerates.&lt;/p&gt;

&lt;p&gt;So I never touch flags or paths unless I want to. I describe what I want, the agent drives pluckmd. And if you don't have an LLM key set for the extraction itself, it still works: pluckmd writes out a file describing the page, and the agent reads that and produces the extraction rules. The agent is the brain, the CLI is the hands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it struggles
&lt;/h2&gt;

&lt;p&gt;Honestly, not every site cooperates. I hit a couple of layouts where the heuristics couldn't find a clean article pattern and it had to lean on the agent fallback. Infinite scroll feeds are hit or miss depending on how the load-more is wired up. If you try it on something exotic and it flops, that's useful to me.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; pluckmd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repo (MIT): &lt;a href="https://github.com/taisei-ide-0123/pluckmd" rel="noopener noreferrer"&gt;https://github.com/taisei-ide-0123/pluckmd&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious what people are pointing their agents at. What would you want read into a wiki first?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cli</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Code Formatting in Nim [nim pretty］</title>
      <dc:creator>Taisei</dc:creator>
      <pubDate>Wed, 16 Nov 2022 16:33:36 +0000</pubDate>
      <link>https://dev.to/taisei_ide/code-formatting-in-nim-4g7f</link>
      <guid>https://dev.to/taisei_ide/code-formatting-in-nim-4g7f</guid>
      <description>&lt;p&gt;This is my first post on dev. By the way, do you guys know Nim? It's an elegant language, compiles to C, and has a Python-like syntax. So it's very fast and easy to write. I've been using it for work recently. I was looking for a code formatting tool for it and finally found one called Nim pretty.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use Nim pretty
&lt;/h2&gt;

&lt;p&gt;You can use it easily. Here's a sample code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const     hello   =      "Hello"
echo   "Say, ",                      hello
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's terrible syntax, right? Here's a way to format it. &lt;br&gt;
You can run the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nimpretty --indent:2 sample.nim
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sample code can be formatted as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const hello = "Hello"
echo "Say, ", hello
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terrible code was formatted with an indent as 2.&lt;br&gt;
It became quite readable but you don't want to run the command for each file, right?&lt;/p&gt;

&lt;p&gt;Here's a better way. You can run the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;find . -name "*.nim" -exec nimpretty --indent:2 {} +
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All Nim files in a directory can be formatted. It's a combination of a Nim pretty command and shell script.&lt;/p&gt;

&lt;p&gt;You can create a task in a &lt;code&gt;nimble file&lt;/code&gt; as below if you don't want to type or copy the command every time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version       = "0.0.1"
author        = "Sample"
description   = "Sample code"
license       = "Sample"

task pretty, "Formats all nim files":
  exec "find . -name '*.nim' -exec nimpretty --indent:2 {} +"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can run the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nimble pretty
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then all Nim files can be formatted even if they are in different directories.&lt;/p&gt;

&lt;p&gt;That's all. Any tips about Nim would be appreciated.&lt;br&gt;
Thank you!!&lt;/p&gt;

</description>
      <category>beginners</category>
    </item>
  </channel>
</rss>
