<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: LEI QIN</title>
    <description>The latest articles on DEV Community by LEI QIN (@lei_qin_486b5f0693eac8f66).</description>
    <link>https://dev.to/lei_qin_486b5f0693eac8f66</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2013928%2F9cb2e937-5043-4e32-baf2-0e6be94f3595.png</url>
      <title>DEV Community: LEI QIN</title>
      <link>https://dev.to/lei_qin_486b5f0693eac8f66</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lei_qin_486b5f0693eac8f66"/>
    <language>en</language>
    <item>
      <title>From PDF to Markdown: Why Document Parsing is Important For RAG.</title>
      <dc:creator>LEI QIN</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:15:13 +0000</pubDate>
      <link>https://dev.to/lei_qin_486b5f0693eac8f66/from-pdf-to-markdown-why-document-parsing-is-important-for-rag-4ao</link>
      <guid>https://dev.to/lei_qin_486b5f0693eac8f66/from-pdf-to-markdown-why-document-parsing-is-important-for-rag-4ao</guid>
      <description>&lt;p&gt;RAG (Retrieval Augmented Generation) is quickly becoming the default pattern for grounding LLMs in your own data. But the quality of your RAG system depends heavily on a step many teams overlook: &lt;strong&gt;how you turn documents into text before they ever hit the vector store&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your source is PDF-heavy—technical docs, reports, contracts—the parsing layer can make or break retrieval. Here’s why it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Parsing Quality Matters for Retrieval
&lt;/h2&gt;

&lt;p&gt;RAG works by embedding chunks of text, storing them in a vector DB, and retrieving the most relevant chunks at query time. The better those chunks reflect the document’s structure and meaning, the better the model can answer questions.&lt;/p&gt;

&lt;p&gt;Bad parsing (raw text extraction, naive PDF-to-text):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Broken tables&lt;/strong&gt; → numbers and headers get mixed into paragraphs; retrieval returns incomplete or nonsensical rows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lost headings&lt;/strong&gt; → no semantic hierarchy; chunk boundaries ignore section logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Garbled layout&lt;/strong&gt; → multi-column or complex docs produce a jumbled reading order&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Noise&lt;/strong&gt; → headers, footers, page numbers pollute chunks and dilute relevance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Good parsing&lt;/strong&gt; (structured Markdown with layout-aware extraction):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Clean tables&lt;/strong&gt; → preserved as HTML or structured blocks; chunks can contain coherent rows and cells&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preserved hierarchy&lt;/strong&gt; → headings become natural chunk boundaries; sections stay intact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical flow&lt;/strong&gt; → reading order follows the document’s intended structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Less noise&lt;/strong&gt; → boilerplate can be filtered; chunks are more representative of actual content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: embeddings capture real meaning, and retrieval returns chunks that actually answer the question instead of fragments of garbage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Markdown as an Ideal RAG Ingestion Format
&lt;/h2&gt;

&lt;p&gt;Markdown is a natural fit for RAG pipelines because it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keeps &lt;strong&gt;structure&lt;/strong&gt; (headings, lists, tables) explicit instead of hidden in layout&lt;/li&gt;
&lt;li&gt;Is &lt;strong&gt;easy to chunk&lt;/strong&gt; — split on &lt;code&gt;##&lt;/code&gt; or by block type without arbitrary character cuts&lt;/li&gt;
&lt;li&gt;Preserves &lt;strong&gt;semantic units&lt;/strong&gt; — a table row, a code block, or a section stays together&lt;/li&gt;
&lt;li&gt;Plays well with &lt;strong&gt;embeddings&lt;/strong&gt; — models typically train on Markdown; structured text tends to embed more predictably&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Raw PDF text, by contrast, often looks like one long run-on with no clear boundaries. You either chunk by fixed token count (losing context) or try to infer structure yourself (expensive and fragile). Starting from clean Markdown gives you structure “for free.”&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Look For in a Parser
&lt;/h2&gt;

&lt;p&gt;When building a document ingestion pipeline for RAG, look for parsers that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Preserve structure&lt;/strong&gt; — headings, tables, lists, code blocks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Produce Markdown&lt;/strong&gt; — or structured JSON you can convert to Markdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle layout&lt;/strong&gt; — multi-column, complex formatting, reading order&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expose block metadata&lt;/strong&gt; — block types, bounding boxes, reading order for custom chunking&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>nlp</category>
      <category>rag</category>
    </item>
    <item>
      <title>Claude Code Curl Alternative: When Websites Block Your Requests</title>
      <dc:creator>LEI QIN</dc:creator>
      <pubDate>Wed, 11 Mar 2026 17:36:28 +0000</pubDate>
      <link>https://dev.to/lei_qin_486b5f0693eac8f66/claude-code-curl-alternative-when-websites-block-your-requests-491d</link>
      <guid>https://dev.to/lei_qin_486b5f0693eac8f66/claude-code-curl-alternative-when-websites-block-your-requests-491d</guid>
      <description>&lt;h1&gt;
  
  
  Claude Code Curl Alternative: When Websites Block Your Requests
&lt;/h1&gt;

&lt;p&gt;You've got Claude Code doing research or pulling data—and it leans on &lt;code&gt;curl&lt;/code&gt; to fetch pages. Works like a charm at first. Then you hit a site that slams the door: 403 Forbidden, access denied, or nothing but a blank screen. "Fine," you think, "I'll spin up Playwright locally." Real browser, real JavaScript—what could go wrong? Turns out: plenty. Local Playwright adds serious overhead and often runs into the same anti-bot walls. There's a smoother path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Claude Code Curl Fails on Certain Websites
&lt;/h2&gt;

&lt;p&gt;Claude Code fetches pages with curl: lightweight, straightforward GET requests. Nothing fancy. The catch? Modern sites are built to spot and block exactly that kind of traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Sites Say No to Curl
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Missing or suspicious headers&lt;/strong&gt; — Curl keeps it minimal. Sites expect &lt;code&gt;User-Agent&lt;/code&gt;, &lt;code&gt;Accept-Language&lt;/code&gt;, &lt;code&gt;Accept-Encoding&lt;/code&gt;, and more. Skip those, and you look like a bot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No JavaScript execution&lt;/strong&gt; — Plenty of sites ship skeleton HTML and load the real content via JS. Curl grabs the skeleton. You get empty shells.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IP-based rate limiting&lt;/strong&gt; — One IP, lots of requests. Whether it's your machine or Claude's backend, you burn through limits fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bot detection&lt;/strong&gt; — TLS fingerprints, header order, timing, behavior. Curl's signature is easy to read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When any of that kicks in, you see &lt;strong&gt;403 Forbidden&lt;/strong&gt;, &lt;strong&gt;429 Too Many Requests&lt;/strong&gt;, &lt;strong&gt;503 Service Unavailable&lt;/strong&gt;, or blank pages. No content means Claude Code has nothing to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Local Playwright Trap
&lt;/h2&gt;

&lt;p&gt;You might think: real browser, real everything—problem solved. Playwright &lt;em&gt;does&lt;/em&gt; handle JavaScript and feels more like a real browser. But it comes with trade-offs:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Heavier Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heavy dependencies&lt;/strong&gt; — Chromium, Firefox, or WebKit. Installation and updates eat hundreds of MB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory and CPU&lt;/strong&gt; — Each headless instance chews RAM and CPU. Parallel crawls get expensive fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment headaches&lt;/strong&gt; — Different OSes, Node versions, Playwright versions. "Works on my machine" becomes your new catchphrase.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Blocked Anyway
&lt;/h3&gt;

&lt;p&gt;Even with a real browser, many high-value targets—e-commerce, news, travel, finance—use bot detection that still catches you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare challenges&lt;/strong&gt; — Turnstile, JS challenges, "Checking your browser" screens. They stop curl, and often stop local headless browsers too.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headless detection&lt;/strong&gt; — &lt;code&gt;navigator.webdriver&lt;/code&gt;, missing plugins, odd resolutions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Behavioral analysis&lt;/strong&gt; — No mouse moves, no natural scroll, inhuman click speeds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy and VPN detection&lt;/strong&gt; — Datacenter IPs? Often flagged.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: same &lt;strong&gt;403s&lt;/strong&gt;, &lt;strong&gt;Cloudflare&lt;/strong&gt;, and &lt;strong&gt;captchas&lt;/strong&gt;. But now you're lugging a heavier stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  AnyCrawl: Crawl for AI
&lt;/h2&gt;

&lt;p&gt;AnyCrawl is &lt;strong&gt;crawl for AI&lt;/strong&gt;—a cloud scraping API built for AI agents and LLMs. It gets around these limits, &lt;strong&gt;including Cloudflare challenges&lt;/strong&gt;, without running browsers on your machine. You get &lt;strong&gt;LLM-optimized Markdown&lt;/strong&gt; back, ready for Claude and others. When curl or local Playwright hits a wall, this is the drop-in replacement.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Handles What Curl Can't
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Curl Problem&lt;/th&gt;
&lt;th&gt;AnyCrawl Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Minimal headers, easy to fingerprint&lt;/td&gt;
&lt;td&gt;Full browser-like headers and request patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No JavaScript execution&lt;/td&gt;
&lt;td&gt;Cheerio, Playwright, and Puppeteer engines in the cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single IP, easily rate-limited&lt;/td&gt;
&lt;td&gt;Rotating proxy support to distribute traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare challenges, 403, anti-bot&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Bypass Cloudflare&lt;/strong&gt; and managed anti-detection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;AnyCrawl &lt;strong&gt;bypasses Cloudflare challenges&lt;/strong&gt; (Turnstile, JS checks, and more), so sites that block curl or local browsers are still reachable. Send a URL, get clean Markdown. No local browser, no proxy setup, no maintenance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Code Integration via AnyCrawl CLI Skill
&lt;/h3&gt;

&lt;p&gt;The best way to use AnyCrawl with Claude Code is the &lt;a href="https://github.com/any4ai/anycrawl-cli" rel="noopener noreferrer"&gt;AnyCrawl CLI&lt;/a&gt; and its &lt;strong&gt;skill&lt;/strong&gt; for AI coding agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install AnyCrawl CLI&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; anycrawl-cli
&lt;span class="c"&gt;# Or: npx -y anycrawl-cli init&lt;/span&gt;

&lt;span class="c"&gt;# Authenticate (get API key at anycrawl.dev/dashboard)&lt;/span&gt;
anycrawl login &lt;span class="nt"&gt;--api-key&lt;/span&gt; &amp;lt;your-api-key&amp;gt;

&lt;span class="c"&gt;# Install skill for Claude Code (and Cursor, Codex, etc.)&lt;/span&gt;
anycrawl setup skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The skill teaches Claude Code when and how to use AnyCrawl for scraping, search, mapping, and crawling—swapping in AnyCrawl when curl gets blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP integration&lt;/strong&gt; — Add the AnyCrawl MCP server to Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add &lt;span class="nt"&gt;--transport&lt;/span&gt; stdio anycrawl &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ANYCRAWL_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;YOUR_API_KEY &lt;span class="nt"&gt;--&lt;/span&gt; npx &lt;span class="nt"&gt;-y&lt;/span&gt; anycrawl-mcp@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Swap &lt;code&gt;YOUR_API_KEY&lt;/code&gt; with your key from &lt;a href="https://anycrawl.dev/dashboard" rel="noopener noreferrer"&gt;anycrawl.dev/dashboard&lt;/a&gt;. Or run &lt;code&gt;anycrawl setup mcp&lt;/code&gt; after installing the CLI.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Usage (Curl-Compatible)
&lt;/h3&gt;

&lt;p&gt;Want to call the API directly from scripts or another AI? Same interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://api.anycrawl.dev/v1/scrape"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_API_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://example.com/page-that-blocks-curl",
    "engine": "playwright"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response includes &lt;code&gt;markdown&lt;/code&gt; or &lt;code&gt;html&lt;/code&gt; (or both), &lt;strong&gt;optimized for AI&lt;/strong&gt;—ready for Claude or any LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  1,500 Credits: Enough for Daily Use
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Free tier&lt;/strong&gt; gives you &lt;strong&gt;1,500 credits per month&lt;/strong&gt;—no card required. Credit use depends on page size and engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cheerio&lt;/strong&gt; (static HTML) — lowest cost, ideal when JS isn't needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; (JS-heavy sites) — moderate cost, handles most anti-bot cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Puppeteer&lt;/strong&gt; — similar to Playwright for Chrome-specific needs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everyday research, docs lookup, and small-scale extraction—dozens to a few hundred pages a month—1,500 credits usually covers it. Need more? Paid tiers start at 2,500 credits (Hobby) and scale from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Each Approach
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Best Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple public pages, no anti-bot&lt;/td&gt;
&lt;td&gt;Claude Code curl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JS-heavy but not protected&lt;/td&gt;
&lt;td&gt;Local Playwright (if you're fine with the setup)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protected sites, Cloudflare, 403/429, captchas&lt;/td&gt;
&lt;td&gt;AnyCrawl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High volume, many domains&lt;/td&gt;
&lt;td&gt;AnyCrawl (cloud, proxies, concurrency)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code integration&lt;/td&gt;
&lt;td&gt;AnyCrawl CLI + skill (&lt;code&gt;anycrawl setup skills&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Sign up at &lt;a href="https://anycrawl.dev" rel="noopener noreferrer"&gt;anycrawl.dev&lt;/a&gt; 
Promo code: STARTUPY&lt;/li&gt;
&lt;li&gt;Grab your API key from the &lt;a href="https://anycrawl.dev/dashboard" rel="noopener noreferrer"&gt;dashboard&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Install the CLI and skill: &lt;code&gt;anycrawl setup skills&lt;/code&gt; (see &lt;a href="https://github.com/any4ai/anycrawl-cli" rel="noopener noreferrer"&gt;anycrawl-cli&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Or call the Scrape API directly from your own scripts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No credit card for Free. Try scraping a URL that blocks curl—the difference shows up right away.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Claude Code's curl is perfect for open, static pages. When you run into 403s, Cloudflare, blank responses, or rate limits, local Playwright adds complexity and often fails anyway. &lt;strong&gt;AnyCrawl is crawl for AI&lt;/strong&gt;—cloud-based, bypasses Cloudflare and anti-bot, uses rotating proxies and multiple engines, and returns LLM-ready Markdown. For daily use, 1,500 free credits cover most research and extraction—without running browsers locally or wrestling anti-bot systems yourself.&lt;/p&gt;




</description>
      <category>anycrawl</category>
      <category>claudecodeskill</category>
      <category>crawlforai</category>
      <category>firecrawlalternative</category>
    </item>
  </channel>
</rss>
