<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Oaida Adrian</title>
    <description>The latest articles on DEV Community by Oaida Adrian (@oaida_adrian_afa2428f63d0).</description>
    <link>https://dev.to/oaida_adrian_afa2428f63d0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4014906%2Fc97aa091-845d-4fe5-b6fd-5a98bf7a23fa.jpg</url>
      <title>DEV Community: Oaida Adrian</title>
      <link>https://dev.to/oaida_adrian_afa2428f63d0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/oaida_adrian_afa2428f63d0"/>
    <language>en</language>
    <item>
      <title>How to Extract Clean Content From Any Website Sitemap (For SEO Audits &amp; AI Training)</title>
      <dc:creator>Oaida Adrian</dc:creator>
      <pubDate>Sat, 04 Jul 2026 10:50:51 +0000</pubDate>
      <link>https://dev.to/oaida_adrian_afa2428f63d0/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training-15a9</link>
      <guid>https://dev.to/oaida_adrian_afa2428f63d0/how-to-extract-clean-content-from-any-website-sitemap-for-seo-audits-ai-training-15a9</guid>
      <description>&lt;h1&gt;
  
  
  How to Extract Clean Content From Any Website Sitemap
&lt;/h1&gt;

&lt;p&gt;Ever needed to inventory every page on a website? Extract clean text content for AI training? Or audit meta tags across an entire domain?&lt;/p&gt;

&lt;p&gt;I built a &lt;strong&gt;Sitemap Content Extractor&lt;/strong&gt; that does exactly this — feed it a &lt;code&gt;sitemap.xml&lt;/code&gt; URL and it crawls every page, extracting structured content.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parses sitemap indexes&lt;/strong&gt; — follows nested sitemaps recursively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handles gzip sitemaps&lt;/strong&gt; — &lt;code&gt;.xml.gz&lt;/code&gt; files work out of the box&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extracts full content&lt;/strong&gt; — clean article text using trafilatura&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Captures metadata&lt;/strong&gt; — title, meta description, meta keywords, H1 headings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Word counts&lt;/strong&gt; — for every page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;URL filtering&lt;/strong&gt; — include/exclude patterns via regex&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;You can run it directly on &lt;a href="https://apify.com/darknezz/sitemap-content-extractor" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt; — no setup required.&lt;/p&gt;

&lt;p&gt;Just provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A sitemap URL (e.g., &lt;code&gt;https://example.com/sitemap.xml&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Max URLs to process&lt;/li&gt;
&lt;li&gt;Whether to extract full content&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Output
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://pydantic.dev/docs/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pydantic Docs - Validation, AI Agents, Logfire Observability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Full extracted article text..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"wordCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;131&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metaDescription"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pydantic documentation..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"h1Headings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Pydantic Docs"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"lastmod"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-01-15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"extractedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-07-04T10:45:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. SEO Content Audits
&lt;/h3&gt;

&lt;p&gt;Crawl your entire site and identify pages with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing or duplicate meta descriptions&lt;/li&gt;
&lt;li&gt;Short content (under 300 words)&lt;/li&gt;
&lt;li&gt;Missing H1 tags&lt;/li&gt;
&lt;li&gt;Stale content (old &lt;code&gt;lastmod&lt;/code&gt; dates)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. AI Training Data Collection
&lt;/h3&gt;

&lt;p&gt;Extract clean text from documentation sites for fine-tuning LLMs. The trafilatura extraction removes navigation, ads, and boilerplate — leaving only the main content.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Competitor Analysis
&lt;/h3&gt;

&lt;p&gt;Inventory a competitor's entire content strategy — how many pages, how much content per page, what topics they cover.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Content Migration
&lt;/h3&gt;

&lt;p&gt;Before migrating a legacy site, extract all content into structured JSON for easy import into a new CMS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Details
&lt;/h2&gt;

&lt;p&gt;The extractor is built in Python 3.12 and uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;trafilatura&lt;/strong&gt; for main content extraction (better than BeautifulSoup for article text)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;lxml&lt;/strong&gt; for sitemap XML parsing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BeautifulSoup&lt;/strong&gt; for metadata extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apify SDK&lt;/strong&gt; for infrastructure and scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It handles both &lt;code&gt;&amp;lt;urlset&amp;gt;&lt;/code&gt; (regular sitemaps) and &lt;code&gt;&amp;lt;sitemapindex&amp;gt;&lt;/code&gt; (nested sitemaps), following child sitemaps recursively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;Try it now on the &lt;a href="https://apify.com/darknezz/sitemap-content-extractor" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No registration needed — just paste a sitemap URL and hit run.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What would you use a sitemap extractor for? Let me know in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>seo</category>
      <category>automation</category>
    </item>
    <item>
      <title>Scraping 187,000 Romanian Businesses: Building a B2B Lead Generation Tool</title>
      <dc:creator>Oaida Adrian</dc:creator>
      <pubDate>Sat, 04 Jul 2026 10:37:34 +0000</pubDate>
      <link>https://dev.to/oaida_adrian_afa2428f63d0/scraping-187000-romanian-businesses-building-a-b2b-lead-generation-tool-176n</link>
      <guid>https://dev.to/oaida_adrian_afa2428f63d0/scraping-187000-romanian-businesses-building-a-b2b-lead-generation-tool-176n</guid>
      <description>&lt;p&gt;I needed Romanian B2B leads and couldn't find a good scraper for local business directories. So I built one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Most lead generation tools focus on the US and Western European markets. If you're doing business in Romania or Eastern Europe, you're stuck with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manual directory browsing&lt;/li&gt;
&lt;li&gt;US-centric tools that don't understand local directory structures&lt;/li&gt;
&lt;li&gt;Outdated databases with stale contacts&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;I built a &lt;a href="https://apify.com/darknezz/ro-business-scraper" rel="noopener noreferrer"&gt;Romanian Business Directory Scraper&lt;/a&gt; that works with &lt;strong&gt;listafirme.ro&lt;/strong&gt; — one of Romania's largest business registries with 187,000+ companies in Bucharest alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Extracts
&lt;/h2&gt;

&lt;p&gt;For each company, the scraper pulls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Company name&lt;/strong&gt; (Denumire)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CUI&lt;/strong&gt; — Romanian tax identification number&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade register number&lt;/strong&gt; (Nr. Reg. Com.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full address&lt;/strong&gt; — Street, city, county (județ)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CAEN code&lt;/strong&gt; — Business activity classification&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Founding date&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAT status&lt;/strong&gt; — Plătitor/neplătitor de TVA&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Sample Output
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"companyName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BORG DESIGN SRL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cui"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RO14837428"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tradeRegister"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"J40/8118/2002"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"address"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Str. Ing. Stefan Hepites 16A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"city"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sectorul 5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"county"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bucuresti"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Proiectarea structurii și conținutului website..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"foundedDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2002-08-26"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Coverage
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;41 counties&lt;/strong&gt; (județe) supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;187,009 companies&lt;/strong&gt; in București alone&lt;/li&gt;
&lt;li&gt;Pagination handled automatically (3,741 pages for București)&lt;/li&gt;
&lt;li&gt;Detail page extraction for full company data&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Use Cases
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;B2B Lead Generation&lt;/strong&gt; — Build targeted contact lists by industry and region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Market Research&lt;/strong&gt; — Analyse business density by county or CAEN category&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitor Analysis&lt;/strong&gt; — Map competitors in your sector by region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local SEO&lt;/strong&gt; — Build citation lists for Romanian businesses&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The tool is on the Apify Store: &lt;a href="https://apify.com/darknezz/ro-business-scraper" rel="noopener noreferrer"&gt;Romanian Business Directory Scraper&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pricing: &lt;strong&gt;$0.01 per business listing extracted&lt;/strong&gt;. Free tier covers ~500 listings.&lt;/p&gt;




&lt;p&gt;Anyone else building tools for the Romanian/Eastern European market? Would love to hear what directories you're working with.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>automation</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Make Any Website AI-Readable: Generating llms.txt Files with Python</title>
      <dc:creator>Oaida Adrian</dc:creator>
      <pubDate>Sat, 04 Jul 2026 10:31:31 +0000</pubDate>
      <link>https://dev.to/oaida_adrian_afa2428f63d0/make-any-website-ai-readable-generating-llmstxt-files-with-python-3jop</link>
      <guid>https://dev.to/oaida_adrian_afa2428f63d0/make-any-website-ai-readable-generating-llmstxt-files-with-python-3jop</guid>
      <description>&lt;p&gt;AI assistants like ChatGPT, Claude, and Perplexity are increasingly crawling the web for context. But most websites aren't optimised for AI readability — they're built for human browsers with complex HTML, JavaScript navigation, and boilerplate-heavy layouts.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;a href="https://llmstxt.org" rel="noopener noreferrer"&gt;llms.txt standard&lt;/a&gt;&lt;/strong&gt; is changing this. It's a simple convention: place a &lt;code&gt;llms.txt&lt;/code&gt; file at your site root that gives AI systems clean, structured content they can actually understand.&lt;/p&gt;

&lt;p&gt;I built a tool that generates these files automatically for any website.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is llms.txt?
&lt;/h2&gt;

&lt;p&gt;Think of it as &lt;code&gt;robots.txt&lt;/code&gt; but for LLMs. Three files form the standard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llms.txt&lt;/code&gt;&lt;/strong&gt; — A curated summary of your site with key links&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llms-full.txt&lt;/code&gt;&lt;/strong&gt; — Complete site content in clean markdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-page data&lt;/strong&gt; — Structured JSON with extracted content per URL&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Generator
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/darknezz/llms-txt-generator" rel="noopener noreferrer"&gt;llms.txt Generator&lt;/a&gt; crawls any website using BFS (Breadth-First Search) and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Respects configurable crawl depth and URL filters&lt;/li&gt;
&lt;li&gt;Extracts clean content via trafilatura (not regex — actual text extraction)&lt;/li&gt;
&lt;li&gt;Outputs markdown or plaintext&lt;/li&gt;
&lt;li&gt;Handles JavaScript-rendered pages&lt;/li&gt;
&lt;li&gt;Produces both summary and full-content files&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters for SEO
&lt;/h2&gt;

&lt;p&gt;Traditional SEO targets Google's crawler. But a new category is emerging: &lt;strong&gt;SEO for AI&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a user asks ChatGPT "what is [your product]?, the AI searches its training data and web results. If your site has a clean &lt;code&gt;llms.txt&lt;/code&gt;, the AI gets structured, accurate content instead of parsing your homepage HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input Parameters
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;startUrls&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;required&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;Website URLs to crawl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxPages&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Maximum pages to process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;outputFormat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;markdown&lt;/td&gt;
&lt;td&gt;Output format (markdown/plaintext)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;includePatterns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;[]&lt;/td&gt;
&lt;td&gt;URL patterns to include&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;excludePatterns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;[]&lt;/td&gt;
&lt;td&gt;URL patterns to exclude&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Example: Documenting a Python Library
&lt;/h2&gt;

&lt;p&gt;I tested it on Pydantic's documentation (docs.pydantic.dev). The crawler:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Started at the root docs page&lt;/li&gt;
&lt;li&gt;Followed internal links via BFS&lt;/li&gt;
&lt;li&gt;Extracted clean content from each page&lt;/li&gt;
&lt;li&gt;Produced a structured dataset with per-page markdown&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Result: 2 pages processed, full content extracted with zero boilerplate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Live on the Apify Store: &lt;a href="https://apify.com/darknezz/llms-txt-generator" rel="noopener noreferrer"&gt;llms.txt Generator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pricing is &lt;strong&gt;$0.01 per page processed&lt;/strong&gt;. Free tier covers ~50 pages.&lt;/p&gt;




&lt;p&gt;The llms.txt standard is still emerging, but early adopters will have an advantage as AI-driven search grows. Is your website AI-readable?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>seo</category>
      <category>python</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Built an RSS Aggregator That Extracts Full Article Content (Not Just Summaries)</title>
      <dc:creator>Oaida Adrian</dc:creator>
      <pubDate>Sat, 04 Jul 2026 10:30:50 +0000</pubDate>
      <link>https://dev.to/oaida_adrian_afa2428f63d0/i-built-an-rss-aggregator-that-extracts-full-article-content-not-just-summaries-ifl</link>
      <guid>https://dev.to/oaida_adrian_afa2428f63d0/i-built-an-rss-aggregator-that-extracts-full-article-content-not-just-summaries-ifl</guid>
      <description>&lt;p&gt;Most RSS feed readers give you a 200-character summary and force you to click through to read the full article. That's useless if you're building news monitoring pipelines, AI training datasets, or content curation tools.&lt;/p&gt;

&lt;p&gt;So I built a proper RSS Feed Aggregator that follows each article link and extracts the &lt;strong&gt;complete full-text content&lt;/strong&gt; — clean, structured, and ready to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-feed ingestion&lt;/strong&gt; — Point it at multiple RSS/Atom feeds simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-text extraction&lt;/strong&gt; — Uses &lt;a href="https://github.com/adbar/trafilatura" rel="noopener noreferrer"&gt;trafilatura&lt;/a&gt; to extract the actual article content, stripping boilerplate, ads, and navigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication&lt;/strong&gt; — Automatically detects and removes duplicate articles across feeds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich metadata&lt;/strong&gt; — Word counts, authorship, publish dates, images, source tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyword filtering&lt;/strong&gt; — Include/exclude articles by keywords&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Example Output
&lt;/h2&gt;

&lt;p&gt;Each article comes back as structured JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The only AI glossary you'll need this year"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fullContent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"...3,727 words of clean extracted text..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Kyle Wiggers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"publishedDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-07-04T10:00:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"wordCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3727&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"imageUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sourceFeed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://techcrunch.com/feed/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sourceUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://techcrunch.com/2026/07/04/..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI/LLM Training Data&lt;/strong&gt; — Need clean text without HTML boilerplate? This outputs publication-ready content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;News Monitoring&lt;/strong&gt; — Aggregate dozens of feeds and get full articles, not snippets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Curation&lt;/strong&gt; — Pull from multiple sources, deduplicate, filter by keywords.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research Pipelines&lt;/strong&gt; — Collect articles on specific topics for analysis.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The tool is live on the Apify Store: &lt;a href="https://apify.com/darknezz/rss-feed-aggregator" rel="noopener noreferrer"&gt;RSS Feed Aggregator &amp;amp; Article Extractor&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It uses pay-per-event pricing at &lt;strong&gt;$0.01 per article extracted&lt;/strong&gt;. If you're on Apify's free tier ($5/mo credits), that covers ~500 articles — enough for a solid test run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input Parameters
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;feedUrls&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;required&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;RSS/Atom feed URLs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxResults&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Maximum articles to extract&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;extractContent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;Follow links and extract full text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deduplicate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;Remove duplicate articles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;keywordFilter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;[]&lt;/td&gt;
&lt;td&gt;Include/exclude keywords&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How Full-Text Extraction Works
&lt;/h2&gt;

&lt;p&gt;The actor uses &lt;code&gt;trafilatura&lt;/code&gt;, a Python library specifically designed for web text extraction. Unlike basic regex or BeautifulSoup approaches, trafilatura:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strips navigation, sidebars, footers, and ads&lt;/li&gt;
&lt;li&gt;Preserves article structure (paragraphs, headings)&lt;/li&gt;
&lt;li&gt;Handles JavaScript-rendered content&lt;/li&gt;
&lt;li&gt;Works across 20+ languages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means you get the actual article text — not the RSS description, not a truncated summary, but the full content as the author wrote it.&lt;/p&gt;




&lt;p&gt;If you're working with RSS feeds or news data, give it a try. Happy to add features based on feedback — what would make this useful for your use case?&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>python</category>
      <category>rss</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
