<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tony Wang</title>
    <description>The latest articles on DEV Community by Tony Wang (@tonywangca).</description>
    <link>https://dev.to/tonywangca</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F148876%2F2c831bd8-52c8-44be-bf32-6653835db6b9.jpeg</url>
      <title>DEV Community: Tony Wang</title>
      <link>https://dev.to/tonywangca</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tonywangca"/>
    <language>en</language>
    <item>
      <title>World Cup 2026 TikTok Creators by the Numbers</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Mon, 29 Jun 2026 06:03:08 +0000</pubDate>
      <link>https://dev.to/tonywangca/world-cup-2026-tiktok-creators-by-the-numbers-233h</link>
      <guid>https://dev.to/tonywangca/world-cup-2026-tiktok-creators-by-the-numbers-233h</guid>
      <description>&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 2026 World Cup is enormous on TikTok — the #worldcup2026 hashtag alone has 25.3 billion views across 1.98 million videos. We went looking for the creators behind that wave across all 48 qualified nations, and the map of where the big ones live is not the one you'd draw.&lt;/li&gt;
&lt;li&gt;The biggest TikTok creators aren't in the USA, Brazil or footballing Europe. Among World Cup nations the typical creator is largest in Saudi Arabia (median 45k followers), Egypt (42k), Colombia (41k), Ecuador (41k) and South Korea (39k) — top-heavy creator economies — while Germany, Spain and the US sit closer to 32–33k.&lt;/li&gt;
&lt;li&gt;Star density says the same thing: about 30% of Egyptian and Saudi creators clear 100k followers, and Egypt has the highest share of million-follower accounts (4.0%). The footballing aristocracy fields more creators, but smaller ones.&lt;/li&gt;
&lt;li&gt;The USA 'dominates TikTok' only at the very top — it holds 11 of the 15 most-followed accounts on earth. But measured by the typical creator, it's mid-table among World Cup nations. Megastars and a big creator base are two different things.&lt;/li&gt;
&lt;li&gt;The twist: the nations with the biggest creators are the hardest to actually reach. In Saudi Arabia, Egypt, Ecuador and Argentina, only 1–3% of creators list any way to contact them, versus 10–15% in England, Canada and the Netherlands. Giant audiences, closed inboxes.&lt;/li&gt;
&lt;li&gt;Coverage caveat: a creator's country comes from TikTok's own account region (with a best-effort flag-emoji/place-name fallback) and is present on only ~7.7% of creators — so it reflects account region, not nationality. 26 of the 48 qualified nations have enough data to rank; 22 — including Uruguay, Croatia, Senegal and Ghana — do not. Every figure reads as 'among established creators with a known country.'&lt;/li&gt;
&lt;li&gt;Aggregate-only — counts and rates by nation, never individual creators — and the underlying table is open (CC BY 4.0) and reproducible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 2026 World Cup is the first 48-team tournament, played across 16 cities in the United States, Canada and Mexico — and on TikTok it is already vast. The &lt;strong&gt;#worldcup2026&lt;/strong&gt; hashtag alone has &lt;strong&gt;25.3 billion views&lt;/strong&gt; across &lt;strong&gt;1.98 million videos&lt;/strong&gt;, and that is before a single knockout match.&lt;/p&gt;

&lt;p&gt;Conventional wisdom says the United States rules TikTok — and for the very top of the leaderboard, it does: &lt;strong&gt;11 of the 15 most-followed accounts on the planet are American.&lt;/strong&gt; But zoom out from that handful of mega-celebrities to the &lt;em&gt;typical&lt;/em&gt; creator in each nation, and a stranger map appears. We pulled every World Cup nation's creators from Crawlora's 3.33-million-creator dataset and asked two simple questions: &lt;strong&gt;whose creators are biggest — and can you actually reach them?&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Methodology &amp;amp; definitions at a glance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What we measured:&lt;/strong&gt; Crawlora's &lt;code&gt;creators_search&lt;/code&gt; dataset — &lt;strong&gt;3,327,485&lt;/strong&gt; discoverable TikTok creators — filtered to the &lt;strong&gt;48 nations qualified for the 2026 World Cup&lt;/strong&gt;, June 2026 snapshot. Per nation we computed the &lt;strong&gt;median follower count&lt;/strong&gt;, the share at &lt;strong&gt;brand-tier reach&lt;/strong&gt; (100k+ and 1M+ followers), the share that is &lt;strong&gt;verified&lt;/strong&gt;, and the share that lists any &lt;strong&gt;public contact&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How we assign a country:&lt;/strong&gt; primarily TikTok's own &lt;strong&gt;account region&lt;/strong&gt;, with a best-effort bio fallback (a flag emoji like 🇧🇷, then a place name) when region is missing — so it tracks &lt;em&gt;account region, not nationality&lt;/em&gt;. It is present on only &lt;strong&gt;7.7%&lt;/strong&gt; of the dataset, so we measure the &lt;strong&gt;170,551&lt;/strong&gt; qualified-nation creators who carry one. As a sanity check, posting language lines up with the assigned country (Spanish dominates Mexico and Argentina, Portuguese Brazil, Arabic Saudi Arabia). A nation needs at least &lt;strong&gt;300&lt;/strong&gt; such creators to rank; &lt;strong&gt;26 of 48&lt;/strong&gt; clear that bar.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What we did NOT measure:&lt;/strong&gt; all of TikTok, or fan-account volume. This is the set of established, discoverable creators — not a census. England and Scotland both fall under ISO &lt;code&gt;gb&lt;/code&gt;, which does not split UK home nations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregate only:&lt;/strong&gt; we publish counts and rates by nation, never individual creators. Every figure is reproducible from &lt;a href="https://github.com/Crawlora-org/world-cup-creators-data" rel="noopener noreferrer"&gt;the open dataset&lt;/a&gt; (CC BY 4.0).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The World Cup's biggest TikTok stars aren't where you'd think
&lt;/h2&gt;

&lt;p&gt;Rank the 26 nations by the size of their &lt;em&gt;typical&lt;/em&gt; creator — the median follower count — and the football map turns upside down. The biggest creators belong to &lt;strong&gt;Saudi Arabia, Egypt, Colombia, Ecuador, South Korea and South Africa&lt;/strong&gt;, not to the tournament favourites:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Favvs94q5c51jqq7xqkmb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Favvs94q5c51jqq7xqkmb.png" alt="Median follower count of TikTok creators by World Cup nation: Saudi Arabia 45,435; Egypt 42,200; Colombia 41,181; Ecuador 40,677; South Africa 39,664; Korea Republic 39,027; Brazil 37,000; England 34,179; France 33,603; United States 33,338; Germany 32,116; Mexico 31,113; Portugal 27,462; Türkiye 26,164." width="800" height="853"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mapped, the pattern is unmistakable — the brightest creator economies ring the Gulf, North Africa and the Andes, not Western Europe:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1bzmndi8w6wa6bd275dh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1bzmndi8w6wa6bd275dh.png" alt="World map of median creator follower count by World Cup nation. Highest: Saudi Arabia 45,435; Egypt 42,200; Colombia 41,181; Ecuador 40,677; South Africa 39,664; Korea 39,027. Lowest: Türkiye 26,164; Portugal 27,462; Norway 27,500." width="800" height="492"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Read down that list and it is almost the inverse of the form guide. The nations with the biggest creators — Saudi Arabia, Egypt, Ecuador, Colombia — are mostly tournament outsiders, while the powerhouses expected to contest the latter rounds (Germany, Spain, France) field &lt;strong&gt;smaller, more numerous&lt;/strong&gt; creators. Star density tells the same story: roughly &lt;strong&gt;30% of Egyptian and Saudi creators clear 100,000 followers&lt;/strong&gt;, against ~22% in the US and 19% in Germany, and Egypt has the single highest share of &lt;strong&gt;million-follower&lt;/strong&gt; accounts (4.0%). These are top-heavy creator economies — a smaller number of very large stars — whereas the European football heartlands run on a long tail of mid-sized creators.&lt;/p&gt;

&lt;p&gt;It also reframes the "USA dominates TikTok" cliché. America genuinely owns the celebrity tier — Charli D'Amelio, MrBeast, 11 of the global top 15. But its &lt;em&gt;median&lt;/em&gt; World Cup creator (33,338 followers) is squarely mid-table, smaller than eight other nations here. Megastars and a deep, large creator base are not the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Giant audiences, closed inboxes
&lt;/h2&gt;

&lt;p&gt;Here is the part that surprised us most. The nations with the biggest creators are the &lt;strong&gt;hardest to actually reach.&lt;/strong&gt; If you ranked nations by raw star power you would get one list; rank them by the share of creators who publish any way to contact them and you get almost the opposite. Plot the two together and the gap is the story:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqqw5n9r1c1hth9b70jec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqqw5n9r1c1hth9b70jec.png" alt="Gap between the share of creators with 100k+ followers and the share that list a contact, by nation. Egypt 30.4% reach versus 2.4% contact; Saudi Arabia 30.0 versus 2.3; Ecuador 26.3 versus 1.1; Colombia 27.5 versus 2.7; Brazil 24.6 versus 3.0; Argentina 22.0 versus 1.8; South Africa 26.5 versus 11.8; United States 22.5 versus 9.7; Canada 20.1 versus 11.6; England 23.3 versus 15.4." width="800" height="738"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In Egypt and Saudi Arabia, a creator is more than &lt;strong&gt;ten times&lt;/strong&gt; as likely to clear 100,000 followers as to list a contact. In Ecuador the ratio is &lt;strong&gt;24 to 1&lt;/strong&gt;. The big-creator nations keep their audiences vast and their inboxes shut — most reach a manager only through a DM lottery. The English-speaking and Northern-European pools are the mirror image: smaller stars, but a contact listed &lt;strong&gt;4 to 10 times&lt;/strong&gt; more often. &lt;strong&gt;South Africa&lt;/strong&gt; is the one nation that refuses the trade-off — top-six on reach &lt;em&gt;and&lt;/em&gt; on openness.&lt;/p&gt;

&lt;h2&gt;
  
  
  The verified elite — and where sport sits
&lt;/h2&gt;

&lt;p&gt;Verification is rarer than reach: only &lt;strong&gt;2.73%&lt;/strong&gt; of all 3.33M creators carry a badge. Among World Cup nations, Sweden (5.8%), Egypt (4.6%), Australia (4.5%) and England (4.3%) lead.&lt;/p&gt;

&lt;p&gt;One cut is worth isolating for a football tournament: the &lt;strong&gt;sports&lt;/strong&gt; niche. Across the dataset, sports creators are a verified elite — &lt;strong&gt;5.9%&lt;/strong&gt; carry a badge, &lt;strong&gt;2.2× the 2.73% baseline&lt;/strong&gt;. They are who they say they are far more often than the average creator — though, tellingly, no easier to reach: their contact rate (3.2%) sits below the all-creator average. A verified badge signals legitimacy, not availability.&lt;/p&gt;

&lt;p&gt;Full table — all 26 measurable nations&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;World Cup 2026 TikTok creators by nation&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Nation&lt;/th&gt;
&lt;th&gt;Confed.&lt;/th&gt;
&lt;th&gt;Creators&lt;/th&gt;
&lt;th&gt;Median followers&lt;/th&gt;
&lt;th&gt;100k+ reach&lt;/th&gt;
&lt;th&gt;1M+ share&lt;/th&gt;
&lt;th&gt;Verified&lt;/th&gt;
&lt;th&gt;Lists a contact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Saudi Arabia&lt;/td&gt;
&lt;td&gt;AFC&lt;/td&gt;
&lt;td&gt;3,278&lt;/td&gt;
&lt;td&gt;45,435&lt;/td&gt;
&lt;td&gt;30.0%&lt;/td&gt;
&lt;td&gt;2.81%&lt;/td&gt;
&lt;td&gt;3.1%&lt;/td&gt;
&lt;td&gt;2.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Egypt&lt;/td&gt;
&lt;td&gt;CAF&lt;/td&gt;
&lt;td&gt;1,843&lt;/td&gt;
&lt;td&gt;42,200&lt;/td&gt;
&lt;td&gt;30.4%&lt;/td&gt;
&lt;td&gt;3.96%&lt;/td&gt;
&lt;td&gt;4.6%&lt;/td&gt;
&lt;td&gt;2.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Colombia&lt;/td&gt;
&lt;td&gt;CONMEBOL&lt;/td&gt;
&lt;td&gt;7,931&lt;/td&gt;
&lt;td&gt;41,181&lt;/td&gt;
&lt;td&gt;27.5%&lt;/td&gt;
&lt;td&gt;3.16%&lt;/td&gt;
&lt;td&gt;2.3%&lt;/td&gt;
&lt;td&gt;2.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ecuador&lt;/td&gt;
&lt;td&gt;CONMEBOL&lt;/td&gt;
&lt;td&gt;5,254&lt;/td&gt;
&lt;td&gt;40,677&lt;/td&gt;
&lt;td&gt;26.3%&lt;/td&gt;
&lt;td&gt;2.04%&lt;/td&gt;
&lt;td&gt;1.3%&lt;/td&gt;
&lt;td&gt;1.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;South Africa&lt;/td&gt;
&lt;td&gt;CAF&lt;/td&gt;
&lt;td&gt;2,028&lt;/td&gt;
&lt;td&gt;39,664&lt;/td&gt;
&lt;td&gt;26.5%&lt;/td&gt;
&lt;td&gt;2.51%&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;td&gt;11.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Korea Republic&lt;/td&gt;
&lt;td&gt;AFC&lt;/td&gt;
&lt;td&gt;4,774&lt;/td&gt;
&lt;td&gt;39,027&lt;/td&gt;
&lt;td&gt;26.1%&lt;/td&gt;
&lt;td&gt;3.35%&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;td&gt;6.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Japan&lt;/td&gt;
&lt;td&gt;AFC&lt;/td&gt;
&lt;td&gt;3,756&lt;/td&gt;
&lt;td&gt;37,595&lt;/td&gt;
&lt;td&gt;25.7%&lt;/td&gt;
&lt;td&gt;2.26%&lt;/td&gt;
&lt;td&gt;3.9%&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Morocco&lt;/td&gt;
&lt;td&gt;CAF&lt;/td&gt;
&lt;td&gt;4,050&lt;/td&gt;
&lt;td&gt;37,104&lt;/td&gt;
&lt;td&gt;24.3%&lt;/td&gt;
&lt;td&gt;2.12%&lt;/td&gt;
&lt;td&gt;1.7%&lt;/td&gt;
&lt;td&gt;5.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brazil&lt;/td&gt;
&lt;td&gt;CONMEBOL&lt;/td&gt;
&lt;td&gt;15,541&lt;/td&gt;
&lt;td&gt;37,000&lt;/td&gt;
&lt;td&gt;24.6%&lt;/td&gt;
&lt;td&gt;2.54%&lt;/td&gt;
&lt;td&gt;2.5%&lt;/td&gt;
&lt;td&gt;3.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;England (GB)&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;13,717&lt;/td&gt;
&lt;td&gt;34,179&lt;/td&gt;
&lt;td&gt;23.3%&lt;/td&gt;
&lt;td&gt;2.25%&lt;/td&gt;
&lt;td&gt;4.3%&lt;/td&gt;
&lt;td&gt;15.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Argentina&lt;/td&gt;
&lt;td&gt;CONMEBOL&lt;/td&gt;
&lt;td&gt;8,375&lt;/td&gt;
&lt;td&gt;33,848&lt;/td&gt;
&lt;td&gt;22.0%&lt;/td&gt;
&lt;td&gt;2.10%&lt;/td&gt;
&lt;td&gt;1.9%&lt;/td&gt;
&lt;td&gt;1.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;France&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;7,061&lt;/td&gt;
&lt;td&gt;33,603&lt;/td&gt;
&lt;td&gt;24.4%&lt;/td&gt;
&lt;td&gt;2.22%&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;td&gt;8.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;United States&lt;/td&gt;
&lt;td&gt;Host&lt;/td&gt;
&lt;td&gt;22,636&lt;/td&gt;
&lt;td&gt;33,338&lt;/td&gt;
&lt;td&gt;22.5%&lt;/td&gt;
&lt;td&gt;1.94%&lt;/td&gt;
&lt;td&gt;2.9%&lt;/td&gt;
&lt;td&gt;9.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Germany&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;6,939&lt;/td&gt;
&lt;td&gt;32,116&lt;/td&gt;
&lt;td&gt;21.9%&lt;/td&gt;
&lt;td&gt;2.16%&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;td&gt;6.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spain&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;6,670&lt;/td&gt;
&lt;td&gt;31,722&lt;/td&gt;
&lt;td&gt;22.3%&lt;/td&gt;
&lt;td&gt;2.43%&lt;/td&gt;
&lt;td&gt;2.7%&lt;/td&gt;
&lt;td&gt;7.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Belgium&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;1,014&lt;/td&gt;
&lt;td&gt;31,550&lt;/td&gt;
&lt;td&gt;22.1%&lt;/td&gt;
&lt;td&gt;1.58%&lt;/td&gt;
&lt;td&gt;2.7%&lt;/td&gt;
&lt;td&gt;8.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sweden&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;895&lt;/td&gt;
&lt;td&gt;31,400&lt;/td&gt;
&lt;td&gt;23.8%&lt;/td&gt;
&lt;td&gt;2.01%&lt;/td&gt;
&lt;td&gt;5.8%&lt;/td&gt;
&lt;td&gt;8.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mexico&lt;/td&gt;
&lt;td&gt;Host&lt;/td&gt;
&lt;td&gt;25,173&lt;/td&gt;
&lt;td&gt;31,113&lt;/td&gt;
&lt;td&gt;20.7%&lt;/td&gt;
&lt;td&gt;2.00%&lt;/td&gt;
&lt;td&gt;1.7%&lt;/td&gt;
&lt;td&gt;3.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Australia&lt;/td&gt;
&lt;td&gt;AFC&lt;/td&gt;
&lt;td&gt;6,023&lt;/td&gt;
&lt;td&gt;29,344&lt;/td&gt;
&lt;td&gt;20.5%&lt;/td&gt;
&lt;td&gt;1.76%&lt;/td&gt;
&lt;td&gt;4.5%&lt;/td&gt;
&lt;td&gt;8.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canada&lt;/td&gt;
&lt;td&gt;Host&lt;/td&gt;
&lt;td&gt;11,779&lt;/td&gt;
&lt;td&gt;29,286&lt;/td&gt;
&lt;td&gt;20.1%&lt;/td&gt;
&lt;td&gt;1.71%&lt;/td&gt;
&lt;td&gt;2.8%&lt;/td&gt;
&lt;td&gt;11.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Switzerland&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;1,101&lt;/td&gt;
&lt;td&gt;29,000&lt;/td&gt;
&lt;td&gt;22.1%&lt;/td&gt;
&lt;td&gt;2.45%&lt;/td&gt;
&lt;td&gt;3.1%&lt;/td&gt;
&lt;td&gt;9.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Austria&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;971&lt;/td&gt;
&lt;td&gt;29,000&lt;/td&gt;
&lt;td&gt;18.8%&lt;/td&gt;
&lt;td&gt;1.13%&lt;/td&gt;
&lt;td&gt;3.4%&lt;/td&gt;
&lt;td&gt;3.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Netherlands&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;1,960&lt;/td&gt;
&lt;td&gt;28,600&lt;/td&gt;
&lt;td&gt;18.6%&lt;/td&gt;
&lt;td&gt;1.48%&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;td&gt;10.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Norway&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;742&lt;/td&gt;
&lt;td&gt;27,500&lt;/td&gt;
&lt;td&gt;21.2%&lt;/td&gt;
&lt;td&gt;1.21%&lt;/td&gt;
&lt;td&gt;3.0%&lt;/td&gt;
&lt;td&gt;11.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portugal&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;2,395&lt;/td&gt;
&lt;td&gt;27,462&lt;/td&gt;
&lt;td&gt;16.2%&lt;/td&gt;
&lt;td&gt;1.13%&lt;/td&gt;
&lt;td&gt;2.3%&lt;/td&gt;
&lt;td&gt;5.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Türkiye&lt;/td&gt;
&lt;td&gt;UEFA&lt;/td&gt;
&lt;td&gt;4,645&lt;/td&gt;
&lt;td&gt;26,164&lt;/td&gt;
&lt;td&gt;18.8%&lt;/td&gt;
&lt;td&gt;1.40%&lt;/td&gt;
&lt;td&gt;3.2%&lt;/td&gt;
&lt;td&gt;2.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;TikTok creators by World Cup 2026 nation, sorted by median follower count. n=170,551 creators with a detected country, June 2026. Source: Crawlora creators_search.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What it tells us
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Fans: the World Cup you scroll on TikTok is shaped by where the big creators actually are — and that is as much the Gulf, North Africa and the Andes as it is Europe or Brazil. Saudi Arabia and Egypt punch far above their football weight on the For You page.&lt;/li&gt;
&lt;li&gt;On the 'USA dominates' myth: true for celebrities, not for the median creator. America owns the megastar tier and a huge creator base, but its typical creator is mid-table — a reminder that 'most-followed accounts' and 'biggest creators overall' are different leaderboards.&lt;/li&gt;
&lt;li&gt;Creators: a giant following and a way to reach you almost never travel together. In most nations, simply listing a contact puts you in a small, valuable minority — nowhere more so than in the big-creator markets of the Gulf and South America.&lt;/li&gt;
&lt;li&gt;Analysts &amp;amp; journalists: rank creator economies by the typical creator, not the top 10. Reach and reachability are weakly — often inversely — correlated across these 26 nations, so a 'star power' map and a 'who can you actually work with' map look nothing alike.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 48-team World Cup will mint thousands of overnight creator moments. Where the biggest of them already live — and which of them you could ever actually reach — turns out to be one of the more surprising maps of the tournament.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/tiktok-creators-index?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora TikTok Creators dataset (creators_search, 3.33M creators) — the data behind this study&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/platforms/tiktok?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora TikTok hashtag data (#worldcup2026: 25.3B views, 1.98M videos)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/List_of_most-followed_TikTok_accounts" rel="noopener noreferrer"&gt;Most-followed TikTok accounts (US dominance of the celebrity tier)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.fifa.com/en/tournaments/mens/worldcup/canadamexicousa2026/teams" rel="noopener noreferrer"&gt;FIFA — qualified teams for the 2026 World Cup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Crawlora-org/world-cup-creators-data" rel="noopener noreferrer"&gt;Open dataset (CC BY 4.0): World Cup 2026 Creator Index&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How big is the 2026 World Cup on TikTok?
&lt;/h3&gt;

&lt;p&gt;Enormous. The #worldcup2026 hashtag alone has 25.3 billion views across 1.98 million videos as of June 2026 — before the knockout rounds — and that is one of many World Cup tags. The tournament is the first 48-team edition, hosted across the United States, Canada and Mexico.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which World Cup country has the biggest TikTok creators?
&lt;/h3&gt;

&lt;p&gt;Measured by median follower count, the typical creator is largest in Saudi Arabia (about 45,000 followers), Egypt (42,000), Colombia (41,000), Ecuador (41,000) and South Korea (39,000) — top-heavy creator economies. The footballing and creator giants are smaller: Germany and Spain about 32,000, the United States 33,000, England 34,000. Egypt and Saudi Arabia also have the highest share of creators above 100,000 followers (about 30%).&lt;/p&gt;

&lt;h3&gt;
  
  
  Does the USA dominate TikTok?
&lt;/h3&gt;

&lt;p&gt;Only at the very top. The United States holds 11 of the 15 most-followed TikTok accounts in the world, but among World Cup nations its typical creator (median about 33,000 followers) is mid-table — smaller than Saudi Arabia, Egypt, Colombia, Ecuador, South Korea, South Africa, Japan, Morocco and Brazil. The US wins on celebrity megastars, not on the size of the average creator.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are the biggest TikTok creators the easiest to reach?
&lt;/h3&gt;

&lt;p&gt;No — it's almost the opposite. The nations with the biggest creators are the hardest to contact: in Saudi Arabia, Egypt, Ecuador and Argentina only 1–3% of creators list any public contact, versus 10–15% in England, Canada, Norway and the Netherlands. In Ecuador a creator is about 24 times more likely to clear 100,000 followers than to list a way to reach them. Giant audiences, closed inboxes — South Africa is the rare nation strong on both.&lt;/p&gt;

&lt;h3&gt;
  
  
  How was this World Cup creator study measured?
&lt;/h3&gt;

&lt;p&gt;We filtered Crawlora's creators_search dataset (3,327,485 discoverable TikTok creators, June 2026 snapshot) to the 48 nations qualified for the 2026 World Cup and rolled up, per nation, the median follower count, brand-tier reach (100k+/1M+), the verified share and the share listing a public contact. Country is bio/locale-derived and present on about 7.7% of creators, so 26 of the 48 nations have enough data to rank (170,551 creators); the figures are aggregate-only and the dataset is open and reproducible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which World Cup nations could not be measured?
&lt;/h3&gt;

&lt;p&gt;22 of the 48 qualified nations — including Uruguay, Croatia, Senegal, Ghana, Tunisia, Paraguay and debutants like Curaçao and Cabo Verde — have too few creators with a detected country to rank reliably. Country detection in the dataset skews toward larger English, Spanish and Portuguese-speaking markets, so smaller or less-represented nations fall below the 300-creator floor we require.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://crawlora.net/blog/world-cup-creators-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;. &lt;a href="https://crawlora.net/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>data</category>
      <category>tiktok</category>
      <category>worldcup</category>
    </item>
    <item>
      <title>The TikTok World Cup: Who the Internet Actually Talks About</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Mon, 29 Jun 2026 06:02:03 +0000</pubDate>
      <link>https://dev.to/tonywangca/the-tiktok-world-cup-who-the-internet-actually-talks-about-54go</link>
      <guid>https://dev.to/tonywangca/the-tiktok-world-cup-who-the-internet-actually-talks-about-54go</guid>
      <description>&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 2026 World Cup is a TikTok event before it is a stadium one — #worldcup2026 already has 25.4 billion views. So we ranked who the platform actually talks about, by hashtag views and by how many creators reference them — not by follower counts.&lt;/li&gt;
&lt;li&gt;On TikTok, Messi beats Ronaldo. #messi has 700.7 billion views to #ronaldo's 580.6B — even though Ronaldo crushes Messi on Instagram followers (665M vs 506M). The internet makes more about Messi than about anyone alive.&lt;/li&gt;
&lt;li&gt;A 17-year-old is rewriting the order. #lamineyamal (147.5B views) already outranks Haaland, Bellingham, Salah and Modrić — and Yamal is TikTok's single most-followed footballer (38.3M). TikTok's football attention is generational.&lt;/li&gt;
&lt;li&gt;Argentina is the TikTok World Cup's biggest nation — #argentina has 631B football-tag views, ahead of France (576B), and far ahead of Brazil and England (~197B each). (We excluded #mexico and #japan: they're dominated by non-football culture.)&lt;/li&gt;
&lt;li&gt;The fan-identity social graph has one giant edge: 161 creators name BOTH Messi and Ronaldo in their bio — 2.5× any other link. The GOAT debate is the gravitational centre. Players fuse with their nation (Messi–Argentina, Neymar–Brazil), and the viral newcomers aren't woven in yet.&lt;/li&gt;
&lt;li&gt;Why TikTok's map differs from Instagram's: the algorithm rewards meme-ability, youth and authenticity over stature — follower count isn't even a ranking factor — so flair players (Yamal, Neymar, Vini) lead, not the all-platform giants. FIFA made TikTok its official video partner for 2026.&lt;/li&gt;
&lt;li&gt;Aggregate-only — views and counts by player and nation, never individual creators — and the underlying data is open (CC BY 4.0) and refreshable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every World Cup now has two tournaments. One is played on grass. The other plays out on a billion phones, in fifteen-second clips, and it has its own table — one that does not always agree with the FIFA rankings or even with Instagram. We went to measure it: across the 48 qualified nations, &lt;strong&gt;which players and teams does TikTok actually talk about?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "who has the most followers" — that list is published every year and it is boring. We measured two things instead: &lt;strong&gt;how many views each player's and nation's hashtag has racked up&lt;/strong&gt; (the size of the conversation), and &lt;strong&gt;how many of the platform's 3.33 million catalogued creators reference each one in their bio&lt;/strong&gt; (the depth of the fandom). The gap between those two is where the story lives.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Methodology &amp;amp; definitions at a glance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hashtag views:&lt;/strong&gt; pulled live from TikTok via Crawlora's hashtag endpoint (June 2026), for a curated, football-specific tag per player and nation. These are cumulative, all-time view totals — the size of each entity's TikTok footprint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Creator-mention affinity &amp;amp; the social graph:&lt;/strong&gt; from Crawlora's &lt;code&gt;creators_search&lt;/code&gt; dataset (3,327,485 discoverable creators; a bio is present on &lt;strong&gt;92.5%&lt;/strong&gt; of them — far higher coverage than profile-country). We count how many creators name each player/nation in their bio or display name, and how often two are named &lt;em&gt;together&lt;/em&gt; (the co-mention edge).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caveats we designed around:&lt;/strong&gt; nation tags double as country-culture tags — &lt;strong&gt;#mexico (1.06T) and #japan (355B) are dominated by travel/food/anime content, not football, so we exclude them&lt;/strong&gt; from the football ranking. #salah is inflated because "salah" also means "wrong" in Indonesian. View totals are cumulative, not tournament-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregate only:&lt;/strong&gt; we publish per-entity totals and edge weights, never individual creators. The data is open (CC BY 4.0) and refreshable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The players TikTok can't stop posting about
&lt;/h2&gt;

&lt;p&gt;Ranked by hashtag views, the top of the board is a Messi–Ronaldo–Neymar wall — and then a teenager:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fymktz420kfhy1jkyw1xh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fymktz420kfhy1jkyw1xh.png" alt="TikTok hashtag views by footballer, billions: Messi 700.7, Ronaldo 580.6, Neymar 401.6, Mbappé 269.0, Lamine Yamal 147.5, Haaland 59.4, Vinícius Jr 54.4, Bellingham 51.2, Salah 37.0, Dembélé 23.8, Endrick 15.6, Griezmann 12.7." width="800" height="745"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two things jump out. First, &lt;strong&gt;a 17-year-old (Lamine Yamal) sits fifth — ahead of Haaland, Bellingham, Salah and Modrić combined-adjacent&lt;/strong&gt; — and he is, separately, the most-&lt;em&gt;followed&lt;/em&gt; footballer on TikTok at 38.3M, ahead of both Ronaldo and Messi. The platform's centre of gravity is visibly young. Second, this is &lt;em&gt;not&lt;/em&gt; the Instagram order: there, Ronaldo (665M followers) towers over everyone, Salah and Bellingham rank high on followers, and a teenager would not be fifth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Messi beats Ronaldo — the one place he does
&lt;/h2&gt;

&lt;p&gt;The Ronaldo-vs-Messi debate has a clean, surprising answer on TikTok, and it is the opposite of the Instagram answer:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fa85ujrccipz3cketekph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fa85ujrccipz3cketekph.png" alt="TikTok hashtag views: hashtag messi 700.7 billion versus hashtag ronaldo 580.6 billion." width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It fits everything else about the platform. Ronaldo's dominance is built on &lt;strong&gt;followers&lt;/strong&gt; — an audience that subscribes to him and his lifestyle empire. Messi's TikTok lead is built on &lt;strong&gt;being made about&lt;/strong&gt; — the 2022 trophy lift, the Miami move, the GOAT edits, the memes. TikTok rewards the second thing, and it is why the #messi tag, fragmented as it is across #messi and #leomessi (another 72B), runs ahead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The nations: Argentina runs the table
&lt;/h2&gt;

&lt;p&gt;Among football-specific national tags, the reigning champions lead — and the order again ignores the bookmakers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0d2sobixoffvq4fym5ug.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0d2sobixoffvq4fym5ug.png" alt="TikTok hashtag views by national team, billions: Argentina 631, France 576, Germany 299, Spain 210, Brazil 197, England 196, Saudi Arabia 163, Morocco 163, Portugal 159, Croatia 50, Netherlands 42." width="800" height="691"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Argentina (631B)&lt;/strong&gt; sits clear at the top, fittingly for the holders, with &lt;strong&gt;Morocco (163B)&lt;/strong&gt; the standout overperformer — the Global-South wave from its 2022 semi-final run still moves on TikTok, level with Saudi Arabia and ahead of Croatia and the Netherlands. Brazil and England, two of the loudest football cultures on earth, land mid-table on the nation tag precisely because their TikTok energy flows through &lt;em&gt;players&lt;/em&gt; (Neymar, Vini, Bellingham) rather than the country hashtag.&lt;/p&gt;

&lt;h2&gt;
  
  
  The social graph: who gets talked about together
&lt;/h2&gt;

&lt;p&gt;We mapped which players creators name &lt;em&gt;together&lt;/em&gt; in their bios — the fan-identity network. It has one overwhelming feature:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0gi9a0pv00w8obqodewa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0gi9a0pv00w8obqodewa.png" alt="Player co-mention edges, number of creators naming both in bio: Messi and Ronaldo 161, Messi and Neymar 63, Ronaldo and Neymar 21, Neymar and Mbappé 13, Ronaldo and Haaland 11, Ronaldo and Mbappé 9, Messi and Haaland 7, Mbappé and Haaland 7, Messi and Yamal 6, Messi and Mbappé 6." width="800" height="637"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The graph is a solar system. &lt;strong&gt;Messi and Ronaldo are the twin suns&lt;/strong&gt; — 161 creators build their identity around &lt;em&gt;both&lt;/em&gt;, the GOAT debate made flesh. Neymar is the third body (63 links to Messi, 21 to Ronaldo), and Mbappé and Haaland orbit the edges. Two more patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Players fuse with their nation. The cleanest one-way ties are Messi → Argentina (52 creators name both), Neymar → Brazil (27) and Ronaldo → Portugal (5). Fans don't just love the player — they wear the flag with him.&lt;/li&gt;
&lt;li&gt;The viral newcomers aren't in the graph yet. Lamine Yamal pulls 147.5B views but only ~6 co-mention links and 32 bio-identity accounts; Vinícius, Endrick and Haaland are near-isolates. They're watched, not yet woven in — devotion lags reach by a generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Viral isn't the same as beloved
&lt;/h2&gt;

&lt;p&gt;That last point is the whole study in one line. Put the two metrics side by side and they diverge hard. Messi has 700.7B views &lt;strong&gt;and&lt;/strong&gt; 1,705 creators who put him in their bio — reach &lt;em&gt;and&lt;/em&gt; devotion. Lamine Yamal has nearly a sixth of Messi's views (147.5B) but &lt;strong&gt;32&lt;/strong&gt; identity accounts. Ronaldo: 1,112. Neymar: 911. The legends own the deep fandom; the new stars own the feed.&lt;/p&gt;

&lt;p&gt;Full table — players by TikTok views and creator devotion&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2026 World Cup footballers by TikTok hashtag views and creator-mention affinity&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Player&lt;/th&gt;
&lt;th&gt;Nation&lt;/th&gt;
&lt;th&gt;TikTok views&lt;/th&gt;
&lt;th&gt;Videos&lt;/th&gt;
&lt;th&gt;Creators naming them&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lionel Messi&lt;/td&gt;
&lt;td&gt;Argentina&lt;/td&gt;
&lt;td&gt;700.7B&lt;/td&gt;
&lt;td&gt;24.1M&lt;/td&gt;
&lt;td&gt;1,705&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cristiano Ronaldo&lt;/td&gt;
&lt;td&gt;Portugal&lt;/td&gt;
&lt;td&gt;580.6B&lt;/td&gt;
&lt;td&gt;21.4M&lt;/td&gt;
&lt;td&gt;1,112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neymar Jr&lt;/td&gt;
&lt;td&gt;Brazil&lt;/td&gt;
&lt;td&gt;401.6B&lt;/td&gt;
&lt;td&gt;16.7M&lt;/td&gt;
&lt;td&gt;911&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kylian Mbappé&lt;/td&gt;
&lt;td&gt;France&lt;/td&gt;
&lt;td&gt;269.0B&lt;/td&gt;
&lt;td&gt;6.2M&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lamine Yamal&lt;/td&gt;
&lt;td&gt;Spain&lt;/td&gt;
&lt;td&gt;147.5B&lt;/td&gt;
&lt;td&gt;4.4M&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Erling Haaland&lt;/td&gt;
&lt;td&gt;Norway&lt;/td&gt;
&lt;td&gt;59.4B&lt;/td&gt;
&lt;td&gt;1.2M&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vinícius Jr&lt;/td&gt;
&lt;td&gt;Brazil&lt;/td&gt;
&lt;td&gt;54.4B&lt;/td&gt;
&lt;td&gt;1.4M&lt;/td&gt;
&lt;td&gt;158&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jude Bellingham&lt;/td&gt;
&lt;td&gt;England&lt;/td&gt;
&lt;td&gt;51.2B&lt;/td&gt;
&lt;td&gt;1.3M&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mohamed Salah&lt;/td&gt;
&lt;td&gt;Egypt&lt;/td&gt;
&lt;td&gt;37.0B&lt;/td&gt;
&lt;td&gt;1.7M&lt;/td&gt;
&lt;td&gt;316&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ousmane Dembélé&lt;/td&gt;
&lt;td&gt;France&lt;/td&gt;
&lt;td&gt;23.8B&lt;/td&gt;
&lt;td&gt;0.6M&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Endrick&lt;/td&gt;
&lt;td&gt;Brazil&lt;/td&gt;
&lt;td&gt;15.6B&lt;/td&gt;
&lt;td&gt;0.3M&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Antoine Griezmann&lt;/td&gt;
&lt;td&gt;France&lt;/td&gt;
&lt;td&gt;12.7B&lt;/td&gt;
&lt;td&gt;0.3M&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Players ranked by cumulative TikTok hashtag views, with the count of creators naming them in their bio/handle. June 2026. Source: Crawlora tiktok_challenge + creators_search.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why TikTok's football map looks nothing like Instagram's
&lt;/h2&gt;

&lt;p&gt;The order keeps surprising because TikTok is a fundamentally different machine. On Instagram, reach follows the follower graph — big accounts get seen, so Ronaldo's 665M compounds. On TikTok, &lt;strong&gt;follower count isn't a ranking factor at all&lt;/strong&gt;; the algorithm serves clips by engagement, so an unknown's celebration can out-travel a superstar's polished post. That is why the platform's leaderboard skews young, Brazilian and meme-friendly — Yamal's "everyday moments", Vini's touchline dances remixed into multilingual edits, Neymar's everything-at-once persona — rather than toward stature.&lt;/p&gt;

&lt;p&gt;It is also why this is the World Cup that institutionalised it: &lt;strong&gt;FIFA named TikTok its official video content partner for 2026&lt;/strong&gt;, with a 30-creator correspondent programme and in-app match hubs. The conversation we measured isn't a sideshow to the tournament — increasingly, for a billion young fans, it &lt;em&gt;is&lt;/em&gt; the tournament.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it tells us
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Fans &amp;amp; journalists: 'the most popular footballer' depends entirely on the platform. Ronaldo wins Instagram followers; Messi wins TikTok views; Lamine Yamal wins TikTok followers. There is no single leaderboard — there are three, and they disagree.&lt;/li&gt;
&lt;li&gt;The generation has turned. A 17-year-old out-hashtags every veteran except the four all-time giants, and is the most-followed footballer on the app. Watch the gap between his reach (huge) and his fan-identity depth (still tiny) close over the next few years.&lt;/li&gt;
&lt;li&gt;Nations live through their players. Brazil and England under-index on the country tag because their TikTok energy is carried by Neymar, Vini and Bellingham. Argentina and Morocco over-index because the nation itself became the story (2022 title; 2022 semi-final run).&lt;/li&gt;
&lt;li&gt;Data buyers: 'who's followed' and 'who's talked about' are different products. Follower counts are a vanity census; hashtag views and creator co-mentions measure cultural footprint — which is what actually predicts a viral tournament moment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the first ball is kicked in June 2026, the grass tournament and the phone tournament will run in parallel — and they will crown different winners. On the pitch, anyone can win. On TikTok, the table is already drawn: Messi on top, a teenager rising fast, and a billion creators arguing about both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/platforms/tiktok?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora TikTok hashtag data (tiktok_challenge) — the view/video counts behind this study&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/tiktok-creators-index?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora TikTok Creators dataset (creators_search, 3.33M creators) — affinity + social graph&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.edigitalagency.com.au/tiktok/top-10-most-popular-footballers-on-tiktok/" rel="noopener noreferrer"&gt;Most-followed footballers on TikTok 2026 (Lamine Yamal #1 at 38.3M)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.espn.com/soccer/story/_/id/47552269/fifa-tiktok-video-content-partner-2026-world-cup" rel="noopener noreferrer"&gt;FIFA names TikTok official video content partner for the 2026 World Cup (ESPN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Crawlora-org/tiktok-world-cup-data" rel="noopener noreferrer"&gt;Open dataset (CC BY 4.0): TikTok World Cup 2026 Index&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Who is the most popular footballer on TikTok in 2026?
&lt;/h3&gt;

&lt;p&gt;It depends what you measure. By hashtag views, #messi leads with 700.7 billion, ahead of #ronaldo (580.6B) and #neymar (401.6B). By follower count, the most-followed footballer on TikTok is 17-year-old Lamine Yamal (38.3M), ahead of both Ronaldo and Messi. And on Instagram, Ronaldo dominates with about 665 million followers. There is no single leaderboard — the three platforms and metrics crown different winners.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Messi or Ronaldo dominate TikTok?
&lt;/h3&gt;

&lt;p&gt;Messi, by the size of the conversation. The #messi hashtag has roughly 700.7 billion views to #ronaldo's 580.6 billion — about 120 billion more — even though Ronaldo has more Instagram followers (665M vs 506M). TikTok talks more about Messi; Instagram follows more of Ronaldo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which World Cup nation is biggest on TikTok?
&lt;/h3&gt;

&lt;p&gt;Among football-specific national hashtags, #argentina leads with about 631 billion views (fitting for the reigning champions), ahead of France (576B), and well ahead of Brazil and England (~197B each). Morocco overperforms at 163B, a lingering effect of its 2022 semi-final run. We exclude #mexico (1.06 trillion) and #japan (355B) because those tags are dominated by non-football culture.&lt;/p&gt;

&lt;h3&gt;
  
  
  How popular is Lamine Yamal on TikTok?
&lt;/h3&gt;

&lt;p&gt;Very — disproportionately so for his age. At 17, #lamineyamal has 147.5 billion hashtag views, fifth among all footballers and ahead of Haaland, Bellingham, Salah and Modrić, and he is the single most-followed footballer on TikTok at 38.3 million. But his fan-identity depth is still small: only about 32 creators name him in their bio, versus 1,705 for Messi — he is watched, not yet woven in.&lt;/p&gt;

&lt;h3&gt;
  
  
  How was the TikTok World Cup study measured?
&lt;/h3&gt;

&lt;p&gt;Two ways. TikTok hashtag view and video counts came live from Crawlora's tiktok_challenge endpoint (platform-wide, cumulative) for a curated football-specific tag per player and nation. The creator-mention affinity and the co-mention social graph came from Crawlora's creators_search dataset of 3.33 million creators (a bio is present on 92.5%), counting how many name each entity and how often two are named together. Figures are aggregate-only and the dataset is open and reproducible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is TikTok's football ranking so different from Instagram's?
&lt;/h3&gt;

&lt;p&gt;Because TikTok's algorithm serves clips by engagement, not by follower count — follower count is not a ranking factor — so an unknown's celebration can out-travel a superstar's polished post. That structurally favours young, flair, meme-friendly players (Lamine Yamal, Neymar, Vinícius) over all-platform giants, which is why TikTok's leaderboard looks nothing like Instagram's follower order.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://crawlora.net/blog/tiktok-world-cup-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;. &lt;a href="https://crawlora.net/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>data</category>
      <category>tiktok</category>
      <category>worldcup</category>
    </item>
    <item>
      <title>Product Hunt Trends 2013–2026: How AI Agents Took Over Startup Launches</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Mon, 29 Jun 2026 06:01:58 +0000</pubDate>
      <link>https://dev.to/tonywangca/product-hunt-trends-2013-2026-how-ai-agents-took-over-startup-launches-1op4</link>
      <guid>https://dev.to/tonywangca/product-hunt-trends-2013-2026-how-ai-agents-took-over-startup-launches-1op4</guid>
      <description>&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We read every Product Hunt annual and monthly leaderboard from 2013 to 2026 — one of the few public, longitudinal records of what builders actually shipped and what other builders upvoted.&lt;/li&gt;
&lt;li&gt;AI didn't trend — it took over: on a deep ~190-product-per-year sample, the 'Artificial Intelligence' tag went from under 4% of the board before 2022 to ~70% by 2025, with the post-ChatGPT 2022→2023 jump the sharpest single-year move in the dataset.&lt;/li&gt;
&lt;li&gt;The deepest shift is the verb: from tools you operate to agents that do the job for you — 'the AI to-do list that does itself', 'a team of AI agents that runs your stores'.&lt;/li&gt;
&lt;li&gt;Trends now churn monthly inside a stable AI-agent theme — and winning a launch isn't lasting: of 28 era-defining winners, ~3 in 10 are dead or pivoted away, and the survivors are trend-independent utilities and infrastructure.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The headline numbers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI's share of the board:&lt;/strong&gt; under 4% before 2022 → &lt;strong&gt;~12% (2022) → 42% (2023) → ~70% by 2025&lt;/strong&gt; (deep sample of ~190 products/year); by the annual Top 10 alone, &lt;strong&gt;0 → 9 of 10&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2023 is the inflection year&lt;/strong&gt; — AI's board share roughly quadrupled in a single post-ChatGPT year, the sharpest move in the dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In 2026, monthly product churn is ~100%&lt;/strong&gt; (almost no launch repeats month to month) — but the theme stays AI agents; only the sub-flavor rotates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;28 era-defining winners re-checked:&lt;/strong&gt; roughly &lt;strong&gt;3 in 10 are dead or pivoted away&lt;/strong&gt; from the idea that won, and half the 2020–22 cohort is gone.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most "state of tech" takes are vibes. Product Hunt's leaderboard is one of the few public, longitudinal records of what people actually shipped and what other builders actually upvoted — every day since 2013. So instead of guessing, we read the boards: every &lt;strong&gt;annual&lt;/strong&gt; leaderboard from 2013 to 2026, plus a stack of &lt;strong&gt;monthly&lt;/strong&gt; ones, and tallied what each year was &lt;em&gt;about&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Ryan Hoover &lt;a href="https://en.wikipedia.org/wiki/Product_Hunt" rel="noopener noreferrer"&gt;launched Product Hunt on November 6, 2013&lt;/a&gt; as a Thanksgiving-break email list, so the board is a &lt;strong&gt;~12.5-year&lt;/strong&gt; record — and that's the more dramatic span, because the entire "AI everything" earthquake is compressed into its back third.&lt;/p&gt;

&lt;p&gt;A method note, with some irony: we pulled the leaderboards programmatically through a &lt;a href="https://crawlora.net/use-cases/ai-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;structured web-data API&lt;/a&gt;. Trying to read the same pages in a plain browser hits a Cloudflare "verifying you are not a bot" wall — which is exactly why reliable, structured access beats hand-scraping HTML. (Caveats are at the end; the short version: the annual board is a curated top-~20, scores are cumulative upvotes, and 2026 is year-to-date.)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Monitor Product Hunt as structured data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Want to track Product Hunt launches, categories, and competitor traction as clean JSON instead of fighting the Cloudflare wall? It's the same tooling that built this study — Crawlora's &lt;a href="https://crawlora.net/platforms/product-hunt?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Product Hunt API&lt;/a&gt; turns blocked leaderboard pages into normalized API responses (leaderboards, categories, makers, reviews, and more).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's the whole twelve years in one shot — Product Hunt's five leading product categories, racing for the top each year. Watch &lt;strong&gt;Artificial Intelligence&lt;/strong&gt; climb from a sliver to first place, overtaking Design Tools and Productivity:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F016zm9mijrc1roq6us12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F016zm9mijrc1roq6us12.png" alt="Animated bar-chart race of Product Hunt's 5 leading product categories by yearly share, 2014–2026; Artificial Intelligence rises from ~5% to first place by 2023, overtaking Design Tools and Productivity." width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  From mobile apps to an AI monoculture: the 12-year arc
&lt;/h2&gt;

&lt;p&gt;The board's center of gravity moved in clear eras:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2013–2014 · Mobile app discovery.&lt;/strong&gt; Almost everything is tagged generically "Tech," with a heavy iOS lean. 2013's top product is &lt;strong&gt;Sunrise&lt;/strong&gt; (a calendar app later bought by Microsoft); 2014's #1 is literally &lt;strong&gt;Product Hunt's own iOS app&lt;/strong&gt;. The platform is a place to find &lt;em&gt;apps&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2015–2016 · The maker era.&lt;/strong&gt; Curated directories and founder toolkits dominate — 2015's #1 is &lt;strong&gt;Startup Stash&lt;/strong&gt; (400 startup tools), 2016's is &lt;strong&gt;Startup Pitch Decks&lt;/strong&gt;. AI first pokes through as novelty (Prisma, Amazon Go).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2017–2019 · Design tools &amp;amp; productivity.&lt;/strong&gt; The two most durable topics in the whole dataset. &lt;strong&gt;AutoDraw&lt;/strong&gt; (2017), &lt;strong&gt;Notion 2.0&lt;/strong&gt; and &lt;strong&gt;remove.bg&lt;/strong&gt; (2018), &lt;strong&gt;Checklist Design&lt;/strong&gt; (2019). AI is still toys (AutoDraw, the "Not Hotdog" meme app, Google Duplex).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2020 · The remote-work pivot.&lt;/strong&gt; &lt;strong&gt;HEY&lt;/strong&gt; tops a board full of video-chat and remote tooling; the community's Golden Kitty Product of the Year is &lt;strong&gt;Clubhouse&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2021 · The no-code peak — the calm before the storm.&lt;/strong&gt; 2021's #1 is &lt;strong&gt;Cal.com&lt;/strong&gt;, surrounded by no-code site builders. AI products in the Top 10: &lt;strong&gt;zero&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2022–2023 · The AI spark, then the breakout.&lt;/strong&gt; ChatGPT debuts on the 2022 board with the single highest score; by 2023 "Artificial Intelligence" is the #1 topic and the Golden Kitty goes to &lt;strong&gt;ChatGPT&lt;/strong&gt; itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2024–2026 · The agent era.&lt;/strong&gt; 2024's #1 is &lt;strong&gt;Wordware&lt;/strong&gt; — &lt;em&gt;a tool for building AI agents&lt;/em&gt;. By 2025, 9 of the Top 10 are AI; in 2026 the incumbents arrive (Anthropic's &lt;strong&gt;Cowork&lt;/strong&gt;, ChatGPT Health) and the unit of competition becomes the &lt;em&gt;agent&lt;/em&gt;, not the app.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Most-upvoted #1&lt;/th&gt;
&lt;th&gt;Top topics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2013*&lt;/td&gt;
&lt;td&gt;Sunrise (calendar)&lt;/td&gt;
&lt;td&gt;Tech, iOS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2014&lt;/td&gt;
&lt;td&gt;Product Hunt for iOS&lt;/td&gt;
&lt;td&gt;Tech, iOS, Web App&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2015&lt;/td&gt;
&lt;td&gt;Startup Stash&lt;/td&gt;
&lt;td&gt;Web App, Design Tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2016&lt;/td&gt;
&lt;td&gt;Startup Pitch Decks&lt;/td&gt;
&lt;td&gt;Web App, iOS, Tech&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2017&lt;/td&gt;
&lt;td&gt;AutoDraw (Google)&lt;/td&gt;
&lt;td&gt;Tech, Design Tools, Dev Tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2018&lt;/td&gt;
&lt;td&gt;Notion 2.0&lt;/td&gt;
&lt;td&gt;Design Tools, Productivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2019&lt;/td&gt;
&lt;td&gt;Checklist Design&lt;/td&gt;
&lt;td&gt;Design Tools, UX, Productivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;HEY (email)&lt;/td&gt;
&lt;td&gt;Productivity, Design Tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;Cal.com&lt;/td&gt;
&lt;td&gt;Productivity, Design Tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;td&gt;ChatGPT (top score)&lt;/td&gt;
&lt;td&gt;Productivity, &lt;strong&gt;AI&lt;/strong&gt;, SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Chat by Copy.ai&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;AI&lt;/strong&gt;, Productivity, Web3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;Wordware&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;AI&lt;/strong&gt;, Productivity, Dev Tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025&lt;/td&gt;
&lt;td&gt;Screen Studio 3.0&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;AI&lt;/strong&gt;, Productivity, Design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026†&lt;/td&gt;
&lt;td&gt;Cowork (Anthropic)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;AI&lt;/strong&gt;, Productivity, Dev Tools&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;* 2013 is a partial first ~2 months. † 2026 is year-to-date.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI didn't trend — it took over
&lt;/h2&gt;

&lt;p&gt;Watch the whole board's category mix reshape itself. Each band below is one of Product Hunt's five leading product categories (deep sample, ~180–200 products/year); &lt;strong&gt;Design Tools&lt;/strong&gt; and &lt;strong&gt;Productivity&lt;/strong&gt; own the early years, then the orange &lt;strong&gt;Artificial Intelligence&lt;/strong&gt; band — barely a sliver before 2022 — swells into the widest by 2024:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5j8p8gj52a5lgzx7uf95.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5j8p8gj52a5lgzx7uf95.png" alt="Streamgraph of Product Hunt's 5 leading product categories, 2014–2026 (deep ~180–200/year sample). Artificial Intelligence rises from ~5% to ~50% while Design Tools, the early leader, fades." width="800" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Plot AI against Productivity — the most durable old-guard category — as a share of each year's ~190 top products, and you can watch the handover happen. Productivity is the steady incumbent (~30–47%); AI goes from a rounding error to dominant, crossing it in &lt;strong&gt;2023&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkwptbvtlltev71dz6ejx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkwptbvtlltev71dz6ejx.png" alt="AI vs Productivity as a share of Product Hunt's ~190 top products per year, 2021–2025. AI rises from 7% to 70% and overtakes Productivity (47% to 32%) in 2023." width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;spark&lt;/em&gt; was the back half of 2022, when ChatGPT launched and immediately posted the top score on the board. From there AI's share of the board &lt;strong&gt;roughly quadrupled in a single year&lt;/strong&gt; — ~12% in 2022 to ~42% in 2023 — then climbed to ~53% (2024) and ~70% (2025). Just as telling is how the &lt;em&gt;flavor&lt;/em&gt; of AI kept mutating: chat/LLM wrappers (2023) → AI agents and vertical apps (2024) → agentic workflows, AI dev-teams and "vibe coding" (2025) → incumbent agents and MCP-native tooling (2026). The category didn't just grow; it kept reinventing what "an AI product" even is.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;2026 so far&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The frontier shifted from startups to incumbents shipping agents — Anthropic's Cowork ("turn Claude into your digital coworker"), ChatGPT Health, Gmail in the Gemini era — plus two rising sub-themes: MCP-native tooling (turning any site or app into an agent tool) and GEO (getting your brand cited by AI assistants).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  A single month, four eras
&lt;/h2&gt;

&lt;p&gt;Zoom from years to months and a paradox appears. At the &lt;strong&gt;product&lt;/strong&gt; level, the monthly Top set turns over almost completely — across the six monthly boards of 2026 so far, essentially no individual product repeats in the Top 8. But at the &lt;strong&gt;theme&lt;/strong&gt; level, almost nothing changes: every 2026 month is ~100% AI/agent products. What recurs is brand &lt;em&gt;families&lt;/em&gt;, not entries — Anthropic's Claude charts nearly every month via a different launch; the "…Claw" agent family propagates across the year.&lt;/p&gt;

&lt;p&gt;Take one month per era and the trajectory is stark:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frgk3znflic9wen2ou7v7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frgk3znflic9wen2ou7v7.png" alt="AI share of the Product Hunt monthly Top 8 in a representative month per era: roughly 0% in January 2017, 13% in April 2020, 25% in January 2023, and about 100% in a 2026 month." width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So the thing that "changes fast" in 2026 isn't the category — it's the &lt;em&gt;sub-flavor&lt;/em&gt; of AI, month to month: January was agent runtimes and code review; March, AI design and agent marketplaces; April, meeting-notes and evals; May, agent-commerce and voice-calling agents; June, fundraising and GTM agents. The trend clock compressed from a &lt;em&gt;theme per year&lt;/em&gt; to a &lt;em&gt;sub-theme per month&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The platform grew, then the clock sped up
&lt;/h2&gt;

&lt;p&gt;Product Hunt itself scaled fast and then plateaued — which makes the acceleration of &lt;em&gt;trends&lt;/em&gt; (not volume) the real story:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fyvg2xyq8k970jt82rbxq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fyvg2xyq8k970jt82rbxq.png" alt="Documented Product Hunt launches per year: 7,529 in 2014, about 11,300 in 2015 and 2016, and a record 12,137 in 2021." width="799" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually changed in 12 years
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;From tools you use to agents that do it for you.&lt;/strong&gt; 2013–2021 winners are utilities a human operates; 2024–2026 winners are described as &lt;em&gt;employees&lt;/em&gt; — "the AI to-do list that does itself", "a team of AI agents that runs your stores". The verb moved from &lt;em&gt;use&lt;/em&gt; to &lt;em&gt;delegate&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topic diversity collapsed into a monoculture.&lt;/strong&gt; Through 2021 the top topics were a genuine mix (Design, Productivity, Dev Tools, No-Code, Hardware, Crypto). Since 2024, "AI" is on nearly everything; the other topics are just the &lt;em&gt;domain&lt;/em&gt; the AI points at.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Durable winners gave way to franchise churn.&lt;/strong&gt; Early hits stayed relevant for years; the 2026 board is version bumps and brand families cycling through the top spots monthly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Each wave left a fingerprint, then receded — except one.&lt;/strong&gt; Hardware, the maker boom, crypto/Web3, COVID/remote, no-code — each surged and faded. AI is the first wave that, instead of receding, &lt;em&gt;became the substrate&lt;/em&gt; the next waves run on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The takeaway for anyone shipping: the leaderboard rewards whatever frame is ascendant. In 2019 that was a polished design resource; in 2026 it's an autonomous agent with a clear "it does the job for you" promise. The same capability launched in the year's language outperforms one launched in last year's.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the board itself changed — not just the topics
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;topics&lt;/em&gt; flipped to AI, but three quieter shifts in the leaderboard's own mechanics are just as telling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Winners now keep gaining votes after launch day.&lt;/strong&gt; Comparing each year's winners' current upvotes to their launch-day score: early-era winners were launch-day spikes that &lt;em&gt;bled&lt;/em&gt; votes afterward (ratio below 1); since 2025 they keep climbing (above 1) — the board rewards sustained momentum over a one-day burst.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvf3w5pkxy790c74l01zw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvf3w5pkxy790c74l01zw.png" alt="Vote-accumulation ratio of Product Hunt's annual winners, 2018–2026: rises from 0.52 to 1.15, crossing 1.0 between 2024 and 2025." width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Annual winners cluster in Q1.&lt;/strong&gt; Across 2018–2025, a disproportionate share of year-topping products launched early in the year — &lt;strong&gt;February is the single peak&lt;/strong&gt; and Q1 holds ~37% of winners. Early launches have the most time to accumulate the cumulative votes that decide annual rank.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F18n78f849ieilahhy4fx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F18n78f849ieilahhy4fx.png" alt="Launch month of 136 Product Hunt annual winners, 2018–2025: February peaks at 20; Q1 holds about 37%; October is the trough at 1." width="800" height="745"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI didn't just grow — it fused with Productivity.&lt;/strong&gt; Among each year's top winners, the number carrying &lt;em&gt;both&lt;/em&gt; the AI and Productivity tags went from ~zero to &lt;strong&gt;9 of 17 in 2025&lt;/strong&gt; — "AI + Productivity" became the board's defining combination, and the old Web3/blockchain pairings vanished after 2023.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3gh9i6b6ck61akl3lnj3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3gh9i6b6ck61akl3lnj3.png" alt="Number of Product Hunt annual winners tagged both AI and Productivity, 2018–2025: 1, 0, 0, 0, 2, 4, 5, 9." width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where are Product Hunt's winners now?
&lt;/h2&gt;

&lt;p&gt;A leaderboard tells you what &lt;em&gt;launched&lt;/em&gt;, not what &lt;em&gt;lasted&lt;/em&gt;. So we ran the winners back through the same API: of the products that topped Product Hunt, how many are still alive today? We fetched the homepage of &lt;strong&gt;28 era-defining winners&lt;/strong&gt; and pulled each one's current traffic.&lt;/p&gt;

&lt;p&gt;The catch — and it's the whole point — is that &lt;strong&gt;a domain returning a page is not a living product.&lt;/strong&gt; Four "winners" still answer with &lt;code&gt;200 OK&lt;/code&gt; but serve something else entirely: Station's domain is now a Thai casino, Sunrise's is an SEO content farm, Rewind's is a generic AI-tools page, and Polywork's is a different site builder. Traffic estimators even keep serving the &lt;em&gt;old&lt;/em&gt; title, so they'd happily score these as alive. Only fetching the live page &lt;strong&gt;and&lt;/strong&gt; cross-checking the traffic — the same fetch-and-verify discipline behind our &lt;a href="https://crawlora.net/anti-bot-index?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;anti-bot index&lt;/a&gt; — reveals that the product is gone.&lt;/p&gt;

&lt;p&gt;By that bar, here's how the 28 winners split:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdkdkt44phck6uhas658l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdkdkt44phck6uhas658l.png" alt="Survival of 28 Product Hunt winners: 17 still running the product that won, 3 diminished or pivoted under the same brand, 8 dead or pivoted away." width="800" height="199"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Roughly &lt;strong&gt;seven in ten still run the product that won&lt;/strong&gt; — but &lt;strong&gt;about three in ten are dead or have pivoted away&lt;/strong&gt; from the thing that put them on the board. A #1 badge was no insurance: &lt;strong&gt;Sunrise&lt;/strong&gt; (2013's beloved calendar, bought by Microsoft and shut down) and &lt;strong&gt;Wordware&lt;/strong&gt; (2024's literal #1 — "$30M, the largest seed in YC history" — now replaced by a different product) both topped Product Hunt and are effectively gone.&lt;/p&gt;

&lt;p&gt;And the deaths aren't evenly spread — they cluster hard in one era:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn42ww084a0mk1vsci8xx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn42ww084a0mk1vsci8xx.png" alt="Share of each era's Product Hunt winners now dead or pivoted away: 29% for 2013-16, 14% for 2017-19, 50% for 2020-22, 17% for 2023-26." width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Put both dimensions together — which era each winner launched in, and where it ended up — and the whole story sits in one flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcoiggnehccl9wca6k0e6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcoiggnehccl9wca6k0e6.png" alt="Flow of 28 Product Hunt winners from launch era to 2026 fate. 2017–19 sends 6 of 7 to Still running; 2020–22 sends 4 of 8 to Gone." width="799" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Half of the 2020–22 cohort is already gone&lt;/strong&gt; — the pandemic-and-social-novelty winners (mmhmm, Polywork, Typedream, Rewind, plus a much-diminished Clubhouse). The survivors, in every era, are the trend-independent ones: single-purpose utilities (Workflowy, Coolors, Hunter, remove.bg) and infrastructure (Notion, Supabase, Cal.com, Resend). Winning a launch is a moment; outliving the trend that powered it is the hard part.&lt;/p&gt;

&lt;p&gt;And outliving the trend still isn't the same as winning the market. Among the survivors, monthly traffic is brutally top-heavy:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fsk8l6sjxn7t822w0gamy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fsk8l6sjxn7t822w0gamy.png" alt="Monthly visits for 13 surviving Product Hunt winners: Notion 152M, remove.bg 62.5M, Lovable 34.8M, Supabase 29.3M, then a long tail from Resend 5.2M down to HEY 778K." width="800" height="799"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Winning a Product Hunt launch gets you on the board; becoming Notion is a different game entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use this before you launch
&lt;/h2&gt;

&lt;p&gt;The leaderboard rewards trend-fit, but the data above says trend-fit and durability are different games. If you're timing a launch, treat these as the operating rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't launch as generic "AI" — launch as this month's sub-flavor.&lt;/strong&gt; "An AI tool" is invisible on a board that's already ~100% AI. The launches that break out name the specific job that's cresting right now — code review, meeting notes, agent commerce, GTM — so match the sub-theme that's ascendant the month you ship.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phrase the promise as delegation, not tooling.&lt;/strong&gt; The winning verb moved from &lt;em&gt;use&lt;/em&gt; to &lt;em&gt;delegate&lt;/em&gt;: "does X for you," not "a tool for X." "The AI to-do list that does itself" beats "a smarter to-do list."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use trend-fit for launch velocity, but ship a utility for survival.&lt;/strong&gt; Trend-of-the-moment framing wins the day; the products still alive years later are single-purpose utilities and infrastructure (Notion, Supabase, Cal.com, remove.bg). Ride the wave to get seen — but build something people still need after it breaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time it: track the monthly sub-theme rotation, and aim for Q1.&lt;/strong&gt; The AI sub-flavor shifts roughly monthly, so launch into the one that's cresting. And if you're chasing the year-end lists, launch early — February is the single peak month for annual winners, because early launches have the longest to accumulate the cumulative votes that decide annual rank.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat Product Hunt as an awareness and positioning signal — not proof of retention or market leadership.&lt;/strong&gt; A #1 badge is a moment of attention, not evidence of a durable business; among survivors, traffic is brutally top-heavy. Winning the launch and becoming Notion are different games.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How we did this (and the caveats)
&lt;/h2&gt;

&lt;p&gt;We read Product Hunt's annual and monthly leaderboards directly via a structured API, then cross-checked the platform facts — launch date, &lt;a href="https://en.wikipedia.org/wiki/Product_Hunt" rel="noopener noreferrer"&gt;launch volumes&lt;/a&gt;, and Golden Kitty winners — against public sources. Caveats worth stating plainly: the &lt;strong&gt;annual board is a curated top-~20&lt;/strong&gt;, not an exhaustive vote ranking; &lt;strong&gt;scores are cumulative upvotes&lt;/strong&gt;, so they're directional, not exact annual totals; &lt;strong&gt;2026 is year-to-date&lt;/strong&gt;; &lt;strong&gt;2013 is a partial ~2 months&lt;/strong&gt;; and "AI in the Top 10" counts are by listed topic, so borderline cases are judgment calls — but the direction is robust regardless. For the topic charts (the race, streamgraph, and AI-vs-Productivity line) we went deeper than the annual top-20: we &lt;strong&gt;aggregated all twelve monthly leaderboards per year and de-duplicated by product&lt;/strong&gt;, giving a &lt;strong&gt;~180–200-product sample each year&lt;/strong&gt; (vs ~17 at the very top) — which is why the AI share here (e.g. ~70% in 2025) is a touch lower and far smoother than a top-of-board figure would be. The yearly endpoint's deep pagination is broken, so the survival, vote-accumulation, seasonality, and co-occurrence cuts use the top-~17/year set. For the "where are they now" check, we fetched each winner's homepage and pulled its current SimilarWeb traffic, counting a product as &lt;em&gt;gone&lt;/em&gt; only when the live page no longer serves it — a domain that returns 200 with unrelated content (a parked page, a casino, a different product) is dead, not alive. That liveness snapshot is June 2026, and "gone" includes products that pivoted away from the winning idea, not only outright shutdowns.&lt;/p&gt;

&lt;p&gt;The datasets behind the piece, at a glance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Sample size&lt;/th&gt;
&lt;th&gt;Used for&lt;/th&gt;
&lt;th&gt;Caveats&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deep monthly-aggregated sample&lt;/td&gt;
&lt;td&gt;~180–200 products/year (all 12 monthly boards, de-duplicated)&lt;/td&gt;
&lt;td&gt;The topic-category charts: race, streamgraph, AI-vs-Productivity line&lt;/td&gt;
&lt;td&gt;Smoother and a touch lower than a top-of-board figure (AI is ~70% in 2025, not 90%); "AI" tagging is a judgment call on borderline products&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Annual top boards&lt;/td&gt;
&lt;td&gt;~17 products/year&lt;/td&gt;
&lt;td&gt;Survival, vote-accumulation, Q1 seasonality, AI+Productivity co-occurrence&lt;/td&gt;
&lt;td&gt;Curated top list, not a full vote ranking; cumulative-upvote scores; the deep-pagination endpoint is broken, so these cuts cap at the top ~17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly boards (era sample)&lt;/td&gt;
&lt;td&gt;Top 8 per sampled month&lt;/td&gt;
&lt;td&gt;Sub-theme rotation and product-level churn&lt;/td&gt;
&lt;td&gt;Sampled months, not every month; 2026 is year-to-date&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Winner survival + traffic&lt;/td&gt;
&lt;td&gt;28 era-defining winners&lt;/td&gt;
&lt;td&gt;Liveness, pivots, and surviving-product traffic&lt;/td&gt;
&lt;td&gt;Live homepage fetch + SimilarWeb (May–June 2026); a 200 with unrelated content counts as gone; "gone" includes pivots&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're wondering how we read pages that block bots: that's the day job. Crawlora is a &lt;a href="https://crawlora.net/blog/best-ai-web-scraping-tools-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;web-data API for AI agents and pipelines&lt;/a&gt; that returns normalized JSON for dozens of platforms, handles the anti-bot layer, and bills &lt;a href="https://crawlora.net/pricing?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;pay-on-success&lt;/a&gt; — you only pay when it actually returns your data. The same approach powers our &lt;a href="https://crawlora.net/anti-bot-index?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;anti-bot adoption study&lt;/a&gt;, the &lt;a href="https://crawlora.net/blog/streaming-fragmentation-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;streaming fragmentation study&lt;/a&gt;, and the &lt;a href="https://crawlora.net/blog/google-vs-bing-vs-brave-serp-study-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Google-vs-Bing-vs-Brave SERP study&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Product_Hunt" rel="noopener noreferrer"&gt;Product Hunt — Wikipedia (history, launch date, ownership)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.producthunt.com/leaderboard" rel="noopener noreferrer"&gt;Product Hunt — annual &amp;amp; monthly leaderboards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.producthunt.com/golden-kitty-awards/hall-of-fame" rel="noopener noreferrer"&gt;Product Hunt — Golden Kitty Awards Hall of Fame&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/anti-bot-index?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora Anti-Bot Adoption Index&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  When did Product Hunt launch?
&lt;/h3&gt;

&lt;p&gt;Product Hunt launched on November 6, 2013, built by Ryan Hoover as a Thanksgiving-break email list. Its leaderboard is therefore about a 12.5-year record (2014–2026) — and the entire AI wave is compressed into the back third of it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How have Product Hunt trends changed over time?
&lt;/h3&gt;

&lt;p&gt;The board moved through clear eras: generic mobile apps (2013–14), startup and maker resource directories (2015–16), a design-tools and productivity golden age (2017–19), a remote-work and COVID pivot (2020), a no-code peak (2021), the AI spark and breakout (2022–23), and an AI-agent monoculture (2024–26).&lt;/p&gt;

&lt;h3&gt;
  
  
  When did AI take over Product Hunt?
&lt;/h3&gt;

&lt;p&gt;2023 is the inflection year — the first year 'Artificial Intelligence' is the single most common topic in the annual Top 10. AI products in the Top 10 grew 0 → 3 → 5 → 7 → 9 across 2021 to 2025, sparked by ChatGPT's debut in late 2022.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Product Hunt trends change month to month?
&lt;/h3&gt;

&lt;p&gt;At the product level, yes — in 2026 the monthly Top 8 turns over almost completely each month, with essentially no repeat products. But the theme is stable: nearly every entry is an AI agent. What rotates month to month is the sub-flavor — coding, then design, then meetings, then commerce, then fundraising.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is trending on Product Hunt in 2026?
&lt;/h3&gt;

&lt;p&gt;AI agents. Almost every top launch in 2026 is an autonomous agent that does a job for you rather than a tool you operate, so the dominant theme barely moves — but the sub-flavor rotates roughly monthly: agent runtimes and code review, then AI design and agent marketplaces, then meeting-notes and evals, then agent-commerce and voice agents, then fundraising and GTM agents, plus a rising MCP-native tool-layer cohort that turns any site or app into an agent tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  How was this Product Hunt trend analysis done?
&lt;/h3&gt;

&lt;p&gt;We read Product Hunt's annual and monthly leaderboards directly through a structured web-data API and cross-checked platform facts against public sources. The annual board is a curated top-~20, scores are cumulative upvotes, and 2026 is year-to-date, so the figures are directional rather than exact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Product Hunt winners survive?
&lt;/h3&gt;

&lt;p&gt;Not reliably. Re-checking 28 era-defining Product Hunt winners in June 2026, about 7 in 10 still run the product that won, but roughly 3 in 10 are dead or have pivoted away — and a #1 badge was no guarantee (2013's Sunrise and 2024's #1 Wordware are both effectively gone). The deaths cluster in the 2020–22 cohort (~50%), the pandemic and social-novelty bets; trend-independent utilities and infrastructure survive best.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://crawlora.net/blog/product-hunt-trends-2013-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;. &lt;a href="https://crawlora.net/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>startup</category>
      <category>ai</category>
      <category>datascience</category>
      <category>productivity</category>
    </item>
    <item>
      <title>14% of the Web Is Actually Dead — But Not How You Think (We Scanned 10M Domains)</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Thu, 18 Jun 2026 01:33:38 +0000</pubDate>
      <link>https://dev.to/tonywangca/14-of-the-web-is-actually-dead-but-not-how-you-think-we-scanned-10m-domains-35g9</link>
      <guid>https://dev.to/tonywangca/14-of-the-web-is-actually-dead-but-not-how-you-think-we-scanned-10m-domains-35g9</guid>
      <description>&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We probed 9,992,781 of the top 10 million domains in June 2026. 14.2% are genuinely dead — no DNS, no connection, nothing answers — not the 27.6% a naive crawl of the same list reports.&lt;/li&gt;
&lt;li&gt;Nearly half of 'dead' was never dead. 8.9% of the top web (891,672 sites) answers but blocks an automated client (403/429/anti-bot), and another ~4% serves a 404 or 5xx from a live server. Naive crawls count all of that as death.&lt;/li&gt;
&lt;li&gt;The genuinely dead web is mostly DNS that no longer resolves: 1,077,715 domains — 76% of all dead — have left DNS entirely. The rest refuse or reset the connection. A 404 page is not death; a missing DNS record is.&lt;/li&gt;
&lt;li&gt;Death is uneven by TLD. .cn (33%), .info (28%), .in and .gov (26%), and .edu (22%) rot fastest — institutional and cheap-registration domains lead, echoing Pew's finding that government and reference pages suffer the worst link rot. .com sits near the 14% line.&lt;/li&gt;
&lt;li&gt;This is not 'link rot' or 'dead internet theory.' We measure whether the domain itself still resolves and answers — a different question from broken links inside pages (Pew, Ahrefs) or AI-generated content flooding the web.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You have probably seen the stat: &lt;strong&gt;27.6% of the web is dead.&lt;/strong&gt; It comes from &lt;a href="https://tonywang.io/blog/top-10-million-sites-27-percent-dead" rel="noopener noreferrer"&gt;a 2024 crawl of the top 10 million domains&lt;/a&gt;, and it gets repeated because it is striking and a little bleak. We ran that study. And when we re-scanned the same 10-million-domain list in 2026 — this time separating a domain that is &lt;em&gt;genuinely gone&lt;/em&gt; from one that is merely &lt;em&gt;refusing a bot&lt;/em&gt; — the real number came out at &lt;strong&gt;14.2%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The web didn't suddenly heal. The original number was counting the wrong things. A naive crawler can't tell a dead domain from a live one hiding behind Cloudflare, and it counts a server that politely returns "404 Not Found" the same as one that never answers at all. Fix the classification and roughly &lt;strong&gt;half of "dead" turns out to be alive&lt;/strong&gt; — it just wasn't talking to a bot. Here is the full picture, from 9,992,781 probed domains.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the outcome of a 10-million-domain scan actually looks like
&lt;/h2&gt;

&lt;p&gt;Every domain gets one of four labels. &lt;code&gt;alive&lt;/code&gt; means it answered (a 2xx, or even a 404/5xx — the server is up). &lt;code&gt;blocked&lt;/code&gt; means it answered but refused our automated client (a 403, 429, or anti-bot challenge). &lt;code&gt;redirect&lt;/code&gt; means it bounced somewhere we couldn't resolve. &lt;code&gt;dead&lt;/code&gt; means it never answered at all — no DNS record, or nothing accepts a connection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4ysxdezsj810eepbmlcm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4ysxdezsj810eepbmlcm.png" alt="Outcome of 9,992,781 probed domains: alive 76.6%, blocked 8.9%, dead 14.2%, redirect 0.3%." width="800" height="199"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three-quarters of the top web is alive and answering. The interesting part is the bottom 23% — the slice everyone argues about — and how you split it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real number: 14% dead, not 27.6%
&lt;/h2&gt;

&lt;p&gt;Same list, same scale, one difference: in 2026 we refuse to call a domain dead just because &lt;em&gt;our bot&lt;/em&gt; couldn't read it. A genuinely dead domain fails early — DNS returns nothing, or the connection is refused. A live-but-defended domain fails &lt;em&gt;late&lt;/em&gt;, with a 403 or a challenge page, which is a completely different signal. Counting honestly moves the headline from 27.6% to 14.2%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F79j5f0ec9231t0rplk5c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F79j5f0ec9231t0rplk5c.png" alt="Dead-rate comparison: naive 2024 crawl 27.6% versus honest 2026 classification 14.2%." width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Where do the missing ~13 points go? Almost all of it is two things a naive crawl mislabels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8.9% (891,672 sites) answer but block bots.&lt;/strong&gt; A 403, a 429, or a Cloudflare "Just a moment" challenge to a datacenter IP. These are some of the &lt;em&gt;most&lt;/em&gt; alive sites on the web — they run active defenses precisely because people want their data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~4% serve a 404 or 5xx from a live server.&lt;/strong&gt; A "404 Not Found" or a "503 Service Unavailable" is proof the host answered. The original crawl counted them as dead; a server that returns an error is the opposite of gone.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The remainder is a 2024 measurement artifact: that crawl resolved each domain through a single DNS resolver, and a flaky lookup falsely marked resolvable domains dead. We now cross-check across resolvers before declaring a DNS failure.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dead means unreachable, not 'returned an error.'&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The whole correction rests on one rule: a server that answers &lt;em&gt;anything&lt;/em&gt; — even a 404, a 500, or a 403 — is up, so it isn't dead. Only a domain that no DNS resolver can find, or that refuses and resets every connection, is dead. Most "dead web" counts skip this and inflate the number by half.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What a no-follow crawler gets wrong
&lt;/h2&gt;

&lt;p&gt;The gap between 27.6% and 14.2% is largely a measurement choice: whether you follow redirects and read what the server actually says. A crawler that stops at the first response sees only &lt;strong&gt;45.9% return a clean 200&lt;/strong&gt; and writes off the rest. Follow the redirects and read the bodies, and &lt;strong&gt;71.9% are alive&lt;/strong&gt;. Here is where every first response actually ends up:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1nioys6fno9bg8rmqs5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1nioys6fno9bg8rmqs5b.png" alt="Flow from first HTTP response to final outcome: 200, most 3xx redirects, 404s and 5xx end alive; no-response ends dead; 403/429 end blocked." width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The big rivers carry the point: a &lt;code&gt;301&lt;/code&gt; is not a dead end — &lt;strong&gt;87% of redirects resolve to a live page&lt;/strong&gt;, and a &lt;code&gt;403&lt;/code&gt; or &lt;code&gt;429&lt;/code&gt; is a live site refusing a bot, not a corpse. The only response that reliably means &lt;em&gt;dead&lt;/em&gt; is no response at all — and that single &lt;code&gt;No response → Dead&lt;/code&gt; band is almost the entire dead web.&lt;/p&gt;

&lt;h2&gt;
  
  
  The genuinely dead web is mostly DNS that's gone
&lt;/h2&gt;

&lt;p&gt;So what &lt;em&gt;is&lt;/em&gt; the 14.2%? Overwhelmingly, it's domains that have left DNS entirely. Of the 1,414,788 genuinely dead domains, &lt;strong&gt;1,077,715 — about 76% — no longer resolve to any IP at all.&lt;/strong&gt; The registration lapsed, the zone was deleted, the project was abandoned. The rest refuse or reset every connection, or fail TLS to a host that is truly down. A dead domain almost never &lt;em&gt;answers and errors&lt;/em&gt; — it simply isn't there.&lt;/p&gt;

&lt;p&gt;This matters if you build anything that follows links or crawls a list: the failures you'll actually hit are split between "this domain is gone" (retry never helps) and "this site is blocking me" (a different request gets in). Treating them the same is the single most common way web-health numbers get inflated — and the most common way a scraper wastes a budget retrying domains that will never answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The famous dead
&lt;/h2&gt;

&lt;p&gt;Aggregate percentages are abstract. So we sorted the genuinely-dead domains by popularity rank and went looking for names you'd recognise — and the graveyard is remarkable. The single highest-ranked dead domain in the entire top 10 million makes the point on its own.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;#568&lt;/strong&gt; sits &lt;code&gt;fanlink.to&lt;/code&gt;, the music "smart-link" service artists and labels used for pre-save and streaming links. In March 2024 its parent — Eventbrite's ToneDen — lost control of the &lt;code&gt;.to&lt;/code&gt; domain and never recovered it, instantly breaking millions of links sitting in artist bios, ads, and press releases.&lt;/p&gt;

&lt;p&gt;Which raises the obvious question: how is a &lt;em&gt;dead&lt;/em&gt; domain the 568th most popular on the web? Because the web never stopped knocking. Every un-updated link, embed, and bookmark keeps firing requests at an address that no longer answers — the rank is a fossil of past popularity. That is precisely why a popularity-ranked list is full of corpses at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Music &amp;amp; video&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;fanlink.to&lt;/strong&gt; († 2024) — Music smart-links · ToneDen / Eventbrite. The single highest-ranked dead domain in the whole top 10M (#568). In March 2024 Eventbrite lost control of the .to domain overnight, instantly breaking millions of artists' pre-save and streaming links sitting in bios, ads, and press releases. &lt;a href="https://web.archive.org/web/*/fanlink.to" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;grooveshark.com&lt;/strong&gt; († 2015) — Free music streaming · ~20M users. Forced shut by the major labels' copyright suit (willful infringement, ~$700M of exposure). The entire catalogue was wiped the day the settlement landed; a co-founder died months later at 28. &lt;a href="https://web.archive.org/web/*/grooveshark.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rdio.com&lt;/strong&gt; († 2015) — Music subscription service. Bankrupt after burning ~$2M a month. Pandora bought the technology for $75M and shut the service down the day before the sale closed. &lt;a href="https://web.archive.org/web/*/rdio.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gfycat.com&lt;/strong&gt; († 2023) — GIF host for Reddit &amp;amp; Discord · ~220M users. Bought by Snap in 2020, then switched off as a non-core asset — one of the largest single link-rot events ever, breaking millions of embedded GIFs across the web. &lt;a href="https://web.archive.org/web/*/gfycat.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;veoh.com&lt;/strong&gt; († 2024) — Video-sharing site. Won a landmark DMCA case that helped protect every YouTube-style site, limped on for years under Japan's FC2, and finally went dark in November 2024. &lt;a href="https://web.archive.org/web/*/veoh.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;metacafe.com&lt;/strong&gt; († 2021) — Top-3 video site of 2006. One of YouTube's first serious rivals — it simply went offline one day in 2021 with no announcement at all. &lt;a href="https://web.archive.org/web/*/metacafe.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The social web&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;del.icio.us&lt;/strong&gt; († 2017) — Delicious · invented social bookmarking. The site that coined web-scale tagging. Passed through five owners (Yahoo → AVOS → Science → Delicious Media → Pinboard for $35,000) before going read-only. &lt;a href="https://web.archive.org/web/*/del.icio.us" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dmoz.org&lt;/strong&gt; († 2017) — The Open Directory · a human-curated map of the web. 91,000 volunteers cataloguing 3.8M sites — once a near-prerequisite for SEO, then made obsolete by Google's algorithm. Lives on as the community fork Curlie. &lt;a href="https://web.archive.org/web/*/dmoz.org" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pipes.yahoo.com&lt;/strong&gt; († 2015) — Yahoo Pipes · visual no-code data mashups. The “Zapier of 2007.” Killed in a Yahoo cost-cut; thousands of live RSS and data pipelines broke on the same day. &lt;a href="https://web.archive.org/web/*/pipes.yahoo.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;topsy.com&lt;/strong&gt; († 2015) — The only full historical Twitter search. Indexed hundreds of billions of tweets back to 2006. Apple bought it for ~$200M and quietly switched it off two years later; the searchable archive simply vanished. &lt;a href="https://web.archive.org/web/*/topsy.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;aviary.com&lt;/strong&gt; († 2018) — Photo-editing SDK embedded in 7,000+ apps. Powered in-app photo editing across the mobile economy (10B edits). Adobe acquired it, folded the tech into Creative Cloud, then sunset the free SDK. &lt;a href="https://web.archive.org/web/*/aviary.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The developer web&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;s7.addthis.com&lt;/strong&gt; († 2023) — Share buttons + tracking on 15M websites. Oracle bought it for the behavioural data, then killed it under GDPR pressure — a single shutdown darkened share widgets across millions of sites at once. &lt;a href="https://web.archive.org/web/*/s7.addthis.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;programmableweb.com&lt;/strong&gt; († 2023) — The public directory of ~19,000 web APIs. The index of the “API economy” for 17 years. Salesforce / MuleSoft erased the whole thing with no archive. &lt;a href="https://web.archive.org/web/*/programmableweb.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;securityfocus.com&lt;/strong&gt; († 2021) — Home of the Bugtraq disclosure list (since 1993). The security world's noticeboard for nearly 30 years. Symantec → Broadcom → Accenture let it freeze; the Bugtraq archive survives only at seclists.org. &lt;a href="https://web.archive.org/web/*/securityfocus.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;opensolaris.org&lt;/strong&gt; († 2013) — Sun's open-source operating system. Oracle froze it the moment it bought Sun and pulled the domain in 2013. The community kept the code alive as the illumos fork. &lt;a href="https://web.archive.org/web/*/opensolaris.org" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sorbs.net&lt;/strong&gt; († 2024) — Spam blocklist covering 512M IP addresses. A DNS blocklist that mail servers queried for over two decades. Proofpoint pulled the plug in 2024; servers worldwide still query a list that no longer answers. &lt;a href="https://web.archive.org/web/*/sorbs.net" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Government &amp;amp; institutions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;patft.uspto.gov&lt;/strong&gt; († 2022) — US patent full-text search (1790–present). Retired for a new search tool — breaking decades of direct patent links embedded in academic papers, legal briefs, and analysis tools. &lt;a href="https://web.archive.org/web/*/patft.uspto.gov" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;petitions.whitehouse.gov&lt;/strong&gt; († 2021) — Obama's “We the People” e-petitions. A petition once topped a million signatures. The platform was quietly discontinued on Inauguration Day 2021 and never revived. &lt;a href="https://web.archive.org/web/*/petitions.whitehouse.gov" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;weblogs.com&lt;/strong&gt; († ~2009) — Dave Winer's blog-ping server · the early blogosphere's heartbeat. Every new blog post once pinged this host; VeriSign paid $2.3M for it. It faded after 2009 — yet old WordPress installs still ping the dead address to this day. &lt;a href="https://web.archive.org/web/*/weblogs.com" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;europa.eu.int&lt;/strong&gt; († 2006) — The European Union's original web address. The canonical home of EU law and institutions for over a decade. Migrated to europa.eu on Europe Day 2006, stranding a generation of links. &lt;a href="https://web.archive.org/web/*/europa.eu.int" rel="noopener noreferrer"&gt;Wayback&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Read those twenty obituaries back-to-back and one cause of death stands out: &lt;strong&gt;being acquired.&lt;/strong&gt; Seven of the twenty were bought by a bigger company that then switched them off — Snap killed Gfycat, Apple killed Topsy, Oracle killed both AddThis and OpenSolaris, Adobe killed Aviary, Salesforce killed ProgrammableWeb, Broadcom let SecurityFocus rot. "Acqui-killed" beats bankruptcy, lawsuits, and neglect combined.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvl314sbwsao59e51ocyq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvl314sbwsao59e51ocyq.png" alt="Cause of death among 20 notable dead domains: acquired then killed 7, strategic shutdown 5, neglect 4, bankruptcy or lawsuit 2, migrated 2." width="799" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Twenty headliners can't show the shape of the whole graveyard. So we widened the lens — pulling &lt;strong&gt;~100 widely-recognised, verifiable shutdowns&lt;/strong&gt; (from this scan's dead domains and the public record), dating each to the year its service ended and sorting them into six corners of the web. Stacked by year, two decades of the dying web look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxmjrwtbqomvvkojitc3f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxmjrwtbqomvvkojitc3f.png" alt="Streamgraph of about 100 notable dead websites by year of death, 2006 to 2026, in six categories. Social &amp;amp; community and Developer &amp;amp; infrastructure are the largest bands; deaths peak in 2012–2017 and 2020–2023." width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two things stand out. &lt;strong&gt;Social platforms and developer tools are the bulk of the dead web&lt;/strong&gt; — the social graveyard (Friendster, Orkut, Bebo, Google+, Path, Yik Yak, Ello, Digg…) and the dev-tools column (Google Code, Parse, Google Wave, Gitorious, Sunrise, Mailbox…) are dead even, and together they're more than half of everything here. And the deaths &lt;em&gt;cluster&lt;/em&gt;: a first swell in &lt;strong&gt;2012–2017&lt;/strong&gt; as the Web 2.0 and check-in/anonymous-app generation collapsed, then a second from &lt;strong&gt;2020&lt;/strong&gt; as pandemic-era and big-tech bets were cut (Quibi, Mixer, CNN+, Stadia, Google Play Music). Before 2009 the stream barely exists — most of the web simply wasn't old enough to have died yet.&lt;/p&gt;

&lt;p&gt;One honest caveat, and the reason we re-checked every domain by hand: a dead &lt;em&gt;domain&lt;/em&gt; is not always a dead &lt;em&gt;thing&lt;/em&gt;. Some only look dead because the service rebranded or moved — &lt;code&gt;money.yandex.ru&lt;/code&gt; became YooMoney, the old &lt;code&gt;suicidepreventionlifeline.org&lt;/code&gt; host gave way to &lt;strong&gt;988lifeline.org&lt;/strong&gt;, the EU's &lt;code&gt;europa.eu.int&lt;/code&gt; simply became &lt;code&gt;europa.eu&lt;/code&gt;. We re-probed every domain above against live DNS in June 2026 and dropped the false positives (&lt;code&gt;nrel.gov&lt;/code&gt; and &lt;code&gt;angelfire.com&lt;/code&gt; still resolve fine). What remains genuinely no longer answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Death is uneven: which TLDs rot fastest
&lt;/h2&gt;

&lt;p&gt;Dead rate is not evenly spread. Split the 10 million by top-level domain and a clear gradient appears — cheap-registration and institutional TLDs rot far faster than the .com baseline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqg3bs5ujmu2mvnipzknb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqg3bs5ujmu2mvnipzknb.png" alt="Dead rate by TLD: .cn 33%, .info 28.4%, .in 25.9%, .gov 25.9%, .edu 22%, .us 22%, .br 20.9%, .net 19.9%." width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The standouts tell two stories. &lt;code&gt;.cn&lt;/code&gt;, &lt;code&gt;.info&lt;/code&gt;, and &lt;code&gt;.in&lt;/code&gt; lead because they are cheap and heavily registered for short-lived or speculative sites that lapse quickly. But &lt;code&gt;.gov&lt;/code&gt; (26%) and &lt;code&gt;.edu&lt;/code&gt; (22%) near the top is the more striking finding: institutional domains rot badly because content is reorganized, departments are dissolved, and old project sites are simply switched off — exactly the digital decay &lt;a href="https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/" rel="noopener noreferrer"&gt;Pew Research documented in 2024&lt;/a&gt;, where government and reference pages had some of the worst link rot. The web's most authoritative corners are some of its least permanent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The geography of the dead web
&lt;/h2&gt;

&lt;p&gt;Group the country-code domains by country and the decay draws a map. The emerging-market registration booms of the last decade left the biggest graveyards — China's &lt;code&gt;.cn&lt;/code&gt; leads at a third dead — while German-speaking Europe runs the most durable web on earth.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fteyk944obti5np1o7fky.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fteyk944obti5np1o7fky.png" alt="World map of dead-domain rate by country-code TLD: China 33%, India 26%, United States 22%, Brazil 21%, Japan 16%, UK 15%, Russia 15%, down to Germany 7.6% and Czechia 7.3%. Redder is deader." width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu871fqczsukynhfmg687.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu871fqczsukynhfmg687.png" alt="Dead rate by country-code TLD: China 33%, India 25.9%, United States 22%, Brazil 20.9%, Spain 16.6%, Japan 15.6%, UK 15.3%, Russia 14.9%, France 14.5%, Italy 13.5%, Netherlands 9.7%, Germany 7.6%." width="799" height="722"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A domain in China's &lt;code&gt;.cn&lt;/code&gt; space is &lt;strong&gt;more than four times&lt;/strong&gt; as likely to be dead as one in Germany's &lt;code&gt;.de&lt;/code&gt;. Fast, cheap, speculative registration — and, for &lt;code&gt;.cn&lt;/code&gt;, a churn-heavy market behind the Great Firewall — leaves more abandoned domains behind; the mature, costlier-to-register German-speaking TLDs barely rot at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the top 10 million is even made of
&lt;/h2&gt;

&lt;p&gt;For context, here's the shape of the corpus itself. &lt;code&gt;.com&lt;/code&gt; is not just first — it is nearly half of the entire top 10 million, larger than every country-code and new-gTLD combined.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fw1dqqt3s3rny9p06agyl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fw1dqqt3s3rny9p06agyl.png" alt="Largest TLDs by share of the top 10 million: .com 44.1%, .org 8.8%, .io 3.6%, .de 3.5%, .net 3.5%." width="800" height="615"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two details worth flagging: &lt;code&gt;.io&lt;/code&gt; (3.6%) has quietly become the third-largest TLD on the popular web — the developer/startup default — and the AI-era &lt;code&gt;.ai&lt;/code&gt; (0.30%, ~30,000 domains) has already overtaken established country domains like &lt;code&gt;.fi&lt;/code&gt;, &lt;code&gt;.no&lt;/code&gt;, and &lt;code&gt;.tw&lt;/code&gt; in the top 10 million.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dead web is the long tail nobody visits
&lt;/h2&gt;

&lt;p&gt;Death is not spread evenly through the ranking. Split the 10 million by popularity and the dead rate climbs more than &lt;strong&gt;20×&lt;/strong&gt; — from 0.8% in the top 1,000 to 16.1% past rank 5 million. &lt;code&gt;blocked&lt;/code&gt; runs the other way: the most-trafficked sites wall bots hardest, then the defenses thin out down the tail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Filtdvunqkkoxkk1ld4m7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Filtdvunqkkoxkk1ld4m7.png" alt="Dead rate rises from 0.8% in the top 1K to 16.1% at rank 5 to 10M; blocked peaks at 15.1% in the 1K to 10K band and falls to 8.5%." width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That gradient reframes the headline. The 14% is real by domain &lt;em&gt;count&lt;/em&gt; — but those dead domains are almost all in the part of the web nobody visits. &lt;strong&gt;99.8% of dead domains sit below rank 100,000&lt;/strong&gt;, and the popular top-100K — where the overwhelming majority of web traffic lives — is only &lt;strong&gt;2.2% dead&lt;/strong&gt;. Weighted by attention instead of raw count, the dead web nearly disappears:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ff77nto3mu628jinngu1s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ff77nto3mu628jinngu1s.png" alt="Dead rate is 14.2% by domain count but only about 3% weighted by traffic." width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  "Dead web" is not "link rot" — and definitely not "dead internet theory"
&lt;/h2&gt;

&lt;p&gt;Three different things get blurred together. Keeping them separate is the whole point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;This study (dead domains):&lt;/strong&gt; does the &lt;em&gt;domain&lt;/em&gt; still resolve and answer? We find 14.2% of the top 10M do not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Link rot (Pew, Ahrefs):&lt;/strong&gt; are the &lt;em&gt;links inside living pages&lt;/em&gt; still good? &lt;a href="https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/" rel="noopener noreferrer"&gt;Pew Research&lt;/a&gt; found 25% of pages from 2013–2023 are gone and 38% of 2013 pages have vanished; &lt;a href="https://ahrefs.com/blog/link-rot-study/" rel="noopener noreferrer"&gt;Ahrefs&lt;/a&gt; found 66.5% of tracked links have rotted. Those measure decay &lt;em&gt;within&lt;/em&gt; the living web — a complement to this, not the same number.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dead internet theory:&lt;/strong&gt; the claim that AI-generated content and bots have displaced human activity online. That is about &lt;em&gt;what's on&lt;/em&gt; the living web, not whether domains are reachable. It is a separate conversation, and conflating it with link rot is how bad statistics spread.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only remember one distinction: link rot is about the pages that are still up; the dead web is about the domains that aren't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means if you're building a scraper or a data pipeline
&lt;/h2&gt;

&lt;p&gt;The practical takeaway is the 8.9% blocked slice, because it is the part most likely to break your project. When a request fails, the reason dictates the fix, and they are nothing alike:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;dead&lt;/strong&gt; domain (no DNS, refused) will never answer. Retrying, rotating proxies, or switching to a browser does nothing. Drop it and move on.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;blocked&lt;/strong&gt; domain is alive and reachable — it just refused &lt;em&gt;your&lt;/em&gt; client. A matched browser TLS/JA3 fingerprint or a residential IP gets in where a datacenter bot gets a 403. This is a transport problem, not a dead site.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't theoretical. Probing every domain a second time with a real Chrome TLS/JA3 fingerprint recovered &lt;strong&gt;~72,000 of the ~890,000 sites the polite bot was blocked from&lt;/strong&gt; — enough to pull the blocked rate from 8.9% down to 8.2%. Every one of those is a live site reachable with the right client, not a dead end.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The blocked web is the web you actually want.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We cross-checked a sample of these results against Similarweb traffic, and the blocked sites are by far the valuable ones. The blocked domains in our top-ranked sample pull a &lt;strong&gt;median of roughly 150 million monthly visits&lt;/strong&gt; — Reddit (4.4 billion), Canva (975 million), Quora (313 million), Claude.ai (952 million). The dead ones record &lt;strong&gt;under 5,000 visits each, and most register zero&lt;/strong&gt; — a four-to-five-orders-of-magnitude gap. Sites run a wall precisely because their data is worth taking, so the 8.9% blocked slice isn't noise; it is the most valuable 8.9% of the web.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Naive crawlers can't tell these apart, so they either give up on reachable sites or burn a budget retrying gone ones. The cost-efficient pattern is to escalate only as far as a site forces you to — which is exactly how &lt;a href="https://crawlora.net/use-cases/ai-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora's anti-bot unblocker&lt;/a&gt; works, and why it bills &lt;a href="https://crawlora.net/pricing?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;on success&lt;/a&gt; rather than per attempt. If you want to know which bucket a specific URL is in before you build, the free &lt;a href="https://crawlora.net/tools/can-i-scrape-this-site?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;anti-bot checker&lt;/a&gt; tells you in about 30 seconds, and our companion &lt;a href="https://crawlora.net/blog/anti-bot-adoption-index-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Anti-Bot Adoption Index&lt;/a&gt; measures how much of the live web runs a wall at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two more things the scan turned up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The web is a maze of redirects.&lt;/strong&gt; Only 69% of domains serve their final page directly; &lt;strong&gt;31% bounce through at least one redirect&lt;/strong&gt; — and a stubborn sliver loops until our 10-hop cap. That is exactly why a crawler that doesn't follow redirects sees a web that looks half-broken.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqo57r76mk7e5ladtz35y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqo57r76mk7e5ladtz35y.png" alt="Redirect depth: 69.4% load direct, 23.9% one redirect, 5.2% two, 1.1% three, 0.4% four or more." width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The dead web is stuck on HTTP.&lt;/strong&gt; A decade into the HTTPS transition, the living web is ~78% encrypted — but dead and bot-blocked domains are barely half, abandoned before they ever got a certificate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk79nxi60oxmt1lkpkgu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk79nxi60oxmt1lkpkgu7.png" alt="Share served over HTTPS by outcome: alive 78.4%, redirect 78.6%, dead 52.8%, blocked 47.5%." width="799" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How we measured it
&lt;/h2&gt;

&lt;p&gt;No magic — a deliberately simple, reproducible probe, run at 10-million scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The list.&lt;/strong&gt; The full top 10 million domains (a DomCop/Tranco-style popularity ranking). We reached 9,992,781 of them — 99.95% coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The probe.&lt;/strong&gt; Each domain is fetched HTTPS-first from a datacenter IP, following redirects, with a short timeout and a cross-resolver DNS retry before any "DNS failed" verdict. We never submit a form, solve a CAPTCHA, log in, or fetch anything behind a wall. Every domain is probed &lt;strong&gt;twice&lt;/strong&gt; — once as an honest bot, and once as a browser-like client with a real Chrome TLS/JA3 fingerprint — so we can separate "nobody's home" from "the bot wasn't let in."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The classification.&lt;/strong&gt; A final 2xx, or a served 404/5xx (the host answered), is &lt;code&gt;alive&lt;/code&gt;. A 403/429 or anti-bot challenge is &lt;code&gt;blocked&lt;/code&gt;. A 3xx we can't resolve is &lt;code&gt;redirect&lt;/code&gt;. Only no DNS, a refused/reset connection, or nothing accepting a connection is &lt;code&gt;dead&lt;/code&gt;. That single rule — &lt;em&gt;a server that answers anything is up&lt;/em&gt; — is the entire difference between 14.2% and 27.6%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limits.&lt;/strong&gt; This is homepage-level reachability from a datacenter vantage, so it is a &lt;strong&gt;lower bound&lt;/strong&gt;: a domain that blocks a datacenter bot may open for a residential browser, and a deep page can be deader (or more defended) than the homepage. Snapshot: June 2026. The full per-domain dataset — every domain, every arm — is open, and the live, searchable version is the &lt;a href="https://crawlora.net/dead-web-index?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Dead-Web Index&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/dead-web-index?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Dead-Web Index — the full searchable dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Crawlora-org/dead-web-index-data" rel="noopener noreferrer"&gt;Full per-domain data on GitHub (CC BY 4.0)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Crawlora-org/ten-million-domains" rel="noopener noreferrer"&gt;The scanner, open source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/" rel="noopener noreferrer"&gt;Pew Research — When Online Content Disappears (2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ahrefs.com/blog/link-rot-study/" rel="noopener noreferrer"&gt;Ahrefs — Link Rot Study&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/blog/anti-bot-adoption-index-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora Anti-Bot Adoption Index — how much of the web runs a wall&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How many of the world's top websites are dead?
&lt;/h3&gt;

&lt;p&gt;14.2% of the top 10 million domains are genuinely dead — about 1.41 million sites that no longer resolve in DNS or refuse every connection. That is far below the often-quoted 27.6%, which counted anti-bot blocks and answered errors as death.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between a dead website and a blocked one?
&lt;/h3&gt;

&lt;p&gt;A dead site never answers — no DNS record, or nothing accepts a TCP connection. A blocked site is alive and answering, it just refuses an automated client (a 403, 429, or anti-bot challenge). 8.9% of the top web — 891,672 sites — is blocked, not dead, a distinction naive crawlers miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is the dead web the same as the dead internet theory?
&lt;/h3&gt;

&lt;p&gt;No. The dead internet theory is a claim that AI-generated content and bots have replaced human activity on the living web. This study measures the opposite, concrete thing: how many domains have gone completely dark and unreachable — DNS gone, connection refused, server gone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is this lower than the 27.6% dead-web figure?
&lt;/h3&gt;

&lt;p&gt;Earlier top-10M crawls counted three non-dead things as dead: anti-bot 403/429 blocks, 404/5xx pages served by a live server, and domains a single flaky DNS resolver failed to look up. Classifying honestly — dead means genuinely unreachable — brings the real figure to 14.2%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which TLD has the most dead domains?
&lt;/h3&gt;

&lt;p&gt;.cn has the highest death rate among common TLDs at 33%. Institutional TLDs like .gov (26%) and .edu (22%) also rank high — matching Pew Research's finding that government and reference pages suffer the worst link rot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does a site look dead to a scraper but load fine in my browser?
&lt;/h3&gt;

&lt;p&gt;Anti-bot systems serve a 403 or a challenge to a datacenter IP while letting a real browser through. A matched browser TLS/JA3 fingerprint reaches the site where a naive bot is blocked — which is why this study probes every domain twice, as a polite bot and as a browser-like client.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://crawlora.net/blog/how-much-of-the-web-is-dead-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;. &lt;a href="https://crawlora.net/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>webscraping</category>
      <category>dns</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Why Reddit Blocked Unauthenticated JSON in 2026 (and How to Still Get Reddit Data)</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Mon, 15 Jun 2026 05:43:49 +0000</pubDate>
      <link>https://dev.to/tonywangca/why-reddit-blocked-unauthenticated-json-in-2026-and-how-to-still-get-reddit-data-58b9</link>
      <guid>https://dev.to/tonywangca/why-reddit-blocked-unauthenticated-json-in-2026-and-how-to-still-get-reddit-data-58b9</guid>
      <description>&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On May 28, 2026, Reddit announced it is deprecating unauthenticated .json endpoints — within days, appending .json to a URL started returning 403, silently breaking most open-source Reddit scrapers.&lt;/li&gt;
&lt;li&gt;The real driver is AI and money: Reddit's two decades of human conversation became a licensed AI-training asset (~$130M in 2024 from deals with Google and OpenAI), and free scraping undercut it — so Reddit is gating the data and suing those who take it without paying.&lt;/li&gt;
&lt;li&gt;Reddit's stated reason is scraping 'without accountability,' bot and agentic abuse, and a clarified Rule 8; it is steering developers to authenticated access and Devvit — and has flagged RSS as the next surface to close.&lt;/li&gt;
&lt;li&gt;You can still get public Reddit data compliantly — the official (paid) API, authenticated access, or a managed API that keeps the access path working and returns normalized JSON — but the free append-.json era is over.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For years, the simplest way to get structured data out of Reddit was a trick everyone knew: append &lt;code&gt;.json&lt;/code&gt; to any Reddit URL and get clean JSON back — no API key, no OAuth, no account. It quietly powered most open-source Reddit scrapers, research scripts, bots, and data pipelines.&lt;/p&gt;

&lt;p&gt;That door is now closed. On &lt;strong&gt;May 28, 2026&lt;/strong&gt;, Reddit posted &lt;a href="https://www.reddit.com/r/modnews/comments/1tq9vxo/" rel="noopener noreferrer"&gt;Protecting communities from scrapers and platform abuse&lt;/a&gt; to r/modnews, announcing it would shut down unauthenticated &lt;code&gt;.json&lt;/code&gt; access. Within days, requests started coming back &lt;strong&gt;403 Forbidden&lt;/strong&gt; — with no deprecation window. If your scraper "still runs" but returns nothing, this is why.&lt;/p&gt;

&lt;p&gt;This post explains &lt;strong&gt;why&lt;/strong&gt; Reddit did it — the answer is mostly AI and money — and the &lt;strong&gt;compliant ways to still get Reddit data&lt;/strong&gt; in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually broke
&lt;/h2&gt;

&lt;p&gt;In Reddit's own words: &lt;em&gt;"Deprecating unauthenticated JSON access: We'll also be shutting down unauthenticated &lt;code&gt;.json&lt;/code&gt; endpoints. These endpoints can be used to scrape Reddit without accountability. Logged-in and authenticated access won't be impacted."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anonymous &lt;code&gt;.json&lt;/code&gt; requests now 403.&lt;/strong&gt; &lt;code&gt;https://www.reddit.com/r/&amp;lt;sub&amp;gt;/top.json&lt;/code&gt; and friends no longer return data without authentication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It fails silently in a lot of tools.&lt;/strong&gt; Many scrapers get a 403 (or an empty/redirect response) but appear to "succeed," so pipelines quietly go dark instead of erroring loudly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authenticated access still works.&lt;/strong&gt; Logged-in sessions and the official OAuth API are unaffected — that is the entire point of the change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RSS is next.&lt;/strong&gt; In the same post Reddit called RSS "another common surface for scraping," so feed-based access is on notice too.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Reddit did it
&lt;/h2&gt;

&lt;p&gt;The technical change is small. The motivation behind it is the bigger story — and yes, it is largely about &lt;strong&gt;AI chatbots and bot traffic&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reddit's data became an AI goldmine — and a product
&lt;/h3&gt;

&lt;p&gt;Reddit is two decades of real human questions, answers, and opinions — exactly the text that makes large language models useful, and one of the &lt;strong&gt;most-cited sources in AI answers&lt;/strong&gt;. Once that became obvious, Reddit turned its archive into a licensed product:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;~$60M/year licensing deal with Google&lt;/strong&gt; (February 2024) to train Gemini on Reddit data.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;licensing deal with OpenAI&lt;/strong&gt; (May 2024) for ChatGPT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~$130M in data-licensing revenue in 2024&lt;/strong&gt; — roughly 10% of Reddit's total revenue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the data is the product, the free append-&lt;code&gt;.json&lt;/code&gt; endpoint is a leak: it let anyone — especially AI companies — take the same data for nothing, undercutting the paid deals.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI bots were taking it for free — "without accountability"
&lt;/h3&gt;

&lt;p&gt;This is the part most people's instinct gets right. The explosion of AI training crawlers and live "grounding" agents (assistants that fetch Reddit threads at answer time) created enormous automated traffic against the exact endpoints that required no identity. Reddit's framing names it directly: &lt;em&gt;"large-scale scraping, spam networks, agentic account creation, and automated abuse."&lt;/em&gt; The unauthenticated &lt;code&gt;.json&lt;/code&gt; route was the anonymous front door for all of it — data taken with no key to rate-limit, bill, or ban.&lt;/p&gt;

&lt;h3&gt;
  
  
  So Reddit started enforcing — in court
&lt;/h3&gt;

&lt;p&gt;Killing &lt;code&gt;.json&lt;/code&gt; is the technical half of a broader campaign:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reddit &lt;strong&gt;sued Anthropic&lt;/strong&gt; (June 2025), alleging its bots crawled Reddit &lt;strong&gt;100,000+ times&lt;/strong&gt; and bypassed &lt;code&gt;robots.txt&lt;/code&gt; after declining to license.&lt;/li&gt;
&lt;li&gt;Reddit then &lt;strong&gt;sued Perplexity&lt;/strong&gt; and three scraping firms — SerpApi, Oxylabs, and AWM Proxy (October 2025).&lt;/li&gt;
&lt;li&gt;Reddit &lt;strong&gt;blocked the Internet Archive's Wayback Machine&lt;/strong&gt; (August 2025) over AI-scraping concerns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cutting off anonymous &lt;code&gt;.json&lt;/code&gt; is how you enforce "license it or don't take it" at the protocol level.&lt;/p&gt;

&lt;h3&gt;
  
  
  It's part of the bigger "closing web"
&lt;/h3&gt;

&lt;p&gt;Reddit is the highest-profile example of a wider shift: as AI made web data commercially valuable, the open, anonymous, append-&lt;code&gt;.json&lt;/code&gt; web is closing. Sites are gating and monetizing data, Cloudflare now blocks AI crawlers by default for many customers, and "pay-per-crawl" is becoming real. The era of casual anonymous public-data access is ending.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why your scraper gets 403 now (it is not your credentials)
&lt;/h2&gt;

&lt;p&gt;Teams hitting this assume it is an auth or rate-limit bug. It usually is not. Reddit's 2026 enforcement also leans on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TLS fingerprinting&lt;/strong&gt; — generic clients (&lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;wget&lt;/code&gt;, default &lt;code&gt;curl&lt;/code&gt;) are identified by their TLS handshake and blocked, even with perfect headers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IP reputation&lt;/strong&gt; — datacenter and cloud IPs (GitHub Actions, Vercel, common hosts) are heavily flagged; the same request often works from a residential browser and 403s from a server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No anonymous fallback&lt;/strong&gt; — the &lt;code&gt;.json&lt;/code&gt; path that used to absorb all this is gone.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why "add a User-Agent" or "back off the rate" no longer fixes it — the block is at the access-policy and fingerprint layer, not the request rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to get Reddit data in 2026 (compliant options)
&lt;/h2&gt;

&lt;p&gt;The free anonymous path is over, but public Reddit data is still reachable through sanctioned routes. Ranked:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The official Reddit Data API / Devvit
&lt;/h3&gt;

&lt;p&gt;Reddit points developers to its &lt;strong&gt;authenticated Data API&lt;/strong&gt; (OAuth) and the &lt;strong&gt;Devvit&lt;/strong&gt; developer platform — the sanctioned path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free for &lt;strong&gt;non-commercial&lt;/strong&gt; use, capped at ~100 requests/minute.&lt;/li&gt;
&lt;li&gt;Commercial access runs about &lt;strong&gt;$0.24 per 1,000 requests&lt;/strong&gt;; enterprise agreements start near &lt;strong&gt;$12,000/year&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best when you can register an app, do the OAuth dance, and your use fits Reddit's terms.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Authenticated / session-based access
&lt;/h3&gt;

&lt;p&gt;A logged-in browser session (cookies, a real browser via Playwright) still works, because authenticated access is unaffected. It is viable for small, careful jobs — but it is fragile (sessions expire, fingerprints get flagged) and you own all the maintenance and the terms-of-service risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A managed Reddit API (Crawlora)
&lt;/h3&gt;

&lt;p&gt;If you want structured Reddit data without maintaining auth, proxies, and fingerprints — or rewriting your scraper every time Reddit changes the rules — a managed API does that for you. &lt;a href="https://crawlora.net/platforms/reddit?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora's Reddit API&lt;/a&gt; returns &lt;strong&gt;normalized JSON&lt;/strong&gt; for search, posts, comment threads, and subreddit feeds from one key, and maintains the access path as Reddit tightens it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-G&lt;/span&gt; &lt;span class="s2"&gt;"https://api.crawlora.net/api/v1/reddit/subreddit/webdev/posts"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-api-key: &lt;/span&gt;&lt;span class="nv"&gt;$CRAWLORA_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s2"&gt;"sort=hot"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s2"&gt;"limit=25"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.crawlora.net/api/v1/reddit/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web scraping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;posts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subreddit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get posts, comments, and feeds as clean JSON, and you stop chasing Reddit's changes — that is the trade you are buying.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on compliance
&lt;/h2&gt;

&lt;p&gt;Reddit's &lt;a href="https://www.redditinc.com/policies/data-api-terms" rel="noopener noreferrer"&gt;updated Data API terms and Rule 8&lt;/a&gt; now explicitly cover automated abuse and unauthorized scraping, and the May 2026 change makes Reddit's stance clear. Whatever route you choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collect only &lt;strong&gt;public&lt;/strong&gt; posts, comments, and subreddits — never private, quarantined, or personal data.&lt;/li&gt;
&lt;li&gt;Treat &lt;strong&gt;usernames and comment text as personal data&lt;/strong&gt; (GDPR/CCPA) — minimize what you store and have a lawful basis, especially for AI-training use.&lt;/li&gt;
&lt;li&gt;Prefer the &lt;strong&gt;official API or a licensed/managed path&lt;/strong&gt;, and review Reddit's terms and your local law before commercial or AI use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not legal advice — see &lt;a href="https://crawlora.net/blog/is-web-scraping-legal-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Is web scraping legal in 2026?&lt;/a&gt; for the public-vs-personal-data detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.reddit.com/r/modnews/comments/1tq9vxo/" rel="noopener noreferrer"&gt;Reddit r/modnews — Protecting communities from scrapers and platform abuse (May 28, 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.redditinc.com/policies/data-api-terms" rel="noopener noreferrer"&gt;Reddit — Data API Terms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.entrepreneur.com/business-news/reddit-sues-ai-startup-anthropic-over-alleged-ai-training/492769" rel="noopener noreferrer"&gt;Reddit sues Anthropic over alleged AI-training scraping (June 2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://builtin.com/articles/reddit-perplexity-data-scraping-lawsuit" rel="noopener noreferrer"&gt;Why Reddit is suing Perplexity and other data scrapers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://alternativeto.net/news/2025/8/reddit-to-block-wayback-machine-from-indexing-its-content-over-ai-data-scraping-concerns" rel="noopener noreferrer"&gt;Reddit to block the Wayback Machine over AI data-scraping concerns (Aug 2025)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this fits
&lt;/h2&gt;

&lt;p&gt;The append-&lt;code&gt;.json&lt;/code&gt; era is over, but Reddit remains one of the richest sources for community research, brand and product sentiment, and grounding data for AI. For the practical how-to (search, posts, comments, subreddit feeds, pagination), see &lt;a href="https://crawlora.net/blog/how-to-scrape-reddit?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;how to scrape Reddit in 2026&lt;/a&gt;; to feed threads into a retrieval pipeline or agent, see the &lt;a href="https://crawlora.net/blog/ai-agent-web-data-mcp?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;MCP integration&lt;/a&gt; and the &lt;a href="https://crawlora.net/use-cases/ai-agent-web-data?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;AI-agent web data&lt;/a&gt; workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it first, free:&lt;/strong&gt; test the endpoint in the &lt;a href="https://crawlora.net/playground/reddit-search?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Playground&lt;/a&gt;, read the schema in the &lt;a href="https://crawlora.net/docs/reddit/reddit-search?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;API docs&lt;/a&gt;, and review credit costs on the &lt;a href="https://crawlora.net/pricing?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why did Reddit block unauthenticated .json endpoints?
&lt;/h3&gt;

&lt;p&gt;On May 28, 2026 Reddit announced it was deprecating unauthenticated .json access to stop scraping 'without accountability' and curb bot and agentic abuse. The bigger driver is commercial: Reddit's data is now a licensed AI-training asset (deals with Google and OpenAI worth ~$130M in 2024), and the free .json path let anyone — especially AI companies — take that data without paying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are Reddit .json URLs still working in 2026?
&lt;/h3&gt;

&lt;p&gt;No. Since late May 2026, appending .json to a Reddit URL returns 403 Forbidden for unauthenticated requests. Logged-in sessions and the official OAuth API still work, and Reddit has flagged RSS as the next surface it may close.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does my Reddit scraper get 403 even with a User-Agent?
&lt;/h3&gt;

&lt;p&gt;Because the block is no longer about rate or headers. Reddit uses TLS fingerprinting and IP-reputation checks, so generic clients (requests, wget, default curl) and datacenter or cloud IPs get 403 even with a valid User-Agent. The anonymous .json fallback that used to absorb this is gone.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the official way to get Reddit data now?
&lt;/h3&gt;

&lt;p&gt;Reddit's authenticated Data API (OAuth) and the Devvit developer platform. It is free for non-commercial use at about 100 requests/minute; commercial access is roughly $0.24 per 1,000 requests, with enterprise agreements starting near $12,000/year.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is scraping Reddit legal or allowed in 2026?
&lt;/h3&gt;

&lt;p&gt;Reddit's updated Rule 8 and Data API terms restrict unauthorized scraping. Public data is generally accessible, but collect only public content, treat usernames and comments as personal data, and prefer the official API or a licensed/managed path — review Reddit's terms and your local law before commercial or AI use. This is not legal advice.&lt;/p&gt;

&lt;h3&gt;
  
  
  How can I still get Reddit data without maintaining a scraper?
&lt;/h3&gt;

&lt;p&gt;A managed API like Crawlora returns normalized JSON for Reddit search, posts, comment threads, and subreddit feeds from one key, and maintains the access path as Reddit tightens it — so you avoid auth, proxies, fingerprinting, and constant breakage.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://crawlora.net/blog/reddit-json-api-blocked-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;. &lt;a href="https://crawlora.net/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>reddit</category>
      <category>ai</category>
      <category>api</category>
    </item>
    <item>
      <title>Best AI Web Scraping Tools in 2026: How to Choose</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Sun, 14 Jun 2026 18:02:08 +0000</pubDate>
      <link>https://dev.to/tonywangca/best-ai-web-scraping-tools-in-2026-how-to-choose-m0e</link>
      <guid>https://dev.to/tonywangca/best-ai-web-scraping-tools-in-2026-how-to-choose-m0e</guid>
      <description>&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;‘AI web scraping’ means two different things: AI-native extractors that read an arbitrary page with an LLM, and structured data APIs that hand AI clean JSON for known sources. Pick by which problem you have.&lt;/li&gt;
&lt;li&gt;AI-native extractors (Firecrawl, ScrapeGraphAI, Diffbot, Browse AI, Kadoa) shine on unknown, one-off pages — but in hands-on tests several still can't paginate natively and lack anti-blocking, and AI extraction runs roughly $0.004–$0.02 per page.&lt;/li&gt;
&lt;li&gt;For repeatable pipelines that feed agents or RAG, a structured API like Crawlora returns documented JSON for supported platforms with no per-site parser, no token tax, and a hosted MCP server.&lt;/li&gt;
&lt;li&gt;Nearly every tool has a free tier — so benchmark accuracy on YOUR pages and compare cost per successful result, not the vendor demo.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best AI web scraping tool depends on the job: extracting fields from an arbitrary page you’ve never seen, or feeding an AI agent clean, structured data from known sources at scale. Those are different problems, and the tools that win each are different. This guide splits the landscape into categories, ranks the main options with real 2026 pricing and benchmark data, and shows how to compare them on cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  "AI web scraping" is two categories, not one
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI-native extractors&lt;/strong&gt; — point a model at a page and ask for fields in plain English. They handle unknown layouts and need no selectors, which is great for one-off or long-tail pages. The trade-offs: a per-page model cost, variable accuracy, and drift when sites change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured data APIs&lt;/strong&gt; — documented endpoints that return normalized JSON for &lt;em&gt;known&lt;/em&gt; platforms (search, maps, marketplaces, social, finance). No parser to maintain, predictable schemas, no token tax, and easy to hand to an agent or a &lt;a href="https://crawlora.net/use-cases/web-data-for-rag?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;RAG pipeline&lt;/a&gt;. This is &lt;a href="https://crawlora.net/use-cases/ai-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora’s category&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams end up using both: a structured API for the platforms they hit constantly, and an AI-native extractor for the arbitrary pages in the tail.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to evaluate
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Accuracy on YOUR target pages — run a real sample, not the vendor demo.&lt;/li&gt;
&lt;li&gt;Output: clean JSON you can store directly vs. text you must validate.&lt;/li&gt;
&lt;li&gt;Anti-bot handling: proxies, browser rendering, and CAPTCHAs behind the tool, or your problem.&lt;/li&gt;
&lt;li&gt;Pagination: does it follow ‘next page’ on its own, or stop at page one?&lt;/li&gt;
&lt;li&gt;Repeatability: does it hold up on a schedule, or drift when the page changes?&lt;/li&gt;
&lt;li&gt;Agent fit: REST + a hosted MCP server so agents can call it as a tool.&lt;/li&gt;
&lt;li&gt;Cost per successful result at your volume — after retries and per-page model costs.&lt;/li&gt;
&lt;li&gt;Compliance: public data only; review each source's terms.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The best AI web scraping tools in 2026
&lt;/h2&gt;

&lt;p&gt;No single winner — match the tool to the problem. Pricing below is the published rate as of mid-2026; always re-check before you commit.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Free tier&lt;/th&gt;
&lt;th&gt;From (paid)&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Crawlora&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured API + hosted MCP&lt;/td&gt;
&lt;td&gt;2,000 credits/mo&lt;/td&gt;
&lt;td&gt;Credit-based&lt;/td&gt;
&lt;td&gt;Repeatable pipelines + agents over known platforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Crawl-to-markdown for LLMs&lt;/td&gt;
&lt;td&gt;500 one-time credits&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;Whole sites into LLM-ready text / RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ScrapeGraphAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI extraction (open source + cloud)&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;~$0.02/page (cloud)&lt;/td&gt;
&lt;td&gt;Prompt-defined extraction with self-hosted control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Crawl4AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI crawler (open source)&lt;/td&gt;
&lt;td&gt;Free (self-host)&lt;/td&gt;
&lt;td&gt;$0 self-host&lt;/td&gt;
&lt;td&gt;Developers who want a free, self-hosted AI crawler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Diffbot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI extraction + Knowledge Graph&lt;/td&gt;
&lt;td&gt;10,000 credits/mo&lt;/td&gt;
&lt;td&gt;$299/mo&lt;/td&gt;
&lt;td&gt;Article / product / entity extraction at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browse AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No-code AI robots&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;~$19/mo&lt;/td&gt;
&lt;td&gt;Point-and-click monitoring of specific pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kadoa&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No-code AI + self-healing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;~$39/mo&lt;/td&gt;
&lt;td&gt;Hands-off no-code extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apify (AI Web Scraper)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Platform + AI Actor&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$35 / 1,000 pages&lt;/td&gt;
&lt;td&gt;Prebuilt scrapers and pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Octoparse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No-code visual + AI assist&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Tiered&lt;/td&gt;
&lt;td&gt;Visual scraping for non-developers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  1. Crawlora — structured JSON for agents, no parser
&lt;/h3&gt;

&lt;p&gt;For data you call repeatedly, &lt;a href="https://crawlora.net/use-cases/ai-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; returns normalized JSON by endpoint for dozens of platforms — search, maps, marketplaces, social, finance — so your model spends tokens on reasoning, not on cleaning HTML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"https://api.crawlora.net/api/v1/google-search/search?keyword=ai%20web%20scraping&amp;amp;country=us"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-api-key: &lt;/span&gt;&lt;span class="nv"&gt;$CRAWLORA_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because it ships a &lt;a href="https://crawlora.net/mcp?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;hosted MCP server&lt;/a&gt;, an agent in Claude, Cursor, or your own stack can call these as tools directly, and there’s no HTML sent to a model (so no &lt;a href="https://crawlora.net/blog/ai-vs-traditional-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;token tax&lt;/a&gt;). Free tier is 2,000 credits/month, no card. &lt;strong&gt;When to choose it:&lt;/strong&gt; the sources you need are supported platforms, you want documented JSON without parser upkeep, and you’re feeding agents or RAG. The trade-off: for an arbitrary page on an unknown site, an AI-native extractor or a crawler fits better.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Firecrawl — whole sites to LLM-ready markdown
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://crawlora.net/compare/firecrawl?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Firecrawl&lt;/a&gt; crawls a site and returns clean markdown or JSON built for LLMs — ideal for ingesting an entire docs site or blog into a RAG index. It’s the most adopted tool in this category (over 125,000 GitHub stars), with a 500-credit one-time free trial and AI extraction around $0.004 per page. A useful reality check: on Firecrawl’s own public 1,000-URL benchmark it reported ~87.7% scrape success and ~63.7% content truth-recall — even the leading tool doesn’t capture everything. &lt;strong&gt;When to choose it:&lt;/strong&gt; turning arbitrary websites into text for retrieval. It’s a different shape from a structured platform API — you point it at URLs rather than calling typed endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. ScrapeGraphAI — prompt-defined extraction, open source
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://crawlora.net/compare/scrapegraphai?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;ScrapeGraphAI&lt;/a&gt; uses LLMs to extract structured data from a page based on a prompt, with an open-source core and a managed cloud. It’s model-agnostic — OpenAI, Anthropic, Gemini, Azure, Groq, and local models via Ollama — so you control the engine. Cloud SmartScraper runs around $0.02 per page (a published comparison put it at roughly 5× Firecrawl’s per-page cost), the trade-off for prompt flexibility. &lt;strong&gt;When to choose it:&lt;/strong&gt; developers who want AI extraction from arbitrary pages and either self-hosted control or a specific LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Crawl4AI — free, self-hosted AI crawler
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/unclecode/crawl4ai" rel="noopener noreferrer"&gt;Crawl4AI&lt;/a&gt; is a fully open-source, self-hosted crawler built for LLM pipelines, with markdown output and &lt;strong&gt;adaptive crawling that auto-learns selectors&lt;/strong&gt; — third-party testing found it cut crawl times by roughly 40% on structured sites. &lt;strong&gt;When to choose it:&lt;/strong&gt; developers comfortable running their own infrastructure who want no per-page vendor fees. You own the proxies, scaling, and anti-bot handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Diffbot — AI extraction with a Knowledge Graph
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://crawlora.net/compare/diffbot?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Diffbot&lt;/a&gt; applies computer vision and NLP to classify and extract articles, products, and discussions semantically rather than by selector, and exposes a Knowledge Graph for entity context. It has the most generous free tier here (10,000 credits/month), with paid plans from $299/month (250K credits) to $899/month (1M credits). &lt;strong&gt;When to choose it:&lt;/strong&gt; large-scale article/product extraction and entity data.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Browse AI, Kadoa &amp;amp; Parsera — no-code AI extractors
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.browse.ai/" rel="noopener noreferrer"&gt;Browse AI&lt;/a&gt; records point-and-click “robots” that monitor specific pages (free tier; paid from about $19/month) and, unlike most, supports pagination. &lt;a href="https://www.kadoa.com/" rel="noopener noreferrer"&gt;Kadoa&lt;/a&gt; turns natural-language workflows into self-healing extractors that adapt to layout changes (free tier; from about $39/month) but lacks strong anti-blocking out of the box. &lt;a href="https://parsera.org/" rel="noopener noreferrer"&gt;Parsera&lt;/a&gt; infers selectors from a URL with self-healing agents and stealth proxies (free tier; from about $25/month). &lt;strong&gt;When to choose them:&lt;/strong&gt; business users monitoring a handful of pages without code. In Apify’s hands-on test, all of these adapted to layout changes — but several couldn’t paginate natively and struggled on protected sites.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Octoparse &amp;amp; Apify — visual scraping and prebuilt Actors
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://crawlora.net/compare/octoparse?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Octoparse&lt;/a&gt; is a visual, no-code scraper with AI assist for non-developers. &lt;a href="https://crawlora.net/compare/apify?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Apify&lt;/a&gt; is a platform of prebuilt “Actors” with scheduling, storage, proxies, and an MCP server; its &lt;strong&gt;AI Web Scraper&lt;/strong&gt; Actor extracts structured data from any URL with a plain-English prompt (AI tokens included) at $35 per 1,000 pages — though it doesn’t paginate natively yet. &lt;strong&gt;When to choose them:&lt;/strong&gt; off-the-shelf scrapers and a pipeline platform rather than a typed API.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the hands-on tests reveal
&lt;/h2&gt;

&lt;p&gt;Two patterns show up across the 2026 reviews and benchmarks, and they matter more than any feature list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI removes selectors, not the hard part.&lt;/strong&gt; These tools genuinely drop the need to write CSS/XPath — but in Apify’s four-tool test, several still couldn’t follow pagination on their own and lacked robust anti-blocking. Getting the page (proxies, rendering, CAPTCHAs) is still where most failures happen. See &lt;a href="https://crawlora.net/blog/ai-vs-traditional-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;AI vs traditional web scraping&lt;/a&gt; for why fetching, not parsing, is the bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No tool hits 100% recall.&lt;/strong&gt; Even Firecrawl’s own benchmark lands near 88% scrape success — so whatever you pick, run a real sample of &lt;em&gt;your&lt;/em&gt; pages and measure accuracy and cost per successful result, not the demo.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to choose in four questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Are you extracting from &lt;strong&gt;arbitrary unknown pages&lt;/strong&gt;, or calling &lt;strong&gt;known platforms&lt;/strong&gt; repeatedly?&lt;/li&gt;
&lt;li&gt;Do you need &lt;strong&gt;clean JSON&lt;/strong&gt; you can store directly, or text you’ll validate?&lt;/li&gt;
&lt;li&gt;Will an &lt;strong&gt;agent&lt;/strong&gt; call it — i.e. do you need REST plus a &lt;a href="https://crawlora.net/mcp?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;hosted MCP server&lt;/a&gt;?&lt;/li&gt;
&lt;li&gt;What’s the &lt;strong&gt;cost per successful result&lt;/strong&gt; at your volume, after retries and per-page model costs?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you’re feeding agents or pipelines from supported platforms, a structured API like Crawlora fits; for whole sites into RAG, Firecrawl or Crawl4AI; for arbitrary one-off pages, an AI-native extractor. Many teams use both. Whatever you choose, collect only public data — see &lt;a href="https://crawlora.net/blog/is-web-scraping-legal-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;is web scraping legal in 2026&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/best-ai-web-scrapers/" rel="noopener noreferrer"&gt;Apify — The best AI web scrapers in 2026? We put four to the test&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kadoa.com/blog/best-ai-web-scrapers-2026" rel="noopener noreferrer"&gt;Kadoa — The Top AI Web Scrapers of 2026: An Honest Review&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.browse.ai/blog/the-best-ai-web-scraper-tools" rel="noopener noreferrer"&gt;Browse AI — AI web scraping tools compared (2026): 9 tools tested&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.firecrawl.dev/" rel="noopener noreferrer"&gt;Firecrawl — crawl and convert sites to LLM-ready data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ScrapeGraphAI/Scrapegraph-ai" rel="noopener noreferrer"&gt;ScrapeGraphAI — LLM-based web scraping (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/unclecode/crawl4ai" rel="noopener noreferrer"&gt;Crawl4AI — open-source LLM-friendly crawler (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Try it first, free:&lt;/strong&gt; turn any URL into clean Markdown with the &lt;a href="https://crawlora.net/tools/free-web-scraper?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Free Web Scraper&lt;/a&gt; — no signup, no API key.&lt;/p&gt;

&lt;p&gt;Read &lt;a href="https://crawlora.net/blog/ai-vs-traditional-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;AI vs traditional web scraping&lt;/a&gt; and &lt;a href="https://crawlora.net/blog/web-scraping-for-ai-training-data?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;web scraping for AI training data&lt;/a&gt;, see the &lt;a href="https://crawlora.net/use-cases/ai-web-scraping?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;AI Web Scraping API&lt;/a&gt;, connect the &lt;a href="https://crawlora.net/mcp?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;hosted MCP server&lt;/a&gt;, and test a call in the &lt;a href="https://crawlora.net/playground?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Playground&lt;/a&gt;. For the broader market, see &lt;a href="https://crawlora.net/blog/best-web-scraping-apis-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;how to choose a web scraping API&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the best AI web scraping tool?
&lt;/h3&gt;

&lt;p&gt;There is no single winner — it depends on the job. For repeatable pipelines and agents over known platforms, a structured data API like Crawlora fits; for whole sites into LLM-ready text, Firecrawl; for prompt-defined extraction from arbitrary pages, ScrapeGraphAI or Diffbot; for no-code monitoring of specific pages, Browse AI or Octoparse.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does 'AI web scraping' actually mean?
&lt;/h3&gt;

&lt;p&gt;Two things: AI-native extractors that read an arbitrary page with an LLM and return fields from a prompt, and structured data APIs that hand AI clean JSON for known sources. They solve different problems, and many teams use both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are AI web scrapers better than traditional scrapers?
&lt;/h3&gt;

&lt;p&gt;Not universally. AI extraction adapts to unknown layouts without selectors, but costs more per page and can drift; traditional selectors are cheap and precise on stable pages; a structured API skips parsing entirely for supported platforms. See our AI vs traditional web scraping guide.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is there a free AI web scraping tool?
&lt;/h3&gt;

&lt;p&gt;Several offer free tiers or credits. Crawlora includes 2,000 credits per month with no card, and tools like ScrapeGraphAI are open source. Benchmark a few on your real target pages before committing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can AI web scraping feed an AI agent directly?
&lt;/h3&gt;

&lt;p&gt;Yes, if the tool exposes a tool interface. Crawlora ships a hosted MCP server, so agents in Claude, Cursor, or your own stack can call its structured web-data endpoints as tools.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://crawlora.net/blog/best-ai-web-scraping-tools-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;. &lt;a href="https://crawlora.net/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How Paywalls Actually Work: The Engineering Behind Them</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Thu, 11 Jun 2026 12:12:16 +0000</pubDate>
      <link>https://dev.to/tonywangca/how-paywalls-actually-work-the-engineering-behind-them-38h5</link>
      <guid>https://dev.to/tonywangca/how-paywalls-actually-work-the-engineering-behind-them-38h5</guid>
      <description>&lt;p&gt;A paywall is one of the more interesting engineering problems on the web, because the publisher has to satisfy two goals that pull in opposite directions. It needs Google to &lt;strong&gt;index&lt;/strong&gt; the article so people can find it and click through — which means a search crawler has to see the full text. But it also needs to &lt;strong&gt;withhold&lt;/strong&gt; that same text from a logged-out reader so there's a reason to subscribe. Reconciling "show the bot everything" with "show the human almost nothing," without getting penalized for it, is the whole game. How a publisher resolves that tension decides whether its paywall is a bank vault or a velvet rope you can step around.&lt;/p&gt;

&lt;p&gt;This guide explains the machinery from an engineer's point of view: the kinds of paywall, where the content actually lives, the structured-data contract that lets publishers serve crawlers and readers different things on purpose, and why some of these walls are trivial to read past while others are effectively sealed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paywalls come in four flavors — hard, soft/freemium, metered, and dynamic — and each is enforced differently.&lt;/li&gt;
&lt;li&gt;The single most important fact is where the content is hidden: client-side paywalls ship the full article to the browser and then hide it (often readable), while server-side paywalls never send it (effectively not).&lt;/li&gt;
&lt;li&gt;Publishers declare gated sections to Google with isAccessibleForFree JSON-LD and grant Googlebot full, IP-validated access — which is exactly why 'pretend to be Googlebot' sometimes works and is usually blocked.&lt;/li&gt;
&lt;li&gt;Reading content behind a paywall is the highest-risk category of access (DMCA §1201, CFAA, terms of service). The defensible path is public data, official APIs, and the structured data publishers already expose.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this guide is — and isn't&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a technical explainer for engineers, SEOs, and publishers who want to understand the machinery. It is &lt;strong&gt;not&lt;/strong&gt; a how-to for reading paid articles without paying. Bypassing a paywall to reach gated content is a real legal risk (covered below), and it is explicitly not what Crawlora is for — we build for &lt;strong&gt;public&lt;/strong&gt; web data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The four kinds of paywall
&lt;/h2&gt;

&lt;p&gt;"Paywall" is a single word for several very different mechanisms. Knowing which one you're looking at tells you almost everything about how it behaves and how robust it is.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;What the reader gets&lt;/th&gt;
&lt;th&gt;How it's enforced&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nothing without a subscription&lt;/td&gt;
&lt;td&gt;The article body is withheld outright; you see a headline, a deck, and a subscribe prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Soft / freemium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Some articles free, some "premium"&lt;/td&gt;
&lt;td&gt;A per-article flag decides whether the full body is served at all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metered&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N free articles per period&lt;/td&gt;
&lt;td&gt;A counter (cookie, local storage, device fingerprint, or server-side account) tracks views and gates after the limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynamic / propensity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Varies per visitor&lt;/td&gt;
&lt;td&gt;A model scores how likely you are to subscribe and shows a harder or softer wall accordingly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Hard paywalls&lt;/strong&gt; are the simplest and the strongest: the body never ships to a non-subscriber, so there's nothing to recover. The Financial Times and parts of the Wall Street Journal run close to this model. The tradeoff is reach — a hard wall sacrifices the casual reader and some SEO surface to protect revenue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Soft/freemium&lt;/strong&gt; walls flag certain articles as premium and leave the rest open. The decision is per-article, made on the server, so a "premium" piece behaves like a hard wall while a "free" piece is fully open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metered&lt;/strong&gt; paywalls are the most common on large news sites because they thread the needle: a handful of free articles per month drive subscriptions, social sharing, and search traffic, while heavy readers eventually hit the wall. The catch is that &lt;em&gt;metering has to count&lt;/em&gt;, and where it counts is the whole story (more on that below).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic / propensity&lt;/strong&gt; paywalls are the modern evolution. Instead of a fixed meter, a model looks at signals — how often you visit, what you read, where you came from, whether you look like a likely subscriber — and decides in real time whether to show you a hard wall, a soft nudge, or nothing at all. Two readers can hit the same URL and see completely different walls. That variability is deliberate: it makes the wall harder to reason about and harder to defeat with a single static trick.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one distinction that explains everything: client-side vs server-side
&lt;/h2&gt;

&lt;p&gt;Forget the marketing names for a second. The question that actually determines whether a paywall is robust is brutally simple: &lt;strong&gt;does the full article text reach the browser at all?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLIENT-SIDE (leaky)                  SERVER-SIDE (sealed)

  origin ──[ full article ]──▶ browser   origin ──[ teaser only ]──▶ browser
                 │                                   ▲
        JS / CSS hides the body            access check runs at the origin,
        (overlay, truncation, fade)        BEFORE the body is ever sent
                 │                                   │
   the bytes are already on the         there is nothing on the page
   page  →  "un-hideable"               to un-hide  →  sealed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Client-side paywalls&lt;/strong&gt; send the complete article in the HTML or in a JSON blob the page hydrates from, then use JavaScript and CSS to hide most of it — an overlay, a &lt;code&gt;display:none&lt;/code&gt;, a truncated container, or a gradient "fade to subscribe." The content is already on the page; the wall is cosmetic. This is why the classic tricks (disable JavaScript, view source, use a browser's reader mode) sometimes reveal the whole article: the bytes were delivered before the wall was painted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server-side paywalls&lt;/strong&gt; make the access decision on the server and simply never include the gated text in the response. A non-subscriber receives a teaser — headline, a paragraph or two, structured metadata — and nothing else. There is nothing to un-hide because the body was never sent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Google says exactly this to publishers in its own documentation: &lt;em&gt;"If you don't want the content to be accessible to the browser at the time of serving, choose a paywall implementation that doesn't supply the paywalled content to the browser."&lt;/em&gt; In plain terms, Google is openly telling publishers that client-side gating is leaky and server-side gating is not.&lt;/p&gt;

&lt;p&gt;So why does anyone still ship client-side? Because it's cheaper and more flexible. Rendering the full page and gating it in the browser plays nicely with ad tech, A/B testing, personalization, and CDN caching (one cached page serves everyone; the JS decides what to show). Server-side entitlement checks mean per-request rendering, a harder caching story, and more backend work. Plenty of publishers knowingly trade a little leakiness for a lot of operational convenience — which is why the web is full of client-side walls a reader can see straight through.&lt;/p&gt;

&lt;h2&gt;
  
  
  How metering actually counts you
&lt;/h2&gt;

&lt;p&gt;Metered paywalls deserve their own look, because "you've read 5 of 5 free articles" has to be stored somewhere, and &lt;em&gt;where&lt;/em&gt; decides how sturdy the meter is.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cookies / local storage.&lt;/strong&gt; The cheapest meter increments a counter in your browser. It's also the weakest: clearing site data, or opening a private/incognito window (which starts with empty storage), resets the count. This is the single reason "open it in incognito" works on so many sites — you're not breaking anything, you're just presenting as a brand-new visitor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Device fingerprinting.&lt;/strong&gt; Sturdier meters derive a semi-stable id from your browser and device characteristics, so a fresh incognito window still looks like the same device. Harder to reset, but probabilistic and privacy-fraught.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IP address.&lt;/strong&gt; Some meters count per IP. Effective against casual evasion, but blunt — it can wrongly gate everyone behind a shared office or campus network.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server-side accounts.&lt;/strong&gt; The sturdiest meter ties consumption to a logged-in identity. There's nothing client-side to clear, because the count lives in the publisher's database. This is where metering converges with a hard wall.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern to notice: the more robust the meter, the more it moves &lt;em&gt;off&lt;/em&gt; the client and &lt;em&gt;onto&lt;/em&gt; the server — the same migration we just saw with rendering. Anything enforced in the browser can be undone in the browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Googlebot contract: how publishers show bots what they hide from you
&lt;/h2&gt;

&lt;p&gt;Here's the part most explanations skip, and it's the most important. A publisher who hides the article from readers but serves the full text to Googlebot is, on its face, doing &lt;strong&gt;cloaking&lt;/strong&gt; — showing crawlers something different from what users get. Cloaking is a search-spam violation that gets a site demoted or removed from the index. So how do paywalled articles rank at all?&lt;/p&gt;

&lt;p&gt;Google built a sanctioned exception. It evolved out of the old "first click free" policy (drop the wall for visitors arriving from Google) and became, in 2017, &lt;strong&gt;flexible sampling&lt;/strong&gt; plus a structured-data declaration. Publishers mark their paywalled sections with schema.org markup — &lt;code&gt;isAccessibleForFree: false&lt;/code&gt; plus a &lt;code&gt;hasPart&lt;/code&gt; block whose &lt;code&gt;cssSelector&lt;/code&gt; points at the gated element:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"@context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://schema.org"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NewsArticle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"headline"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Article headline"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"isAccessibleForFree"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hasPart"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WebPageElement"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"isAccessibleForFree"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cssSelector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;".paywall"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That declaration is the contract. It tells Google: "this &lt;code&gt;.paywall&lt;/code&gt; section is gated, and any difference between what Googlebot sees and what a logged-out human sees is intentional, not cloaking." In return, the publisher &lt;strong&gt;grants Googlebot (and Googlebot-News) full access&lt;/strong&gt; to the body so the article can be indexed and ranked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    ┌──────────────────────────────┐
   Googlebot  ────▶ │  Publisher origin            │ ──▶  FULL article
 (verified by       │   isAccessibleForFree: false │      (so it can be indexed)
  reverse DNS)      │   hasPart → ".paywall"       │
   Logged-out  ───▶ │                              │ ──▶  teaser + subscribe wall
   reader           └──────────────────────────────┘
      The JSON-LD declares the gap on purpose, so serving the
      bot more than the human is treated as policy — not cloaking.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two consequences fall out of this, and they explain a lot of real-world behavior:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Publishers verify that Googlebot is really Googlebot.&lt;/strong&gt; Because crawler access is a privilege, sites confirm it by reverse-DNS and IP against Google's published ranges — not by trusting the &lt;code&gt;User-Agent&lt;/code&gt; header. That's why simply sending &lt;code&gt;User-Agent: Googlebot&lt;/code&gt; from an ordinary server gets you an HTTP 403: the request's IP doesn't belong to Google. The user-agent trick only ever worked on sites that didn't bother validating, and the big publishers all validate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The markup hands out a map of the wall.&lt;/strong&gt; The &lt;code&gt;cssSelector: ".paywall"&lt;/code&gt; is, quite literally, the selector of the overlay element. A declaration intended to &lt;em&gt;help search engines&lt;/em&gt; also tells anyone reading the page source exactly which node is the gate — which is why client-side "un-hide" tools target that same selector.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The same logic extends to &lt;strong&gt;AMP&lt;/strong&gt;: Google requires a publisher's bot-access policy to match across AMP and non-AMP pages (via &lt;code&gt;amp-subscriptions&lt;/code&gt;), or Search Console flags a content mismatch. That parity requirement is why AMP versions of articles are sometimes less aggressively gated than their canonical pages — the publisher had to keep the two consistent for the crawler.&lt;/p&gt;

&lt;h2&gt;
  
  
  How paywall "bypass" tools actually work
&lt;/h2&gt;

&lt;p&gt;Open-source paywall removers — the best known being &lt;a href="https://en.wikipedia.org/wiki/Bypass_Paywalls_Clean" rel="noopener noreferrer"&gt;Bypass Paywalls Clean&lt;/a&gt;, plus web tools like 12ft and archives like archive.today — are essentially a catalogue of per-site rules, each exploiting one of the weaknesses above. Understanding &lt;em&gt;what&lt;/em&gt; they do is useful for reasoning about how robust a given paywall is. It is not an endorsement: several have been removed from extension stores under legal pressure, which is the subject of the next section.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;Which paywall design it targets&lt;/th&gt;
&lt;th&gt;Why it fails on hardened sites&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Crawler user-agent&lt;/strong&gt; (Googlebot/Bingbot)&lt;/td&gt;
&lt;td&gt;Sites that serve crawlers the full body&lt;/td&gt;
&lt;td&gt;Blocked by IP / reverse-DNS validation of the bot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Referer spoofing&lt;/strong&gt; (Google / social)&lt;/td&gt;
&lt;td&gt;"First-click-free"-style allowances&lt;/td&gt;
&lt;td&gt;Most publishers dropped first-click-free; ignored on server-side gates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Clearing cookies / storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Metered&lt;/strong&gt; counters tracked client-side&lt;/td&gt;
&lt;td&gt;Useless against server-side, account-based, or fingerprinted meters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Blocking the paywall script&lt;/strong&gt; (Piano/Tinypass, Poool, etc.)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Client-side&lt;/strong&gt; JS enforcement&lt;/td&gt;
&lt;td&gt;Nothing to block when the gate is server-side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AMP / reader-mode / view-source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Content shipped-then-hidden&lt;/td&gt;
&lt;td&gt;The body simply isn't in the response on server-side pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Reading embedded JSON&lt;/strong&gt; (&lt;code&gt;articleBody&lt;/code&gt;, framework state)&lt;/td&gt;
&lt;td&gt;Sites that ship full text for their own SPA/SEO&lt;/td&gt;
&lt;td&gt;The text isn't embedded when rendered server-side per entitlement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Web archives&lt;/strong&gt; (archive.today)&lt;/td&gt;
&lt;td&gt;Anything someone already archived&lt;/td&gt;
&lt;td&gt;Depends on a third-party copy existing; raises its own copyright questions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Walk down the column and a single pattern emerges. Crawler-UA and referer tricks exploit the &lt;em&gt;indexing contract&lt;/em&gt; — they try to look like the privileged visitor the publisher serves in full. Cookie-clearing exploits &lt;em&gt;client-side metering&lt;/em&gt;. Script-blocking, reader-mode, and view-source exploit &lt;em&gt;client-side rendering&lt;/em&gt;. Reading embedded JSON exploits the fact that a single-page app or an SEO setup often ships the whole article as data even when the visible DOM is truncated. Archives sidestep the live site entirely by reading a copy someone else already saved.&lt;/p&gt;

&lt;p&gt;The throughline: &lt;strong&gt;every one of these works only because the content already left the publisher's server.&lt;/strong&gt; Server-side rendering plus IP-validated bot access closes the entire column at once — there is no header to spoof into a privilege, no counter in the browser to reset, no hidden body to un-hide, and no embedded JSON because the body was never serialized to the client.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the arms race now favors publishers
&lt;/h2&gt;

&lt;p&gt;A decade ago, "disable JavaScript" beat most paywalls. Today it rarely does, for a few converging reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Server-side rendering&lt;/strong&gt; keeps the body off the wire until entitlement is checked. The leak closes at the source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic / propensity models&lt;/strong&gt; change the wall per visit, so a single static rule breaks the moment the model decides you look different.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bot validation&lt;/strong&gt; — reverse DNS for Googlebot, plus commercial anti-bot vendors like Cloudflare and DataDome at the edge — makes crawler impersonation and naive automated access expensive and unreliable. A spoofed user-agent now meets a fingerprinting challenge, not a free pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge enforcement&lt;/strong&gt; means the gate is applied at the CDN, before a request ever reaches the origin app. The decision happens in front of the content, not inside it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The net effect is that the cheap, client-side techniques are dying off, and what remains is either legally fraught (archives, account sharing) or simply doesn't work against a modern server-side, dynamically gated, edge-protected site.&lt;/p&gt;

&lt;h2&gt;
  
  
  The legal reality: paywalls are the highest-risk category
&lt;/h2&gt;

&lt;p&gt;This is the part that matters most, and it's why Crawlora's position is unambiguous: &lt;strong&gt;don't bypass paywalls.&lt;/strong&gt; It's consistent with everything in our guide on &lt;a href="https://crawlora.net/blog/is-web-scraping-legal-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;whether web scraping is legal in 2026&lt;/a&gt; — the rules depend on the data, the method, and what you do with the results.&lt;/p&gt;

&lt;p&gt;Access risk stratifies cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tier 1 — public, non-gated pages.&lt;/strong&gt; The lowest risk. In the US, &lt;em&gt;hiQ Labs v. LinkedIn&lt;/em&gt; and the Supreme Court's narrowing of the CFAA in &lt;em&gt;Van Buren v. United States&lt;/em&gt; support the view that accessing data available to the public without authentication is not "unauthorized access."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 2 — login-gated content.&lt;/strong&gt; A step riskier: you're now past an authentication boundary, and terms of service are squarely in play.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 3 — paywalled content.&lt;/strong&gt; The top of the risk stack. Engineering a workaround around a technological access control can implicate the &lt;strong&gt;DMCA's anti-circumvention rule (§1201)&lt;/strong&gt; — which targets &lt;em&gt;circumventing a measure that controls access to a work&lt;/em&gt;, separate from copyright infringement itself — and the &lt;strong&gt;CFAA&lt;/strong&gt;, on top of breaching the site's terms of service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The case law is moving in the publishers' direction. &lt;em&gt;Reddit v. Perplexity&lt;/em&gt; alleges circumvention of rate limits and anti-bot systems; Google sued SerpApi in late 2025 citing the DMCA and copyright. And the open-source paywall removers themselves have been pulled from the Chrome and Firefox stores under the DMCA — the clearest signal of where the legal line sits.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public, non-gated pages are the defensible tier; logins and paywalls escalate risk sharply.&lt;/li&gt;
&lt;li&gt;Circumventing a technological access control — a paywall, login, or anti-bot system — is a distinct legal exposure under DMCA §1201, separate from reading a public page.&lt;/li&gt;
&lt;li&gt;Terms of service can prohibit automated access even to public content; that's a contract risk on top of everything else.&lt;/li&gt;
&lt;li&gt;If you need a specific publisher's articles at scale, the right path is a licensing or syndication deal — not a workaround.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The right way to get article content at scale
&lt;/h2&gt;

&lt;p&gt;If your project genuinely needs article text, there are legitimate routes, in rough order of preference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Official content APIs and licensing.&lt;/strong&gt; Many publishers and wire services license full text, and a syndication or licensing agreement is the durable answer for a specific outlet's articles at scale. Several large publishers also expose documented developer APIs for metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The structured data publishers already expose.&lt;/strong&gt; Headlines, descriptions, authors, dates, sections, and tags are published &lt;em&gt;for&lt;/em&gt; crawlers in JSON-LD — that's fair game and machine-readable by design. You can get a lot of value from the metadata layer without touching gated bodies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public, non-gated pages.&lt;/strong&gt; For the large universe of web content that isn't paywalled at all, a compliant scraping API that respects robots.txt, rate limits, and terms is the clean way to get structured content without running your own browser fleet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is where Crawlora fits. Our &lt;a href="https://crawlora.net/web-scraping-api?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;web scraping API&lt;/a&gt; and the &lt;a href="https://crawlora.net/docs/web/web-scrape?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;&lt;code&gt;/web/scrape&lt;/code&gt; endpoint&lt;/a&gt; turn &lt;strong&gt;public&lt;/strong&gt; URLs into clean Markdown and structured metadata, with managed rendering and proxies — built for public web data, not for circumventing paid content. If you want to know how hard a given public page is to fetch before you start, the &lt;a href="https://crawlora.net/tools/can-i-scrape-this-site?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;anti-bot checker&lt;/a&gt; gives you a difficulty read on the exact URL, and the &lt;a href="https://crawlora.net/blog/proxies-for-web-scraping-explained?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;proxies explainer&lt;/a&gt; covers responsible pacing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;A paywall is just an answer to one question — &lt;em&gt;where does the content live when a non-subscriber asks for it?&lt;/em&gt; Keep it in the browser and hide it, and the wall is cosmetic. Keep it on the server and never send it, and the wall is real. The structured-data contract with Google explains the strange middle ground where bots see everything and humans see a teaser, and the steady migration of every defense — rendering, metering, bot checks — from the client to the server and the edge is why the easy tricks keep dying. The robust, lawful way to work with article content at scale isn't to fight that trend; it's to use the public data, the structured metadata, and the licensing the open web already provides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developers.google.com/search/docs/appearance/structured-data/paywalled-content" rel="noopener noreferrer"&gt;Google Search Central — Structured data for paywalled content (isAccessibleForFree)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.google.com/search/docs/appearance/subscription-paywalled-content" rel="noopener noreferrer"&gt;Google Search Central — Subscription and paywalled content (overview &amp;amp; flexible sampling)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Bypass_Paywalls_Clean" rel="noopener noreferrer"&gt;Wikipedia — Bypass Paywalls Clean (DMCA store removal)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Paywall" rel="noopener noreferrer"&gt;Wikipedia — Paywall&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/blog/is-web-scraping-legal-2026?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora — Is web scraping legal in 2026?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crawlora.net/web-scraping-api?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora — Web Scraping API&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do paywalls work?
&lt;/h3&gt;

&lt;p&gt;A paywall withholds an article from non-subscribers, but the implementation varies. Hard paywalls serve no body at all; metered paywalls track your free-article count with a cookie, device fingerprint, or account; dynamic paywalls vary the wall per visitor. The key technical difference is whether the full text is sent to your browser and then hidden (client-side) or never sent at all (server-side).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why can I read some paywalled articles in incognito mode but not others?
&lt;/h3&gt;

&lt;p&gt;Incognito clears cookies and local storage, which resets a client-side metered counter that tracks how many free articles you've read — so metered paywalls often reopen in a fresh private window. It does nothing against hard or server-side paywalls, where the article body is never delivered to the browser in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between a client-side and server-side paywall?
&lt;/h3&gt;

&lt;p&gt;A client-side paywall sends the full article to the browser and hides it with JavaScript/CSS (an overlay or truncation), so the content technically reached your device. A server-side paywall decides access on the server and never includes the gated text in the response. Client-side gates are far easier to circumvent; server-side gates are, in Google's own words, almost impossible to get around.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is it legal to bypass a paywall?
&lt;/h3&gt;

&lt;p&gt;Bypassing a paywall is the highest-risk category of web access. Circumventing a technological access control can implicate the DMCA's anti-circumvention rules (§1201) and the CFAA, on top of breaching the site's terms of service. Reading public, non-gated pages is far more defensible, and for a specific publisher's full articles at scale, licensing is the right path — not a workaround. This is not legal advice.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://crawlora.net/blog/how-paywalls-work?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;. &lt;a href="https://crawlora.net/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication" rel="noopener noreferrer"&gt;Crawlora&lt;/a&gt; is a structured web-data, search, and anti-bot API — dozens of platforms as normalized JSON, plus a hosted MCP server, with a free tier (no card).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>seo</category>
      <category>webdev</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Give Your AI Agent Live Web Data with MCP</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Mon, 08 Jun 2026 09:51:45 +0000</pubDate>
      <link>https://dev.to/tonywangca/give-your-ai-agent-live-web-data-with-mcp-38hj</link>
      <guid>https://dev.to/tonywangca/give-your-ai-agent-live-web-data-with-mcp-38hj</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Give an AI agent live web data by connecting it to Crawlora's hosted MCP endpoint — it calls documented tools (search, maps, commerce, social, finance) and gets normalized JSON back, with no scraping code or proxies to run.&lt;/li&gt;
&lt;li&gt;MCP (Model Context Protocol) is an open standard: agents discover and call tools through one interface instead of a bespoke integration per data source.&lt;/li&gt;
&lt;li&gt;Connect over Streamable HTTP at &lt;code&gt;https://mcp.crawlora.net/mcp&lt;/code&gt; with your API key — about three minutes in Claude, Cursor, Cline, Windsurf, or any MCP client.&lt;/li&gt;
&lt;li&gt;One connection exposes 319 tools across 33 platforms (393 REST endpoints underneath): Google/Bing/Brave search, Google Maps, Amazon, YouTube, TikTok, Yahoo Finance, CoinGecko, and more.&lt;/li&gt;
&lt;li&gt;You pay only on a successful (2xx) response — failed calls are free — and the free tier includes 2,000 credits a month with no card.&lt;/li&gt;
&lt;li&gt;Versus writing your own scrapers: no per-source glue code, normalized JSON instead of HTML, and proxy routing, rendering, and retries handled behind the endpoint.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can give an AI agent live web data by connecting it to a &lt;strong&gt;hosted MCP endpoint&lt;/strong&gt;: your agent calls documented tools — search, maps, e-commerce, app stores, social, finance, and more — and gets back normalized JSON, with no scraping code to write or proxies to run. This guide explains what MCP is, what data you can pull, how to connect in about three minutes, and what a real tool call and its response look like.&lt;/p&gt;

&lt;p&gt;Most LLMs are frozen at their training cutoff and can't see the live web. The usual fix — writing a scraper per source, then maintaining proxies, headless browsers, and parsers — is exactly the work teams don't want to own. MCP plus a hosted data server removes it: the model gets a stable set of tools, and the fetching lives behind an endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is MCP, and why does it matter for agents?
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; is an open standard that lets an AI agent call external tools through one consistent interface. Instead of wiring a bespoke integration for every data source, the agent connects to an MCP server, &lt;strong&gt;discovers&lt;/strong&gt; the tools it exposes, and calls them during a task.&lt;/p&gt;

&lt;p&gt;An MCP server can expose three kinds of primitives: &lt;strong&gt;tools&lt;/strong&gt; (functions the model can call, like &lt;code&gt;google_map_search&lt;/code&gt;), &lt;strong&gt;resources&lt;/strong&gt; (read-only data), and &lt;strong&gt;prompts&lt;/strong&gt; (reusable templates). For live web data, tools are what matter — each one is a documented action with typed inputs and a predictable output.&lt;/p&gt;

&lt;p&gt;Why this beats a pile of one-off integrations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One interface, many sources.&lt;/strong&gt; Add a data source, swap a search engine, or pull a new platform without touching your agent's wiring — it's a tool call, not a rewrite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-describing.&lt;/strong&gt; The agent reads each tool's schema, so it knows what arguments to pass and what shape comes back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portable.&lt;/strong&gt; The same server works across Claude, Cursor, Cline, Windsurf, n8n, and any MCP-compatible client.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Who should use this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code, Cursor, Cline, and Windsurf users who want their editor or agent to read live web, SERP, commerce, or finance data while coding or researching.&lt;/li&gt;
&lt;li&gt;Agent builders wiring tools into LangChain, n8n, or a custom framework who need a reliable web-data layer instead of bespoke scrapers.&lt;/li&gt;
&lt;li&gt;RAG and data teams that need fresh, structured records — places, products, reviews, prices, quotes — rather than raw HTML to parse.&lt;/li&gt;
&lt;li&gt;Anyone moving an agent from prototype to production who doesn't want to run proxies, browsers, and parser maintenance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What your agent can pull: the tool catalog
&lt;/h2&gt;

&lt;p&gt;Crawlora's hosted MCP server exposes &lt;strong&gt;319 tools across 33 platforms&lt;/strong&gt;, backed by &lt;strong&gt;393 documented REST endpoints&lt;/strong&gt;. One connection covers a wide slice of the public web, each tool returning the same JSON fields every time:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Platforms&lt;/th&gt;
&lt;th&gt;Example tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Search &amp;amp; SERP&lt;/td&gt;
&lt;td&gt;Google, Bing, Brave (web, news, images, suggest)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;google_search&lt;/code&gt;, &lt;code&gt;bing_search&lt;/code&gt;, &lt;code&gt;brave_search&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maps &amp;amp; local&lt;/td&gt;
&lt;td&gt;Google Maps (places, search, reviews)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;google_map_search&lt;/code&gt;, &lt;code&gt;google_map_place&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E-commerce&lt;/td&gt;
&lt;td&gt;Amazon, eBay, Shopify, Shop.app&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;amazon_search&lt;/code&gt;, &lt;code&gt;ebay_search&lt;/code&gt;, &lt;code&gt;shopify_products&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;App stores&lt;/td&gt;
&lt;td&gt;Apple App Store, Google Play&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;appstore_search&lt;/code&gt;, &lt;code&gt;googleplay_reviews&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Social &amp;amp; creator&lt;/td&gt;
&lt;td&gt;TikTok, YouTube, Instagram, Reddit&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;tiktok_search&lt;/code&gt;, &lt;code&gt;youtube_search&lt;/code&gt;, &lt;code&gt;reddit_search&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reviews &amp;amp; travel&lt;/td&gt;
&lt;td&gt;Trustpilot, Tripadvisor&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;trustpilot_business_reviews&lt;/code&gt;, &lt;code&gt;tripadvisor_search&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Finance &amp;amp; crypto&lt;/td&gt;
&lt;td&gt;Yahoo Finance, Google Finance, CoinGecko&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;yahoo_finance_ticker_quote&lt;/code&gt;, &lt;code&gt;coingecko_coin&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The deepest groups carry dozens of tools each — Yahoo Finance (39), Spotify (30), TikTok (24), CoinGecko (21), JustWatch (21), Google Finance (20) — so an agent can do real work on one platform without leaving the server.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connect the hosted MCP endpoint in about three minutes
&lt;/h2&gt;

&lt;p&gt;Crawlora runs a &lt;strong&gt;hosted MCP endpoint&lt;/strong&gt; over Streamable HTTP at &lt;code&gt;https://mcp.crawlora.net/mcp&lt;/code&gt;. There's nothing to install or host — you point your client at the URL and authenticate with your API key, either as an &lt;code&gt;x-api-key&lt;/code&gt; header or an &lt;code&gt;Authorization: Bearer&lt;/code&gt; token. &lt;a href="https://crawlora.net" rel="noopener noreferrer"&gt;Get a free key&lt;/a&gt; (2,000 credits/month, no card) first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Desktop / Claude Code, Cursor, Windsurf&lt;/strong&gt; — add the server to your client's MCP config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"crawlora"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.crawlora.net/mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"x-api-key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YOUR_API_KEY"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cline (VS Code)&lt;/strong&gt; — open the MCP Servers panel, choose &lt;em&gt;Remote&lt;/em&gt;, and use the same URL and header. The tools appear in the agent's tool list once connected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A stdio bridge&lt;/strong&gt; — if your client only speaks stdio rather than a remote URL, wrap the endpoint with a proxy and pass the key as an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx mcp-remote https://mcp.crawlora.net/mcp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bearer-token-env-var&lt;/span&gt; CRAWLORA_API_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://crawlora.net/mcp" rel="noopener noreferrer"&gt;MCP docs&lt;/a&gt; have the current connection details and a server card listing the full tool catalog. After connecting, ask your agent to "list available tools" to confirm the tools are visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  A worked example: from one prompt to clean JSON
&lt;/h2&gt;

&lt;p&gt;Once connected, the agent calls tools and reasons over the normalized JSON they return. Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Find the top-rated coffee shops in Austin and summarize what reviewers like."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent picks the maps tool and calls it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google_map_search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"coffee shops in Austin, TX"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It gets back structured records — not HTML to parse — that look like this (trimmed for the example):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Houndstooth Coffee"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reviews"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1284&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"address"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"401 Congress Ave, Austin, TX 78701"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Coffee shop"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Cuvée Coffee Bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reviews"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;932&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"address"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2000 E 6th St, Austin, TX 78702"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Coffee shop"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From there the agent ranks by rating and review count and writes the summary. The same pattern works for any platform: search a marketplace with &lt;code&gt;amazon_search&lt;/code&gt;, pull a stock quote with &lt;code&gt;yahoo_finance_ticker_quote&lt;/code&gt;, or read app reviews with &lt;code&gt;googleplay_reviews&lt;/code&gt;. The data layer is Crawlora; the orchestration is your agent framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP vs. writing your own scrapers
&lt;/h2&gt;

&lt;p&gt;The shortcut is real, but it helps to see exactly what you trade away by not building the plumbing yourself:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Crawlora MCP&lt;/th&gt;
&lt;th&gt;DIY scrapers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Integration&lt;/td&gt;
&lt;td&gt;One interface; tools discovered automatically&lt;/td&gt;
&lt;td&gt;Bespoke glue code per source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;Normalized JSON with a documented schema&lt;/td&gt;
&lt;td&gt;HTML you parse and re-parse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fetching&lt;/td&gt;
&lt;td&gt;Proxy routing, JS rendering, retries handled&lt;/td&gt;
&lt;td&gt;You run proxies and headless browsers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;None — the endpoint owns the schema&lt;/td&gt;
&lt;td&gt;Parsers break when a page's layout shifts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coverage&lt;/td&gt;
&lt;td&gt;319 tools across 33 platforms, one key&lt;/td&gt;
&lt;td&gt;One scraper per source you build&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost model&lt;/td&gt;
&lt;td&gt;Pay on success (2xx only); free tier&lt;/td&gt;
&lt;td&gt;Infra + engineering time, paid regardless&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two aren't mutually exclusive. For arbitrary, unpredictable pages — docs sites, blogs, long-tail URLs — an AI-native crawler that returns markdown is the better fit. For &lt;em&gt;known platforms&lt;/em&gt; where you want stable records to sort, join, and chart, documented endpoints win because there's no parser to maintain. Many teams run both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best practices for agents that call web data
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Authenticate with a header, not a query string: send the key as &lt;code&gt;x-api-key&lt;/code&gt; or &lt;code&gt;Authorization: Bearer&lt;/code&gt; so it never lands in logs or URLs.&lt;/li&gt;
&lt;li&gt;Let the model read the tool schemas before calling — discovery is the point of MCP; don't hard-code arguments your agent could infer.&lt;/li&gt;
&lt;li&gt;Handle the 2xx-only billing model in your logic: a failed call costs nothing, so retries are cheap, but check status before treating a response as data.&lt;/li&gt;
&lt;li&gt;Start narrow. Point the agent at the few tools a task needs rather than all of them, so its tool-selection stays accurate.&lt;/li&gt;
&lt;li&gt;Cache results you'll reuse within a task to save credits and latency — live data doesn't mean re-fetching the same page twice.&lt;/li&gt;
&lt;li&gt;Prototype on the free tier, then watch the credits dashboard before you scale a multi-step agent that fans out calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pricing, credits, and limits
&lt;/h2&gt;

&lt;p&gt;Crawlora bills on a &lt;strong&gt;pay-on-success&lt;/strong&gt; model: each call costs &lt;strong&gt;1–8 credits&lt;/strong&gt; and is charged &lt;strong&gt;only on a successful (2xx) response&lt;/strong&gt; — 4xx and 5xx responses are free, so an agent that retries or probes doesn't run up a bill for failures. The &lt;strong&gt;free tier&lt;/strong&gt; includes &lt;strong&gt;2,000 credits per month with no card&lt;/strong&gt;, which is enough to build and test a real agent before upgrading. There's also a public &lt;a href="https://crawlora.net/playground" rel="noopener noreferrer"&gt;Playground&lt;/a&gt; to run any endpoint and inspect the JSON before you wire it into a tool call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Do I need to host anything to use Crawlora's MCP server?
&lt;/h3&gt;

&lt;p&gt;No. It's a hosted, remote MCP server over Streamable HTTP — you point your client at &lt;code&gt;https://mcp.crawlora.net/mcp&lt;/code&gt; and add your API key. There's no server to install, no proxies to rotate, and no browsers to run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which clients work with it?
&lt;/h3&gt;

&lt;p&gt;Any MCP-compatible client: Claude Desktop and Claude Code, Cursor, Cline, Windsurf, and agent frameworks like n8n or LangChain via an MCP adapter. The same remote URL and header work everywhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is this different from a general web-scraping or crawler MCP?
&lt;/h3&gt;

&lt;p&gt;Crawler-style servers fetch an arbitrary URL and return its page content as markdown — great for unstructured pages. Crawlora exposes &lt;em&gt;documented tools for known platforms&lt;/em&gt;, so a Google Maps place or an Amazon product comes back as the same JSON fields every time, with no extraction prompt or parser to maintain.&lt;/p&gt;

&lt;h3&gt;
  
  
  What data formats does it return?
&lt;/h3&gt;

&lt;p&gt;Normalized JSON per tool, with a documented schema. You get records — places, products, reviews, prices, quotes, posts — not raw HTML, so your agent can use the response immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does authentication work?
&lt;/h3&gt;

&lt;p&gt;Send your Crawlora API key as an &lt;code&gt;x-api-key&lt;/code&gt; header or an &lt;code&gt;Authorization: Bearer&lt;/code&gt; token on the MCP connection. The same key authenticates every tool the server exposes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I try it for free?
&lt;/h3&gt;

&lt;p&gt;Yes — the free tier is 2,000 credits a month with no card, and you only spend credits on successful responses. &lt;a href="https://crawlora.net" rel="noopener noreferrer"&gt;Get a key&lt;/a&gt; and connect in about three minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Give your agent live web data in three minutes&lt;/strong&gt; — a hosted MCP server, 319 documented tools across search, maps, commerce, social, and finance, normalized JSON, and managed proxies and retries. 2,000 free credits a month, no card. → &lt;a href="https://crawlora.net/mcp" rel="noopener noreferrer"&gt;Read the MCP docs&lt;/a&gt; · &lt;a href="https://crawlora.net/playground" rel="noopener noreferrer"&gt;Try the Playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this fits
&lt;/h2&gt;

&lt;p&gt;See the &lt;a href="https://crawlora.net/use-cases/ai-agent-web-data" rel="noopener noreferrer"&gt;AI agent web data&lt;/a&gt; use case for the broader pattern, and the &lt;a href="https://crawlora.net/integrations/langchain" rel="noopener noreferrer"&gt;LangChain integration&lt;/a&gt; if you're wiring tools through a framework rather than a native MCP client. For the web-data fundamentals behind the tools, see &lt;a href="https://crawlora.net/blog/best-web-scraping-apis-2026" rel="noopener noreferrer"&gt;how to choose a web scraping API&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/introduction" rel="noopener noreferrer"&gt;Model Context Protocol — introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/news/model-context-protocol" rel="noopener noreferrer"&gt;Anthropic — introducing the Model Context Protocol&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;Anthropic — code execution with MCP&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://crawlora.net/blog/best-web-scraping-apis-2026" rel="noopener noreferrer"&gt;Best Web Scraping APIs in 2026: How to Choose&lt;/a&gt; — structured APIs, generic scrapers, and proxy networks compared.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://crawlora.net/blog/firecrawl-alternatives" rel="noopener noreferrer"&gt;Firecrawl Alternatives&lt;/a&gt; — AI-native crawling vs. structured platform endpoints, and when each fits.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://crawlora.net/blog/how-serp-monitoring-apis-work" rel="noopener noreferrer"&gt;How SERP Monitoring APIs Work&lt;/a&gt; — turning live search data into tracked records.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://crawlora.net/blog/ai-agent-web-data-mcp" rel="noopener noreferrer"&gt;crawlora.net&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>api</category>
      <category>webdev</category>
    </item>
    <item>
      <title>27.6% of the Top 10 Million Sites are Dead</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Wed, 30 Oct 2024 08:48:52 +0000</pubDate>
      <link>https://dev.to/tonywangca/276-of-the-top-10-million-sites-are-dead-fgi</link>
      <guid>https://dev.to/tonywangca/276-of-the-top-10-million-sites-are-dead-fgi</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76kfnc0mfkmz24dz2pym.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76kfnc0mfkmz24dz2pym.png" alt="The internet has a memory" width="800" height="759"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The internet, in many ways, has a memory. From archived versions of old websites to search engine caches, there's often a way to dig into the past and uncover information—even for websites that are no longer active. You may have heard of the Internet Archive, a popular tool for exploring the history of the web, which has experienced outages lately due to hacks and other challenges. But what if there was no Internet Archive? Does the internet still "remember" these sites?&lt;/p&gt;

&lt;p&gt;In this article, we'll dive into a study of the top 10 million domains and reveal a surprising finding: &lt;strong&gt;over a quarter of them—27.6%—are effectively dead&lt;/strong&gt;. Below, I'll walk you through the steps and infrastructure involved in analyzing these domains, along with the system requirements, code snippets, and statistical results of this research.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: Analyzing 10 Million Domains
&lt;/h2&gt;

&lt;p&gt;Thanks to resources like &lt;a href="https://www.domcop.com/files/top/top10milliondomains.csv.zip" rel="noopener noreferrer"&gt;DomCop&lt;/a&gt;, we can access a list of the top 10 million domains, which serves as our starting point. Processing such a large volume of URLs requires significant computing resources, parallel processing, and optimized handling of HTTP requests.&lt;/p&gt;

&lt;p&gt;To get accurate results quickly, we needed a well-designed scraper capable of handling millions of requests in minutes. Here’s a breakdown of our approach and the system design.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Design for High-Volume Domain Scraping
&lt;/h2&gt;

&lt;p&gt;To analyze 10 million domains in a reasonable timeframe, we set a target of completing the task in &lt;strong&gt;10 minutes&lt;/strong&gt;. This required a system that could process &lt;strong&gt;approximately 16,667 requests per second&lt;/strong&gt;. By splitting the load across &lt;strong&gt;100 workers&lt;/strong&gt;, each would need to handle around &lt;strong&gt;167 requests per second&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Efficient Queue Management with Redis
&lt;/h3&gt;

&lt;p&gt;Redis, with its capability of handling over 10,000 requests per second easily, played a key role in managing the job queue. However, even with Redis, tracking status codes from millions of domains can overload the system. To prevent this, we utilized Redis pipelines, allowing multiple jobs to be processed simultaneously and reducing the load on our Redis cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// SPopN retrieves multiple items from a Redis set efficiently.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;SPopN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;Redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SPop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;cmders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmder&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;cmders&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;spopCmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;cmder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringCmd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;spopCmd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using this method, we could pull large batches from Redis with minimal impact on performance, fetching up to 100 jobs at a time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Worker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;fetchJobs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Jobs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;SPopN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;jobQueue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddJob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Optimizing DNS Requests
&lt;/h3&gt;

&lt;p&gt;To resolve domains efficiently, we used multiple public DNS servers (e.g., Google DNS, Cloudflare) and handled up to &lt;strong&gt;16,667 requests per second&lt;/strong&gt;. Public DNS servers typically throttle large volumes of requests, so we implemented error handling and retries for DNS timeouts and throttling errors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dnsServers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"8.8.8.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"8.8.4.4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"1.1.1.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"1.0.0.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"208.67.222.222"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"208.67.220.220"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By balancing the load across multiple servers, we could avoid rate limits imposed by individual DNS providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. HTTP Request Handling
&lt;/h3&gt;

&lt;p&gt;To check domain statuses, we attempted direct HTTP/HTTPS requests to each IP address. The following code retries with HTTPS if the HTTP request encounters a protocol error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Worker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;ips&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IPAddr&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;customDNSServer&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;customDNSServer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dnsServers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Intn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dnsServers&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
        &lt;span class="n"&gt;resolver&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Resolver&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;PreferGo&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Dial&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dialer&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DialContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"udp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customDNSServer&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;":53"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;ips&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resolver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LookupIPAddr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ips&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Retry %d: Failed to resolve %s on DNS server: %s, error: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customDNSServer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ips&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to resolve %s on DNS server: %s after retries, error: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customDNSServer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updateStats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;customDialer&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dialer&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Timeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;customTransport&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Transport&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;DialContext&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"80"&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"https://"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"443"&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;customDialer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DialContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ips&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;":"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Timeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Transport&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;customTransport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;CheckRedirect&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;via&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrUseLastResponse&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRequestWithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"http://"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to create request: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updateStats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"User-Agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userAgent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;urlErr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urlErr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"http: server gave HTTP response to HTTPS client"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Request failed due to HTTP response to HTTPS client: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c"&gt;// Retry with HTTPS&lt;/span&gt;
            &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scheme&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https"&lt;/span&gt;
            &lt;span class="n"&gt;customTransport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DialContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;customDialer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DialContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ips&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;":443"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"HTTPS request failed: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updateStats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Request failed: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updateStats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Received response from %s: %s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updateStats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Deployment Strategy
&lt;/h2&gt;

&lt;p&gt;Our scraping deployment consisted of &lt;strong&gt;400 worker replicas&lt;/strong&gt;, each handling &lt;strong&gt;200 concurrent requests&lt;/strong&gt;. This configuration required &lt;strong&gt;20 instances, 160 vCPUs, and 450GB of memory&lt;/strong&gt;. With CPU usage at only around 30%, the setup was efficient and cost-effective, as shown below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;400&lt;/span&gt;
  &lt;span class="s"&gt;...&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/tonywangcn/ten-million-domains:20241028150232&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000m"&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;300Mi"&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;300m"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The approximate cost for this setup was around &lt;strong&gt;$0.0116 per 10 million requests&lt;/strong&gt;, totaling less than $1 for the entire analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo91tskuanxr6recxkf0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo91tskuanxr6recxkf0d.png" alt="Cost of servers" width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Analysis: How Many Sites Are Actually Accessible?
&lt;/h2&gt;

&lt;p&gt;The status code data from the scraper allowed us to classify domains as "accessible" or "inaccessible." Here’s the criteria used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accessible: Status codes other than 1000 (DNS not found), 0 (timeout), 404 (not found), or 5xx (server error).&lt;/li&gt;
&lt;li&gt;Inaccessible: Domains with the status codes above, indicating they are either unreachable or no longer in service.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;accessible_condition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
    &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;599&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;inaccessible_condition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;accessible_condition&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After aggregating the results, we found that &lt;strong&gt;27.6% of the domains were either inactive or inaccessible&lt;/strong&gt;. This meant that over &lt;strong&gt;2.75 million domains&lt;/strong&gt; from the top 10 million were dead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Status Code | Count     | Rate |
| ----------- | --------- | ---- |
| 301         | 4,989,491 | 50%  |
| 1000        | 1,883,063 | 19%  |
| 200         | 1,087,516 | 11%  |
| 302         | 659,791   | 7%   |
| 0           | 522,221   | 5%   |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;With a dataset as large as 10 million domains, there are bound to be formatting inconsistencies that affect accuracy. For example, domains with a &lt;code&gt;www&lt;/code&gt; prefix should ideally be treated the same as those without, yet variations in how URLs are constructed can lead to mismatches. Additionally, some domains serve specific functions, like content delivery networks (CDNs) or API endpoints, which may not have a traditional homepage or may return a &lt;code&gt;404&lt;/code&gt; status by design. This adds a layer of complexity when interpreting accessibility.&lt;/p&gt;

&lt;p&gt;Achieving complete data cleanliness and uniform formatting would require substantial additional processing time. However, with the large volume of data, minor inconsistencies likely constitute around 1% or less of the overall dataset, meaning they don’t significantly affect the final result: &lt;strong&gt;more than a quarter of the top 10 million domains are no longer accessible&lt;/strong&gt;. This suggests that as time passes, your history and contributions on the internet could gradually disappear.&lt;/p&gt;

&lt;p&gt;While the scraper itself completes the task in around 10 minutes, the research, development, and testing required to reach this point took days or even weeks of effort.&lt;/p&gt;

&lt;p&gt;If this research resonates with you, please consider supporting more work like this by sponsoring me on &lt;a href="https://www.patreon.com/tonywang_dev" rel="noopener noreferrer"&gt;Patreon&lt;/a&gt;. Your support fuels the creation of articles and research projects, helping to keep these insights accessible to everyone. Additionally, if you have questions or projects where you could use consultation, feel free to reach out via email.&lt;/p&gt;

&lt;p&gt;The source code for this project is available on &lt;a href="https://github.com/tonywangcn/ten-million-domains" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. Please use it responsibly—this is meant for ethical and constructive use, not for overwhelming or abusing servers.&lt;/p&gt;

&lt;p&gt;Thank you for reading, and I hope this research inspires a deeper appreciation for the impermanence of the internet.&lt;/p&gt;

</description>
      <category>domainanalysis</category>
      <category>topdomains</category>
      <category>webcrawler</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>The Architecture of a Web Crawler: Building a Google-Inspired Distributed Web Crawler. Part 1</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Fri, 13 Oct 2023 12:37:00 +0000</pubDate>
      <link>https://dev.to/tonywangca/the-architecture-of-a-web-crawler-building-a-google-inspired-distributed-web-crawler-part-1-87f</link>
      <guid>https://dev.to/tonywangca/the-architecture-of-a-web-crawler-building-a-google-inspired-distributed-web-crawler-part-1-87f</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--njNeOyas--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A8xrYFatdSREBw1eSmt7QKA.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--njNeOyas--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A8xrYFatdSREBw1eSmt7QKA.jpeg" alt="Source: earth.com" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Support me on &lt;a href="https://www.patreon.com/tonywang_dev"&gt;Patreon&lt;/a&gt; to write more tutorials like this!&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In the rapidly evolving digital landscape, accessing and analyzing vast troves of web data has become imperative for businesses and researchers alike. In real-world scenarios, the need for scaling web crawling operations is paramount. Whether it’s dynamic pricing analysis for e-commerce, sentiment analysis of social media trends, or competitive intelligence, the ability to gather data at scale offers a competitive advantage. Our goal is to guide you through the development of a Google-inspired distributed web crawler, a powerful tool capable of efficiently navigating the intricate web of information.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Imperative of Scaling: Why Distributed Crawlers Matter
&lt;/h2&gt;

&lt;p&gt;The significance of distributed web crawlers becomes evident when we consider the challenges of traditional, single-node crawling. These limitations encompass issues such as speed bottlenecks, scalability constraints, and vulnerability to system failures. To effectively harness the wealth of data on the web, we must adopt scalable and resilient solutions.&lt;/p&gt;

&lt;p&gt;Ignoring this necessity can result in missed opportunities, incomplete insights, and a loss of competitive edge. For instance, consider a scenario where a retail business fails to employ a distributed web crawler to monitor competitor prices in real-time. Without this technology, they may miss out on adjusting their own prices dynamically to remain competitive, potentially losing customers to rivals offering better deals.&lt;/p&gt;

&lt;p&gt;In the field of academic research, a researcher investigating trends in scientific publications may find that manually collecting data from hundreds of journal websites is not only time-consuming but also prone to errors. A distributed web crawler, on the other hand, could automate this process, ensuring comprehensive and error-free data collection.&lt;/p&gt;

&lt;p&gt;In the realm of social media marketing, timely analysis of trending topics is crucial. Without the ability to rapidly gather data from various platforms, a marketing team might miss the ideal moment to engage with a viral trend, resulting in lost opportunities for brand exposure.&lt;/p&gt;

&lt;p&gt;These examples illustrate how distributed web crawlers are not just convenient tools but essential assets for staying ahead in the modern digital landscape. They empower businesses, researchers, and marketers to harness the full potential of the internet, enabling data-driven decisions and maintaining a competitive edge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing the Multifaceted Tech Stack: Kubernetes and More
&lt;/h2&gt;

&lt;p&gt;Our journey into distributed web crawling will be guided by a multifaceted technology stack, carefully selected to address each facet of the challenge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes&lt;/strong&gt;: This powerful orchestrator is the cornerstone of our solution, enabling the dynamic scaling and efficient management of containerized applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Golang, Python, NodeJS&lt;/strong&gt;: We have chose these programming languages for their strengths in specific components of the crawler, offering a blend of performance, versatility, and developer-friendly features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana and Prometheus&lt;/strong&gt;: These monitoring tools provide real-time visibility into the performance and health of our crawler, ensuring we stay on top of any issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus Exporters&lt;/strong&gt;: Along with Prometheus, exporters capture customized metrics from various services, enhancing our monitoring capabilities of distributed crawlers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ELK Stack (Elasticsearch, Logstash, Kibana)&lt;/strong&gt;: This trio constitutes our log analysis toolkit, enabling comprehensive log collection, processing, analysis, and visualization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Preparing Your Development Environment
&lt;/h2&gt;

&lt;p&gt;A robust development environment is the foundation of any successful project. Here, we’ll guide you through setting up the environment for building our distributed web crawler:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1). Install Dependencies&lt;/strong&gt;: We highly recommend using a Unix-like operating system to install the packages listed below. For this demonstration, we will use Ubuntu 22.04.3 LTS.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install -y awscli docker.io docker-compose make kubectl (check https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/ for detailed tutorial about how to install)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;2). Configure AWS and Setup EKS cluster&lt;/strong&gt;: To create a dedicated AWS Access key and run &lt;code&gt;aws configure&lt;/code&gt; in the terminal of your development machine, please follow the tutorial available &lt;a href="https://docs.aws.amazon.com/powershell/latest/userguide/pstools-appendix-sign-up.html"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws configure
AWS Access Key ID [****************3ZL7]: 
AWS Secret Access Key [****************S3Fu]: 
Default region name [us-east-1]: 
Default output format [None]:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;After creating a Kubernetes cluster on AWS EKS by following the steps outlined in &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html"&gt;this guide&lt;/a&gt;, it’s time to generate the kubeconfig using the following command.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws eks update-kubeconfig - name distributed-web-crawler
Added new context arn:aws:eks:us-east-1:************:cluster/distributed-web-crawler to /home/ubuntu/.kube/config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;At this point, you can run &lt;em&gt;kubectl get pods&lt;/em&gt; to verify if you can successfully connect to the remote cluster. Sometimes, you may encounter the following error. In such cases, we suggest following this &lt;a href="https://gist.github.com/Zheaoli/335bba0ad0e49a214c61cbaaa1b20306"&gt;tutorial&lt;/a&gt; to debug and resolve the version conflict issue.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods
error: exec plugin: invalid apiVersion "client.authentication.k8s.io/v1alpha1"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;3).Setting up Redis and MongoDB Instances:&lt;/strong&gt; In a distributed system, a message queue system is essential for distributing tasks among workers. Redis has been chosen for its rich data structures, such as lists, sets, and strings, which can serve not only as a message queue system but also as a cache and duplication filter. MongoDB is selected for its native scalability as a key-value database. This choice avoids the challenges of scaling a database to handle billions or more records in the future. Follow the tutorials below to create a Redis instance and a MongoDB instance, respectively:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Redis: &lt;a href="https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Clusters.Create.html"&gt;https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Clusters.Create.html&lt;/a&gt;&lt;br&gt;
MongoDB: &lt;a href="https://www.mongodb.com/docs/atlas/getting-started/"&gt;https://www.mongodb.com/docs/atlas/getting-started/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;3). Lens:&lt;/strong&gt; the most powerful IDE for Kubernetes, allowing you to visually manage your Kubernetes clusters. Once you have it installed on your computer, you will eventually see charts as the screenshot shows. However, please note that you will need to install a few components to enable real-time CPU and memory usage monitoring for your cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9lNKYbHq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/5030/1%2AlJUyqTPE9SuEDpo123nY3A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9lNKYbHq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/5030/1%2AlJUyqTPE9SuEDpo123nY3A.png" alt="" width="800" height="235"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Constructing the Initial Project Structure
&lt;/h2&gt;

&lt;p&gt;With your environment set up, it’s time to establish the foundation of the project. An organized and modular project structure is essential for scalability and maintainability. Since this is a demonstration project, I suggest consolidating everything into a monolithic repository for simplicity, instead of splitting it into multiple repositories based on languages, purposes, or other criteria:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&lt;strong&gt;./&lt;/strong&gt;

&lt;p&gt;├── &lt;strong&gt;docker&lt;/strong&gt;&lt;br&gt;
│   ├── &lt;strong&gt;go&lt;/strong&gt;&lt;br&gt;
│   │   └── Dockerfile&lt;br&gt;
│   └── &lt;strong&gt;node&lt;/strong&gt;&lt;br&gt;
│       └── Dockerfile&lt;br&gt;
├── docker-compose.yml&lt;br&gt;
├── &lt;strong&gt;elk&lt;/strong&gt;&lt;br&gt;
│   └── docker-compose.yml&lt;br&gt;
├── &lt;strong&gt;go&lt;/strong&gt;&lt;br&gt;
│   └── &lt;strong&gt;src&lt;/strong&gt;&lt;br&gt;
│       ├── main.go&lt;br&gt;
│       ├── &lt;strong&gt;metric&lt;/strong&gt;&lt;br&gt;
│       │   └── metric.go&lt;br&gt;
│       ├── &lt;strong&gt;model&lt;/strong&gt;&lt;br&gt;
│       │   └── model.go&lt;br&gt;
│       └── &lt;strong&gt;pkg&lt;/strong&gt;&lt;br&gt;
│           ├── &lt;strong&gt;constant&lt;/strong&gt;&lt;br&gt;
│           │   └── constant.go&lt;br&gt;
│           └── &lt;strong&gt;redis&lt;/strong&gt;&lt;br&gt;
│               └── redis.go&lt;br&gt;
├── &lt;strong&gt;k8s&lt;/strong&gt;&lt;br&gt;
│   ├── config.yaml&lt;br&gt;
│   ├── deployment.yaml&lt;br&gt;
│   └── service.yaml&lt;br&gt;
├── makefile&lt;br&gt;
└── &lt;strong&gt;node&lt;/strong&gt;&lt;br&gt;
    └── index.js&lt;/p&gt;

&lt;p&gt;13 directories, 14 files&lt;br&gt;
&lt;/p&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Designing the Distributed Crawler Architecture&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eG_jF5W---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4230/1%2AfTzojPCTgwv_xSqmuCeskQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eG_jF5W---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4230/1%2AfTzojPCTgwv_xSqmuCeskQ.png" alt="Architecture of Distributed Crawler. Click to see original image." width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In understanding the architecture of a distributed web crawler, it’s essential to grasp the core components that come together to make this intricate system function seamlessly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1) . Worker Nodes:&lt;/strong&gt; These are the cornerstone of our distributed crawler. We’ll dedicate significant attention to them in the following sections. The Golang Crawler will handle straightforward webpages rendered from the server-side, while the NodeJS crawler will tackle complex webpages, using a headless browser, such as Chrome. It’s important to note that a single HTTP request issued by programming languages like Golang or Python is significantly more resource-efficient (often 10 times or more) compared to requests made with a headless browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2) . Message Queue:&lt;/strong&gt; For simplicity and remarkable built-in features, we rely on Redis. Here, the inclusion of Bloom Filters stands out; they are invaluable for filtering duplicates among billions of records, offering high performance and minimal resource consumption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3) . Data Storage:&lt;/strong&gt; The choice of key-value databases, such as MongoDB, is available for storage. However, if you aspire to make your textual data searchable, akin to Google, Elastic Search is the preferred option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4) . Logging:&lt;/strong&gt; Within our ecosystem, the ELK stack shines. We deploy a Filebeat worker into each instance as a DaemonSet to collect and ship logs to Elastic Search via Logstash. This is a critical aspect of any distributed system, as logs play a pivotal role in debugging issues, crashes, or unexpected behaviors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5) . Monitoring:&lt;/strong&gt; Prometheus takes the lead here, enabling the monitoring of common metrics like CPU and memory usage by pods or nodes. With a customized metric exporter, we can also monitor metrics related to crawling tasks, such as the real-time status of each crawler, the total processed URLs, crawling rates per hour, and more. Moreover, we can set up alerts based on these metrics. Blind management of a distributed system with numerous instances is not advisable; Prometheus ensures that we have clear insights into our system’s health.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Road Ahead
&lt;/h2&gt;

&lt;p&gt;With a strong foundation laid, the series is poised to delve into the technical intricacies of each component. In the upcoming articles, we’ll start to develop the core code of crawlers and extract data from webpages.&lt;/p&gt;

&lt;p&gt;Stay engaged and follow the series closely to gain a comprehensive understanding of building a cutting-edge distributed web crawler. You can access the source code for this project on the GitHub repository &lt;a href="https://github.com/tonywangcn/distributed-web-crawler"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>webcrawler</category>
      <category>go</category>
      <category>distributedsystem</category>
    </item>
    <item>
      <title>How to efficiently scrape millions of Google Businesses on a large scale using a distributed crawler</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Mon, 31 Jul 2023 16:46:33 +0000</pubDate>
      <link>https://dev.to/tonywangca/how-to-efficiently-scrape-millions-of-google-businesses-on-a-large-scale-using-a-distributed-crawler-3lkp</link>
      <guid>https://dev.to/tonywangca/how-to-efficiently-scrape-millions-of-google-businesses-on-a-large-scale-using-a-distributed-crawler-3lkp</guid>
      <description>&lt;p&gt;&lt;em&gt;Support me on (Patreon)[&lt;a href="https://www.patreon.com/tonywang_dev"&gt;https://www.patreon.com/tonywang_dev&lt;/a&gt;] to write more tutorials like this!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/tonywangca/a-step-by-step-guide-to-building-a-scalable-distributed-crawler-for-scraping-millions-of-top-tiktok-profiles-2pk8"&gt;previous post&lt;/a&gt;, we covered the process of analyzing the network panel of a webpage to identify the relevant RESTful API for scraping desired data. While this approach works for many websites, some implement techniques like JavaScript encryption, which makes it difficult to decrypt and extract valuable information solely through RESTful APIs. This is where the concept of a “headless browser” can enable us to simulate the actions of a real user browsing the website with a browser.&lt;/p&gt;

&lt;p&gt;A headless browser is essentially a web browser without a graphical user interface (GUI). It allows automated web browsing and page interaction, providing a means to access and extract information from websites that employ dynamic content and JavaScript encryption. By using a headless browser, we can overcome some of the challenges posed by traditional scraping methods, as it allows us to execute JavaScript, render web pages, and access dynamically generated content.&lt;/p&gt;

&lt;p&gt;Here I will demonstrate the process of creating a distributed crawler using a headless browser, using Google Maps as our target website.&lt;/p&gt;

&lt;p&gt;Throughout my experience, I have explored various headless browser frameworks, such as Selenium, Puppeteer, Playwright, and Chromedp. Among them, I believe that Crawlee stands out as the most powerful tool I have ever used for web scraping purposes. Crawlee is a JavaScript-based library, which means you can easily adapt it to work with other frameworks of your choice, making it highly versatile and flexible for different project requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to list all the businesses in a country&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In general, when using Google Maps to find businesses we want to visit, we typically conduct searches based on the business category type and location. For instance, we may use a keyword like “shop near Holtsville” to locate any shops in a small town in New York. However, a challenge arises when multiple towns share the same name within the same country. To overcome this, Google Maps offers a helpful feature: querying by postal code. Consequently, the initial query can be refined to “shop near 00501,” with 00501 being the postal code of a specific location in Holtsville. This approach provides greater clarity and reduces confusion compared to using town names.&lt;/p&gt;

&lt;p&gt;With this clear path for efficient searches, our next objective is to compile a comprehensive list of all postal codes in the USA. To accomplish this, I used a free postal code database accessible &lt;a href="https://www.unitedstateszipcodes.org/zip-code-database/"&gt;here&lt;/a&gt;. If you happen to know of a better database, leave a comment below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qkzVoWPI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4264/1%2Agi-qjyfD_1YlejkweYQk4A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qkzVoWPI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4264/1%2Agi-qjyfD_1YlejkweYQk4A.png" alt="Snapshot of the postal code list of the US" width="800" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once we have downloaded the postal code list file, we can begin testing its functionality on Google Maps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VBY1hnS9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/5324/1%2Aw1BtzyC48o7rJXBOWk6vSA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VBY1hnS9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/5324/1%2Aw1BtzyC48o7rJXBOWk6vSA.png" alt="Search shop near 00501 USA in Google Map" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the keyword shop near 00501 USA in the Google Map search bar, we can observe a list of shops located in Holtsville. As our aim is to scrape all the businesses from this search, it is essential to ensure we retrieve a comprehensive list. To achieve this, we must scroll down through the search results until we reach the bottom of the list. Upon reaching the end, Google Maps will display a clear message stating You’ve reached the end of the list. This indicator serves as our cue to conclude the scrolling process and move on to the next phase of data extraction. By doing so, we can be certain that we have gathered all the relevant businesses from the specified location, enabling us to proceed with the scraping procedure accurately and comprehensively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HPYocz8H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4880/1%2A6MHDwUw61qV-3GXGfbmLcw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HPYocz8H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4880/1%2A6MHDwUw61qV-3GXGfbmLcw.png" alt="Scroll down until seeing the message “You’ve reached the end of the list”" width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once we have compiled the list of businesses from Google Maps, we can proceed to extract the detailed information we need from each business entry. This process involves going through the list one by one and scraping relevant data, such as the business’s address, operating hours, phone number, star ratings, number of reviews, and all available reviews.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ajjn9vS9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/3352/1%2A9UvtNAzQbX5VwJghl8p6VQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ajjn9vS9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/3352/1%2A9UvtNAzQbX5VwJghl8p6VQ.png" alt="" width="800" height="779"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WFKhDr6r--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2Au6B5KwMwiAiqvgjihGTpkA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WFKhDr6r--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2Au6B5KwMwiAiqvgjihGTpkA.png" alt="" width="800" height="1301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2Mmf87cx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/3336/1%2A_lK_BYoGi1JJhnqMGXKNng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2Mmf87cx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/3336/1%2A_lK_BYoGi1JJhnqMGXKNng.png" alt="" width="800" height="827"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Implementing the code of Google Map scraper&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Google Map Businesses scraper&lt;/strong&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;

The provided source code mainly focuses on extracting information from Google Maps using CSS selectors, which is relatively straightforward. As spot instances can be terminated at any time, it is essential to handle this situation carefully.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To solve this issue, we need to implement code that listens for the SIGTERM and SIGINT events. These events indicate that the instance is about to be terminated. When these events are triggered, we should take appropriate actions to backup any pending tasks in the job queue and also preserve the state of any running tasks that haven’t been completed yet.&lt;/p&gt;

&lt;p&gt;By listening to these signals, we can intercept the termination process and ensure that critical data and tasks are not lost. The backup mechanism enables us to store any unfinished work safely, allowing for a seamless continuation of tasks when new instances are launched in the future.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['SIGINT', 'SIGTERM', "uncaughtException"].forEach(signal =&amp;gt; process.on(signal, async () =&amp;gt; {
 await backupRequestQueue(queue, store, signal)
 await crawler.teardown()
 await sleep(200)
 process.exit(1)
}))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;2. Google Map Business Detail Scraper&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;3. Deployment file for Kubernetes&lt;/strong&gt;&lt;br&gt;&lt;/p&gt;

&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Monitoring and Optimizing the performance&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;As of now, everything with Crawlee appears to be functioning well, except for one critical issue. After running in the Kubernetes (k8s) cluster for approximately one hour, the performance of Crawlee experiences a significant drop, resulting in the extraction of only a few hundred items per hour, whereas initially, it was extracting at a much higher rate. Interestingly, this issue is not encountered when using a standalone container with Docker Compose on a dedicated machine.&lt;/p&gt;

&lt;p&gt;Moreover, while monitoring the cluster, you may observe a drastic decrease in CPU utilization from around 90% to merely 10%, especially if you have the metric-server installed. This unexpected behavior is concerning and requires investigation to identify the underlying cause.&lt;/p&gt;

&lt;p&gt;To address this performance degradation and ensure efficient resource utilization, you have taken the initiative to leverage the Kubernetes API and &lt;code&gt;client-go&lt;/code&gt;, the Golang SDK for Kubernetes. By utilizing these tools, you can effectively monitor the CPU utilization of all instances in the cluster. To further mitigate this issue, you have implemented a solution to automatically terminate instances that exhibit very low CPU utilization and have been active for at least 30 minutes.&lt;/p&gt;

&lt;p&gt;By automatically terminating such instances, you can avoid inefficiencies in resource allocation and ensure that underperforming instances do not hamper the overall data extraction process. This proactive approach helps maintain the cluster’s performance and ensures that Crawlee operates optimally, delivering consistent and reliable results even in the dynamic and challenging Kubernetes environment.&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
&lt;br&gt;
the provided code aims to address the issue of low CPU utilization in Kubernetes nodes by utilizing the Kubernetes metrics API to filter out underperforming nodes. Subsequently, the instance termination process is executed through the AWS Go SDK.

&lt;p&gt;To ensure the successful implementation of this solution in a Kubernetes (k8s) cluster, additional steps are required. Specifically, we need to create a &lt;strong&gt;ServiceAccount&lt;/strong&gt;, &lt;strong&gt;ClusterRole&lt;/strong&gt;, and &lt;strong&gt;ClusterRoleBinding&lt;/strong&gt; to properly assign the necessary permissions to the &lt;strong&gt;nodes-cleanup-cron-task&lt;/strong&gt;. These permissions are essential for the task to effectively query the relevant Kubernetes resources and perform the required actions.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;ServiceAccount&lt;/strong&gt; is responsible for providing an identity to the &lt;strong&gt;nodes-cleanup-cron-task&lt;/strong&gt;, allowing it to authenticate with the Kubernetes API server. The &lt;strong&gt;ClusterRole&lt;/strong&gt; defines a set of permissions that the task requires to interact with the necessary resources, in this case, the metrics API and other Kubernetes objects. Finally, the &lt;strong&gt;ClusterRoleBinding&lt;/strong&gt; connects the &lt;strong&gt;ServiceAccount&lt;/strong&gt; and &lt;strong&gt;ClusterRole&lt;/strong&gt;, granting the task the permissions specified in the &lt;strong&gt;ClusterRole&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By establishing this set of permissions and associations, we ensure that the &lt;strong&gt;nodes-cleanup-cron-task&lt;/strong&gt; can access and query the metrics API and other Kubernetes resources, effectively identifying nodes with low CPU utilization and terminating instances using the AWS Go SDK.&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;At this stage, the majority of the code is complete, and you have the capability to deploy it on any cloud server with Kubernetes (k8s). This flexibility allows you to scale the application effortlessly, expanding the number of instances as needed to meet your specific requirements.&lt;/p&gt;

&lt;p&gt;One of the key advantages of the design lies in its termination tolerance. With the implemented safeguards to handle &lt;strong&gt;SIGTERM&lt;/strong&gt; and &lt;strong&gt;SIGINT&lt;/strong&gt; events, you can deploy spot instances without concerns about potential data loss. Even when spot instances are terminated unexpectedly, the application gracefully manages the data in the job queue and running tasks.&lt;/p&gt;

&lt;p&gt;By leveraging this termination tolerance feature, the application can handle spot instance terminations smoothly. This ensures that any pending tasks in the job queue are backed up safely and that the state of running tasks, which haven’t completed yet, is preserved. Consequently, you can rest assured that the integrity of your data and tasks will be maintained throughout the operation.&lt;/p&gt;

&lt;p&gt;Deploying the application with Kubernetes and taking advantage of termination tolerance empowers you to scale the Google Maps scraper efficiently, managing numerous instances to meet your data extraction needs effectively. The combination of Kubernetes and the termination tolerance design enhances the overall robustness and reliability of the application, allowing for seamless operation even in the dynamic and unpredictable cloud environment. If you have any questions regarding this article or any suggestions for future articles, please leave a comment below. Additionally, I am available for remote work or contracts, so please feel free to reach out to me via email.&lt;/p&gt;

</description>
      <category>googlemap</category>
      <category>crawler</category>
      <category>k8s</category>
      <category>javascript</category>
    </item>
    <item>
      <title>A Step-by-Step Guide to Building a Scalable Distributed Crawler for Scraping Millions of Top TikTok Profiles</title>
      <dc:creator>Tony Wang</dc:creator>
      <pubDate>Mon, 12 Jun 2023 04:54:05 +0000</pubDate>
      <link>https://dev.to/tonywangca/a-step-by-step-guide-to-building-a-scalable-distributed-crawler-for-scraping-millions-of-top-tiktok-profiles-2pk8</link>
      <guid>https://dev.to/tonywangca/a-step-by-step-guide-to-building-a-scalable-distributed-crawler-for-scraping-millions-of-top-tiktok-profiles-2pk8</guid>
      <description>&lt;p&gt;&lt;em&gt;Support me on (Patreon)[&lt;a href="https://www.patreon.com/tonywang_dev" rel="noopener noreferrer"&gt;https://www.patreon.com/tonywang_dev&lt;/a&gt;] to write more tutorials like this!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
In this tutorial, we will walk you through the process of building a distributed crawler that can efficiently scrape millions of top TikTok profiles. Before we embark on this tutorial, it is crucial to have a solid grasp of fundamental concepts like &lt;strong&gt;web scraping&lt;/strong&gt;, &lt;strong&gt;the Golang programming language&lt;/strong&gt;, &lt;strong&gt;Docker, and Kubernetes (k8s)&lt;/strong&gt;. Additionally, being familiar with essential libraries such as Golang Colly for efficient web scraping and Golang Gin for building powerful APIs will greatly enhance your learning experience. By following this tutorial, you will gain insight into building a scalable and distributed system to extract profile information from TikTok.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developing a Deeper Understanding of the Website You Want to Scrape.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before delving into writing the code, it is imperative to thoroughly analyze and understand the structure of TikTok’s website. To facilitate this process, we recommend using the convenient “&lt;strong&gt;Quick Javascript Switcher&lt;/strong&gt;” Chrome plugin, available &lt;a href="https://chrome.google.com/webstore/detail/quick-javascript-switcher/geddoclleiomckbhadiaipdggiiccfje" rel="noopener noreferrer"&gt;here&lt;/a&gt;. This invaluable tool allows you to disable and re-enable JavaScript with a single mouse-click. By doing so, we aim to optimize our scraping workflow, to increase efficiency, and to minimize costs by minimizing the reliance on JavaScript rendering.&lt;/p&gt;

&lt;p&gt;Upon disabling JavaScript using the plugin, we will focus our attention on TikTok’s profile page — the specific page we aim to scrape. Analyzing this page thoroughly will enable us to gain a comprehensive understanding of its underlying structure, crucial elements, and relevant data points. By examining the HTML structure, identifying key tags and attributes, and inspecting the network requests triggered during page loading, we can unravel the essential information we seek to extract.&lt;/p&gt;

&lt;p&gt;Furthermore, by scrutinizing the structure and behavior of TikTok’s profile page without the interference of JavaScript, we can ensure our scraper’s efficiency and effectiveness. Bypassing the rendering of JavaScript code allows us to directly target the necessary HTML elements and retrieve the desired data swiftly and accurately.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ARGykTHcs0MvsTS4kMjkVJw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ARGykTHcs0MvsTS4kMjkVJw.png" alt="the Network of Requests in TikTok Profile Page with JavaScript Enabled"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine visiting a TikTok profile, such as “&lt;a href="https://www.tiktok.com/@linisflorez09](https://www.tiktok.com/@linisflorez09)," rel="noopener noreferrer"&gt;https://www.tiktok.com/@linisflorez09&lt;/a&gt;" with JavaScript enabled. You would witness approximately 300 requests being made, resulting in a whopping transfer of 10MB of data. Loading the entire page, complete with CSS style files, JavaScript files, images, and videos, takes roughly 5 seconds. &lt;strong&gt;Now, let’s put this into perspective: if we aim to scrape millions of data records, the total number of requests would skyrocket into the billions, while the data package would amass to over ten Terabytes.&lt;/strong&gt; And that’s not even factoring in the computing resources consumed by headless Chrome instances. This proactive approach not only streamlines the scraping process, but also helps mitigate unnecessary expenses, ultimately saving you, your boss, or your customers substantial amounts of money.&lt;/p&gt;

&lt;p&gt;It is crucial to acknowledge the monumental task at hand when dealing with such large-scale data scraping operations. By investing time and effort into analyzing the webpage upfront, we can discover innovative ways to extract the desired data while minimizing the number of requests, reducing data transfer size, and optimizing resource utilization. This strategic approach ensures that our scraping process is not only efficient but also cost-effective.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2746%2F1%2Aj0LBFNb-Wk0EjmMA0IKJSA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2746%2F1%2Aj0LBFNb-Wk0EjmMA0IKJSA.png" alt="TikTok Profile Page with JavaScript Disabled"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implementing the Code for Scraping TikTok Profiles&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When it comes to scraping TikTok’s profile page, the Golang built-in &lt;em&gt;net/http&lt;/em&gt; package provides a reliable solution for making HTTP requests. If you prefer a more straightforward approach without the need for callback features like &lt;em&gt;OnError&lt;/em&gt; and &lt;em&gt;OnResponse&lt;/em&gt; offered by Golang Colly, &lt;em&gt;net/http&lt;/em&gt; is a suitable choice.&lt;/p&gt;

&lt;p&gt;Below, you’ll find a code snippet to guide you in building your TikTok profile scraper. However, certain parts of the code are intentionally omitted to prevent potential misuse, such as sending an excessive number of requests to the TikTok platform. &lt;strong&gt;It’s crucial to adhere to ethical scraping practices and respect the platform’s terms of service&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To extract information from HTML pages using CSS selectors in Golang, various tutorials and resources are available that demonstrate the use of libraries like goquery. Exploring these resources will provide you with comprehensive guidance on extracting specific data points from HTML pages.&lt;/p&gt;

&lt;p&gt;Please note that the provided code snippet is meant for reference. Ensure that you modify and augment it as per your requirements and adhere to responsible data scraping practices.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Discovering the Entry Points for Popular Videos and Profiles&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By now, we have completed the TikTok profile scraper. However, there’s more to explore. How can we find millions of top profiles to scrape? That’s precisely what I’ll discuss next.&lt;/p&gt;

&lt;p&gt;If you visit the TikTok homepage at &lt;a href="https://www.tiktok.com/" rel="noopener noreferrer"&gt;https://www.tiktok.com/&lt;/a&gt;, you’ll notice four sections on the top left: &lt;em&gt;For You&lt;/em&gt;, &lt;em&gt;Following&lt;/em&gt;, &lt;em&gt;Explore&lt;/em&gt;, and &lt;em&gt;Live&lt;/em&gt;. Clicking on the &lt;em&gt;For You&lt;/em&gt; and &lt;em&gt;Explore&lt;/em&gt; sections will yield random popular videos each time. Hence, these two sections serve as entry points for us to discover a vast number of viral videos. Let’s analyze them individually:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Explore Page&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once we navigate to the explore page, it’s advisable to clean up the network section of DevTools for better clarity before proceeding with any further operations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3962%2F1%2AU8YGJWarysBzTRe75o7oiw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3962%2F1%2AU8YGJWarysBzTRe75o7oiw.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To ensure accurate filtering of requests, remember to select the &lt;em&gt;Fetch/XHR&lt;/em&gt; option. This selection will exclude any requests that are not made by JavaScript from the frontend. Once you have everything set up, proceed by scrolling down the &lt;em&gt;explore&lt;/em&gt; page. As you do so, TikTok will continue recommending viral videos based on factors such as your country and behavior. Simultaneously, keep a close eye on the network panel. Your goal is to locate the specific request containing the keyword “explore” among the numerous requests being made.&lt;/p&gt;

&lt;p&gt;Initially, it may not be immediately clear which exact request to focus on. Take your time and carefully inspect each request. We are looking for the request that returns essential information, such as author details, video content, view count, and other relevant data. Although the inspection process may require some patience, it is definitely worth the effort.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2076%2F1%2Arvy0HaNQwk2cgYu_RDW2Pg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2076%2F1%2Arvy0HaNQwk2cgYu_RDW2Pg.png" alt="The response of a request from explore page."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Continuing with the process, scroll down the explore page to explore more viral videos tailored to your country, behavior, and other factors. As you delve deeper, among the numerous requests being made, you will eventually come across a specific request containing the keyword &lt;em&gt;explore&lt;/em&gt;. This particular request is the one we are searching for to extract the desired data. To proceed, right-click on this request and select the option &lt;em&gt;Copy as cURL&lt;/em&gt;, as illustrated in the accompanying screenshot. By choosing this option, you can capture the request details in the form of a cURL command, which will serve as a valuable resource for further analysis and integration into your scraping workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F4428%2F1%2Ag85tP6sIqaE4tqGO_dFjuQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F4428%2F1%2Ag85tP6sIqaE4tqGO_dFjuQ.png" alt="Scroll down the explore page until you find the correct request."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A9mx1hDaGkbVvBakcFkPxAg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A9mx1hDaGkbVvBakcFkPxAg.png" alt="Copy the request as cURL"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the previously identified request, we can import it into Postman to simulate the same request. Upon clicking the “Send” button, we should receive a similar response. This indicates that the request does not require the bothersome &lt;em&gt;CSRF&lt;/em&gt; token for encryption and can be sent multiple times to obtain different results.&lt;/p&gt;

&lt;p&gt;To further explore the request, we will examine it in Postman. Within the Params and Headers panel, you have the option to uncheck various boxes and then click the &lt;em&gt;Send&lt;/em&gt; button. By doing so, you can verify if the response is successfully returned without including specific parameters. If the response is indeed returned, it implies that the corresponding parameter can be omitted in further development and requests. This step allows us to determine which parameters are required and which ones can be excluded for more efficient scraping.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2746%2F1%2AQuxn9DkOf0nvR0NXlONntw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2746%2F1%2AQuxn9DkOf0nvR0NXlONntw.png" alt="Import the cURL from above step to Postman, and click *Send* button"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before diving into the code implementation, there is an essential piece of information we need to acquire — the category IDs. On the explore page, you will find a variety of categories displayed at the top, including popular ones like &lt;em&gt;Dance and Music&lt;/em&gt;, &lt;em&gt;Sports&lt;/em&gt;, and &lt;em&gt;Entertainment&lt;/em&gt;. These categories play a crucial role in targeting specific types of content for scraping.&lt;/p&gt;

&lt;p&gt;To proceed, we will follow a similar approach as mentioned earlier. Begin by cleaning up the network session to enhance clarity and ensure a focused analysis. Then, systematically click on each category button, one by one, and observe the value of the &lt;em&gt;categoryType&lt;/em&gt; parameter associated with each request. By examining the &lt;em&gt;categoryType&lt;/em&gt; values, we can identify the corresponding IDs for each category.&lt;/p&gt;

&lt;p&gt;This step is vital as it enables us to tailor our scraping process to specific categories of interest. By retrieving the relevant category IDs, we can precisely target the desired content and extract the necessary data. So, take your time to explore and document the category IDs, as it will significantly enhance the effectiveness of your scraping implementation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3796%2F1%2AWYcmkzQFjy3l_xLm5s_b8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3796%2F1%2AWYcmkzQFjy3l_xLm5s_b8g.png" alt="Click the second section *Sports* and find the corresponding **categoryType** of **Sports**"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the end, after performing the necessary analysis, we will compile a comprehensive map that associates each category type with its unique ID:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;var categoryTypeMap = map[string]string{

"1": "comoedy &amp;amp; drama",

"2": "dance &amp;amp; music",

"3": "relationship",

"4": "pet &amp;amp; nature",

"5": "lifestyle",

"6": "society",

"7": "fashion",

"8": "enterainment",

"10": "informative",

"11": "sport",

"12": "auto",

}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;At this point, we have almost completed the analysis of the explore page, and we are ready to begin the code implementation phase. To simplify the process and save time, there are several online services available that can assist us in converting JSON data into Go struct format. One such service that I highly recommend is &lt;a href="https://mholt.github.io/json-to-go/](https://mholt.github.io/json-to-go/)." rel="noopener noreferrer"&gt;https://mholt.github.io/json-to-go/.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This convenient tool allows us to paste the JSON response obtained from the explore page and automatically generates the corresponding Go struct representation. By utilizing this service, we can effortlessly convert the retrieved JSON data into structured Go objects, which will greatly facilitate data manipulation and extraction in our code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F4406%2F1%2APT7aTUShJqI6md_fv9u0ww.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F4406%2F1%2APT7aTUShJqI6md_fv9u0ww.png" alt="Copy the JSON response from the Postman response to any online *JSON to Go struct* website, and convert it to Go struct for later use."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The criteria I have set for determining popular profiles on TikTok is based on two factors: the number of likes on their content and the number of followers they have. Specifically, I consider a profile to be popular if they have any content with at least 250K likes or if they have accumulated at least 10K followers. These thresholds help identify profiles that have gained significant attention and engagement on the platform.&lt;/p&gt;

&lt;p&gt;The key information I aim to extract from these popular profiles includes their unique identifier (ID), which serves as an input variable scraping profile details, and their follower count, which provides insights into their audience reach and influence. Additionally, I am interested in capturing the “digg” count of their videos, which represents the number of times users have interacted with and appreciated their content. These pieces of information offer valuable metrics to assess the popularity and impact of TikTok profiles.&lt;/p&gt;

&lt;p&gt;It is worth noting that while the above-mentioned information is essential for my specific project, you have the flexibility to customize and retain any additional data that aligns with the requirements and objectives of your own undertaking. This allows you to tailor the scraping process to suit your unique needs and extract the most relevant information for your analysis or application.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;For the parameters inside the &lt;em&gt;getUrl&lt;/em&gt; function, you have the flexibility to remove or customize any specific parameters based on the analysis we conducted earlier. This allows you to fine-tune the request and retrieve more accurate results from the &lt;em&gt;explore&lt;/em&gt; response. In this demonstration, I have chosen to keep all the parameters as they are, except for &lt;em&gt;categoryType&lt;/em&gt;, which I have left as a variable. This approach will enable us to scrape data from all categories, providing a comprehensive view of the TikTok profiles we intend to extract.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Building an API service to monitor scraper stats&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By now, we have completed the majority of the TikTok scraper. As we are utilizing Redis as the message queue to store tasks, it is crucial to monitor key statistics to ensure the smooth functioning of the scraper. We need to track metrics such as the number of times each category has been scraped, the count of successes and failures, and the remaining tasks in the job queue. To achieve this, it is necessary to build a service that offers an API endpoint for querying the statistics information at any time. Additionally, to safeguard sensitive stats, it is advisable to secure the endpoints, implementing appropriate authentication and authorization measures. This will ensure that only authorized individuals can access the scraper’s monitoring API and maintain the confidentiality of the collected data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AOunpTZMN5P4yDU2sSNi83Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AOunpTZMN5P4yDU2sSNi83Q.png" alt="Scraper statistics returned through API endpoint"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, we are going to complete the final part of the code, which is the main function. To simplify the deployment process, we will compile all the Golang code into a single binary file and package it into a Docker image. However, a question arises: How can we deploy different services, such as the profile scraper, explore scraper, and API service, with different numbers of replicas?&lt;/p&gt;

&lt;p&gt;To address this challenge, we will use the main function with different arguments when running the &lt;em&gt;tiktok-crawler&lt;/em&gt; binary. By modifying the &lt;code&gt;workerMap&lt;/code&gt;, we can add as many different types of workers as we need to expand the functionality. For example, for the profile scraper, we may require 20 workers and 3 replicas, while for the explore scraper, we may need 40 workers and 4 replicas. The flexibility of the main function allows us to configure the desired number of workers for each scraper. By default, we set the number of workers for each scraper to 20.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Building a Docker Image and Deploying it into a Kubernetes Cluster&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is the Dockerfile that enables us to build the binary file and package it into a Docker image, which can then be deployed into a Kubernetes (k8s) cluster.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Before deploying the code into a Kubernetes (k8s) cluster, it’s advisable to test the functionality of both the code and the Docker image locally using Docker Compose. Docker Compose allows us to define and manage multi-container applications. In this case, we can use the provided &lt;em&gt;docker-compose.yml&lt;/em&gt; file.&lt;/p&gt;

&lt;p&gt;By running the command &lt;em&gt;docker-compose up — scale tiktok-profile=3 — scale tiktok-server=1 — scale tiktok-explore=5 -d&lt;/em&gt;, you can launch multiple instances of the desired services. This command allows you to scale up or down the number of replicas for each service as needed. It ensures that the services, such as &lt;em&gt;tiktok-profile&lt;/em&gt;, &lt;em&gt;tiktok-server&lt;/em&gt;, and &lt;em&gt;tiktok-explore&lt;/em&gt;, are properly orchestrated and running concurrently.&lt;/p&gt;

&lt;p&gt;Testing the code and Docker image locally with Docker Compose allows for a comprehensive evaluation of the application’s behavior and performance before deploying it into the production Kubernetes cluster. It helps ensure that the application functions as expected and can handle the desired scaling requirements.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;After executing the provided command, you will observe that the specified number of profile scrapers, explore scrapers, and API servers are successfully launched and operational.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5648%2F1%2AfEyANNQPS6H3mHY6zTSIZQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5648%2F1%2AfEyANNQPS6H3mHY6zTSIZQ.png" alt="Running scraper services locally with docker-compose"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deploying the Scraper to Kubernetes Cluster&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Everything is prepared for the next stage, which involves deploying the application to a Kubernetes (k8s) cluster. Below is a sample k8s deployment file for your reference. You have the flexibility to customize the number of replicas for the scrapers and adjust the parameters for the scraper command as needed. It is important to note that the value for &lt;em&gt;alb.ingress.kubernetes.io/subnets&lt;/em&gt; in the Ingress controller should be set according to the subnets associated with your k8s cluster during its creation. This ensures proper networking configuration for the Ingress controller.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;To optimize cost while running the scraper, it is recommended to utilize &lt;em&gt;Spot Instances&lt;/em&gt; when adding a new node group. Spot Instances offer a significant cost advantage, as they are typically priced 20%-90% lower than On-Demand instances. Since the scraper is designed to be stateless and can be terminated at any time, Spot Instances are suitable for this use case. By leveraging Spot Instances, you can achieve substantial cost savings while maintaining the required functionality of the scraper.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ARyRtKo3t6QNl2dApeGZiCQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2ARyRtKo3t6QNl2dApeGZiCQ.png" alt="Set compute and scaling configuration for new node group"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the node group has been successfully created and the state of the nodes has changed to &lt;em&gt;ready&lt;/em&gt;, you are ready to deploy the scraper using the command &lt;strong&gt;&lt;em&gt;kubectl apply -f deployment.yaml&lt;/em&gt;&lt;/strong&gt;. This command will apply the configurations specified in the deployment file to the Kubernetes cluster. It will ensure that the desired number of replicas for the scraper services are up and running.&lt;/p&gt;

&lt;p&gt;One of the advantages of using Kubernetes is its flexibility in scaling the number of replicas. You can easily adjust the number of workers that should be running at any given time by updating the deployment configuration. This allows you to scale up or down the number of scraper workers based on the workload or performance requirements.&lt;/p&gt;

&lt;p&gt;By executing the appropriate &lt;em&gt;kubectl&lt;/em&gt; commands, you have the flexibility to manage and control the deployment of the scraper services within the Kubernetes cluster, ensuring optimal performance and resource utilization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2578%2F1%2AIvUfdv7AKgtY_o3Mjb36Iw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2578%2F1%2AIvUfdv7AKgtY_o3Mjb36Iw.png" alt="Nodes state page in AWS Kubernetes cluster"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Based on my extensive experience with the scraper, I have observed that the initial speed can reach an impressive rate of up to 1 million records per day when using the criteria I have set. However, it’s important to note that as time progresses, the speed may gradually decrease to a few thousand records per day. This decline occurs due to the nature of the explore page, where many of the popular contents have been created months ago. As we continue to scrape more profiles, we naturally cover a significant portion of the popular ones. Consequently, it becomes increasingly challenging to discover new viral content.&lt;/p&gt;

&lt;p&gt;Considering this, it is advisable to consider temporarily halting the scraper for a few weeks or even longer. By pausing the scraping process, you allow time for new viral content to emerge and accumulate. Once a sufficient period has passed, restarting the scraper will help maintain efficiency and optimize costs, as you will be able to focus on capturing the latest popular profiles and videos.&lt;/p&gt;

&lt;p&gt;With the successful completion of the TikTok scraper and its deployment in a distributed system using Kubernetes, we have achieved a robust and scalable solution. The combination of scraping techniques, data processing, and deployment infrastructure has allowed us to harness the full potential of TikTok’s platform. &lt;strong&gt;&lt;em&gt;If you have any questions regarding this article or any suggestions for future articles, I encourage you to leave a comment below. Additionally, I am available for remote work or contracts, so please feel free to reach out to me via email.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>tiktok</category>
      <category>crawler</category>
      <category>go</category>
      <category>k8s</category>
    </item>
  </channel>
</rss>
