<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Devil Scrapes</title>
    <description>The latest articles on DEV Community by Devil Scrapes (@devil_scrapes).</description>
    <link>https://dev.to/devil_scrapes</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960872%2Ffa930ad0-5ebc-4ca7-b894-6bc6fb3e2b40.png</url>
      <title>DEV Community: Devil Scrapes</title>
      <link>https://dev.to/devil_scrapes</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devil_scrapes"/>
    <language>en</language>
    <item>
      <title>Equity Crowdfunding Leads: scrape 4,800+ Wefunder founders for $5/1K</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 12:08:13 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/equity-crowdfunding-leads-scrape-4800-wefunder-founders-for-51k-121b</link>
      <guid>https://dev.to/devil_scrapes/equity-crowdfunding-leads-scrape-4800-wefunder-founders-for-51k-121b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; There is no unified API for Wefunder, Republic, or StartEngine. An &lt;em&gt;equity crowdfunding leads&lt;/em&gt; scraper collects currently-raising and recently-funded campaign data — founder names, company taglines, raise progress, pre-money valuations — from all three platforms and returns them as one normalized dataset. The Apify Actor below does it for &lt;strong&gt;$0.005 per row&lt;/strong&gt; (~$5.05 per 1,000), with the TLS fingerprinting, proxy rotation, and per-source parsing handled for you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Wefunder alone lists 4,800+ currently-raising companies — founder names, taglines, raise totals, and pre-money valuations in one JSON payload. Republic has a trending carousel. StartEngine has an XML sitemap of 98 offering slugs. None has a download button; none shares a schema.&lt;/p&gt;

&lt;p&gt;If you're a VC scout, an SDR targeting founders, or an analyst tracking what's raising in climate versus fintech, you're opening three browser tabs and copy-pasting. Here's what it takes to do that programmatically — and how I compressed it to one API call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is equity crowdfunding? 🔎
&lt;/h2&gt;

&lt;p&gt;Equity crowdfunding under Regulation CF lets any US startup raise up to $5 million per year from the general public — not just accredited investors. The three dominant platforms are &lt;strong&gt;Wefunder&lt;/strong&gt; (largest by volume), &lt;strong&gt;Republic&lt;/strong&gt; (curated campaigns), and &lt;strong&gt;StartEngine&lt;/strong&gt; (heavy on CPG and consumer brands).&lt;/p&gt;

&lt;p&gt;Each platform requires issuers to file a Form C with the SEC before opening a round, so every active campaign has a verified company name, founding team, financial disclosures, and valuation on public record. That's the dataset: comprehensive, legally disclosed, and — until this Actor — only accessible by visiting three separate sites with three separate UX patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does Wefunder have an API? 📡
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No public API.&lt;/strong&gt; As of 2026, none of Wefunder, Republic, or StartEngine publishes an official data API or bulk export. Wefunder's SPA calls an internal JSON endpoint (&lt;code&gt;/-/companies/explore&lt;/code&gt;) returning full campaign payloads — but it's undocumented, inspects your TLS fingerprint, and sits behind Cloudflare. Republic's backend GraphQL at &lt;code&gt;api.republic.com&lt;/code&gt; rejects unauthenticated POSTs from datacenter IPs. StartEngine's offering detail pages require clearing a JavaScript-gated challenge first.&lt;/p&gt;

&lt;p&gt;This is exactly why a hosted Actor earns its keep over a three-line &lt;code&gt;requests&lt;/code&gt; snippet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Each row is a flat, typed record. A real one — RISE Robotics on Wefunder as of 2026-05-16:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"wefunder"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"campaign_slug"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"riserobotics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"company_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RISE Robotics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tagline"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Electrifying heavy machines"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"industry"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MA"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"founders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Hiten Sonpal"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"website_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"target_amount_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"raised_amount_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;17448682.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"num_investors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;417&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"valuation_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;62100000.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"revenue_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"funding_stage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"raising"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"campaign_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://wefunder.com/riserobotics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T13:40:00.000Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sixteen fields, Pydantic-validated before they hit your dataset. &lt;code&gt;valuation_usd&lt;/code&gt; comes from Wefunder's &lt;code&gt;terms.nb&lt;/code&gt; shorthand (&lt;code&gt;"$62.1M"&lt;/code&gt;), parsed into a float automatically. Republic and StartEngine rows land with the same shape; monetary fields are &lt;code&gt;null&lt;/code&gt; there because that data is client-rendered (v2 plan — more below).&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) 🔧
&lt;/h2&gt;

&lt;p&gt;The obvious move: open DevTools, find the XHR, replay it with &lt;code&gt;requests.get()&lt;/code&gt;. It breaks fast, for a different reason on each platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wefunder.&lt;/strong&gt; The &lt;code&gt;/-/companies/explore&lt;/code&gt; endpoint checks your TLS ClientHello fingerprint before it answers. Python's stdlib &lt;code&gt;ssl&lt;/code&gt; and &lt;code&gt;httpx&lt;/code&gt; look nothing like a real browser — the JA3/JA4 fingerprint reads as a script, and you hit a Cloudflare challenge before the JSON loads. We run &lt;code&gt;curl-cffi&lt;/code&gt; with &lt;code&gt;impersonate="chrome131"&lt;/code&gt;, which replays the full Chrome 131 TLS handshake, ALPN extension order, and HTTP/2 SETTINGS frame, so at the TLS layer the connection &lt;em&gt;is&lt;/em&gt; a browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Republic.&lt;/strong&gt; The &lt;code&gt;republic.com/companies&lt;/code&gt; page is SPA-rendered; the SSR shell carries only a ~10-item carousel of trending campaign links, and the backend GraphQL at &lt;code&gt;api.republic.com&lt;/code&gt; rejects unauthenticated POSTs from datacenter IPs. We thread Apify residential proxies on every request so the connection arrives from a residential exit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;StartEngine.&lt;/strong&gt; Their &lt;code&gt;explore&lt;/code&gt; page is fully client-rendered. &lt;code&gt;sitemap-private-offerings.xml&lt;/code&gt; carries the active slug list (98 entries as of 2026-05-16) — the only unauthenticated surface; detail pages return a bot-challenge body to non-browser clients. v1 emits slug + company name from the sitemap; Camoufox full-render is planned for v2.&lt;/p&gt;

&lt;p&gt;We retry with exponential backoff (base 2 s, doubling, capped at 30 s, max 5 attempts) and honour &lt;code&gt;Retry-After&lt;/code&gt;. On &lt;code&gt;429&lt;/code&gt; or &lt;code&gt;503&lt;/code&gt; we rotate the proxy session ID — fresh exit IP, fresh cookie jar. Partial success surfaces as an explicit status message; we never return an empty dataset under a green status. One source failing does not kill the run; all three failing exits non-zero with a clear error.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor 🛠️
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/equity-crowdfunding-leads" rel="noopener noreferrer"&gt;Equity Crowdfunding Leads&lt;/a&gt;&lt;/strong&gt; on the Apify Store.&lt;/p&gt;

&lt;p&gt;Open it in the Apify Console and click Start, or call it with the &lt;a href="https://docs.apify.com/api/client/python/" rel="noopener noreferrer"&gt;apify-client&lt;/a&gt; Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/equity-crowdfunding-leads&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wefunder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxPerSource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusFilter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industryFilter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fintech&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;useProxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raised_amount_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;founders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key input parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sources&lt;/code&gt;&lt;/strong&gt; — any combination of &lt;code&gt;wefunder&lt;/code&gt;, &lt;code&gt;republic&lt;/code&gt;, &lt;code&gt;startengine&lt;/code&gt;, or empty (= all three). Default: all three.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;maxPerSource&lt;/code&gt;&lt;/strong&gt; — hard cap per platform, 1–500. Default: 50.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;statusFilter&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;"active"&lt;/code&gt;, &lt;code&gt;"funded"&lt;/code&gt;, or &lt;code&gt;"all"&lt;/code&gt;. Wefunder-native; Republic and StartEngine emit currently-listed slugs regardless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;industryFilter&lt;/code&gt;&lt;/strong&gt; — optional case-insensitive substring matched against &lt;code&gt;tagline&lt;/code&gt; or &lt;code&gt;industry&lt;/code&gt;. Pass &lt;code&gt;"climate"&lt;/code&gt; for climate-tech campaigns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;useProxy&lt;/code&gt;&lt;/strong&gt; — default &lt;code&gt;true&lt;/code&gt;. Wefunder and Republic fingerprint datacenter IPs and block plain exits; leave it on.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What you'd actually use this for 💡
&lt;/h2&gt;

&lt;p&gt;Four scenarios from the README and spec:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VC scout pipeline.&lt;/strong&gt; Schedule a weekly Wefunder-only run, pull all active campaigns, join on &lt;code&gt;founders[]&lt;/code&gt;, enrich with LinkedIn. A live feed of sub-Series-A founders without waiting for Crunchbase. Scope it with &lt;code&gt;industryFilter: "robotics"&lt;/code&gt; for your thesis vertical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SDR founder outreach.&lt;/strong&gt; Founders in active crowdfunding campaigns are fundraising — and buying. Filter by &lt;code&gt;statusFilter: "active"&lt;/code&gt; and &lt;code&gt;industryFilter: "fintech"&lt;/code&gt;, drop &lt;code&gt;founders[]&lt;/code&gt; into Apollo or Clay, and reach them while they're in motion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crowdfunding analytics.&lt;/strong&gt; Schedule daily runs, persist to BigQuery or S3, and track &lt;code&gt;raised_amount_usd&lt;/code&gt; trajectories. Wefunder publishes pre-money valuations Crunchbase never sees — the &lt;code&gt;valuation_usd&lt;/code&gt; distribution by sector is a clean dataset for a leaderboard or research report.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Form C deep dives.&lt;/strong&gt; This Actor surfaces &lt;code&gt;campaign_slug&lt;/code&gt; and &lt;code&gt;campaign_url&lt;/code&gt;. &lt;code&gt;sec-edgar-filings-scraper&lt;/code&gt; (sibling Actor) takes it from there — issuer CIK on EDGAR, Form C / Form C-AR PDFs, audited revenue, SAFE terms. Two Actors, one Reg CF pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for rows you receive, nothing for rows that don't come back.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Actor start (once per run)&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per campaign row emitted&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run size&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50 rows (default, all 3 sources)&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;150 rows (50/source × 3)&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 rows&lt;/td&gt;
&lt;td&gt;$5.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000 rows&lt;/td&gt;
&lt;td&gt;$25.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 rows&lt;/td&gt;
&lt;td&gt;$50.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For context, the nearest alternative — scraping Crunchbase via a third-party Apify Actor — typically runs around $30 per 1,000 rows, while covering fewer than 30% of Wefunder campaigns and zero Republic trending campaigns. This Actor is roughly 6× cheaper and sources from the campaigns directly, not from a derived database. Apify's $5 free trial credit covers your first ~990 rows with no credit card.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part worth knowing before you build on this 🔍
&lt;/h2&gt;

&lt;p&gt;Wefunder's internal &lt;code&gt;/-/companies/explore&lt;/code&gt; endpoint is the same one the SPA calls on every page load — unauthenticated, returning full JSON payloads including pre-money valuation encoded as &lt;code&gt;terms.nb&lt;/code&gt; dollar shorthand (&lt;code&gt;"$62.1M"&lt;/code&gt;, &lt;code&gt;"$700K"&lt;/code&gt;, &lt;code&gt;"$1.2B"&lt;/code&gt;). This Actor parses that shorthand with multipliers K=1e3, M=1e6, B=1e9; malformed values emit &lt;code&gt;null&lt;/code&gt; rather than crashing.&lt;/p&gt;

&lt;p&gt;The design point worth knowing: the scraper doesn't infer valuations — it reads the exact payload the website reads and converts the display string to a typed float. The Pydantic v2 &lt;code&gt;ResultRow&lt;/code&gt; model enforces the schema on every row before write, so type surprises are caught at write time, not at analysis time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations (the honest list) 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Republic and StartEngine return sparse data in v1.&lt;/strong&gt; Republic surfaces ~10 trending campaign slugs per run from the SSR shell; StartEngine emits slug + company name from the public sitemap. On both, raised amount, valuation, and investor count are client-rendered and stay &lt;code&gt;null&lt;/code&gt;. For the richest rows, run Wefunder-only (&lt;code&gt;sources: ["wefunder"]&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No historical archive.&lt;/strong&gt; Every run is a fresh snapshot of currently-listed campaigns. Schedule runs and export to your own storage; Apify's default run-scoped storage is purged after 7 days on the free plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status filter is Wefunder-native.&lt;/strong&gt; &lt;code&gt;funded&lt;/code&gt; and &lt;code&gt;all&lt;/code&gt; only meaningfully change Wefunder results; Republic and StartEngine always emit their current listing surface regardless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No investor identity data.&lt;/strong&gt; Who invested and at what amount is private. This Actor emits only public-facing campaign metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No SEC EDGAR Form C parsing.&lt;/strong&gt; Revenue, expenses, share count, and SAFE terms from Form C filings are in scope for &lt;code&gt;sec-edgar-filings-scraper&lt;/code&gt;, not this Actor.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping Wefunder, Republic, and StartEngine legal?&lt;/strong&gt;&lt;br&gt;
All three host public-facing marketing pages built to attract investors. This Actor reads only what the public UI exposes — no authentication is bypassed, no private investor data is collected, and the request rate stays well under a human browsing the site. Form C filings are SEC-required public disclosures. Check your own jurisdiction and use case; nothing here is legal advice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Wefunder, Republic, or StartEngine have an official API I should use instead?&lt;/strong&gt;&lt;br&gt;
No. As of 2026, none of the three offers a public data API or bulk export endpoint. Wefunder operates an internal JSON endpoint the SPA uses; Republic and StartEngine surface their data via their web UIs (or, for StartEngine, a sitemap).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export the dataset to Google Sheets or a data warehouse?&lt;/strong&gt;&lt;br&gt;
Yes — export CSV, JSON, Excel, or XML from the Apify Console &lt;strong&gt;Export&lt;/strong&gt; button after the run, webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make, Zapier, or n8n, or pull it via the &lt;a href="https://docs.apify.com/api/v2" rel="noopener noreferrer"&gt;Apify API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does the Actor cost less than Crunchbase scrapers?&lt;/strong&gt;&lt;br&gt;
Different source, lower extraction cost. Crunchbase scraping hits a richer, more heavily defended site with far more fields. This Actor targets three smaller platforms and returns a narrower, well-defined schema. The 6× difference reflects the actual engineering complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Live on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/equity-crowdfunding-leads" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/equity-crowdfunding-leads&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Run the defaults and you'll have 150 equity-crowdfunding leads across all three platforms in a couple of minutes. Need a fourth platform (NextSeed, MicroVentures), a field you wish was populated, or a parser that broke after a site restructure? Drop it in the comments. The devil's in the data; I ship based on what people actually find there.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Further reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://wefunder.com/faq" rel="noopener noreferrer"&gt;Wefunder — How equity crowdfunding works&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sec.gov/education/smallbusiness/marketplaces/regulationcrowdfunding" rel="noopener noreferrer"&gt;SEC — Regulation Crowdfunding overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.apify.com/api/client/python/" rel="noopener noreferrer"&gt;Apify Python Client documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors for builders who want the data, not the drama. Pay-per-event, honest pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>data</category>
    </item>
    <item>
      <title>Discogs Price Scraper: pull every marketplace listing as clean JSON</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 12:02:58 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/discogs-price-scraper-pull-every-marketplace-listing-as-clean-json-4l1g</link>
      <guid>https://dev.to/devil_scrapes/discogs-price-scraper-pull-every-marketplace-listing-as-clean-json-4l1g</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; The &lt;a href="https://www.discogs.com" rel="noopener noreferrer"&gt;Discogs&lt;/a&gt; marketplace is the world's largest secondary market for vinyl, CDs, and cassettes — but its free REST API exposes only a &lt;code&gt;lowest_price&lt;/code&gt; aggregate and a &lt;code&gt;num_for_sale&lt;/code&gt; count. To get the per-listing detail — who is selling, at what asking price, in what condition, and shipping from where — you have to scrape the Cloudflare-protected marketplace HTML. The Apify Actor at &lt;a href="https://apify.com/DevilScrapes/discogs-sold-price" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/discogs-sold-price&lt;/a&gt; does exactly that, joining the public REST API with paginated marketplace HTML into a single flat dataset. Price: &lt;strong&gt;$0.005 per row&lt;/strong&gt;, ~$5.05 per 1,000 results.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you have ever priced a record collection for resale, you know the workflow: open Discogs, search the release, squint at 25 listings per page, mentally adjust for condition grade and seller location. Multiply by 200 records and a weekend is gone. The data you want — every asking price, every seller rating, every condition grade — is sitting on a public web page, but Discogs exposes none of it in any published API, and the page sits behind Cloudflare.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Discogs? 🎵
&lt;/h2&gt;

&lt;p&gt;Discogs is a community-built catalog of released music and an active secondary marketplace. Collectors use it to catalog inventory, buy and sell physical media, and settle "what is this copy worth" questions. The catalog side is free and has a REST API. The marketplace side — the live order book of who is selling what, at what price — is a Cloudflare-protected web app with no programmatic export.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.discogs.com/developers/" rel="noopener noreferrer"&gt;Discogs API&lt;/a&gt; exposes release metadata (title, artist, year, format, genres), community stats, and exactly two market-signal fields: &lt;code&gt;lowest_price&lt;/code&gt; and &lt;code&gt;num_for_sale&lt;/code&gt;. That is the entire API surface for marketplace pricing. Every column past those two — asking price by listing, condition grade, seller country, seller rating — lives only in the HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does Discogs have an API for marketplace listings?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No.&lt;/strong&gt; The Discogs REST API at &lt;code&gt;api.discogs.com&lt;/code&gt; exposes a &lt;code&gt;GET /marketplace/stats/{release_id}&lt;/code&gt; endpoint that returns the floor ask (&lt;code&gt;lowest_price&lt;/code&gt;) and the total listing count (&lt;code&gt;num_for_sale&lt;/code&gt;). It does not return individual listings. The per-listing data — who is selling, at what price, with what media and sleeve grade, ships-from where — is available only on the paginated HTML page at &lt;code&gt;www.discogs.com/sell/release/{release_id}&lt;/code&gt;. That page is served through Cloudflare and does not have a documented API counterpart.&lt;/p&gt;

&lt;p&gt;This Actor fills that gap: it joins the two public REST endpoints (release metadata + marketplace stats) with the paginated listing HTML into one flat, typed row per listing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like 📤
&lt;/h2&gt;

&lt;p&gt;Each marketplace listing comes back as one flat row. Here is a real one, from release ID 249504 (Rick Astley — &lt;em&gt;Never Gonna Give You Up&lt;/em&gt;, 1987 UK 7"):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"row_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"listing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"release_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;249504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"release_title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Never Gonna Give You Up"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"artist"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Rick Astley"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"year"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1987&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UK"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"format_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Vinyl"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"format_descriptions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"7&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"45 RPM"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Single"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Stereo"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"genres"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Electronic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pop"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"master_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;96559&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"release_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.discogs.com/release/249504"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"listing_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3761251765&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"listing_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.discogs.com/sell/item/3761251765"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"asking_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"asking_currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GBP"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"shipping_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"+£15.00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"condition_media"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Very Good Plus (VG+)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"condition_sleeve"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Generic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"seller_username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ronan266"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"seller_rating_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"seller_rating_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"seller_country"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"United Kingdom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stats_lowest_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stats_lowest_currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stats_num_for_sale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stats_blocked_from_sale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T12:00:00.000Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the full 27-field schema. With &lt;code&gt;includeStatsRow&lt;/code&gt; on (default), the Actor also emits one aggregate row per release — same schema, &lt;code&gt;row_type: "stats"&lt;/code&gt;, &lt;code&gt;stats_*&lt;/code&gt; fields populated, listing fields null — so you can &lt;code&gt;GROUP BY release_id&lt;/code&gt; for both per-listing detail and the market floor. Pydantic v2 validates every row before it is pushed: ISO-8601 timestamps, proper nulls, typed numerics — not stringified soup.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) ⚠️
&lt;/h2&gt;

&lt;p&gt;Every developer who finds the Discogs marketplace HTML eventually tries the same thing: open Chrome DevTools, find the request for &lt;code&gt;/sell/release/{id}&lt;/code&gt;, replay it with &lt;code&gt;requests.get()&lt;/code&gt;. Here is what breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare.&lt;/strong&gt; The &lt;code&gt;www.discogs.com/sell/release/&lt;/code&gt; path sits behind Cloudflare's bot-management layer. Python's stdlib SSL stack, &lt;code&gt;httpx&lt;/code&gt;, and plain &lt;code&gt;requests&lt;/code&gt; all fail with a 403 and a JS challenge — the response looks like HTML but holds no listing data. We handle it with &lt;code&gt;curl-cffi&lt;/code&gt; impersonation: &lt;code&gt;AsyncSession(impersonate="chrome131")&lt;/code&gt; replays a real Chrome 131 TLS ClientHello, ALPN order, and HTTP/2 SETTINGS frame. Before any listing page we run a single homepage warm-up to seed the Cloudflare cookies, after which pages return 200 cleanly. We thread Apify residential proxies (&lt;code&gt;BUYPROXIES94952&lt;/code&gt;) to give the session a clean exit IP — verified in cloud QA on real Apify datacenter IPs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limits.&lt;/strong&gt; Discogs documents 60 req/min for anonymous API traffic; the marketplace HTML surface has an unwritten ceiling around 25 req/min. We throttle at one request per 1.5 seconds (~40 req/min) across all calls to stay under both, honour &lt;code&gt;Retry-After&lt;/code&gt; on 429, and retry with exponential backoff on 408/429/503 and network errors (five attempts). When a release fails after all retries we log a &lt;code&gt;WARNING&lt;/code&gt;, emit whatever rows we collected, and continue — one bad ID never kills the run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The REST API needs a custom User-Agent.&lt;/strong&gt; The &lt;a href="https://www.discogs.com/developers/" rel="noopener noreferrer"&gt;Discogs API Terms&lt;/a&gt; require every request to carry an &lt;code&gt;Application-Name/Version&lt;/code&gt; User-Agent; default browser-impersonation headers get a 403 on the JSON surface. The Actor sends &lt;code&gt;DevilScrapes/0.1 (+https://apify.com/DevilScrapes)&lt;/code&gt; automatically — you configure nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parsing the condition grades.&lt;/strong&gt; Condition text is free-form HTML. We extract it with a regex against the canonical Discogs grade vocabulary (Mint, Near Mint, VG+, VG, Generic, Not Graded, and the rest); anything outside it sets the field to null with a &lt;code&gt;WARNING&lt;/code&gt; rather than emitting garbage. None of this is glamorous — all of it is the difference between a script that works once and a feed that survives Cloudflare's quarterly changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor ⚙️
&lt;/h2&gt;

&lt;p&gt;The Actor is live on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/discogs-sold-price" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/discogs-sold-price&lt;/a&gt;&lt;/strong&gt;. Run it from the Apify Console (paste release IDs, click Start) or via the Apify Python client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/discogs-sold-price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;releaseIds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;249504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10843&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxPagesPerRelease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxListingsPerRelease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;includeStatsRow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;useProxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;asking_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;condition_media&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you do not know the release ID, use &lt;code&gt;searchQuery&lt;/code&gt; instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/discogs-sold-price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nirvana nevermind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxSearchResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxListingsPerRelease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The search path resolves the top &lt;code&gt;maxSearchResults&lt;/code&gt; hits into release IDs, then fetches listings for each. Exactly one of &lt;code&gt;releaseIds&lt;/code&gt; or &lt;code&gt;searchQuery&lt;/code&gt; is required — passing both, or neither, raises a Pydantic &lt;code&gt;ValidationError&lt;/code&gt; before any network call goes out.&lt;/p&gt;

&lt;p&gt;Key input parameters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Max&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;releaseIds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;100 IDs per run&lt;/td&gt;
&lt;td&gt;XOR with &lt;code&gt;searchQuery&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;searchQuery&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;200 chars&lt;/td&gt;
&lt;td&gt;Resolves via Discogs search API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxSearchResults&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Only used with &lt;code&gt;searchQuery&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxPagesPerRelease&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;25 listings per page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxListingsPerRelease&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;Hard cap per release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;includeStatsRow&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;One aggregate row per release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;useProxy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Recommended for Apify cloud runs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Use cases 💡
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Vinyl and CD reseller benchmarking.&lt;/strong&gt; Took in 200 records at an estate sale? Pull the marketplace listings for each release, compute the median asking price by condition grade, and price your own copies in minutes instead of an afternoon — filtering on &lt;code&gt;seller_country&lt;/code&gt; for the shipping-adjusted competitive picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Music-collectibles arbitrage.&lt;/strong&gt; The same release often trades at materially different prices across seller countries. Query &lt;code&gt;seller_country&lt;/code&gt; + &lt;code&gt;asking_price&lt;/code&gt; + &lt;code&gt;asking_currency&lt;/code&gt; across a handful of releases and spot regional underpricing before another buyer does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Label and catalog market intel.&lt;/strong&gt; For a label's back-catalog, track &lt;code&gt;stats_num_for_sale&lt;/code&gt; and &lt;code&gt;stats_lowest_price&lt;/code&gt; over time with Apify Schedules. A sustained drop in &lt;code&gt;num_for_sale&lt;/code&gt; alongside a rising &lt;code&gt;stats_lowest_price&lt;/code&gt; is a secondary-market appreciation signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Journalism and pricing studies.&lt;/strong&gt; "What does a first pressing of Nevermind cost right now?" is a recurring music-publication paragraph. One Actor run on the relevant release IDs gives you a defensible, timestamped dataset instead of a manual spot-check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seller-quality screening.&lt;/strong&gt; Filter listings by &lt;code&gt;seller_rating_pct &amp;gt;= 99.0&lt;/code&gt; and &lt;code&gt;seller_rating_count &amp;gt;= 100&lt;/code&gt; before buying, or sweep &lt;code&gt;stats_blocked_from_sale&lt;/code&gt; to flag releases Discogs has quietly restricted. The Actor hands you the data; the filtering is one line of Pandas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay only for rows that land in the dataset. No data, no charge (beyond the $0.05 run start fee).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price (USD)&lt;/th&gt;
&lt;th&gt;When&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;actor-start&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;td&gt;Once per run, at boot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;result-row&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;td&gt;Per listing OR per stats row written&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What that looks like in practice:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Rows&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 release × 25 listings + 1 stats row&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;$0.18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 releases × 100 listings + 5 stats rows&lt;/td&gt;
&lt;td&gt;505&lt;/td&gt;
&lt;td&gt;$2.58&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 releases × 100 listings + 10 stats rows&lt;/td&gt;
&lt;td&gt;1,010&lt;/td&gt;
&lt;td&gt;$5.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50 releases × 100 listings + 50 stats rows&lt;/td&gt;
&lt;td&gt;5,050&lt;/td&gt;
&lt;td&gt;$25.30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At scale the per-row charge dominates: approximately &lt;strong&gt;$5.05 per 1,000 rows&lt;/strong&gt;, priced for a hand-parsed marketplace listing with full seller metadata rather than a JSON API field copy. Apify's $5 free trial credit covers your first ~990 listing rows, no credit card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technically interesting part
&lt;/h2&gt;

&lt;p&gt;The devil is in the data-attribute vs. display-text split. Discogs encodes the machine-readable price in an HTML attribute (&lt;code&gt;data-pricevalue="0.50"&lt;/code&gt;, &lt;code&gt;data-currency="GBP"&lt;/code&gt;) and a human-formatted string in the visible text (&lt;code&gt;"£0.50"&lt;/code&gt;). The displayed text goes through server-side currency conversion for non-UK visitors, so text-scraping gives you a converted price in whatever currency the datacenter IP geolocates to. We always prefer &lt;code&gt;data-pricevalue&lt;/code&gt; and &lt;code&gt;data-currency&lt;/code&gt; — the canonical values the seller entered — and only fall back to text parsing (with a &lt;code&gt;WARNING&lt;/code&gt;) when the attributes are missing. For any cross-country price comparison the text number is a moving target and the attribute is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations 🚧
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No closed-sale price history.&lt;/strong&gt; Discogs hosts sold-price history at &lt;code&gt;/sell/history/{release_id}&lt;/code&gt;, but that page is gated behind account login (Auth0) and is out of scope without user OAuth. What you get instead: every active asking price (the live offer side) plus the public &lt;code&gt;lowest_price&lt;/code&gt; aggregate. For most reseller and arbitrage workflows the live asking-price distribution is the more actionable signal anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One snapshot per run.&lt;/strong&gt; The Actor captures marketplace state at a point in time. For time-series tracking, schedule recurring runs via Apify Schedules and concatenate datasets by &lt;code&gt;release_id&lt;/code&gt; + &lt;code&gt;scraped_at&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;500-listing ceiling per release.&lt;/strong&gt; Discogs serves 25 listings per page server-side; the &lt;code&gt;maxPagesPerRelease&lt;/code&gt; cap of 20 gives a 500-listing hard ceiling per release per run. Releases with more active listings need multiple runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Currency is not normalised.&lt;/strong&gt; &lt;code&gt;asking_price&lt;/code&gt; comes in the seller's local currency, &lt;code&gt;stats_lowest_price&lt;/code&gt; in the request IP's resolved currency. Join on &lt;code&gt;asking_currency&lt;/code&gt; / &lt;code&gt;stats_lowest_currency&lt;/code&gt;; no canonical USD conversion is applied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput is ~40 req/min&lt;/strong&gt;, imposed by Discogs — a 10-release run with 4 pages each takes roughly 90 seconds plus warm-up. And note the &lt;strong&gt;7-day default storage retention&lt;/strong&gt; on the Apify FREE plan: export your dataset right after the run or upgrade for longer retention.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ ❓
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping Discogs marketplace listings legal?&lt;/strong&gt;&lt;br&gt;
The marketplace listings page is public — no login, no paywall. This Actor reads only what the anonymous public UI exposes, paces itself at ~40 req/min (below Discogs' documented 60 req/min ceiling), and authenticates against nothing. As always, verify against your own jurisdiction and use case before running at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Discogs have an API for marketplace listings?&lt;/strong&gt;&lt;br&gt;
Not for per-listing data. The REST API returns only &lt;code&gt;lowest_price&lt;/code&gt; and &lt;code&gt;num_for_sale&lt;/code&gt; aggregates per release; individual listings (price, condition, seller) live only on the Cloudflare-protected HTML page, which this Actor scrapes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I get the closed-sale / sold-price history?&lt;/strong&gt;&lt;br&gt;
No. The &lt;code&gt;/sell/history/{release_id}&lt;/code&gt; page is gated behind Auth0 login. The "sold price" in the slug is a vestige of the original scope. What the Actor delivers is the ask-side snapshot: every active listing plus the public floor-price aggregate — for most reseller workflows the more actionable data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export the results to a spreadsheet or warehouse?&lt;/strong&gt;&lt;br&gt;
Yes — CSV, Excel, JSON, and XML exports from the Apify dataset viewer. You can also webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make, Zapier, or n8n, or pull it via the &lt;a href="https://docs.apify.com/api/v2" rel="noopener noreferrer"&gt;Apify API&lt;/a&gt;: &lt;code&gt;GET /datasets/{id}/items?format=csv&amp;amp;clean=true&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is live: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/discogs-sold-price" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/discogs-sold-price&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 Apify credit, no credit card. Run it on release ID &lt;code&gt;249504&lt;/code&gt; (the Rick Astley test classic) and you will have 25 typed listing rows in your dataset in under a minute. Found a field you wish it returned — median price, condition-grade distribution? Drop a comment or open a request on the Store page. We ship based on what builders actually need.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>music</category>
    </item>
    <item>
      <title>DEV.to Scraper: pull articles by tag, author, or feed into clean JSON</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:57:44 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/devto-scraper-pull-articles-by-tag-author-or-feed-into-clean-json-dih</link>
      <guid>https://dev.to/devil_scrapes/devto-scraper-pull-articles-by-tag-author-or-feed-into-clean-json-dih</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; DEV.to (built on the &lt;a href="https://www.forem.com/" rel="noopener noreferrer"&gt;Forem&lt;/a&gt; platform) publishes a public v1 REST API at &lt;code&gt;https://developers.forem.com/api/v1&lt;/code&gt; — but it paginates 30 articles at a time and offers no bulk-by-tag export beyond the first 1,000 items. A &lt;em&gt;DEV.to scraper&lt;/em&gt; fans out across those paginated endpoints, fetches each article's full Markdown body in parallel, and returns one clean typed row per article. The Apify Actor below does it for &lt;strong&gt;$0.002 per article&lt;/strong&gt; ($2.00 per 1,000), with rate-limit pacing, retries, and Pydantic-validated rows handled for you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;DEV.to's feed is one of the richest free sources of developer-written technical content on the web. On any given day the &lt;code&gt;python&lt;/code&gt; tag alone has thousands of articles — tutorials, opinions, walkthroughs, career posts — each with engagement signals (reactions, comments, reading time) attached. The platform's UI surfaces individual articles just fine. What it doesn't give you is a download button, a bulk-export endpoint, or a way to pull every article in a tag across the full history. The API hands back 30 at a time, then stops answering after 1,000 items per tag.&lt;/p&gt;

&lt;p&gt;If you want this as a dataset — to seed a RAG corpus, run an engagement benchmark, or mirror an author's catalogue — you have to stitch it together yourself. Here's what that involves, and how I turned it into a one-call Actor.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is DEV.to? 🔎
&lt;/h2&gt;

&lt;p&gt;DEV.to is a community publishing platform for software developers, built on &lt;a href="https://www.forem.com/" rel="noopener noreferrer"&gt;Forem&lt;/a&gt; — the open-source publishing engine that also powers CodeNewbie and several smaller communities. Launched in 2016, DEV.to hosts millions of articles across tags like &lt;code&gt;python&lt;/code&gt;, &lt;code&gt;webdev&lt;/code&gt;, &lt;code&gt;typescript&lt;/code&gt;, &lt;code&gt;beginners&lt;/code&gt;, and &lt;code&gt;ai&lt;/code&gt;, written by everyone from student bloggers to senior engineers.&lt;/p&gt;

&lt;p&gt;What makes DEV.to useful as a data source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every article carries &lt;strong&gt;structured engagement metrics&lt;/strong&gt;: positive reactions, comments, and reading time&lt;/li&gt;
&lt;li&gt;Articles are tagged with &lt;strong&gt;community-maintained taxonomy&lt;/strong&gt; (lowercase tags like &lt;code&gt;javascript&lt;/code&gt;, &lt;code&gt;devops&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The body of every article is available as &lt;strong&gt;raw Markdown&lt;/strong&gt; — ready to embed in a vector store without stripping HTML&lt;/li&gt;
&lt;li&gt;Authorship is consistent: every article has a &lt;code&gt;username&lt;/code&gt; and &lt;code&gt;display_name&lt;/code&gt;, making per-author analysis straightforward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What the platform does &lt;em&gt;not&lt;/em&gt; give you: a bulk export, a search-by-keyword endpoint, or a way to get all articles in a tag older than the most recent thousand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does DEV.to have an API? 🔌
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes — but it has meaningful limits.&lt;/strong&gt; DEV.to's &lt;a href="https://developers.forem.com/api/v1" rel="noopener noreferrer"&gt;Forem v1 API&lt;/a&gt; is public for read access, requires no API key for &lt;code&gt;GET /articles&lt;/code&gt;, and is reasonably well-documented. That's genuinely the good news.&lt;/p&gt;

&lt;p&gt;The constraints that send people looking for a scraper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30 articles per page, hard cap.&lt;/strong&gt; You can request &lt;code&gt;per_page=30&lt;/code&gt; — that's the max the server will honor. Getting 10,000 articles means 334 sequential (or carefully paced parallel) requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag endpoint cuts off at 1,000 items.&lt;/strong&gt; Past page 34, the API returns empty arrays. There's no &lt;code&gt;cursor&lt;/code&gt; mechanism, no &lt;code&gt;since&lt;/code&gt; timestamp, and no workaround documented by Forem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Body Markdown requires a second request.&lt;/strong&gt; &lt;code&gt;GET /articles&lt;/code&gt; returns metadata. To get &lt;code&gt;body_markdown&lt;/code&gt; you need a &lt;code&gt;GET /articles/:id&lt;/code&gt; call per article — one extra round trip for every row you want full-text on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limits are real and undocumented.&lt;/strong&gt; Hit them and you get 429s with no &lt;code&gt;Retry-After&lt;/code&gt; header. Retry naively and you accumulate backoff penalties.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of that is a dealbreaker on its own. Together, for a 10,000-article corpus pull, it's several hours of babysitting a script. That's the gap the Actor fills.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Each article becomes one flat, typed row. Here's a real-shaped output record with all 16 fields from &lt;code&gt;models.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1893402&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"slug"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"build-a-rag-pipeline-with-python-and-chroma-3x7k"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Build a RAG pipeline with Python and Chroma"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A step-by-step guide to building a retrieval-augmented generation system using Python, ChromaDB, and the OpenAI API."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://dev.to/pythonista/build-a-rag-pipeline-with-python-and-chroma-3x7k"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cover_image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://media.dev.to/cdn-cgi/image/quality=100/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xyz.jpg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"machinelearning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"beginners"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author_username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pythonista"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Alex Chen"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reading_time_minutes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"positive_reactions_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;347&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"comments_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"body_markdown"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"## Introduction&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Retrieval-augmented generation (RAG)..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"published_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-12T09:14:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"edited_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-14T11:02:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-29T08:22:41+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sixteen fields, consistent shape every time, Pydantic-validated before the row is written. It drops straight into Pandas, a vector store, or BigQuery with no field-wrangling on your side.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) 🔧
&lt;/h2&gt;

&lt;p&gt;The obvious script goes like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://dev.to/api/articles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works until it doesn't. Three failure modes that matter at scale:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The 1,000-item ceiling.&lt;/strong&gt; Around page 34 the API returns an empty list for any tag endpoint. There's no error, no header — just silence. A naive loop exits thinking it finished. You have 1,000 rows instead of the 12,000 that exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The body-fetch N+1 problem.&lt;/strong&gt; Want Markdown? Every article needs a second &lt;code&gt;GET /articles/:id&lt;/code&gt;. For 1,000 articles that's 1,000 extra requests. We handle this with a &lt;code&gt;concurrency&lt;/code&gt; parameter — up to 16 parallel body fetches — so the Actor fans out rather than serializing. We pace those fetches and retry with exponential backoff on &lt;code&gt;408 / 429 / 5xx&lt;/code&gt;, up to 5 attempts per article before we surface a partial-success status rather than handing you a half-empty dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Rate limits that arrive unannounced.&lt;/strong&gt; DEV.to's rate-limit threshold isn't published; it varies by endpoint and time of day. We back off on rate-limit signals and reset the session rather than triggering a retry storm — and we surface a &lt;code&gt;set_status_message&lt;/code&gt; so you know what happened, rather than silently returning fewer articles than you asked for.&lt;/p&gt;

&lt;p&gt;We rotate browser fingerprints via &lt;code&gt;curl-cffi&lt;/code&gt; so requests look like a real browser's TLS handshake, and we thread Apify residential proxies on every session rotation — fresh exit IP, fresh cookie jar — so a single blocked IP doesn't kill the run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor
&lt;/h2&gt;

&lt;p&gt;I packaged this as an Apify Actor: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/dev-to-articles-scraper" rel="noopener noreferrer"&gt;DEV.to Articles Scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Open the Apify Console and click Start, or run it programmatically with the &lt;a href="https://docs.apify.com/api/client/python/" rel="noopener noreferrer"&gt;Apify Python client&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/dev-to-articles-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;includeBody&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concurrency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive_reactions_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four &lt;code&gt;mode&lt;/code&gt; values from the input schema:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;What it fetches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tag&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All articles for a given tag (e.g. &lt;code&gt;python&lt;/code&gt;, &lt;code&gt;ai&lt;/code&gt;, &lt;code&gt;webdev&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;username&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All articles by a specific DEV.to author&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;latest&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Global latest feed — newest articles across all tags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;top&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Top articles of all time across the platform&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Set &lt;code&gt;includeBody: false&lt;/code&gt; if you only need metadata — that halves the request count and the run time. &lt;code&gt;maxResults&lt;/code&gt; caps the total rows; &lt;code&gt;concurrency&lt;/code&gt; controls how many body fetches run in parallel (1–16, default 4).&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG corpus seeding.&lt;/strong&gt; Pull &lt;code&gt;mode=tag, tag=python, includeBody=true, maxResults=1000&lt;/code&gt; to get 1,000 Python tutorials in raw Markdown. Each article is already chunked — the &lt;code&gt;body_markdown&lt;/code&gt; field is the unit. Embed it directly into &lt;a href="https://www.trychroma.com/" rel="noopener noreferrer"&gt;Chroma&lt;/a&gt;, Pinecone, or Weaviate. The &lt;code&gt;url&lt;/code&gt; and &lt;code&gt;author_username&lt;/code&gt; fields give you citation metadata for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trending tag dashboards.&lt;/strong&gt; Schedule a daily run on &lt;code&gt;mode=tag, tag=ai&lt;/code&gt; with &lt;code&gt;maxResults=50&lt;/code&gt;. Diff today's &lt;code&gt;positive_reactions_count&lt;/code&gt; against yesterday's — any article that gained more than 50 reactions in 24 hours is trending. Wire it into a Slack webhook and you have a free daily briefing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Author monitoring and portfolio analysis.&lt;/strong&gt; Pull &lt;code&gt;mode=username&lt;/code&gt; for a specific author to mirror their full catalogue. Useful for DevRel teams tracking competitor advocates, recruiters benchmarking engineering blog authors, or writers building a personal analytics dashboard outside DEV.to's own stats page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Newsletter assembly.&lt;/strong&gt; Pull &lt;code&gt;mode=top&lt;/code&gt; or &lt;code&gt;mode=latest&lt;/code&gt; with &lt;code&gt;maxResults=10&lt;/code&gt;, sort by &lt;code&gt;positive_reactions_count&lt;/code&gt;, and render the top 5 to Markdown for a weekly digest. The &lt;code&gt;reading_time_minutes&lt;/code&gt; field tells readers upfront what they're committing to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engagement benchmarking.&lt;/strong&gt; Pull 500 articles in a tag, group by &lt;code&gt;author_username&lt;/code&gt;, and compute average reactions per post — a simple "who are the most impactful writers in this niche?" query for sponsorship research, guest-post pitching, or a contributor leaderboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for articles written to your dataset, nothing for the ones you don't get.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$0.005 per run&lt;/strong&gt; (covers the Actor warm-up)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.002 per article&lt;/strong&gt; written to the dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pull&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;30 articles (default)&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 articles&lt;/td&gt;
&lt;td&gt;$0.21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 articles&lt;/td&gt;
&lt;td&gt;$2.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000 articles&lt;/td&gt;
&lt;td&gt;$10.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 articles&lt;/td&gt;
&lt;td&gt;$20.01&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apify's $5 free trial credit covers your first ~2,490 articles with no credit card required. No subscription, no minimum, no charge for runs that return zero results.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technically interesting bit
&lt;/h2&gt;

&lt;p&gt;DEV.to's API officially cuts off article listings at 1,000 per tag — but the per-article &lt;code&gt;GET /articles/:id&lt;/code&gt; endpoint has no such limit. So a full corpus is achievable by combining the listing endpoint (for IDs) with the detail endpoint (for bodies): even when the listing only goes 34 pages deep, you can supplement IDs from the &lt;code&gt;username&lt;/code&gt; endpoint, the &lt;code&gt;latest&lt;/code&gt; feed, or a prior run's dataset. The Actor exposes this as a design choice — &lt;code&gt;mode=tag&lt;/code&gt; is the fast lane for recent articles; &lt;code&gt;mode=latest&lt;/code&gt; is the slow lane for full-history accumulation over scheduled runs. Both paths produce identical row shapes, so your downstream pipeline never needs to know which mode fed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tag endpoint hard cap at ~1,000 items.&lt;/strong&gt; The DEV.to v1 API does not paginate beyond this for the tag feed. Full-history pulls require either the &lt;code&gt;username&lt;/code&gt; mode (per-author) or multiple scheduled &lt;code&gt;latest&lt;/code&gt;-mode runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Body Markdown is the API's version.&lt;/strong&gt; If an author used DEV.to's rich editor with embedded Liquid tags (custom video/link cards), those render as raw Liquid syntax in the Markdown — not HTML. Post-processing is on you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No comment bodies.&lt;/strong&gt; &lt;code&gt;comments_count&lt;/code&gt; is in the metadata, but fetching individual comment threads would multiply the request count significantly. Not in scope for v1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No read-time filtering at the API level.&lt;/strong&gt; You can filter post-scrape, but the API doesn't accept a &lt;code&gt;min_reading_time&lt;/code&gt; param. Download the dataset and filter in Pandas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private/draft articles are inaccessible.&lt;/strong&gt; The public API only surfaces published, non-hidden articles.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping DEV.to legal?&lt;/strong&gt;&lt;br&gt;
This Actor calls DEV.to's own published public API (&lt;code&gt;https://developers.forem.com/api/v1&lt;/code&gt;) — no authentication bypassed, no HTML scraped, no undocumented endpoint hit. The API is designed for programmatic access. Standard advice: read &lt;a href="https://dev.to/terms"&gt;DEV.to's Terms of Service&lt;/a&gt;, stay within polite request rates, and don't republish article bodies wholesale without attribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export the dataset to a spreadsheet or warehouse?&lt;/strong&gt;&lt;br&gt;
Yes — export CSV, JSON, or Excel directly from the Apify Console. Alternatively, webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make, Zapier, or n8n, or fetch it via the &lt;a href="https://docs.apify.com/api/v2" rel="noopener noreferrer"&gt;Apify API&lt;/a&gt; for direct warehouse ingestion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does DEV.to have an official bulk-export API?&lt;/strong&gt;&lt;br&gt;
No. The &lt;a href="https://developers.forem.com/api/v1" rel="noopener noreferrer"&gt;Forem v1 API&lt;/a&gt; paginates 30 articles at a time and caps tag listings at approximately 1,000 items. There is no official bulk download, no CSV export, and no GraphQL endpoint on the public surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why are some &lt;code&gt;body_markdown&lt;/code&gt; fields null?&lt;/strong&gt;&lt;br&gt;
Some DEV.to articles link out to a canonical URL hosted on the author's own blog — the metadata (title, tags, reactions) lives on DEV.to but the body lives elsewhere. In those cases the API returns an empty or very short body; the Actor surfaces that faithfully as &lt;code&gt;null&lt;/code&gt; rather than silently dropping the row.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is live on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/dev-to-articles-scraper" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/dev-to-articles-scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Run it on &lt;code&gt;tag=python&lt;/code&gt; with &lt;code&gt;maxResults=30&lt;/code&gt; and you'll have a full typed dataset in under a minute — Markdown bodies included if you leave &lt;code&gt;includeBody: true&lt;/code&gt;. Need a field that isn't there (comment threads, co-authors, series metadata)? Drop a note in the comments. We read every one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — the devil's in the data, and we keep it clean.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>data</category>
    </item>
    <item>
      <title>CoinGecko API Scraper: pull crypto price, market cap, ATH for $2/1K</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:52:29 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/coingecko-api-scraper-pull-crypto-price-market-cap-ath-for-21k-193f</link>
      <guid>https://dev.to/devil_scrapes/coingecko-api-scraper-pull-crypto-price-market-cap-ath-for-21k-193f</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; A &lt;em&gt;CoinGecko API scraper&lt;/em&gt; feeds a list of coin IDs, rotates TLS fingerprints and proxy sessions to stay within CoinGecko's free-tier rate limits (~10 requests per minute), and returns one clean, typed JSON row per coin with 26 fields — price, market cap, ATH/ATL, supply, and sentiment. The Apify Actor below costs &lt;strong&gt;$0.002 per coin&lt;/strong&gt; (~$2.00 per 1,000), handles the rate-limiting for you, and lands the data wherever your stack lives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you've built anything crypto-adjacent — a portfolio dashboard, an alert bot, a strategy backtest — you've run into CoinGecko's free tier. The data is excellent: prices, market caps, circulating vs total supply, all-time highs and lows, community sentiment, even GitHub activity. The problem isn't the data; it's getting it at scale without triggering a rate-limit cascade.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;for&lt;/code&gt; loop over your coin list, one &lt;code&gt;requests.get&lt;/code&gt; per coin, works fine for five coins and falls apart at fifty. By the time you've added backoff, fingerprint rotation, proxy management, and typed output, you've spent half a day on plumbing you'd rather not maintain. Here's how I wrapped all of that into a single API call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is CoinGecko? 🔎
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.coingecko.com" rel="noopener noreferrer"&gt;CoinGecko&lt;/a&gt; is an independent cryptocurrency data aggregator, founded in 2014, that tracks over 14,000 digital assets across 900+ exchanges. It publishes market data — prices, volumes, supply figures, ATH/ATL records, community sentiment — through a public &lt;a href="https://www.coingecko.com/en/api/documentation" rel="noopener noreferrer"&gt;REST API (v3)&lt;/a&gt; with a free "Demo" tier and paid Pro plans. Unlike exchange APIs, CoinGecko aggregates across venues, so you get one normalized view regardless of where a coin trades.&lt;/p&gt;

&lt;p&gt;For portfolio tracking, research, alerting, or tax reporting, CoinGecko is the reference source: public dataset, deep coverage, documented API. The catch is throughput — the free Demo tier enforces a hard cap that makes bulk pulls slow and fragile without deliberate rate-limit management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does CoinGecko have an API? 🔌
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes&lt;/strong&gt; — CoinGecko publishes an &lt;a href="https://www.coingecko.com/en/api/documentation" rel="noopener noreferrer"&gt;official v3 REST API&lt;/a&gt;. This Actor uses the &lt;code&gt;/coins/{id}&lt;/code&gt; endpoint to fetch the per-coin detail payload. The free Demo tier is rate-limited to roughly 10 requests per minute from a single IP; the Pro tier lifts that significantly. The API is documented and stable, but managing those rate limits across concurrent workers — when you're pulling hundreds of coins and can't afford to serialize them — is where most scripts break down.&lt;/p&gt;

&lt;p&gt;This Actor adds the plumbing the raw API doesn't: fingerprint rotation, proxy sessions, backoff on 429s, and clean validated output. You bring the coin list; we bring the resilience.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like 📤
&lt;/h2&gt;

&lt;p&gt;Each coin comes back as one flat, typed row. Every field below is from &lt;code&gt;models.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"coin_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bitcoin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"symbol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"btc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bitcoin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"image_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://coin-images.coingecko.com/coins/images/1/large/bitcoin.png"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_price_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;71250.32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_price_vs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;65120.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vs_currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"eur"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price_change_24h_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price_change_7d_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-1.18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price_change_30d_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;14.73&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"market_cap_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1408000000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"market_cap_rank"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_volume_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28540000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"circulating_supply"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;19700000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_supply"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;21000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_supply"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;21000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ath_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;73738.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ath_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-03-14T07:10:36.635Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"atl_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;67.81&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"atl_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2013-07-06T00:00:00.000Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sentiment_up_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;74.29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"exchanges_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;58&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"github_stars"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"github_forks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"coingecko_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.coingecko.com/en/coins/bitcoin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-31T09:00:00+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Twenty-six fields per coin, Pydantic-validated before they reach your dataset. &lt;code&gt;github_stars&lt;/code&gt; and &lt;code&gt;github_forks&lt;/code&gt; populate when you enable &lt;code&gt;includeDeveloperData&lt;/code&gt;. Everything else — prices, supply, ATH/ATL, sentiment, exchange count — comes through on the default configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) ⚠️
&lt;/h2&gt;

&lt;p&gt;The immediate instinct for anyone who's glanced at the CoinGecko API docs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_coins&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coin_ids&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;coin_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;coin_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.coingecko.com/api/v3/coins/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;coin_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works for three coins on your laptop. Here's where it breaks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Rate limits shut you down without ceremony.&lt;/strong&gt; The Demo tier enforces roughly 10 requests per minute. A list of 100 coins serialized naively takes 10 minutes minimum and still 429s if your timing slips. We sleep 30 seconds on a 429, retry once, and warn loudly if it persists, so partial success surfaces rather than silently producing an empty dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. TLS fingerprinting tightens under load.&lt;/strong&gt; Hit the endpoint at volume from a datacenter IP with Python's default TLS stack and the response profile changes. We rotate through Chrome 131, Chrome 124, Firefox 147, and Safari 180 TLS fingerprints via &lt;code&gt;curl-cffi&lt;/code&gt;, picked randomly per session, so the handshake looks like a real browser even under concurrent load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Concurrency without a semaphore blows the limit.&lt;/strong&gt; Even &lt;code&gt;asyncio.gather()&lt;/code&gt; with ten concurrent tasks exceeds the free-tier quota instantly. We thread a &lt;code&gt;Semaphore&lt;/code&gt; around the fetch so parallel lookups stay within the configured ceiling — default 2, configurable up to 32 for Pro-key holders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Null fields for low-coverage coins.&lt;/strong&gt; Smaller coins return incomplete payloads with missing keys or null market-data fields. A naïve &lt;code&gt;data["market_data"]["current_price"]["usd"]&lt;/code&gt; raises &lt;code&gt;KeyError&lt;/code&gt; or &lt;code&gt;TypeError&lt;/code&gt;. Our &lt;code&gt;_build_row&lt;/code&gt; guards every field with &lt;code&gt;.get()&lt;/code&gt; chains and Pydantic absorbs the rest — nulls land as typed &lt;code&gt;None&lt;/code&gt; values rather than crashing the run.&lt;/p&gt;

&lt;p&gt;We rotate residential proxies through Apify Proxy on a block, retry with exponential backoff on &lt;code&gt;408/429/5xx&lt;/code&gt;, and hand back Pydantic-validated rows. No data, no charge — only the small &lt;code&gt;actor-start&lt;/code&gt; warm-up fee fires if the run produces zero items.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor ⚙️
&lt;/h2&gt;

&lt;p&gt;Store listing: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/coingecko-crypto-scraper" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/coingecko-crypto-scraper&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Paste a list of coin IDs in the Apify Console and click Start, or drive it from Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/coingecko-crypto-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coinIds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bitcoin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ethereum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;solana&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardano&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chainlink&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vsCurrency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eur&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;includeDeveloperData&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concurrency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;current_price_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price_change_24h_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key input fields:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;coinIds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;["bitcoin", "ethereum", "solana"]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CoinGecko slug from the URL — e.g. &lt;code&gt;uniswap&lt;/code&gt;, &lt;code&gt;the-graph&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vsCurrency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"eur"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Secondary quote currency alongside USD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;includeDeveloperData&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Adds GitHub stars / forks for coins with linked repos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;apiKey&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;CoinGecko Pro key — lifts the 10 req/min Demo ceiling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;concurrency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Keep at 2 on Demo; raise to 8+ with a Pro key&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Coin IDs are the lowercase slugs in the URL — &lt;code&gt;bitcoin&lt;/code&gt; is &lt;code&gt;/en/coins/bitcoin&lt;/code&gt;, &lt;code&gt;the-graph&lt;/code&gt; is &lt;code&gt;/en/coins/the-graph&lt;/code&gt;. The &lt;a href="https://docs.coingecko.com/v3.0.1/reference/coins-list" rel="noopener noreferrer"&gt;CoinGecko coin list endpoint&lt;/a&gt; returns every valid ID if you need to enumerate them first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases 💡
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Portfolio tracker.&lt;/strong&gt; Schedule a daily run with your watch list — 20 coins, &lt;code&gt;vsCurrency: "eur"&lt;/code&gt;, &lt;code&gt;concurrency: 2&lt;/code&gt;. Each run produces one row per coin with price in USD and EUR, 24h/7d/30d change, and market cap. Pipe the dataset to a Google Sheet via the Apify API or a Make webhook. Cost: 20 coins × $0.002 = $0.04 per day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert bot.&lt;/strong&gt; Run every hour, diff &lt;code&gt;price_change_24h_pct&lt;/code&gt; against the previous run, and fire a Slack notification when any coin moves more than 10% in either direction. The &lt;code&gt;scraped_at&lt;/code&gt; ISO timestamp on each row makes the diff deterministic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trading-strategy backtest data.&lt;/strong&gt; Pull 50 coins by market cap rank, store daily snapshots in a named Apify dataset, then read back the stored runs for strategy simulation. &lt;code&gt;market_cap_rank&lt;/code&gt;, &lt;code&gt;total_volume_usd&lt;/code&gt;, and the price-change fields give you the features most trend-following strategies need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tax reporting.&lt;/strong&gt; Capital-gains math needs a daily closing price for every asset held. One run per day per asset, stored with &lt;code&gt;scraped_at&lt;/code&gt;, gives you an auditable price history without a dedicated pricing-feed subscription. At $0.002 per coin, 10 coins × 365 days = $7.30/year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developer-activity screening.&lt;/strong&gt; Enable &lt;code&gt;includeDeveloperData&lt;/code&gt;, then filter coins where &lt;code&gt;github_stars &amp;gt; 500&lt;/code&gt; and &lt;code&gt;price_change_30d_pct &amp;gt; 0&lt;/code&gt; — a rough proxy for active development plus market momentum. CoinGecko sources the GitHub stats from each project's linked repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for coins you receive, nothing for coins that error out.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;actor-start&lt;/code&gt; (once per run)&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;result&lt;/code&gt; (per coin in dataset)&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pull&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 coins&lt;/td&gt;
&lt;td&gt;$0.025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 coins&lt;/td&gt;
&lt;td&gt;$0.205&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 coins&lt;/td&gt;
&lt;td&gt;$2.005&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000 coins&lt;/td&gt;
&lt;td&gt;$10.005&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apify's $5 free trial credit — no credit card — covers your first 2,497 coins. This Actor uses the free Demo tier by default, so you get well-structured, well-managed access to the same data without a subscription.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technically interesting bit 🧠
&lt;/h2&gt;

&lt;p&gt;CoinGecko's &lt;code&gt;/coins/{id}&lt;/code&gt; endpoint packs market data into nested dicts with currency-keyed sub-objects — &lt;code&gt;market_data.current_price.usd&lt;/code&gt;, &lt;code&gt;market_data.ath.usd&lt;/code&gt;, and so on. Low-coverage coins return partial payloads: some fields have the outer key but &lt;code&gt;null&lt;/code&gt;, some skip the key entirely, some return an empty dict &lt;code&gt;{}&lt;/code&gt; for the inner object. A straightforward &lt;code&gt;data["market_data"]["current_price"]["usd"]&lt;/code&gt; fails unpredictably across that variance.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;_build_row&lt;/code&gt; function in &lt;code&gt;scraper.py&lt;/code&gt; guards every multi-level access with &lt;code&gt;.get()&lt;/code&gt; chains — &lt;code&gt;(market.get("ath") or {}).get("usd")&lt;/code&gt; — which safely returns &lt;code&gt;None&lt;/code&gt; for any missing level without raising. Pydantic then validates the final &lt;code&gt;ResultRow&lt;/code&gt;, coercing &lt;code&gt;None&lt;/code&gt; to JSON &lt;code&gt;null&lt;/code&gt; on &lt;code&gt;model_dump(mode="json")&lt;/code&gt;. Your rows stay consistently typed regardless of which level of the CoinGecko response is missing, and the run never crashes on a partially-covered altcoin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Demo tier is ~10 req/min.&lt;/strong&gt; Large coin lists (500+) are slow without a CoinGecko Pro key. Attach one via &lt;code&gt;apiKey&lt;/code&gt; to lift the ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Some fields are null for low-coverage coins.&lt;/strong&gt; Smaller, less-traded coins often have null &lt;code&gt;market_cap_usd&lt;/code&gt;, &lt;code&gt;total_volume_usd&lt;/code&gt;, or &lt;code&gt;exchanges_count&lt;/code&gt;. The fields are present in the schema — just typed as &lt;code&gt;number | null&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No historical OHLC.&lt;/strong&gt; This Actor fetches current market state; historical price series use a different CoinGecko endpoint and are out of scope here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CoinGecko updates every ~60 seconds.&lt;/strong&gt; Not a real-time tick feed. For sub-second data, use an exchange WebSocket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;exchanges_count&lt;/code&gt; reflects CoinGecko's ticker window.&lt;/strong&gt; It's the number of distinct exchanges in the tickers CoinGecko returns for the coin, which may not be every venue listing it globally. Treat it as a coverage signal, not an exhaustive count.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ ❓
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping CoinGecko legal?&lt;/strong&gt;&lt;br&gt;
This Actor calls CoinGecko's own documented public API — the same endpoint CoinGecko publishes in its docs. It doesn't bypass authentication, doesn't scrape rendered HTML, and respects rate limits by backing off on 429s. CoinGecko's &lt;a href="https://www.coingecko.com/en/api/documentation" rel="noopener noreferrer"&gt;Terms of Service&lt;/a&gt; permit API access for non-commercial and commercial use subject to rate limits. Check your own use case against their terms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export to Google Sheets or a data warehouse?&lt;/strong&gt;&lt;br&gt;
Yes. Export CSV/Excel/JSON directly from the Apify Console dataset view. Use a webhook on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; to push each run's results into Make, Zapier, or n8n. Pull the &lt;a href="https://docs.apify.com/api/v2#/reference/datasets/item-collection/get-items" rel="noopener noreferrer"&gt;Apify dataset API&lt;/a&gt; directly from any pipeline that can make an HTTP call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there a CoinGecko API?&lt;/strong&gt;&lt;br&gt;
Yes — this Actor wraps &lt;a href="https://www.coingecko.com/en/api/documentation" rel="noopener noreferrer"&gt;CoinGecko's v3 REST API&lt;/a&gt;. If you prefer to call it directly, the documentation is public. The Actor adds rate-limit management, fingerprint rotation, proxy routing, and Pydantic-typed output on top of the raw API surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does my run return fewer coins than I requested?&lt;/strong&gt;&lt;br&gt;
A coin ID that doesn't exist returns a 404, which the scraper logs and skips. The usual cause is a typo in the slug — &lt;code&gt;the-graph&lt;/code&gt; not &lt;code&gt;thegraph&lt;/code&gt;, &lt;code&gt;uniswap&lt;/code&gt; not &lt;code&gt;uni&lt;/code&gt;. Check the exact slug in the CoinGecko URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/coingecko-crypto-scraper" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/coingecko-crypto-scraper&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card required. Drop in &lt;code&gt;["bitcoin", "ethereum", "solana"]&lt;/code&gt;, click Start, and you'll have your first three rows in seconds. Need a larger coin universe, or a field that isn't here yet? Drop a comment — the roadmap is driven by what people build with this.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors for builders who'd rather not debug rate-limit waterfalls at 2 a.m. Pay-per-event, transparent pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>data</category>
    </item>
    <item>
      <title>Bluesky Starter Pack Scraper: export any community list for $2.05/1K</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:47:16 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/bluesky-starter-pack-scraper-export-any-community-list-for-2051k-3408</link>
      <guid>https://dev.to/devil_scrapes/bluesky-starter-pack-scraper-export-any-community-list-for-2051k-3408</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; A Bluesky starter pack is a curated community list published on the AT Protocol. There is no official export button or bulk API endpoint for reading pack membership programmatically. A &lt;em&gt;Bluesky starter pack scraper&lt;/em&gt; queries the public AT Protocol AppView (&lt;code&gt;https://public.api.bsky.app/xrpc/&lt;/code&gt;) and returns every member's profile — handle, DID, follower count, post count — as clean, typed JSON. The Apify Actor below does it for &lt;strong&gt;$0.002 per member row&lt;/strong&gt; (~$2.05 per 1,000 members), handling pagination, retries, and URL normalization for you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Starter packs drove a reported &lt;strong&gt;43% of all follows&lt;/strong&gt; during Bluesky's 2024 growth surge (&lt;a href="https://www.eurekalert.org/" rel="noopener noreferrer"&gt;EurekAlert, 2024&lt;/a&gt;). That number matters: when someone publishes "ML Researchers on Bluesky" or "London Founders" as a starter pack, they are not just recommending accounts — they are shaping the social graph of an entire professional niche. For researchers, growth marketers, and social-graph analysts, those packs are some of the most signal-dense audience lists on any platform.&lt;/p&gt;

&lt;p&gt;The problem: the Bluesky web UI shows a pack's members one scroll at a time. There is no CSV download, no endpoint labeled "give me all members", and no obvious way to turn a pack into a spreadsheet your CRM can ingest. Here is what it actually takes to get that data programmatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Bluesky Starter Pack? 🔎
&lt;/h2&gt;

&lt;p&gt;A Bluesky Starter Pack is a named, curator-published list of accounts, stored as a record in the &lt;a href="https://atproto.com/" rel="noopener noreferrer"&gt;AT Protocol&lt;/a&gt; — the open, federated protocol underpinning Bluesky. Any user can create one, and the platform surfaces popular packs on the onboarding screen and in discovery feeds. Each pack carries a &lt;strong&gt;title&lt;/strong&gt; and optional &lt;strong&gt;description&lt;/strong&gt;, a &lt;strong&gt;list of member DIDs&lt;/strong&gt; (the protocol's decentralized account identifiers), the &lt;strong&gt;creator's handle&lt;/strong&gt; (e.g. &lt;code&gt;pfrazee.com&lt;/code&gt;), and a stable &lt;strong&gt;AT URI&lt;/strong&gt; (&lt;code&gt;at://did:plc:.../app.bsky.graph.starterpack/...&lt;/code&gt;) that identifies it across the federated network.&lt;/p&gt;

&lt;p&gt;When a new user follows an entire pack, every member gains a follower at once. That viral follow-through is why the "43% of follows" stat exists — packs are follow-amplification engines, and the membership list is the core asset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does Bluesky have an API for starter packs?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes, but it does not expose keyword search or bulk export.&lt;/strong&gt; The &lt;a href="https://public.api.bsky.app/xrpc/" rel="noopener noreferrer"&gt;AT Protocol AppView&lt;/a&gt; offers three relevant lexicon methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;app.bsky.graph.getStarterPack&lt;/code&gt; — fetch one pack by AT URI&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;app.bsky.graph.getActorStarterPacks&lt;/code&gt; — list all packs published by a creator handle or DID&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;app.bsky.graph.getList&lt;/code&gt; — retrieve member profiles from the list embedded in a pack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What you cannot do: search packs by keyword. The lexicon defines &lt;code&gt;app.bsky.graph.searchStarterPacks&lt;/code&gt; but the public AppView returns &lt;code&gt;XRPCNotSupported&lt;/code&gt; (HTTP 404). Discovery by plain-text query is not available on the public tier — to find packs in a topic area, identify a prominent curator and enumerate their packs via &lt;code&gt;creatorHandle&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Every member row comes back flat and typed — pack metadata is denormalized onto each row so a single CSV export is self-contained. The output shape from &lt;code&gt;src/models.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pack_uri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"at://did:plc:abc123/app.bsky.graph.starterpack/xyz789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pack_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AI Researchers on Bluesky"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pack_description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Curated list of ML/AI researchers who migrated from Twitter."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pack_creator_handle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"alice.bsky.social"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"member_did"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"did:plc:def456"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"member_handle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bob.bsky.social"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"member_display_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bob Smith"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"member_followers_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1204&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"member_following_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;380&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"member_posts_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;841&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"member_indexed_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-11-14T09:22:01.000Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T12:00:00.000Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Twelve fields per row. &lt;code&gt;pack_uri&lt;/code&gt;, &lt;code&gt;pack_name&lt;/code&gt;, &lt;code&gt;pack_creator_handle&lt;/code&gt;, &lt;code&gt;member_did&lt;/code&gt;, &lt;code&gt;member_handle&lt;/code&gt;, and &lt;code&gt;scraped_at&lt;/code&gt; are always present. The count fields, &lt;code&gt;member_display_name&lt;/code&gt;, and &lt;code&gt;member_indexed_at&lt;/code&gt; are nullable — if the API omits them for a profile, the Actor emits the row with those fields set to &lt;code&gt;null&lt;/code&gt; rather than dropping it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart)
&lt;/h2&gt;

&lt;p&gt;The AT Protocol is designed for open access, which is genuinely unusual. The &lt;code&gt;https://public.api.bsky.app/xrpc/&lt;/code&gt; base URL needs no session token. A first response is minutes away. The complexity lives elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor pagination across chained endpoints.&lt;/strong&gt; Getting all members is not a single call. You call &lt;code&gt;getStarterPack&lt;/code&gt; to learn the embedded list AT URI, then &lt;code&gt;getList&lt;/code&gt; with that URI, which returns a page of members plus a cursor. You loop until the response omits the cursor — for large packs, a dozen sequential calls that all need to succeed and reassemble in order. We thread that loop cleanly, applying the &lt;code&gt;maxMembersPerPack&lt;/code&gt; cap as a client-side guard to bound run cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nested record structure.&lt;/strong&gt; The &lt;code&gt;starterPack&lt;/code&gt; view object puts &lt;code&gt;uri&lt;/code&gt; and &lt;code&gt;creator&lt;/code&gt; at the top level, but &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt; live inside a nested &lt;code&gt;record&lt;/code&gt; sub-object. A naive parser reading &lt;code&gt;pack["name"]&lt;/code&gt; returns &lt;code&gt;None&lt;/code&gt; every time; the correct path is &lt;code&gt;pack["record"]["name"]&lt;/code&gt;. We pin the parser against this shape and validate output with Pydantic before it reaches your dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retry discipline on 429/503.&lt;/strong&gt; The AT Protocol does not publish rate limits, but it enforces them. On &lt;code&gt;429&lt;/code&gt;/&lt;code&gt;503&lt;/code&gt; we retry with exponential backoff — base 2 seconds, doubling, capped at 30 seconds, up to 5 attempts — honouring &lt;code&gt;Retry-After&lt;/code&gt; when present. We rotate the curl-cffi browser fingerprint (Chrome / Firefox / Safari TLS profiles) so the handshake looks like a real browser, not a Python script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pydantic-validated rows or nothing.&lt;/strong&gt; Every row passes through &lt;code&gt;ResultRow.model_validate(...)&lt;/code&gt; before &lt;code&gt;Actor.push_data(...)&lt;/code&gt; writes it. If the API contract drifts and a required field disappears, the Actor fails loud with a clear status message rather than emitting garbage. No data, no charge.&lt;/p&gt;

&lt;p&gt;None of this is insurmountable to build yourself. All of it is what you skip by using the Actor.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor
&lt;/h2&gt;

&lt;p&gt;The packaged result is on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/bluesky-starter-pack" rel="noopener noreferrer"&gt;Bluesky Starter Pack Scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Two modes — single pack by URI or URL, or bulk export of every pack owned by a creator handle.&lt;/p&gt;

&lt;p&gt;Run it via the Apify Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Single-pack mode — paste a bsky.app URL directly; the Actor normalizes it
&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/bluesky-starter-pack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;starterPackUri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://bsky.app/starter-pack/pfrazee.com/3l2stmy4ote2b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxMembersPerPack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;member_handle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;member_followers_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or in creator mode — export every pack a curator has published:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/bluesky-starter-pack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;creatorHandle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pfrazee.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxPacks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxMembersPerPack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;starterPackUri&lt;/code&gt; field accepts both AT URIs and Bluesky web URLs; the Actor normalizes web URLs to AT URI form before the first network call — no manual conversion required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;p&gt;Five concrete scenarios where this data earns its keep:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Academic community-topology research.&lt;/strong&gt; Export a pack, load it into &lt;a href="https://networkx.org/" rel="noopener noreferrer"&gt;NetworkX&lt;/a&gt;, and run community detection. The &lt;code&gt;member_followers_count&lt;/code&gt; and &lt;code&gt;member_following_count&lt;/code&gt; fields give you edge weights without a second API call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B2B growth marketing.&lt;/strong&gt; Find packs in your industry (e.g. "ML Researchers on Bluesky", "London Founders") via their curator handles, export membership, filter by &lt;code&gt;member_followers_count &amp;gt;= 500&lt;/code&gt;, and you have a list of niche-verified, high-signal accounts — warm context for relevant replies, not cold outreach against scraped emails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive intelligence.&lt;/strong&gt; Track which accounts influential curators are recommending. A startup appearing in three separate "AI tools" packs in one week is a signal worth noticing. Schedule a weekly run and diff the membership list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Social graph analysis.&lt;/strong&gt; Compare follower distributions and post activity across a curated peer group. Export "AI Researchers" and "AI Founders" packs and measure whether the founder community posts more or follows more — one flat CSV, no joins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OSINT and journalism.&lt;/strong&gt; Document community formation around an event. A pack published during breaking news captures a snapshot of who organized around it — a record the AT Protocol's mutable graph will not preserve indefinitely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for rows written to the dataset, not for idle compute or pages fetched. No data, no charge (beyond the small start fee).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Actor start (once per run)&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Member row emitted&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Members scraped&lt;/th&gt;
&lt;th&gt;Total cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100 members&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500 members&lt;/td&gt;
&lt;td&gt;$1.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 members&lt;/td&gt;
&lt;td&gt;$2.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000 members (max per pack)&lt;/td&gt;
&lt;td&gt;$10.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apify's $5 free trial credit covers your first ~2,470 member rows — enough to export several typical starter packs — with no credit card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technically interesting part
&lt;/h2&gt;

&lt;p&gt;The AT Protocol's URL-normalization step is worth understanding if you build anything else on it.&lt;/p&gt;

&lt;p&gt;A Bluesky web URL looks like &lt;code&gt;https://bsky.app/starter-pack/alice.bsky.social/abc123rkey&lt;/code&gt;. The canonical form is &lt;code&gt;at://alice.bsky.social/app.bsky.graph.starterpack/abc123rkey&lt;/code&gt;. The Actor's &lt;code&gt;ActorInput&lt;/code&gt; model handles this in a &lt;code&gt;@field_validator&lt;/code&gt; that fires before any network call: strip the &lt;code&gt;https://bsky.app/starter-pack/&lt;/code&gt; prefix, split on &lt;code&gt;/&lt;/code&gt; for handle and rkey, reassemble the AT URI. Pass a malformed URL — missing rkey, extra segments — and Pydantic raises a &lt;code&gt;ValidationError&lt;/code&gt; before the run bills you for the start event.&lt;/p&gt;

&lt;p&gt;The broader point: AT Protocol tools must be careful about this translation layer. The web UI uses handles; the protocol uses DIDs. Denormalizing &lt;code&gt;pack_creator_handle&lt;/code&gt; (human-readable) onto every row while storing &lt;code&gt;member_did&lt;/code&gt; (machine-canonical) is deliberate — the handle can change; the DID cannot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations (the honest list)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No keyword search across packs.&lt;/strong&gt; Discovery is by creator handle, not by topic keyword. The &lt;code&gt;app.bsky.graph.searchStarterPacks&lt;/code&gt; lexicon method returns &lt;code&gt;XRPCNotSupported&lt;/code&gt; on the public AppView.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public packs only.&lt;/strong&gt; Private or invite-only packs are not visible to the unauthenticated API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current membership only.&lt;/strong&gt; The AT Protocol does not expose member join or leave history. This Actor captures a snapshot, not a timeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No post content.&lt;/strong&gt; Member posts are out of scope. Use the companion &lt;a href="https://apify.com/DevilScrapes/bluesky-feed-posts" rel="noopener noreferrer"&gt;Bluesky Feed Posts Actor&lt;/a&gt; for post data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Member cap per pack.&lt;/strong&gt; &lt;code&gt;maxMembersPerPack&lt;/code&gt; defaults to 500 (max 5,000). At 5,000 members the per-run cost is $10.05.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7-day dataset retention on Apify's free plan.&lt;/strong&gt; Export your results or use a named dataset if you need longer retention.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping Bluesky starter packs compliant with their Terms of Service?&lt;/strong&gt;&lt;br&gt;
Yes. The AT Protocol is an open protocol designed for interoperability and data portability. The public AppView (&lt;code&gt;public.api.bsky.app&lt;/code&gt;) is the official mechanism for unauthenticated access. Bluesky has publicly proposed a scraping standard for AI training datasets (&lt;a href="https://slashdot.org/" rel="noopener noreferrer"&gt;Slashdot, 2025&lt;/a&gt;). This Actor reads only what the public API exposes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there an official Bluesky API I could use instead?&lt;/strong&gt;&lt;br&gt;
The three lexicon methods the Actor uses (&lt;code&gt;getStarterPack&lt;/code&gt;, &lt;code&gt;getActorStarterPacks&lt;/code&gt;, &lt;code&gt;getList&lt;/code&gt;) are the official public API. The Actor saves you the pagination loop, URL normalization, retry handling, and Pydantic validation — not a workaround around a locked endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export the data to Google Sheets or a data warehouse?&lt;/strong&gt;&lt;br&gt;
Yes. Export CSV, JSON, or Excel from the Apify Console, or webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make, Zapier, or n8n. The flat row schema loads into any spreadsheet or SQL table without a join.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this export a user's full follower list, or just pack members?&lt;/strong&gt;&lt;br&gt;
Pack members only. The Actor scopes to the membership of a specific pack (or all packs by a creator). It does not enumerate the general follower graph of any account.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/bluesky-starter-pack" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/bluesky-starter-pack&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Paste a &lt;code&gt;bsky.app&lt;/code&gt; URL, get a flat JSON dataset of every member in under a minute. Researching a niche, mapping a community, or building an outreach workflow? Start there.&lt;/p&gt;

&lt;p&gt;If there is a field or an AT Protocol endpoint you wish this Actor covered, drop it in the comments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors for the data-hungry. Pay-per-event, honest pricing, typed rows.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>data</category>
    </item>
    <item>
      <title>Bluesky Feed Scraper: export any custom or algorithm feed to clean JSON</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:42:02 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/bluesky-feed-scraper-export-any-custom-or-algorithm-feed-to-clean-json-52a4</link>
      <guid>https://dev.to/devil_scrapes/bluesky-feed-scraper-export-any-custom-or-algorithm-feed-to-clean-json-52a4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; The &lt;a href="https://bsky.app" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt; AT Protocol exposes a public, unauthenticated &lt;code&gt;app.bsky.feed.getFeed&lt;/code&gt; endpoint that returns posts from any custom or algorithm feed. To get that data as a flat, analytics-ready CSV or JSON — with engagement counts and feed metadata on every row — you call that endpoint with cursor pagination, stitch in metadata from &lt;code&gt;getFeedGenerator&lt;/code&gt;, normalise DIDs, and handle backoff on rate limits. The &lt;a href="https://apify.com/DevilScrapes/bluesky-feed-posts" rel="noopener noreferrer"&gt;Bluesky Feed Posts Scraper&lt;/a&gt; does it for &lt;strong&gt;$0.002 per post&lt;/strong&gt; (~$2.05 per 1,000), no Bluesky account required.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Bluesky ships a &lt;a href="https://docs.bsky.app/docs/tutorials/creating-a-feed" rel="noopener noreferrer"&gt;Feed Generator&lt;/a&gt; protocol that lets any developer publish an algorithm. The result is hundreds of community-curated feeds — topic feeds, language feeds, niche hobby feeds — each one a named curator's idea of what the most relevant posts look like. And unlike Twitter/X's opaque ranked feed, the curation logic and the post inventory are publicly queryable.&lt;/p&gt;

&lt;p&gt;The catch is that "publicly queryable" and "easily queryable" are not the same sentence. The AT Protocol surfaces the data across three separate endpoints, each cursor-paginated, each returning nested structures that need denormalising before they're useful in a spreadsheet. This is the gap this Actor fills.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Bluesky? 🔎
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://bsky.app" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt; is a decentralised social platform built on the &lt;a href="https://atproto.com" rel="noopener noreferrer"&gt;AT Protocol&lt;/a&gt; — an open federated standard for social data. Every post, like, and follow is a signed, addressable record stored in a Personal Data Server (PDS). A public AppView aggregates those records and serves them over a JSON RPC API at &lt;code&gt;public.api.bsky.app/xrpc/&lt;/code&gt; — no authentication required, no API key, no OAuth dance.&lt;/p&gt;

&lt;p&gt;The feed system sits on top of this: a Feed Generator is a small service that responds to &lt;code&gt;getFeed&lt;/code&gt; calls with a list of post URIs. Bluesky's own algorithms ("Discover", "What's Hot", "With Friends") are feed generators, and so are the thousands of community-built ones. Each has a stable AT URI in the form &lt;code&gt;at://did:plc:.../app.bsky.feed.generator/&amp;lt;rkey&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does Bluesky have an API for feed posts? 🔌
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes — and it is intentionally public.&lt;/strong&gt; The &lt;a href="https://atproto.com/guides/lexicon" rel="noopener noreferrer"&gt;AT Protocol AppView&lt;/a&gt; exposes &lt;code&gt;app.bsky.feed.getFeed&lt;/code&gt; for cursor-paginated post retrieval, &lt;code&gt;app.bsky.feed.getFeedGenerator&lt;/code&gt; for feed metadata, and &lt;code&gt;app.bsky.feed.getActorFeeds&lt;/code&gt; to enumerate every feed a creator has published. All three are unauthenticated. Bluesky's open-protocol commitments mean this is not an accident: the data portability is by design.&lt;/p&gt;

&lt;p&gt;What the API does &lt;em&gt;not&lt;/em&gt; give you: a flat, analytics-ready row with feed metadata and engagement counts already joined. It gives you nested JSON, positional cursors, and a DID per author that you carry through the schema yourself. That joining and validation is the whole job.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like 📋
&lt;/h2&gt;

&lt;p&gt;Each post comes back as one typed, denormalised row. Feed metadata — display name, creator handle, description — lands on every row so a CSV export is self-contained. Here is a real record from the "Discover" feed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feed_uri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feed_display_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Discover"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feed_creator_handle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bsky.app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feed_description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Trending content from your personal network"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_uri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"at://did:plc:sj5wj7libgr7omqiotenxadx/app.bsky.feed.post/3mlxmr4jyfs2s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_cid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bafyreidgimgd7v3g3pazsp5oq7ur6bvedpnwohul26mss7cbffg6bdqjkm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_indexed_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T10:20:40.467Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"If you never read the book or saw the movie, you missed one of the greatest Pulitzer Prize winning sagas ever written."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_lang"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_reply_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_repost_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;414&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_like_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1288&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"post_quote_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author_did"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"did:plc:sj5wj7libgr7omqiotenxadx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author_handle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"louiseplease.bsky.social"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author_display_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Louise"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T12:00:00+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seventeen fields, Pydantic-validated before they're written. Every optional field (&lt;code&gt;feed_description&lt;/code&gt;, &lt;code&gt;post_lang&lt;/code&gt;, &lt;code&gt;author_display_name&lt;/code&gt;) is &lt;code&gt;null&lt;/code&gt; when the API omits it — rows are never dropped for missing optional data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) 🛠️
&lt;/h2&gt;

&lt;p&gt;The obvious path: hit &lt;code&gt;https://public.api.bsky.app/xrpc/app.bsky.feed.getFeed?feed=at://...&amp;amp;limit=100&lt;/code&gt;, parse the JSON, paginate until the cursor runs out. It works for the first page. Then the edges appear.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. DID resolution and AT URI construction.&lt;/strong&gt; Bluesky web URLs look like &lt;code&gt;bsky.app/profile/bsky.app/feed/whats-hot&lt;/code&gt;. The AT Protocol wants &lt;code&gt;at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot&lt;/code&gt;. Those are not the same string. The DID is what &lt;code&gt;app.bsky.actor.getProfile&lt;/code&gt; returns when you look up the handle &lt;code&gt;bsky.app&lt;/code&gt;, and a single typo in it returns &lt;code&gt;{"error":"InvalidRequest","message":"could not find feed"}&lt;/code&gt;. We resolve web URLs to AT URIs via &lt;code&gt;getProfile&lt;/code&gt; on every run so you never have to track which DID is current.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Feed metadata is on a different endpoint.&lt;/strong&gt; &lt;code&gt;getFeed&lt;/code&gt; returns post URIs and engagement counts. It does not return the feed's display name, description, or creator handle — those come from &lt;code&gt;getFeedGenerator&lt;/code&gt;. For a self-contained dataset you need both calls stitched together. We make the &lt;code&gt;getFeedGenerator&lt;/code&gt; call once per feed and denormalise the result onto every row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cursor pagination and the client-side cap.&lt;/strong&gt; The public AppView paginates at up to 100 posts per page. A feed with 500 posts needs 5 round trips with cursor threading. A feed with an undefined number of posts needs a sensible client-side cap so the run does not accumulate unbounded cost. We thread cursors, respect the per-feed cap you set (&lt;code&gt;maxPostsPerFeed&lt;/code&gt;, default 100, max 5,000), and stop cleanly when the cursor is exhausted or the cap is hit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Rate limits at scale.&lt;/strong&gt; We retry on &lt;code&gt;408&lt;/code&gt;, &lt;code&gt;429&lt;/code&gt;, and &lt;code&gt;503&lt;/code&gt; with exponential backoff — base 2 seconds, doubling each attempt, capped at 30 seconds, up to 5 attempts per request — and honour &lt;code&gt;Retry-After&lt;/code&gt; headers when the API sets them. We rotate browser fingerprints via &lt;code&gt;curl-cffi&lt;/code&gt; so the TLS handshake looks like a real browser client, not a Python script. On a partial-success run we surface the count via &lt;code&gt;Actor.set_status_message&lt;/code&gt; rather than returning a green status with a silently truncated dataset.&lt;/p&gt;

&lt;p&gt;None of this is conceptually hard. It is just engineering tax that adds up to a weekend from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor 🚀
&lt;/h2&gt;

&lt;p&gt;I packaged the result as an Apify Actor: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/bluesky-feed-posts" rel="noopener noreferrer"&gt;Bluesky Feed Posts Scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Paste a feed URI or creator handle in the Apify Console and click Start, or run it programmatically via the Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Single-feed mode — pull up to 200 posts from Bluesky's Discover feed
&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/bluesky-feed-posts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feedUri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxPostsPerFeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_handle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_like_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or in creator-discovery mode — enumerate every feed a creator publishes and pull posts from each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/bluesky-feed-posts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;creatorHandle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bsky.app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxPostsPerFeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxFeeds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two input modes, mutually exclusive. Setting both &lt;code&gt;feedUri&lt;/code&gt; and &lt;code&gt;creatorHandle&lt;/code&gt; causes the Actor to fail fast with a clear error before any network call — no silent half-runs. You can also paste a &lt;code&gt;bsky.app&lt;/code&gt; web URL directly into &lt;code&gt;feedUri&lt;/code&gt; — the Actor converts it to AT URI form automatically. The raw AT URI skips the resolution step and runs slightly faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases 💡
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Algorithm research.&lt;/strong&gt; Sample what the "Discover" or "What's Hot" feeds surface across multiple days or weeks and track topic drift, amplification patterns, or the ratio of original posts to reposts. The AT Protocol's open data makes this the most researcher-accessible large social feed available right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Newsroom social listening.&lt;/strong&gt; Subscribe to curated topic feeds and pipe new posts into Slack or a Google Sheet via Apify Webhooks. Because feed metadata is denormalised onto every row, the Slack message template needs no join.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NLP corpus building.&lt;/strong&gt; Collect labelled training data from topic-curated feeds for sentiment models, topic classifiers, or RAG systems. A feed labelled "AI news" or "climate science" gives you weakly supervised labels without manual tagging of raw timelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creator and feed analytics.&lt;/strong&gt; Pull every post a niche feed generator surfaces and rank by like / repost / quote ratios. Benchmark your own Bluesky posts against what the feed amplifies, and see which content formats dominate the engagement distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive monitoring.&lt;/strong&gt; Track community-curated feeds that aggregate competitor announcements, support complaints, or product mentions. Creator-discovery mode pulls a creator's full feed catalogue in a single run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for posts that land in your dataset. No data, no charge (beyond the $0.05 run warm-up).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Actor start&lt;/td&gt;
&lt;td&gt;$0.05 per run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post row written&lt;/td&gt;
&lt;td&gt;$0.002 per row&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Posts scraped&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;$1.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;$2.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;$10.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The maximum single-run input (50 feeds × 100 posts = 5,000 rows) comes out to around $10.05. Apify's $5 free trial credit covers roughly 2,475 posts — no credit card needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technically interesting part
&lt;/h2&gt;

&lt;p&gt;The AT Protocol uses Content Identifiers (CIDs) — IPLD content-addressed hashes — as the stable identifier for every post record. The &lt;code&gt;post_cid&lt;/code&gt; field in each row is the cryptographic fingerprint of the exact post record at time of indexing. Two runs returning the same &lt;code&gt;post_cid&lt;/code&gt; for a &lt;code&gt;post_uri&lt;/code&gt; are guaranteed to be the same record; two different &lt;code&gt;post_cid&lt;/code&gt; values mean the post was edited in between. This makes longitudinal feed studies possible — you can track not just which posts appeared in the feed, but whether their content changed over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Private or access-restricted feeds&lt;/strong&gt; are not exposed by the public AppView API. Only feeds visible at &lt;code&gt;public.api.bsky.app&lt;/code&gt; can be scraped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global feed discovery by keyword is not supported.&lt;/strong&gt; Bluesky's &lt;code&gt;getPopularFeedGenerators&lt;/code&gt; endpoint returns &lt;code&gt;MethodNotImplemented&lt;/code&gt; on the public AppView. Use creator-discovery mode (&lt;code&gt;creatorHandle&lt;/code&gt;) to enumerate one creator's feeds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post images, embeds, and quoted-post bodies are not extracted.&lt;/strong&gt; Only the plain-text &lt;code&gt;post_text&lt;/code&gt; is captured. Image ALT text and quoted-post content are outside the current schema.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reply thread expansion is out of scope.&lt;/strong&gt; Only the top-level post row is emitted. Thread context (parent/root posts) would require additional &lt;code&gt;getPostThread&lt;/code&gt; calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;maxPostsPerFeed&lt;/code&gt; cap is client-side.&lt;/strong&gt; If a feed has fewer posts than the cap, fewer rows are returned — expected behaviour, not a failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage retention on Apify's FREE plan is 7 days.&lt;/strong&gt; Export your dataset immediately after the run, or use a named dataset for longer retention.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ ❓
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping public Bluesky feeds legal?&lt;/strong&gt;&lt;br&gt;
The AT Protocol is an open, federated standard. &lt;code&gt;public.api.bsky.app&lt;/code&gt; is explicitly unauthenticated and publicly accessible without login. The &lt;a href="https://bsky.social/about/support/tos" rel="noopener noreferrer"&gt;Bluesky Terms of Service&lt;/a&gt; permit accessing public data programmatically, as long as you do not impersonate users or violate AT Protocol data-portability principles. Always verify the current Terms of Service and your local jurisdiction's data-protection rules before using scraped data commercially.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this a replacement for the Twitter/X API?&lt;/strong&gt;&lt;br&gt;
No. Bluesky's AT Protocol is a different architecture: the open feed-generator system, content-addressed records, and unauthenticated public AppView are native to its design, not workarounds. If you need Twitter/X data, use a Twitter scraper. If you want Bluesky's unique feed-curation graph, this Actor is built for that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export to Google Sheets or a data warehouse?&lt;/strong&gt;&lt;br&gt;
Yes. Export CSV/Excel/JSON from the Apify Console after the run, webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make, Zapier, or n8n, or pull via the Apify REST API: &lt;code&gt;GET /datasets/{id}/items?format=csv&amp;amp;clean=true&lt;/code&gt;. Because feed metadata is already denormalised onto every row, no pivot or VLOOKUP is needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a feed URI and how do I find one?&lt;/strong&gt;&lt;br&gt;
An AT URI looks like &lt;code&gt;at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky.feed.generator/whats-hot&lt;/code&gt;. Every Bluesky feed also has a &lt;code&gt;bsky.app&lt;/code&gt; web URL in the form &lt;code&gt;https://bsky.app/profile/&amp;lt;creator&amp;gt;/feed/&amp;lt;rkey&amp;gt;&lt;/code&gt;. You can paste either format into the &lt;code&gt;feedUri&lt;/code&gt; field — the Actor converts web URLs automatically. To get all feeds from a creator, use &lt;code&gt;creatorHandle&lt;/code&gt; and set &lt;code&gt;maxFeeds&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is live on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/bluesky-feed-posts" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/bluesky-feed-posts&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Run it against the &lt;code&gt;whats-hot&lt;/code&gt; feed URI above with &lt;code&gt;maxPostsPerFeed: 100&lt;/code&gt; and you will have a clean dataset of today's Discover feed posts in under 30 seconds. The &lt;a href="https://atproto.com/guides/lexicon" rel="noopener noreferrer"&gt;AT Protocol documentation&lt;/a&gt; and &lt;a href="https://docs.apify.com/sdk/python" rel="noopener noreferrer"&gt;Apify Python SDK docs&lt;/a&gt; are the two reference links you will reach for most.&lt;/p&gt;

&lt;p&gt;Have a feed analysis use case I haven't covered, or a field you wish was in the output? Drop it in the comments — I ship based on what people actually need.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>data</category>
    </item>
    <item>
      <title>Bing Search API Replacement: scrape SERP results for $1.05/1K</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:36:47 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/bing-search-api-replacement-scrape-serp-results-for-1051k-6lg</link>
      <guid>https://dev.to/devil_scrapes/bing-search-api-replacement-scrape-serp-results-for-1051k-6lg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; Microsoft retired the Bing Search API on August 11, 2025. There is no longer an official endpoint. A &lt;em&gt;Bing search scraper&lt;/em&gt; hits the same &lt;code&gt;www.bing.com/search&lt;/code&gt; HTML endpoint any browser hits, decodes the &lt;code&gt;bing.com/ck/a&lt;/code&gt; redirect wrappers to canonical URLs, and returns clean typed rows — query, position, title, URL, snippet, country, language, timestamp. The Apify Actor below costs &lt;strong&gt;$0.001 per row&lt;/strong&gt; (~$1.05 per 1,000), with TLS fingerprint rotation, residential proxy routing, and Bing's pagination handled for you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Microsoft's retirement notice landed on August 11, 2025, and thousands of teams who had wired Bing SERP data into rank trackers, brand-monitoring dashboards, and RAG pipelines found themselves holding a dead &lt;code&gt;api.bing.microsoft.com&lt;/code&gt; call. The official replacement — "Grounding with Bing Search" inside Azure AI — is gated behind an Azure subscription and priced for LLM workflows, not raw SERP pulls. SerpApi and DataForSEO fill the gap at $75–225/month. DIY means debugging redirect wrappers at 11pm.&lt;/p&gt;

&lt;p&gt;This post covers where a naive scraper breaks and how I packaged the fix as a cheap Apify Actor.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Bing Search? 🔎
&lt;/h2&gt;

&lt;p&gt;Bing is Microsoft's web search engine. For SEO practitioners it matters because its ranking signals diverge from Google's — it weights on-page keyword density and anchor text differently, surfaces more forum content, and runs its own crawler (&lt;code&gt;bingbot&lt;/code&gt;) with independent crawl frequency. For AI and RAG applications, Bing's index includes different long-tail pages than Google's; an LLM agent that only queries one engine has blind spots the other would fill.&lt;/p&gt;

&lt;p&gt;The SERP itself — &lt;code&gt;www.bing.com/search?q=&amp;lt;query&amp;gt;&lt;/code&gt; — is a server-rendered HTML page. It returns 10 organic &lt;code&gt;&amp;lt;li class="b_algo"&amp;gt;&lt;/code&gt; blocks per page, paginated via &lt;code&gt;&amp;amp;first=0,10,20,...&lt;/code&gt;. Everything the old API gave you (titles, URLs, snippets) is in that HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does Bing have an API for search results?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No — as of August 11, 2025, Microsoft has no official API for Bing organic search results.&lt;/strong&gt; The &lt;a href="https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/endpoints" rel="noopener noreferrer"&gt;Bing Web Search API v7&lt;/a&gt; was retired on that date. The Azure AI "Grounding with Bing Search" product is a different surface, gated to the Azure OpenAI workflow and not available as a standalone SERP endpoint for arbitrary queries.&lt;/p&gt;

&lt;p&gt;The only programmatic path to Bing SERP data is the public HTML endpoint the browser hits — which is what this Actor uses.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Each organic result becomes one flat, Pydantic-validated row. Every field is verified against the &lt;code&gt;ResultRow&lt;/code&gt; model in &lt;code&gt;src/models.py&lt;/code&gt; before it reaches your dataset — no positional-array guessing, no silent null promotion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bing search api replacement"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"position"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bing Search Scraper — SERP organic results to JSON"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://apify.com/DevilScrapes/bing-search-scraper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"displayed_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://apify.com › DevilScrapes › bing-search-scraper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"snippet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Drop-in replacement for the retired Bing Search API. Returns title, URL, snippet, position for any query and locale."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"US"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T13:40:00.000Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nine fields. The &lt;code&gt;url&lt;/code&gt; is always the canonical destination — never a &lt;code&gt;bing.com/ck/a&lt;/code&gt; redirect wrapper. The &lt;code&gt;displayed_url&lt;/code&gt; is the breadcrumb Bing renders on screen. &lt;code&gt;position&lt;/code&gt; is 1-indexed across all paginated pages, so position 11 is the first result on page 2 — no off-by-one math on your side.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) 🔧
&lt;/h2&gt;

&lt;p&gt;The SERP is just HTML, so the obvious path looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;parsel&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Selector&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.bing.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bing api replacement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;sel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.b_algo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;li&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2 a::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will fail in production. Three reasons, and they are why a hosted Actor earns its keep:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. TLS fingerprint inspection.&lt;/strong&gt; Bing inspects the JA3/JA4 signature of your TLS handshake. Python's stdlib &lt;code&gt;ssl&lt;/code&gt;, &lt;code&gt;requests&lt;/code&gt;, and &lt;code&gt;httpx&lt;/code&gt; emit fingerprints no real browser produces — the server returns &lt;code&gt;403&lt;/code&gt; before reading the query string. We route every request through &lt;a href="https://github.com/lexiforest/curl_cffi" rel="noopener noreferrer"&gt;&lt;code&gt;curl-cffi&lt;/code&gt;&lt;/a&gt;'s &lt;code&gt;AsyncSession&lt;/code&gt;, impersonating Chrome 131, Chrome 124, or Firefox 147 TLS + HTTP/2 SETTINGS frames at the socket level, rotating profiles per page to reduce burst correlation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The &lt;code&gt;bing.com/ck/a&lt;/code&gt; redirect wrapper.&lt;/strong&gt; Every organic &lt;code&gt;href&lt;/code&gt; is a click-tracking redirect of the form &lt;code&gt;https://www.bing.com/ck/a?...&amp;amp;u=a1&amp;lt;base64url&amp;gt;...&lt;/code&gt;. A naive parser hands you that wrapper, which joins to nothing downstream. We decode the &lt;code&gt;u=a1&amp;lt;base64url&amp;gt;&lt;/code&gt; on every href — the &lt;code&gt;url&lt;/code&gt; field in your dataset is always the canonical destination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Proxy continuity across pagination.&lt;/strong&gt; Bing serves 10 results per page. Rotating proxies mid-query degrades results: fewer organic blocks, more ads, eventually empty pages. We thread Apify residential proxies with a stable &lt;code&gt;session_id&lt;/code&gt; per query's page-burst, then rotate on the next query. On &lt;code&gt;429&lt;/code&gt;/&lt;code&gt;503&lt;/code&gt; we retry with exponential backoff (2s → 4s → ... capped at 30s, max 5 attempts, honouring &lt;code&gt;Retry-After&lt;/code&gt;). A query that exhausts retries doesn't abort the run — partial success surfaces via status message.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor ⚙️
&lt;/h2&gt;

&lt;p&gt;The result is packaged as an Apify Actor: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/bing-search-scraper" rel="noopener noreferrer"&gt;Bing Search Scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Run it from the Apify Console by pasting your queries into the input form, or call it programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/bing-search-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bing search api replacement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bing serp scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrape bing results python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResultsPerQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;useProxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Input parameters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;queries&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string[]&lt;/td&gt;
&lt;td&gt;required&lt;/td&gt;
&lt;td&gt;One row emitted per (query × result)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;country&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;&lt;code&gt;US&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ISO-3166-1 alpha-2, passed as &lt;code&gt;cc=&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;language&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ISO-639-1, passed as &lt;code&gt;setlang=&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxResultsPerQuery&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;int 1–100&lt;/td&gt;
&lt;td&gt;&lt;code&gt;30&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Capped at 100; Bing quality degrades past ~page 5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;useProxy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bool&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Apify residential proxy; strongly recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The output streams to the default dataset as the scrape progresses — you can read partial results before the run finishes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you'd actually use this for 📊
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Bing Search API migration.&lt;/strong&gt; If you have any Python code calling &lt;code&gt;api.bing.microsoft.com&lt;/code&gt;, it is dead. The Actor's input and output map to the same logical shape: queries in, ranked organic results out. The migration work is replacing the client call, not re-architecting the downstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEO rank tracking across locales.&lt;/strong&gt; Run the same query list with &lt;code&gt;country=US&lt;/code&gt;, &lt;code&gt;country=GB&lt;/code&gt;, &lt;code&gt;country=DE&lt;/code&gt; on a weekly schedule and chart position drift over time. Locale-specific Bing rank tracking is hard to buy — most tools are Google-first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brand monitoring.&lt;/strong&gt; Pull the top 30 results for branded queries and alert when a new domain enters the top 10. At 100 queries × 30 results that is 3,000 rows per run — &lt;strong&gt;$3.05&lt;/strong&gt; total including the actor-start fee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI agent retrieval / RAG diversification.&lt;/strong&gt; LangChain's &lt;code&gt;BingSearchAPIWrapper&lt;/code&gt; and LlamaIndex's &lt;code&gt;BingToolSpec&lt;/code&gt; both called the now-dead endpoint. An Apify Actor call is the drop-in. Bing's index skews toward forum content and long-tail pages where Google's top results are SEO-dominated — the diversification is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay only for rows that land in your dataset. No data, no charge (beyond the warm-up fee).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Rate (USD)&lt;/th&gt;
&lt;th&gt;When&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;actor-start&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;td&gt;Once per run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;result-row&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;Per organic SERP row pushed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pull&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100 rows&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;300 rows (10 queries × 30)&lt;/td&gt;
&lt;td&gt;$0.35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 rows&lt;/td&gt;
&lt;td&gt;$1.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3,000 rows (100 queries × 30)&lt;/td&gt;
&lt;td&gt;$3.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 rows&lt;/td&gt;
&lt;td&gt;$10.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apify's free $5 trial credit covers roughly 4,950 rows — no credit card required. For comparison, SerpApi's entry plan runs $75/month, and DataForSEO requires a separate account plus API key management. This Actor undercuts on simplicity and price by going direct to Bing's HTML endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part worth understanding technically
&lt;/h2&gt;

&lt;p&gt;The redirect decoder is the technically non-obvious piece. Bing wraps every organic &lt;code&gt;href&lt;/code&gt; in &lt;code&gt;bing.com/ck/a?!&amp;amp;&amp;amp;p=&amp;lt;hash&amp;gt;&amp;amp;u=a1&amp;lt;base64url&amp;gt;...&lt;/code&gt;. The &lt;code&gt;u=&lt;/code&gt; parameter carries the destination URL prefixed with &lt;code&gt;a1&lt;/code&gt; and encoded as URL-safe base64:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decode_bing_href&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;href&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[?&amp;amp;]u=a1([A-Za-z0-9_-]+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;href&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;href&lt;/span&gt;  &lt;span class="c1"&gt;# already canonical
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlsafe_b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bing has kept this format stable since at least 2023, but it's the detail that causes a naive scraper to silently fill your dataset with &lt;code&gt;bing.com/ck/a&lt;/code&gt; junk that joins to nothing. Every URL is decoded before it reaches the dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Organic web results only.&lt;/strong&gt; People-Also-Ask, news, images, videos, and local packs are out of scope for v1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100-result ceiling per query.&lt;/strong&gt; Bing's HTML becomes unreliable past page 10. The Actor stops paginating rather than returning garbage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;is_featured&lt;/code&gt; flag.&lt;/strong&gt; Featured snippets that render inside &lt;code&gt;b_algo&lt;/code&gt; blocks appear as regular rows; v1 has no way to distinguish them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No JavaScript rendering.&lt;/strong&gt; The current path hits server-rendered HTML (verified working 2026-05-16). A Bing move to client-rendering would require a Camoufox upgrade.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshots only.&lt;/strong&gt; Every run is a fresh snapshot. For time-series tracking, schedule runs and export to named Apify datasets or your own storage. Free-plan default storage purges after 7 days.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Country/language passed through, not validated.&lt;/strong&gt; Unrecognized ISO codes silently return default results.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ ❓
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping Bing search results legal?&lt;/strong&gt;&lt;br&gt;
The Actor sends well-formed HTTP requests to the same public, unauthenticated page any browser accesses — no authentication bypass, no paywall, no personal data. Comply with Bing's Terms of Service and your jurisdiction's applicable law.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export results to Google Sheets or a data warehouse?&lt;/strong&gt;&lt;br&gt;
Yes — download CSV, Excel, JSON, or XML from the Apify Console Export button, or use the &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; webhook to push the dataset into Make, Zapier, or n8n. Raw API access via the &lt;a href="https://docs.apify.com/api/v2#/reference/datasets/item-collection/get-items" rel="noopener noreferrer"&gt;Apify Dataset API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there still an official Bing search API?&lt;/strong&gt;&lt;br&gt;
No. The &lt;a href="https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/endpoints" rel="noopener noreferrer"&gt;Bing Web Search API v7&lt;/a&gt; was retired August 11, 2025. Azure AI "Grounding with Bing Search" is a different surface — an LLM-workflow integration, not a raw SERP endpoint for arbitrary queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why are the redirect URLs always decoded?&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;bing.com/ck/a?...&amp;amp;u=a1&amp;lt;base64url&amp;gt;&lt;/code&gt; wrapper carries no information your workflow can use. Canonical URLs join cleanly with any URL-keyed dataset. The decode is always on; there is no option to keep the raw wrapper.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is live on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/bing-search-scraper" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/bing-search-scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Paste &lt;code&gt;bing search api replacement&lt;/code&gt; as your first query and you'll have 30 ranked results in your dataset in under a minute. Building something on top — a rank tracker, a brand-monitoring webhook, a RAG pipeline? Drop the use case in the comments. If there's a feature the README doesn't cover (PAA boxes, news vertical, location targeting), that's how v2 gets scoped.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>seo</category>
    </item>
    <item>
      <title>ATS Tech Stack Detector: pull company back-end stacks from jobs for $5.05/1K</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:31:32 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/ats-tech-stack-detector-pull-company-back-end-stacks-from-jobs-for-5051k-1444</link>
      <guid>https://dev.to/devil_scrapes/ats-tech-stack-detector-pull-company-back-end-stacks-from-jobs-for-5051k-1444</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; Greenhouse, Lever, and Ashby each publish a public job-board API that any job aggregator can hit — no auth required. An &lt;em&gt;ATS tech stack detector&lt;/em&gt; calls those APIs, strips the HTML from each job description, then runs a curated vocabulary of ~110 canonical tech names against the text to produce a deduplicated &lt;code&gt;detected_techs&lt;/code&gt; list per job row. The Apify Actor below does it for &lt;strong&gt;$0.005 per row&lt;/strong&gt; ($5.05 per 1,000 jobs, including the one-time $0.05 start fee), with exponential backoff, per-company fault isolation, and Pydantic-validated output ready to drop into any CRM pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every B2B sales motion eventually hits the same wall: you need to know what your prospect actually runs in production, and front-end sniffers like BuiltWith and Wappalyzer can't tell you. They read Cloudflare headers, tag-manager IDs, and JS bundle manifests — which surfaces HubSpot and Segment beautifully but says nothing about whether the company runs Postgres or Snowflake, Kafka or RabbitMQ, Kubernetes or Nomad. Engineering teams declare their real stack in job descriptions. A "Senior Backend Engineer" posting that lists Django, Postgres, Kafka, and Kubernetes tells you more about their infrastructure than any front-end scan ever will.&lt;/p&gt;

&lt;p&gt;The hard part is extracting that signal at scale: hit the APIs, unpick the HTML, normalize "node.js" vs "Node.js" vs "NodeJS", deduplicate, then keep the vocabulary current as new tools ship. Or call one Actor and get a typed dataset back. Here's how the second option works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an ATS? 🔎
&lt;/h2&gt;

&lt;p&gt;An Applicant Tracking System (ATS) is the software companies use to post jobs, accept applications, and manage candidates. The three platforms this Actor covers — &lt;strong&gt;Greenhouse&lt;/strong&gt;, &lt;strong&gt;Lever&lt;/strong&gt;, and &lt;strong&gt;Ashby&lt;/strong&gt; — collectively handle a large share of Series A+ tech-company hiring, and all three publish their job-board data via public, unauthenticated REST APIs so that aggregators and "Powered by Greenhouse" listings can pull postings without a login.&lt;/p&gt;

&lt;p&gt;That public-API design is the foundation this Actor builds on. We call the same endpoint a job-board aggregator would, then add what they don't: tech-stack extraction from the description text, normalization to a canonical vocabulary, and a flat, validated output schema.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does Greenhouse have an API for job postings?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes.&lt;/strong&gt; Greenhouse's Job Board API (&lt;code&gt;boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true&lt;/code&gt;) returns every active posting for any company with a public board, including the full description HTML. Lever exposes &lt;code&gt;api.lever.co/v0/postings/{token}&lt;/code&gt; with the same shape, and Ashby uses &lt;code&gt;api.ashbyhq.com/posting-api/job-board/{token}&lt;/code&gt; — case-sensitive on the token (&lt;code&gt;Ramp&lt;/code&gt; works; &lt;code&gt;ramp&lt;/code&gt; returns zero). None require an API key or OAuth.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Each job posting produces one flat, typed row. Every field comes directly from the ATS response or from our detection pass — nothing is inferred or synthesised:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ats"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"greenhouse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"company_token"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"airtable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"job_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4567890"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Senior Backend Engineer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Remote, US"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://boards.greenhouse.io/airtable/jobs/4567890"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"We're looking for a backend engineer with deep experience in Python, Django, Postgres and Redis. Our data pipeline runs on AWS with Kubernetes..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"detected_techs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Django"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Kubernetes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Postgres"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Redis"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"posted_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-01T09:00:00+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-31T14:22:11+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Eleven fields, the same shape every time. &lt;code&gt;detected_techs&lt;/code&gt; is sorted (case-insensitive) and deduplicated. Every row is Pydantic-validated before it is written, so you never receive a missing required field or an &lt;code&gt;ats&lt;/code&gt; value outside &lt;code&gt;["greenhouse", "lever", "ashby"]&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart)
&lt;/h2&gt;

&lt;p&gt;The first attempt looks straightforward: hit the API, grab the description, split on whitespace, look for known tech names. It breaks almost immediately, and the breakage is instructive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. HTML encoding layers.&lt;/strong&gt; Greenhouse double-encodes its &lt;code&gt;content&lt;/code&gt; field — the HTML you receive contains &lt;code&gt;&amp;amp;lt;div&amp;amp;gt;&lt;/code&gt; where the original had &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;. A single &lt;code&gt;html.unescape()&lt;/code&gt; leaves literal &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; tags in the text that your regex then trips over. The parser unescapes twice, then strips tags, before the vocabulary scan runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Lever's split description.&lt;/strong&gt; Lever's &lt;code&gt;descriptionPlain&lt;/code&gt; often omits everything inside "Requirements" bullets — that content lives in &lt;code&gt;lists[].content&lt;/code&gt; chunks in a separate array. Read only &lt;code&gt;descriptionPlain&lt;/code&gt; and you silently miss half the tech signals in engineering roles. We concatenate every &lt;code&gt;lists[].content&lt;/code&gt; chunk before scanning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Case-sensitivity on Ashby tokens.&lt;/strong&gt; Write &lt;code&gt;ramp&lt;/code&gt; instead of &lt;code&gt;Ramp&lt;/code&gt; and Ashby's API returns an empty list — not a 404, just zero jobs. Without a guard, you produce an empty dataset and think you scraped it. The Actor fails loud when every company returns zero rows rather than reporting a hollow success.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Rate-limiting.&lt;/strong&gt; All three APIs throttle undeclared bursts. We retry on &lt;code&gt;429&lt;/code&gt; and &lt;code&gt;503&lt;/code&gt; with exponential backoff (2 seconds, doubling, capped at 30 seconds, up to 5 attempts), and we honour &lt;code&gt;Retry-After&lt;/code&gt; when the server sends it. On multi-company runs, one company isolating a rate limit doesn't abort the rest — each company's fetch loop is isolated, so partial success is partial success, not total failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Vocabulary normalization.&lt;/strong&gt; Job descriptions use "Node.js", "NodeJS", "Node JS", and "Node" to mean the same runtime, and "postgres", "Postgres", and "PostgreSQL" for the same database. A naive substring match either misses casing variants or fires on the wrong word ("Java" inside "JavaScript"). The vocabulary uses case-insensitive word-boundary regex (&lt;code&gt;\bPostgres\b&lt;/code&gt;), longest-match-first, so the canonical name is emitted regardless of how the author capitalised it and substrings never false-fire. Underneath, every request goes out through &lt;code&gt;curl-cffi&lt;/code&gt; impersonating a real Chrome 131 TLS + HTTP/2 fingerprint, we thread Apify residential proxies when you flip the &lt;code&gt;useProxy&lt;/code&gt; flag on, and we hand back Pydantic-validated typed rows — no data means no charge.&lt;/p&gt;

&lt;p&gt;None of that is complicated in hindsight. All of it is the difference between a script that worked on three companies in a notebook and a pipeline that survives 500 companies overnight.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor ⚙️
&lt;/h2&gt;

&lt;p&gt;I packaged this as an Apify Actor: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/ats-tech-stack-scraper" rel="noopener noreferrer"&gt;ATS Tech Stack Scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can run it from the Apify Console by filling in the input form, or call it programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/ats-tech-stack-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;companies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;companyToken&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;airtable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;atsType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;greenhouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;companyToken&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;palantir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;atsType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lever&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;companyToken&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ramp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;atsType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ashby&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxJobsPerCompany&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minTechsDetected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detected_techs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;minTechsDetected: 2&lt;/code&gt; is a useful floor that drops sales, marketing, and ops roles mentioning no engineering tools — you keep only rows where the vocabulary actually fired. &lt;code&gt;maxJobsPerCompany&lt;/code&gt; caps spend on large companies with hundreds of open roles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding a company's ATS token&lt;/strong&gt; — it's the board slug in the careers URL: &lt;code&gt;boards.greenhouse.io/{token}&lt;/code&gt; (e.g. &lt;code&gt;airtable&lt;/code&gt;), &lt;code&gt;jobs.lever.co/{token}&lt;/code&gt; (e.g. &lt;code&gt;palantir&lt;/code&gt;), or &lt;code&gt;jobs.ashbyhq.com/{Token}&lt;/code&gt; (e.g. &lt;code&gt;Ramp&lt;/code&gt; — case-sensitive; lowercase returns zero jobs).&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases 💡
&lt;/h2&gt;

&lt;p&gt;Three buyer types, five concrete scenarios:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. B2B sales qualification.&lt;/strong&gt; You sell a Postgres performance tool. Filter &lt;code&gt;detected_techs&lt;/code&gt; for &lt;code&gt;"Postgres"&lt;/code&gt; and sort by &lt;code&gt;posted_at&lt;/code&gt; recency — any company actively hiring "Senior Backend Engineer" with Postgres in the JD is pre-qualified, and you know they're staffing that stack now, not three years ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Recruiter sourcing.&lt;/strong&gt; Filling a Rust role? Pull every engineering posting from your target-company list, filter &lt;code&gt;detected_techs&lt;/code&gt; for &lt;code&gt;"Rust"&lt;/code&gt;, and you have a live shortlist of companies already running it in production. Pair with &lt;code&gt;minTechsDetected: 3&lt;/code&gt; to exclude postings that mention Rust only in passing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Competitive intelligence.&lt;/strong&gt; Track which competitors are posting Kubernetes, Snowflake, or dbt roles. A cluster of new infra-heavy postings often signals a platform migration before it becomes a press release.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. CRM enrichment.&lt;/strong&gt; Feed the output into Clay, HubSpot Operations Hub, or Attio as a custom enrichment step — replacing a BuiltWith seat for the subset of back-end and data-platform signals engineering-buying teams actually need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Investment research.&lt;/strong&gt; Map a pipeline target's tech footprint across quarters. A company that moves from "MySQL" to "Postgres" and "Snowflake" postings in eighteen months is scaling its data stack — a signal front-end sniffers miss entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for rows you get; rows that don't land cost nothing.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;When&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;actor-start&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;td&gt;Once per run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;result-row&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;td&gt;Per job row emitted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rows pulled&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50 rows&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200 rows&lt;/td&gt;
&lt;td&gt;$1.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 rows&lt;/td&gt;
&lt;td&gt;$5.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 rows&lt;/td&gt;
&lt;td&gt;$50.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a 200-company qualification pass (roughly 2,000–4,000 rows), this Actor costs $10–$20 — no subscription, no seat limit, no contract. Apify's $5 free trial credit covers your first ~990 rows with no credit card.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technically interesting bit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The vocabulary is the product.&lt;/strong&gt; ~110 canonical tech names sounds small, but precision matters more than recall here — a false positive ("Java" matching "JavaScript" without word boundaries) poisons the very signal buyers are paying for. Every entry uses case-insensitive word-boundary matching (&lt;code&gt;\bPostgres\b&lt;/code&gt;), and the alternation is sorted longest-first so "PostgreSQL" wins over "Postgres" on a shared prefix while "Javanese" and "Djangocon" never fire. Buyers in this category have been burned by LLM-based detection inventing plausible-sounding stacks; a deterministic, auditable regex is the antidote. Miss a tool? Open a feedback ticket and it ships in the next version — no model fine-tuning required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Regex detection only.&lt;/strong&gt; Tools outside the curated ~110-name vocabulary won't appear in &lt;code&gt;detected_techs&lt;/code&gt;. Missing one? Submit a feedback request on the Store page and it ships in the next release.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You supply the token.&lt;/strong&gt; The Actor requires the ATS board token — it does not discover which ATS a company uses or look it up by company name. No LinkedIn or Workday support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Greenhouse double-encoding edge cases.&lt;/strong&gt; The parser handles the common double-encode; exotic Greenhouse configs may leave extra HTML fragments in &lt;code&gt;description_text&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ashby case-sensitivity is on you.&lt;/strong&gt; Pass &lt;code&gt;ramp&lt;/code&gt; when the token is &lt;code&gt;Ramp&lt;/code&gt; and you get zero rows — find the correct casing in the Ashby job-board URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7-day default storage.&lt;/strong&gt; On the FREE plan, default datasets expire after 7 days. For persistent pipelines, open a named dataset: &lt;code&gt;Actor.open_dataset(name="my-stack-data")&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is it legal to scrape job boards?&lt;/strong&gt;&lt;br&gt;
These endpoints are public APIs published intentionally by Greenhouse, Lever, and Ashby so that job aggregators can distribute postings. The Actor requests only publicly visible job post data at a rate the APIs can handle, collects only company-aggregated tech signals (no personal applicant data), and bypasses no authentication. As always, review your own jurisdiction and intended use case before running in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export the results to a spreadsheet or data warehouse?&lt;/strong&gt;&lt;br&gt;
Yes. The Apify Console exports CSV, Excel, and JSON directly from the dataset view. You can also webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make, Zapier, or n8n, or pull it programmatically via the &lt;a href="https://docs.apify.com/api/v2" rel="noopener noreferrer"&gt;Apify API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there an official API for this?&lt;/strong&gt;&lt;br&gt;
The Actor itself is the programmatic interface — call it via the &lt;a href="https://docs.apify.com/sdk/python" rel="noopener noreferrer"&gt;Apify Python client&lt;/a&gt; or REST API. The underlying Greenhouse, Lever, and Ashby job-board APIs are official and publicly documented.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I find a company's ATS if I don't know it?&lt;/strong&gt;&lt;br&gt;
Look for the "Powered by Greenhouse / Lever / Ashby" badge on their careers page, or inspect the job-listing URL — the token is the path segment between the domain and &lt;code&gt;/jobs/&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is live on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/ats-tech-stack-scraper" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/ats-tech-stack-scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Drop in Airtable on Greenhouse and Ramp on Ashby, set &lt;code&gt;minTechsDetected: 2&lt;/code&gt;, and you'll have a cross-company tech-stack dataset in under a minute. Building a Clay recipe or a qualification workflow and want a specific output field? Leave a comment — I build based on what people actually need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Further reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developers.greenhouse.io/job-board.html" rel="noopener noreferrer"&gt;Greenhouse Job Board API docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hire.lever.co/developer/postings" rel="noopener noreferrer"&gt;Lever Postings API docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.apify.com/sdk/python" rel="noopener noreferrer"&gt;Apify Python SDK — calling Actors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors that do the dirty work so your dataset stays clean.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
      <category>data</category>
    </item>
    <item>
      <title>Google AI Overview Tracker: 8-selector battery + citation drift telemetry</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:26:18 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/google-ai-overview-tracker-8-selector-battery-citation-drift-telemetry-j65</link>
      <guid>https://dev.to/devil_scrapes/google-ai-overview-tracker-8-selector-battery-citation-drift-telemetry-j65</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; Google publishes no API for AI Overview citations. The only way to get the data programmatically is to render Google SERPs in a real browser and parse the citation carousel client-side. The &lt;a href="https://apify.com/DevilScrapes/ai-overview-citations" rel="noopener noreferrer"&gt;Google AI Overview Citation Tracker&lt;/a&gt; does exactly that — one Pydantic-validated row per (query × cited source) at &lt;strong&gt;$5.50 per 1,000 rows&lt;/strong&gt;, with selector-drift telemetry so you know when Google rotates its markup before your dashboard goes dark.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Answer Engine Optimization has a measurement problem no major SEO platform has solved. Ahrefs, Semrush, and Sistrix track your domain's SERP rank, but AI Overview appears above position 1 for roughly 30% of informational queries in 2026, and its citations are drawn from a different pool than your normal rankings. You can rank position 1 and still be invisible in AI Overview while a competitor's 2022 blog post gets cited six times — with no backlink your SEO tools can detect. That gap, structured per-query citation data you can query against a competitor list, is what this Actor closes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Google AI Overview? 🔎
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://blog.google/products/search/generative-ai-search/" rel="noopener noreferrer"&gt;Google AI Overview&lt;/a&gt; is the AI-generated summary block at the top of Google's search results for informational queries. It rolled out broadly in the US in May 2024 and expanded globally through 2025 — Google's generative-AI answer inside the SERP, its response to Perplexity and ChatGPT Search. For a query like "what causes inflation", it renders a 3-5 sentence synthesis with a carousel of 4-8 cited sources below it.&lt;/p&gt;

&lt;p&gt;The citations are the commercially interesting part. Those cited domains get free brand impressions, click-throughs, and authority signals that traditional SEO tools never surface. The shift is large enough that some publishers have watched informational-query traffic fall 20-40% even while their SERP rank held steady.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does Google AI Overview have an API? 📡
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No.&lt;/strong&gt; As of 2026, Google publishes no official API, export endpoint, or structured feed for AI Overview citations. The only programmatic surface is what the browser renders client-side. Google's &lt;a href="https://developers.google.com/search/docs/appearance/ai-overviews" rel="noopener noreferrer"&gt;Search Central documentation&lt;/a&gt; covers AEO best practices but provides no access to citation data. To collect it at scale you render real Google SERPs in a real browser and parse the output yourself — the entire reason this Actor exists instead of a three-line API call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Every citation in an AI Overview carousel produces one flat, typed row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"what causes inflation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ai_overview_appeared"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ai_overview_text_excerpt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Inflation is caused by a combination of demand-pull factors, cost-push factors..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"citation_position"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"imf.org"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.imf.org/en/Publications/fandd/issues/Series/Back-to-Basics/Inflation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Inflation: Prices on the Rise"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"selector_used"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"div[aria-label=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;AI Overview&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-16T20:50:00.000Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When AI Overview did not appear for a query, the Actor still emits a row — &lt;code&gt;ai_overview_appeared: false&lt;/code&gt;, all citation fields null. That absence is itself a valid AEO signal: you need to know which queries don't trigger AI Overview today, because that changes.&lt;/p&gt;

&lt;p&gt;Eleven fields total, validated through Pydantic v2 &lt;code&gt;ResultRow.model_validate&lt;/code&gt; before writing. Drop it straight into BigQuery, Sheets, or a pandas pivot — no positional-array wrangling on your side.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) 🔩
&lt;/h2&gt;

&lt;p&gt;The mental model most people start with: open DevTools, find whatever request the SERP makes, replay it in Python. Three failure modes kill that before the first result lands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Google hard-blocks datacenter IPs.&lt;/strong&gt; Our recon showed the &lt;code&gt;sorry/index&lt;/code&gt; reCAPTCHA interstitial appearing within one second for direct-IP requests, regardless of fingerprint quality. Proxy is load-bearing, not optional. We thread Apify residential proxies, rotate the session ID per query (Apify's &lt;code&gt;session_id&lt;/code&gt; regex requires &lt;code&gt;^[\w._~]+$&lt;/code&gt; — no hyphens), and fall back to &lt;code&gt;BUYPROXIES94952&lt;/code&gt; when residential is unavailable on your plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. AI Overview lazy-renders client-side.&lt;/strong&gt; The carousel appears 5-7 seconds after &lt;code&gt;domcontentloaded&lt;/code&gt; via a separate async render pass — a tool that scrapes the raw HTML response gets nothing, because the container does not exist in the initial DOM. We render with &lt;a href="https://camoufox.com/" rel="noopener noreferrer"&gt;Camoufox&lt;/a&gt; (the Firefox fork with anti-detection patches our org mandates per ADR-0002) and wait a configurable 4-15 seconds for the overlay to settle before probing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Google rotates the AI Overview markup.&lt;/strong&gt; This kills scrapers quietly. Since launch in May 2024, the container's identifying attributes have changed at least three times. A scraper that hardcodes &lt;code&gt;div[aria-label="AI Overview"]&lt;/code&gt; works until Google A/B-tests a new attribute, then silently returns zero citations.&lt;/p&gt;

&lt;p&gt;We absorb all three. We rotate browser fingerprints through Camoufox's Firefox TLS and navigator stack, and on &lt;code&gt;408 / 429 / 5xx&lt;/code&gt; or a CAPTCHA intercept we rotate the proxy session and retry once before emitting a marker row. We back off when Google rate-limits, and surface partial success with a clear &lt;code&gt;Actor.set_status_message&lt;/code&gt; — we never silently return an empty dataset. The &lt;code&gt;selector_used&lt;/code&gt; field makes drift detection a single SQL query, which brings us to the most interesting part of this build.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor: 8-selector fall-through battery 🎯
&lt;/h2&gt;

&lt;p&gt;I packaged this as an Apify Actor: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/ai-overview-citations" rel="noopener noreferrer"&gt;Google AI Overview Citation Tracker&lt;/a&gt;&lt;/strong&gt;. The selector battery is the load-bearing decision — eight selectors probed in priority order, first hit wins:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Selector&lt;/th&gt;
&lt;th&gt;Origin&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;div[aria-label="AI Overview"]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Current canonical (2026)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;div[data-attrid="AI Overview"]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2025 rotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;div[data-attrid="wa:/description"]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Historical knowledge-panel reuse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;div[jsname][data-rl="ai_overview"]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2025 rotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;div[data-async-context*="ai_overview"]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Async-loaded variant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;div#m-x-content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mobile SGE legacy id&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Any &lt;code&gt;h1/h2/h3&lt;/code&gt; whose text starts &lt;code&gt;AI Overview&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Last-resort text fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;div&lt;/code&gt; containing &lt;code&gt;h2&lt;/code&gt; with text &lt;code&gt;AI Overview&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Last-resort structural fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every row records which selector fired. When selector 1 stops winning and selector 4 starts, open an issue — we'll add the new attribute to the battery.&lt;/p&gt;

&lt;p&gt;Run it from the Apify Console or programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/ai-overview-citations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;best CRM for startups 2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what causes inflation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;how to reduce churn rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxQueries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;waitMsAfterLoad&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;useApifyProxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apifyProxyGroups&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RESIDENTIAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ai_overview_appeared&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;citation_position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Input accepts 1-50 queries per run, plus &lt;code&gt;country&lt;/code&gt; (ISO-3166 alpha-2 → &lt;code&gt;gl=&lt;/code&gt;) and &lt;code&gt;language&lt;/code&gt; (ISO-639-1 → &lt;code&gt;hl=&lt;/code&gt;) for locale targeting. &lt;code&gt;waitMsAfterLoad&lt;/code&gt; (default 8000ms) controls how long the Actor waits after &lt;code&gt;domcontentloaded&lt;/code&gt; before probing — raise it to 12000-15000ms for slow proxy exits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AEO dashboard.&lt;/strong&gt; Schedule a weekly run for your 50 highest-priority informational queries. Chart &lt;code&gt;source_domain&lt;/code&gt; share-of-citation over time alongside &lt;code&gt;ai_overview_appeared&lt;/code&gt; rate, and catch when a competitor first appears in the carousel for a query where you rank position 1. A 50-query run yields roughly 95 rows — about &lt;strong&gt;$0.53 per run&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive citation gap analysis.&lt;/strong&gt; Run the 20 queries you want to rank for and map which domains Google currently cites for them. That list is your outreach shortlist — a mention from a site Google already trusts beats generic link-building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brand monitoring.&lt;/strong&gt; Run your core product-category queries weekly and alert when your domain drops out of the citation set — or when a direct competitor appears. Most brands have no instrumentation here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Localized AEO comparison.&lt;/strong&gt; Run identical query lists with &lt;code&gt;country=us&lt;/code&gt; vs &lt;code&gt;country=gb&lt;/code&gt;. Citations for "best mortgage rates" differ sharply between US and UK — different markets entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event: &lt;code&gt;actor-start&lt;/code&gt; is &lt;strong&gt;$0.05&lt;/strong&gt; once per run, &lt;code&gt;result-row&lt;/code&gt; is &lt;strong&gt;$0.005&lt;/strong&gt; per row written (citation hit or no-AI-Overview marker). You pay only for rows that land in your dataset.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Rows&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10-query spot check (~30% hit rate, ~4 citations/hit)&lt;/td&gt;
&lt;td&gt;~19&lt;/td&gt;
&lt;td&gt;~$0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-query weekly AEO audit&lt;/td&gt;
&lt;td&gt;~95&lt;/td&gt;
&lt;td&gt;~$0.53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500-query category sweep&lt;/td&gt;
&lt;td&gt;~950&lt;/td&gt;
&lt;td&gt;~$4.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000-row dataset (effective rate)&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;~$5.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The $5.50/1,000 effective rate sits above commodity SERP scrapers because this citation data is essentially unavailable elsewhere at this granularity. Ahrefs and Semrush are beginning to ship AEO modules at $300-1,500/month — and they only track your own domain. Apify's $5 free trial credit covers roughly 900-950 rows, no credit card needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technically interesting part: why we record which selector fired
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;selector_used&lt;/code&gt; field is deliberately shipped operational telemetry. &lt;a href="https://developers.google.com/search/docs/appearance/ai-overviews" rel="noopener noreferrer"&gt;Google has rotated the AI Overview container's attributes multiple times&lt;/a&gt; since launch in 2024, and each rotation silently kills scrapers that hardcode one selector — the parser falls through to empty, the dataset looks fine, until someone notices the citation count dropped to zero. Recording which of the 8 selectors matched on every row turns that into a dead-simple query against your own dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;selector_used&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scraped_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;your_aeo_dataset&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the distribution shifts — selector 1 dropping, selector 4 climbing — you get a 24-48 hour warning before coverage degrades. The alternative is waking up to a week of empty citation data and no idea why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations (the honest list) 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI Overview triggers on roughly 30% of queries.&lt;/strong&gt; Transactional, navigational, and trademark-heavy queries mostly produce &lt;code&gt;ai_overview_appeared=false&lt;/code&gt; marker rows; informational queries (&lt;code&gt;what is&lt;/code&gt;, &lt;code&gt;how to&lt;/code&gt;, &lt;code&gt;best X 2026&lt;/code&gt;) have the best trigger rate. Marker rows are charged at the same per-row rate — the absence is a valid data point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1 is English-tuned.&lt;/strong&gt; The text-based fallback selectors (7 and 8) match the literal string &lt;code&gt;AI Overview&lt;/code&gt;, so non-English locales may produce false negatives on that path. The CSS battery (selectors 1-6) is locale-agnostic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apify Proxy is required — not optional.&lt;/strong&gt; Google hard-blocks Apify datacenter IPs; without proxy enrichment the Actor fails fast at startup with a clear status message. FREE tier gets &lt;code&gt;BUYPROXIES94952&lt;/code&gt; (5 datacenter IPs, higher CAPTCHA rate); paid plans with RESIDENTIAL get substantially cleaner runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile SERP is out of scope for v0.1.&lt;/strong&gt; Mobile AI Overview has a different DOM structure; a mobile-variant Actor is planned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Actor records citation URLs but does not follow them.&lt;/strong&gt; For destination page content, pair it with a downstream HTTP scraper.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping Google Search results legal?&lt;/strong&gt;&lt;br&gt;
The data returned is public — the same content anyone sees in a browser. This Actor reads only what Google renders in the public SERP, at a paced rate with per-query session isolation, and collects no personal data. &lt;a href="https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn" rel="noopener noreferrer"&gt;hiQ Labs v. LinkedIn&lt;/a&gt; (9th Circuit, 2022) affirmed that scraping publicly accessible data is not a CFAA violation. Legality still varies by jurisdiction and use case — review Google's Terms of Service and your local regulations for your situation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export the data to Sheets, CSV, or a data warehouse?&lt;/strong&gt;&lt;br&gt;
Yes. The Apify Console downloads CSV / Excel / JSON directly from the dataset view. You can also webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make, Zapier, or n8n, or pull it via the &lt;a href="https://docs.apify.com/api/v2" rel="noopener noreferrer"&gt;Apify API&lt;/a&gt; using the &lt;code&gt;datasetId&lt;/code&gt; from the run response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there an official Google API for AI Overview citations?&lt;/strong&gt;&lt;br&gt;
No. As of 2026, Google provides no API or structured export for AI Overview citation data. &lt;a href="https://developers.google.com/search/" rel="noopener noreferrer"&gt;Google Search Central&lt;/a&gt; documents general AEO guidance but no programmatic citation access. This Actor is the practical alternative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why emit a row even when AI Overview didn't appear?&lt;/strong&gt;&lt;br&gt;
Because the absence is meaningful AEO data. Run the same query set weekly and you want the &lt;code&gt;ai_overview_appeared&lt;/code&gt; rate over time — when a query transitions from non-triggering to triggering, that's the moment a citation opportunity opens. Marker rows make the transition visible, charged at the same $0.005 per-row rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/ai-overview-citations" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/ai-overview-citations&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Run it on your 10 most important informational queries and you'll have the citation breakdown in your dataset within minutes. Find a selector miss, a locale that doesn't work, or a field you wish it returned? Drop it in the comments — real reported drift is exactly what I build the next selector battery from.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>seo</category>
      <category>python</category>
      <category>apify</category>
    </item>
    <item>
      <title>arXiv Scraper: build a RAG-ready paper corpus for $1.50/1K</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 11:21:04 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/arxiv-scraper-build-a-rag-ready-paper-corpus-for-1501k-28bp</link>
      <guid>https://dev.to/devil_scrapes/arxiv-scraper-build-a-rag-ready-paper-corpus-for-1501k-28bp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; &lt;a href="https://arxiv.org" rel="noopener noreferrer"&gt;arXiv&lt;/a&gt; publishes an Atom feed API at &lt;code&gt;export.arxiv.org/api/query&lt;/code&gt; that any program can call — but it rate-limits hard, paginates awkwardly, and ships XML you still have to wrangle into rows. An &lt;em&gt;arXiv scraper&lt;/em&gt; wraps that feed with proper pagination, retry logic, and rate-limit pacing, then returns every matching paper as structured JSON. The Apify Actor below does it for &lt;strong&gt;$0.0015 per paper&lt;/strong&gt; (~$1.50 per 1,000), handling the failure modes so your embedding pipeline sees clean rows instead of dropped requests.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There is a moment every ML engineer knows: you have picked your retrieval architecture, your vector store is standing by, and you just need a clean slice of arXiv papers to run benchmarks on. So you open the &lt;a href="https://info.arxiv.org/help/api/user-manual.html" rel="noopener noreferrer"&gt;arXiv export API docs&lt;/a&gt;, write twenty lines of Python, and an hour later you are still debugging timeouts, malformed Unicode in author names, and a cursor that resets when you change the &lt;code&gt;start&lt;/code&gt; offset without the right headers.&lt;/p&gt;

&lt;p&gt;The papers are not the problem. Getting them out is.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is arXiv? 📄
&lt;/h2&gt;

&lt;p&gt;arXiv (pronounced "archive") is a free, open-access preprint server operated by Cornell University since 1991. Researchers in physics, mathematics, computer science, quantitative biology, statistics, and economics deposit pre-peer-review papers there before — and sometimes instead of — formal journal submission.&lt;/p&gt;

&lt;p&gt;For machine learning, it is effectively the primary literature. Every major model, benchmark, and technique shows up on arXiv before it shows up anywhere else. The &lt;code&gt;cs.AI&lt;/code&gt;, &lt;code&gt;cs.LG&lt;/code&gt;, and &lt;code&gt;cs.CL&lt;/code&gt; subcategories receive hundreds of new papers every weekday. As of 2026, the total corpus sits above 2.3 million papers.&lt;/p&gt;

&lt;p&gt;For researchers and engineers building retrieval systems, evaluation corpora, or trend-monitoring pipelines, arXiv is an obvious source of truth — if you can get the data out reliably.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does arXiv have an API?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes — but it is more limited than it looks.&lt;/strong&gt; arXiv runs a public &lt;a href="https://info.arxiv.org/help/api/user-manual.html" rel="noopener noreferrer"&gt;Atom feed API&lt;/a&gt; at &lt;code&gt;export.arxiv.org/api/query&lt;/code&gt;. You can query by title, author, category, abstract, or combinations. It supports sorting by submission date or last-update date. Pagination works through &lt;code&gt;start&lt;/code&gt; and &lt;code&gt;max_results&lt;/code&gt; parameters.&lt;/p&gt;

&lt;p&gt;What the documentation underplays: arXiv's own guidelines request no more than one request every three seconds from a single client. On a 30,000-paper category sweep at 50 papers per page, that is 600 pages, 1,800 seconds of wait time, and dozens of transient network failures to handle across a 30-minute run. The API also returns Atom XML, not JSON — which is fine for a one-off, and tedious for a pipeline.&lt;/p&gt;

&lt;p&gt;This is the gap the Actor fills.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Each paper comes back as one flat, typed row. Here is a real example with every field the Actor returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"arxiv_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2401.12345v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://arxiv.org/abs/2401.12345v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pdf_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://arxiv.org/pdf/2401.12345v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Scaling Laws for Sparse Mixture-of-Experts Language Models"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"We investigate scaling laws for sparse mixture-of-experts (MoE) language models..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Alex Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Jamie Smith"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Morgan Lee"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"primary_category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cs.CL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"categories"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"cs.CL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cs.LG"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cs.AI"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"doi"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10.18653/v1/2024.acl-long.001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"journal_ref"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Proceedings of ACL 2024"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"comment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"18 pages, 7 figures. Accepted at ACL 2024."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"published"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-01-22T16:00:00+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"updated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-03-15T09:30:00+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-31T11:00:00+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fourteen fields, Pydantic-validated before they reach your dataset. &lt;code&gt;arxiv_id&lt;/code&gt; includes the version suffix; &lt;code&gt;published&lt;/code&gt; and &lt;code&gt;updated&lt;/code&gt; are ISO-8601 UTC; &lt;code&gt;doi&lt;/code&gt; and &lt;code&gt;journal_ref&lt;/code&gt; are &lt;code&gt;null&lt;/code&gt; for preprints that have not been formally published. That is the full schema — no hidden fields, no surprise &lt;code&gt;null&lt;/code&gt; rows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart) 🔧
&lt;/h2&gt;

&lt;p&gt;The first script most engineers write looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://export.arxiv.org/api/query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat:cs.AI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="c1"&gt;# parse Atom XML, extract entries
&lt;/span&gt;    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That script fails in production for three reasons that are not obvious until you have already burned a Sunday on them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The XML response is finicky.&lt;/strong&gt; arXiv's Atom feed can return an empty &lt;code&gt;&amp;lt;feed&amp;gt;&lt;/code&gt; when pagination goes past the result set, a 503 under load, or a malformed Unicode character in an author name that breaks naive XML parsers. A parse failure stops your sweep silently at row 2,400.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The rate-limit pushes back mid-sweep.&lt;/strong&gt; arXiv enforces the three-second guideline inconsistently. During peak hours you can hit a 503 that a simple &lt;code&gt;time.sleep(3)&lt;/code&gt; loop will not recover from gracefully. We retry with exponential backoff on &lt;code&gt;408 / 429 / 5xx&lt;/code&gt;, honour &lt;code&gt;Retry-After&lt;/code&gt; headers, and surface partial success rather than returning an empty dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The page cursor is fragile.&lt;/strong&gt; If you restart a failed run with different &lt;code&gt;start&lt;/code&gt; offsets, arXiv treats it as a fresh query and results shift between pages as new papers are indexed mid-sweep. We log the precise offset for each page so you can audit exactly what was fetched.&lt;/p&gt;

&lt;p&gt;We absorb all three failure modes. We also rotate browser fingerprints via &lt;code&gt;curl-cffi&lt;/code&gt; so requests look like a browser hitting the feed, not a raw Python &lt;code&gt;requests&lt;/code&gt; call — which matters less for arXiv than for more aggressive targets, but the defence is already there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor ⚙️
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/DevilScrapes/arxiv-papers-scraper" rel="noopener noreferrer"&gt;arXiv Papers Scraper&lt;/a&gt; is on the Apify Store. You can run it from the Apify Console UI, or call it programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/arxiv-papers-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat:cs.AI AND ti:transformer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sortBy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;submittedDate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sortOrder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;descending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pageSize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arxiv_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key input parameters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;searchQuery&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cat:cs.AI&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;arXiv query string. Supports &lt;code&gt;ti:&lt;/code&gt;, &lt;code&gt;au:&lt;/code&gt;, &lt;code&gt;cat:&lt;/code&gt;, &lt;code&gt;abs:&lt;/code&gt; prefixes and Boolean operators.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sortBy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;submittedDate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;relevance&lt;/code&gt;, &lt;code&gt;lastUpdatedDate&lt;/code&gt;, or &lt;code&gt;submittedDate&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sortOrder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;descending&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ascending&lt;/code&gt; or &lt;code&gt;descending&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxResults&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Cap at 5,000 per run. arXiv recommends ≤30,000 per query.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pageSize&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Papers per API call; arXiv caps at 2,000.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The output streams live into the run's dataset — export as JSON, CSV, or Excel from the Apify Console, or fetch via the &lt;a href="https://docs.apify.com/api/v2#/reference/datasets" rel="noopener noreferrer"&gt;Apify dataset API&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG corpus ingestion.&lt;/strong&gt; Build a domain-specific retrieval pipeline — run &lt;code&gt;cat:cs.CL&lt;/code&gt; for 12 months, drop &lt;code&gt;arxiv_id&lt;/code&gt;, &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;summary&lt;/code&gt;, and &lt;code&gt;published&lt;/code&gt; into your vector store, and you have a searchable knowledge base in an afternoon. The Actor handles pagination; you handle embeddings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Literature review seeding.&lt;/strong&gt; Query &lt;code&gt;ti:scaling laws AND cat:cs.LG&lt;/code&gt; to seed a systematic review. Every result comes back with DOI and journal reference where available — enough to drop straight into a citation manager.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trend monitoring.&lt;/strong&gt; Schedule a daily run on &lt;code&gt;cat:cs.AI&lt;/code&gt; sorted by &lt;code&gt;submittedDate&lt;/code&gt; descending with &lt;code&gt;maxResults: 50&lt;/code&gt;. Pipe into a Slack webhook and get yesterday's new papers in your inbox every morning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research attribution tracking.&lt;/strong&gt; Run &lt;code&gt;au:lecun&lt;/code&gt; weekly, diff &lt;code&gt;arxiv_id&lt;/code&gt; against last week's run, and alert on new papers. Useful for tracking lab output or advisor submissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for papers that land in your dataset; nothing for the setup or failed retries.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Actor start (one-off per run)&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Result (per paper written)&lt;/td&gt;
&lt;td&gt;$0.0015&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Volume&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100 papers&lt;/td&gt;
&lt;td&gt;$0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 papers&lt;/td&gt;
&lt;td&gt;$1.51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 papers&lt;/td&gt;
&lt;td&gt;$15.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30,000 papers (full category sweep)&lt;/td&gt;
&lt;td&gt;$45.01&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apify's $5 free trial credit covers your first ~3,300 papers with no credit card required. For context: building and maintaining the pagination, retry, and parsing layer yourself costs engineering time you could spend on the embedding pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part worth knowing: arXiv's rate-limit is real
&lt;/h2&gt;

&lt;p&gt;Most arXiv scrapers ignore the &lt;a href="https://info.arxiv.org/help/bulk_data.html" rel="noopener noreferrer"&gt;official bulk-access guidance&lt;/a&gt;. arXiv asks for a 3-second delay between requests and automated clients to identify themselves in the &lt;code&gt;User-Agent&lt;/code&gt; header. We follow both: the Actor sets a &lt;code&gt;User-Agent&lt;/code&gt; that identifies it as a DevilScrapes Apify Actor, paces requests at the recommended interval, and backs off further when the server signals overload.&lt;/p&gt;

&lt;p&gt;This is not altruism — it is the reason arXiv's API is still open. If you hammer arXiv with an aggressive client, you hurt every researcher who relies on programmatic access. We take the slower throughput and keep the access clean.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metadata only.&lt;/strong&gt; The Actor uses the Atom feed API. Full-text search over PDF content is not supported — queries operate on metadata fields (title, abstract, author, category). For PDF content extraction, you would need a second-stage pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No canonical author IDs.&lt;/strong&gt; arXiv does not assign stable author identifiers in the public API. Author-disambiguation across name collisions (&lt;code&gt;"J. Smith"&lt;/code&gt; being three different people) is on you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version pinning.&lt;/strong&gt; &lt;code&gt;arxiv_id&lt;/code&gt; includes the version suffix (e.g. &lt;code&gt;2401.12345v2&lt;/code&gt;). If you want only the latest version of each paper, filter on the highest &lt;code&gt;v&lt;/code&gt; suffix yourself — the API can return multiple versions of the same paper in a sweep.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result cap.&lt;/strong&gt; &lt;code&gt;maxResults&lt;/code&gt; is capped at 5,000 per run in the Actor input. For larger sweeps (a full &lt;code&gt;cs.AI&lt;/code&gt; category, 30k+ papers), run multiple queries partitioned by date range and merge the datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DOI availability.&lt;/strong&gt; Most preprints have no DOI until the paper is journal-published. &lt;code&gt;doi&lt;/code&gt; is &lt;code&gt;null&lt;/code&gt; for the majority of recent papers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping arXiv legal?&lt;/strong&gt;&lt;br&gt;
arXiv publishes the Atom API specifically for programmatic access and has &lt;a href="https://info.arxiv.org/help/bulk_data.html" rel="noopener noreferrer"&gt;explicit bulk-access guidance&lt;/a&gt;. The Actor follows that guidance: polite request pacing, identified &lt;code&gt;User-Agent&lt;/code&gt;, metadata only, no auth bypass. As always, review arXiv's terms and your use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does arXiv have a proper JSON API?&lt;/strong&gt;&lt;br&gt;
No. The public programmatic surface is the Atom XML feed at &lt;code&gt;export.arxiv.org/api/query&lt;/code&gt; — no official REST/JSON API exists for bulk metadata export. The Actor converts Atom responses to clean JSON rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I download the actual PDFs?&lt;/strong&gt;&lt;br&gt;
Not through this Actor. We surface &lt;code&gt;pdf_url&lt;/code&gt; for each paper; pulling the PDFs is a separate step. Use the URL as input to a follow-up pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I build a RAG pipeline on top of this?&lt;/strong&gt;&lt;br&gt;
Run the Actor, export as JSON, load &lt;code&gt;arxiv_id&lt;/code&gt; + &lt;code&gt;title&lt;/code&gt; + &lt;code&gt;summary&lt;/code&gt; into your vector store (ChromaDB, Pinecone, Qdrant — any accept a list of dicts), embed &lt;code&gt;summary&lt;/code&gt;, and query semantically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/arxiv-papers-scraper" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/arxiv-papers-scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 credit, no credit card. Run &lt;code&gt;cat:cs.AI&lt;/code&gt; with &lt;code&gt;maxResults: 100&lt;/code&gt; and you will have a hundred recent AI papers as clean JSON in under two minutes. If you build something interesting on top of the corpus — a paper recommender, a retrieval benchmark, a trend chart — drop it in the comments. The dataset is the interesting part; we just make it easier to get.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>apify</category>
    </item>
    <item>
      <title>Google Ads Transparency Scraper: pull any competitor's ads for $1.20/1K</title>
      <dc:creator>Devil Scrapes</dc:creator>
      <pubDate>Sun, 31 May 2026 09:52:50 +0000</pubDate>
      <link>https://dev.to/devil_scrapes/google-ads-transparency-scraper-pull-any-competitors-ads-for-1201k-25fo</link>
      <guid>https://dev.to/devil_scrapes/google-ads-transparency-scraper-pull-any-competitors-ads-for-1201k-25fo</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; The &lt;a href="https://adstransparency.google.com" rel="noopener noreferrer"&gt;Google Ads Transparency Center&lt;/a&gt; is a public registry of every ad Google runs — but it ships &lt;strong&gt;no API and no bulk export&lt;/strong&gt;. To get the data programmatically you scrape it. A &lt;em&gt;Google Ads Transparency scraper&lt;/em&gt; sends the same RPC call the website uses and returns every ad creative for an advertiser as structured JSON. The Apify Actor below does it for &lt;strong&gt;$0.0012 per ad&lt;/strong&gt; (~$1.20 per 1,000), with the TLS fingerprinting, proxy rotation, and pagination handled for you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Google's Ads Transparency Center is one of the most underused datasets in marketing. Launched in 2023 under the &lt;a href="https://commission.europa.eu/strategy-and-policy/priorities-2019-2024/europe-fit-digital-age/digital-services-act_en" rel="noopener noreferrer"&gt;EU Digital Services Act&lt;/a&gt; and parallel US pressure, it indexes every ad campaign currently running on Search, YouTube, Display, Shopping, Maps, and Play — keyed by advertiser. Google's own counter lists &lt;strong&gt;300,000+ active creatives for a brand like Nike&lt;/strong&gt;. For your nearest competitor, it's usually 50–500.&lt;/p&gt;

&lt;p&gt;The catch: there's no download button. Just an interactive UI that paginates 40 creatives at a time. If you want this as a CSV — for a competitor sweep, a trademark audit, or a RAG corpus — you have to extract it yourself. Here's what that actually takes, and how I shortened it to one API call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the Google Ads Transparency Center? 🔎
&lt;/h2&gt;

&lt;p&gt;The Google Ads Transparency Center is a public, Google-operated registry that shows the ad creatives any verified advertiser is running, the date range each ad was shown, and roughly where. Google built it to comply with ad-disclosure regulation, so the data is public by design — you're reading the same registry a regulator would.&lt;/p&gt;

&lt;p&gt;What it gives you per advertiser:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every &lt;strong&gt;ad creative&lt;/strong&gt; currently or recently live (text, image, video)&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;landing domain&lt;/strong&gt; each ad clicks through to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First-shown / last-shown&lt;/strong&gt; timestamps and a rough impression count&lt;/li&gt;
&lt;li&gt;A deep link to each creative inside the Transparency Center&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it does &lt;em&gt;not&lt;/em&gt; give you: a search-by-keyword mode, region-filtered results from the server, or — crucially — an API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does the Google Ads Transparency Center have an API?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No.&lt;/strong&gt; As of 2026 Google publishes no official API or bulk export for the Ads Transparency Center. The only programmatic surface is the internal &lt;code&gt;SearchService/SearchCreatives&lt;/code&gt; RPC that the website itself calls. That endpoint is undocumented, returns a positional protobuf-style array (not labeled JSON), and inspects your TLS fingerprint before it answers. Scraping it reliably is the whole job — which is why a hosted Actor exists instead of a three-line snippet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the data looks like
&lt;/h2&gt;

&lt;p&gt;Each ad creative comes back as one flat, typed row. Concrete beats abstract, so here's a real one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"advertiser_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AR18378488041124659201"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"advertiser_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Nike Retail BV"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"creative_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CR15771942603307614209"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"creative_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://adstransparency.google.com/advertiser/AR18378488041124659201/creative/CR15771942603307614209?region=anywhere"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"landing_domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"nike.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"format_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"first_shown_ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1761145807&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last_shown_ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1778871417&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"impressions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;205&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"preview_image_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://tpc.googlesyndication.com/archive/simgad/12774179880874022668"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"preview_content_js_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anywhere"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-15T19:17:59+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thirteen fields, the same shape every time, validated with Pydantic before it's written. It drops straight into Pandas, BigQuery, or a vector store — no positional-array wrangling on your side.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it falls apart)
&lt;/h2&gt;

&lt;p&gt;The first thing every scraper-aware person tries:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open Chrome DevTools, find the XHR call to &lt;code&gt;SearchCreatives&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Replay it with &lt;code&gt;requests.post()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Parse the JSON, paginate, done&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It breaks on the first request. Three reasons, and they're the reasons a hosted Actor earns its keep:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. TLS fingerprinting.&lt;/strong&gt; Google's endpoint inspects the JA3/JA4 signature of your TLS handshake. Python's stdlib SSL doesn't match any real browser, so the server returns &lt;code&gt;403&lt;/code&gt; before it even reads your payload. We get around it by impersonating a real Firefox 147 TLS + HTTP/2 fingerprint via &lt;code&gt;curl-cffi&lt;/code&gt; — so the handshake looks like a browser, because functionally it is one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cookie continuity across pagination.&lt;/strong&gt; The pagination cursor is bound to a session cookie. Rotate IPs naively between pages and the server invalidates your cursor mid-scrape. We thread Apify residential proxies with &lt;strong&gt;sticky sessions&lt;/strong&gt; so each advertiser's pagination keeps one stable exit IP and cookie jar, and we pace requests at ~1/sec to stay polite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. A positional, protobuf-flavored response.&lt;/strong&gt; The reply isn't keyed JSON — it's nested arrays where meaning depends on &lt;em&gt;position&lt;/em&gt;. One Google A/B rotation and a naive parser silently emits garbage. We pin the parser against four captured creative shapes (still image, rich video, minimal, malformed) and run live wire-validation to catch contract drift before it reaches your dataset. On &lt;code&gt;408/429/5xx&lt;/code&gt; we retry with exponential backoff and fail loud on partial success rather than handing you a half-empty file.&lt;/p&gt;

&lt;p&gt;None of that is glamorous. All of it is the difference between a script that worked once on your laptop and a feed that survives Google's quarterly cipher rotation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actor
&lt;/h2&gt;

&lt;p&gt;I packaged the result as an Apify Actor: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/google-ads-transparency" rel="noopener noreferrer"&gt;Google Ads Transparency Scraper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Paste a domain in the Apify Console and click Start, or run it programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DevilScrapes/google-ads-transparency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchDomains&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nike.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;adidas.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You search by &lt;strong&gt;landing domain&lt;/strong&gt; (returns &lt;em&gt;every&lt;/em&gt; ad pointing at that domain — including ones bought by resellers and affiliates) or by &lt;strong&gt;advertiser ID&lt;/strong&gt; when you already know the exact advertiser. Multiple targets per run, deduplicated automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you'd actually use this for
&lt;/h2&gt;

&lt;p&gt;Four concrete patterns, not generic "competitive intelligence":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Weekly competitor sweep.&lt;/strong&gt; Schedule a run on your top 5 competitors, diff this week's creative IDs against last week's, and alert when a new product line launches. Five competitors × ~200 ads each = roughly &lt;strong&gt;$1.20/week&lt;/strong&gt; of data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trademark enforcement.&lt;/strong&gt; Sweep your &lt;em&gt;own&lt;/em&gt; domain and you'll see ads other people bought against your brand keyword — resellers, affiliates, competitors. Cross-reference advertiser IDs against your trademark portfolio and flag the unlicensed ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Affiliate-fraud detection.&lt;/strong&gt; Pull every advertiser whose &lt;code&gt;landing_domain&lt;/code&gt; doesn't match the &lt;code&gt;advertiser_name&lt;/code&gt;. Mismatches are common in crypto, nutra, and supplement verticals: &lt;code&gt;[c for c in creatives if c["landing_domain"] not in c["advertiser_name"].lower()]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI / RAG ingestion.&lt;/strong&gt; Feed creative metadata plus image URLs into a vector store for image-grounded competitive analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pricing — exact numbers 💰
&lt;/h2&gt;

&lt;p&gt;Pay-per-event. You pay for ads you get, nothing for ads you ask for. No data, no charge.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$0.005 per run&lt;/strong&gt; (covers warm-up + cookie handshake)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.0012 per ad&lt;/strong&gt; written to the dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pull&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100 ads&lt;/td&gt;
&lt;td&gt;$0.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000 ads&lt;/td&gt;
&lt;td&gt;$1.21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000 ads&lt;/td&gt;
&lt;td&gt;$12.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000 ads (monthly sweep)&lt;/td&gt;
&lt;td&gt;$120.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apify's $5 free trial credit covers your first ~4,000 ads with no credit card. For comparison, the nearest SaaS substitutes (Adbeat, SpyFu) start around $249/month for a slice of the same Google data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part other scrapers won't tell you
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Region filtering doesn't work — and we say so.&lt;/strong&gt; The &lt;code&gt;region&lt;/code&gt; parameter on Google's &lt;code&gt;SearchCreatives&lt;/code&gt; RPC is &lt;em&gt;server-ignored&lt;/em&gt;. We tested every plausible request-body shape and none of them returned a region-narrowed result set; the browser UI shows a region selector, but the server hands back the same creative set regardless of what you pass.&lt;/p&gt;

&lt;p&gt;So we expose &lt;code&gt;region&lt;/code&gt; only as a &lt;strong&gt;metadata tag&lt;/strong&gt; — useful for labeling exports by intended market when you run parallel campaigns, useless as a filter. No public Actor offers real region-narrowed scraping, because Google's endpoint doesn't support it. We'd rather under-promise than ship a filter that silently does nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations (the honest list)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No keyword search.&lt;/strong&gt; You search by advertiser/domain, not by ad copy — Google's RPC exposes no keyword mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video creatives return a JS bundle, not an MP4.&lt;/strong&gt; You get a &lt;code&gt;preview_content_js_url&lt;/code&gt;; rendering the actual frame needs a headless browser and is out of scope for v1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~12 months of history.&lt;/strong&gt; Google purges older creatives, so a wider date range just clips to what they retain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Big brands hit a cap.&lt;/strong&gt; Google stops paginating past ~1,000 ads per query, so full-history pulls on a Nike-sized advertiser need &lt;code&gt;maxPages&lt;/code&gt; raised deliberately.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is scraping the Google Ads Transparency Center legal?&lt;/strong&gt;&lt;br&gt;
The Center is a public registry Google operates under regulatory pressure. This Actor reads only what the public UI exposes, at ~1 request/second per session, collects only advertiser-level metadata (no personal data), and bypasses no authentication. As always, check your own jurisdiction and use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How is this different from the Facebook Ad Library?&lt;/strong&gt;&lt;br&gt;
Different platform, different endpoint. This covers Google's network (Search, YouTube, Display, Shopping, Maps, Play). For Meta, use a dedicated Facebook Ad Library scraper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I export to Google Sheets or a warehouse?&lt;/strong&gt;&lt;br&gt;
Yes — export CSV/Excel/JSON from the Apify Console, webhook the dataset on &lt;code&gt;ACTOR.RUN.SUCCEEDED&lt;/code&gt; into Make/Zapier/n8n, or pull it via the Apify API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why are some &lt;code&gt;preview_image_url&lt;/code&gt; values null?&lt;/strong&gt;&lt;br&gt;
Those are rich/video/animated creatives — Google renders them with JavaScript, so you get a &lt;code&gt;content.js&lt;/code&gt; URL instead of a static image.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The Actor is on the Apify Store: &lt;strong&gt;&lt;a href="https://apify.com/DevilScrapes/google-ads-transparency" rel="noopener noreferrer"&gt;apify.com/DevilScrapes/google-ads-transparency&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Free $5 trial credit, no credit card. Run it on &lt;code&gt;nike.com&lt;/code&gt; and you'll have ~1,000 creatives in your dataset in under a minute. Find a use case I missed, or a field you wish it returned? Drop it in the comments — I ship based on what people actually need.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://apify.com/DevilScrapes" rel="noopener noreferrer"&gt;Devil Scrapes&lt;/a&gt; — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields.&lt;/em&gt; 😈&lt;/p&gt;

</description>
      <category>apify</category>
      <category>webscraping</category>
      <category>python</category>
      <category>marketing</category>
    </item>
  </channel>
</rss>
