<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: agenthustler</title>
    <description>The latest articles on DEV Community by agenthustler (@agenthustler).</description>
    <link>https://dev.to/agenthustler</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3810515%2F33856722-1a98-4563-ba8b-622b5fddcf7e.png</url>
      <title>DEV Community: agenthustler</title>
      <link>https://dev.to/agenthustler</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/agenthustler"/>
    <language>en</language>
    <item>
      <title>SoundCloud Data in 2026: Why It's Hard to Get and How to Extract It</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 11:25:26 +0000</pubDate>
      <link>https://dev.to/agenthustler/soundcloud-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-4lo</link>
      <guid>https://dev.to/agenthustler/soundcloud-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-4lo</guid>
      <description>&lt;p&gt;SoundCloud hosts over 175 million tracks from more than 30 million artists. For A&amp;amp;R teams, music data analysts, playlist curators, and indie label scouts, it's the single most important platform for spotting emerging talent before it shows up on Spotify or TikTok. And yet getting that data programmatically is genuinely difficult — not because the data is hidden, but because SoundCloud's official API has been closed to new applicants for years, and their anti-scraping infrastructure has tightened significantly since 2024.&lt;/p&gt;

&lt;p&gt;This post covers what data is actually available on SoundCloud, why it's hard to get at scale, who needs it and why, and how to run our actor to extract it without building or maintaining any scraping infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SoundCloud data is hard to get
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The official API is effectively closed.&lt;/strong&gt; SoundCloud's public API registration form has been disabled to new developers since 2021. Existing API keys still work, but new applications return a "registrations are temporarily disabled" message — and "temporarily" is now into its fifth year. For any team that didn't get access before 2021, the API is not an option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-scraping stack has gotten serious.&lt;/strong&gt; SoundCloud's web app is a JavaScript-heavy single-page application that loads track data through internal endpoints with rotating client IDs. A naive &lt;code&gt;requests.get()&lt;/code&gt; returns an empty shell. Headless browsers work, but SoundCloud added behavioral detection in 2024 that flags non-human navigation patterns within a few hundred requests. High-volume extraction now requires residential proxies and careful pacing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting is aggressive on the unauthenticated paths.&lt;/strong&gt; Even legitimate browser sessions hit "too many requests" responses after a few hundred page loads from the same IP. The internal API endpoints that power the web app rotate their client tokens every few hours, which breaks any scraper that hardcodes them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data is fragmented across surfaces.&lt;/strong&gt; A track has metadata on its own page, but play counts and engagement numbers update through a separate stats endpoint. Artist profiles list tracks but not full play counts. Playlists embed tracks but truncate descriptions. Pulling a complete picture means hitting multiple URLs per artist or track and stitching the responses together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result&lt;/strong&gt;: most teams either pay for music industry data vendors (Chartmetric, Soundcharts) at $500-2,000/month, build fragile internal scrapers that need constant maintenance as SoundCloud updates its frontend, or just skip SoundCloud entirely and rely on Spotify data — which misses the entire underground/emerging-artist layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who actually needs this data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A&amp;amp;R and label scouting.&lt;/strong&gt; Independent labels and major-label A&amp;amp;R teams use SoundCloud as their primary scouting surface for hip-hop, electronic, and alternative music. Tracking which unsigned artists are gaining play counts week-over-week is how the next generation of signings gets identified — usually 6-12 months before the artist breaks on streaming platforms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Music data analytics products.&lt;/strong&gt; Companies building dashboards for managers, agents, and labels need SoundCloud track and artist data as a raw input alongside Spotify, Apple Music, YouTube, and TikTok numbers. SoundCloud is often the leading indicator that other platforms lag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playlist curation and discovery tools.&lt;/strong&gt; Curators building niche playlists across genres need to scan thousands of new uploads weekly, filter by play count and like-to-play ratio, and surface candidates worth a human listen. Manual discovery doesn't scale past a few dozen tracks per week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sync licensing and music supervision.&lt;/strong&gt; Music supervisors searching for tracks that fit a specific mood, BPM, or genre for ads, films, and games use SoundCloud as a pool of licensable indie music. Bulk track metadata extraction lets supervisors filter at scale rather than browsing manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitor and trend analysis.&lt;/strong&gt; Labels and managers tracking what's working for competing artists need historical play count, repost, and follower trajectory data. SoundCloud exposes this on artist pages but doesn't offer any export or trend view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Academic and journalistic research.&lt;/strong&gt; Music researchers studying genre evolution, regional scene dynamics, or the structure of artist-to-artist influence need bulk data. SoundCloud is one of the few platforms where genre tags and reposts make those network structures visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What data you actually get
&lt;/h2&gt;

&lt;p&gt;Our actor extracts the following fields from public SoundCloud pages — no authenticated session required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;track_id&lt;/strong&gt; — SoundCloud track ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;title&lt;/strong&gt; — track title&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;artist_name&lt;/strong&gt; — track uploader display name&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;artist_url&lt;/strong&gt; — canonical URL of the uploading artist's profile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;genre&lt;/strong&gt; — primary genre tag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tags&lt;/strong&gt; — list of additional tags set by the uploader&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;duration_ms&lt;/strong&gt; — track duration in milliseconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;play_count&lt;/strong&gt; — total play count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;like_count&lt;/strong&gt; — total likes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;repost_count&lt;/strong&gt; — total reposts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;comment_count&lt;/strong&gt; — total comments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;release_date&lt;/strong&gt; — date the track was uploaded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;description&lt;/strong&gt; — full track description&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;artwork_url&lt;/strong&gt; — URL to the track artwork image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;track_url&lt;/strong&gt; — canonical SoundCloud track URL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;playlist_id&lt;/strong&gt; — playlist ID (when scraping playlists)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;playlist_title&lt;/strong&gt; — playlist title&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;playlist_track_count&lt;/strong&gt; — number of tracks in the playlist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;artist_follower_count&lt;/strong&gt; — uploader follower count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;artist_track_count&lt;/strong&gt; — total tracks uploaded by the artist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;scraped_at&lt;/strong&gt; — timestamp of extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to run the actor
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Via Apify Console&lt;/strong&gt; (no code needed):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://apify.com/cryptosignals/soundcloud-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/soundcloud-scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Try for free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Paste your target URLs into the &lt;code&gt;urls&lt;/code&gt; field — accepts track URLs, artist URLs, or playlist URLs&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;max_results&lt;/code&gt; to cap the run if you're working through a long list&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Start&lt;/strong&gt; and download results as JSON or CSV&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Input JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://soundcloud.com/skrillex"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://soundcloud.com/discover/sets/charts-top:all-music"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://soundcloud.com/fred-again/turn-on-the-lights-again-feat-future"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Apify API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://api.apify.com/v2/acts/cryptosignals~soundcloud-scraper/runs"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_APIFY_TOKEN"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "urls": ["https://soundcloud.com/skrillex"],
    "max_results": 50
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output record:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"track_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1234567890"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Turn On The Lights again.."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"artist_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Fred again.."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"artist_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://soundcloud.com/fred-again"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"genre"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Electronic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"house"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uk"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fred again"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"future"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;218000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"play_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14820000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"like_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;412000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"repost_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;38500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"comment_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"release_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2022-10-21"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Turn On The Lights again.. (feat. Future)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"artwork_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://i1.sndcdn.com/artworks-..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"track_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://soundcloud.com/fred-again/turn-on-the-lights-again-feat-future"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"artist_follower_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1240000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"artist_track_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-04T09:00:00+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;The actor uses pay-per-result pricing: &lt;strong&gt;$0.005 per track or profile record&lt;/strong&gt;. The first 5 results are free so you can verify output quality before committing. For a list of 1,000 tracks, that's $5.&lt;/p&gt;

&lt;p&gt;For high-volume A&amp;amp;R workflows (10,000+ tracks per run), residential proxy coverage becomes important for reliability. &lt;a href="https://oxylabs.go2cloud.org/aff_c?offer_id=7&amp;amp;aff_id=2066&amp;amp;url_id=174" rel="noopener noreferrer"&gt;Oxylabs&lt;/a&gt; is the proxy infrastructure we've tested for this kind of music platform workload — their residential network handles SoundCloud's IP reputation checks without the rotation failures that plague datacenter proxies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you don't get
&lt;/h2&gt;

&lt;p&gt;The actor extracts public metadata visible to any logged-out visitor. It does &lt;strong&gt;not&lt;/strong&gt; download audio files — that would violate SoundCloud's terms and copyright law on every track that isn't explicitly Creative Commons. If you need audio fingerprinting or feature extraction, that's a separate workflow that operates only on tracks the artist has explicitly licensed for that use.&lt;/p&gt;

&lt;p&gt;Private tracks, private playlists, and any data behind a SoundCloud Pro paywall are not accessible. Comment text is summarized as a count rather than a full thread dump — full comment scraping at scale runs into rate limits that aren't worth fighting. Direct messages and private interactions are obviously off-limits.&lt;/p&gt;

&lt;p&gt;For very large historical backfills (tracking play count over time for 100,000+ tracks), the play count snapshots from a single run aren't enough — you need scheduled runs and your own time-series storage. The actor produces the snapshot; you're responsible for the longitudinal database.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative
&lt;/h2&gt;

&lt;p&gt;You can build this yourself. The engineering work involves: handling SoundCloud's rotating client IDs, parsing the embedded JSON state from the HTML shell, dealing with the separate stats endpoints, managing proxy rotation against the rate-limit detection, retrying partial responses, and maintaining the scraper when SoundCloud updates its frontend — which has happened three times since 2024.&lt;/p&gt;

&lt;p&gt;That's 2-3 weeks of engineering time to build, and ongoing maintenance after that. At $0.005 per record, you'd need to extract more than 2 million records before the build-vs-buy math favors building.&lt;/p&gt;

&lt;p&gt;For most A&amp;amp;R, analytics, and curation teams, the answer is clear.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Actor:&lt;/strong&gt; &lt;a href="https://apify.com/cryptosignals/soundcloud-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/soundcloud-scraper&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;By:&lt;/strong&gt; Web Data Labs — data infrastructure for music industry teams.&lt;/p&gt;

</description>
      <category>music</category>
      <category>data</category>
      <category>webscraping</category>
      <category>automation</category>
    </item>
    <item>
      <title>Crunchbase Data in 2026: Why It's Hard to Get and How to Extract It</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 08:27:52 +0000</pubDate>
      <link>https://dev.to/agenthustler/crunchbase-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-32o1</link>
      <guid>https://dev.to/agenthustler/crunchbase-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-32o1</guid>
      <description>&lt;p&gt;Crunchbase is the de facto database of the startup world. Over 3 million company profiles, more than 600,000 funding rounds, and an investor graph that VCs, founders, and analysts treat as ground truth. The data sits behind a login wall, a heavy JavaScript frontend, and an enterprise API that prices most teams out of the market.&lt;/p&gt;

&lt;p&gt;This post covers what data lives on Crunchbase, why it's hard to get programmatically, who needs it, and how to run our actor to extract it without building or maintaining any scraping infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Crunchbase data is hard to get
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The official API is enterprise-only.&lt;/strong&gt; Crunchbase's Enterprise API starts around $49,000/year and requires an annual contract. The cheaper Pro subscription gives you a web UI and CSV exports, but no programmatic access for automation. There is no developer tier — the company has explicitly chosen to monetize data access at the top of the funnel. For a solo founder, an analyst at a small fund, or a data team building an internal CRM enrichment, the API simply isn't an option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most of the interesting data is behind a login.&lt;/strong&gt; Public visitors see a stripped-down version of company pages. Funding round details, investor lists, key employees, acquisition history, and competitor data require an authenticated session. That single architectural choice eliminates 90% of naive scraping approaches — you can't just &lt;code&gt;curl&lt;/code&gt; a profile URL and parse the HTML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The frontend is heavy and dynamic.&lt;/strong&gt; Crunchbase is a single-page React-style application. Profile data loads asynchronously through internal GraphQL-style endpoints, with request signing and session-bound tokens. The HTML you get on first paint contains almost no real data — it's a shell that hydrates client-side. Headless browsers can render it, but each profile takes seconds and significant compute, and the anti-bot stack flags automated browsers quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-bot stack is real.&lt;/strong&gt; Cloudflare bot management, behavioral fingerprinting, request rate analysis, and aggressive IP reputation scoring all run on Crunchbase. A datacenter IP gets challenged within a few requests. Even residential proxies need careful rotation to avoid the heuristics that look for unnatural session patterns. This is a moving target — what worked last month often breaks this month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result&lt;/strong&gt;: most teams either pay for Crunchbase Pro and copy-paste manually, license bulk data from resellers, or quietly maintain a fragile in-house scraper. None of these are ideal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who actually needs this data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;VC and angel investor research.&lt;/strong&gt; Before a partner meeting, an associate needs to pull funding history, current valuation signals, investor syndicate, key team members, and competitor landscape for a target company. Doing this manually across a pipeline of 50 deals per week is hours of clicking. Automated extraction turns it into a single overnight run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive intelligence.&lt;/strong&gt; Mapping a competitive landscape means pulling funding rounds, headcount trajectories, and acquisition history for 20-100 companies in a sector. The funding round data alone — who invested, at what stage, when — is the spine of any defensible competitive analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sales enrichment for B2B targeting AI/SaaS startups.&lt;/strong&gt; A sales team selling tools to Series A-C startups needs filtered lists by funding stage, last round date, total raised, and investor list. Crunchbase is where this data is most current and most accurate. Enriching a prospect list with last-funding signals lets sales prioritize companies likely to have budget right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M&amp;amp;A and corporate development.&lt;/strong&gt; Corp dev teams scanning the market for acquisition targets pull acquisition history, funding totals, and founder backgrounds at scale. The investor list on a target also signals which firms might block or push a deal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Market research and analyst reports.&lt;/strong&gt; Researchers writing sector reports — fintech, climate tech, AI infrastructure — need bulk data on hundreds of companies in a category. Funding round data over time is the raw material for "where is the money flowing" charts that anchor most industry reports.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Founder competitive due diligence.&lt;/strong&gt; Before raising, founders benchmark themselves against competitors: how much each raised, from whom, on what trajectory. Walking into a partner meeting with this data prepared is table stakes now.&lt;/p&gt;

&lt;h2&gt;
  
  
  What data you actually get
&lt;/h2&gt;

&lt;p&gt;Our actor extracts the following fields from public and authenticated Crunchbase company profiles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;name&lt;/strong&gt; — official company name&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;crunchbase_url&lt;/strong&gt; — canonical Crunchbase profile URL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;description&lt;/strong&gt; — short and long company descriptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;website&lt;/strong&gt; — official company website&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;founded_date&lt;/strong&gt; — founding date as listed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;headquarters&lt;/strong&gt; — city, region, country&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;industries&lt;/strong&gt; — list of industry categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;operating_status&lt;/strong&gt; — active, closed, acquired&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;company_type&lt;/strong&gt; — for profit, non-profit, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;employee_count&lt;/strong&gt; — headcount range&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;total_funding&lt;/strong&gt; — total funding raised in USD&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;last_funding_round&lt;/strong&gt; — type, date, and amount of most recent round&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;funding_rounds&lt;/strong&gt; — full list of funding rounds with stage, date, amount, and lead investors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;investors&lt;/strong&gt; — list of investor entities with name, type, and lead/follow flag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;founders&lt;/strong&gt; — founder names and roles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;key_people&lt;/strong&gt; — current executives and key employees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;acquisitions&lt;/strong&gt; — companies acquired, with dates and disclosed amounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;acquired_by&lt;/strong&gt; — acquirer details if the company was acquired&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;competitors&lt;/strong&gt; — competitor list as listed on the profile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;scraped_at&lt;/strong&gt; — extraction timestamp&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to run the actor
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Via Apify Console&lt;/strong&gt; (no code needed):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://apify.com/cryptosignals/crunchbase-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/crunchbase-scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Try for free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Paste your company list into the &lt;code&gt;companies&lt;/code&gt; field — accepts Crunchbase slugs (e.g., &lt;code&gt;stripe&lt;/code&gt;) or full profile URLs&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;max_results&lt;/code&gt; to cap the run if you're testing&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Start&lt;/strong&gt; and download results as JSON or CSV&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Input JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"companies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"stripe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://www.crunchbase.com/organization/anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"include_funding_rounds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"include_investors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Apify API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://api.apify.com/v2/acts/cryptosignals~crunchbase-scraper/runs"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_APIFY_TOKEN"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "companies": ["stripe", "anthropic"],
    "include_funding_rounds": true,
    "max_results": 10
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output record:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"crunchbase_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.crunchbase.com/organization/anthropic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AI safety company building reliable, interpretable AI systems."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"website"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://anthropic.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"founded_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2021-01-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"headquarters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"San Francisco, California, US"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"industries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Artificial Intelligence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Machine Learning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Software"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"operating_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Active"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"company_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"For Profit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"employee_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"501-1000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_funding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7600000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last_funding_round"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Series E"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-03-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3500000000&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"funding_rounds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Series A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2021-05-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;124000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"lead_investors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Jaan Tallinn"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Series B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2022-04-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;580000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"lead_investors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Sam Bankman-Fried"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Series C"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2023-05-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;450000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"lead_investors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Spark Capital"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"investors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Google"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"corporate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"lead"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Spark Capital"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"lead"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Salesforce Ventures"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"corporate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"lead"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"founders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Dario Amodei"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CEO"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Daniela Amodei"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"President"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-04T09:00:00+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;The actor uses pay-per-event pricing: &lt;strong&gt;$0.012 per company profile&lt;/strong&gt; with funding rounds and investors included. The first 3 results are free so you can verify output quality before committing. For a list of 1,000 companies, that's $12.&lt;/p&gt;

&lt;p&gt;For high-volume runs, residential proxy coverage matters for reliability. &lt;a href="https://oxylabs.go2cloud.org/aff_c?offer_id=7&amp;amp;aff_id=2066&amp;amp;url_id=174" rel="noopener noreferrer"&gt;Oxylabs&lt;/a&gt; is the proxy infrastructure we've tested for this kind of workload — their residential network handles Crunchbase's reputation scoring without the constant rotation failures that plague datacenter proxies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you don't get
&lt;/h2&gt;

&lt;p&gt;Crunchbase profiles don't include private company financials beyond disclosed funding rounds, employee email addresses, or anything behind Crunchbase Pro's premium signals (e.g., predictive scores). For contact-level data on individuals at these companies, you need a separate enrichment step.&lt;/p&gt;

&lt;p&gt;The actor extracts what's available on the standard profile page. Some very large or very recently created companies have partial data on Crunchbase itself — that's a Crunchbase coverage limit, not an extraction limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative
&lt;/h2&gt;

&lt;p&gt;You can build this yourself. The engineering work involves: managing an authenticated session that doesn't get flagged, handling Crunchbase's anti-bot stack, parsing the React-hydrated data without depending on internal endpoints that change, building proxy rotation and retry logic, and maintaining the whole thing as Crunchbase pushes frontend updates — which happens often.&lt;/p&gt;

&lt;p&gt;That's 3-6 weeks of engineering time to build something reliable, plus ongoing maintenance. At $0.012 per company, you'd need to scrape over 2.5 million profiles before the build-vs-buy math favors building. Crunchbase has 3 million profiles total, so realistically you never cross that line.&lt;/p&gt;

&lt;p&gt;For most teams the answer is clear: don't build it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Actor:&lt;/strong&gt; &lt;a href="https://apify.com/cryptosignals/crunchbase-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/crunchbase-scraper&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;By:&lt;/strong&gt; Web Data Labs — data infrastructure for B2B and investor teams.&lt;/p&gt;

</description>
      <category>crunchbase</category>
      <category>startup</category>
      <category>data</category>
      <category>webautomation</category>
    </item>
    <item>
      <title>LinkedIn Profile Data in 2026: Why It's Hard to Get and How to Extract It</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 08:27:37 +0000</pubDate>
      <link>https://dev.to/agenthustler/linkedin-profile-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-5c7a</link>
      <guid>https://dev.to/agenthustler/linkedin-profile-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-5c7a</guid>
      <description>&lt;p&gt;LinkedIn has more than 1 billion member profiles. Every recruiter, sales team, investor, and researcher needs profile data from LinkedIn. And yet getting that data programmatically is genuinely difficult — not because the data is hidden, but because LinkedIn has built one of the most aggressive anti-scraping systems on the web, and the alternatives are either expensive seat-based subscriptions or fragile in-house scrapers.&lt;/p&gt;

&lt;p&gt;This post covers what data is actually available on LinkedIn profiles, why it's hard to get at scale, who needs it and why, and how to run our actor to extract it without building or maintaining any scraping infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why LinkedIn profile data is hard to get
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;There is no real official API.&lt;/strong&gt; LinkedIn's public API does not expose member profile data to general developers. The endpoints that do exist are gated behind partnership programs (Talent Solutions, Sales Navigator API) that require a partner application, an enterprise contract, and a specific approved use case. For a solo recruiter or a small sales team doing list enrichment, the official API is effectively closed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sales Navigator is per-seat, not per-record.&lt;/strong&gt; The most common workaround — Sales Navigator at $99/seat/month — gives a human the ability to browse profiles, but it doesn't give you a programmatic export. Bulk extraction violates LinkedIn's ToS for that product, and accounts get flagged quickly when used with browser automation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-scraping stack is serious.&lt;/strong&gt; LinkedIn runs browser fingerprinting, behavioral analysis, IP reputation scoring, and bot challenge pages. A naive Python script gets blocked within minutes. Even headless browsers get flagged quickly without significant evasion infrastructure. High-volume profile extraction requires residential proxies and ongoing maintenance as LinkedIn updates its detection methods — sometimes weekly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Profile pages are auth-walled in different ways depending on viewer state.&lt;/strong&gt; A logged-out visitor sees a public preview. A logged-in visitor sees the full profile. Different proxy strategies, session strategies, and parsing logic apply to each — and getting reliable, consistent extraction across millions of profiles is a real engineering problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terms of service add legal ambiguity.&lt;/strong&gt; The &lt;em&gt;hiQ Labs v. LinkedIn&lt;/em&gt; ruling (affirmed by the Ninth Circuit) established that scraping publicly available data is not a Computer Fraud and Abuse Act violation. The data on public profile pages — the kind visible to any logged-out visitor — sits in the clearest legal territory. Anything behind login is a different conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result&lt;/strong&gt;: most teams either pay for expensive enrichment vendors (ZoomInfo, Apollo, Clay), build fragile in-house scrapers that need constant maintenance, or do it manually. None of these scale to the volumes most use cases actually need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who actually needs this data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Recruiting and talent sourcing.&lt;/strong&gt; Identifying candidates by current role, experience, skills, and location is the standard recruiting workflow. Sourcers spend hours per week building shortlists. Automated profile extraction across a Boolean search result turns 4 hours of copy-paste into a 5-minute job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B2B sales prospecting.&lt;/strong&gt; Outbound teams enrich lead lists with current title, current employer, and seniority before scoring them against the ICP. The difference between a generic blast and a real personalized opener is whether you know what someone actually does today — not what their LinkedIn URL said three years ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investor due diligence.&lt;/strong&gt; Before a pre-seed call, investors check the founders' previous companies, education, and length of relevant experience. This is profile-level data, and right now it's mostly done manually by associates flipping between tabs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Market and labor research.&lt;/strong&gt; Researchers studying career trajectories, skills demand, or labor market shifts need bulk profile data as raw material. Same fields that power individual sales workflows also power dashboards on what skills are growing in a sector and where talent is concentrated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sales intelligence products.&lt;/strong&gt; Anyone building a tool on top of LinkedIn signals — change-of-job alerts, hiring trend tools, revenue-per-employee benchmarks — needs profile data as input. The tools that look like magic from the outside are mostly clean profile extraction on the inside.&lt;/p&gt;

&lt;h2&gt;
  
  
  What data you actually get
&lt;/h2&gt;

&lt;p&gt;Our actor extracts the following fields from public LinkedIn profile pages — no login required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;full_name&lt;/strong&gt; — first and last name as listed on LinkedIn&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;headline&lt;/strong&gt; — the tagline under the name (current role / personal pitch)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;location&lt;/strong&gt; — city, region, country&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;about&lt;/strong&gt; — full "About" section text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;current_position&lt;/strong&gt; — most recent role title and company&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;experience&lt;/strong&gt; — list of past roles with title, company, dates, and description&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;education&lt;/strong&gt; — schools, degrees, fields of study, dates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;skills&lt;/strong&gt; — self-reported skills list&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;certifications&lt;/strong&gt; — listed certifications with issuer and date&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;languages&lt;/strong&gt; — listed languages with proficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;connection_count&lt;/strong&gt; — connection range (e.g., "500+")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;follower_count&lt;/strong&gt; — LinkedIn follower count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;profile_url&lt;/strong&gt; — canonical LinkedIn profile URL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;avatar_url&lt;/strong&gt; — URL to the profile photo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;scraped_at&lt;/strong&gt; — timestamp of extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to run the actor
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Via Apify Console&lt;/strong&gt; (no code needed):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://apify.com/cryptosignals/linkedin-profile-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/linkedin-profile-scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Try for free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Paste your profile list into the &lt;code&gt;profiles&lt;/code&gt; field — accepts LinkedIn slugs (e.g., &lt;code&gt;williamhgates&lt;/code&gt;) or full URLs&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;max_results&lt;/code&gt; if you want to cap the run&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Start&lt;/strong&gt; and download results as JSON or CSV&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Input JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"profiles"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"williamhgates"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://www.linkedin.com/in/satyanadella"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"jeffweiner08"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Apify API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://api.apify.com/v2/acts/cryptosignals~linkedin-profile-scraper/runs"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_APIFY_TOKEN"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "profiles": ["williamhgates", "satyanadella"],
    "max_results": 10
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output record:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"profile_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"satyanadella"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"full_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Satya Nadella"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"headline"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Chairman and CEO at Microsoft"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Redmond, Washington, United States"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"about"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"As chairman and CEO of Microsoft, I define my mission..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_position"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Chairman and CEO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"company"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Microsoft"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"experience"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Chairman and CEO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"company"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Microsoft"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"start_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2014-02"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"end_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Leading Microsoft as chairman and CEO."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"education"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"school"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The University of Chicago Booth School of Business"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"degree"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MBA"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skills"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Cloud Computing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Strategy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Leadership"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"connection_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"500+"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"follower_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"11200000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"profile_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.linkedin.com/in/satyanadella"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"avatar_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://media.licdn.com/dms/image/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-04T09:00:00+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;The actor uses pay-per-event pricing: &lt;strong&gt;$0.012 per profile&lt;/strong&gt;. The first 5 results are free so you can verify output quality before committing. For a list of 1,000 profiles, that's $12 — roughly the cost of a single coffee meeting, returning a structured dataset that would take a sourcer two days to compile manually.&lt;/p&gt;

&lt;p&gt;For high-volume runs (10,000+ profiles), residential proxy coverage matters for reliability. &lt;a href="https://oxylabs.go2cloud.org/aff_c?offer_id=7&amp;amp;aff_id=2066&amp;amp;url_id=174" rel="noopener noreferrer"&gt;Oxylabs&lt;/a&gt; is the proxy infrastructure we've tested for this kind of workload — their residential network handles LinkedIn's IP reputation checks without the constant rotation failures that plague datacenter proxies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you don't get
&lt;/h2&gt;

&lt;p&gt;Profile pages don't include private email addresses or phone numbers. LinkedIn doesn't expose those publicly, and neither does this actor. For contact-level data, you need a separate enrichment step using a waterfall provider. The actor extracts profile-level public metadata — the data visible to any unauthenticated visitor.&lt;/p&gt;

&lt;p&gt;The actor also does not extract private posts, private connections lists, or anything behind login. If your use case needs that, it doesn't belong in a public-data scraper — and it doesn't belong in a blog post either.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative
&lt;/h2&gt;

&lt;p&gt;You can build this yourself. The engineering work involves: handling LinkedIn's anti-bot detection, managing residential proxy rotation, parsing the structured profile data out of pages that ship as a JavaScript app, dealing with partial responses and retry logic, normalizing the experience and education sections (which have a dozen edge cases each), and maintaining the scraper when LinkedIn changes its page structure — which happens several times per year.&lt;/p&gt;

&lt;p&gt;That's 3-6 weeks of engineering time to build a reliable version, and ongoing maintenance after that. At $0.012 per profile, you'd need to scrape over 500,000 profiles before the build-vs-buy math favors building.&lt;/p&gt;

&lt;p&gt;For most teams, the answer is clear.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Actor:&lt;/strong&gt; &lt;a href="https://apify.com/cryptosignals/linkedin-profile-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/linkedin-profile-scraper&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;By:&lt;/strong&gt; Web Data Labs — data infrastructure for B2B teams.&lt;/p&gt;

</description>
      <category>linkedin</category>
      <category>webscraping</category>
      <category>data</category>
      <category>automation</category>
    </item>
    <item>
      <title>How to Monitor Telegram Channels for Crypto and Brand Intelligence in 2026</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 08:00:22 +0000</pubDate>
      <link>https://dev.to/agenthustler/how-to-monitor-telegram-channels-for-crypto-and-brand-intelligence-in-2026-2f8p</link>
      <guid>https://dev.to/agenthustler/how-to-monitor-telegram-channels-for-crypto-and-brand-intelligence-in-2026-2f8p</guid>
      <description>&lt;p&gt;Telegram has become the de facto communication platform for crypto projects, brand communities, and market intelligence teams. With over 950 million monthly active users and channels routinely exceeding 100K subscribers, Telegram is where breaking information surfaces first — often hours before it hits Twitter or Discord.&lt;/p&gt;

&lt;p&gt;If you're building any kind of monitoring pipeline in 2026, Telegram channels are a data source you can't ignore. Here's how to tap into them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Telegram Monitoring Matters
&lt;/h2&gt;

&lt;p&gt;Three major use cases have emerged:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crypto intelligence.&lt;/strong&gt; Major projects like TON, Solana ecosystem funds, and dozens of DeFi protocols use Telegram as their &lt;em&gt;primary&lt;/em&gt; announcement channel. Token listings, governance votes, partnership announcements — they hit Telegram first. Traders who monitor these channels programmatically have an information edge measured in minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brand monitoring.&lt;/strong&gt; Consumer brands increasingly maintain Telegram communities. Monitoring sentiment shifts, complaint spikes, or competitor channel activity gives product and marketing teams real-time signal that surveys take weeks to capture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Market research.&lt;/strong&gt; Hedge funds and research firms track channel growth rates, message frequency, and engagement patterns as alternative data signals. A channel gaining 10K subscribers in 24 hours often correlates with upcoming announcements.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API Challenge
&lt;/h2&gt;

&lt;p&gt;If you've tried to access Telegram data programmatically, you've hit one of these walls:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bot API&lt;/strong&gt; — Requires your bot to be added as an admin to the target channel. Great for channels you own, useless for monitoring external channels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User API (MTProto)&lt;/strong&gt; — Full access, but requires a phone number, 2FA authentication, and careful session management. Telegram actively rate-limits and bans accounts that behave like scrapers. One wrong move and your phone number is blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you actually need&lt;/strong&gt; is read-only access to public channel data — no authentication, no phone numbers, no risk of account bans.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Web Preview Endpoint
&lt;/h2&gt;

&lt;p&gt;Telegram exposes a web preview for every public channel at &lt;code&gt;t.me/s/{channel_name}&lt;/code&gt;. This endpoint returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Channel metadata&lt;/strong&gt;: subscriber count, description, channel photo, verification status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent messages&lt;/strong&gt;: the last 20-50 posts with full text content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engagement data&lt;/strong&gt;: view counts per message, forwarding info&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamps&lt;/strong&gt;: exact post dates and times&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No API key required. No authentication. Just an HTTP GET request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python Implementation
&lt;/h2&gt;

&lt;p&gt;Here's a minimal working example that extracts key data from any public Telegram channel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_telegram_channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;channel_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://t.me/s/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;channel_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AppleWebKit/537.36 (KHTML, like Gecko) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chrome/120.0.0.0 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Channel metadata
&lt;/span&gt;    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tgme_channel_info_header_title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tgme_channel_info_description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;subscribers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tgme_channel_info_counter .counter_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;verified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.verified-icon&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="c1"&gt;# Parse recent messages
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tgme_widget_message_wrap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;text_el&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tgme_widget_message_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;views_el&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tgme_widget_message_views&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;date_el&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tgme_widget_message_date time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text_el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text_el&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;views&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;views_el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;views_el&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date_el&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datetime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;date_el&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;channel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;channel_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;subscribers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;subscribers&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_verified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;verified&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Example: scrape Pavel Durov's channel
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scrape_telegram_channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;durov&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Channel: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Subscribers: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subscribers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Verified: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_verified&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recent messages: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this against &lt;code&gt;durov&lt;/code&gt; (Pavel Durov's personal channel) returns structured data you can feed into any monitoring pipeline — no tokens, no phone numbers, no rate limit anxiety.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Data Points You Can Extract
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data Point&lt;/th&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;subscriber_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Channel header&lt;/td&gt;
&lt;td&gt;Growth tracking, trend detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;message_text&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Message body&lt;/td&gt;
&lt;td&gt;Keyword alerts, sentiment analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;message_views&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Message footer&lt;/td&gt;
&lt;td&gt;Engagement scoring, virality detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;message_date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Message timestamp&lt;/td&gt;
&lt;td&gt;Frequency analysis, pump detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;is_verified&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Channel badge&lt;/td&gt;
&lt;td&gt;Legitimacy filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;description&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Channel info&lt;/td&gt;
&lt;td&gt;Category classification&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Comparison: Three Approaches to Telegram Monitoring
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bot API&lt;/th&gt;
&lt;th&gt;User API (MTProto)&lt;/th&gt;
&lt;th&gt;Web Scraping&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auth required&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bot token + admin&lt;/td&gt;
&lt;td&gt;Phone + 2FA&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access to public channels&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only if admin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Message history depth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unlimited (if admin)&lt;/td&gt;
&lt;td&gt;Unlimited&lt;/td&gt;
&lt;td&gt;Last 20-50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30 msg/sec&lt;/td&gt;
&lt;td&gt;Aggressive&lt;/td&gt;
&lt;td&gt;Standard HTTP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Account ban risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your own channels&lt;/td&gt;
&lt;td&gt;Full archive access&lt;/td&gt;
&lt;td&gt;Monitoring &amp;amp; alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For most monitoring use cases — where you need recent activity, subscriber counts, and engagement data from channels you don't own — web scraping is the pragmatic choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling Beyond a Script
&lt;/h2&gt;

&lt;p&gt;The Python example above works for ad-hoc checks, but production monitoring needs scheduling, proxy rotation, error handling, and structured output.&lt;/p&gt;

&lt;p&gt;If you need a production-ready solution, the &lt;a href="https://apify.com/cryptosignals/telegram-channel-scraper" rel="noopener noreferrer"&gt;Telegram Channel Scraper on Apify&lt;/a&gt; handles all of this out of the box — batch channel processing, JSON/CSV export, and scheduled runs with webhook notifications. It's built on the same web preview approach, so no Telegram authentication is needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Build With This
&lt;/h2&gt;

&lt;p&gt;Once you have the data flowing, the interesting work begins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alert pipelines&lt;/strong&gt;: Trigger notifications when a crypto channel posts for the first time in 48 hours (often precedes announcements)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Growth dashboards&lt;/strong&gt;: Track subscriber count changes across competitor channels daily&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentiment analysis&lt;/strong&gt;: Run message text through an LLM to classify sentiment shifts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-channel correlation&lt;/strong&gt;: Detect when multiple channels in the same niche start posting about the same topic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Telegram monitoring is one of those capabilities where the gap between "I should do this" and "I'm actually doing this" is surprisingly small. A few lines of Python and you're in.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>python</category>
      <category>api</category>
      <category>programming</category>
    </item>
    <item>
      <title>itch.io Game Data at Scale: Who Needs It and How to Extract It</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 07:03:34 +0000</pubDate>
      <link>https://dev.to/agenthustler/itchio-game-data-at-scale-who-needs-it-and-how-to-extract-it-3e58</link>
      <guid>https://dev.to/agenthustler/itchio-game-data-at-scale-who-needs-it-and-how-to-extract-it-3e58</guid>
      <description>&lt;p&gt;itch.io hosts over 500,000 indie games and has roughly 10 million registered users. It is the dominant platform for game jams, experimental titles, and developers who want to publish without a publisher. All of that catalog data — titles, ratings, genres, platforms, creator information — is publicly visible. None of it is available through a bulk API.&lt;/p&gt;

&lt;p&gt;This post covers what data is actually accessible on itch.io, who needs it and why, and how to run our actor to extract it without building or maintaining any scraping infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The data access problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;There is no public itch.io API for catalog data.&lt;/strong&gt; itch.io provides a limited API for developers to manage their own games and purchases, but there is no endpoint that lets you query the catalog by genre, rating, platform, or creator. If you want a list of the top-rated horror games, or all puzzle games with more than 1,000 ratings, or every title a specific developer has published — you cannot get that from an API call. The data exists, it is all publicly visible on the site, but there is no programmatic way to retrieve it in bulk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The catalog is large and fragmented.&lt;/strong&gt; With 500,000+ games spanning dozens of genres, tags, and platforms, manually compiling even a modest dataset is impractical. A researcher trying to track rating trends across game jams, or a journalist building a "best of" list, or a developer benchmarking their own game against the market — all of them face the same problem: the data is there, but getting it out at any scale requires either hours of manual copy-paste or purpose-built extraction tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pages are server-rendered.&lt;/strong&gt; itch.io renders its browse and search pages as standard HTML. This means the data is accessible to anyone who can read a web page — but actually collecting it at scale, handling pagination, respecting the site's rate limits, and normalizing the output into a structured format still requires non-trivial engineering work to do right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who actually needs this data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Indie game developers doing competitive research.&lt;/strong&gt; Before pricing a new game, launching a jam entry, or picking a genre to target, developers want to understand the landscape. How many games in a given tag have ratings above 4.0? What is the average rating count for top-performing puzzle games? What platforms do the most-played browser games support? These are answerable questions if you have the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Game jam organizers and curators.&lt;/strong&gt; Jam organizers running events on itch.io often want to survey the catalog of entries — track which games are gaining traction post-jam, compile shortlists for coverage, or analyze genre distribution across submissions. Doing this for dozens or hundreds of games without automated extraction is slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Game journalists and critics.&lt;/strong&gt; Writers covering the indie space need lists. The top-rated games in a specific tag this year. The highest-rated games with under 500 ratings (hidden gems). All titles from a creator whose new release just launched. These are the kinds of queries that make for good editorial angles, and right now most journalists build them manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Academic researchers.&lt;/strong&gt; Game studies researchers use itch.io as a field site. Analyzing genre trends over time, studying rating distributions, tracking which platform combinations correlate with higher engagement — all of this requires structured data from a large sample. itch.io's public catalog is one of the few places where indie game metadata is available at this scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data product builders.&lt;/strong&gt; Teams building game discovery tools, recommendation engines, or market intelligence products need raw catalog data as an input. itch.io is a natural complement to Steam data for anyone covering the full indie game market, not just the commercial storefront.&lt;/p&gt;

&lt;h2&gt;
  
  
  What data you actually get
&lt;/h2&gt;

&lt;p&gt;Our actor extracts the following fields from public itch.io game listings — no account or API key required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;id&lt;/strong&gt; — itch.io's internal game ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;title&lt;/strong&gt; — game title as listed on the platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;url&lt;/strong&gt; — direct link to the game page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;creator&lt;/strong&gt; — developer or publisher name&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;creator_url&lt;/strong&gt; — link to the creator's itch.io profile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;description&lt;/strong&gt; — short game description from the listing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;genre&lt;/strong&gt; — genre classification (e.g., Action, Visual Novel, Platformer, RPG)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rating&lt;/strong&gt; — average rating on a 0–5 scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rating_count&lt;/strong&gt; — total number of ratings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;platforms&lt;/strong&gt; — supported platforms: &lt;code&gt;windows&lt;/code&gt;, &lt;code&gt;mac&lt;/code&gt;, &lt;code&gt;linux&lt;/code&gt;, &lt;code&gt;android&lt;/code&gt;, &lt;code&gt;browser&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;thumbnail&lt;/strong&gt; — URL of the game's cover image&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to run the actor
&lt;/h2&gt;

&lt;p&gt;The actor supports four modes: browsing top-rated games, browsing featured/popular games, searching by keyword, and filtering by tag. All modes return the same structured output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Via Apify Console&lt;/strong&gt; (no code needed):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://apify.com/cryptosignals/itchio-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/itchio-scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Try for free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Choose your action and set any query parameters&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Start&lt;/strong&gt; and download results as JSON or CSV&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Input: top-rated games&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"top"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxItems"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Input: games by tag&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"horror"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxItems"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Input: keyword search&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bullet hell"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxItems"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Input: featured games&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"featured"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxItems"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Apify API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://api.apify.com/v2/acts/cryptosignals~itchio-scraper/runs"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_APIFY_TOKEN"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "action": "tag",
    "query": "puzzle",
    "maxItems": 100
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output record:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;123456&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HoloCure - Save the Fans!"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://kay-yu.itch.io/holocure"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"creator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Kay Yu"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"creator_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://kay-yu.itch.io"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Fan-made inspired by a certain VTuber group."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"genre"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rating_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"platforms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"windows"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"thumbnail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://img.itch.zone/..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results are returned as a JSON array and can be exported as CSV directly from the Apify console.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;The actor uses pay-per-event pricing — you pay per game extracted, and the first results are free so you can verify output quality before committing to a larger run. For typical research datasets of a few hundred games, the cost is low enough that the build-vs-buy decision is clear. Check the actor page for current pricing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you don't get
&lt;/h2&gt;

&lt;p&gt;Price is not available in itch.io's listing pages. itch.io does not expose game prices in browse or search results — pricing is only visible on individual game pages. The actor extracts listing-level data (what you see when browsing by top, featured, search, or tag), not the full detail page for each game.&lt;/p&gt;

&lt;p&gt;Individual game page data — full description text, all screenshots, download counts, comment threads, or pricing — would require a separate per-game fetch for each title. The current actor focuses on catalog-level metadata.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative
&lt;/h2&gt;

&lt;p&gt;You can build this yourself. The engineering work involves: handling itch.io's pagination across potentially hundreds of pages, normalizing inconsistent genre and platform markup, implementing polite rate limiting to avoid overloading the server, parsing structured data out of HTML, and maintaining the scraper when the page structure changes. That is a day or two of engineering time to build, plus ongoing maintenance.&lt;/p&gt;

&lt;p&gt;For most research workflows, teams do not need a custom scraper — they need the data. The actor delivers structured JSON ready for analysis without any infrastructure overhead.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Actor:&lt;/strong&gt; &lt;a href="https://apify.com/cryptosignals/itchio-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/itchio-scraper&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;By:&lt;/strong&gt; Web Data Labs — data infrastructure for developers and researchers.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>gamedev</category>
      <category>datascience</category>
      <category>api</category>
    </item>
    <item>
      <title>Scraping Shopify Stores in 2026: Product Catalog, Pricing &amp; Inventory Data</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 05:28:20 +0000</pubDate>
      <link>https://dev.to/agenthustler/scraping-shopify-stores-in-2026-product-catalog-pricing-inventory-data-1l20</link>
      <guid>https://dev.to/agenthustler/scraping-shopify-stores-in-2026-product-catalog-pricing-inventory-data-1l20</guid>
      <description>&lt;p&gt;If you've ever tried to monitor competitor pricing on a Shopify store, build a dropshipping research pipeline, or feed a market-intel dashboard with live e-commerce data, you've probably learned the hard way that "just scrape it" is a sentence that hides a lot of pain.&lt;/p&gt;

&lt;p&gt;Shopify powers somewhere north of 4.6 million live storefronts in 2026. Each one is a goldmine of structured data — product catalogs, variant matrices, real-time inventory, pricing changes — but extracting that data reliably across thousands of stores is an engineering problem that gets messy fast.&lt;/p&gt;

&lt;p&gt;This post walks through &lt;em&gt;why&lt;/em&gt; scraping Shopify is harder than it looks, what kinds of business problems good Shopify data solves, and how to plug a managed scraper into your stack without writing (or maintaining) your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why businesses need Shopify store data
&lt;/h2&gt;

&lt;p&gt;Shopify's open architecture means a lot of useful data is technically reachable — and that creates demand from teams that aren't going to build a scraper from scratch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Price monitoring.&lt;/strong&gt; DTC brands and retailers want to know when competitors discount, restock, or change MSRP. A daily snapshot across 200 competitor stores beats a quarterly manual audit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitive intelligence.&lt;/strong&gt; Which SKUs is a competitor pushing on their homepage? Which collections did they reshuffle this week? Which products are quietly hidden but still sold? This is the data that ends up in board decks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dropshipping research.&lt;/strong&gt; Finding winning products before they saturate is the entire dropshipping playbook. Tracking new listings, sudden inventory drops, and review velocity across hundreds of niche stores is how serious dropshippers find signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inventory tracking.&lt;/strong&gt; Suppliers and resellers need to know when a hot product is back in stock — sometimes within minutes. Polling key SKUs is a real-money problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Market analysis.&lt;/strong&gt; Aggregating product, price, and category data across a vertical (say, sustainable fashion or pet supplements) tells you category-level trends no single store can.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern across all of these: &lt;strong&gt;structured product data, refreshed often, normalized across many stores&lt;/strong&gt;. That's deceptively hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why scraping Shopify is non-trivial
&lt;/h2&gt;

&lt;p&gt;Shopify &lt;em&gt;looks&lt;/em&gt; easy. The platform is consistent: every store has predictable URL patterns, products have stable IDs, and a lot of data is exposed in JSON. New scrapers usually start optimistic and get humbled within a week. Here's what bites them:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Rate limits and adaptive throttling
&lt;/h3&gt;

&lt;p&gt;Shopify's edge applies aggressive rate limiting that adapts to traffic patterns. A naive scraper hammering one store will get throttled within minutes. The signs are subtle — slower responses, partial pages, soft 429s wrapped as 200s with truncated bodies. By the time you notice, your dataset is already corrupt.&lt;/p&gt;

&lt;p&gt;Doing this at scale (hundreds of stores, hourly) needs distributed request scheduling, exponential backoff, and a rotating residential proxy pool. Providers like &lt;a href="https://oxylabs.go2cloud.org/aff_c?offer_id=7&amp;amp;aff_id=2066&amp;amp;url_id=174" rel="noopener noreferrer"&gt;Oxylabs&lt;/a&gt; and &lt;a href="https://www.scraperapi.com/?fp_ref=the52" rel="noopener noreferrer"&gt;ScraperAPI&lt;/a&gt; exist precisely because this layer is non-trivial — they handle the proxy rotation, geo-targeting, and CAPTCHA solving that you'd otherwise be reinventing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Pagination quirks
&lt;/h3&gt;

&lt;p&gt;Shopify exposes product listings through several different endpoints, each with its own pagination quirks, page-size caps, and silent truncation behaviors. Some endpoints will happily return the first 250 products and then stop. Others paginate cleanly until they suddenly return duplicates. Building a scraper that &lt;em&gt;actually&lt;/em&gt; gets every product on a 50,000-SKU store, every time, is a long debugging exercise.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Variant explosion
&lt;/h3&gt;

&lt;p&gt;A single product can have dozens of variants — size × color × material × bundle. A store with 1,000 visible products can expand to 30,000+ variant rows once you flatten it. Storage, deduplication, and "is this the same product?" matching all become real concerns.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. JSON vs HTML endpoints disagree
&lt;/h3&gt;

&lt;p&gt;The JSON endpoints, the rendered HTML, and the search results sometimes disagree about what's available. A product can be hidden from collection pages but still purchasable via direct URL. Inventory counts shown in HTML may not match the underlying JSON. A robust scraper has to reconcile these views — and decide which is the source of truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Anti-bot defenses are getting smarter
&lt;/h3&gt;

&lt;p&gt;Cloudflare, custom JS challenges, fingerprinting, behavioral detection — Shopify stores are increasingly protected. Tools like &lt;a href="https://scrapeops.io?fpr=the-data28" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; help with monitoring and bypass orchestration, but the cat-and-mouse game eats engineering time you'd rather spend on your actual product.&lt;/p&gt;

&lt;p&gt;The honest summary: writing a one-off scraper for &lt;em&gt;one&lt;/em&gt; Shopify store on a Tuesday afternoon is fine. Running a reliable, normalized, multi-store scraping pipeline in production is a months-long engineering project most teams shouldn't take on.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use our actor
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/cryptosignals/shopify-scraper" rel="noopener noreferrer"&gt;Shopify Store Scraper&lt;/a&gt; on Apify is a managed solution. You give it a list of stores and parameters; you get back a clean, normalized dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Input
&lt;/h3&gt;

&lt;p&gt;The actor takes a simple JSON input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"startUrls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://allbirds.com"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://gymshark.com"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://kith.com"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxProducts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"includeVariants"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"includeInventory"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole interface. No proxy config, no rate-limit tuning, no pagination strategy — those are the actor's job.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output
&lt;/h3&gt;

&lt;p&gt;Each product comes back as a normalized record:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"store"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"allbirds.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"productId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"7891234567890"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"handle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"wool-runner-mizzles"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Wool Runner Mizzles"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vendor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allbirds"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"productType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Shoes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mens"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weather-ready"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"wool"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://allbirds.com/products/wool-runner-mizzles"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"images"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://cdn.shopify.com/.../mizzle-1.jpg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://cdn.shopify.com/.../mizzle-2.jpg"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;115.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;135.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"variants"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"variantId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"44123456789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"M9 / Natural Black"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"sku"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WRM-M9-NB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;115.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"compareAtPrice"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;135.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"available"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"inventoryQuantity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"M9"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Natural Black"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"createdAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-09-12T10:14:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"updatedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-30T08:21:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scrapedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-04T14:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the shape your downstream code wants — flat, predictable, normalized currency, ISO timestamps, a stable &lt;code&gt;productId&lt;/code&gt; you can dedupe on. Drop it into BigQuery, Postgres, a vector DB, or a spreadsheet and it just works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Calling it from code
&lt;/h3&gt;

&lt;p&gt;You don't need to learn the actor's internals. From any language with HTTP, you start a run with the input above against the Apify API and pull results from the dataset when it's done. There's a Python client, a JS client, and webhook delivery if you want push instead of pull.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;p&gt;A few concrete patterns we've seen users build:&lt;/p&gt;

&lt;h3&gt;
  
  
  Dropshippers
&lt;/h3&gt;

&lt;p&gt;Run the actor nightly across a curated list of 100–300 niche stores. Diff against yesterday's snapshot to surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New product launches (likely test SKUs)&lt;/li&gt;
&lt;li&gt;Sudden inventory drops (hot product signal)&lt;/li&gt;
&lt;li&gt;Price increases (validated demand)&lt;/li&gt;
&lt;li&gt;Variants that keep selling out (winners)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole pipeline is one cron job, one Postgres table, and a Slack notifier on the diff.&lt;/p&gt;

&lt;h3&gt;
  
  
  E-commerce teams
&lt;/h3&gt;

&lt;p&gt;Marketing and merchandising teams use it to monitor 20–50 competitors. The output feeds a dashboard that flags pricing changes the moment they happen — which means the brand can respond same-day instead of finding out at the next quarterly review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Market analysts
&lt;/h3&gt;

&lt;p&gt;Researchers building reports on a vertical (sustainable beauty, technical apparel, indie coffee) point the actor at 500+ stores in the category, then aggregate average prices, top tags, common product types, and category mix. What used to be a six-week consultant engagement becomes a weekend of analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Re-sellers and inventory bots
&lt;/h3&gt;

&lt;p&gt;Polling specific SKUs across supplier stores to catch restocks. The actor's variant-level inventory output makes this clean — you watch one variant ID and trigger when &lt;code&gt;available&lt;/code&gt; flips to true.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If any of those sound like your problem, the &lt;a href="https://apify.com/cryptosignals/shopify-scraper" rel="noopener noreferrer"&gt;Shopify Store Scraper&lt;/a&gt; on Apify is the fastest way to skip the months of pipeline engineering and get straight to the data. There's a free tier — point it at one store and see the output for yourself before committing.&lt;/p&gt;

&lt;p&gt;If you do decide to roll your own (some teams need to), at least save yourself the proxy-and-bypass headache — &lt;a href="https://oxylabs.go2cloud.org/aff_c?offer_id=7&amp;amp;aff_id=2066&amp;amp;url_id=174" rel="noopener noreferrer"&gt;Oxylabs&lt;/a&gt; and &lt;a href="https://www.scraperapi.com/?fp_ref=the52" rel="noopener noreferrer"&gt;ScraperAPI&lt;/a&gt; handle the infrastructure layer that's the most painful to maintain, and &lt;a href="https://scrapeops.io?fpr=the-data28" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; gives you visibility into what's actually happening when things break.&lt;/p&gt;

&lt;p&gt;Either way: stop scraping the slow way. The data's there — go get it.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>python</category>
      <category>scraping</category>
      <category>ecommerce</category>
    </item>
    <item>
      <title>[DRAFT - IGNORE]</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 03:59:37 +0000</pubDate>
      <link>https://dev.to/agenthustler/test-draft-delete-me-52kb</link>
      <guid>https://dev.to/agenthustler/test-draft-delete-me-52kb</guid>
      <description>&lt;p&gt;Internal test draft.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>LinkedIn Company Data in 2026: Why It's Hard to Get and How to Extract It</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Mon, 04 May 2026 03:59:04 +0000</pubDate>
      <link>https://dev.to/agenthustler/linkedin-company-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-35df</link>
      <guid>https://dev.to/agenthustler/linkedin-company-data-in-2026-why-its-hard-to-get-and-how-to-extract-it-35df</guid>
      <description>&lt;p&gt;LinkedIn has over 67 million company pages. Every B2B sales team, investor, and recruiter needs company data from LinkedIn. And yet getting that data programmatically is genuinely difficult — not because the data is hidden, but because LinkedIn has built one of the most aggressive anti-scraping systems on the web, and their official API is priced for enterprise budgets only.&lt;/p&gt;

&lt;p&gt;This post covers what data is actually available on LinkedIn company pages, why it's hard to get at scale, who needs it and why, and how to run our actor to extract it without building or maintaining any scraping infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why LinkedIn company data is hard to get
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The official API is not a real option for most teams.&lt;/strong&gt; LinkedIn's Marketing Developer Platform costs $15,000+/year and requires a partner application process. The data endpoints available through official channels are primarily designed for ad targeting and HR software integrations — not bulk company research. For a solo founder or a small data team doing ICP analysis or competitive research, the API is effectively unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-scraping stack is serious.&lt;/strong&gt; LinkedIn runs browser fingerprinting, behavioral analysis, IP reputation scoring, and bot challenge pages. A naive Python &lt;code&gt;requests&lt;/code&gt; script gets blocked within minutes. Even headless browsers get flagged quickly without significant investment in evasion infrastructure. High-volume extraction requires residential proxies — which add meaningful cost — and constant maintenance as LinkedIn updates its detection methods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terms of service add legal ambiguity.&lt;/strong&gt; LinkedIn's ToS restricts automated data collection. The &lt;em&gt;hiQ Labs v. LinkedIn&lt;/em&gt; ruling (affirmed by the Ninth Circuit) established that scraping publicly available data is not a Computer Fraud and Abuse Act violation, but companies still need to assess their own risk tolerance. The data on public company pages — the kind visible to any logged-out visitor — sits in the clearest legal territory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result&lt;/strong&gt;: most teams either pay for expensive data vendors (ZoomInfo, Clearbit), build fragile in-house scrapers that need constant maintenance, or just do it manually. None of these scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who actually needs this data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;B2B sales and ICP research.&lt;/strong&gt; Building an ideal customer profile requires enriching company lists with industry, headcount, HQ location, and founding year. Teams doing outbound at scale need to filter thousands of companies down to the 200 that actually fit their ICP. LinkedIn company pages are the canonical source for this data — more accurate and more current than most third-party databases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investor due diligence.&lt;/strong&gt; Before a call, investors verify headcount growth signals (employee count vs. last quarter), check the company description for pivot signals, and confirm website and contact details. LinkedIn is the ground truth that other sources pull from. Automating this enrichment across a deal pipeline saves hours per week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive landscape analysis.&lt;/strong&gt; Mapping a competitive landscape means collecting industry, size, HQ, founding year, and specialties for 20-100 companies. Doing this manually in a spreadsheet is an afternoon of copy-paste. Automated extraction turns it into a 5-minute job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recruitment targeting.&lt;/strong&gt; Identifying companies in a specific industry, headcount band, and city before sourcing candidates from those companies is a standard recruiting workflow. LinkedIn company data is the filter layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Market research and data products.&lt;/strong&gt; Research teams building industry reports, data enrichment services, or market intelligence products need bulk company data as a raw material. The same fields that power sales enrichment also power competitive benchmarking tools and market maps.&lt;/p&gt;

&lt;h2&gt;
  
  
  What data you actually get
&lt;/h2&gt;

&lt;p&gt;Our actor extracts the following fields from public LinkedIn company pages — no login required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;name&lt;/strong&gt; — official company name as listed on LinkedIn&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;industry&lt;/strong&gt; — LinkedIn industry classification (e.g., "Software Development", "Financial Services")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;employee_count&lt;/strong&gt; — headcount range (e.g., "501-1000", "10001+")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;follower_count&lt;/strong&gt; — LinkedIn follower count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;headquarters&lt;/strong&gt; — city, state/region, country&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;founded_year&lt;/strong&gt; — year the company was founded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;website&lt;/strong&gt; — official company website URL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;company_url&lt;/strong&gt; — canonical LinkedIn company page URL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;description&lt;/strong&gt; — full company description text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tagline&lt;/strong&gt; — short company tagline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;specialties&lt;/strong&gt; — list of self-reported specialty areas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;logo_url&lt;/strong&gt; — URL to the company logo image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;scraped_at&lt;/strong&gt; — timestamp of extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to run the actor
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Via Apify Console&lt;/strong&gt; (no code needed):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://apify.com/cryptosignals/linkedin-company-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/linkedin-company-scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Try for free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Paste your company list into the &lt;code&gt;companies&lt;/code&gt; field — accepts LinkedIn slugs (e.g., &lt;code&gt;stripe&lt;/code&gt;) or full URLs&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;max_results&lt;/code&gt; if you want to cap the run&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Start&lt;/strong&gt; and download results as JSON or CSV&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Input JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"companies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"stripe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://www.linkedin.com/company/shopify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"notion"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Apify API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://api.apify.com/v2/acts/cryptosignals~linkedin-company-scraper/runs"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_APIFY_TOKEN"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "companies": ["stripe", "shopify"],
    "max_results": 10
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output record:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"company_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stripe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Stripe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tagline"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Financial infrastructure for the internet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Stripe is a financial infrastructure platform for businesses..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"industry"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Software Development"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"employee_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5001-10000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"follower_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1240000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"headquarters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"San Francisco, California, US"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"founded_year"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2010&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"website"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://stripe.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"specialties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Payments"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Financial Infrastructure"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Developer Tools"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"logo_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://media.licdn.com/dms/image/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"company_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.linkedin.com/company/stripe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraped_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-04T09:00:00+00:00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;The actor uses pay-per-event pricing: &lt;strong&gt;$0.008 per company&lt;/strong&gt; starting May 17, 2026. The first 5 results are free so you can verify output quality before committing. For a list of 1,000 companies, that's $8.&lt;/p&gt;

&lt;p&gt;For high-volume runs (10,000+ companies), residential proxy coverage becomes important for reliability. &lt;a href="https://oxylabs.go2cloud.org/aff_c?offer_id=7&amp;amp;aff_id=2066&amp;amp;url_id=174" rel="noopener noreferrer"&gt;Oxylabs&lt;/a&gt; is the proxy infrastructure we've tested and trust for this kind of workload — their residential network handles LinkedIn's IP reputation checks without constant rotation failures that plague datacenter proxies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you don't get
&lt;/h2&gt;

&lt;p&gt;Company pages don't include employee email addresses, phone numbers, or individual employee profiles. For contact-level data, you need a separate enrichment step. The actor extracts company-level public metadata — the data visible to any unauthenticated visitor on a public company page.&lt;/p&gt;

&lt;p&gt;LinkedIn also rate-limits aggressively on certain pages. The actor handles this, but very large runs (5,000+ companies) benefit from &lt;a href="https://brightdata.com" rel="noopener noreferrer"&gt;Bright Data's residential network&lt;/a&gt; as a proxy layer to maintain throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative
&lt;/h2&gt;

&lt;p&gt;You can build this yourself. The engineering work involves: handling LinkedIn's anti-bot detection, managing proxy rotation, parsing the structured data out of the page (LinkedIn embeds JSON-LD in company pages), dealing with partial responses and retry logic, and maintaining the scraper when LinkedIn changes its page structure — which happens several times per year.&lt;/p&gt;

&lt;p&gt;That's 2-4 weeks of engineering time to build, and ongoing maintenance after that. At $0.008 per company, you'd need to scrape over 1 million companies before the build-vs-buy math favors building.&lt;/p&gt;

&lt;p&gt;For most teams, the answer is clear.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Actor:&lt;/strong&gt; &lt;a href="https://apify.com/cryptosignals/linkedin-company-scraper" rel="noopener noreferrer"&gt;apify.com/cryptosignals/linkedin-company-scraper&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;By:&lt;/strong&gt; Web Data Labs — data infrastructure for B2B teams.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>datascience</category>
      <category>b2b</category>
      <category>api</category>
    </item>
    <item>
      <title>Crunchbase API in 2026: Free Tier Gone — What Startup Data Hunters Do Now</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Sat, 02 May 2026 08:00:16 +0000</pubDate>
      <link>https://dev.to/agenthustler/crunchbase-api-in-2026-free-tier-gone-what-startup-data-hunters-do-now-1177</link>
      <guid>https://dev.to/agenthustler/crunchbase-api-in-2026-free-tier-gone-what-startup-data-hunters-do-now-1177</guid>
      <description>&lt;h2&gt;
  
  
  Crunchbase API: The Free Tier Is Dead
&lt;/h2&gt;

&lt;p&gt;If you’re a developer who used Crunchbase’s free API tier for startup research, funding data, or market analysis — it’s gone. As of 2025, Crunchbase &lt;strong&gt;eliminated free API access entirely&lt;/strong&gt;. The cheapest plan now starts at &lt;strong&gt;$49/month&lt;/strong&gt; (Basic), with the full-featured API requiring the &lt;strong&gt;Pro plan at $99/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For indie developers, researchers, and early-stage startups who need startup ecosystem data, this pricing change fundamentally changes the equation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Crunchbase API Pricing in 2026
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;API Access&lt;/th&gt;
&lt;th&gt;Daily Limit&lt;/th&gt;
&lt;th&gt;Data Available&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;del&gt;$0&lt;/del&gt; Removed&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Basic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$49/mo&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;200 calls/min&lt;/td&gt;
&lt;td&gt;Basic company data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$99/mo&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;200 calls/min&lt;/td&gt;
&lt;td&gt;Full dataset + exports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;td&gt;Everything + support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What $49/Month Gets You
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;CB_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_api_key_here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_companies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.crunchbase.com/api/v4/searches/organizations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-cb-user-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CB_API_KEY&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;field_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;identifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short_description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;funding_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_funding_rounds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;founded_on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;predicate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;field_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;identifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operator_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contains&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;]}],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_companies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ai agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
    &lt;span class="n"&gt;props&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;funding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;funding_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;identifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; — $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;funding&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; raised&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data quality is excellent. But $588-$1,188/year is a hard sell for individual developers or side projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Lose Without Crunchbase API
&lt;/h2&gt;

&lt;p&gt;Crunchbase’s dataset is uniquely valuable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Funding rounds&lt;/strong&gt; — who invested, how much, what stage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Company profiles&lt;/strong&gt; — founding date, team size, location, categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acquisition data&lt;/strong&gt; — who bought whom and for how much&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;People data&lt;/strong&gt; — founders, executives, board members&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Market maps&lt;/strong&gt; — companies by category and geography&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No other single source combines all of this. But paying $49+/month for a data project that may or may not produce value? That’s a tough startup cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Web Scraping Alternative
&lt;/h2&gt;

&lt;p&gt;Crunchbase’s company profiles are publicly accessible on the web. The data displayed on their website is the same data behind the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_crunchbase_company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;company_slug&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.crunchbase.com/organization/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;company_slug&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Note: Crunchbase heavily uses JavaScript rendering
&lt;/span&gt;    &lt;span class="c1"&gt;# Basic requests won’t get much — you need browser automation
&lt;/span&gt;    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The challenge&lt;/strong&gt;: Crunchbase is a React single-page application. Most data loads dynamically via JavaScript, which means simple HTTP requests won’t work. You need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Headless browser&lt;/strong&gt; — Playwright or Puppeteer to render JavaScript&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy rotation&lt;/strong&gt; — Crunchbase blocks datacenter IPs quickly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-detection&lt;/strong&gt; — fingerprint management, human-like behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate management&lt;/strong&gt; — respectful pacing to avoid blocks&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  API vs Scraping: Side-by-Side
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Crunchbase API ($49+/mo)&lt;/th&gt;
&lt;th&gt;Web Scraping&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$49-99/month&lt;/td&gt;
&lt;td&gt;Infrastructure only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access barrier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Credit card required&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clean JSON&lt;/td&gt;
&lt;td&gt;Requires parsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Company profiles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Public data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Funding data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Detailed&lt;/td&gt;
&lt;td&gt;As displayed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;People/team data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Public profiles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Historical data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full history&lt;/td&gt;
&lt;td&gt;Limited to current&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bulk export&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pro plan only ($99/mo)&lt;/td&gt;
&lt;td&gt;Unlimited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200 calls/min&lt;/td&gt;
&lt;td&gt;Self-managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (API keys)&lt;/td&gt;
&lt;td&gt;High (browser automation)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Scaling Crunchbase Data Collection
&lt;/h2&gt;

&lt;p&gt;For production use, building Crunchbase scraping infrastructure from scratch is complex. The SPA rendering, anti-bot measures, and data structure changes require ongoing maintenance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apify.com/cryptosignals/crunchbase-scraper" rel="noopener noreferrer"&gt;Managed scraping tools like this Crunchbase scraper on Apify&lt;/a&gt; handle the browser automation and proxy management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cryptosignals/crunchbase-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;artificial intelligence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;San Francisco&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fundingStage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Series A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;company&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;funding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;totalFunding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; — $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;funding&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Alternative Data Sources for Startup Research
&lt;/h2&gt;

&lt;p&gt;If neither the API nor scraping fits your needs, consider these alternatives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Strengths&lt;/th&gt;
&lt;th&gt;Weaknesses&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PitchBook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enterprise ($$$)&lt;/td&gt;
&lt;td&gt;Most comprehensive&lt;/td&gt;
&lt;td&gt;Expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dealroom&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free tier available&lt;/td&gt;
&lt;td&gt;EU/startup focus&lt;/td&gt;
&lt;td&gt;Limited US data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenVC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;VC-focused&lt;/td&gt;
&lt;td&gt;Smaller dataset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tracxn&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free tier available&lt;/td&gt;
&lt;td&gt;Good coverage&lt;/td&gt;
&lt;td&gt;Limited free access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LinkedIn&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (limited)&lt;/td&gt;
&lt;td&gt;People data strong&lt;/td&gt;
&lt;td&gt;No funding data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AngelList/Wellfound&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Startup jobs + data&lt;/td&gt;
&lt;td&gt;Limited API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Cost Comparison
&lt;/h2&gt;

&lt;p&gt;Let’s do the math for a typical startup research project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Crunchbase API (1 year):
  Basic: $49 x 12 = $588/year
  Pro:   $99 x 12 = $1,188/year

Web scraping (Apify, typical usage):
  Pay-per-result: ~$5-20/month for moderate use
  Annual: $60-240/year

Savings: 60-90% depending on usage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API wins on convenience and data structure. Scraping wins on cost and flexibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use the Crunchbase API if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have budget and need clean, reliable data&lt;/li&gt;
&lt;li&gt;You’re building a product where Crunchbase data is core&lt;/li&gt;
&lt;li&gt;You need historical funding data going back years&lt;/li&gt;
&lt;li&gt;You want zero maintenance overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use web scraping if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re exploring and don’t want to commit $49+/month&lt;/li&gt;
&lt;li&gt;You need bulk data beyond API rate limits&lt;/li&gt;
&lt;li&gt;You’re combining data from multiple sources&lt;/li&gt;
&lt;li&gt;You need flexibility the API doesn’t offer&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Crunchbase’s decision to remove the free tier makes business sense — their data is genuinely valuable and they’re entitled to charge for it. But it also means the barrier to entry for startup data access has gone from $0 to $588/year overnight.&lt;/p&gt;

&lt;p&gt;For developers and researchers who need startup ecosystem data without the subscription commitment, web scraping provides a &lt;a href="https://apify.com/cryptosignals/crunchbase-scraper" rel="noopener noreferrer"&gt;cost-effective alternative&lt;/a&gt;. The key is choosing the right approach for your specific use case and budget.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;How do you source startup and funding data? Found a good Crunchbase alternative? Let me know in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>api</category>
      <category>python</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Amazon Product API (PA-API) in 2026: Restrictions, Alternatives, and Web Scraping</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Fri, 01 May 2026 08:00:18 +0000</pubDate>
      <link>https://dev.to/agenthustler/amazon-product-api-pa-api-in-2026-restrictions-alternatives-and-web-scraping-4l35</link>
      <guid>https://dev.to/agenthustler/amazon-product-api-pa-api-in-2026-restrictions-alternatives-and-web-scraping-4l35</guid>
      <description>&lt;h2&gt;
  
  
  Amazon’s Product Advertising API: The Access Problem
&lt;/h2&gt;

&lt;p&gt;Amazon’s Product Advertising API (PA-API 5.0) is powerful — when you can use it. The catch? &lt;strong&gt;You need an active Amazon Associates account with at least 3 qualifying sales in the past 30 days&lt;/strong&gt; just to maintain access.&lt;/p&gt;

&lt;p&gt;For new developers, researchers, and startups building price comparison tools or product databases, this creates a chicken-and-egg problem: you need the API to build your product, but you need sales (from a product you haven’t built yet) to keep API access.&lt;/p&gt;

&lt;p&gt;Let’s examine the current state of PA-API, its restrictions, and why web scraping has become the go-to alternative.&lt;/p&gt;

&lt;h2&gt;
  
  
  PA-API 5.0 Requirements in 2026
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Associates account&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Must be approved for your country&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sales requirement&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 qualifying sales in 30 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 req/sec (scales with revenue)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Initial quota&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8,640 requests/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data available&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Product info, pricing, reviews (limited)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Geographic restriction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate API per marketplace (US, UK, DE, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Revenue share&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1-10% depending on category&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Access Cliff
&lt;/h3&gt;

&lt;p&gt;Here’s the brutal part: if your Associates account goes 30 days without 3 sales, &lt;strong&gt;Amazon revokes your API keys&lt;/strong&gt;. You have to reapply.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://webservices.amazon.com/paapi5/getitems&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ItemIds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B09V3KXJPB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After 30 days without 3 sales:
# {
#   "Errors": [{
#     "Code": "TooManyRequests",
#     "Message": "Your access has been revoked."
#   }]
# }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Who Gets Blocked by PA-API Requirements?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;New developers&lt;/strong&gt; building their first e-commerce tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Researchers&lt;/strong&gt; analyzing product trends or pricing history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startups&lt;/strong&gt; building comparison shopping engines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;International developers&lt;/strong&gt; in countries without Associates programs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data analysts&lt;/strong&gt; who need bulk product data for market research&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you’re in any of these categories, the API isn’t a realistic starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What PA-API Actually Provides (When You Have Access)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;paapi5_python_sdk.api.default_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultApi&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;paapi5_python_sdk.models.get_items_request&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GetItemsRequest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;paapi5_python_sdk.models.partner_type&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PartnerType&lt;/span&gt;

&lt;span class="n"&gt;api&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DefaultApi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;access_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_ACCESS_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_SECRET_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;webservices.amazon.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GetItemsRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;partner_tag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-tag-20&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;partner_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PartnerType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ASSOCIATES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;item_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B09V3KXJPB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ItemInfo.Title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Offers.Listings.Price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Images.Primary.Large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CustomerReviews.StarRating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_items&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Price: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;offers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;listings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display_amount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data is clean and structured. But at 1 request per second with sales-gated access, it’s not built for data collection — it’s built for affiliate links.&lt;/p&gt;

&lt;h2&gt;
  
  
  Web Scraping: The Practical Alternative
&lt;/h2&gt;

&lt;p&gt;Amazon’s product pages are publicly accessible. Scraping extracts the same data without the affiliate sales requirement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_amazon_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asin&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.amazon.com/dp/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;asin&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept-Language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US,en;q=0.9&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#productTitle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.a-price .a-offscreen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#acrPopover&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;asin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;asin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scrape_amazon_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B09V3KXJPB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;: Amazon has some of the most aggressive anti-bot systems on the web. Basic scraping like the above will get blocked within 10-20 requests. For any real use case, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Residential proxy rotation&lt;/li&gt;
&lt;li&gt;Browser fingerprint management&lt;/li&gt;
&lt;li&gt;CAPTCHA solving&lt;/li&gt;
&lt;li&gt;Request throttling&lt;/li&gt;
&lt;li&gt;Geographic proxy matching (US proxies for amazon.com)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  API vs Scraping: Head-to-Head
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;PA-API 5.0&lt;/th&gt;
&lt;th&gt;Web Scraping&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access barrier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 sales/30 days&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (if qualified)&lt;/td&gt;
&lt;td&gt;Proxy/infrastructure costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 req/sec&lt;/td&gt;
&lt;td&gt;Self-managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Product data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured JSON&lt;/td&gt;
&lt;td&gt;Requires parsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Price history&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Current only&lt;/td&gt;
&lt;td&gt;Can build your own&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Review text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;Available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bulk collection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very limited&lt;/td&gt;
&lt;td&gt;Scalable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Days (approval wait)&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (when accessible)&lt;/td&gt;
&lt;td&gt;Requires maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Scaling Amazon Data Collection
&lt;/h2&gt;

&lt;p&gt;Building and maintaining Amazon scraping infrastructure is a significant engineering challenge. Anti-bot systems update frequently, and what works today may break tomorrow.&lt;/p&gt;

&lt;p&gt;Managed scraping tools abstract this away. &lt;a href="https://apify.com/cryptosignals/amazon-scraper" rel="noopener noreferrer"&gt;This Amazon scraper on Apify&lt;/a&gt; handles the proxy rotation, CAPTCHA solving, and anti-detection for you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cryptosignals/amazon-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchTerms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wireless headphones&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;marketplace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amazon.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxProducts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;includeReviews&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; — $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; stars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Alternative Data Sources
&lt;/h2&gt;

&lt;p&gt;Beyond scraping Amazon directly, consider:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Keepa API&lt;/strong&gt; — Amazon price history tracking ($20/mo)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rainforest API&lt;/strong&gt; — structured Amazon data (pay-per-request)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CamelCamelCamel&lt;/strong&gt; — price history (no API, scrapeable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Shopping API&lt;/strong&gt; — aggregated product data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each has tradeoffs in cost, coverage, and freshness.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Amazon’s PA-API is designed for affiliate marketers who are already making sales, not for developers who need product data. The 3-sales-in-30-days requirement isn’t a bug — it’s Amazon’s way of ensuring the API serves their affiliate program’s interests.&lt;/p&gt;

&lt;p&gt;For developers who need Amazon product data without the affiliate prerequisite, web scraping is the practical path. The tooling has matured — &lt;a href="https://apify.com/cryptosignals/amazon-scraper" rel="noopener noreferrer"&gt;managed scraping platforms&lt;/a&gt; handle the infrastructure complexity that used to require dedicated engineering teams.&lt;/p&gt;

&lt;p&gt;Start with the API if you qualify. Fall back to scraping if you don’t. Either way, build your application — don’t let access gates stop you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Struggled with PA-API access requirements? Found a better alternative? Share your experience below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>api</category>
      <category>python</category>
      <category>javascript</category>
    </item>
    <item>
      <title>GitHub API Rate Limits in 2026: When Web Scraping Is the Better Choice</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Thu, 30 Apr 2026 08:00:22 +0000</pubDate>
      <link>https://dev.to/agenthustler/github-api-rate-limits-in-2026-when-web-scraping-is-the-better-choice-hdo</link>
      <guid>https://dev.to/agenthustler/github-api-rate-limits-in-2026-when-web-scraping-is-the-better-choice-hdo</guid>
      <description>&lt;h2&gt;
  
  
  GitHub API Rate Limits: The Numbers That Block Your Project
&lt;/h2&gt;

&lt;p&gt;GitHub’s REST API is one of the most generous public APIs out there — until it isn’t. At &lt;strong&gt;5,000 requests per hour&lt;/strong&gt; (authenticated) or a mere &lt;strong&gt;60 requests per hour&lt;/strong&gt; (unauthenticated), developers routinely hit walls when building anything beyond basic integrations.&lt;/p&gt;

&lt;p&gt;If you’re doing repository analysis, tracking open-source trends, monitoring competitor activity, or aggregating data across thousands of repos — you’ll burn through that quota in minutes.&lt;/p&gt;

&lt;p&gt;Let’s look at when the API is sufficient, when it’s not, and when web scraping becomes the pragmatic alternative.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub API Rate Limits Explained (2026)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Rate Limit&lt;/th&gt;
&lt;th&gt;Auth Required&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unauthenticated&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;60 req/hr&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Quick lookups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Personal Access Token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5,000 req/hr&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Standard dev work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub App&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5,000 req/hr + 50/repo&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Org integrations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15,000 req/hr&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Large-scale use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sounds generous until you do the math:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# How fast can you exhaust 5,000 requests?
&lt;/span&gt;
&lt;span class="c1"&gt;# Scenario: Analyze top 1,000 Python repos
&lt;/span&gt;&lt;span class="n"&gt;requests_per_repo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="c1"&gt;# repo info + contributors + languages + commits + issues
&lt;/span&gt;&lt;span class="n"&gt;total_requests&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="c1"&gt;# = 5,000
# Result: One scan = entire hourly quota
&lt;/span&gt;
&lt;span class="c1"&gt;# Scenario: Monitor 200 repos for new releases
&lt;/span&gt;&lt;span class="n"&gt;checks_per_hour&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# = 200 per cycle
&lt;/span&gt;&lt;span class="n"&gt;cycles_per_hour&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;  &lt;span class="c1"&gt;# = 25 cycles/hr (one every 2.4 min)
# Seems OK, but add commit history and you’re cooked
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What the API Gives You (and What It Doesn’t)
&lt;/h2&gt;

&lt;p&gt;GitHub’s API is excellent for structured data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repository metadata, stars, forks&lt;/li&gt;
&lt;li&gt;Issues and pull requests&lt;/li&gt;
&lt;li&gt;Commit history (paginated)&lt;/li&gt;
&lt;li&gt;User profiles and contributions&lt;/li&gt;
&lt;li&gt;Release and tag information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But several things are &lt;strong&gt;not available or practical&lt;/strong&gt; through the API:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trending repositories&lt;/strong&gt; — no API endpoint for GitHub Trending&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search ranking factors&lt;/strong&gt; — can’t see why repos rank where they do&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contribution graphs at scale&lt;/strong&gt; — rate-limited per-user fetch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topic/tag aggregations&lt;/strong&gt; — limited search API (30 req/min)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bulk profile data&lt;/strong&gt; — fetching 10K developer profiles = 2+ hours&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Real-World Rate Limit Pain Points
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ghp_your_token_here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_rate_limit&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.github.com/rate_limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;core&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remaining&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;reset_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;core&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reset_time&lt;/span&gt;

&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check_rate_limit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remaining: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/5000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reset in: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The dreaded 403
# {
#   "message": "API rate limit exceeded for user ID 12345.",
#   "documentation_url": "https://docs.github.com/rest/overview/rate-limits-for-the-rest-api"
# }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you hit that 403, your options are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Wait&lt;/strong&gt; — up to 60 minutes for reset&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use GraphQL&lt;/strong&gt; — separate 5,000-point budget, but complex queries cost more points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple tokens&lt;/strong&gt; — technically against ToS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web scraping&lt;/strong&gt; — for data the API limits or doesn’t expose&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When Web Scraping Makes More Sense
&lt;/h2&gt;

&lt;p&gt;Web scraping GitHub works best for:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Trending Repositories
&lt;/h3&gt;

&lt;p&gt;GitHub’s trending page has no API. Period.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_trending&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/trending/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;?since=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;repos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article.Box-row&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2 a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;stars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.Link--muted.d-inline-block.mr-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;repos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stars_today&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stars&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stars&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;repos&lt;/span&gt;

&lt;span class="n"&gt;trending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_trending&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;trending&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stars_today&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Bulk Data Collection Without Rate Limits
&lt;/h3&gt;

&lt;p&gt;Scraping doesn’t have a 5,000/hour cap — you’re limited only by request pacing and proxy infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Data the API Doesn’t Expose
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Repository traffic insights (normally owner-only)&lt;/li&gt;
&lt;li&gt;Dependency graphs in full&lt;/li&gt;
&lt;li&gt;Community health metrics across many repos&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scaling GitHub Scraping
&lt;/h2&gt;

&lt;p&gt;For anything beyond basic scraping, you need to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub’s bot detection&lt;/li&gt;
&lt;li&gt;JavaScript-rendered content (some pages use React)&lt;/li&gt;
&lt;li&gt;Session management&lt;/li&gt;
&lt;li&gt;Respectful rate limiting (don’t hammer their servers)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Managed scraping tools handle this. &lt;a href="https://apify.com/cryptosignals/github-scraper" rel="noopener noreferrer"&gt;This GitHub scraper on Apify&lt;/a&gt; manages proxy rotation and rendering for bulk data extraction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cryptosignals/github-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxRepos&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;includeReadme&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fullName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stars&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; stars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  API vs Scraping: Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Best Approach&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single repo data&lt;/td&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;Fast, structured, within limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD integration&lt;/td&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;Real-time webhooks available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trending repos&lt;/td&gt;
&lt;td&gt;Scraping&lt;/td&gt;
&lt;td&gt;No API endpoint exists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000+ repo analysis&lt;/td&gt;
&lt;td&gt;Scraping&lt;/td&gt;
&lt;td&gt;API quota exhausted in minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User profile aggregation&lt;/td&gt;
&lt;td&gt;Scraping&lt;/td&gt;
&lt;td&gt;Bulk fetching is rate-limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commit monitoring (few repos)&lt;/td&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;Efficient with conditional requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-platform comparison&lt;/td&gt;
&lt;td&gt;Scraping&lt;/td&gt;
&lt;td&gt;Need to combine multiple sources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Hybrid Approach: Best of Both
&lt;/h2&gt;

&lt;p&gt;The smartest strategy combines both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_repo_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Use API for structured data within limits
&lt;/span&gt;    &lt;span class="n"&gt;api_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_from_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Use scraping for data API doesn’t provide
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;api_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate_limited&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch_from_scraper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Enrich with scraped data
&lt;/span&gt;    &lt;span class="n"&gt;api_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trending_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_trending_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;api_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;GitHub’s API is excellent for standard integrations and moderate-scale use. But for data analysis, market research, trend tracking, and bulk operations, the rate limits become a genuine blocker.&lt;/p&gt;

&lt;p&gt;Web scraping isn’t a replacement for the API — it’s a &lt;strong&gt;complement&lt;/strong&gt; for the cases where 5,000 requests per hour simply isn’t enough, or where the data you need doesn’t have an API endpoint at all.&lt;/p&gt;

&lt;p&gt;For production-grade GitHub data collection at scale, &lt;a href="https://apify.com/cryptosignals/github-scraper" rel="noopener noreferrer"&gt;managed scraping solutions&lt;/a&gt; save weeks of infrastructure work.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Hit GitHub rate limits on a project? What workaround did you use? Share in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>api</category>
      <category>python</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Glassdoor API in 2026: Why Developers Are Switching to Web Scraping</title>
      <dc:creator>agenthustler</dc:creator>
      <pubDate>Wed, 29 Apr 2026 08:00:14 +0000</pubDate>
      <link>https://dev.to/agenthustler/glassdoor-api-in-2026-why-developers-are-switching-to-web-scraping-na0</link>
      <guid>https://dev.to/agenthustler/glassdoor-api-in-2026-why-developers-are-switching-to-web-scraping-na0</guid>
      <description>&lt;h2&gt;
  
  
  Glassdoor API in 2026: The Landscape Has Changed
&lt;/h2&gt;

&lt;p&gt;If you’ve tried accessing Glassdoor’s API recently, you already know: &lt;strong&gt;the public API is gone&lt;/strong&gt;. Glassdoor shut down open developer access and now only offers data through enterprise partnerships with undisclosed pricing.&lt;/p&gt;

&lt;p&gt;This leaves thousands of developers, recruiters, and data analysts in the dark. Whether you’re building a salary comparison tool, analyzing company reviews for investment research, or aggregating job market data — the official path is effectively closed.&lt;/p&gt;

&lt;p&gt;Let’s break down what happened, what your options actually are in 2026, and why web scraping has become the practical alternative.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened to the Glassdoor API?
&lt;/h2&gt;

&lt;p&gt;Glassdoor originally offered a public API that let developers access:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Company reviews and ratings&lt;/li&gt;
&lt;li&gt;Salary estimates by role and location&lt;/li&gt;
&lt;li&gt;Job listings&lt;/li&gt;
&lt;li&gt;Interview questions and experiences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In 2024, Glassdoor (now owned by Recruit Holdings alongside Indeed) &lt;strong&gt;restricted API access to enterprise partners only&lt;/strong&gt;. There’s no public documentation, no free tier, no developer signup page.&lt;/p&gt;

&lt;p&gt;The reasoning? Data monetization. Glassdoor’s salary and review data is their core asset, and they’ve decided to gate it behind B2B contracts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Official API vs Web Scraping: Direct Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Glassdoor API (Enterprise)&lt;/th&gt;
&lt;th&gt;Web Scraping&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enterprise partnership only&lt;/td&gt;
&lt;td&gt;Open to anyone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom pricing ($$$$)&lt;/td&gt;
&lt;td&gt;Infrastructure costs only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data available&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured JSON&lt;/td&gt;
&lt;td&gt;Requires parsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Contract-dependent&lt;/td&gt;
&lt;td&gt;Self-managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Salary data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Company reviews&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Near real-time&lt;/td&gt;
&lt;td&gt;On-demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weeks (sales process)&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Legal clarity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Licensed&lt;/td&gt;
&lt;td&gt;Gray area (public data)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When the API Made Sense (and When It Doesn’t)
&lt;/h2&gt;

&lt;p&gt;The enterprise API still makes sense if you’re a large HR tech company with budget and a direct relationship with Glassdoor. For everyone else — individual developers, startups, researchers — &lt;strong&gt;the API path is a dead end&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here’s the reality check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No public signup exists&lt;/li&gt;
&lt;li&gt;No pricing page exists&lt;/li&gt;
&lt;li&gt;No documentation is available&lt;/li&gt;
&lt;li&gt;Response times for partnership inquiries: weeks to months&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Web Scraping Alternative
&lt;/h2&gt;

&lt;p&gt;Web scraping lets you extract the same data Glassdoor displays publicly on their website. Here’s a basic Python example of what the data extraction looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_glassdoor_company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;company_url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;company_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract rating
&lt;/span&gt;    &lt;span class="n"&gt;rating_elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[data-test=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rating_elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rating_elem&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract review count
&lt;/span&gt;    &lt;span class="n"&gt;review_elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.reviews-count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reviews&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;review_elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;review_elem&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_reviews&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reviews&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_glassdoor_company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://glassdoor.com/Overview/company-overview.htm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works for small-scale use, but Glassdoor has &lt;strong&gt;aggressive anti-bot measures&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CAPTCHAs after a few requests&lt;/li&gt;
&lt;li&gt;IP blocking&lt;/li&gt;
&lt;li&gt;JavaScript-rendered content&lt;/li&gt;
&lt;li&gt;Session validation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scaling Glassdoor Data Collection
&lt;/h2&gt;

&lt;p&gt;For production use cases, you need infrastructure that handles these challenges. Key requirements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rotating proxies&lt;/strong&gt; — residential proxies work best for Glassdoor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browser automation&lt;/strong&gt; — much of Glassdoor’s content loads via JavaScript&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CAPTCHA handling&lt;/strong&gt; — automated solving or avoidance strategies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate management&lt;/strong&gt; — respectful request pacing to avoid blocks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rather than building all of this yourself, managed scraping tools handle the infrastructure. For example, &lt;a href="https://apify.com/cryptosignals/glassdoor-scraper" rel="noopener noreferrer"&gt;this Glassdoor scraper on Apify&lt;/a&gt; handles proxy rotation, browser rendering, and anti-bot bypasses out of the box.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cryptosignals/glassdoor-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;searchTerms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;software engineer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;San Francisco, CA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;company&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - Rating: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - Salary: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Data Can You Actually Get?
&lt;/h2&gt;

&lt;p&gt;Through web scraping, you can extract everything Glassdoor shows publicly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Company ratings&lt;/strong&gt; (overall, culture, work-life balance, compensation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Individual reviews&lt;/strong&gt; with pros, cons, and advice to management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Salary reports&lt;/strong&gt; by role, location, and experience level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interview experiences&lt;/strong&gt; including difficulty ratings and questions asked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job listings&lt;/strong&gt; with salary estimates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CEO approval ratings&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Legal Considerations
&lt;/h2&gt;

&lt;p&gt;Scraping publicly available data has been largely upheld in US courts (see &lt;em&gt;hiQ Labs v. LinkedIn&lt;/em&gt;, 2022). However:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Respect &lt;code&gt;robots.txt&lt;/code&gt; directives&lt;/li&gt;
&lt;li&gt;Don’t bypass authentication walls&lt;/li&gt;
&lt;li&gt;Don’t overload servers with aggressive request rates&lt;/li&gt;
&lt;li&gt;Check Glassdoor’s Terms of Service for your jurisdiction&lt;/li&gt;
&lt;li&gt;Use data responsibly — don’t republish raw review content&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Glassdoor’s decision to gate their API behind enterprise contracts makes business sense for them, but it’s left the developer community without a practical option. Web scraping fills that gap for salary research, market analysis, and recruitment data needs.&lt;/p&gt;

&lt;p&gt;If you’re building something that needs Glassdoor data in 2026, your realistic options are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise partnership&lt;/strong&gt; — if you have the budget and patience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web scraping&lt;/strong&gt; — for everyone else&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alternative data sources&lt;/strong&gt; — LinkedIn, Levels.fyi, Blind (each with their own limitations)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tooling for option 2 has matured significantly. What used to take weeks of proxy management and CAPTCHA solving can now be handled by &lt;a href="https://apify.com/cryptosignals/glassdoor-scraper" rel="noopener noreferrer"&gt;managed scraping platforms&lt;/a&gt; in a few lines of code.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What’s your experience with Glassdoor data access? Have you found alternative approaches? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>api</category>
      <category>python</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
