DEV Community: FairPrice

How to Scrape LinkedIn Job Listings in 2026 (No Login Required)

FairPrice — Wed, 25 Mar 2026 15:03:21 +0000

LinkedIn is the largest professional network in the world, and its job listings are a goldmine of data — salaries, hiring trends, company growth signals. The problem? LinkedIn locks most of its data behind authentication and aggressively blocks scrapers.

But there is a loophole. LinkedIn's guest job search endpoint (linkedin.com/jobs-guest/) serves public job listings without requiring login. The LinkedIn Jobs Scraper on Apify leverages this to extract job data at scale — no cookies, no login, no risk to your LinkedIn account.

Why LinkedIn Jobs Data Is Public

LinkedIn intentionally makes job listings publicly accessible. It is in their interest — they want job seekers to find listings through Google. The /jobs-guest/ endpoint serves the same data you see when you Google "software engineer jobs LinkedIn" without being logged in.

This means:

No authentication needed — the data is served to anonymous visitors
No account risk — you are not logged in, so there is nothing to ban
Legal gray area favors you — public data accessible without circumventing any access controls

What You Can Extract

The scraper pulls structured data from each job listing:

Job title and description (full text)
Company name and company URL
Location (city, state, country, remote/hybrid/onsite)
Salary range (when posted — increasingly common in 2026 due to pay transparency laws)
Posted date and application deadline
Seniority level (entry, mid, senior, director, executive)
Employment type (full-time, part-time, contract, internship)
Industry and job function
Number of applicants

Quick Start on Apify

Open LinkedIn Jobs Scraper
Enter your search parameters:
- Keywords: e.g., "machine learning engineer"
- Location: e.g., "San Francisco, CA"
- Company: e.g., "Google" (optional)
Set max results (default: 100)
Click Run

Results export to JSON, CSV, or Excel.

Python Code Example: Scraping Jobs Programmatically

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

run_input = {
    "keywords": "data engineer",
    "location": "United States",
    "maxResults": 200,
    "dateSincePosted": "past-week"
}

run = client.actor("cryptosignals/linkedin-jobs-scraper").call(run_input=run_input)

for job in client.dataset(run["defaultDatasetId"]).iterate_items():
    salary = job.get("salary", "Not listed")
    print(f"{job['title']} at {job['company']} — {job['location']}")
    print(f"  Salary: {salary}")
    print(f"  Posted: {job['postedDate']}")
    print()

Install the client:

pip install apify-client

Filtering and Targeting

The scraper supports the same filters LinkedIn's search does:

By Company

Track hiring at specific companies:

run_input = {
    "keywords": "engineer",
    "company": "Stripe",
    "maxResults": 50
}

By Location

Compare job markets across cities:

cities = ["San Francisco", "New York", "Austin", "London", "Berlin"]
for city in cities:
    run_input = {
        "keywords": "software engineer",
        "location": city,
        "maxResults": 100
    }
    run = client.actor("cryptosignals/linkedin-jobs-scraper").call(run_input=run_input)
    items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    print(f"{city}: {len(items)} jobs found")

By Date

Get only fresh listings:

run_input = {
    "keywords": "AI researcher",
    "dateSincePosted": "past-24-hours",
    "maxResults": 50
}

Real-World Use Cases

Job Market Analytics

Track hiring trends over time. Run the scraper weekly for specific keywords and build a time series of job postings. Are "AI engineer" roles growing faster than "data scientist" roles? How quickly did remote jobs decline (or not) in 2026? The data tells the story.

Salary Benchmarking

With pay transparency laws expanding across the US and EU, more LinkedIn job posts include salary ranges. Scrape thousands of listings to build salary benchmarks by role, location, company size, and seniority. This data sells — recruiters, HR teams, and job boards all need it.

Company Hiring Tracker

Monitor when companies ramp up or slow down hiring. A sudden burst of engineering roles at a startup could signal a funding round. A hiring freeze at a public company could signal trouble. Track 50+ companies and you have an intelligence feed.

Competitive Intelligence for Recruiters

Recruiters can use this to see exactly what their competitors are offering — salary ranges, benefits mentioned in descriptions, required skills, and seniority levels. Build a dashboard that updates weekly.

Dealing with Anti-Bot Protection

LinkedIn's guest endpoints are public, but they do have basic protections:

Rate limiting — too many requests from one IP get throttled
CAPTCHA challenges — triggered by suspicious patterns
IP blocking — temporary blocks on aggressive scrapers

The Apify actor handles most of this with built-in request throttling and retry logic. For high-volume jobs (1000+ listings per run), you may want to add a proxy layer.

ScraperAPI is a solid option for this — it rotates proxies automatically, handles CAPTCHAs, and has a free tier for testing. You can configure it as a proxy in your Apify actor settings or use it in your own scripts:

import requests

params = {
    "api_key": "YOUR_SCRAPERAPI_KEY",
    "url": "https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&location=US&start=0"
}

response = requests.get("http://api.scraperapi.com", params=params)
print(response.text)

Building a Salary Database

Here is a complete example that scrapes jobs across multiple roles and stores the results in a CSV:

import csv
from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

roles = [
    "software engineer",
    "data engineer",
    "product manager",
    "machine learning engineer",
    "devops engineer"
]

all_jobs = []
for role in roles:
    run = client.actor("cryptosignals/linkedin-jobs-scraper").call(run_input={
        "keywords": role,
        "location": "United States",
        "maxResults": 100
    })
    for job in client.dataset(run["defaultDatasetId"]).iterate_items():
        if job.get("salary"):
            all_jobs.append({
                "role": role,
                "title": job["title"],
                "company": job["company"],
                "location": job["location"],
                "salary": job["salary"]
            })

with open("salary_data.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["role", "title", "company", "location", "salary"])
    writer.writeheader()
    writer.writerows(all_jobs)

print(f"Saved {len(all_jobs)} jobs with salary data")

Output Format

Clean, structured JSON per job:

{
  "title": "Senior Data Engineer",
  "company": "Stripe",
  "location": "San Francisco, CA (Hybrid)",
  "salary": "$180,000 - $250,000/yr",
  "postedDate": "2026-03-01",
  "seniorityLevel": "Mid-Senior level",
  "employmentType": "Full-time",
  "applicants": 47,
  "description": "We are looking for a Senior Data Engineer to...",
  "url": "https://www.linkedin.com/jobs/view/3456789"
}

Pricing

The scraper runs on Apify's pay-per-use platform. Scraping 200 job listings typically costs $0.10-0.30 in compute credits. No subscription needed.

Summary

LinkedIn job listings are publicly accessible data, and with the right tools you can extract them at scale without risking your account or dealing with authentication headaches. The LinkedIn Jobs Scraper handles the scraping infrastructure — you focus on what to do with the data.

Whether you are building a salary benchmarking tool, tracking hiring trends, or feeding a job aggregator, this gives you structured LinkedIn data in minutes, not days.

How to Scrape Substack Newsletters in 2026: Posts, Authors, Subscriber Counts

FairPrice — Wed, 25 Mar 2026 15:02:37 +0000

Substack has become the go-to platform for independent writers and newsletters, but extracting data from it at scale has always been a pain. The official API is limited, requires authentication, and does not support bulk operations.

In this guide, I will show you how to scrape Substack newsletters — posts, author profiles, and publication stats — using the Substack Scraper on Apify. No API keys, no Substack account, no rate limit headaches.

Why Not Use the Substack API?

Substack does have an API, but it comes with serious limitations:

Authentication required — you need a Substack account and session cookies
No bulk endpoints — you can only fetch one post or one publication at a time
Rate limits — aggressive throttling if you make too many requests
No subscriber count data — Substack hides this from public endpoints
Undocumented and unstable — endpoints change without notice

The Substack Scraper bypasses all of this by parsing public pages directly. It extracts data that is visible to any reader, just at scale.

What You Can Scrape

The scraper supports three modes:

1. Newsletter Posts

Extract all posts from any Substack publication. Each post includes:

Title, subtitle, and full text content
Author name and bio
Publication date
Canonical URL
Like count and comment count
Post type (free, paid, podcast)

2. Author Profiles

Get detailed information about Substack writers:

Name, bio, and profile photo URL
Publication name and description
Social links (Twitter, website)
Number of posts published

3. Publication Stats

Get high-level stats about any Substack publication:

Estimated subscriber count (based on public signals)
Total posts published
Publication creation date
Top posts by engagement

Quick Start on Apify

Go to Substack Scraper on Apify
Click Start
Enter one or more Substack URLs (e.g., https://newsletter.pragmaticengineer.com)
Select the scraping mode (posts, author, or stats)
Hit Run

Results are available in JSON, CSV, or Excel format.

Python Code Example: Using the Apify API

For automation, use the Apify Python client to run the scraper programmatically.

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

run_input = {
    "urls": [
        "https://newsletter.pragmaticengineer.com",
        "https://www.lennysnewsletter.com",
        "https://stratechery.com"
    ],
    "mode": "posts",
    "maxPosts": 50
}

run = client.actor("cryptosignals/substack-scraper").call(run_input=run_input)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['title']} — {item['url']}")
    print(f"  Likes: {item.get('likeCount', 0)}, Comments: {item.get('commentCount', 0)}")
    print()

Install the client first:

pip install apify-client

Real-World Use Cases

Newsletter Aggregator

Build a curated feed of top Substack posts across multiple publications. Scrape posts from 50+ newsletters, rank by engagement, and surface the best content daily. This is how tools like Substack Reads work under the hood.

Content Research

Analyze what topics perform best on Substack. Scrape thousands of posts, extract titles and engagement metrics, and identify patterns. Which headlines get the most likes? What posting frequency works best? Data beats guessing.

Writer Analytics Dashboard

Track any Substack writer's output over time. How often do they publish? Are their engagement numbers going up or down? This is invaluable for media companies scouting talent or sponsors evaluating newsletter partnerships.

Competitive Intelligence

If you run a newsletter, scrape your competitors. See what they are writing about, how their audience responds, and where the gaps are. Map the entire landscape of newsletters in your niche.

Handling Large Scraping Jobs

For scraping hundreds of newsletters or thousands of posts, you will want to:

Use pagination — the scraper handles this automatically, but set maxPosts to control output size
Run async — use the Apify API's async run endpoint and poll for results
Export to a database — pipe results into PostgreSQL or BigQuery for analysis

import time
from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_API_TOKEN")

# Start async run
run = client.actor("cryptosignals/substack-scraper").start(run_input={
    "urls": ["https://newsletter.pragmaticengineer.com"],
    "mode": "posts",
    "maxPosts": 500
})

# Poll for completion
while True:
    status = client.run(run["id"]).get()
    if status["status"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(5)

# Fetch results
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(f"Scraped {len(items)} posts")

Monitoring Your Scrapes with ScrapeOps

When running scrapers in production, you need visibility into what is working and what is failing. ScrapeOps gives you a monitoring dashboard for all your scraping jobs — success rates, response times, error breakdowns, and alerts.

It integrates with any Python scraper in a few lines of code and is especially useful when you are running multiple Apify actors across different data sources. Free tier available for small projects.

Output Format

The scraper returns clean, structured JSON:

{
  "title": "What I Think About When I Think About AI",
  "subtitle": "The real questions nobody is asking",
  "url": "https://newsletter.pragmaticengineer.com/p/ai-thoughts",
  "author": "Gergely Orosz",
  "publishedAt": "2026-02-15T10:00:00Z",
  "likeCount": 847,
  "commentCount": 123,
  "type": "free",
  "content": "Full text of the post..."
}

Every field is consistently named and typed. No parsing HTML, no dealing with inconsistent formats.

Pricing

The scraper runs on Apify's pay-per-use model. A typical run scraping 100 posts from 5 newsletters costs about $0.10-0.50 in platform credits. No monthly subscription, no minimum commitment.

Summary

Substack is one of the richest sources of written content on the internet, and now you can extract that data at scale without authentication, without rate limit issues, and without writing a custom scraper. The Substack Scraper handles the hard parts — you just point it at the newsletters you care about and get clean data back.

Whether you are building a newsletter aggregator, doing content research, or tracking the Substack ecosystem, this is the fastest path from idea to data.

Best SoundCloud Scrapers in 2026: Tracks, Artists, Playlists, Followers

FairPrice — Wed, 25 Mar 2026 12:04:10 +0000

SoundCloud hosts over 300 million tracks from 30+ million creators. It's one of the largest open music platforms on the web, and its data is a goldmine for music discovery tools, artist analytics, and playlist curation services.

In this guide, we'll cover the best ways to scrape SoundCloud data in 2026, including a dedicated Apify actor that uses SoundCloud's public API v2.

What Data Can You Get from SoundCloud?

SoundCloud exposes a rich set of data points across tracks, artists, and playlists:

Track data — title, description, duration, play count, like count, repost count, comment count, waveform data, genre, tags, upload date
Artist profiles — username, display name, bio, follower count, following count, track count, playlist count, verified status, location
Playlists/Albums — title, track list, creator, like count, repost count, duration
Comments — timestamped comments on tracks (unique to SoundCloud)
Related tracks — algorithmic recommendations per track
Search results — tracks, artists, and playlists matching keywords

This data powers a wide range of applications from music tech startups to academic research.

Why SoundCloud Is Easier to Scrape Than Spotify

Unlike Spotify, which locks everything behind OAuth and strict API quotas, SoundCloud has a public-facing API (v2) that serves its web frontend. Key advantages:

No authentication required for public data — tracks, profiles, and playlists are accessible without API keys
JSON responses — the internal API returns structured JSON, not HTML to parse
Less aggressive rate limiting — reasonable request rates are tolerated
Open platform philosophy — SoundCloud's roots as an open platform mean more data is publicly accessible

That said, you still need to handle pagination, rate limits, and API endpoint changes.

Option 1: SoundCloud Scraper on Apify (Recommended)

The SoundCloud Scraper on Apify Store provides ready-to-use extraction with zero setup. It uses SoundCloud's public API v2 directly:

No auth tokens needed — works out of the box
Multiple input types: search queries, artist URLs, track URLs, playlist URLs
Full data extraction — all metadata, stats, and related content
Automatic pagination — handles large result sets

Example: Search for Tracks

{
  "searchQuery": "lo-fi beats",
  "maxResults": 100,
  "type": "tracks"
}

Example: Scrape an Artist Profile

{
  "urls": ["https://soundcloud.com/flaboratory"],
  "includeTrackList": true
}

The output includes every available data point in clean JSON — play counts, follower numbers, track metadata, waveforms, and more. Export to CSV, JSON, or pipe directly into your data warehouse via Apify's built-in integrations (Google Sheets, Slack, webhooks, S3, and others).

Pricing

Pay-per-use on Apify. Scraping 1,000 tracks typically costs $0.10-0.50 depending on the depth of data extracted. No monthly minimums.

Option 2: Build Your Own with ScrapeOps

For a custom scraping pipeline, ScrapeOps provides proxy management, monitoring dashboards, and fake browser headers — everything you need to build a reliable SoundCloud scraper.

import requests

SCRAPEOPS_KEY = "your_scrapeops_key"
sc_client_id = "your_client_id"  # extracted from SoundCloud's frontend JS

# Search for tracks
search_url = f"https://api-v2.soundcloud.com/search/tracks?q=lo-fi+beats&client_id={sc_client_id}&limit=20"

response = requests.get(
    "https://proxy.scrapeops.io/v1/",
    params={"api_key": SCRAPEOPS_KEY, "url": search_url}
)

data = response.json()
for track in data.get("collection", []):
    print(f"{track['title']} - {track['playback_count']} plays")

ScrapeOps handles proxy rotation and request management. You'll need to extract a valid client_id from SoundCloud's frontend JavaScript (it rotates periodically) and build your own parsing logic.

Option 3: SoundCloud's Official API (Deprecated)

SoundCloud's official API has been effectively closed to new registrations since 2017. Existing apps with legacy API keys still work, but new applications cannot get access. The v2 API used by the web frontend is the practical alternative — which is exactly what the Apify actor uses.

Use Cases for SoundCloud Data

Music Discovery Tools

Build recommendation engines based on play counts, genre tags, and related tracks. SoundCloud's data includes waveform information and timestamped comments that add context no other platform provides.

Artist Analytics Platforms

Track follower growth, play count trends, and engagement metrics over time. Compare artists within genres. Identify emerging artists before they blow up on mainstream platforms.

Playlist Curation Services

Automated playlist building based on genre, BPM (from track metadata), play count thresholds, and freshness. SoundCloud's open upload model means you'll find tracks here months before they appear on Spotify.

Academic Music Research

Study genre evolution, geographic distribution of music production, collaboration networks (via reposts and features), and engagement patterns. SoundCloud's long history and open structure make it ideal for longitudinal studies.

A&R and Talent Scouting

Monitor emerging artists by tracking rapid follower growth, viral tracks (high play-to-follower ratios), and cross-platform presence. Feed SoundCloud data into scoring models that flag promising unsigned artists.

Podcast and Audio Content Analysis

SoundCloud hosts significant podcast and spoken-word content. Scrape episode metadata, listener counts, and comment sentiment to track podcast performance outside major platforms.

Choosing the Right Approach

Approach	Setup Time	Auth Required	Maintenance	Best For
Apify SoundCloud Scraper	5 min	No	None	Quick data access, no coding
ScrapeOps + Custom Code	2-3 hours	No	Moderate	Custom pipelines, full control
Legacy Official API	N/A	Yes (closed)	N/A	Existing apps only
Raw HTTP requests	1-2 hours	No	High	Small scale, learning

For most projects, the Apify actor gets you from zero to data in minutes. If you're building a larger pipeline and want proxy infrastructure, ScrapeOps gives you the tooling to scale your own scraper.

Wrapping Up

SoundCloud's open architecture makes it one of the most accessible music platforms for data extraction. The public API v2 returns structured JSON without authentication, which dramatically reduces the complexity compared to scraping Spotify or Apple Music.

The SoundCloud Scraper on Apify is the fastest path to production-ready data. For custom builds, ScrapeOps provides the proxy and monitoring infrastructure you'll need.

Whether you're building a music discovery app, running artist analytics, or curating playlists at scale — SoundCloud data is rich, accessible, and underutilized. Start extracting and see what you can build.

Best Metacritic Scrapers in 2026: Game Reviews, Scores, Critic Data

FairPrice — Wed, 25 Mar 2026 12:03:32 +0000

Metacritic is the go-to aggregation platform for game, movie, and TV review scores. Whether you're building a game recommendation engine, tracking critic sentiment over time, or feeding review data into an analytics pipeline, you need reliable access to Metacritic data.

In this guide, we'll look at the best ways to scrape Metacritic in 2026, including a ready-to-use Apify actor that handles the heavy lifting.

What Data Does Metacritic Have?

Metacritic aggregates reviews from hundreds of professional critics and millions of users. Here's what you can extract:

Metascores — weighted critic score (0-100) for games, movies, TV, and music
User scores — community ratings on a 0-10 scale
Individual critic reviews — reviewer name, outlet, score, review snippet, and date
Platform breakdown — separate scores per platform (PS5, Xbox, PC, Switch)
Release metadata — publisher, developer, genre, release date, ESRB rating
Must-Play / Must-Watch lists — curated editorial selections

This data powers everything from game industry research to automated review aggregation dashboards.

Why Scraping Metacritic Is Tricky

Metacritic uses aggressive anti-bot protections. Traditional scraping approaches hit several walls:

JavaScript rendering — much of the page content loads dynamically
Rate limiting — frequent requests get IP-blocked quickly
CAPTCHAs — automated access triggers verification challenges
Layout changes — Metacritic redesigns break CSS-selector-based scrapers regularly

This is where dedicated scraping tools and proxy services come in.

Option 1: Metacritic Scraper on Apify (Recommended)

The Metacritic Scraper on Apify Store handles all the complexity for you. It uses Metacritic's internal backend API directly, which means:

No browser rendering needed — faster and cheaper than headless browser approaches
Two modes: search mode (find games by keyword) and detail mode (get full data for specific URLs)
Structured JSON output — clean data ready for your pipeline
Built-in proxy rotation — no IP blocks

How It Works

Search mode — pass a search query and get matching results:

{
  "mode": "search",
  "query": "zelda",
  "limit": 20
}

Detail mode — pass specific Metacritic URLs to get full review data:

{
  "mode": "detail",
  "urls": ["https://www.metacritic.com/game/the-legend-of-zelda-tears-of-the-kingdom/"]
}

The output includes metascores, user scores, critic reviews, platform data, and all metadata in a clean JSON format. You can export to CSV, JSON, or push directly to a database via Apify integrations.

Pricing

Apify's pay-per-use model means you only pay for what you scrape. A typical run processing 100 game pages costs around $0.10-0.30 depending on the data depth.

Option 2: Build Your Own Scraper with ScraperAPI

If you prefer a DIY approach, ScraperAPI is a solid proxy and rendering service that handles CAPTCHAs, retries, and IP rotation for you.

import requests

API_KEY = "your_scraperapi_key"
target_url = "https://www.metacritic.com/game/the-legend-of-zelda-tears-of-the-kingdom/"

response = requests.get(
    f"http://api.scraperapi.com?api_key={API_KEY}&url={target_url}&render=true"
)

# Parse the rendered HTML with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

ScraperAPI handles the proxy rotation and JavaScript rendering, but you'll need to write and maintain your own parsing logic. This is more work but gives you full control.

Option 3: Metacritic's Own API (Limited)

Metacritic doesn't offer a public API. There are internal endpoints that power their frontend, but these are undocumented, rate-limited, and can change without notice. The Apify actor mentioned above already leverages these endpoints with proper error handling and retry logic built in.

Use Cases for Metacritic Data

Game Industry Research

Track how review scores correlate with sales data. Compare critic vs. user sentiment across genres. Identify which publishers consistently deliver high-rated titles.

Review Aggregation Dashboards

Build a dashboard that pulls scores from Metacritic alongside Steam reviews, OpenCritic, and user forums. Cross-reference scores to find games that critics love but users don't (or vice versa).

Sentiment Tracking Over Time

Monitor how user scores change post-launch. Some games see dramatic score shifts after patches or controversy. Tracking this data lets you spot trends early.

Price Optimization

Combine Metacritic scores with pricing data from Steam, PlayStation Store, and Xbox Marketplace. Identify undervalued games with high scores but low prices — useful for deal sites and recommendation engines.

Content Generation

Feed structured review data into LLMs to generate game summaries, comparison articles, or buying guides. The structured nature of Metacritic data makes it ideal for automated content pipelines.

Choosing the Right Approach

Approach	Setup Time	Maintenance	Cost	Best For
Apify Metacritic Scraper	5 min	None	Pay-per-use	Quick access, no coding
ScraperAPI + Custom Parser	2-4 hours	Ongoing	API subscription	Full control, custom needs
Raw scraping	4-8 hours	Heavy	Proxy costs	Learning, small scale

For most data projects, the Apify actor is the fastest path to clean data. If you need custom extraction logic or want to integrate scraping into a larger pipeline, ScraperAPI gives you the proxy infrastructure to build on.

Wrapping Up

Metacritic data is valuable for game industry analysis, review aggregation, and sentiment tracking. The main challenge is getting past their anti-bot protections reliably.

The Metacritic Scraper on Apify is the most practical option for 2026 — it uses internal APIs, handles all the infrastructure, and outputs clean JSON. For DIY builders, pair ScraperAPI with your own parser for maximum flexibility.

Whatever approach you choose, structured review data opens up a wide range of analytical and commercial applications. Start small, validate your pipeline, and scale from there.

I Analyzed 10,000 Hacker News Comments to Find What Makes a Post Go Viral

FairPrice — Thu, 12 Mar 2026 03:02:20 +0000

Last month, I scraped 10,000+ comments from Hacker News top stories to answer one question: what separates a 500-point post from a 5-point post?

Here's what the data revealed.

The Dataset

I used the HN Top Stories scraper on Apify to collect structured data from the front page over several weeks — titles, scores, comment counts, domains, and timestamps.

Quick setup:

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={"maxItems": 500}
)

stories = list(client.dataset(run["defaultDatasetId"]).iterate_items())

That gives you clean JSON with scores, comment counts, timestamps, and URLs — no BeautifulSoup required.

Finding #1: Comment Count Predicts Virality Better Than Score

I expected upvotes to be the key metric. Wrong.

Posts with 200+ comments had an average score of 487, while posts with 200+ upvotes but fewer than 50 comments averaged only 243.

Comments drive engagement loops. A controversial title gets people arguing, which pushes the post higher, which attracts more commenters. Score alone doesn't capture this.

Finding #2: The "Show HN" Advantage Is Real

Show HN posts that hit the front page had 2.3x more comments than regular posts at the same score level. The HN community rewards builders — but only if your project is genuinely useful.

The highest-performing Show HN posts shared three traits:

Solved a specific, common pain point
Had a live demo link
Were solo/small-team projects (not corporate launches)

Finding #3: Timing Matters Less Than You Think

Everyone says "post at 6am PT." The data tells a different story:

Time Window (PT)	Avg Score	Avg Comments
6-9 AM	142	67
9-12 PM	138	71
12-3 PM	127	63
6-9 PM	131	59

The difference between the best and worst window is only ~10%. Content quality dominates timing.

Finding #4: Title Length Sweet Spot

Posts with titles between 8-12 words scored 40% higher on average than those outside this range. Too short lacks context. Too long gets ignored.

The highest-scoring title pattern: "[Action verb] + [specific thing] + [surprising result]"

Examples: "I reverse-engineered the Spotify algorithm", "Why we moved from React to plain HTML"

Try It Yourself

The full dataset pipeline:

import pandas as pd
from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={"maxItems": 1000}
)

df = pd.DataFrame(client.dataset(run["defaultDatasetId"]).iterate_items())

# Score vs comments correlation
print(f"Correlation: {df['score'].corr(df['commentCount']):.2f}")

# Best performing domains
print(df.groupby('domain')['score'].mean().sort_values(ascending=False).head(10))

You can run this for free on Apify's free tier (no credit card).

Get the HN scraper here — it returns structured JSON, handles pagination, and costs fractions of a cent per run.

What patterns have you noticed on HN? Drop a comment — I'd love to compare notes.

Recommended Tools for Web Scraping

If you're building scrapers at scale, these tools can save you hours of dealing with proxies, CAPTCHAs, and rate limits:

ScraperAPI — Handles proxy rotation, browser rendering, and CAPTCHAs automatically. Great if you don't want to manage your own proxy infrastructure. Comes with 5,000 free API credits to get started.
ScrapeOps — A proxy aggregator that routes your requests through 20+ proxy providers and picks the best one for each target site. Useful when you need reliability across different domains.

How to build a free HN data pipeline in 30 minutes

FairPrice — Thu, 12 Mar 2026 00:10:35 +0000

Hacker News is one of the richest sources of signal in tech. New frameworks, hiring waves, shifting sentiment — it all shows up on HN before it hits mainstream. But scraping HN yourself is tedious and fragile.

In this tutorial, I'll walk you through building a lightweight data pipeline that pulls structured HN data on a schedule, stores it locally, and runs basic trend detection — all for free.

The Data Source

We'll use the HN Top Stories actor on Apify, which returns clean JSON for top, new, best, ask, show, and job stories. It handles pagination, rate limits, and retries so you don't have to.

Apify's free tier gives you enough compute to run this daily without paying a cent.

Step 1: Fetch HN Data

Install the Apify client:

pip install apify-client

Then pull the latest top stories:

from apify_client import ApifyClient
from datetime import datetime

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={"category": "topstories", "maxItems": 100}
)

items = list(
    client.dataset(run["defaultDatasetId"]).iterate_items()
)
print(f"Fetched {len(items)} stories")

Each item gives you the title, URL, score, author, comment count, and timestamp — everything you need for analysis.

Step 2: Store Results in SQLite

Save each run to a local database so you can track changes over time:

import sqlite3

db = sqlite3.connect("hn_pipeline.db")
db.execute(
    "CREATE TABLE IF NOT EXISTS stories ("
    "id INTEGER, title TEXT, url TEXT, score INTEGER, "
    "comments INTEGER, author TEXT, fetched_at TEXT, "
    "PRIMARY KEY (id, fetched_at))"
)

fetched_at = datetime.utcnow().isoformat()
for item in items:
    db.execute(
        "INSERT OR IGNORE INTO stories VALUES (?,?,?,?,?,?,?)",
        (item["id"], item["title"], item.get("url", ""),
         item["score"], item.get("descendants", 0),
         item["by"], fetched_at)
    )
db.commit()

Step 3: Detect Trends

With a few days of data, you can spot rising topics:

from collections import Counter

rows = db.execute(
    "SELECT title FROM stories "
    "WHERE fetched_at > datetime('now', '-7 days')"
).fetchall()

words = []
for (title,) in rows:
    words.extend(
        w.lower() for w in title.split() if len(w) > 3
    )

trends = Counter(words).most_common(20)
for word, count in trends:
    print(f"{word:20s} {count}")

Run this daily and diff against the previous week to catch emerging topics early — useful for content planning, market research, or just staying ahead of the curve.

Step 4: Automate It

Add a cron job to run on schedule. The HN Top Stories actor also supports scheduled runs natively on Apify, so you can set it to run every 6 hours and have fresh data waiting.

A minimal cron entry:

0 */6 * * * cd ~/hn-pipeline && python3 fetch.py && python3 trends.py

Real Use Case: Job Monitoring

One practical application is monitoring HN "Who is Hiring" threads. Set the actor category to jobstories, then filter for keywords matching your stack:

keywords = ["python", "fastapi", "remote", "senior"]
matches = [
    s for s in items
    if any(k in s["title"].lower() for k in keywords)
]

Pipe matches into a Slack webhook or email digest and you have a free, targeted job alert system.

Wrapping Up

The full pipeline is under 50 lines of Python, runs on free-tier infrastructure, and gives you structured access to one of the best signal sources in tech. The Apify HN actor handles the scraping; you just handle the analysis.

Grab the code, set up a schedule, and start mining HN data today.

Why I Stopped Using the Hacker News API Directly (and What I Use Instead)

FairPrice — Wed, 11 Mar 2026 23:06:36 +0000

I've been pulling data from Hacker News for over a year. I started where everyone starts: the official Firebase API.

And for about two weeks, it was fine. Then reality set in.

The problem with the HN API

The Hacker News API is technically correct. It returns items by ID. It returns top story IDs. It does what it says.

But if you want to do anything practical — like get the top 50 stories with their comment counts, scores, and metadata in a single call — you're looking at 51 HTTP requests minimum. One for the top stories list, then one per story.

Here's what that looks like in Python:

import requests

def get_top_stories(n=50):
    top_ids = requests.get(
        'https://hacker-news.firebaseio.com/v0/topstories.json'
    ).json()[:n]

    stories = []
    for story_id in top_ids:
        item = requests.get(
            f'https://hacker-news.firebaseio.com/v0/item/{story_id}.json'
        ).json()
        stories.append(item)

    return stories

This works. It's also painfully slow. On a good day, you're waiting 8-12 seconds for 50 stories. On a bad day with rate limiting, much longer.

You can parallelize with asyncio and aiohttp, but then you're managing connection pools, handling rate limits, retrying failed requests, and parsing inconsistent response shapes. The HN API returns different fields depending on item type (story, comment, poll, job). There's no schema. No pagination for comments. No filtering.

I wrote that boilerplate three times before deciding there had to be a better way.

What I actually use now

I switched to using a pre-built scraper that handles all the ugly parts: HN Top Stories on Apify.

The difference is night and day. Instead of managing 50+ HTTP calls, I get structured JSON back with one API call:

from apify_client import ApifyClient

client = ApifyClient("your-api-token")

run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={
        "maxItems": 100,
        "minScore": 10,
        "includeComments": True
    }
)

items = list(client.dataset(run["defaultDatasetId"]).iterate_items())

That gives me up to 500 stories, pre-filtered by score, with full comment trees already extracted. The response is consistent — every item has the same shape.

Why this matters for data projects

If you're building a trend detector, a sentiment analyzer, or even a simple dashboard, the data collection step shouldn't be the hard part. But with the raw HN API, it is.

Here's what I was spending time on before switching:

Rate limit handling: HN doesn't publish rate limits, so you discover them empirically (and differently each time)
Comment tree traversal: Comments are stored as nested IDs. To get a full thread, you need recursive fetching. For a post with 300 comments, that's 300+ additional API calls
Data normalization: The API returns null for deleted items, different fields for different item types, and timestamps in Unix epoch
Caching and deduplication: If you're polling every hour, you need to diff against previous results

All of that is now someone else's problem. The Apify actor handles it, and I get clean data out.

The code I actually ship now

My current HN monitoring script is 40 lines instead of 200:

from apify_client import ApifyClient
import json

client = ApifyClient("your-api-token")

def get_trending(min_score=20, keywords=None):
    run_input = {"maxItems": 200, "minScore": min_score}
    if keywords:
        run_input["keyword"] = keywords[0]

    run = client.actor("cryptosignals/hn-top-stories").call(
        run_input=run_input
    )

    items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    return sorted(items, key=lambda x: x.get("score", 0), reverse=True)

# Get AI-related stories scoring above 50
trending = get_trending(min_score=50, keywords=["AI"])
for story in trending[:10]:
    print(f"{story['score']} | {story['title']}")

No connection pool management. No retry logic. No response parsing. Just data.

When to still use the raw API

The direct API still makes sense for:

Fetching a single item by ID (it's instant)
Real-time streaming via the /v0/updates endpoint
Building something where you need sub-second freshness

For everything else — batch collection, filtering, comment extraction, historical data — I'd reach for a dedicated tool. The HN API is a data source, not a data pipeline. Treating it like a pipeline is where most projects get stuck.

I've been using the HN Top Stories actor for my own projects. If you're doing anything with HN data at scale, it saves a surprising amount of plumbing code.

Building a Tech Trend Dashboard from Hacker News Data with Python

FairPrice — Wed, 11 Mar 2026 22:06:09 +0000

Every day, thousands of developers discuss emerging technologies on Hacker News. What if you could turn that firehose of discussion into a structured trend dashboard?

In this tutorial, we build a Python script that analyzes HN stories and comments to detect trending topics and visualize what the developer community cares about right now.

The Data Pipeline

We need stories and their comments. The official HN API gives you individual items, but fetching hundreds of stories plus nested comments is slow since each comment requires a separate HTTP call.

For bulk collection, I use the HN Stories + Comments Scraper on Apify which grabs full story metadata and comment trees in one run. But you can adapt this to any data source.

Assume we have our data as JSON:

stories = [
    {
        "title": "Show HN: I built a Rust web framework",
        "score": 342,
        "num_comments": 156,
        "comments": [
            {"text": "Really fast compared to Actix...", "score": 45},
            {"text": "How does this handle async?", "score": 23},
        ]
    },
]

Step 1: Extract Tech Keywords

Simple keyword extraction weighted by engagement:

import re
from collections import Counter

TECH_TERMS = {
    'rust', 'python', 'go', 'typescript', 'zig', 'kotlin',
    'react', 'vue', 'svelte', 'htmx', 'nextjs',
    'llm', 'gpt', 'claude', 'openai', 'transformer',
    'kubernetes', 'docker', 'wasm', 'sqlite', 'postgres',
}

def extract_trends(stories):
    weighted_counts = Counter()
    for story in stories:
        text = story['title'].lower()
        for c in story.get('comments', []):
            text += ' ' + c.get('text', '').lower()
        for term in TECH_TERMS:
            if re.search(r'\b' + term + r'\b', text):
                weight = story['score'] + story['num_comments']
                weighted_counts[term] += weight
    return weighted_counts.most_common(15)

Step 2: Detect Rising vs Falling Topics

Compare recent mentions against a baseline period:

from datetime import datetime, timedelta

def trend_momentum(stories, days_recent=7, days_baseline=30):
    now = datetime.now()
    recent_cutoff = now - timedelta(days=days_recent)
    baseline_cutoff = now - timedelta(days=days_baseline)
    recent, baseline = Counter(), Counter()

    for story in stories:
        ts = datetime.fromtimestamp(story.get('time', 0))
        terms = {t for t in TECH_TERMS
                 if re.search(r'\b' + t + r'\b', story['title'].lower())}
        if ts >= recent_cutoff:
            for t in terms: recent[t] += 1
        elif ts >= baseline_cutoff:
            for t in terms: baseline[t] += 1

    momentum = {}
    for term in TECH_TERMS:
        r = recent[term] / days_recent
        b = baseline[term] / (days_baseline - days_recent) or 0.001
        momentum[term] = r / b

    return (sorted(momentum.items(), key=lambda x: -x[1])[:5],
            sorted(momentum.items(), key=lambda x: x[1])[:5])

Step 3: Build the Dashboard

def print_dashboard(stories):
    trends = extract_trends(stories)
    rising, falling = trend_momentum(stories)

    print("=" * 50)
    print("  HN TECH TREND DASHBOARD")
    print("=" * 50)

    print("\nTOP TECHNOLOGIES (by weighted mentions):")
    for i, (term, score) in enumerate(trends[:10], 1):
        bar = "#" * min(score // 100, 30)
        print(f"  {i:2d}. {term:12s} {bar} ({score})")

    print("\nRISING:")
    for term, ratio in rising:
        if ratio > 1.2:
            print(f"  ^ {term} ({ratio:.1f}x baseline)")

    print("\nCOOLING:")
    for term, ratio in falling:
        if ratio < 0.8:
            print(f"  v {term} ({ratio:.1f}x baseline)")

Collecting Real Data

For production, here are your options:

HN Official API - Free, but slow for bulk. Good for under 50 stories.
HN Algolia API - Search-oriented, great for keyword queries.
Web scraping - Fragile, breaks when HN changes markup.
Pre-built scrapers - Tools like the Apify HN scraper handle pagination, rate limits, and comment tree traversal.

import requests

def fetch_hn_stories(query="python", hits=100):
    url = "http://hn.algolia.com/api/v1/search"
    params = {"query": query, "tags": "story", "hitsPerPage": hits}
    return requests.get(url, params=params).json()["hits"]

What You Can Build From Here

Weekly email digest of trending tech topics
Investment signal - track which technologies gain developer mindshare
Content planning tool - write about what devs are actively discussing

The code runs in under a second on a few hundred stories. For continuous monitoring, throw it in a cron job with a SQLite database. Happy trend hunting!

How I Built a Hacker News Trend Detector Using Only Public Data

FairPrice — Wed, 11 Mar 2026 21:07:14 +0000

Every day, thousands of stories compete for the Hacker News front page. I wanted to detect trending topics before they blow up — using nothing but public APIs and a bit of Python.

Here's how I built a simple HN trend detector, and what I learned about the data along the way.

The HN Firebase API

Hacker News runs on a public Firebase API. No auth needed. The two endpoints that matter:

import requests

# Get current top 500 story IDs
top_ids = requests.get("https://hacker-news.firebaseio.com/v0/topstories.json").json()

# Get details for a single story
story = requests.get(f"https://hacker-news.firebaseio.com/v0/item/{top_ids[0]}.json").json()
print(story["title"], story["score"], story["descendants"])  # descendants = comment count

This gives you title, score, author, timestamp, comment count, and URL for every item. Simple, but powerful.

Detecting Velocity, Not Just Score

A story with 300 points after 12 hours is stale. A story with 80 points after 40 minutes is exploding. The key metric is points per hour:

import time

def velocity(story):
    age_hours = (time.time() - story["time"]) / 3600
    if age_hours < 0.1:
        return 0  # too new to judge
    return story["score"] / age_hours

# Fetch top 30 stories and rank by velocity
stories = []
for sid in top_ids[:30]:
    s = requests.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json").json()
    s["velocity"] = velocity(s)
    stories.append(s)

stories.sort(key=lambda s: s["velocity"], reverse=True)

for s in stories[:10]:
    print(f"{s[\"velocity\"]:.1f} pts/hr | {s[\"score\"]} pts | {s[\"title\"]}")

Sample output:

142.3 pts/hr | 87 pts | Show HN: I made a tool that...
 98.1 pts/hr | 203 pts | The hidden cost of...
 67.4 pts/hr | 312 pts | Why we switched from...

Stories with high velocity in their first 1-2 hours almost always make it to #1.

Adding Comment Sentiment

HN comments are gold. A story with 200 points but hostile comments won't last. I added a simple ratio check:

def engagement_ratio(story):
    comments = story.get("descendants", 0)
    if story["score"] == 0:
        return 0
    return comments / story["score"]

Ratios above 1.5 usually mean controversy. Below 0.3 means people upvote but don't discuss — often link-heavy or self-explanatory content. The sweet spot (0.5-1.2) indicates genuine interest.

Tracking Topics Over Time

To spot trends across days, I store snapshots in SQLite:

import sqlite3
from datetime import datetime

db = sqlite3.connect("hn_trends.db")
db.execute("""CREATE TABLE IF NOT EXISTS snapshots (
    id INTEGER, title TEXT, score INTEGER, 
    comments INTEGER, velocity REAL,
    captured_at TEXT
)""")

for s in stories:
    db.execute(
        "INSERT INTO snapshots VALUES (?,?,?,?,?,?)",
        (s["id"], s["title"], s["score"],
         s.get("descendants", 0), s["velocity"],
         datetime.utcnow().isoformat())
    )
db.commit()

Run this every 30 minutes via cron, and after a week you can query for patterns:

-- Topics that appeared on front page 3+ times this week
SELECT title, COUNT(*) as appearances, MAX(score) as peak_score
FROM snapshots
WHERE captured_at > datetime("now", "-7 days")
GROUP BY id
HAVING appearances >= 3
ORDER BY peak_score DESC;

Scaling Up: When the Firebase API Isn't Enough

The Firebase API is great for real-time data, but it has limits:

No bulk export (you fetch one item at a time)
No historical data beyond current top/new/best lists
Comment trees require recursive fetching (slow for 500+ comment threads)

If you need structured historical data or full comment threads at scale, Apify has several HN scrapers that handle the heavy lifting. I've been using HN Top Stories Scraper which pulls structured data including full comment threads — useful when you want to analyze discussion patterns without writing your own recursive crawler.

The Full Pipeline

My production setup:

Cron job every 30 min → fetches top 100 via Firebase API
Velocity calculator flags stories above 50 pts/hr
SQLite storage for historical analysis
Weekly digest email with recurring topics and velocity outliers

Total code: ~120 lines of Python. No ML, no fancy NLP. Just velocity math and some SQL.

What I Found

After running this for a few weeks:

AI/LLM stories consistently hit the highest velocities (80+ pts/hr)
Show HN posts have the best engagement ratios
Stories posted between 9-11am ET get 2x the velocity of evening posts
The comment-to-score ratio reliably predicts whether a story stays on the front page

The full code is straightforward enough to run on any $5 VPS. If you're interested in HN data analysis, start with the Firebase API — it's surprisingly capable for a free, unauthenticated endpoint.

What patterns have you noticed on HN? Drop a comment if you've built something similar.

I Built a Pay-Per-Result Hacker News Scraper on Apify

FairPrice — Wed, 11 Mar 2026 18:47:37 +0000

The Problem

I needed structured Hacker News data for a side project — trending stories, scores, comment counts. The HN API exists but requires pagination, filtering, and batch fetching logic.

So I built an Apify Actor that handles all of this and published it for free.

What It Does

HN Top Stories Scraper lets you:

Scrape top, new, best, ask, and show stories
Filter by minimum score, comment count, or keyword
Get up to 500 stories per run
Output as JSON, CSV, or connect to Google Sheets, Slack, Zapier

It uses the official HN Firebase API — no scraping, no proxies needed.

Example

Get the top 50 AI stories with 100+ upvotes:

{
  "count": 50,
  "type": "top",
  "minScore": 100,
  "keyword": "AI"
}

Returns:

{
  "id": 12345678,
  "title": "Show HN: AI tool that does X",
  "url": "https://example.com",
  "score": 342,
  "comments": 89,
  "author": "username",
  "hn_url": "https://news.ycombinator.com/item?id=12345678"
}

Use Cases

RSS replacement: Schedule runs to get stories as structured data
Competitor monitoring: Filter by your company name
Content curation: Feed into newsletters or Slack
Trend analysis: Track what gets high scores over time
Job monitoring: Scrape Who is Hiring threads

Pricing

Pay-per-result: ~$0.01 per 1,000 stories. Free tier available — no credit card needed.

Compare that to the $5-19/month flat-rate competitors charge.

Try It

https://apify.com/cryptosignals/hn-top-stories

Feedback welcome — this is my first published Actor.

Recommended Tools for Web Scraping

If you're building scrapers at scale, these tools can save you hours of dealing with proxies, CAPTCHAs, and rate limits:

ScraperAPI — Handles proxy rotation, browser rendering, and CAPTCHAs automatically. Great if you don't want to manage your own proxy infrastructure. Comes with 5,000 free API credits to get started.
ScrapeOps — A proxy aggregator that routes your requests through 20+ proxy providers and picks the best one for each target site. Useful when you need reliability across different domains.

3 Free Apify Actors for Scraping Bluesky, Substack, and Hacker News (No API Keys Needed)

FairPrice — Sat, 07 Mar 2026 04:55:27 +0000

I built 3 free scrapers for platforms that researchers and developers commonly need data from. All use pay-per-event pricing (free until March 21), no API keys required.

If you've ever needed to pull data from Bluesky, Substack, or Hacker News, you know the drill: write a custom script, handle pagination, deal with rate limits, parse HTML. These three Apify Actors handle all of that out of the box.

1. Bluesky Scraper

Link: Bluesky Scraper on Apify Store

What it does: Scrapes posts, user profiles, and search results from Bluesky via the AT Protocol.

Why Bluesky: The AT Protocol is fully open — no authentication tokens needed for public data. With 30M+ users and growing, Bluesky is becoming a primary data source for social media researchers and trend analysts.

Example input:

{
  "searchTerms": ["web scraping", "data extraction"],
  "maxPosts": 100,
  "includeReplies": false
}

This pulls up to 100 posts matching your search terms. You can also scrape specific user profiles or full thread conversations.

2. Substack Scraper

Link: Substack Scraper on Apify Store

What it does: Scrapes newsletter posts, author metadata, and publication details from any public Substack.

Why Substack: Substack exposes an unofficial JSON API for public content — no auth required. This makes it straightforward to collect article text, subscriber counts, and publication metadata at scale.

Example input:

{
  "publicationUrls": [
    "https://platformer.news",
    "https://www.lennysnewsletter.com"
  ],
  "maxPostsPerPublication": 50
}

This scrapes the 50 most recent posts from each publication, including full article text, dates, likes, and author info.

3. Hacker News Scraper

Link: Hacker News Scraper on Apify Store

What it does: Scrapes stories, comments, and user profiles from Hacker News.

Why HN: Hacker News has an official Firebase API with no rate limits and no authentication. The scraper wraps this into a structured output with filtering, sorting, and comment threading built in.

Example input:

{
  "scrapeType": "search",
  "searchQuery": "LLM fine-tuning",
  "maxItems": 200,
  "includeComments": true
}

This searches HN for stories about LLM fine-tuning and includes the full comment trees — useful for sentiment analysis or finding expert opinions.

Why Use These vs. Building Your Own?

	DIY Script	Apify Actor
Setup time	Hours to days	Minutes
Pagination	You handle it	Built-in
Output format	Whatever you code	JSON, CSV, Excel, or direct to your DB
Scheduling	Cron jobs on your server	Built-in scheduler on Apify
Proxy rotation	You manage it	Handled automatically
Maintenance	You fix it when the site changes	Actor updates handle it

If you need a one-off data pull, a DIY script works. If you need recurring scrapes, structured output, or you just don't want to spend a day writing pagination logic, these Actors save real time.

Try Them Out

All three are live on the Apify Store with free trials:

Each Actor runs on pay-per-event pricing. You get results as structured JSON, ready for analysis, storage, or piping into your data pipeline.

If you have questions or feature requests, drop a comment or open an issue on the Actor page. Happy scraping.

Recommended Tools for Web Scraping

If you're building scrapers at scale, these tools can save you hours of dealing with proxies, CAPTCHAs, and rate limits:

ScraperAPI — Handles proxy rotation, browser rendering, and CAPTCHAs automatically. Great if you don't want to manage your own proxy infrastructure. Comes with 5,000 free API credits to get started.
ScrapeOps — A proxy aggregator that routes your requests through 20+ proxy providers and picks the best one for each target site. Useful when you need reliability across different domains.

How to Scrape Substack Newsletters at Scale (No API Key Needed)

FairPrice — Sat, 07 Mar 2026 02:51:40 +0000

If you've ever tried to track what competitors publish on Substack, monitor newsletter trends, or build a dataset of authors in a specific niche — you've probably hit a wall. Substack has no official public API for third-party developers, and their content is rendered dynamically, making traditional scraping brittle.

The good news? Substack actually exposes structured JSON endpoints for every publication. In this tutorial, I'll show you these endpoints, explain the data they return, and demonstrate how to scrape Substack newsletters at scale using both code and a no-code solution.

Why Scrape Substack?

There are several legitimate reasons to collect public Substack data:

Competitor research: Track what topics rival newsletters cover, how often they publish, and what resonates with readers
Content monitoring: Stay on top of publications in your industry without manually subscribing to dozens of newsletters
Lead generation: Build lists of active newsletter authors in a niche for partnership outreach
Market analysis: Understand which Substack categories are growing, which authors are gaining subscribers, and what pricing models work
Academic research: Study the newsletter economy, media trends, or content patterns at scale

Substack's Hidden JSON API

Every Substack publication exposes data through predictable URL patterns. Here are the key endpoints:

Publication metadata

https://{publication}.substack.com/api/v1/archive?sort=new&limit=12&offset=0

This returns a JSON array of recent posts with titles, subtitles, slugs, post dates, word counts, and more.

Author profile

https://substack.com/api/v1/user/{author_id}

Returns the author's name, bio, photo URL, and linked publications.

Post details

https://{publication}.substack.com/api/v1/posts/{slug}

Returns full post content including HTML body, comments count, likes, and metadata.

Search publications

https://substack.com/api/v1/publication/search?query={keyword}

Search across all Substack publications by keyword.

The Problem with DIY Scraping

While these endpoints are accessible, building a production-grade scraper involves handling:

Rate limiting: Substack will throttle or block aggressive requests
Pagination: Most endpoints return paginated results that need iterative fetching
Error handling: Transient failures, changed publication slugs, deleted posts
Data normalization: Raw JSON varies between publications and post types
Proxy rotation: Necessary for scraping at any real scale

This is where a managed solution saves significant development time.

The Easy Way: Substack Scraper on Apify

I built a Substack Newsletter Scraper on Apify that handles all of the above. It extracts posts, author profiles, and publication metadata from any Substack newsletter using the public JSON API — no authentication needed.

What it extracts

Post titles, subtitles, body text (HTML and plain text)
Publication dates, word counts, reading time
Author name, bio, profile image
Comment and reaction counts
Publication name, description, subscriber info
Cover images and canonical URLs

How to use it

Go to apify.com/cryptosignals/substack-scraper
Enter one or more Substack publication URLs
Set your desired post limit
Click "Start" and download the results as JSON, CSV, or Excel

No code required. But if you prefer programmatic access, read on.

Code Examples

Python (using Apify API)

import requests

API_TOKEN = "your_apify_api_token"
ACTOR_ID = "cryptosignals/substack-scraper"

# Start the actor run
run = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    params={"token": API_TOKEN},
    json={
        "urls": [
            "https://platformer.substack.com",
            "https://www.lennysnewsletter.com"
        ],
        "maxPosts": 50
    }
).json()

run_id = run["data"]["id"]
print(f"Run started: {run_id}")

# Poll for completion (simplified)
import time
while True:
    status = requests.get(
        f"https://api.apify.com/v2/actor-runs/{run_id}",
        params={"token": API_TOKEN}
    ).json()
    if status["data"]["status"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(5)

# Get results
dataset_id = status["data"]["defaultDatasetId"]
results = requests.get(
    f"https://api.apify.com/v2/datasets/{dataset_id}/items",
    params={"token": API_TOKEN}
).json()

for post in results:
    print(f"{post.get('title')} — {post.get('post_date')}")

JavaScript / Node.js (using Apify Client)

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'your_apify_api_token' });

const run = await client.actor('cryptosignals/substack-scraper').call({
    urls: [
        'https://platformer.substack.com',
        'https://www.lennysnewsletter.com'
    ],
    maxPosts: 50,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();

items.forEach(post => {
    console.log(`${post.title} — ${post.post_date}`);
});

Using the Apify CLI

# Install the Apify CLI
npm install -g apify-cli

# Run the scraper
apify call cryptosignals/substack-scraper \
  -i '{"urls": ["https://platformer.substack.com"], "maxPosts": 20}'

Output Example

Each scraped post returns structured data like this:

{
    "title": "Why Google is losing the AI race",
    "subtitle": "A deep dive into search disruption",
    "slug": "why-google-is-losing-the-ai-race",
    "post_date": "2026-02-28T12:00:00.000Z",
    "word_count": 2450,
    "comment_count": 87,
    "reaction_count": 342,
    "author_name": "Casey Newton",
    "publication_name": "Platformer",
    "canonical_url": "https://platformer.substack.com/p/why-google-is-losing-the-ai-race",
    "body_text": "Full text content here...",
    "body_html": "<p>Full HTML content here...</p>"
}

Use Cases and Ideas

Here are some practical things you can build with this data:

Newsletter competitive dashboard: Track multiple publications and compare posting frequency, engagement, and topic coverage
Content calendar tool: Aggregate posts from newsletters you follow into a single timeline
Trend analysis: Run NLP on post titles and bodies to identify emerging topics
Author database: Build a searchable directory of newsletter writers in specific niches
RSS alternative: Create custom feeds from Substack publications with filtering and alerts

Wrapping Up

Substack's public JSON endpoints make it surprisingly accessible for data collection. Whether you use the raw API endpoints for small-scale projects or the Substack Newsletter Scraper on Apify for production workloads, you now have the tools to extract newsletter data at scale.

If you found this useful, give the Apify actor a try — there's a free tier that lets you test it without any commitment.

Have questions or want to see more scraping tutorials? Drop a comment below.

Recommended Tools for Web Scraping

If you're building scrapers at scale, these tools can save you hours of dealing with proxies, CAPTCHAs, and rate limits:

ScraperAPI — Handles proxy rotation, browser rendering, and CAPTCHAs automatically. Great if you don't want to manage your own proxy infrastructure. Comes with 5,000 free API credits to get started.
ScrapeOps — A proxy aggregator that routes your requests through 20+ proxy providers and picks the best one for each target site. Useful when you need reliability across different domains.