I Analyzed 10,000 Hacker News Comments to Find What Makes a Post Go Viral

#python #programming #webdev #datascience

Last month, I scraped 10,000+ comments from Hacker News top stories to answer one question: what separates a 500-point post from a 5-point post?

Here's what the data revealed.

The Dataset

I used the HN Top Stories scraper on Apify to collect structured data from the front page over several weeks — titles, scores, comment counts, domains, and timestamps.

Quick setup:

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={"maxItems": 500}
)

stories = list(client.dataset(run["defaultDatasetId"]).iterate_items())

That gives you clean JSON with scores, comment counts, timestamps, and URLs — no BeautifulSoup required.

Finding #1: Comment Count Predicts Virality Better Than Score

I expected upvotes to be the key metric. Wrong.

Posts with 200+ comments had an average score of 487, while posts with 200+ upvotes but fewer than 50 comments averaged only 243.

Comments drive engagement loops. A controversial title gets people arguing, which pushes the post higher, which attracts more commenters. Score alone doesn't capture this.

Finding #2: The "Show HN" Advantage Is Real

Show HN posts that hit the front page had 2.3x more comments than regular posts at the same score level. The HN community rewards builders — but only if your project is genuinely useful.

The highest-performing Show HN posts shared three traits:

Solved a specific, common pain point
Had a live demo link
Were solo/small-team projects (not corporate launches)

Finding #3: Timing Matters Less Than You Think

Everyone says "post at 6am PT." The data tells a different story:

Time Window (PT)	Avg Score	Avg Comments
6-9 AM	142	67
9-12 PM	138	71
12-3 PM	127	63
6-9 PM	131	59

The difference between the best and worst window is only ~10%. Content quality dominates timing.

Finding #4: Title Length Sweet Spot

Posts with titles between 8-12 words scored 40% higher on average than those outside this range. Too short lacks context. Too long gets ignored.

The highest-scoring title pattern: "[Action verb] + [specific thing] + [surprising result]"

Examples: "I reverse-engineered the Spotify algorithm", "Why we moved from React to plain HTML"

Try It Yourself

The full dataset pipeline:

import pandas as pd
from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={"maxItems": 1000}
)

df = pd.DataFrame(client.dataset(run["defaultDatasetId"]).iterate_items())

# Score vs comments correlation
print(f"Correlation: {df['score'].corr(df['commentCount']):.2f}")

# Best performing domains
print(df.groupby('domain')['score'].mean().sort_values(ascending=False).head(10))

You can run this for free on Apify's free tier (no credit card).

Get the HN scraper here — it returns structured JSON, handles pagination, and costs fractions of a cent per run.

What patterns have you noticed on HN? Drop a comment — I'd love to compare notes.

Recommended Tools for Web Scraping

If you're building scrapers at scale, these tools can save you hours of dealing with proxies, CAPTCHAs, and rate limits:

ScraperAPI — Handles proxy rotation, browser rendering, and CAPTCHAs automatically. Great if you don't want to manage your own proxy infrastructure. Comes with 5,000 free API credits to get started.
ScrapeOps — A proxy aggregator that routes your requests through 20+ proxy providers and picks the best one for each target site. Useful when you need reliability across different domains.