DEV Community

Gittech
Gittech

Posted on

How to Analyze Developer Trends Using HackerNews + GitHub Data (Step-by-Step Tutorial)

Developers constantly ask questions like:

  • “What tech is trending right now?”
  • “Why do some GitHub repos go viral?”
  • “How do I find project ideas devs actually want?”
  • “Which months are best for launching tools?”

The truth?
You can guess… OR you can use real data from HackerNews + GitHub and answer these questions with actual evidence.

In this tutorial, I’ll walk you through a practical, real-world workflow to analyze:

✅ What kinds of repos go viral
✅ Which technologies are rising
✅ Seasonal patterns in open-source launches
✅ How to spot ideas early
✅ How to forecast future trends

And yes — all of this becomes 10x easier if you’re using my cleaned dataset of 17,900+ HackerNews→GitHub repo submissions, split by month.

If you want to follow along with the same dataset I use in this tutorial,
you can grab it here:
👉 Grab it here


1. Why HackerNews → GitHub Data Is So Useful

Most “tech trend” predictions are based on vibes.
But HackerNews links to GitHub repos are different:

  • They come directly from developers
  • They represent real usage or real curiosity
  • They show what devs think is worth sharing
  • They capture early-stage signals before mainstream coverage
  • They are timestamped → perfect for trend timelines
  • They show actual projects, not news articles

This makes them perfect for:

  • Trend forecasting
  • Product idea generation
  • Competitive research
  • Launch strategy
  • Side project discovery
  • ML training
  • Market analysis for indie founders

If you want to analyze these patterns easily,
👉 Grab the dataset here


2. Load the Dataset (CSV Example)

Let’s start with a simple workflow.

import pandas as pd

df = pd.read_csv("2024-01.csv")
df.head()
Enter fullscreen mode Exit fullscreen mode

Your columns:

  • title
  • github_link
  • submitted_date

If you’re using the multi-format monthly dataset,
just pick the month you want from the folders.

You can follow along using the same structured files:
👉 Grab them here


3. Extract Programming Languages Automatically

A great first analysis is seeing which languages dominate HackerNews.

Here’s a quick and dirty language detector:

import re

def detect_language(title):
    title = title.lower()
    if "rust" in title: return "Rust"
    if "python" in title: return "Python"
    if "go " in title or " golang" in title: return "Go"
    if "js" in title or "javascript" in title: return "JavaScript"
    if "typescript" in title: return "TypeScript"
    if "cpp" in title or "c++" in title: return "C++"
    return "Other"

df["language"] = df["title"].apply(detect_language)
df["language"].value_counts()
Enter fullscreen mode Exit fullscreen mode

Result: A real breakdown of which languages are getting attention.

This is extremely powerful for:

  • choosing a language for your next open-source project
  • picking topics for blog posts or YouTube videos
  • forecasting future dev movements

To run this across all monthly folders, you’ll want the full dataset:
👉 Grab it here


4. Find Which Repo Topics Go Viral Most Often

Let’s look at titles that contain topic keywords:

topics = ["AI", "CLI", "framework", "compiler", "database",
          "LLM", "serverless", "infra", "debugger", "tool"]

def detect_topic(title):
    matches = [t for t in topics if t.lower() in title.lower()]
    return ", ".join(matches) if matches else "Other"

df["topics"] = df["title"].apply(detect_topic)
df["topics"].value_counts()
Enter fullscreen mode Exit fullscreen mode

You will immediately see patterns like:

  • AI tools exploding
  • Infra tooling outperforming web frameworks
  • Debugging utilities consistently performing well
  • Compilers experiencing periodic spikes

This is pure gold for anyone trying to build a product or open-source project.


5. Analyze Seasonal Patterns (Why January & September Matter)

Developers think tech trends are random.

They’re not.

There are strong seasonal patterns:

df['submitted_date'] = pd.to_datetime(df['submitted_date'])
df['month'] = df['submitted_date'].dt.month

df.groupby("month").size()
Enter fullscreen mode Exit fullscreen mode

You will see:

🔥 January → Massive spike (new-year side projects)
🔥 September → Another spike (post-summer reboot)
🧊 April & July → Lowest months (burnout & vacations)

This is extremely useful if you:

  • plan to launch an open-source repo
  • want to release a product update
  • want to publish a blog or newsletter
  • want to maximize GitHub stars

To analyze this across years, you need access to multiple folders:
👉 Grab the dataset here


6. Build a “Viral Repo Predictor” (Simple ML Example)

You can even train a lightweight model to predict whether a repo might go viral based on title patterns.

Example using TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df["title"])

# simulate a "viral" threshold = if title contains certain words
df["viral"] = df["title"].str.contains("AI|LLM|tool|fast|open-source", case=False)

model = LogisticRegression()
model.fit(X, df["viral"])

model.predict(X[:5])
Enter fullscreen mode Exit fullscreen mode

Now you can:

  • score new repo titles
  • evaluate launch names
  • find high-performing keywords

This is impossible without having thousands of historical titles.
Exactly what the dataset gives you.

👉 Grab it here


7. Build Your Own GitHub Trends Dashboard (Beginner-Friendly)

Here’s a simple visualization to get started:

import matplotlib.pyplot as plt

df["language"].value_counts().plot(kind="bar")
plt.title("Language Distribution for This Month’s Popular Repos")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Or a timeline:

df.groupby(df['submitted_date'].dt.to_period('M')).size().plot()
plt.title("Number of GitHub Repos Shared on HN Over Time")
plt.show()
Enter fullscreen mode Exit fullscreen mode

These dashboards help you:

  • spot rising languages
  • see hype cycles
  • identify long-term trends
  • find dev communities to tap into

You can only do this properly with multi-year monthly data:
👉 Grab your copy here


8. Generate Project Ideas Using the Data

One of the best uses of this dataset is idea generation.

Try this:

df["title"].sample(20)
Enter fullscreen mode Exit fullscreen mode

Instant inspiration.

Even better: cluster the titles:

from sklearn.cluster import KMeans

X = vectorizer.fit_transform(df["title"])
kmeans = KMeans(n_clusters=10)
labels = kmeans.fit_predict(X)

df["cluster"] = labels
df.groupby("cluster").head(3)
Enter fullscreen mode Exit fullscreen mode

This reveals:

  • trending categories
  • tech gaps
  • unserved niches
  • high-interest areas
  • repeating patterns

Perfect for indie hackers.


9. Summary: Why This Workflow Matters

This tutorial barely scratches the surface of what's possible:

  • trend forecasting
  • competitor analysis
  • NLP models
  • launch timing optimization
  • idea generation
  • content planning
  • GitHub ecosystem research
  • open-source strategy

And having a clean, multi-year dataset turns all of this from “theoretical” to “extremely practical.”

If you want to use the exact dataset this tutorial is based on,
you can grab it here:

👉 Grab it here


Report

Charts

Samples

Top comments (4)

Collapse
 
snappy_tuts profile image
Snappy Tuts

I think it's worth trying atleast get some trends on saas products...

Collapse
 
gittech profile image
Gittech

yes, i also try to use this dataset for that.. maybe useful to you. ❤️

Collapse
 
rare_source_0540faa19bf40 profile image
Rare Source

It's an great resource, thanks for sharing such things. I mail you can u check for more license things.

Collapse
 
gittech profile image
Gittech

yep, thanks the purchase though.. ❤️