Gittech

Posted on Nov 25

How to Analyze Developer Trends Using HackerNews + GitHub Data (Step-by-Step Tutorial)

#ai #webdev #programming #career

Developers constantly ask questions like:

“What tech is trending right now?”
“Why do some GitHub repos go viral?”
“How do I find project ideas devs actually want?”
“Which months are best for launching tools?”

The truth?
You can guess… OR you can use real data from HackerNews + GitHub and answer these questions with actual evidence.

In this tutorial, I’ll walk you through a practical, real-world workflow to analyze:

✅ What kinds of repos go viral
✅ Which technologies are rising
✅ Seasonal patterns in open-source launches
✅ How to spot ideas early
✅ How to forecast future trends

And yes — all of this becomes 10x easier if you’re using my cleaned dataset of 17,900+ HackerNews→GitHub repo submissions, split by month.

If you want to follow along with the same dataset I use in this tutorial,
you can grab it here:
👉 Grab it here

1. Why HackerNews → GitHub Data Is So Useful

Most “tech trend” predictions are based on vibes.
But HackerNews links to GitHub repos are different:

They come directly from developers
They represent real usage or real curiosity
They show what devs think is worth sharing
They capture early-stage signals before mainstream coverage
They are timestamped → perfect for trend timelines
They show actual projects, not news articles

This makes them perfect for:

Trend forecasting
Product idea generation
Competitive research
Launch strategy
Side project discovery
ML training
Market analysis for indie founders

If you want to analyze these patterns easily,
👉 Grab the dataset here

2. Load the Dataset (CSV Example)

Let’s start with a simple workflow.

import pandas as pd

df = pd.read_csv("2024-01.csv")
df.head()

Your columns:

title
github_link
submitted_date

If you’re using the multi-format monthly dataset,
just pick the month you want from the folders.

You can follow along using the same structured files:
👉 Grab them here

3. Extract Programming Languages Automatically

A great first analysis is seeing which languages dominate HackerNews.

Here’s a quick and dirty language detector:

import re

def detect_language(title):
    title = title.lower()
    if "rust" in title: return "Rust"
    if "python" in title: return "Python"
    if "go " in title or " golang" in title: return "Go"
    if "js" in title or "javascript" in title: return "JavaScript"
    if "typescript" in title: return "TypeScript"
    if "cpp" in title or "c++" in title: return "C++"
    return "Other"

df["language"] = df["title"].apply(detect_language)
df["language"].value_counts()

Result: A real breakdown of which languages are getting attention.

This is extremely powerful for:

choosing a language for your next open-source project
picking topics for blog posts or YouTube videos
forecasting future dev movements

To run this across all monthly folders, you’ll want the full dataset:
👉 Grab it here

4. Find Which Repo Topics Go Viral Most Often

Let’s look at titles that contain topic keywords:

topics = ["AI", "CLI", "framework", "compiler", "database",
          "LLM", "serverless", "infra", "debugger", "tool"]

def detect_topic(title):
    matches = [t for t in topics if t.lower() in title.lower()]
    return ", ".join(matches) if matches else "Other"

df["topics"] = df["title"].apply(detect_topic)
df["topics"].value_counts()

You will immediately see patterns like:

AI tools exploding
Infra tooling outperforming web frameworks
Debugging utilities consistently performing well
Compilers experiencing periodic spikes

This is pure gold for anyone trying to build a product or open-source project.

5. Analyze Seasonal Patterns (Why January & September Matter)

Developers think tech trends are random.

They’re not.

There are strong seasonal patterns:

df['submitted_date'] = pd.to_datetime(df['submitted_date'])
df['month'] = df['submitted_date'].dt.month

df.groupby("month").size()

You will see:

🔥 January → Massive spike (new-year side projects)
🔥 September → Another spike (post-summer reboot)
🧊 April & July → Lowest months (burnout & vacations)

This is extremely useful if you:

plan to launch an open-source repo
want to release a product update
want to publish a blog or newsletter
want to maximize GitHub stars

To analyze this across years, you need access to multiple folders:
👉 Grab the dataset here

6. Build a “Viral Repo Predictor” (Simple ML Example)

You can even train a lightweight model to predict whether a repo might go viral based on title patterns.

Example using TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df["title"])

# simulate a "viral" threshold = if title contains certain words
df["viral"] = df["title"].str.contains("AI|LLM|tool|fast|open-source", case=False)

model = LogisticRegression()
model.fit(X, df["viral"])

model.predict(X[:5])

Now you can:

score new repo titles
evaluate launch names
find high-performing keywords

This is impossible without having thousands of historical titles.
Exactly what the dataset gives you.

👉 Grab it here

7. Build Your Own GitHub Trends Dashboard (Beginner-Friendly)

Here’s a simple visualization to get started:

import matplotlib.pyplot as plt

df["language"].value_counts().plot(kind="bar")
plt.title("Language Distribution for This Month’s Popular Repos")
plt.show()

Or a timeline:

df.groupby(df['submitted_date'].dt.to_period('M')).size().plot()
plt.title("Number of GitHub Repos Shared on HN Over Time")
plt.show()

These dashboards help you:

spot rising languages
see hype cycles
identify long-term trends
find dev communities to tap into

You can only do this properly with multi-year monthly data:
👉 Grab your copy here

8. Generate Project Ideas Using the Data

One of the best uses of this dataset is idea generation.

Try this:

df["title"].sample(20)

Instant inspiration.

Even better: cluster the titles:

from sklearn.cluster import KMeans

X = vectorizer.fit_transform(df["title"])
kmeans = KMeans(n_clusters=10)
labels = kmeans.fit_predict(X)

df["cluster"] = labels
df.groupby("cluster").head(3)