Developers constantly ask questions like:
- “What tech is trending right now?”
- “Why do some GitHub repos go viral?”
- “How do I find project ideas devs actually want?”
- “Which months are best for launching tools?”
The truth?
You can guess… OR you can use real data from HackerNews + GitHub and answer these questions with actual evidence.
In this tutorial, I’ll walk you through a practical, real-world workflow to analyze:
✅ What kinds of repos go viral
✅ Which technologies are rising
✅ Seasonal patterns in open-source launches
✅ How to spot ideas early
✅ How to forecast future trends
And yes — all of this becomes 10x easier if you’re using my cleaned dataset of 17,900+ HackerNews→GitHub repo submissions, split by month.
If you want to follow along with the same dataset I use in this tutorial,
you can grab it here:
👉 Grab it here
1. Why HackerNews → GitHub Data Is So Useful
Most “tech trend” predictions are based on vibes.
But HackerNews links to GitHub repos are different:
- They come directly from developers
- They represent real usage or real curiosity
- They show what devs think is worth sharing
- They capture early-stage signals before mainstream coverage
- They are timestamped → perfect for trend timelines
- They show actual projects, not news articles
This makes them perfect for:
- Trend forecasting
- Product idea generation
- Competitive research
- Launch strategy
- Side project discovery
- ML training
- Market analysis for indie founders
If you want to analyze these patterns easily,
👉 Grab the dataset here
2. Load the Dataset (CSV Example)
Let’s start with a simple workflow.
import pandas as pd
df = pd.read_csv("2024-01.csv")
df.head()
Your columns:
titlegithub_linksubmitted_date
If you’re using the multi-format monthly dataset,
just pick the month you want from the folders.
You can follow along using the same structured files:
👉 Grab them here
3. Extract Programming Languages Automatically
A great first analysis is seeing which languages dominate HackerNews.
Here’s a quick and dirty language detector:
import re
def detect_language(title):
title = title.lower()
if "rust" in title: return "Rust"
if "python" in title: return "Python"
if "go " in title or " golang" in title: return "Go"
if "js" in title or "javascript" in title: return "JavaScript"
if "typescript" in title: return "TypeScript"
if "cpp" in title or "c++" in title: return "C++"
return "Other"
df["language"] = df["title"].apply(detect_language)
df["language"].value_counts()
Result: A real breakdown of which languages are getting attention.
This is extremely powerful for:
- choosing a language for your next open-source project
- picking topics for blog posts or YouTube videos
- forecasting future dev movements
To run this across all monthly folders, you’ll want the full dataset:
👉 Grab it here
4. Find Which Repo Topics Go Viral Most Often
Let’s look at titles that contain topic keywords:
topics = ["AI", "CLI", "framework", "compiler", "database",
"LLM", "serverless", "infra", "debugger", "tool"]
def detect_topic(title):
matches = [t for t in topics if t.lower() in title.lower()]
return ", ".join(matches) if matches else "Other"
df["topics"] = df["title"].apply(detect_topic)
df["topics"].value_counts()
You will immediately see patterns like:
- AI tools exploding
- Infra tooling outperforming web frameworks
- Debugging utilities consistently performing well
- Compilers experiencing periodic spikes
This is pure gold for anyone trying to build a product or open-source project.
5. Analyze Seasonal Patterns (Why January & September Matter)
Developers think tech trends are random.
They’re not.
There are strong seasonal patterns:
df['submitted_date'] = pd.to_datetime(df['submitted_date'])
df['month'] = df['submitted_date'].dt.month
df.groupby("month").size()
You will see:
🔥 January → Massive spike (new-year side projects)
🔥 September → Another spike (post-summer reboot)
🧊 April & July → Lowest months (burnout & vacations)
This is extremely useful if you:
- plan to launch an open-source repo
- want to release a product update
- want to publish a blog or newsletter
- want to maximize GitHub stars
To analyze this across years, you need access to multiple folders:
👉 Grab the dataset here
6. Build a “Viral Repo Predictor” (Simple ML Example)
You can even train a lightweight model to predict whether a repo might go viral based on title patterns.
Example using TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df["title"])
# simulate a "viral" threshold = if title contains certain words
df["viral"] = df["title"].str.contains("AI|LLM|tool|fast|open-source", case=False)
model = LogisticRegression()
model.fit(X, df["viral"])
model.predict(X[:5])
Now you can:
- score new repo titles
- evaluate launch names
- find high-performing keywords
This is impossible without having thousands of historical titles.
Exactly what the dataset gives you.
7. Build Your Own GitHub Trends Dashboard (Beginner-Friendly)
Here’s a simple visualization to get started:
import matplotlib.pyplot as plt
df["language"].value_counts().plot(kind="bar")
plt.title("Language Distribution for This Month’s Popular Repos")
plt.show()
Or a timeline:
df.groupby(df['submitted_date'].dt.to_period('M')).size().plot()
plt.title("Number of GitHub Repos Shared on HN Over Time")
plt.show()
These dashboards help you:
- spot rising languages
- see hype cycles
- identify long-term trends
- find dev communities to tap into
You can only do this properly with multi-year monthly data:
👉 Grab your copy here
8. Generate Project Ideas Using the Data
One of the best uses of this dataset is idea generation.
Try this:
df["title"].sample(20)
Instant inspiration.
Even better: cluster the titles:
from sklearn.cluster import KMeans
X = vectorizer.fit_transform(df["title"])
kmeans = KMeans(n_clusters=10)
labels = kmeans.fit_predict(X)
df["cluster"] = labels
df.groupby("cluster").head(3)
This reveals:
- trending categories
- tech gaps
- unserved niches
- high-interest areas
- repeating patterns
Perfect for indie hackers.
9. Summary: Why This Workflow Matters
This tutorial barely scratches the surface of what's possible:
- trend forecasting
- competitor analysis
- NLP models
- launch timing optimization
- idea generation
- content planning
- GitHub ecosystem research
- open-source strategy
And having a clean, multi-year dataset turns all of this from “theoretical” to “extremely practical.”
If you want to use the exact dataset this tutorial is based on,
you can grab it here:







Top comments (4)
I think it's worth trying atleast get some trends on saas products...
yes, i also try to use this dataset for that.. maybe useful to you. ❤️
It's an great resource, thanks for sharing such things. I mail you can u check for more license things.
yep, thanks the purchase though.. ❤️