NexGenData

Posted on May 14 • Edited on May 18 • Originally published at thenextgennexus.com

GitHub Trending to Product Ideas: Automated Market Signal Pipeline

#apify #github #productideas #signals

GitHub Trending to Product Ideas: Automated Market Signal Pipeline

Indie makers have a reliable problem: the gap between "I could build something" and "I know what to build." The usual advice is to scratch your own itch, which is fine when you have one. When you don't, or when you have built the itch-scratcher and are looking for idea number two, the gap becomes real. You end up doom-scrolling Twitter for product ideas, which is exactly the wrong input channel — Twitter surfaces opinions, not signals.

The better inputs are places where people publicly announce what they are actually working on. GitHub trending shows what open-source projects are gaining stars this week — a leading indicator of what developers are adopting. Hacker News Show HN is the oldest and still best channel for early-stage project launches — pre-revenue, pre-PR, founder-posted. Reddit's r/SideProject and r/IndieHackers are where solo makers announce projects at the "I shipped something" stage, often with transparent metrics.

Each of these channels alone has high noise. r/SideProject in particular has a lot of dropshipping stores and NFT projects. GitHub trending is biased toward what GitHub's ranking algorithm favors, which over-indexes AI tooling right now. Hacker News Show HN has survivorship bias — what hits the front page is not representative of what gets posted.

The way through is to pull all three, dedupe aggressively (the same project often hits all three in the same week), classify by category, and produce a weekly digest that you actually read. This post walks through building that pipeline and the patterns you should look for once you have it running.

Grounding Numbers

For calibration: GitHub Octoverse 2024 reports 518 million new repositories created in 2024, with roughly 2-3k repositories hitting trending each week across all languages. Trending is algorithmic — based on star velocity, relative to the repo's recent history — so a repo with 50 stars yesterday and 500 today can trend despite being small in absolute terms.

Hacker News had 27 million comments and 5.4 million stories posted in 2024 (YCombinator's internal stats, surfaced via the HN API). Show HN accounts for roughly 8% of story volume, meaning ~1,200 Show HN posts per week. Of those, approximately 150-200 reach the front page; the rest disappear to the /show listing.

Reddit's r/SideProject has around 385k members as of April 2026 and averages 45 posts per day (about 315 per week). r/IndieHackers is smaller, around 135k members and 15 posts per day.

Combined, the three channels surface approximately 3,000-4,000 distinct projects per week, of which maybe 800-1,200 are unique after deduplication and roughly 200-400 are genuinely worth a second look after category classification and noise filtering. That is a manageable weekly read — if you can get it into a clean digest rather than three separate firehoses.

Why This Is Hard

Four reasons a naive "just subscribe to the RSS feeds" approach falls apart.

Duplication across sources. A launching maker posts on all three channels the same week. If you read three feeds you see the same project three times and your brain learns to skip everything. Dedupe needs to happen upstream.
Categorization is required to make signal visible. "500 new projects this week" is useless. "37 new developer-tooling projects, 24 new AI-agent frameworks, 18 new Chrome extensions" is a readable digest.
GitHub trending rewards certain behaviors. Spam organizations that stars-bomb their own repos, viral-in-the-moment repos with shallow content, and "awesome-list" style aggregator repos all clutter the top of trending. Filtering requires looking at repo age, commit history, and contributor diversity.
Show HN and r/SideProject posts often lack structured metadata. A GitHub trending row has clean fields (stars, language, description). A Reddit or HN post is a title + URL + comment thread. To extract "what is this thing" you need either the destination URL or an LLM pass over the title.

Architecture

Three scrapers fan in, one dedupe + classify step, one digest step:

  [weekly cron]
       |
       +---> [github-trending-scraper] ----+
       |     (weekly, all languages)       |
       |                                   |
       +---> [hacker-news-scraper] --------+
       |     (Show HN, past 7 days)        |
       |                                   |
       +---> [reddit-scraper] -------------+
             (r/SideProject, r/IndieHackers,|
              past 7 days, score > 10)     |
                                           v
                                   [unified row schema]
                                   (title, url, source,
                                    score/stars, date, desc)
                                           |
                                           v
                                   [dedupe by URL + title]
                                           |
                                           v
                                   [LLM categorize]
                                   (dev-tools, AI, chrome-ext,
                                    saas, content, open-source-lib,...)
                                           |
                                           v
                                   [score and rank]
                                   (novelty × traction × category fit)
                                           |
                                           v
                                   [markdown digest]
                                   (top 5 per category,
                                    full list appendix)

Three Apify actors — github-trending-scraper, hacker-news-scraper, and reddit-scraper — handle the three data sources. Categorization and digest generation run in a single Python script after the scrapers finish. The full pipeline runs weekly on an Apify schedule and completes in under 15 minutes for a typical week's signal volume.

Code: Pull a Week of Signals

from apify_client import ApifyClient
from datetime import datetime, timedelta

client = ApifyClient("APIFY_TOKEN")
since = (datetime.utcnow() - timedelta(days=7)).isoformat()

# GitHub trending — weekly window
gh_run = client.actor("nexgendata/github-trending-scraper").call(run_input={
    "time_range": "weekly",
    "languages": ["all", "python", "typescript", "rust", "go"],
    "max_repos_per_language": 100,
})
gh = list(client.dataset(gh_run["defaultDatasetId"]).iterate_items())

# Hacker News Show HN
hn_run = client.actor("nexgendata/hacker-news-scraper").call(run_input={
    "sections": ["show"],
    "since": since,
    "min_score": 10,
})
hn = list(client.dataset(hn_run["defaultDatasetId"]).iterate_items())

# Reddit
reddit_run = client.actor("nexgendata/reddit-scraper").call(run_input={
    "subreddits": ["SideProject", "IndieHackers"],
    "since": since,
    "min_score": 10,
    "sort": "top",
})
reddit = list(client.dataset(reddit_run["defaultDatasetId"]).iterate_items())

print(f"GitHub trending: {len(gh)} repos")
print(f"HN Show HN: {len(hn)} posts")
print(f"Reddit: {len(reddit)} posts")

Typical output for a recent week:

GitHub trending: 412 repos
HN Show HN: 147 posts
Reddit: 286 posts
Total raw: 845 signals

Normalize and Dedupe

The three sources have different schemas. Normalize before deduping:

from urllib.parse import urlparse

def normalize_gh(r):
    return {
        "title": r["name"],
        "url": r["url"],
        "source": "github",
        "score": r["stars_this_period"],
        "desc": r.get("description", ""),
        "date": r["trending_date"],
        "lang": r.get("language"),
    }

def normalize_hn(p):
    return {
        "title": p["title"].replace("Show HN: ", ""),
        "url": p.get("url") or f"https://news.ycombinator.com/item?id={p['id']}",
        "source": "hn",
        "score": p["score"],
        "desc": p.get("text", "")[:500],
        "date": p["time"],
        "lang": None,
    }

def normalize_reddit(p):
    return {
        "title": p["title"],
        "url": p.get("url") or f"https://reddit.com{p['permalink']}",
        "source": f"reddit/{p['subreddit']}",
        "score": p["score"],
        "desc": p.get("selftext", "")[:500],
        "date": p["created_utc"],
        "lang": None,
    }

rows = (
    [normalize_gh(r) for r in gh]
    + [normalize_hn(p) for p in hn]
    + [normalize_reddit(p) for p in reddit]
)

# Dedupe by destination URL (canonicalized) and by title similarity
def canon(u):
    p = urlparse(u)
    host = p.netloc.lower().replace("www.", "")
    path = p.path.rstrip("/")
    return f"{host}{path}"

seen_urls = set()
seen_titles = set()
deduped = []
for r in rows:
    key_u = canon(r["url"])
    key_t = r["title"].lower().strip()[:60]
    if key_u in seen_urls or key_t in seen_titles:
        continue
    seen_urls.add(key_u)
    seen_titles.add(key_t)
    deduped.append(r)

print(f"After dedupe: {len(deduped)} unique signals")

Typical dedupe rate: 20-25% drop. Out of ~845 raw signals you get ~640 unique.

Categorize with an LLM

Category taxonomy is a design choice. The one below works for indie-maker consumption; adjust for your use case.

import openai, json

CATEGORIES = [
    "dev-tools", "ai-agents", "chrome-extensions", "mobile-apps",
    "saas-b2b", "consumer-apps", "open-source-libraries",
    "content-creators", "education", "data-viz", "games", "other",
]

def categorize_batch(items, batch_size=30):
    results = []
    for i in range(0, len(items), batch_size):
        batch = items[i:i+batch_size]
        prompt = (
            f"Classify each project into one of: {', '.join(CATEGORIES)}. "
            f"Return JSON array of strings, same length and order.\n\n"
            + "\n".join(f"{j}. {r['title']} — {r['desc'][:200]}" for j, r in enumerate(batch))
        )
        resp = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"},
        )
        labels = json.loads(resp.choices[0].message.content).get("categories", [])
        for r, l in zip(batch, labels):
            r["category"] = l
            results.append(r)
    return results

categorized = categorize_batch(deduped)

Cost: about $0.50 per week for ~640 items on gpt-4o-mini. An order of magnitude less than any hosted alternative.

Score and Digest

from collections import defaultdict

by_cat = defaultdict(list)
for r in categorized:
    by_cat[r.get("category", "other")].append(r)

digest_lines = [f"# Weekly Product Ideas Digest — {datetime.utcnow():%Y-%m-%d}\n"]
for cat in sorted(by_cat, key=lambda c: -len(by_cat[c])):
    items = sorted(by_cat[cat], key=lambda r: -r["score"])[:5]
    if not items:
        continue
    digest_lines.append(f"\n## {cat} ({len(by_cat[cat])} total)\n")
    for r in items:
        digest_lines.append(f"- **[{r['title']}]({r['url']})** — {r['source']} ({r['score']}) — {r['desc'][:140]}")

digest = "\n".join(digest_lines)
print(digest)

A typical digest looks like:

# Weekly Product Ideas Digest — 2026-04-12

## ai-agents (89 total)
- **[agent-dsl](https://github.com/...)** — github (1203) — Lightweight DSL for multi-step agent orchestration, MIT licensed...
- **[Show HN: I built an AI recruiter that...](https://news.ycombinator.com/...)** — hn (287) — Pre-screening tool...
...
## dev-tools (63 total)
- **[fastlog](https://github.com/...)** — github (812) — Rust-based replacement for tail -f with regex...
...

Worked Example: Founder Scanning for Idea #2

A solo founder shipped her first app (a Notion-to-blog exporter, small but profitable). She wants to find her next project. Constraints: small enough to ship solo in 3 months, big enough to hit $5k MRR, not directly competing with VC-funded incumbents.

She runs the pipeline weekly for 8 weeks. What she notices:

The "chrome-extensions" category consistently shows 15-25 new projects per week, but almost all of them get minimal traction. Category is crowded.
The "dev-tools" category shows recurring themes: log-reading tools, cron dashboards, env-variable managers. Every week another small take on the same problems. This suggests the space is active but unresolved — opportunity.
The "saas-b2b" category is dominated by AI-wrapped CRM and email tools. Too crowded.
"content-creators" is a surprise — a cluster of simple podcast-editing, newsletter-analytics, and YouTube-tag-research tools keep appearing with $1k-5k MRR screenshots posted by makers on r/SideProject. Underserved, low competition, visible willingness to pay.

She picks a creator-focused pain point: "where is my newsletter being shared?" Builds it in six weeks. Hits $3k MRR within three months of launch. The pipeline did not pick the idea — she did — but it narrowed the search space from "all of tech" to "underserved creator tooling with visible traction from small makers."

This is the realistic outcome. Market-signal pipelines do not hand you a billion-dollar idea. They do shift your input diet from "what Twitter is angry about this week" to "what people are actually building and what they are actually paying for."

Gotchas

Things that will skew the output:

GitHub trending over-indexes AI and JavaScript right now. Roughly 40% of the trending list in any given week is some variant of AI tooling or an LLM wrapper. Your categorizer will surface this honestly; do not mistake category volume for opportunity. A crowded category is a warning signal.
Show HN has survivorship bias. Posts that hit the front page are the ones that got early upvotes, which correlates with existing founder network more than with product quality. Consider pulling the full /show listing, not just min_score >= 10, if you want a less biased sample.
r/SideProject has a lot of dropshipping and crypto. Filter by keyword, or accept the noise and let the categorizer mark them other. Do not drop the subreddit entirely; the signal-to-noise is still worth it.
Dedupe by URL misses redirected links. A project posted with a bit.ly on Reddit and the full URL on HN will not dedupe. The actor resolves redirects; roll-your-own needs a HEAD pass on every URL.
Titles are often clickbait. "I built a thing that 10x'd my SaaS" tells you nothing about category. The categorizer needs the desc field too; do not categorize on titles alone.
LLM categorization drifts. Run the same batch twice and expect 3-5% of items to move between adjacent categories. For a weekly read this is fine. For longitudinal analysis, pin the model version and temperature 0.
Weekend spikes. Reddit activity spikes weekends; GitHub trending is flatter. If you run the pipeline Sunday you will get more Reddit noise; run Monday mornings for cleaner output.
International projects. Some genuinely interesting projects are posted on Hacker News Japan or Chinese dev forums, not in the sources above. This pipeline is English/Western-centric. That is a limitation, not a bug.

FAQ

How often should I run this?
Weekly is the natural cadence. Daily produces too much noise; monthly loses the freshness signal that makes GitHub trending useful.

Can I add my own sources?
Yes. ProductHunt is a natural fourth source — the launcher's intent is different (more polished, more marketing) but the signal is complementary. Some makers add lobste.rs, /r/programming, or specific Discord dumps.

What about LinkedIn or Twitter?
LinkedIn is closed and anti-scraping; not worth the fight. Twitter (X) is partially scrapable but the signal-to-noise is terrible — people post opinions, not projects. We stick to launch-oriented channels.

How do I avoid re-reading the same project across weeks?
Persist (source, url) keys in a local sqlite. Filter new runs against seen keys. A project that genuinely sustains traction for multiple weeks is worth seeing twice anyway, so consider a 2-week decay.

Can I use this for competitive monitoring?
It is not the right tool. For "what are my specific competitors shipping" you want GitHub repo watchers, PRH tracking, and changelog scrapes. This pipeline is a top-of-funnel idea feed, not a competitor radar.

What about false positives from stars-for-sale repos?
GitHub's trending algorithm has some de-weighting of suspicious star patterns, but it's imperfect. Sanity check: repos with 5000+ stars in 24 hours and fewer than 5 contributors are usually inflated. The github-trending-scraper returns contributor count; filter on that.

How do I connect this to my own note-taking?
Drop the markdown digest into Obsidian, Notion, or plain files. Some users wire the actor output directly into a Slack channel via a scheduled webhook — a digest that shows up Monday morning without any action on their part.

What's the total cost of running this weekly?
Apify actor credits: ~$2-4/week. LLM categorization: ~$0.50/week on gpt-4o-mini. Total well under $20/month for a genuinely useful idea-discovery stream.

Conclusion

Indie-maker idea generation is a data problem, not an inspiration problem. The raw inputs — GitHub trending, Show HN, r/SideProject — are all public, all structured, all accessible. The hard part is fanning them in, deduping, and categorizing into a digest you will actually read on Monday morning.

Three Apify actors handle the scraping; a hundred lines of Python handle the normalize + dedupe + categorize + digest. Total cost under $20/month. The output shifts your top-of-funnel from Twitter opinion firehoses to actual builder activity — which is the input you wanted in the first place.

Start with the github-trending-scraper, hacker-news-scraper, and reddit-scraper on Apify. Schedule them weekly, pipe into your digest, and read on Mondays.

DEV Community

GitHub Trending to Product Ideas: Automated Market Signal Pipeline

GitHub Trending to Product Ideas: Automated Market Signal Pipeline

Grounding Numbers

Why This Is Hard

Architecture

Code: Pull a Week of Signals

Normalize and Dedupe

Categorize with an LLM

Score and Digest

Worked Example: Founder Scanning for Idea #2

Gotchas

FAQ

Conclusion

Top comments (0)