DEV Community

Linghua Jin
Linghua Jin

Posted on

I Built a Real-Time HackerNews Trend Radar With AI (And It Runs Itself)

Every day, HackerNews quietly decides what the dev world will care about next.
But unless you're doom-scrolling it all day, you're missing the real signal: which topics are actually taking off right now, across threads and deep comment chains.

So instead of manually refreshing HN, I built a real-time "trend radar" on top of it:

  • Continuously ingests fresh HN stories and comments
  • Uses an LLM to extract structured topics (companies, tools, models, tech terms)
  • Streams everything into Postgres for instant querying like:
    • "What's trending on HN right now?"
    • "Which threads are driving the most hype for Claude / LangChain / Rust today?"

All of this runs as a declarative CocoIndex flow with incremental syncs, LLM-powered extraction, and simple query handlers.

In this post, you'll see how it works end-to-end and how you can fork it to track any community (Reddit, X, Discord, internal Slack, etc.).


Why HN Is a Goldmine (If You Can Structure It)

HackerNews is one of the strongest early signals for:

  • New tools and frameworks devs actually try
  • Which AI models/products are gaining mindshare
  • Real sentiment and feedback in the comments
  • Emerging startups and obscure libraries that might be big in 6-12 months

But raw HN has three problems:

  • Threads are noisy; comments are nested and messy
  • There's no notion of "topics" beyond free text
  • There's no built-in way to ask: "What's trending across the whole firehose?"

The HackerNews Trending Topics example in CocoIndex is essentially: "turn HN into a structured, continuously updating topics index that AI agents and dashboards can query in milliseconds."


Architecture: From HN Firehose to Queryable Topics

At a high level, the pipeline looks like this:

HackerNews API
    ↓
HackerNewsConnector (Custom Source)
    ├─ list() → thread IDs + updated_at
    ├─ get_value() → full threads + comments
    └─ provides_ordinal() → enables incremental sync
    ↓
CocoIndex Flow
    ├─ LLM topic extraction on threads + comments
    ├─ message_index collector (content)
    └─ topic_index collector (topics)
    ↓
Postgres
    ├─ hn_messages
    └─ hn_topics
    ↓
Query Handlers
    ├─ search_by_topic("Claude")
    ├─ get_trending_topics(limit=20)
    └─ get_threads_for_topic("Rust")
Enter fullscreen mode Exit fullscreen mode

Key idea: separate discovery from fetching.

  • list() hits the HN Algolia search API to get lightweight metadata: thread IDs + updated_at timestamps.
  • get_value() only runs for threads whose updated_at changed, fetching full content + comments from the items API.
  • Ordinals (timestamps) let CocoIndex skip everything that hasn't changed, cutting API calls by >90% on subsequent syncs.

This is what enables "live mode" with a 30-second polling interval without melting APIs or your wallet.


Step 1: Turning HackerNews Into a First-Class Incremental Source

First, you define the data model for threads and comments.

class _HackerNewsThreadKey(NamedTuple):
    thread_id: str

@dataclasses.dataclass
class _HackerNewsComment:
    id: str
    author: str | None
    text: str | None
    created_at: datetime | None

@dataclasses.dataclass
class _HackerNewsThread:
    author: str | None
    text: str
    url: str | None
    created_at: datetime | None
    comments: list[_HackerNewsComment]
Enter fullscreen mode Exit fullscreen mode

Then you declare a SourceSpec that configures how to query HN:

class HackerNewsSource(SourceSpec):
    """Source spec for HackerNews API."""
    tag: str | None = None      # e.g. "story"
    max_results: int = 100      # hits per poll
Enter fullscreen mode Exit fullscreen mode

The custom source connector wires this spec into actual HTTP calls:

  • list() → calls https://hn.algolia.com/api/v1/search_by_date with hitsPerPage=max_results, yields PartialSourceRow objects keyed by thread ID, with ordinals based on updated_at.
  • get_value() → calls https://hn.algolia.com/api/v1/items/{thread_id} and parses the full thread + nested comments into _HackerNewsThread and _HackerNewsComment.
  • provides_ordinal() → returns True so CocoIndex can do incremental sync.

CocoIndex handles the hard part: tracking ordinals and only re-pulling changed rows on each sync.


Step 2: Using an LLM to Extract Topics From Every Thread and Comment

Once the source is in the flow, the fun part starts: semantic enrichment.

You define a minimal Topic type that the LLM will fill:

@dataclasses.dataclass
class Topic:
    """
    A single topic extracted from text:
    - products, tools, frameworks
    - people, companies
    - domains (e.g. "vector search", "fintech")
    """
    topic: str
Enter fullscreen mode Exit fullscreen mode

Inside the flow, every thread gets its topics extracted with a single declarative transform:

with data_scope["threads"].row() as thread:
    thread["topics"] = thread["text"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(
                api_type=cocoindex.LlmApiType.OPENAI,
                model="gpt-4o-mini",
            ),
            output_type=list[Topic],
        )
    )
Enter fullscreen mode Exit fullscreen mode

Same for comments:

with thread["comments"].row() as comment:
    comment["topics"] = comment["text"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(
                api_type=cocoindex.LlmApiType.OPENAI,
                model="gpt-4o-mini",
            ),
            output_type=list[Topic],
        )
    )
Enter fullscreen mode Exit fullscreen mode

Under the hood, CocoIndex:

  • Calls the LLM with a structured prompt and enforces output_type=list[Topic]
  • Normalizes messy free text into consistent topic strings
  • Makes this just another column in your flow instead of a separate glue script

This is what turns HN from "some text" into something an AI agent or SQL query can reason about.


Step 3: Indexing Into Postgres for Fast Topic Queries

All structured data is collected into two logical indexes:

  • message_index: threads + comments with their raw text and metadata
  • topic_index: individual topics linked back to messages

Collectors are declared once and then exported to Postgres:

message_index = data_scope.add_collector()
topic_index = data_scope.add_collector()

message_index.export(
    "hn_messages",
    cocoindex.targets.Postgres(),
    primary_key_fields=["id"],
)

topic_index.export(
    "hn_topics",
    cocoindex.targets.Postgres(),
    primary_key_fields=["topic", "message_id"],
)
Enter fullscreen mode Exit fullscreen mode

Now you have two tables you can poke with SQL or via CocoIndex query handlers:

  • hn_messages: full-text search, content analytics, author stats
  • hn_topics: topic-level analytics, trend tracking, per-topic thread ranking

Step 4: Query Handlers - From "Cool Pipeline" to Real Product

Here's where it stops being just a nice ETL project and becomes something you can actually ship.

4.1 search_by_topic(topic): "Show Me All Claude Mentions"

This query handler lets you search HN content by topic across threads and comments:

@hackernews_trending_topics_flow.query_handler()
def search_by_topic(topic: str) -> cocoindex.QueryOutput:
    topic_table = cocoindex.utils.get_target_default_name(
        hackernews_trending_topics_flow, "hn_topics"
    )
    message_table = cocoindex.utils.get_target_default_name(
        hackernews_trending_topics_flow, "hn_messages"
    )

    with connection_pool().connection() as conn:
        with conn.cursor() as cur:
            cur.execute(
                f"""
                SELECT m.id, m.thread_id, m.author, m.content_type,
                       m.text, m.created_at, t.topic
                FROM {topic_table} t
                JOIN {message_table} m ON t.message_id = m.id
                WHERE LOWER(t.topic) LIKE LOWER(%s)
                ORDER BY m.created_at DESC
                """,
                (f"%{topic}%",),
            )

            results = [
                {
                    "id": row[0],
                    "url": f"https://news.ycombinator.com/item?id={row[1]}",
                    "author": row[2],
                    "type": row[3],
                    "text": row[4],
                    "created_at": row[5].isoformat(),
                    "topic": row[6],
                }
                for row in cur.fetchall()
            ]

    return cocoindex.QueryOutput(results=results)
Enter fullscreen mode Exit fullscreen mode

You can literally run:

cocoindex query main.py search_by_topic --topic "Claude"
Enter fullscreen mode Exit fullscreen mode

...and get a clean JSON response with URLs, authors, timestamps, and which piece of content the topic appeared in.

4.2 get_threads_for_topic(topic): Rank Threads by Topic Score

Not all mentions are equal.

  • If "Rust" is in the thread title, that's a primary discussion
  • If it's buried in a comment, that's more of a side mention

get_threads_for_topic uses a weighted scoring model to prioritize threads where the topic is central.

4.3 get_trending_topics(limit=20): The Actual Trend Radar

Finally, the endpoint that powers dashboards and agents - this surfaces a list like:

  • ["Claude 3.7 Sonnet", "OpenAI o4-mini", "LangChain", "Modal", ...] with scores and latest mention times
  • Each topic includes the top threads where it's being discussed right now

You can wire this into:

  • A live dashboard showing "top 20 topics in the last N hours"
  • A Slack bot posting a daily "what's trending on HN" summary
  • An internal research agent that watches for signals relevant to your stack

Running It in Real Time

Once the flow is defined, keeping it live is a one-liner:

# On-demand refresh
cocoindex update main

# Live mode: keeps polling HN and updating indexes
cocoindex update -L main
Enter fullscreen mode Exit fullscreen mode

CocoIndex handles:

  • Polling HN every 30 seconds (configurable)
  • Incrementally syncing only changed threads
  • Re-running LLM extraction only where needed
  • Exporting into Postgres and making query handlers available

For debugging, CocoInsight lets you explore the flow, see lineage, and play with queries from a UI:

cocoindex server -ci main
# Then open: https://cocoindex.io/cocoinsight
Enter fullscreen mode Exit fullscreen mode

What You Can Build on Top of This (Beyond "Just HN")

Once you have this pattern, you're not limited to HackerNews.

Some obvious extensions:

  • Cross-community trend tracking

    • Add Reddit subs, X lists, Discord channels, internal Slack, etc. as additional sources
    • Normalize topics across them to see which ideas propagate where and when
  • Sentiment-aware trend analysis

    • Plug in an LLM-based sentiment extraction step alongside topics
    • Track not just what is trending, but whether devs love or hate it
  • Influencer and key-contributor maps

    • Use the author field to see who starts important discussions and whose comments move the conversation
  • Continuous knowledge graphs

    • Treat topics as nodes, threads as edges, and build a graph of tools, companies, and people linked by real discussions
  • Real-time AI research agents

    • Point an agent at the Postgres-backed index and let it answer questions like
    • "What are the top new vector DBs people are experimenting with this week?"
    • "Which AI eval frameworks are getting traction?"

If you live in data, infra, or AI-land, this is basically a self-updating signal layer over HN that your tools and agents can query.


Want to Try It Yourself?

You can find the fully working example (including flow definition, custom source, query handlers, and Postgres export) in the official HackerNews Trending Topics example on the CocoIndex docs and GitHub.

If you end up:

  • Pointing this at a different community
  • Layering in embeddings, RAG, or sentiment
  • Wiring it into a real product or agent

...definitely share it back. The coolest part of this pattern is how little code you need to go from "raw community noise" to a live, queryable trend radar.

Top comments (0)