Linghua Jin

Posted on Jan 16

I Built a Real-Time HackerNews Trend Radar With AI (And It Runs Itself)

#ai #programming #python #datascience

Every day, HackerNews quietly decides what the dev world will care about next.
But unless you're doom-scrolling it all day, you're missing the real signal: which topics are actually taking off right now, across threads and deep comment chains.

So instead of manually refreshing HN, I built a real-time "trend radar" on top of it:

Continuously ingests fresh HN stories and comments
Uses an LLM to extract structured topics (companies, tools, models, tech terms)
Streams everything into Postgres for instant querying like:
- "What's trending on HN right now?"
- "Which threads are driving the most hype for Claude / LangChain / Rust today?"

All of this runs as a declarative CocoIndex flow with incremental syncs, LLM-powered extraction, and simple query handlers.

In this post, you'll see how it works end-to-end and how you can fork it to track any community (Reddit, X, Discord, internal Slack, etc.).

Why HN Is a Goldmine (If You Can Structure It)

HackerNews is one of the strongest early signals for:

New tools and frameworks devs actually try
Which AI models/products are gaining mindshare
Real sentiment and feedback in the comments
Emerging startups and obscure libraries that might be big in 6-12 months

But raw HN has three problems:

Threads are noisy; comments are nested and messy
There's no notion of "topics" beyond free text
There's no built-in way to ask: "What's trending across the whole firehose?"

The HackerNews Trending Topics example in CocoIndex is essentially: "turn HN into a structured, continuously updating topics index that AI agents and dashboards can query in milliseconds."

Architecture: From HN Firehose to Queryable Topics

At a high level, the pipeline looks like this:

HackerNews API
    ↓
HackerNewsConnector (Custom Source)
    ├─ list() → thread IDs + updated_at
    ├─ get_value() → full threads + comments
    └─ provides_ordinal() → enables incremental sync
    ↓
CocoIndex Flow
    ├─ LLM topic extraction on threads + comments
    ├─ message_index collector (content)
    └─ topic_index collector (topics)
    ↓
Postgres
    ├─ hn_messages
    └─ hn_topics
    ↓
Query Handlers
    ├─ search_by_topic("Claude")
    ├─ get_trending_topics(limit=20)
    └─ get_threads_for_topic("Rust")

Key idea: separate discovery from fetching.

list() hits the HN Algolia search API to get lightweight metadata: thread IDs + updated_at timestamps.
get_value() only runs for threads whose updated_at changed, fetching full content + comments from the items API.
Ordinals (timestamps) let CocoIndex skip everything that hasn't changed, cutting API calls by >90% on subsequent syncs.

This is what enables "live mode" with a 30-second polling interval without melting APIs or your wallet.

Step 1: Turning HackerNews Into a First-Class Incremental Source

First, you define the data model for threads and comments.

class _HackerNewsThreadKey(NamedTuple):
    thread_id: str

@dataclasses.dataclass
class _HackerNewsComment:
    id: str
    author: str | None
    text: str | None
    created_at: datetime | None

@dataclasses.dataclass
class _HackerNewsThread:
    author: str | None
    text: str
    url: str | None
    created_at: datetime | None
    comments: list[_HackerNewsComment]

Then you declare a SourceSpec that configures how to query HN:

class HackerNewsSource(SourceSpec):
    """Source spec for HackerNews API."""
    tag: str | None = None      # e.g. "story"
    max_results: int = 100      # hits per poll

The custom source connector wires this spec into actual HTTP calls:

list() → calls https://hn.algolia.com/api/v1/search_by_date with hitsPerPage=max_results, yields PartialSourceRow objects keyed by thread ID, with ordinals based on updated_at.
get_value() → calls https://hn.algolia.com/api/v1/items/{thread_id} and parses the full thread + nested comments into _HackerNewsThread and _HackerNewsComment.
provides_ordinal() → returns True so CocoIndex can do incremental sync.

CocoIndex handles the hard part: tracking ordinals and only re-pulling changed rows on each sync.

Step 2: Using an LLM to Extract Topics From Every Thread and Comment

Once the source is in the flow, the fun part starts: semantic enrichment.

You define a minimal Topic type that the LLM will fill:

@dataclasses.dataclass
class Topic:
    """
    A single topic extracted from text:
    - products, tools, frameworks
    - people, companies
    - domains (e.g. "vector search", "fintech")
    """
    topic: str

Inside the flow, every thread gets its topics extracted with a single declarative transform:

with data_scope["threads"].row() as thread:
    thread["topics"] = thread["text"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(
                api_type=cocoindex.LlmApiType.OPENAI,
                model="gpt-4o-mini",
            ),
            output_type=list[Topic],
        )
    )

Same for comments:

with thread["comments"].row() as comment:
    comment["topics"] = comment["text"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(
                api_type=cocoindex.LlmApiType.OPENAI,
                model="gpt-4o-mini",
            ),
            output_type=list[Topic],
        )
    )

Under the hood, CocoIndex:

Calls the LLM with a structured prompt and enforces output_type=list[Topic]
Normalizes messy free text into consistent topic strings
Makes this just another column in your flow instead of a separate glue script

This is what turns HN from "some text" into something an AI agent or SQL query can reason about.

Step 3: Indexing Into Postgres for Fast Topic Queries

All structured data is collected into two logical indexes:

message_index: threads + comments with their raw text and metadata
topic_index: individual topics linked back to messages

Collectors are declared once and then exported to Postgres:

message_index = data_scope.add_collector()
topic_index = data_scope.add_collector()

message_index.export(
    "hn_messages",
    cocoindex.targets.Postgres(),
    primary_key_fields=["id"],
)

topic_index.export(
    "hn_topics",
    cocoindex.targets.Postgres(),
    primary_key_fields=["topic", "message_id"],
)

Now you have two tables you can poke with SQL or via CocoIndex query handlers:

hn_messages: full-text search, content analytics, author stats
hn_topics: topic-level analytics, trend tracking, per-topic thread ranking

Step 4: Query Handlers - From "Cool Pipeline" to Real Product

Here's where it stops being just a nice ETL project and becomes something you can actually ship.

4.1 `search_by_topic(topic)`: "Show Me All Claude Mentions"

This query handler lets you search HN content by topic across threads and comments:

@hackernews_trending_topics_flow.query_handler()
def search_by_topic(topic: str) -> cocoindex.QueryOutput:
    topic_table = cocoindex.utils.get_target_default_name(
        hackernews_trending_topics_flow, "hn_topics"
    )
    message_table = cocoindex.utils.get_target_default_name(
        hackernews_trending_topics_flow, "hn_messages"
    )

    with connection_pool().connection() as conn:
        with conn.cursor() as cur:
            cur.execute(
                f"""
                SELECT m.id, m.thread_id, m.author, m.content_type,
                       m.text, m.created_at, t.topic
                FROM {topic_table} t
                JOIN {message_table} m ON t.message_id = m.id
                WHERE LOWER(t.topic) LIKE LOWER(%s)
                ORDER BY m.created_at DESC
                """,
                (f"%{topic}%",),
            )

            results = [
                {
                    "id": row[0],
                    "url": f"https://news.ycombinator.com/item?id={row[1]}",
                    "author": row[2],
                    "type": row[3],
                    "text": row[4],
                    "created_at": row[5].isoformat(),
                    "topic": row[6],
                }
                for row in cur.fetchall()
            ]

    return cocoindex.QueryOutput(results=results)

You can literally run:

cocoindex query main.py search_by_topic --topic "Claude"

...and get a clean JSON response with URLs, authors, timestamps, and which piece of content the topic appeared in.

4.2 `get_threads_for_topic(topic)`: Rank Threads by Topic Score

Not all mentions are equal.

If "Rust" is in the thread title, that's a primary discussion
If it's buried in a comment, that's more of a side mention

get_threads_for_topic uses a weighted scoring model to prioritize threads where the topic is central.

4.3 `get_trending_topics(limit=20)`: The Actual Trend Radar

Finally, the endpoint that powers dashboards and agents - this surfaces a list like:

["Claude 3.7 Sonnet", "OpenAI o4-mini", "LangChain", "Modal", ...] with scores and latest mention times
Each topic includes the top threads where it's being discussed right now

You can wire this into:

A live dashboard showing "top 20 topics in the last N hours"
A Slack bot posting a daily "what's trending on HN" summary
An internal research agent that watches for signals relevant to your stack

Running It in Real Time

Once the flow is defined, keeping it live is a one-liner:

# On-demand refresh
cocoindex update main

# Live mode: keeps polling HN and updating indexes
cocoindex update -L main

CocoIndex handles:

Polling HN every 30 seconds (configurable)
Incrementally syncing only changed threads
Re-running LLM extraction only where needed
Exporting into Postgres and making query handlers available

For debugging, CocoInsight lets you explore the flow, see lineage, and play with queries from a UI:

cocoindex server -ci main
# Then open: https://cocoindex.io/cocoinsight

What You Can Build on Top of This (Beyond "Just HN")

Once you have this pattern, you're not limited to HackerNews.

Some obvious extensions:

Cross-community trend tracking
- Add Reddit subs, X lists, Discord channels, internal Slack, etc. as additional sources
- Normalize topics across them to see which ideas propagate where and when
Sentiment-aware trend analysis
- Plug in an LLM-based sentiment extraction step alongside topics
- Track not just what is trending, but whether devs love or hate it
Influencer and key-contributor maps
- Use the author field to see who starts important discussions and whose comments move the conversation
Continuous knowledge graphs
- Treat topics as nodes, threads as edges, and build a graph of tools, companies, and people linked by real discussions
Real-time AI research agents
- Point an agent at the Postgres-backed index and let it answer questions like
- "What are the top new vector DBs people are experimenting with this week?"
- "Which AI eval frameworks are getting traction?"

If you live in data, infra, or AI-land, this is basically a self-updating signal layer over HN that your tools and agents can query.

Want to Try It Yourself?

You can find the fully working example (including flow definition, custom source, query handlers, and Postgres export) in the official HackerNews Trending Topics example on the CocoIndex docs and GitHub.

If you end up:

Pointing this at a different community
Layering in embeddings, RAG, or sentiment
Wiring it into a real product or agent

...definitely share it back. The coolest part of this pattern is how little code you need to go from "raw community noise" to a live, queryable trend radar.

DEV Community

I Built a Real-Time HackerNews Trend Radar With AI (And It Runs Itself)

Why HN Is a Goldmine (If You Can Structure It)

Architecture: From HN Firehose to Queryable Topics

Step 1: Turning HackerNews Into a First-Class Incremental Source

Step 2: Using an LLM to Extract Topics From Every Thread and Comment

Step 3: Indexing Into Postgres for Fast Topic Queries

Step 4: Query Handlers - From "Cool Pipeline" to Real Product

4.1 `search_by_topic(topic)`: "Show Me All Claude Mentions"

4.2 `get_threads_for_topic(topic)`: Rank Threads by Topic Score

4.3 `get_trending_topics(limit=20)`: The Actual Trend Radar

Running It in Real Time

What You Can Build on Top of This (Beyond "Just HN")

Want to Try It Yourself?

Top comments (0)

Why HN Is a Goldmine (If You Can Structure It)

Architecture: From HN Firehose to Queryable Topics

Step 1: Turning HackerNews Into a First-Class Incremental Source

Step 2: Using an LLM to Extract Topics From Every Thread and Comment

Step 3: Indexing Into Postgres for Fast Topic Queries

Step 4: Query Handlers - From "Cool Pipeline" to Real Product

4.1 search_by_topic(topic): "Show Me All Claude Mentions"

4.2 get_threads_for_topic(topic): Rank Threads by Topic Score

4.3 get_trending_topics(limit=20): The Actual Trend Radar

Running It in Real Time

What You Can Build on Top of This (Beyond "Just HN")

Want to Try It Yourself?

4.1 `search_by_topic(topic)`: "Show Me All Claude Mentions"

4.2 `get_threads_for_topic(topic)`: Rank Threads by Topic Score

4.3 `get_trending_topics(limit=20)`: The Actual Trend Radar