Finding patterns across Hacker News, arXiv, and GitHub by reasoning about groups, not pairs

#automation #productivity #programming #showdev

How I Built a Pattern-Detection Engine That Reads Across Eight Tech Feeds Simultaneously

Most feed readers solve the wrong problem. They help you consume faster, not understand better. I kept noticing that the interesting signal wasn't inside any single post — it was in the friction between a paper on arXiv, a Show HN thread, and a Product Hunt launch happening within the same week. Nobody was naming that pattern out loud. So I built something that tries to.

Why I built this

Every week I'd open five or six tabs — Hacker News, a few arXiv abstracts, whatever was trending on Product Hunt — and occasionally get this feeling: these things are connected. Not in an obvious "they're both about LLMs" way, but in a more structural sense. Like, someone is solving a problem in systems programming that maps onto a debate happening in ML tooling. Or a niche GitHub project is quietly becoming the implementation layer for an idea three different research groups published independently.

The frustrating part was that making the connection required holding all of it in my head at once. Pair-wise similarity search (the thing most recommendation engines do) doesn't cut it. Finding that paper A is similar to paper B doesn't tell you that A + B + this HN thread together imply something that none of them say directly.

I wanted a tool that reasons about groups, not pairs. That's what I set out to build with Constellate.

The approach

The non-obvious design decision was to treat "a pattern" as a minimum-three-node structure. Most clustering approaches will happily give you a pile of thematically similar documents. That's not what I wanted. I wanted the system to surface constellations: configurations where the relationship between the ideas is the finding, not just the ideas themselves.

This meant defining connection types explicitly, rather than letting an embedding distance be the only signal. I ended up with five:

Chain — A leads to B leads to C; a progression or causal sequence
Triangulation — Three sources approaching the same conclusion from different angles
Convergence — Independent ideas arriving at the same solution without apparent coordination
Absence — A gap: the pattern implies something that nobody is building or saying yet
Spectrum — A range of positions on a single underlying axis, with sources spread across it

The "Absence" type was the hardest to implement and the most interesting. It's not about what's there — it's about what the shape of the conversation implies is missing.

How it works

The pipeline runs weekly and processes roughly 140 items across HN, arXiv, Product Hunt, YC, GitHub, Hugging Face, and Dev.to.

Here's the rough architecture:

Ingestion layer
  └── fetch ~140 items from 8 sources
       └── normalize to: {id, source, title, body_excerpt, date, tags}

Embedding + candidate generation
  └── embed each item (text-embedding-3-small, 1536d)
  └── build approximate kNN graph (FAISS, k=15)
  └── enumerate candidate triplets from shared neighborhoods

Constellation reasoning (the interesting part)
  └── for each candidate group:
       ├── prompt: "What is the structural relationship between these ideas?"
       ├── classify into: Chain / Triangulation / Convergence / Absence / Spectrum
       ├── score: specificity, surprise, non-obviousness
       └── filter: discard if pattern is restateable as a single topic label

Output
  └── ~60 constellations
       ├── Cards view (one pattern per screen, plain language)
       └── Constellation Map (spatial graph, clusters positioned by relationship type)

The scoring step is doing real work. It's easy to generate hundreds of technically-valid groupings that are just boring ("these three things are all about Rust"). The filter that asks "can this be summarized as a topic label?" culls most of those. If I can describe the pattern as "ML infrastructure" and be done with it, it's not a constellation — it's just a category.

A real example from a recent run: a Hacker News thread on SQLite's WAL mode, an arXiv paper on eventual consistency in edge deployments, and a Product Hunt launch for a local-first sync library formed a Convergence constellation. The pattern: a quiet move away from assuming reliable network connectivity as a baseline, happening simultaneously in three separate conversations that don't reference each other.

What I learned / tradeoffs

Embeddings alone are not enough. Cosine similarity surfaces thematic proximity, but thematic proximity isn't the same as structural relationship. Two items can be semantically distant but form a tight Chain. The kNN graph gives you candidates; the reasoning step is where the actual pattern detection happens.

The 5-type taxonomy is a constraint, not a limitation. Early versions tried to let the model free-form describe relationships. The output was verbose and inconsistent. Forcing a classification makes the patterns comparable across weeks and surfaces when a certain type is unusually prevalent (three weeks of "Absence" constellations in a domain is itself a signal).

False positives are expensive. A wrong constellation isn't just a missed result — it's a confidence hit. Users who see a grouping that doesn't hold up start doubting the real ones. I'd rather surface 60 high-quality constellations than 150 mediocre ones. The current cutoff is conservative.

The "Absence" type is hard to evaluate. There's no ground truth for a gap. I can check factually whether a thing exists; I can't objectively confirm it's "missing in a meaningful way." Right now this type has the weakest precision. I mark it visually in the UI as lower-confidence.

Limitations and what's next

The current pipeline is weekly, not real-time. For fast-moving news cycles this is a real limitation — a constellation that was interesting on Monday may be stale by Friday. Moving toward a rolling window (72 hours) is on the roadmap, but the reasoning step is LLM-heavy and cost/latency become real concerns at higher frequency.

The source list (8 feeds right now) creates blind spots. Academic CS and ML are well-covered; hardware, biotech, climate tech are not. Expanding the source set without degrading signal-to-noise is an open problem — more items means more candidate triplets, which grows combinatorially.

The Constellation Map is currently a static layout recomputed weekly. Making it interactive (click a node, see which other constellations it participates in) is the next UI feature.

The waitlist is open if you want early access: constellate.fyi

I'm also genuinely curious whether the 5-type taxonomy resonates with other people who think about information architecture — happy to talk through it in the comments.