I shipped an AI pipeline in a month that reads Reddit, HN, and X for startup ideas. The hardest part wasn't the AI.

Benjian Dai — Tue, 28 Apr 2026 13:30:23 +0000

For the last month I've been building MonetScope — a pipeline that crawls Reddit, Hacker News, and X, reads what real people are complaining about, and surfaces the complaints as scored startup opportunities.

Going in, I assumed the hard part would be the AI layer. You know the story: prompt engineering, structured output, temperature tuning. That's where the demos happen and where most of the blog posts get written.

It wasn't. The LLM layer landed roughly on schedule. What took real engineering time were four other things — each one taught me something I'll carry into the next pipeline I build. Plus one deeply dumb cross-language serialization bug that nearly corrupted my data for a week without me noticing.

This is a write-up of those things.

The pipeline, very roughly

Before we dig in, the mental model. I'm keeping this deliberately abstract because the interesting part is the categories, not my specific libraries.

   [Reddit]    [Hacker News]    [X]
       \            |            /
        \           |           /
         +---> crawler layer <---+
                    |
             message queue
                    |
          deterministic filters
                    |
           multi-stage LLM layer
                    |
            grounding + storage
                    |
                 product UI

Two runtimes: one for crawlers (good at "get data out of places"), one for orchestration plus API (good at "stay up under load"). A message queue between them. Boring, intentionally.

What's on that diagram today is three public platforms. What's on my whiteboard is more — additional communities, niche forums, and eventually a user-submitted channel where a founder can drop in their own support tickets, a competitor's review stream, or a CSV dump from a private Slack. Each new source is just another input box on the diagram, which is the whole point of having the diagram at this level of abstraction. The cost of adding source N+1 is not in the pipeline; it's in the per-platform quality heuristics, and that's a problem I enjoy having.

Now — the four things that took more time than the AI. Starting with the most embarrassing.

1. The scraping tool was wrong for two of three platforms

Every scraping tutorial you'll find online opens with the same heavyweight toolkit — headless browser, stealth plugins, proxy rotation, the works. I started there too.

This turned out to be wrong for one platform, unnecessary for another, and actively dumb to avoid on the third.

Platform A: I had a headless browser dutifully rendering pages for three weeks before I realized the same data was available through a much thinner path that didn't require rendering a single pixel. When I rewrote that spider it went from "needs a beefy worker" to "runs on a potato." I want those three weeks back.
Platform B: There was a developer-focused API available the entire time. Not just scrape-able, officially supported. I was reinventing its existence in the browser layer. Pure hubris — I had assumed "if the tutorial uses a browser for it, that must be the right tool."
Platform C: No shortcut. Actively hostile to scraping, continuous cat-and-mouse, and my time is worth more than the subscription. Paid for access. Never looked back.

The generalizable lesson: don't start with a tool and hunt for problems to solve with it. Start with how the source actually serves data to its own frontend. If it serves structured data, there's usually a path that isn't a browser. If it only renders via JS, that's your answer. If it's hostile, pay or skip.

The heavyweight-browser default isn't wrong — it's a good fallback when nothing else works. The failure mode is treating it as the starting point.

2. The cheapest filter is the one that runs before the LLM

Naive version of the pipeline: crawled post arrives → LLM processes it → structured output goes to the database.

Two problems at once.

It's expensive. Every post is tokens, and token cost on a content-scale pipeline dominates the bill within a week.

The output is worse. "Why is no one building X?" posts waste tokens producing confident-sounding opportunity cards that don't survive human review. A model asked to extract an opportunity from a substance-free rant will dutifully hallucinate one.

The fix is philosophically simple: put a deterministic filter in front of the LLM that rejects content the LLM would have rejected anyway, but for free.

What "deterministic" means varies per platform — what counts as a substantive post in one community is a throwaway in another. The thresholds are per-platform, and they drift as community norms change. I'm not going to publish the current values; they're part of the product. But the interesting thing about tuning them isn't the numbers. It's that I ended up building a small tuning harness that was more work than picking any individual threshold.

Ordering matters too. The deterministic filter runs before the stateful dedup layer, not after. Posts rejected today can be reconsidered tomorrow if they accumulate engagement — which they sometimes do, especially on HN.

Generalizable rule: before every LLM call, ask "can I cheaply reject this input first?" The answer is usually yes, and the win compounds: less cost per document, better signal on the documents that make it through, fewer false positives to clean up downstream.

3. Shipping an "AI-grounded" feature without lying to users

This is the section I most want a reader to take away, so I'll be careful about the level I pitch it at.

The product makes a specific promise: every claim it generates is backed by a quote from an identifiable, real user. You see "users complain that X breaks every Tuesday" and you can click through to the exact comment where someone said exactly that.

If that promise leaks — if even a small percentage of the quotes are paraphrased, massaged, or invented — the product has no reason to exist. "We summarize Reddit with AI" is a commodity. "We show you the literal thing the person said" is not.

LLMs in their default configuration will break that promise. Paraphrasing is what they're good at. Making up a plausible-sounding quote is easier for them than surfacing the specific boring one that matters. This isn't a flaw of any particular model; it's a property of optimized-for-fluency generation.

The approach I landed on, at the pattern level:

Don't ask the LLM to find sources. Do extraction of candidate source material deterministically, before the generation step. The LLM sees a curated candidate pool, not the raw corpus.
Constrain generation to operate on those candidates. The LLM's job is to synthesize and structure. It is not the layer that decides what's citable.
Mechanically verify every output claim against a specific source record. If it doesn't match a real record, it doesn't ship. This is the step most pipelines skip, and it's the one that determines whether users still trust the output six months in.
Fail closed. If verification can't find the source for a generated claim, drop the claim. If dropping claims leaves the output empty, drop the whole output. Empty is fine. Phantom is not.

I won't walk through the matching algorithm, what the candidate pool looks like in this codebase, or how I decide "drop claim vs drop whole output" — those are product-specific tuning and they're where the moat actually lives.

The pattern itself is freely available, and I wish I'd seen it articulated when I started: for any LLM feature where hallucination is a correctness bug rather than a style flaw, the pattern is pre-extract → constrain → verify → fail closed. Half-measures ship, but they don't keep trust.

4. One score is almost never enough

The hardest non-infrastructure problem turned out to be teaching the pipeline the difference between "people are angry" and "people will pay."

The naive move is one score per item — how good is this opportunity, 0-10. This doesn't work. It tries to answer two reader questions at once ("is there real pain here?" and "would anyone actually buy a solution?") and those are orthogonal.

A thread full of "someone should build X" is high on the first and near-zero on the second. A thread where one person has duct-taped three tools together and is actively shopping for a replacement is moderate on the first and very high on the second. A single composite score collapses those into noise.

The fix isn't clever math. It's the recognition that any time a single number is answering more than one question, it ends up answering none of them well. Split the signal into the questions you actually want answered, score those separately, and compose them deliberately — not average them. A lot of weak signal does not beat a little strong signal, even when the means come out similar.

One non-obvious finding I'll share because it's directional-only: the "obvious" communities (the big generalist ones) produce noisier signal than niche ones. Volume is a weak proxy for signal strength. Source diversity turned out to matter as much as source volume — an opportunity drawn from three different niche communities beats one drawn from thirty posts in one big general community, even though the raw evidence count is an order of magnitude lower.

Dev-applicable version: when a ranking algorithm isn't working, check whether you're averaging two signals that are answering different questions. You'll be surprised how often the answer is yes.

The dumb bug that almost invalidated everything

This one isn't a product moat — it's a general cross-language gotcha — so I'll share it in detail because the lesson is broadly useful to anyone building a polyglot system.

Two services in two languages, both writing to the same database column. The column stores vectors as text, like [0.123,0.456,...]. The database happily accepts whatever either language produces.

The trap: each language's default float-to-string produces slightly different output. Different widths. Different rounding at the edge case. To the human eye they look identical. To byte-wise comparison they aren't. To cosine similarity on the resulting vector, they're close but not the same vector.

Nothing broke. No exception. No type error. No failing test. What happened was that semantic similarity rankings drifted depending on which service had last written the record. Results were good, then mysteriously slightly bad, then good again. I chased this for most of a week before I realized what I was looking at.

The fix, in pseudocode:

// One canonical formatter, shared definition across both runtimes.
// Verified by golden-string test in each language.
format_vector(floats):
    return "[" + join(",", each f -> to_string_exact(f)) + "]"

Where to_string_exact is explicitly pinned to the widest-precision, culture-invariant format available in each language — not whatever the default toString() happens to do. And the test is a literal string equality check against a hand-written golden output, run from both sides.

The broader lesson: "both sides are using the default" is a dangerous sentence in a polyglot system. Default serialization isn't a contract. If two runtimes are going to share a serialized format, write the format exactly once as a pure function, and verify its output byte-for-byte from both languages. Repeat for every format that crosses the runtime boundary — JSON casing was the other one that bit me, in passing.

What I'd do differently

Nothing kills a launch post faster than "everything went great." So:

I picked the wrong primary data store early, for the wrong reason — "flexibility." Future me didn't want flexibility. Future me wanted fewer moving parts. Moved to a boring relational DB with a JSON column type and things got better immediately. Lost about three weeks.
I wrote my own rate-limit layer before realizing a standard caching-server primitive plus ten lines of script would have done the same job in an afternoon. Lost a week on that one.
I underestimated observability. The liveness vs. readiness healthcheck split only happened after the third production incident. It should have been there on day one — you don't need it until you need it, and then you need it immediately.
The grounding / verification layer shipped too late. Weeks of early data had to be re-processed once I added it. It should have been part of the first LLM call, not the twelfth.

If there's a theme: my worst decisions were the ones where I picked the more flexible option so future-me would have more options. Future-me didn't want options. Future-me wanted something that worked.

Closing

That's four things that turned out harder than the AI, plus the serialization bug for flavor.

The product all this plumbing is in service of is at monetscope.com — free 14-day trial, no card. If you'd rather see what it produces before committing, this week's top 10 opportunities are at monetscope.com/this-week (just email, no card). The output is what the pipeline exists for, but to be honest the feedback I most want right now is on the engineering choices above, not the landing page.

If you've shipped an LLM-scored content pipeline yourself, I'd genuinely like to hear: how do you version and regression-test your prompts? That's the layer I feel weakest on, and I haven't found a tool I love. Current setup is git commit plus a hand-maintained regression set, and it's starting to creak as the prompt surface grows.

Also: if you maintain a community, newsletter, or data source you'd be interested in seeing indexed — that's on my roadmap, and I'm actively looking for source-expansion partnerships. DM me.

Thanks for reading this far.

DEV Community: Benjian Dai