Techno Neighbour

Posted on Jun 30

Why I built a CLI to automate web research instead of relying on browser tabs

#python #opensource #productivity #showdev

A few months ago I noticed something annoying about how I worked: I was spending more time collecting information than actually thinking about it.

The pattern was always the same. Open a search engine, open a dozen tabs, skim past the SEO filler and cookie banners, copy the paragraphs that actually mattered into a doc, paste the whole mess into an LLM and ask it to make sense of things. Then, a week later, do it again because whatever I was tracking had changed.

At some point I stopped asking "how do I do this faster" and started asking why I was doing it by hand at all.

Why the obvious answers didn't work

ChatGPT and Perplexity are fine for a single question. They're worse at the part I actually needed help with, which was repetition: running the same research loop on a schedule, keeping a record of what changed, and getting a notification when it did. Neither tool is built to sit in the background and check on a topic for you.

Plain scraping scripts have the opposite problem. They get you raw HTML, not understanding. You still have to strip out nav bars and footers by hand, and the moment you point one at a list-style page like Hacker News instead of a blog post, it falls apart.

And bookmarking is just deferring the problem. A folder of forty saved links isn't research, it's homework you haven't done yet.

I wanted something in between: automated enough to skip the tab-hoarding, but still producing something I could read and trust, not just a black-box answer.

So I built Focal Harvest

It's a modular CLI that runs the whole research loop, search, scrape, clean, synthesize, report, on its own, and stays lightweight enough to run on a laptop with no GPU and no database.

A single run looks like this: you give it a topic and a focus area (what you specifically want answered), it searches the web, pulls and cleans the pages, synthesizes a report, and writes it to disk. There's also a loop mode, so the same query can re-run every few hours and ping you on Discord or Telegram if you want to monitor something over time instead of researching it once.

How it's put together

I deliberately didn't build this as one big script. It's five files, each doing one job, called in sequence:

main.py        → terminal UI and orchestration
scraper.py     → search + concurrent crawling + HTML parsing
analyzer.py    → synthesis (AI or offline)
notifier.py    → saving reports, sending alerts
config_manager → reading/writing settings

main.py doesn't know anything about how scraping works internally, and scraper.py doesn't know anything about Discord webhooks. That separation made it much easier to add the offline summarizer later without touching the scraping code at all, and it's the kind of decision that only pays off once you try to change something six weeks in.

The parts that were actually hard

Getting clean text out of arbitrary HTML. readability-lxml is good at finding "the article" inside a page, but it assumes the page is an article. Point it at a Hacker News thread or a GitHub repo listing and it often returns almost nothing, because there's no single article body to extract. The fix was to treat readability as the first attempt, not the only one: if it returns under 200 characters of usable text, the code falls back to a structural BeautifulSoup pass that looks for <article>, <main>, or common content selectors instead. Two different parsing strategies, picked automatically based on what the page actually looks like.

Supporting three different LLM providers without three different code paths. Gemini, OpenAI, and Claude all have different request shapes, but the thing I actually cared about (send scraped context, get back a structured Markdown report) is identical across all of them. So each provider gets its own thin function that builds the right payload and hits the right endpoint, but all three feed into the same synthesize_topics router, and all three fall back to the same offline summarizer if the API call fails for any reason. The interface is the constant; the providers are interchangeable behind it.

An offline mode that's actually usable. No API key, no internet dependency on a third-party model, still get a real report. This is where most of the actual algorithm work went: extract keywords from the query and focus area, score every sentence in the scraped text by keyword density and position (earlier sentences in a paragraph, earlier paragraphs in a document, weighted higher), then deduplicate near-identical sentences before assembling the top results into a report with an executive summary and per-source findings. It's not as fluent as an LLM-written summary, but it's not nothing either, and it means the tool works the moment you clone it.

Staying out of the database trap. It would have been easy to reach for SQLite the moment I wanted history or saved searches. I didn't. Reports are timestamped Markdown and JSON files in a reports/ folder, and saved search presets live in a plain config.json. You can read everything by double-clicking it. No schema, no migrations, no ORM.

What it's not

Focal Harvest isn't trying to replace search engines or chat-based AI assistants. It automates the mechanical part, gathering and organizing information, so you spend your attention on evaluating it instead of assembling it. If you want a single deep conversational answer to one question, this is the wrong tool. If you want a repeatable, schedulable research pipeline that produces a file you can actually keep, that's the gap it's filling.

Where I'd like help

A few areas I haven't gotten to yet:

Additional search providers beyond Tavily and DuckDuckGo
A plugin system for custom parsing rules on specific sites
Recursive crawling (follow internal links to a set depth)
Incremental reports, so a recurring monitor only flags what actually changed instead of regenerating the whole thing
General performance work on the scraping layer

If any of that sounds interesting, or if you've built something in this space and have opinions about the architecture, I'd genuinely like to hear them. Issues and pull requests are open.

If you've got your own version of the tab-hoarding problem, I'd like to hear about it in the comments. What does your research loop look like, and where does it break down?

Top comments (2)

Jeremy Guzman • Jul 17

Bringing this one back up because I'm curious about your progress with Focal Harvest. Have you had any success integrating those additional search providers or developing the plugin system? If you're still looking for help, have you considered crowdsourcing ideas or collaborating with other developers who might have tackled similar challenges?

Techno Neighbour • Jun 30

The GitHub repository to the post for those interested in exploring the code: github.com/techno-neighbour/focal-...