DEV Community

大橙子
大橙子

Posted on

Building a Self-Hosted Market Research Engine: Architecture, Scraping, and AI Analysis

Building a Self-Hosted Market Research Engine: Architecture, Scraping, and AI Analysis

When I started my last indie project, I needed to know two things: who's competing in my space, and what are they saying? Simple question. But the tools I found were either $200/month (hello, Semrush) or locked me into their data silo.

So I built my own. Self-hosted. Python-based. With actual control over my data.

Here's how it works under the hood — and why I think most indie devs should consider this approach instead of signing up for yet another SaaS.

The Core Architecture

The system has three layers, running in Docker:

┌─────────────────────┐
│   Data Collection    │  ← Scrapy + Playwright
├─────────────────────┤
│   Analysis Pipeline  │  ← LLM summarization + NLP
├─────────────────────┤
│   API + Dashboard    │  ← FastAPI + SQLite/Postgres
└─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

All three run on a $12/month VPS. No external API calls except the LLM endpoint.

Layer 1: Not Your Grandma's Web Scraper

The tricky part isn't scraping. It's scraping at scale without getting blocked. Here's what I learned:

Proxy rotation isn't optional. After about 50 requests to the same domain, you're cooked. I built a simple proxy rotator that pulls from a pool of residential proxies:

class RotatingProxyMiddleware:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self._index = 0
        self._lock = Lock()

    def fetch(self, url: str) -> Response:
        with self._lock:
            proxy = self.proxies[self._index % len(self.proxies)]
            self._index += 1
        return requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=15)
Enter fullscreen mode Exit fullscreen mode

Dead simple. But it solves 90% of the block problem. The last 10% is solved by random delays (2–5 seconds between requests) and realistic user-agent headers.

Playwright for JS-heavy sites. Some competitor sites are full SPA — no static HTML, all JavaScript rendering. For those, I drop into Playwright:

async def scrape_dynamic(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await browser.close()
        return content
Enter fullscreen mode Exit fullscreen mode

This adds about 3 seconds per page, but it's the only way to get the actual rendered content.

Layer 2: The Analysis Pipeline — Where AI Actually Helps

Once you have the raw data (pricing pages, product descriptions, changelogs), you need to extract signal from noise. This is where LLMs shine — not for generating content, but for structuring it.

I use a cheap model (Claude Haiku or GPT-4o-mini) with a structured prompt:

ANALYSIS_PROMPT = """Extract the following from this competitor page:
1. Pricing tiers and amounts (list all)
2. Target customer (individual/team/enterprise)
3. Key features mentioned
4. Landing page positioning
Return as JSON."""

def analyze_page(content: str) -> dict:
    response = llm_client.chat(
        model="claude-3-haiku",
        messages=[{"role": "user", "content": ANALYSIS_PROMPT + "\n\n" + content[:8000]}],
        response_format={"type": "json_object"}
    )
    return json.loads(response.content)
Enter fullscreen mode Exit fullscreen mode

The key insight: you don't need a huge context window. Most pages have the important info in the first 5000 characters. Truncate aggressively, batch cheap calls, and store results in SQLite.

Layer 3: The Dashboard

I built a minimal FastAPI dashboard that shows:

  • Competitor pricing changes over time (screenshot diff)
  • Feature comparison matrix (auto-populated)
  • Raw changelog entries with AI summaries

The dashboard runs on the same VPS. No frontend framework — just Jinja2 templates and htmx for interactivity. It's ugly, but it works, and I can access it from my phone.

Why Self-Hosted Matters

Three reasons:

  1. Your data stays yours. Every market analysis tool I tried uploads your competitors' data to their databases. That's fine until you're analyzing something sensitive.

  2. Cost scales to zero. The VPS costs $12/month regardless of how many competitors I track. SaaS tools charge per project/per seat.

  3. You own the pipeline. Found a new data source? Add it in 20 lines of Python. Want a custom metric? Write it. No submitting feature requests.

The Stack I Wish I Started With

If I were building this again today, I'd start with:

  • Scrapy for crawling (not raw requests — Scrapy handles retries, backpressure, and pipelines out of the box)
  • SQLite for storage (faster than Postgres for read-heavy workloads, and zero ops)
  • Claude Haiku for analysis (cheapest capable model I've found)
  • htmx for dashboard (no JS build step)

What's Next

I've packaged this into a tool called MarketEye — it's the exact system I described above, with a ready-to-run Docker setup and a few pre-built scrapers for common competitor sources (Product Hunt, G2, Crunchbase, etc.).

If you want to check it out:

But honestly? Even if you never use MarketEye, I hope this post shows you that building your own self-hosted intelligence pipeline is totally doable. You don't need enterprise tools. You just need Python, a VPS, and a weekend.


Have you built any self-hosted tooling for your indie projects? I'd love to hear what you're running — drop a comment below.

Top comments (0)