大橙子

Posted on Jun 19

Building a Self-Hosted Market Research Engine: Architecture, Scraping, and AI Analysis

#python #opensource #webdev #showdev

Building a Self-Hosted Market Research Engine: Architecture, Scraping, and AI Analysis

When I started my last indie project, I needed to know two things: who's competing in my space, and what are they saying? Simple question. But the tools I found were either $200/month (hello, Semrush) or locked me into their data silo.

So I built my own. Self-hosted. Python-based. With actual control over my data.

Here's how it works under the hood — and why I think most indie devs should consider this approach instead of signing up for yet another SaaS.

The Core Architecture

The system has three layers, running in Docker:

┌─────────────────────┐
│   Data Collection    │  ← Scrapy + Playwright
├─────────────────────┤
│   Analysis Pipeline  │  ← LLM summarization + NLP
├─────────────────────┤
│   API + Dashboard    │  ← FastAPI + SQLite/Postgres
└─────────────────────┘

All three run on a $12/month VPS. No external API calls except the LLM endpoint.

Layer 1: Not Your Grandma's Web Scraper

The tricky part isn't scraping. It's scraping at scale without getting blocked. Here's what I learned:

Proxy rotation isn't optional. After about 50 requests to the same domain, you're cooked. I built a simple proxy rotator that pulls from a pool of residential proxies:

class RotatingProxyMiddleware:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self._index = 0
        self._lock = Lock()

    def fetch(self, url: str) -> Response:
        with self._lock:
            proxy = self.proxies[self._index % len(self.proxies)]
            self._index += 1
        return requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=15)

Dead simple. But it solves 90% of the block problem. The last 10% is solved by random delays (2–5 seconds between requests) and realistic user-agent headers.

Playwright for JS-heavy sites. Some competitor sites are full SPA — no static HTML, all JavaScript rendering. For those, I drop into Playwright:

async def scrape_dynamic(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await browser.close()
        return content

This adds about 3 seconds per page, but it's the only way to get the actual rendered content.

Layer 2: The Analysis Pipeline — Where AI Actually Helps

Once you have the raw data (pricing pages, product descriptions, changelogs), you need to extract signal from noise. This is where LLMs shine — not for generating content, but for structuring it.

I use a cheap model (Claude Haiku or GPT-4o-mini) with a structured prompt:

ANALYSIS_PROMPT = """Extract the following from this competitor page:
1. Pricing tiers and amounts (list all)
2. Target customer (individual/team/enterprise)
3. Key features mentioned
4. Landing page positioning
Return as JSON."""

def analyze_page(content: str) -> dict:
    response = llm_client.chat(
        model="claude-3-haiku",
        messages=[{"role": "user", "content": ANALYSIS_PROMPT + "\n\n" + content[:8000]}],
        response_format={"type": "json_object"}
    )
    return json.loads(response.content)

The key insight: you don't need a huge context window. Most pages have the important info in the first 5000 characters. Truncate aggressively, batch cheap calls, and store results in SQLite.

Layer 3: The Dashboard

I built a minimal FastAPI dashboard that shows:

Competitor pricing changes over time (screenshot diff)
Feature comparison matrix (auto-populated)
Raw changelog entries with AI summaries

The dashboard runs on the same VPS. No frontend framework — just Jinja2 templates and htmx for interactivity. It's ugly, but it works, and I can access it from my phone.

Why Self-Hosted Matters

Three reasons:

Your data stays yours. Every market analysis tool I tried uploads your competitors' data to their databases. That's fine until you're analyzing something sensitive.
Cost scales to zero. The VPS costs $12/month regardless of how many competitors I track. SaaS tools charge per project/per seat.
You own the pipeline. Found a new data source? Add it in 20 lines of Python. Want a custom metric? Write it. No submitting feature requests.

The Stack I Wish I Started With

If I were building this again today, I'd start with:

Scrapy for crawling (not raw requests — Scrapy handles retries, backpressure, and pipelines out of the box)
SQLite for storage (faster than Postgres for read-heavy workloads, and zero ops)
Claude Haiku for analysis (cheapest capable model I've found)
htmx for dashboard (no JS build step)

What's Next

I've packaged this into a tool called MarketEye — it's the exact system I described above, with a ready-to-run Docker setup and a few pre-built scrapers for common competitor sources (Product Hunt, G2, Crunchbase, etc.).

If you want to check it out:

GitHub (open-source core): https://github.com/dachengzi065-gif/marketeYE
Gumroad (pre-built Docker image + scrapers): https://gumroad.com/l/kvnkhb

But honestly? Even if you never use MarketEye, I hope this post shows you that building your own self-hosted intelligence pipeline is totally doable. You don't need enterprise tools. You just need Python, a VPS, and a weekend.

Have you built any self-hosted tooling for your indie projects? I'd love to hear what you're running — drop a comment below.

DEV Community

Building a Self-Hosted Market Research Engine: Architecture, Scraping, and AI Analysis

Building a Self-Hosted Market Research Engine: Architecture, Scraping, and AI Analysis

The Core Architecture

Layer 1: Not Your Grandma's Web Scraper

Layer 2: The Analysis Pipeline — Where AI Actually Helps

Layer 3: The Dashboard

Why Self-Hosted Matters

The Stack I Wish I Started With

What's Next

Top comments (0)