James

Posted on May 9

How I Automated 90% of My Business Research with AI Agents

#automation #ai #productivity #startup

I tracked every hour I spent on research for a month. The result was humiliating: 40 hours per week. Not analysis. Not strategy. Just gathering, formatting, and cross-referencing data.

18 hours: web search and data gathering
12 hours: copy-paste and formatting
8 hours: cross-referencing sources
2 hours: actual analysis and decision-making

I was spending €200 per hour on data entry. So I rebuilt the entire workflow. Today I spend four hours per week on research. The other 36 are automated.

This article is a technical breakdown of how that works.

Why Research is the Hardest Task to Automate

Research is not a single task. It is a pipeline of tasks, each requiring a different skill:

Discovery: Finding sources you did not know existed
Extraction: Pulling structured data from unstructured pages
Validation: Cross-checking claims across multiple sources
Synthesis: Turning raw data into actionable intelligence
Distribution: Getting the right insight to the right person at the right time

Most automation tools handle one step well. None handle the full pipeline natively. The breakthrough was not finding a better tool. It was wiring multiple tools into a single pipeline.

The Pipeline Architecture

The system I built has three layers. Each layer addresses one stage of the research problem.

Layer 1: Discovery Automation (Search Agents)

Manual research starts with search. You type a query, review results, click links, bookmark relevant pages, and repeat. This is the slowest part because it requires human judgment at every step.

Automation works differently. Instead of searching reactively, you define what you need and let agents monitor continuously.

A search agent is a declarative specification:

agent = {
    "name": "Competitor Monitor",
    "sources": ["google_news", "crunchbase", "linkedin_posts", "product_hunt"],
    "query": "{company} funding OR acquisition OR product_launch",
    "filters": {
        "language": "en",
        "date_range": "past_7_days",
        "sentiment": "not_negative"
    },
    "output": {
        "format": "markdown_table",
        "fields": ["date", "source", "summary", "relevance_score"]
    },
    "schedule": "daily_0600",
    "alert_on": ["funding", "acquisition", "pricing_change"]
}

The agent runs autonomously. It queries sources, filters noise, extracts structured data, and generates a report. No manual search. No tab management. No copy-paste.

Critical insight: the agent does not replace human judgment. It surfaces candidates for judgment. A human still decides whether a funding announcement matters. But the human now reviews a curated table instead of scanning 50 sources.

Layer 2: Structured Extraction (Web Scraping with Schema)

Search finds pages. The next problem is extracting data from those pages. Most sites that do not have APIs still contain structured data in their HTML.

The naive approach is scraping with XPath or regex. This breaks constantly. A site redesign, a renamed CSS class, or a JavaScript framework update breaks your selector.

The better approach is schema-based extraction:

schema = {
    "plan_name": "css:.pricing-tier h3",
    "price": "css:.pricing-tier .price-amount",
    "currency": "css:.pricing-tier .price-currency",
    "billing_cycle": "css:.pricing-tier .price-period",
    "features": "css:.pricing-tier .feature-list li",
    "limitations": "css:.pricing-tier .limitation-note"
}

# Extract
results = extract(html, schema)
# Validate
assert results["price"] is not None
assert len(results["features"]) >= 3
# Transform
normalized = {
    "monthly_eur": convert_currency(results["price"], results["currency"]),
    "feature_count": len(results["features"]),
    "has_enterprise_tier": "enterprise" in [f.lower() for f in results["features"]]
}

Schema extraction is more resilient than XPath because it is semantic, not positional. If a site redesigns, css:.pricing-tier .price-amount may still find the element even if the DOM structure changes. If not, the assertion fails and the pipeline alerts you to fix the schema.

The extraction layer also handles anti-bot measures:

Rotating proxies: Residential IPs with automatic rotation
Fingerprint spoofing: Real browser headers, not python-requests
Rate limiting: Built-in delays and jitter
CAPTCHA handling: Human-in-the-loop for edge cases, not systematic abuse

This is not aggressive scraping. It is respectful automation that stays within the bounds of what a human researcher could do manually, just faster.

Layer 3: Intelligence Layer (LLM-Based Synthesis)

Raw data is not intelligence. A spreadsheet of competitor prices is just data. Intelligence answers questions like:

"Which competitor is most likely to cut prices next quarter?"
"What feature gaps are being discussed in developer communities?"
"Is this market trending toward consolidation or fragmentation?"

I use LLMs for synthesis, but with strict constraints:

Rule 1: The LLM only processes data that has already been extracted and validated. It does not hallucinate sources.

Rule 2: Every claim includes a source reference. If the LLM says "Competitor X added feature Y," it must cite the source document.

Rule 3: Confidence scores are attached to every insight. "Likely" means 60-70% confidence. "Highly likely" means 80-90%. No absolutes.

Example output:

**COMPETITOR ALERT: TechCorp GmbH**

**Source:** pricing page scrape, 2024-03-15 (confidence: 100%)

**Change:** New Enterprise tier introduced at €299/month. Previously only Starter (€39) and Pro (€129) existed.

**Inference:** (confidence: 75%) This suggests mid-market expansion and possible funding pressure. The €299 price point is 50% below industry average for enterprise tiers, indicating competitive positioning rather than premium positioning.

**Recommendation:** (confidence: 60%) Monitor for 30 days. If pricing stabilizes, prepare response. If they add enterprise-only features, signal is stronger.

This output is not a replacement for human judgment. It is a structured brief that saves 30-60 minutes of manual analysis.

The Numbers

Here is what changed, month by month:

Metric	Before	Month 1	Month 3	Month 6
Research hours/week	40	25	12	4
Data sources monitored	5	15	35	50+
Competitors tracked	3	8	15	25
Reports generated	0	10	30	60
Missed opportunities	~2/month	1/month	0	0
Cost of tooling	€0	€120/mo	€280/mo	€419/mo
Equivalent human cost	€8,000/mo	€5,000/mo	€2,400/mo	€800/mo

The cost of tooling is real: proxies, compute, APIs, storage. But the equivalent human cost drops faster.

The Architecture I Actually Built

For engineers who want to build this themselves, here is the stack:

Orchestration: n8n (self-hosted, Fair-code, Berlin team)
Search Layer: Custom agents using arxiv, serpapi, crunchbase, and RSS feeds
Extraction: playwright with schema-based extraction and pydantic validation
Storage: PostgreSQL for structured data, S3 for raw HTML snapshots
Analysis: minimax API for synthesis, ollama with local models for sensitive data
Distribution: n8n email nodes and Slack webhooks
Frontend: Custom dashboard showing agent status, source health, and recent reports

The total infrastructure cost is approximately €419 per month at steady state.

Compare to hiring a junior researcher at €45,000 per year plus overhead. The math is not close.

What Breaks (And How to Fix It)

Source schema changes. A site redesigns and your CSS selectors break. Fix: source health monitoring with automatic alerting. Each source gets a reliability score. If extraction fails 3 times in a row, the agent switches to a secondary source.

Rate limiting and blocks. Aggressive scraping gets you blocked. Fix: implement polite delays (1 request per second minimum), respect robots.txt, use rotating residential proxies, and accept that some sites are not scrapable.

LLM hallucination. Even with rule constraints, LLMs occasionally generate false inferences. Fix: every LLM output requires human review before distribution. The pipeline generates drafts, not final reports.

Data staleness. Prices and features change daily. A report generated on Monday may be wrong by Wednesday. Fix: freshness scoring. Every data point includes a "last verified" timestamp. Stale data is flagged automatically.

API dependency fragility. SerpAPI changes pricing. Crunchbase updates rate limits. Fix: multi-source redundancy. Never depend on a single source for a critical data point.

The Privacy Problem Nobody Talks About

Most research automation creates a new problem: your automation stack becomes a surveillance trail.

If your scraping pipeline runs on AWS, Amazon sees your research targets. If you use Google Sheets for storage, Google sees your data. If you use Zapier for orchestration, Zapier processes your data.

The research stack I described above is deliberately designed to minimize third-party exposure:

Self-hosted orchestration (n8n on Hetzner, not Zapier)
Local LLM inference for sensitive analysis (not OpenAI)
EU-hosted infrastructure (GDPR-native by design)
No persistent query logs

If you are building research automation for competitive intelligence, your tooling is part of your threat model.

Start Small

You do not need the full stack on day one. Start here:

Week 1: Identify your single biggest research time sink. Write down exactly what you search for and what format you need the output in.

Week 2: Build one search agent for that task. Use RSS feeds and free APIs. Do not build infrastructure yet.

Week 3: Add one extraction target. Pick a site with stable HTML. Use schema-based extraction, not XPath.

Week 4: Generate your first automated report. Review it manually. Iterate on the format.

Month 2: Add two more agents. Build a simple dashboard showing agent status and recent outputs.

Month 3: Integrate LLM synthesis for the highest-value reports. Add source health monitoring.

I am the founder of Graham Miranda UG, a Berlin-based company building privacy-first web intelligence tools. The architecture described above is what we ship in asearchz.online.

DEV Community