Sales teams spend 40% of their time on research before making a single call. Finding company details, identifying decision-makers, understanding tech stacks, tracking funding rounds, analyzing competitive positioning. It's manual, repetitive, and expensive.
I recently built an AI research pipeline that automates most of this. Here's the architecture and the lessons from making it production-ready.
The Problem
A typical sales researcher needs to answer questions like:
- What does this company actually do?
- Who are the decision-makers?
- What tech stack are they using?
- Have they raised funding recently?
- What are their pain points based on job postings and reviews?
- How do they compare to competitors?
Doing this manually for 50 prospects takes a full day. An AI pipeline does it in minutes, with better coverage and consistency.
Architecture Overview
Prospect List (URLs/names)
|
v
[Web Scraper] --> Company website, LinkedIn, Crunchbase, job boards
|
v
[Content Store] --> Raw pages stored with metadata
|
v
[Embedding Generator] --> Vector representations for semantic search
|
v
[LLM Synthesizer] --> Structured intelligence reports per company
|
v
[Search Interface] --> Query across all research with natural language
Each component is independently scalable. The scraper runs on a schedule. The embedding generator processes new content asynchronously. The synthesizer produces reports on demand.
Implementation
Web Scraping Layer
The scraper needs to handle JavaScript-rendered pages (most modern company websites) and respect rate limits:
import httpx
from bs4 import BeautifulSoup
import asyncio
async def scrape_company(url: str) -> dict:
async with httpx.AsyncClient(timeout=30) as client:
response = await client.get(url, follow_redirects=True)
soup = BeautifulSoup(response.text, "html.parser")
# Extract structured data
return {
"url": url,
"title": soup.title.string if soup.title else "",
"description": extract_meta(soup, "description"),
"headings": [h.text.strip() for h in soup.find_all(["h1", "h2", "h3"])],
"body_text": soup.get_text(separator=" ", strip=True)[:5000],
"links": [a.get("href") for a in soup.find_all("a", href=True)]
}
def extract_meta(soup, name):
tag = soup.find("meta", attrs={"name": name}) or \
soup.find("meta", attrs={"property": f"og:{name}"})
return tag.get("content", "") if tag else ""
For JavaScript-heavy sites, swap httpx for a headless browser (Playwright). The tradeoff: 10x slower but handles SPAs correctly.
Enrichment Sources
Don't rely on a single source. Cross-reference multiple data points:
ENRICHMENT_SOURCES = {
"company_website": {
"pages": ["/", "/about", "/team", "/careers", "/pricing"],
"extract": ["description", "team_members", "tech_stack_hints"]
},
"job_postings": {
"sources": ["careers_page", "linkedin_jobs"],
"extract": ["tech_stack", "team_size_proxy", "growth_signals"]
},
"public_data": {
"sources": ["crunchbase", "linkedin_company"],
"extract": ["funding", "employee_count", "industry"]
}
}
Job postings are the most underrated intelligence source. A company hiring "3 senior ML engineers" tells you they're investing in AI. A posting for "Stripe integration specialist" tells you their payment stack. This data is public, updated frequently, and highly actionable.
LLM Synthesis
Raw scraped data is noise. The LLM turns it into structured intelligence:
from ai_sdk import generate_text
async def synthesize_company(raw_data: dict) -> dict:
prompt = f"""Analyze this company data and produce a structured intelligence report.
Company URL: {raw_data['url']}
Website content: {raw_data['body_text'][:3000]}
Job postings: {raw_data.get('jobs', 'None found')}
Team page: {raw_data.get('team', 'Not available')}
Output a JSON report with:
- company_name: string
- one_liner: what they do in one sentence
- industry: primary industry
- tech_stack: list of technologies mentioned or implied
- team_size_estimate: based on available signals
- growth_signals: list of indicators (hiring, funding, expansion)
- pain_points: inferred from job postings and product gaps
- decision_makers: names and titles if found
- competitive_position: how they differentiate
- relevance_score: 1-10 for our services (AI/automation)"""
result = await generate_text(
model="anthropic/claude-sonnet-4.6",
prompt=prompt,
output_format="json"
)
return result
The relevance_score field is key. It lets you automatically prioritize which prospects deserve immediate outreach vs. which can wait.
Semantic Search Across All Research
Once you have research on 100+ companies, you need a way to query across all of it:
def search_prospects(query: str, top_k: int = 10):
"""Search across all company research with natural language."""
query_embedding = get_embedding(query)
results = []
for company in all_companies:
score = cosine_similarity(query_embedding, company.embedding)
if score > 0.72:
results.append({
"company": company.name,
"score": score,
"summary": company.one_liner,
"relevance": company.relevance_score
})
return sorted(results, key=lambda x: x["score"], reverse=True)[:top_k]
Now your sales team can ask: "Which prospects are building AI products and recently raised funding?" and get instant answers instead of manually filtering spreadsheets.
Production Considerations
Rate Limiting and Politeness
Scraping at scale requires discipline:
import asyncio
from collections import defaultdict
class RateLimiter:
def __init__(self, requests_per_second: float = 1.0):
self.delay = 1.0 / requests_per_second
self.domain_last_request = defaultdict(float)
async def wait(self, domain: str):
now = asyncio.get_event_loop().time()
elapsed = now - self.domain_last_request[domain]
if elapsed < self.delay:
await asyncio.sleep(self.delay - elapsed)
self.domain_last_request[domain] = asyncio.get_event_loop().time()
One request per second per domain. Respect robots.txt. Use realistic user agents. This isn't about being sneaky; it's about being sustainable. Get blocked and you lose the data source permanently.
Data Freshness
Company data goes stale fast. A startup that had 20 employees last month might have 50 now. Set up refresh cycles:
- Website content: Re-scrape monthly
- Job postings: Re-scrape weekly (highest signal-to-noise)
- Funding data: Re-check bi-weekly
- Re-embed: After any content update (embeddings reflect the latest data)
Cost at Scale
For 1,000 companies:
- Scraping: ~5,000 pages, mostly free (httpx) or minimal (headless browser compute)
- Embeddings: 5,000 chunks, ~$0.30 total
- LLM synthesis: 1,000 reports, ~$2-5 (depending on model and input length)
- Storage: ~50MB for embeddings + metadata
Total: under $10/month for a comprehensive intelligence system covering 1,000 companies. Compare that to $500-2,000/month for ZoomInfo or Apollo.
The Output
After processing, each prospect has a structured profile:
{
"company": "Acme Corp",
"one_liner": "B2B SaaS platform for supply chain visibility",
"tech_stack": ["React", "Node.js", "PostgreSQL", "AWS"],
"team_size": "50-100",
"growth_signals": [
"Hiring 4 engineers (2 senior)",
"Series B ($15M, 2025)",
"New VP of Engineering posting"
],
"pain_points": [
"Manual data pipeline management (hiring data engineers)",
"No ML team yet (first ML hire posting)"
],
"relevance_score": 8,
"decision_makers": [
{"name": "Jane Smith", "title": "CTO"},
{"name": "John Doe", "title": "VP Engineering"}
]
}
Sales reps get this instead of spending 30 minutes per prospect doing manual research. The pipeline runs overnight and has fresh intelligence ready by morning.
What This Replaces
| Traditional Approach | AI Pipeline |
|---|---|
| 30 min/prospect research | 2 min/prospect (automated) |
| Inconsistent data quality | Structured, standardized |
| Stale (researched once, used for months) | Auto-refreshing weekly |
| Keyword search across notes | Semantic search across everything |
| $500-2K/mo for data tools | $10/mo compute + API costs |
I build production AI pipelines like this for companies. If you're automating prospect research or building intelligence tooling, let's talk. More at astraedus.dev.
Top comments (0)