Diven Rastdus

Posted on Mar 20 • Edited on Mar 26

Building an AI-Powered Sales Intelligence Pipeline from Scratch

#ai #python #sales #automation

Sales teams spend 40% of their time on research before making a single call. Finding company details, identifying decision-makers, understanding tech stacks, tracking funding rounds, analyzing competitive positioning. It's manual, repetitive, and expensive.

I recently built an AI research pipeline that automates most of this. Here's the architecture and the lessons from making it production-ready.

The Problem

A typical sales researcher needs to answer questions like:

What does this company actually do?
Who are the decision-makers?
What tech stack are they using?
Have they raised funding recently?
What are their pain points based on job postings and reviews?
How do they compare to competitors?

Doing this manually for 50 prospects takes a full day. An AI pipeline does it in minutes, with better coverage and consistency.

Architecture Overview

Prospect List (URLs/names)
    |
    v
[Web Scraper] --> Company website, LinkedIn, Crunchbase, job boards
    |
    v
[Content Store] --> Raw pages stored with metadata
    |
    v
[Embedding Generator] --> Vector representations for semantic search
    |
    v
[LLM Synthesizer] --> Structured intelligence reports per company
    |
    v
[Search Interface] --> Query across all research with natural language

Each component is independently scalable. The scraper runs on a schedule. The embedding generator processes new content asynchronously. The synthesizer produces reports on demand.

Implementation

Web Scraping Layer

The scraper needs to handle JavaScript-rendered pages (most modern company websites) and respect rate limits:

import httpx
from bs4 import BeautifulSoup
import asyncio

async def scrape_company(url: str) -> dict:
    async with httpx.AsyncClient(timeout=30) as client:
        response = await client.get(url, follow_redirects=True)
        soup = BeautifulSoup(response.text, "html.parser")

        # Extract structured data
        return {
            "url": url,
            "title": soup.title.string if soup.title else "",
            "description": extract_meta(soup, "description"),
            "headings": [h.text.strip() for h in soup.find_all(["h1", "h2", "h3"])],
            "body_text": soup.get_text(separator=" ", strip=True)[:5000],
            "links": [a.get("href") for a in soup.find_all("a", href=True)]
        }

def extract_meta(soup, name):
    tag = soup.find("meta", attrs={"name": name}) or \
          soup.find("meta", attrs={"property": f"og:{name}"})
    return tag.get("content", "") if tag else ""

For JavaScript-heavy sites, swap httpx for a headless browser (Playwright). The tradeoff: 10x slower but handles SPAs correctly.

Enrichment Sources

Don't rely on a single source. Cross-reference multiple data points:

ENRICHMENT_SOURCES = {
    "company_website": {
        "pages": ["/", "/about", "/team", "/careers", "/pricing"],
        "extract": ["description", "team_members", "tech_stack_hints"]
    },
    "job_postings": {
        "sources": ["careers_page", "linkedin_jobs"],
        "extract": ["tech_stack", "team_size_proxy", "growth_signals"]
    },
    "public_data": {
        "sources": ["crunchbase", "linkedin_company"],
        "extract": ["funding", "employee_count", "industry"]
    }
}

Job postings are the most underrated intelligence source. A company hiring "3 senior ML engineers" tells you they're investing in AI. A posting for "Stripe integration specialist" tells you their payment stack. This data is public, updated frequently, and highly actionable.

LLM Synthesis

Raw scraped data is noise. The LLM turns it into structured intelligence:

from ai_sdk import generate_text

async def synthesize_company(raw_data: dict) -> dict:
    prompt = f"""Analyze this company data and produce a structured intelligence report.

Company URL: {raw_data['url']}
Website content: {raw_data['body_text'][:3000]}
Job postings: {raw_data.get('jobs', 'None found')}
Team page: {raw_data.get('team', 'Not available')}

Output a JSON report with:
- company_name: string
- one_liner: what they do in one sentence
- industry: primary industry
- tech_stack: list of technologies mentioned or implied
- team_size_estimate: based on available signals
- growth_signals: list of indicators (hiring, funding, expansion)
- pain_points: inferred from job postings and product gaps
- decision_makers: names and titles if found
- competitive_position: how they differentiate
- relevance_score: 1-10 for our services (AI/automation)"""

    result = await generate_text(
        model="anthropic/claude-sonnet-4.6",
        prompt=prompt,
        output_format="json"
    )
    return result

The relevance_score field is key. It lets you automatically prioritize which prospects deserve immediate outreach vs. which can wait.

Semantic Search Across All Research

Once you have research on 100+ companies, you need a way to query across all of it:

def search_prospects(query: str, top_k: int = 10):
    """Search across all company research with natural language."""
    query_embedding = get_embedding(query)

    results = []
    for company in all_companies:
        score = cosine_similarity(query_embedding, company.embedding)
        if score > 0.72:
            results.append({
                "company": company.name,
                "score": score,
                "summary": company.one_liner,
                "relevance": company.relevance_score
            })

    return sorted(results, key=lambda x: x["score"], reverse=True)[:top_k]

Now your sales team can ask: "Which prospects are building AI products and recently raised funding?" and get instant answers instead of manually filtering spreadsheets.

Production Considerations

Rate Limiting and Politeness

Scraping at scale requires discipline:

import asyncio
from collections import defaultdict

class RateLimiter:
    def __init__(self, requests_per_second: float = 1.0):
        self.delay = 1.0 / requests_per_second
        self.domain_last_request = defaultdict(float)

    async def wait(self, domain: str):
        now = asyncio.get_event_loop().time()
        elapsed = now - self.domain_last_request[domain]
        if elapsed < self.delay:
            await asyncio.sleep(self.delay - elapsed)
        self.domain_last_request[domain] = asyncio.get_event_loop().time()

One request per second per domain. Respect robots.txt. Use realistic user agents. This isn't about being sneaky; it's about being sustainable. Get blocked and you lose the data source permanently.

Data Freshness

Company data goes stale fast. A startup that had 20 employees last month might have 50 now. Set up refresh cycles:

Website content: Re-scrape monthly
Job postings: Re-scrape weekly (highest signal-to-noise)
Funding data: Re-check bi-weekly
Re-embed: After any content update (embeddings reflect the latest data)

Cost at Scale

For 1,000 companies:

Scraping: ~5,000 pages, mostly free (httpx) or minimal (headless browser compute)
Embeddings: 5,000 chunks, ~$0.30 total
LLM synthesis: 1,000 reports, ~$2-5 (depending on model and input length)
Storage: ~50MB for embeddings + metadata

Total: under $10/month for a comprehensive intelligence system covering 1,000 companies. Compare that to $500-2,000/month for ZoomInfo or Apollo.

The Output

After processing, each prospect has a structured profile:

{
  "company": "Acme Corp",
  "one_liner": "B2B SaaS platform for supply chain visibility",
  "tech_stack": ["React", "Node.js", "PostgreSQL", "AWS"],
  "team_size": "50-100",
  "growth_signals": [
    "Hiring 4 engineers (2 senior)",
    "Series B ($15M, 2025)",
    "New VP of Engineering posting"
  ],
  "pain_points": [
    "Manual data pipeline management (hiring data engineers)",
    "No ML team yet (first ML hire posting)"
  ],
  "relevance_score": 8,
  "decision_makers": [
    {"name": "Jane Smith", "title": "CTO"},
    {"name": "John Doe", "title": "VP Engineering"}
  ]
}

Sales reps get this instead of spending 30 minutes per prospect doing manual research. The pipeline runs overnight and has fresh intelligence ready by morning.

What This Replaces

Traditional Approach	AI Pipeline
30 min/prospect research	2 min/prospect (automated)
Inconsistent data quality	Structured, standardized
Stale (researched once, used for months)	Auto-refreshing weekly
Keyword search across notes	Semantic search across everything
$500-2K/mo for data tools	$10/mo compute + API costs

I build production AI pipelines like this for companies. If you're automating prospect research or building intelligence tooling, let's talk. More at astraedus.dev.

If you're building AI agents for production, check out my book Production AI Agents on Amazon Kindle. It covers architecture patterns, tool design, multi-agent coordination, and deployment strategies.

DEV Community