DEV Community

Amirali Nurmagomedov
Amirali Nurmagomedov

Posted on • Originally published at companyenrich.com

How to enrich a company from a domain (Python + LLM workflow)

You have a domain. You want a clean company profile: name, description, industry, socials, contact info, location. Sounds like a one-liner. It isn't, but it's closer than you'd think.

The workflow is four steps: fetch the site, extract fields with an LLM, validate the output, and figure out the point where building your own thing stops being worth it. I'll walk through all four with real Python you can run today.

Stack: Firecrawl for fetching, OpenAI structured outputs for extraction. Plain HTTP, no SDKs, so you can port it to Node, Go, or whatever you live in.

The naive version (and why it breaks)

The demo everybody writes:

  1. Fetch the homepage
  2. Throw the HTML at an LLM
  3. Get JSON back

This works beautifully on stripe.com. Then your next record is a parked domain, the one after that hides its address three clicks deep, and the one after that is a JS app that returns an empty shell on the first request. The model is rarely the hard part. Page discovery, missing data, validation, and cost control are.

The single most useful mindset shift: make it cheap for the model to be honest. If the page doesn't show a phone number, the right answer is null, not a plausible guess that happens to match the country code.

Setup

pip install requests
Enter fullscreen mode Exit fullscreen mode
import json
import os
import re
from urllib.parse import urljoin, urlparse

import requests

FIRECRAWL_API_KEY = os.environ["FIRECRAWL_API_KEY"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-5.5")

MAX_CHARS_PER_PAGE = 12000
SUPPORTING_PAGE_LIMIT = 3
Enter fullscreen mode Exit fullscreen mode

Step 1: Normalize the domain

Input is always messier than you want. companyenrich.com, https://companyenrich.com, www.companyenrich.com/. Clean it up first.

def normalize_domain_to_url(value: str) -> str:
    value = value.strip()
    if not value.startswith(("http://", "https://")):
        value = "https://" + value

    parsed = urlparse(value)
    return f"{parsed.scheme}://{parsed.netloc}".rstrip("/")
Enter fullscreen mode Exit fullscreen mode

Step 2: Scrape, and keep the footer

Here's a gotcha. Firecrawl's onlyMainContent defaults to true, which strips headers, navs, and footers. Great for reading articles. Terrible for enrichment, because the footer is exactly where social links and contact details live. Set it to False.

def scrape_url(url: str) -> dict:
    response = requests.post(
        "https://api.firecrawl.dev/v1/scrape",
        headers={
            "Authorization": f"Bearer {FIRECRAWL_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "formats": ["markdown", "links"],
            "onlyMainContent": False,
            "removeBase64Images": True,
            "blockAds": True,
        },
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["data"]
Enter fullscreen mode Exit fullscreen mode

The homepage alone usually isn't enough. Use the returned links to find a few supporting pages, same-domain only.

def discover_supporting_urls(home_url: str, links: list[str]) -> list[str]:
    home_host = urlparse(home_url).netloc.replace("www.", "")
    keywords = ("/about", "/company", "/contact", "/locations", "/location", "/team")

    candidates = []
    for link in links or []:
        href = link if isinstance(link, str) else link.get("href") or link.get("url")
        if not href:
            continue

        absolute = urljoin(home_url, href)
        parsed = urlparse(absolute)
        host = parsed.netloc.replace("www.", "")
        path = parsed.path.lower()

        if host == home_host and any(keyword in path for keyword in keywords):
            candidates.append(absolute.split("#")[0])

    return list(dict.fromkeys(candidates))[:SUPPORTING_PAGE_LIMIT]
Enter fullscreen mode Exit fullscreen mode

Then stitch it all into one payload for the model:

def fetch_company_pages(domain: str) -> str:
    home_url = normalize_domain_to_url(domain)
    pages = []

    home_data = scrape_url(home_url)
    pages.append((home_url, home_data.get("markdown", "")))

    for url in discover_supporting_urls(home_url, home_data.get("links", [])):
        try:
            page_data = scrape_url(url)
            pages.append((url, page_data.get("markdown", "")))
        except requests.RequestException:
            continue

    sections = []
    for url, markdown in pages:
        trimmed = (markdown or "")[:MAX_CHARS_PER_PAGE]
        sections.append(f"URL: {url}\n\n{trimmed}")

    return "\n\n---\n\n".join(sections)
Enter fullscreen mode Exit fullscreen mode

Step 3: Define a nullable schema

This is the part people get wrong. If your prompt says "return null when the field is missing" but your schema says "name": {"type": "string"}, you've handed the model a contradiction. It has to return a string, so it makes one up. Nullable fields aren't a compromise here, they're the whole point.

COMPANY_SCHEMA = {
    "type": "object",
    "properties": {
        "name": {"type": ["string", "null"]},
        "domain": {"type": ["string", "null"]},
        "description": {"type": ["string", "null"]},
        "long_description": {"type": ["string", "null"]},
        "industry": {"type": ["string", "null"]},
        "founded_year": {"type": ["integer", "null"]},
        "headquarters": {
            "type": "object",
            "properties": {
                "city": {"type": ["string", "null"]},
                "country": {"type": ["string", "null"]},
                "full_address": {"type": ["string", "null"]},
            },
            "required": ["city", "country", "full_address"],
            "additionalProperties": False,
        },
        "phone": {"type": ["string", "null"]},
        "email": {"type": ["string", "null"]},
        "social_profiles": {
            "type": "object",
            "properties": {
                "linkedin": {"type": ["string", "null"]},
                "x": {"type": ["string", "null"]},
                "facebook": {"type": ["string", "null"]},
                "instagram": {"type": ["string", "null"]},
                "youtube": {"type": ["string", "null"]},
            },
            "required": ["linkedin", "x", "facebook", "instagram", "youtube"],
            "additionalProperties": False,
        },
        "evidence_urls": {"type": "array", "items": {"type": "string"}},
    },
    "required": [
        "name", "domain", "description", "long_description", "industry",
        "founded_year", "headquarters", "phone", "email",
        "social_profiles", "evidence_urls",
    ],
    "additionalProperties": False,
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Extract with structured outputs

def extract_company_profile(page_content: str) -> dict:
    response = requests.post(
        "https://api.openai.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "model": OPENAI_MODEL,
            "messages": [
                {
                    "role": "system",
                    "content": (
                        "Extract structured company data from the provided website pages. "
                        "Use null when a field is not explicitly visible in the provided content. "
                        "Do not guess phone numbers, email addresses, physical addresses, or social URLs. "
                        "Only include evidence_urls that appear in the input."
                    ),
                },
                {"role": "user", "content": page_content},
            ],
            "response_format": {
                "type": "json_schema",
                "json_schema": {
                    "name": "company_profile",
                    "strict": True,
                    "schema": COMPANY_SCHEMA,
                },
            },
        },
        timeout=120,
    )
    response.raise_for_status()

    message = response.json()["choices"][0]["message"]
    if message.get("refusal"):
        raise ValueError(f"Model refused extraction: {message['refusal']}")

    return json.loads(message["content"])
Enter fullscreen mode Exit fullscreen mode

Structured outputs enforce the shape of the response. They do not make the values true. That's the next step.

Step 5: Validate (this is where a demo becomes real)

Cheap checks first. Does the phone look like a phone? Does the LinkedIn URL point to a company page, not a person? Is that X link actually on x.com? When in doubt, null it.

PHONE_RE = re.compile(r"^\+?[0-9][0-9\s().-]{6,}$")
EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")


def valid_url(value: str | None) -> bool:
    if not value:
        return False
    parsed = urlparse(value)
    return parsed.scheme in {"http", "https"} and bool(parsed.netloc)


def validate_profile(profile: dict) -> dict:
    if profile.get("phone") and not PHONE_RE.match(profile["phone"]):
        profile["phone"] = None

    if profile.get("email") and not EMAIL_RE.match(profile["email"]):
        profile["email"] = None

    socials = profile.get("social_profiles", {})

    linkedin = socials.get("linkedin")
    if linkedin and "linkedin.com/company" not in linkedin:
        socials["linkedin"] = None

    x_url = socials.get("x")
    if x_url and not any(host in x_url for host in ("x.com/", "twitter.com/")):
        socials["x"] = None

    for key in ("facebook", "instagram", "youtube"):
        if socials.get(key) and not valid_url(socials[key]):
            socials[key] = None

    profile["social_profiles"] = socials
    return profile
Enter fullscreen mode Exit fullscreen mode

Conservative on purpose. A null is annoying. A fake phone number in your CRM is a support ticket.

Wire it together

def enrich_company_from_domain(domain: str) -> dict:
    page_content = fetch_company_pages(domain)
    raw_profile = extract_company_profile(page_content)
    return validate_profile(raw_profile)


if __name__ == "__main__":
    company = enrich_company_from_domain("companyenrich.com")
    print(json.dumps(company, indent=2))
Enter fullscreen mode Exit fullscreen mode

Sample output:

{
  "name": "CompanyEnrich",
  "domain": "companyenrich.com",
  "description": "API-first B2B data platform for company and people intelligence.",
  "industry": "B2B data infrastructure",
  "founded_year": null,
  "headquarters": { "city": "Ankara", "country": "Turkey", "full_address": null },
  "phone": null,
  "email": null,
  "social_profiles": {
    "linkedin": "https://www.linkedin.com/company/companyenrich",
    "x": null, "facebook": null, "instagram": null, "youtube": null
  },
  "evidence_urls": ["https://companyenrich.com"]
}
Enter fullscreen mode Exit fullscreen mode

Notice how much is null. That's not failure, that's honesty. The site simply doesn't publish those fields.

Which fields are actually reliable from a website?

The biggest DIY mistake is treating missing data as a model bug when the source page just doesn't contain it.

Field Reliability Notes
Name High Usually right there on the homepage
Description High Hero or meta description
Industry Medium Often inferred, flag confidence
Founded year Low-medium Frequently missing
HQ city Medium Footer or contact page
Full address Low Lots of SaaS sites omit it
Phone / Email Low Often hidden behind a form
LinkedIn High Validate company vs personal
Technologies Out of scope Needs a tech-detection source

A website tells you what a company says about itself. It can't reliably give you headcount, revenue, funding, tech stack, or historical changes. For those you need other sources, and that's exactly where DIY extraction ends and production enrichment begins.

What breaks at scale

The first 10 domains feel free. The next 10,000 teach you where the cost actually lives.

  • Bot protection: Cloudflare/Akamai challenge pages. Retry, log, use a renderer, don't loop forever.
  • JS-heavy sites: empty shell on plain HTTP. Use browser rendering.
  • Thin homepages: one landing page, no footer. Widen your source set or accept the gaps.
  • Parked / redirected domains: detect them, store a status, don't treat them like real companies.
  • Hallucinations: structured output doesn't fix this. Validation outside the LLM call does.
  • Rate limits and cost spikes: queue, back off, cache by domain, cap pages per domain.

Give every scrape a status instead of a silent failure: success, blocked, timeout, parked, redirected, empty_content. A blocked site and a company-with-no-phone are very different things and should never look the same in your data.

For cost control, send the smallest useful set of pages, not the whole site. Good defaults: homepage plus up to 3 supporting pages, 8k-15k chars each, no base64 images, no blog archives.

Build vs buy, honestly

Not "DIY bad, API good." The honest version:

  • ~100 domains/month: build it yourself. You'll learn the mechanics and the cost is nothing.
  • ~10,000 domains/month: engineering time now costs more than the token bill.
  • Feeds a CRM, agent, or customer-facing product: you want a dedicated layer.

Build it once to understand the workflow. Buy it when the workflow becomes infrastructure. If you reach that point, a company enrichment API pulls from registries, maps, filings, and tech detection on top of the website, which is what fills the gaps a homepage-only flow never can. (Full disclosure: that's the product I work on.)

A note on doing this responsibly

Keep it boring. Use public business info for legitimate business workflows. Don't scrape private areas, don't infer sensitive personal attributes, respect robots and rate limits. And separate company-level enrichment from person-level enrichment: a domain identifies a company, it isn't a license to harvest every individual you can find. If you store personal data, loop in your legal team on GDPR/CCPA.

TL;DR checklist

  • Normalize the domain before fetching
  • Scrape the homepage with footer included
  • Discover and fetch contact/about/location pages
  • Use a nullable JSON schema
  • Tell the model to return null instead of guessing
  • Validate phone, email, address, and socials
  • Store evidence URLs
  • Test on a reviewed sample before scaling

That's the whole thing. The model is the easy part. The validation layer and the build-vs-buy call are what actually decide whether your enrichment is trustworthy.

If you build this, drop your null rate in the comments. Curious how different stacks compare.


I work on CompanyEnrich, an API-first B2B data platform for company and people intelligence. We spend a lot of time on exactly the messy parts above: page discovery, multi-source enrichment, and validation at volume. If you'd rather not maintain your own extraction stack, that's what we build.

Top comments (0)