Amirali Nurmagomedov

Posted on May 29 • Originally published at companyenrich.com

How to enrich a company from a domain (Python + LLM workflow)

#python #ai #tutorial #automation

You have a domain. You want a clean company profile: name, description, industry, socials, contact info, location. Sounds like a one-liner. It isn't, but it's closer than you'd think.

The workflow is four steps: fetch the site, extract fields with an LLM, validate the output, and figure out the point where building your own thing stops being worth it. I'll walk through all four with real Python you can run today.

Stack: Firecrawl for fetching, OpenAI structured outputs for extraction. Plain HTTP, no SDKs, so you can port it to Node, Go, or whatever you live in.

The naive version (and why it breaks)

The demo everybody writes:

Fetch the homepage
Throw the HTML at an LLM
Get JSON back

This works beautifully on stripe.com. Then your next record is a parked domain, the one after that hides its address three clicks deep, and the one after that is a JS app that returns an empty shell on the first request. The model is rarely the hard part. Page discovery, missing data, validation, and cost control are.

The single most useful mindset shift: make it cheap for the model to be honest. If the page doesn't show a phone number, the right answer is null, not a plausible guess that happens to match the country code.

Setup

pip install requests

import json
import os
import re
from urllib.parse import urljoin, urlparse

import requests

FIRECRAWL_API_KEY = os.environ["FIRECRAWL_API_KEY"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-5.5")

MAX_CHARS_PER_PAGE = 12000
SUPPORTING_PAGE_LIMIT = 3

Step 1: Normalize the domain

Input is always messier than you want. companyenrich.com, https://companyenrich.com, www.companyenrich.com/. Clean it up first.

def normalize_domain_to_url(value: str) -> str:
    value = value.strip()
    if not value.startswith(("http://", "https://")):
        value = "https://" + value

    parsed = urlparse(value)
    return f"{parsed.scheme}://{parsed.netloc}".rstrip("/")

Step 2: Scrape, and keep the footer

Here's a gotcha. Firecrawl's onlyMainContent defaults to true, which strips headers, navs, and footers. Great for reading articles. Terrible for enrichment, because the footer is exactly where social links and contact details live. Set it to False.

def scrape_url(url: str) -> dict:
    response = requests.post(
        "https://api.firecrawl.dev/v1/scrape",
        headers={
            "Authorization": f"Bearer {FIRECRAWL_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "formats": ["markdown", "links"],
            "onlyMainContent": False,
            "removeBase64Images": True,
            "blockAds": True,
        },
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["data"]

The homepage alone usually isn't enough. Use the returned links to find a few supporting pages, same-domain only.

def discover_supporting_urls(home_url: str, links: list[str]) -> list[str]:
    home_host = urlparse(home_url).netloc.replace("www.", "")
    keywords = ("/about", "/company", "/contact", "/locations", "/location", "/team")

    candidates = []
    for link in links or []:
        href = link if isinstance(link, str) else link.get("href") or link.get("url")
        if not href:
            continue

        absolute = urljoin(home_url, href)
        parsed = urlparse(absolute)
        host = parsed.netloc.replace("www.", "")
        path = parsed.path.lower()

        if host == home_host and any(keyword in path for keyword in keywords):
            candidates.append(absolute.split("#")[0])

    return list(dict.fromkeys(candidates))[:SUPPORTING_PAGE_LIMIT]

Then stitch it all into one payload for the model:

def fetch_company_pages(domain: str) -> str:
    home_url = normalize_domain_to_url(domain)
    pages = []

    home_data = scrape_url(home_url)
    pages.append((home_url, home_data.get("markdown", "")))

    for url in discover_supporting_urls(home_url, home_data.get("links", [])):
        try:
            page_data = scrape_url(url)
            pages.append((url, page_data.get("markdown", "")))
        except requests.RequestException:
            continue

    sections = []
    for url, markdown in pages:
        trimmed = (markdown or "")[:MAX_CHARS_PER_PAGE]
        sections.append(f"URL: {url}\n\n{trimmed}")

    return "\n\n---\n\n".join(sections)

Step 3: Define a nullable schema

This is the part people get wrong. If your prompt says "return null when the field is missing" but your schema says "name": {"type": "string"}, you've handed the model a contradiction. It has to return a string, so it makes one up. Nullable fields aren't a compromise here, they're the whole point.

COMPANY_SCHEMA = {
    "type": "object",
    "properties": {
        "name": {"type": ["string", "null"]},
        "domain": {"type": ["string", "null"]},
        "description": {"type": ["string", "null"]},
        "long_description": {"type": ["string", "null"]},
        "industry": {"type": ["string", "null"]},
        "founded_year": {"type": ["integer", "null"]},
        "headquarters": {
            "type": "object",
            "properties": {
                "city": {"type": ["string", "null"]},
                "country": {"type": ["string", "null"]},
                "full_address": {"type": ["string", "null"]},
            },
            "required": ["city", "country", "full_address"],
            "additionalProperties": False,
        },
        "phone": {"type": ["string", "null"]},
        "email": {"type": ["string", "null"]},
        "social_profiles": {
            "type": "object",
            "properties": {
                "linkedin": {"type": ["string", "null"]},
                "x": {"type": ["string", "null"]},
                "facebook": {"type": ["string", "null"]},
                "instagram": {"type": ["string", "null"]},
                "youtube": {"type": ["string", "null"]},
            },
            "required": ["linkedin", "x", "facebook", "instagram", "youtube"],
            "additionalProperties": False,
        },
        "evidence_urls": {"type": "array", "items": {"type": "string"}},
    },
    "required": [
        "name", "domain", "description", "long_description", "industry",
        "founded_year", "headquarters", "phone", "email",
        "social_profiles", "evidence_urls",
    ],
    "additionalProperties": False,
}

Step 4: Extract with structured outputs

def extract_company_profile(page_content: str) -> dict:
    response = requests.post(
        "https://api.openai.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "model": OPENAI_MODEL,
            "messages": [
                {
                    "role": "system",
                    "content": (
                        "Extract structured company data from the provided website pages. "
                        "Use null when a field is not explicitly visible in the provided content. "
                        "Do not guess phone numbers, email addresses, physical addresses, or social URLs. "
                        "Only include evidence_urls that appear in the input."
                    ),
                },
                {"role": "user", "content": page_content},
            ],
            "response_format": {
                "type": "json_schema",
                "json_schema": {
                    "name": "company_profile",
                    "strict": True,
                    "schema": COMPANY_SCHEMA,
                },
            },
        },
        timeout=120,
    )
    response.raise_for_status()

    message = response.json()["choices"][0]["message"]
    if message.get("refusal"):
        raise ValueError(f"Model refused extraction: {message['refusal']}")

    return json.loads(message["content"])

Structured outputs enforce the shape of the response. They do not make the values true. That's the next step.

Step 5: Validate (this is where a demo becomes real)

Cheap checks first. Does the phone look like a phone? Does the LinkedIn URL point to a company page, not a person? Is that X link actually on x.com? When in doubt, null it.

PHONE_RE = re.compile(r"^\+?[0-9][0-9\s().-]{6,}$")
EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")


def valid_url(value: str | None) -> bool:
    if not value:
        return False
    parsed = urlparse(value)
    return parsed.scheme in {"http", "https"} and bool(parsed.netloc)


def validate_profile(profile: dict) -> dict:
    if profile.get("phone") and not PHONE_RE.match(profile["phone"]):
        profile["phone"] = None

    if profile.get("email") and not EMAIL_RE.match(profile["email"]):
        profile["email"] = None

    socials = profile.get("social_profiles", {})

    linkedin = socials.get("linkedin")
    if linkedin and "linkedin.com/company" not in linkedin:
        socials["linkedin"] = None

    x_url = socials.get("x")
    if x_url and not any(host in x_url for host in ("x.com/", "twitter.com/")):
        socials["x"] = None

    for key in ("facebook", "instagram", "youtube"):
        if socials.get(key) and not valid_url(socials[key]):
            socials[key] = None

    profile["social_profiles"] = socials
    return profile

Conservative on purpose. A null is annoying. A fake phone number in your CRM is a support ticket.

Wire it together

def enrich_company_from_domain(domain: str) -> dict:
    page_content = fetch_company_pages(domain)
    raw_profile = extract_company_profile(page_content)
    return validate_profile(raw_profile)


if __name__ == "__main__":
    company = enrich_company_from_domain("companyenrich.com")
    print(json.dumps(company, indent=2))

Sample output:

{
  "name": "CompanyEnrich",
  "domain": "companyenrich.com",
  "description": "API-first B2B data platform for company and people intelligence.",
  "industry": "B2B data infrastructure",
  "founded_year": null,
  "headquarters": { "city": "Ankara", "country": "Turkey", "full_address": null },
  "phone": null,
  "email": null,
  "social_profiles": {
    "linkedin": "https://www.linkedin.com/company/companyenrich",
    "x": null, "facebook": null, "instagram": null, "youtube": null
  },
  "evidence_urls": ["https://companyenrich.com"]
}

Notice how much is null. That's not failure, that's honesty. The site simply doesn't publish those fields.

Which fields are actually reliable from a website?

The biggest DIY mistake is treating missing data as a model bug when the source page just doesn't contain it.

Field	Reliability	Notes
Name	High	Usually right there on the homepage
Description	High	Hero or meta description
Industry	Medium	Often inferred, flag confidence
Founded year	Low-medium	Frequently missing
HQ city	Medium	Footer or contact page
Full address	Low	Lots of SaaS sites omit it
Phone / Email	Low	Often hidden behind a form
LinkedIn	High	Validate company vs personal
Technologies	Out of scope	Needs a tech-detection source

A website tells you what a company says about itself. It can't reliably give you headcount, revenue, funding, tech stack, or historical changes. For those you need other sources, and that's exactly where DIY extraction ends and production enrichment begins.

What breaks at scale

The first 10 domains feel free. The next 10,000 teach you where the cost actually lives.

Bot protection: Cloudflare/Akamai challenge pages. Retry, log, use a renderer, don't loop forever.
JS-heavy sites: empty shell on plain HTTP. Use browser rendering.
Thin homepages: one landing page, no footer. Widen your source set or accept the gaps.
Parked / redirected domains: detect them, store a status, don't treat them like real companies.
Hallucinations: structured output doesn't fix this. Validation outside the LLM call does.
Rate limits and cost spikes: queue, back off, cache by domain, cap pages per domain.

Give every scrape a status instead of a silent failure: success, blocked, timeout, parked, redirected, empty_content. A blocked site and a company-with-no-phone are very different things and should never look the same in your data.

For cost control, send the smallest useful set of pages, not the whole site. Good defaults: homepage plus up to 3 supporting pages, 8k-15k chars each, no base64 images, no blog archives.

Build vs buy, honestly

Not "DIY bad, API good." The honest version:

~100 domains/month: build it yourself. You'll learn the mechanics and the cost is nothing.
~10,000 domains/month: engineering time now costs more than the token bill.
Feeds a CRM, agent, or customer-facing product: you want a dedicated layer.

Build it once to understand the workflow. Buy it when the workflow becomes infrastructure. If you reach that point, a company enrichment API pulls from registries, maps, filings, and tech detection on top of the website, which is what fills the gaps a homepage-only flow never can. (Full disclosure: that's the product I work on.)

A note on doing this responsibly

Keep it boring. Use public business info for legitimate business workflows. Don't scrape private areas, don't infer sensitive personal attributes, respect robots and rate limits. And separate company-level enrichment from person-level enrichment: a domain identifies a company, it isn't a license to harvest every individual you can find. If you store personal data, loop in your legal team on GDPR/CCPA.

TL;DR checklist

Normalize the domain before fetching
Scrape the homepage with footer included
Discover and fetch contact/about/location pages
Use a nullable JSON schema
Tell the model to return null instead of guessing
Validate phone, email, address, and socials
Store evidence URLs
Test on a reviewed sample before scaling

That's the whole thing. The model is the easy part. The validation layer and the build-vs-buy call are what actually decide whether your enrichment is trustworthy.

If you build this, drop your null rate in the comments. Curious how different stacks compare.

I work on CompanyEnrich, an API-first B2B data platform for company and people intelligence. We spend a lot of time on exactly the messy parts above: page discovery, multi-source enrichment, and validation at volume. If you'd rather not maintain your own extraction stack, that's what we build.

Top comments (1)

Harjot Singh • May 31

Domain-to-company enrichment is a genuinely useful LLM workflow, and the interesting engineering is the reliability of the extraction, not the LLM call - the model will happily hallucinate a plausible company description, employee count, or industry if the source data is thin, so the trustworthy version grounds every field in scraped/retrieved evidence and flags low-confidence fields rather than inventing them. Confidently-wrong enrichment is worse than a blank, because it poisons whatever downstream system consumes it (CRM, lead scoring, outreach).

The design that makes it production-grade: structured output validated against a schema, each field traceable to a source, and a confidence/abstain path for what the data doesn't support. Verify-or-leave-blank beats fill-with-a-guess. That ground-and-validate discipline is core to how I treat any extraction in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - structured, validated, not trusted raw. Genuinely useful workflow (I do a lot of domain->company work myself for outreach). How are you handling the hallucination risk on sparse domains - confidence thresholds, or strict "only output what's in the retrieved evidence"? That guardrail is the difference between usable enrichment and garbage-in-CRM.