You have a domain. You want a clean company profile: name, description, industry, socials, contact info, location. Sounds like a one-liner. It isn't, but it's closer than you'd think.
The workflow is four steps: fetch the site, extract fields with an LLM, validate the output, and figure out the point where building your own thing stops being worth it. I'll walk through all four with real Python you can run today.
Stack: Firecrawl for fetching, OpenAI structured outputs for extraction. Plain HTTP, no SDKs, so you can port it to Node, Go, or whatever you live in.
The naive version (and why it breaks)
The demo everybody writes:
- Fetch the homepage
- Throw the HTML at an LLM
- Get JSON back
This works beautifully on stripe.com. Then your next record is a parked domain, the one after that hides its address three clicks deep, and the one after that is a JS app that returns an empty shell on the first request. The model is rarely the hard part. Page discovery, missing data, validation, and cost control are.
The single most useful mindset shift: make it cheap for the model to be honest. If the page doesn't show a phone number, the right answer is null, not a plausible guess that happens to match the country code.
Setup
pip install requests
import json
import os
import re
from urllib.parse import urljoin, urlparse
import requests
FIRECRAWL_API_KEY = os.environ["FIRECRAWL_API_KEY"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-5.5")
MAX_CHARS_PER_PAGE = 12000
SUPPORTING_PAGE_LIMIT = 3
Step 1: Normalize the domain
Input is always messier than you want. companyenrich.com, https://companyenrich.com, www.companyenrich.com/. Clean it up first.
def normalize_domain_to_url(value: str) -> str:
value = value.strip()
if not value.startswith(("http://", "https://")):
value = "https://" + value
parsed = urlparse(value)
return f"{parsed.scheme}://{parsed.netloc}".rstrip("/")
Step 2: Scrape, and keep the footer
Here's a gotcha. Firecrawl's onlyMainContent defaults to true, which strips headers, navs, and footers. Great for reading articles. Terrible for enrichment, because the footer is exactly where social links and contact details live. Set it to False.
def scrape_url(url: str) -> dict:
response = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={
"Authorization": f"Bearer {FIRECRAWL_API_KEY}",
"Content-Type": "application/json",
},
json={
"url": url,
"formats": ["markdown", "links"],
"onlyMainContent": False,
"removeBase64Images": True,
"blockAds": True,
},
timeout=60,
)
response.raise_for_status()
return response.json()["data"]
The homepage alone usually isn't enough. Use the returned links to find a few supporting pages, same-domain only.
def discover_supporting_urls(home_url: str, links: list[str]) -> list[str]:
home_host = urlparse(home_url).netloc.replace("www.", "")
keywords = ("/about", "/company", "/contact", "/locations", "/location", "/team")
candidates = []
for link in links or []:
href = link if isinstance(link, str) else link.get("href") or link.get("url")
if not href:
continue
absolute = urljoin(home_url, href)
parsed = urlparse(absolute)
host = parsed.netloc.replace("www.", "")
path = parsed.path.lower()
if host == home_host and any(keyword in path for keyword in keywords):
candidates.append(absolute.split("#")[0])
return list(dict.fromkeys(candidates))[:SUPPORTING_PAGE_LIMIT]
Then stitch it all into one payload for the model:
def fetch_company_pages(domain: str) -> str:
home_url = normalize_domain_to_url(domain)
pages = []
home_data = scrape_url(home_url)
pages.append((home_url, home_data.get("markdown", "")))
for url in discover_supporting_urls(home_url, home_data.get("links", [])):
try:
page_data = scrape_url(url)
pages.append((url, page_data.get("markdown", "")))
except requests.RequestException:
continue
sections = []
for url, markdown in pages:
trimmed = (markdown or "")[:MAX_CHARS_PER_PAGE]
sections.append(f"URL: {url}\n\n{trimmed}")
return "\n\n---\n\n".join(sections)
Step 3: Define a nullable schema
This is the part people get wrong. If your prompt says "return null when the field is missing" but your schema says "name": {"type": "string"}, you've handed the model a contradiction. It has to return a string, so it makes one up. Nullable fields aren't a compromise here, they're the whole point.
COMPANY_SCHEMA = {
"type": "object",
"properties": {
"name": {"type": ["string", "null"]},
"domain": {"type": ["string", "null"]},
"description": {"type": ["string", "null"]},
"long_description": {"type": ["string", "null"]},
"industry": {"type": ["string", "null"]},
"founded_year": {"type": ["integer", "null"]},
"headquarters": {
"type": "object",
"properties": {
"city": {"type": ["string", "null"]},
"country": {"type": ["string", "null"]},
"full_address": {"type": ["string", "null"]},
},
"required": ["city", "country", "full_address"],
"additionalProperties": False,
},
"phone": {"type": ["string", "null"]},
"email": {"type": ["string", "null"]},
"social_profiles": {
"type": "object",
"properties": {
"linkedin": {"type": ["string", "null"]},
"x": {"type": ["string", "null"]},
"facebook": {"type": ["string", "null"]},
"instagram": {"type": ["string", "null"]},
"youtube": {"type": ["string", "null"]},
},
"required": ["linkedin", "x", "facebook", "instagram", "youtube"],
"additionalProperties": False,
},
"evidence_urls": {"type": "array", "items": {"type": "string"}},
},
"required": [
"name", "domain", "description", "long_description", "industry",
"founded_year", "headquarters", "phone", "email",
"social_profiles", "evidence_urls",
],
"additionalProperties": False,
}
Step 4: Extract with structured outputs
def extract_company_profile(page_content: str) -> dict:
response = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json",
},
json={
"model": OPENAI_MODEL,
"messages": [
{
"role": "system",
"content": (
"Extract structured company data from the provided website pages. "
"Use null when a field is not explicitly visible in the provided content. "
"Do not guess phone numbers, email addresses, physical addresses, or social URLs. "
"Only include evidence_urls that appear in the input."
),
},
{"role": "user", "content": page_content},
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "company_profile",
"strict": True,
"schema": COMPANY_SCHEMA,
},
},
},
timeout=120,
)
response.raise_for_status()
message = response.json()["choices"][0]["message"]
if message.get("refusal"):
raise ValueError(f"Model refused extraction: {message['refusal']}")
return json.loads(message["content"])
Structured outputs enforce the shape of the response. They do not make the values true. That's the next step.
Step 5: Validate (this is where a demo becomes real)
Cheap checks first. Does the phone look like a phone? Does the LinkedIn URL point to a company page, not a person? Is that X link actually on x.com? When in doubt, null it.
PHONE_RE = re.compile(r"^\+?[0-9][0-9\s().-]{6,}$")
EMAIL_RE = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
def valid_url(value: str | None) -> bool:
if not value:
return False
parsed = urlparse(value)
return parsed.scheme in {"http", "https"} and bool(parsed.netloc)
def validate_profile(profile: dict) -> dict:
if profile.get("phone") and not PHONE_RE.match(profile["phone"]):
profile["phone"] = None
if profile.get("email") and not EMAIL_RE.match(profile["email"]):
profile["email"] = None
socials = profile.get("social_profiles", {})
linkedin = socials.get("linkedin")
if linkedin and "linkedin.com/company" not in linkedin:
socials["linkedin"] = None
x_url = socials.get("x")
if x_url and not any(host in x_url for host in ("x.com/", "twitter.com/")):
socials["x"] = None
for key in ("facebook", "instagram", "youtube"):
if socials.get(key) and not valid_url(socials[key]):
socials[key] = None
profile["social_profiles"] = socials
return profile
Conservative on purpose. A null is annoying. A fake phone number in your CRM is a support ticket.
Wire it together
def enrich_company_from_domain(domain: str) -> dict:
page_content = fetch_company_pages(domain)
raw_profile = extract_company_profile(page_content)
return validate_profile(raw_profile)
if __name__ == "__main__":
company = enrich_company_from_domain("companyenrich.com")
print(json.dumps(company, indent=2))
Sample output:
{
"name": "CompanyEnrich",
"domain": "companyenrich.com",
"description": "API-first B2B data platform for company and people intelligence.",
"industry": "B2B data infrastructure",
"founded_year": null,
"headquarters": { "city": "Ankara", "country": "Turkey", "full_address": null },
"phone": null,
"email": null,
"social_profiles": {
"linkedin": "https://www.linkedin.com/company/companyenrich",
"x": null, "facebook": null, "instagram": null, "youtube": null
},
"evidence_urls": ["https://companyenrich.com"]
}
Notice how much is null. That's not failure, that's honesty. The site simply doesn't publish those fields.
Which fields are actually reliable from a website?
The biggest DIY mistake is treating missing data as a model bug when the source page just doesn't contain it.
| Field | Reliability | Notes |
|---|---|---|
| Name | High | Usually right there on the homepage |
| Description | High | Hero or meta description |
| Industry | Medium | Often inferred, flag confidence |
| Founded year | Low-medium | Frequently missing |
| HQ city | Medium | Footer or contact page |
| Full address | Low | Lots of SaaS sites omit it |
| Phone / Email | Low | Often hidden behind a form |
| High | Validate company vs personal | |
| Technologies | Out of scope | Needs a tech-detection source |
A website tells you what a company says about itself. It can't reliably give you headcount, revenue, funding, tech stack, or historical changes. For those you need other sources, and that's exactly where DIY extraction ends and production enrichment begins.
What breaks at scale
The first 10 domains feel free. The next 10,000 teach you where the cost actually lives.
- Bot protection: Cloudflare/Akamai challenge pages. Retry, log, use a renderer, don't loop forever.
- JS-heavy sites: empty shell on plain HTTP. Use browser rendering.
- Thin homepages: one landing page, no footer. Widen your source set or accept the gaps.
- Parked / redirected domains: detect them, store a status, don't treat them like real companies.
- Hallucinations: structured output doesn't fix this. Validation outside the LLM call does.
- Rate limits and cost spikes: queue, back off, cache by domain, cap pages per domain.
Give every scrape a status instead of a silent failure: success, blocked, timeout, parked, redirected, empty_content. A blocked site and a company-with-no-phone are very different things and should never look the same in your data.
For cost control, send the smallest useful set of pages, not the whole site. Good defaults: homepage plus up to 3 supporting pages, 8k-15k chars each, no base64 images, no blog archives.
Build vs buy, honestly
Not "DIY bad, API good." The honest version:
- ~100 domains/month: build it yourself. You'll learn the mechanics and the cost is nothing.
- ~10,000 domains/month: engineering time now costs more than the token bill.
- Feeds a CRM, agent, or customer-facing product: you want a dedicated layer.
Build it once to understand the workflow. Buy it when the workflow becomes infrastructure. If you reach that point, a company enrichment API pulls from registries, maps, filings, and tech detection on top of the website, which is what fills the gaps a homepage-only flow never can. (Full disclosure: that's the product I work on.)
A note on doing this responsibly
Keep it boring. Use public business info for legitimate business workflows. Don't scrape private areas, don't infer sensitive personal attributes, respect robots and rate limits. And separate company-level enrichment from person-level enrichment: a domain identifies a company, it isn't a license to harvest every individual you can find. If you store personal data, loop in your legal team on GDPR/CCPA.
TL;DR checklist
- Normalize the domain before fetching
- Scrape the homepage with footer included
- Discover and fetch contact/about/location pages
- Use a nullable JSON schema
- Tell the model to return
nullinstead of guessing - Validate phone, email, address, and socials
- Store evidence URLs
- Test on a reviewed sample before scaling
That's the whole thing. The model is the easy part. The validation layer and the build-vs-buy call are what actually decide whether your enrichment is trustworthy.
If you build this, drop your null rate in the comments. Curious how different stacks compare.
I work on CompanyEnrich, an API-first B2B data platform for company and people intelligence. We spend a lot of time on exactly the messy parts above: page discovery, multi-source enrichment, and validation at volume. If you'd rather not maintain your own extraction stack, that's what we build.
Top comments (0)