How I Built a Soccer Coach Contact Extractor for Messy Athletics Websites

#webscraping #typescript #llm #ai

Most athletics websites look simple until you try to extract structured data from them at scale.

Coach pages are especially messy. One school gives you a clean staff directory with mailto: links. Another hides emails behind Cloudflare. Another puts names on the roster page and the actual contact info on a separate bio page. Another sends back an empty shell and expects JavaScript to do the rest.

That is what this project solves.

football-soccer-emails is a TypeScript-based extractor that pulls soccer and football coach contact information from athletics websites and turns it into structured records. It supports direct URLs, public Google Sheets, and an Apify workflow for batch runs.

The reason I built it this way is simple: I tried a version of this problem around 2017 or 2018 using heuristics only, and it was roughly 40% accurate. That was about as far as rules alone would take me. With LLMs and a multi-stage extraction flow, this same class of problem can now get into the 90%+ range.

The Core Idea

The project does not rely on one extraction method. It has three:

heuristic: free, pattern-based extraction for static sites
llm: AI-assisted extraction through OpenRouter
firecrawl: structured extraction through Firecrawl

That matters because athletics sites are not consistent enough for a one-size-fits-all scraper. Some pages are easy and should be handled with deterministic parsing. Some are messy but still understandable to an LLM. Some are JavaScript-heavy enough that you want a different extraction path entirely.

The main pipeline stays simple. It resolves input URLs, builds the selected extractor, fetches HTML when needed, and pushes normalized coach records into the output dataset. The interface is stable even when the extraction strategy changes underneath it.

What It Extracts

For each coach, the system tries to capture:

first name
last name
title or position
email
phone number
school
division or conference
source URL
profile URL
confidence levels for name, email, and phone

Those confidence levels are useful because not all matches are equal. A mailto: link is very different from an email pulled out of weak body text. I wanted the output to show that difference instead of hiding it.

In the repo, I used a 71-URL comparison batch to sanity-check cost and mode tradeoffs. The heuristic path was still useful, but it left too many messy pages unresolved. The LLM path closed most of that gap and was the difference between a pipeline that felt partly usable and one that could realistically clear 90%+ accuracy on this kind of data.

Why the Heuristic Layer Still Matters

It is easy to tell this story like the LLM replaced everything. That is not what happened.

The heuristic layer still matters because it is fast, free, and trustworthy on obvious patterns. If a page gives me a mailto: link, a tel: link, or a Cloudflare-protected email I can decode directly, I would rather use that than ask a model to interpret it.

The heuristic extractor is layered on purpose. It checks:

mailto: links
Cloudflare email protection
custom data attributes
JavaScript variable patterns
reversed text and simple obfuscation
plain body text
script tags
raw HTML

That list only exists because sports sites do weird things. The funny part is how often a page that looks like a soccer directory is also, somewhere in the HTML, a baseball page, a basketball page, and two hidden mobile layouts at the same time.

The Part That Actually Improved Accuracy

The biggest improvement was not just "use an LLM." It was combining multiple passes.

If the extractor finds a coach on a roster page but does not find an email, it can follow the coach’s profile URL and try again in profile mode. That matters because many athletics sites split their data across two layers:

roster page: names and titles
profile page: bio, phone, and email

Without that follow-up pass, you miss a lot of good data. With it, the extractor can keep the roster context and fill in the missing fields from the profile page.

There is also a guardrail I like here: if a model returns a profile URL on the wrong domain, the code rewrites it back onto the source site. That keeps the pipeline grounded in the actual athletics site instead of trusting a hallucinated cross-domain link.

Handling the Messy Cases

The project includes a few defensive pieces that came directly from running into bad pages over and over.

One is SPA detection. If the HTML looks like a React, Vue, or Angular shell instead of a real document, the heuristic extractor can fail explicitly instead of pretending there was just no data.

Another is hidden-content filtering. Athletics sites often include hidden sections for other sports, old layouts, or alternate mobile markup. If you do not filter that out, you can easily extract the wrong staff from the right page.

The fetch layer also uses TLS client fingerprinting instead of a plain default fetch stack. That helps with sites that behave differently when the client looks too synthetic.

None of that is glamorous, but it is the difference between a toy scraper and something you can run in bulk.

Built Around Spreadsheets, Not Just Code

I also wanted the workflow to be usable by someone who is not living in the codebase.

That sounds small, but it changes the project from "a scraper" into something an operator can actually use.

The project accepts input in three ways:

The project accepts:

direct URL lists
public Google Sheets
spreadsheet-driven workflows through Apify and Apps Script

That means the operating model can stay simple:

Put coach-page URLs into a sheet.
Run the extractor in the mode you want.
Get back structured coach records.

Why This Project Matters to Me

What I like about this build is that it settled a question I had back in 2017: could this problem ever be solved well without writing endless custom scraping rules?

Not in a vague "agentic" marketing sense. Not as magic. Just as a better way to handle messy, inconsistent data that used to force you into an endless loop of brittle rules.

The old version of this idea topped out around 40% because heuristics had to do everything. This version works because heuristics do the obvious work, LLMs handle ambiguity, and the pipeline follows up when the first pass is incomplete.

The next step is making that routing automatic so the system can decide, page by page, when cheap rules are enough and when a stronger extractor is worth using.