DEV Community

SEN LLC
SEN LLC

Posted on

Building linkcheck: Why I Stopped Parsing Markdown with Regex

Building linkcheck: Why I Stopped Parsing Markdown with Regex

A small Python CLI that scans Markdown and HTML for broken links β€” including internal anchors, with proper GitHub-style slug resolution, and per-host rate limiting so it doesn't accidentally DoS the sites it's checking.

πŸ“¦ GitHub: https://github.com/sen-ltd/linkcheck

Screenshot

Documentation rots. Every README accumulates dead links, every static blog ends up with 404s pointing at services that quietly retired three years ago, and every monorepo's docs/ directory eventually sprouts internal anchors that go nowhere because someone renamed a heading and didn't grep for the slug. The depressing part is that all of this is mechanically detectable β€” a 200-line script could catch most of it β€” but the existing tools each have a sharp edge that keeps them out of CI for plenty of small projects. So I wrote linkcheck, a Python CLI that fits in pretty much any pipeline, and I want to walk through three design decisions in it that took longer than the actual coding.

The three alternatives, honestly

Before writing yet another anything, you should be able to say what's wrong with the existing options.

  • lychee is genuinely excellent. It's Rust, it's fast, it's well-maintained, and for big sites it's the right answer. The friction is that it's another toolchain to install and pin in projects that don't otherwise need Rust. For a 30-page docs tree in a Python repo, asking contributors to install Rust just to run pre-commit is a hard sell.
  • markdown-link-check is Node-only and conspicuously slow on larger trees because it doesn't pool connections per host. It also doesn't speak HTML, which matters if your docs build emits both.
  • html-linkchecker has been quiet since around 2019 and doesn't understand Markdown at all.

Each is fine for its niche. The niche I wanted to fill: a small Python CLI that handles Markdown and HTML in one pass, gets internal anchors right, and is well-behaved on the network. About 700 lines of source, two dependencies, ships as a 62 MB Docker image, runs the whole self-test in under a second.

Decision 1: markdown-it-py over regex

This was the first thing I tried to skip and the first thing I had to come back and fix. The first version of the extractor used a re pattern roughly like \[([^\]]+)\]\(([^)]+)\) to find inline Markdown links. It worked for about ten minutes. Here is the file that broke it:

Real link: [docs](https://example.com)

```


Fake link: [also](https://nope.example)


```

Escaped: \[not a link\](nope)
```

`

The regex happily extracted `https://nope.example` from inside the code fence, then complained that it was broken. The regex extracted the escaped form too. Both are false positives, both erode trust in the tool fast, and trying to fix them with more regex sends you down the rabbit hole that turns into the canonical Stack Overflow answer about why you can't parse HTML with regex.

The fix is to stop parsing yourself. `markdown-it-py` does it correctly because it's a real CommonMark parser:

```python
_MD = MarkdownIt("commonmark", {"html": True}).enable("table")


def _iter_md_tokens(tokens, inherited_line: int = 0):
    """Yield ``(token, line)`` pairs.

    ``markdown-it`` only sets ``map`` on block tokens; inline tokens (the
    ``link_open``, ``image``, etc. we care about) inherit the line of their
    enclosing block. We track that here so callers always get a usable
    line number.
    """

    for tok in tokens:
        line = inherited_line
        if tok.map is not None:
            line = tok.map[0] + 1
        yield tok, line
        if tok.children:
            yield from _iter_md_tokens(tok.children, line)
```

The crucial property: a fenced code block is a *single* `fence` token whose `content` is a string. `markdown-it-py` does not parse the inside of code blocks into child tokens, so `link_open` events simply never appear there. The regex problem dissolves because we stopped pretending Markdown was a regular language. The escaped bracket case is handled the same way β€” the parser sees `\[` and emits a text token, not a link.

The line-number bit is the second thing I learned the hard way. `markdown-it-py` only sets the `map` attribute on *block* tokens (paragraphs, list items, headings). The inline tokens nested inside a paragraph β€” including the `link_open` you actually care about β€” have `map=None`. If you naively grab `tok.map[0]`, you get `0` for every link in a paragraph, which produces error messages like `doc.md:0 β†’ broken` and makes your CI annotation feature useless. The fix is to walk the tree depth-first and *inherit* the most recent block's line number into every child.

## Decision 2: per-host rate limiting as a first-class feature

The naive way to check 200 URLs is to fire 200 concurrent HTTP requests. `httpx.AsyncClient` will happily do this, and on the first run against a real docs tree it feels great because everything finishes in two seconds. Then you check the same tree against `docs.python.org` from a CI runner and your build mysteriously starts failing on rate-limited 429s. Then you do it again from a coworker's laptop on the office network and they get blocked at the firewall.

Polite tools cap concurrency *per host*. Not globally β€” globally is wasteful, because requests to GitHub and requests to PyPI don't compete with each other β€” but per destination hostname. Here's the entire implementation:

```python
class HostLimiter:
    """One semaphore per host.

    Created lazily so we don't allocate a semaphore for every known domain
    up front. ``per_host`` of 4 is a reasonable default β€” higher for very
    large sites (github.com) if you know what you're doing, lower for small
    personal blogs.
    """

    def __init__(self, per_host: int) -> None:
        self._per_host = per_host
        self._locks: dict[str, asyncio.Semaphore] = {}
        self._mu = asyncio.Lock()

    async def acquire(self, host: str) -> asyncio.Semaphore:
        async with self._mu:
            sem = self._locks.get(host)
            if sem is None:
                sem = asyncio.Semaphore(self._per_host)
                self._locks[host] = sem
            return sem
```

Forty lines including comments. The check function then `async with sem:`s before every request, so even if 200 tasks are scheduled, only `per_host` of them can be in-flight against any one hostname at a time. The default is 4. Tasks targeting different hosts don't block each other.

This is one of those features where the test is more interesting than the code, because you can't unit-test "does the rate limit work" by counting after the fact β€” by the time you check, the limiter has already released. You have to observe the in-flight count *during* the request:

```python
async def test_per_host_limit_caps_concurrency():
    in_flight = 0
    peak = 0

    async def handler(request: httpx.Request) -> httpx.Response:
        nonlocal in_flight, peak
        in_flight += 1
        peak = max(peak, in_flight)
        # Yield several times so any unblocked siblings get a turn to run
        # and bump in_flight before we decrement.
        for _ in range(5):
            await asyncio.sleep(0)
        in_flight -= 1
        return httpx.Response(200)
    ...
    assert peak == 2, f"expected exactly 2 concurrent at peak, saw {peak}"
```

The `await asyncio.sleep(0)` is doing real work here: it forces the event loop to schedule any other ready task, so if the limiter were broken and a sibling were unblocked, it would also enter the handler and bump `in_flight` before this one decrements. Without those yields, you might see `peak=1` even with a broken limiter just because the tasks happened to run serially.

## Decision 3: the GitHub slug algorithm, exactly

When you write `[Usage](./guide.md#usage)`, the anchor `usage` only exists if the target has a heading whose generated slug matches. Most link checkers handle external URLs but skip anchors entirely, or check that *some* heading exists in the file without checking the slug. Both miss the most common bug: someone renames `## Usage` to `## How to use it`, the slug becomes `how-to-use-it`, and every old link is silently dead.

Here's the slug function:

```python
def slugify(heading: str) -> str:
    """Convert a single heading string to a GitHub-style slug.

    This is the atomic version β€” it does not track duplicates. Use
    :class:`SlugCounter` if you are slugifying a full document.
    """

    # Normalize to NFKC then strip combining marks that aren't part of the
    # letter class. This keeps "cafΓ©" β†’ "cafΓ©" (GitHub preserves composed
    # forms), while still handling weird combining sequences.
    text = unicodedata.normalize("NFKC", heading)
    text = text.lower()
    text = _ALLOWED.sub("", text)
    text = _WHITESPACE.sub("-", text.strip())
    text = text.strip("-")
    return text
```

There are a couple of subtleties. First, GitHub does *not* transliterate β€” `## γ‚€γƒ³γ‚ΉγƒˆγƒΌγƒ«` becomes the slug `γ‚€γƒ³γ‚ΉγƒˆγƒΌγƒ«`, not something Romanized. The regex `[^\w\- ]` with the `re.UNICODE` flag respects this because Python's `\w` includes Unicode letters by default. Second, GitHub tracks duplicates within a document and suffixes them: the first `## Usage` becomes `usage`, the second becomes `usage-1`, and so on. That means you can't slugify in isolation β€” you need a counter that walks the document in order, which is what `SlugCounter` is for.

Internal anchor checking then becomes: parse the target file into a `ParsedDocument` (which contains a `frozenset[str]` of all the slugs reachable in that file), and check membership.

## Other tradeoffs

A few things linkcheck deliberately doesn't do, and why:

- **No anchor checking inside JavaScript-generated content.** If your docs are SSR'd and the heading appears in the rendered HTML, fine β€” the HTML parser picks it up via `id="..."`. If your docs are React with client-side rendering and the heading only exists after `useEffect`, linkcheck will say the anchor doesn't exist. The honest answer is that any tool that doesn't run a real headless browser will get this wrong, and running a headless browser would 10Γ— the install size and 100Γ— the runtime. We chose not to.
- **No link-shortener resolution.** A `bit.ly` link only tells you that `bit.ly` is up. Following it to verify the *destination* would mean a GET (not HEAD), which means more bandwidth and a different rate-limit profile. We just check that the shortener responds with a 2xx/3xx and trust that as "the link works."
- **Retries are exponential, not linear.** A flaky 502 from a CDN edge usually clears in 250 ms or it's a real outage. Linear backoff (1s, 2s, 3s) would make CI runs glacial when *something* is down. Exponential (250 ms, 500 ms, 1 s, 2 s) is over in three seconds either way.
- **`mailto:` is skipped, period.** You could argue we should validate the domain has an MX record. We don't, because false positives on legitimate but unusual mail setups would be infuriating, and the cost of a typo'd `mailto:` is so low that it's not worth annoying users to catch.
- **No `robots.txt` respect.** robots.txt is a directive for crawlers walking a site. linkcheck visits each URL exactly once, identifies as `linkcheck-cli/<version>` in the User-Agent, and respects per-host rate limits β€” so it isn't a crawler and arguably isn't subject to robots.txt at all. If a server *really* doesn't want to be checked, they can return 403 and we'll report that honestly.

## HEAD with GET fallback (a small thing that matters)

This is the kind of detail that only shows up when you run a tool against the open web. The textbook way to check whether a URL exists is to issue an HTTP `HEAD` β€” same headers as `GET`, no response body, much cheaper. The reality is that a meaningful slice of servers either don't implement HEAD at all (`501 Not Implemented`) or block it for security reasons (`405 Method Not Allowed`), and the only way to distinguish "server is broken on HEAD" from "URL is broken" is to retry with `GET`.

```python
resp = await client.head(
    resolved.url,
    follow_redirects=True,
    headers={"User-Agent": config.user_agent},
)
if resp.status_code in (405, 501):
    resp = await client.get(
        resolved.url,
        follow_redirects=True,
        headers={"User-Agent": config.user_agent},
    )
```

Two extra lines, eliminates a whole category of false positives. Without this, the tool would tell you that maybe a third of the wider web is broken, which is a great way to get yourself uninstalled.

## Try it in 30 seconds

```bash
git clone https://github.com/sen-ltd/linkcheck
cd linkcheck
docker build -t linkcheck .

# Fast, deterministic, network-free β€” drop into pre-commit:
docker run --rm -v "$PWD:/work" linkcheck --no-external docs/

# Full check with GitHub Actions annotations:
docker run --rm -v "$PWD:/work" linkcheck --format github docs/
```

`--no-external` is the flag I expect to get the most mileage. It runs in milliseconds, never touches the network, and catches the renamed-heading and missing-file bugs that account for most of the broken links you'd ever ship β€” which means it's safe to run on every commit without slowing anyone down, and the slow external check moves to nightly CI.

## What's next

The obvious next step is a `--baseline` flag that ignores any link broken in the previous run, so you can adopt the tool against a legacy docs tree without having to fix every preexisting issue on day one. That's the workflow that got tools like `mypy --ignore-existing-errors` adopted; it's a low-friction on-ramp. But that's the next entry. This one is already on disk and works tomorrow.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)