Gabriel Anhaia

Posted on Jul 2

PHP 8.5's New URI Extension: Parsing URLs Without the parse_url() Footguns

#php #php85 #security #architecture

Book: Decoupled PHP — Clean and Hexagonal Architecture for Applications That Outlive the Framework
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

parse_url() has been in PHP since PHP 4. It follows neither RFC 3986 nor the WHATWG URL standard. It splits a string on characters that look like URL delimiters and hands you an array. That's the whole contract.

For twenty years that array has been the foundation of every allowlist check, every "is this an internal host" guard, every redirect validator in the ecosystem. The problem shows up when your validator reads a URL one way and the HTTP client that actually sends the request reads it another way. Orange Tsai's Black Hat 2017 talk, "A New Era of SSRF", is 40 slides of exactly this: two parsers disagreeing about the same string, and the gap between them being the vulnerability.

PHP 8.5 ships two standards-compliant parsers as an always-on extension. This post is about where parse_url() lies, and how to keep the new parser at the edge of your app where it belongs.

Where parse_url() lies

Start with the classic. A hostname with no scheme:

var_dump(parse_url('cdn.example.com/app.js'));
// array(1) { ["path"]=> "cdn.example.com/app.js" }

No host key. The whole thing landed in path. If your code did parse_url($url)['host'] and compared it against an allowlist, you just got null, and whatever you do with null is now the bug.

Next, a port that isn't a number:

var_dump(parse_url('https://example.com:80x/api'));
// bool(false)

The entire parse collapses to false because of two characters in the port. No component survives. Your validator has to handle false, and a lot of validators in the wild forget to.

Then there's the input parse_url() accepts that it should reject. Control characters, whitespace inside the authority, mixed delimiters. It rarely errors. It reshuffles the string into an array and moves on, and the array often disagrees with what curl or a browser would do with the same bytes. That disagreement is the whole SSRF class.

Two parsers, on purpose

PHP 8.5 gives you two classes, and the split is deliberate.

Uri\Rfc3986\Uri follows RFC 3986. It's strict, it preserves what you gave it, and it's the right choice for URIs in the general sense: config values, urn: identifiers, anything that isn't specifically a web address headed for an HTTP client.

Uri\WhatWg\Url follows the WHATWG URL standard, the same algorithm browsers and fetch() run. It normalizes aggressively, handles internationalized domains, and reports soft errors. It's the right choice for any URL that a browser or HTTP client will resolve, because it parses the string the way that client will.

The reason both exist is the reason SSRF filters break: "the correct parse" depends on who's consuming the result. Match your validator's parser to your client's parser and the gap closes.

Parsing without the surprises

The RFC 3986 class has a constructor that throws and a static parse() that returns null. Pick based on whether bad input is exceptional or expected.

use Uri\Rfc3986\Uri;
use Uri\InvalidUriException;

// Throws on invalid input.
try {
    $uri = new Uri('https://example.com:80x/api');
} catch (InvalidUriException $e) {
    // handled, typed, not a silent false
}

// Returns null on invalid input.
$uri = Uri::parse('https://example.com:80x/api');
if ($uri === null) {
    // malformed port, rejected here
}

Compare that to the parse_url() version, where invalid input is false, missing components are absent array keys, and nothing tells you which case you hit. The new API forces the decision at the parse site.

The getters return typed, nullable components:

$uri = new Uri('https://api.example.com:8080/v1?id=7#top');

$uri->getScheme();    // "https"
$uri->getHost();      // "api.example.com"
$uri->getPort();      // 8080  (int, or null)
$uri->getPath();      // "/v1"
$uri->getQuery();     // "id=7"
$uri->getFragment();  // "top"

getPort() gives you an int or null. No more casting a string, no more false.

Raw versus normalized

This is the distinction that matters for security work. Every getter has a raw twin, and the object has both toString() and toRawString().

$uri = new Uri('HTTPS://ExAmple.COM/sp%6Fnsor/');

$uri->getHost();       // "example.com"   (normalized)
$uri->getRawHost();    // "ExAmple.COM"   (as written)
$uri->toString();      // normalized form
$uri->toRawString();   // your exact input, untouched

When you compare a host against an allowlist, you want the normalized getHost(), so ExAmple.COM and example.com don't slip past a case-sensitive in_array(). When you need to echo back exactly what the user sent, you reach for the raw side. Having both, named clearly, means you stop hand-rolling strtolower() normalization and getting it subtly wrong.

WHATWG for anything a client will touch

The WHATWG class is where the browser-compatibility payoff lives. Internationalized domains are the clearest example:

use Uri\WhatWg\Url;

$url = new Url('https://münchen.de/tickets');

$url->getAsciiHost();     // "xn--mnchen-3ya.de"
$url->getUnicodeHost();   // "münchen.de"

parse_url() hands back münchen.de as-is. Feed that to a client expecting the ASCII/punycode host and you get a mismatch. The WHATWG parser gives you both representations, and getAsciiHost() is the one your DNS resolver and TLS stack actually use.

Invalid input throws a typed exception carrying the validation errors:

use Uri\WhatWg\InvalidUrlException;

try {
    $url = new Url('https://exa mple.com');
} catch (InvalidUrlException $e) {
    foreach ($e->errors as $error) {
        // structured UrlValidationError entries
    }
}

For non-fatal issues, the constructor and parse() take a by-reference $softErrors argument so you can inspect problems that were recoverable but worth logging.

Building URLs safely

Both classes are immutable. The wither methods clone and return a new instance, so you compose a URL without ever concatenating strings:

$base = new Uri('https://api.example.com/v1/users');

$next = $base
    ->withQuery('page=2&limit=20')
    ->withFragment('results');

$next->toString();
// https://api.example.com/v1/users?page=2&limit=20#results

Relative references resolve against a base the way the standard says they should, so you stop writing rtrim($base, '/') . '/' . ltrim($path, '/'):

$doc = new Uri('https://example.com/docs/guide/');
$doc->resolve('../api/reference')->toString();
// https://example.com/docs/api/reference

Keeping it at the edge

URL parsing is an I/O concern. It belongs in an adapter, next to the HTTP client, not scattered through your domain. The domain should receive a validated value object and never see a raw string.

Put the parser behind a port. Here's an outbound-URL guard that uses the WHATWG parser (matching what the HTTP client will resolve) to reject anything not on the allowlist:

interface OutboundUrl
{
    public function toString(): string;
}

final class SafeOutboundUrl implements OutboundUrl
{
    private function __construct(
        private readonly string $url,
    ) {}

    public static function forAllowedHosts(
        string $candidate,
        array $allowedHosts,
    ): self {
        $parsed = Url::parse($candidate);

        if ($parsed === null) {
            throw new InvalidArgumentException(
                'Unparseable URL rejected'
            );
        }

        $host = $parsed->getAsciiHost();

        if (!in_array($host, $allowedHosts, true)) {
            throw new InvalidArgumentException(
                "Host not allowed: {$host}"
            );
        }

        return new self($parsed->toAsciiString());
    }

    public function toString(): string
    {
        return $this->url;
    }
}

The validator parses the URL with the same algorithm the client uses, checks the ASCII host that DNS will actually resolve, and hands back a value object built from the normalized form. Your webhook dispatcher takes an OutboundUrl, not a string, so no unvalidated URL reaches the network by this path. The SSRF check and the request share one parser, which closes the parser-disagreement gap. It does not close the whole SSRF class: DNS rebinding, where the host resolves to an internal IP after the check, and redirect following by the HTTP client are separate controls you still need.

Which class to reach for

URLs headed for an HTTP client, a browser, or an SSRF guard: Uri\WhatWg\Url. Match the client's algorithm.
URIs in the general sense, config identifiers, strict RFC 3986 work: Uri\Rfc3986\Uri.
Bad input is expected (user forms, imported data): ::parse() and check for null.
Bad input is a bug (internal config that must be valid): the constructor, and let InvalidUriException surface.
Allowlist and comparison logic: the normalized getters (getHost(), getAsciiHost()), never the raw ones.

parse_url() still works, and for pulling the scheme off a trusted internal string it's fine. The moment a URL comes from outside and gets sent somewhere, use the parser that agrees with the thing sending the request.

The parse_url() era of "close enough" URL handling is ending. That's a good thing.

A URL parser is a textbook adapter concern: it deals with the messy outside world, and its job is to hand your domain a clean, validated value object it can trust. Keeping that translation at the boundary, so an OutboundUrl never arrives as a raw string, is the same discipline that lets you swap HTTP clients or frameworks without touching a use case. That boundary between the outside world and the domain is what Decoupled PHP is about, with chapters on ports, adapters, and the value objects that live at the seam.