SEN LLC

Posted on Apr 15

Parsing phpinfo() in Pure PHP — or, How to Diff Two Containers

#php #cli #devops #tutorial

Parsing phpinfo() in Pure PHP — or, How to Diff Two Containers

phpinfo() is the most-used PHP debugging tool on earth, and its output format is hostile to every reasonable thing you'd want to do with it. This is a small CLI that parses it into JSON so you can jq it, grep it, and — the real motivator — diff two environments.

You know the ritual. Something is weird on one container. It works on your laptop, it works in staging, but production is quietly doing the wrong thing. You suspect opcache, or memory_limit, or whether a particular extension got loaded, so you reach for the one tool that has never failed you in fifteen years: phpinfo().

You get back 72 KB of HTML tables. Then you stare at it, because what are you going to do — Cmd+F through it?

🔗 GitHub: https://github.com/sen-ltd/phpinfo-json

I wrote a small CLI called phpinfo-json that parses the HTML dump (or the text dump from php -i) into a clean JSON shape with sections, modules, and general keys. It's ~400 lines of strict PHP 8.2 with no Composer runtime dependencies — just DOMDocument, DOMXPath, and json_encode. The point of this article is not the tool. The point is a few specific decisions I made while writing the parser that are more interesting than they look: why you should use DOMDocument and not regex, how phpinfo() actually structures its HTML once you look closely, how to handle the "no value" cells without special-casing them everywhere, and the diff-two-environments use case that turned out to be the whole reason I bothered.

The problem

Here is what phpinfo() produces when you call it from a PHP web process:

<h2><a name="module_core" href="#module_core">Core</a></h2>
<table>
<tr><td class="e">PHP Version </td><td class="v">8.2.30 </td></tr>
</table>
<table>
<tr class="h"><th>Directive</th><th>Local Value</th><th>Master Value</th></tr>
<tr><td class="e">allow_url_fopen</td><td class="v">On</td><td class="v">On</td></tr>
<tr><td class="e">display_errors</td><td class="v">Off</td><td class="v">On</td></tr>
<tr><td class="e">memory_limit</td><td class="v">256M</td><td class="v">128M</td></tr>
<tr><td class="e">open_basedir</td><td class="v"><i>no value</i></td><td class="v"><i>no value</i></td></tr>
</table>

A few things jump out once you actually read it:

Each section is an <h2> followed by one or more <table> elements. The Core section has two — one for "PHP Version" and another for all the directives. The parser has to collect every table between this <h2> and the next <h2>, not just the first one. I got this wrong on my first pass.
There are two kinds of rows. Two-column rows (<td>key</td><td>value</td>) for simple properties like "PDO support: enabled", and three-column rows (<td>key</td><td>local</td><td>master</td>) for INI directives where the local value can override the master php.ini value.
Unset INI directives are rendered as the literal string <i>no value</i>. Not an empty string, not a NULL, not a missing <td>. The HTML string no value in an italic tag.
Multi-line values exist. disable_functions = exec, system, passthru appears in the HTML as three separate lines inside the same <td>. You have to preserve the newlines or you lose information.
The text format (php -i) is simpler — no HTML, just key => value lines with --- separators between sections — but it's lossy in a handful of ways I'll get to.

Why DOMDocument, not regex

I want to address the obvious temptation first. You look at this HTML, notice it's simple and regular, and your brain immediately starts writing preg_match_all('/<td class="e">([^<]+)<\/td><td class="v">([^<]+)<\/td>/', ...). Please don't. I've written that code, and it's fine until it isn't. Three things blow it up:

Whitespace inside cells. Real phpinfo output has <td class="e">PHP Version </td> with a trailing space. Your regex has to tolerate whitespace everywhere, and soon you're writing \s* between every token.
Nested tags. <td class="v"><i>no value</i></td> has an <i> inside. Your non-greedy [^<]+ doesn't match that. You need .*?, which opens the door to catastrophic backtracking on big inputs.
Entity encoding. UTF-8 values with ampersands become &. Multi-byte values like Asia/東京 pass through fine in UTF-8 bytes, but the moment you normalize whitespace or slice substrings with substr() instead of mb_substr(), you're one bug away from corrupting a date.timezone value in a Japanese environment.

DOMDocument handles all three for free. It's in the PHP stdlib, it's been stable for twenty years, and it gives you a proper tree you can walk with XPath. The only tricky bit is that phpinfo()'s HTML is technically XHTML 1.0 Transitional with a doctype, and DOMDocument::loadHTML used to choke on UTF-8 without a hint. The fix is a one-liner — prefix an XML declaration before loading:

$prev = libxml_use_internal_errors(true);
$doc = new \DOMDocument();
$wrapped = '<?xml encoding="UTF-8"?>' . $source;
$doc->loadHTML($wrapped, LIBXML_NOERROR | LIBXML_NOWARNING);
libxml_clear_errors();
libxml_use_internal_errors($prev);

The libxml_use_internal_errors dance is there because phpinfo()'s HTML is not strictly valid — it has unclosed <tr> elements in older PHP versions and a fistful of XHTML quirks. We're not trying to validate it; we just want to walk the tree. Suppress the warnings, extract what we need, move on.

Walking the sections

With the DOM loaded, the parser does this:

$xpath = new \DOMXPath($doc);
$sections = [];

foreach ($xpath->query('//h2') as $h2) {
    $name = trim($h2->textContent);
    if ($name === '' || stripos($name, 'phpinfo') !== false) {
        continue;
    }
    $tables = $this->tablesBefore($h2, $xpath);
    $rows = [];
    foreach ($tables as $table) {
        foreach ($this->extractTableRows($table, $xpath) as $k => $v) {
            $rows[$k] = $v;
        }
    }
    if ($rows !== []) {
        $sections[$name] = $rows;
    }
}

tablesBefore is the method that fixed the Core-section bug. It walks forward from the <h2> until it hits the next <h2> or <h1>, collecting every <table> it sees along the way:

private function tablesBefore(\DOMNode $h2, \DOMXPath $xpath): array
{
    $tables = [];
    $node = $h2->nextSibling;
    while ($node !== null) {
        if ($node instanceof \DOMElement) {
            $name = strtolower($node->nodeName);
            if ($name === 'h2' || $name === 'h1') {
                break;
            }
            if ($name === 'table') {
                $tables[] = $node;
            }
        }
        $node = $node->nextSibling;
    }
    return $tables;
}

The row extractor handles both shapes uniformly:

foreach ($xpath->query('.//tr', $table) as $tr) {
    $tds = $xpath->query('./td', $tr);
    if ($tds->length < 2) continue;
    $key = $this->cleanCell($tds->item(0)->textContent);
    if ($key === '') continue;
    if ($tds->length >= 3) {
        $out[$key] = [
            'local'  => $this->cleanCell($tds->item(1)->textContent),
            'master' => $this->cleanCell($tds->item(2)->textContent),
        ];
    } else {
        $out[$key] = $this->cleanCell($tds->item(1)->textContent);
    }
}

Two-column rows become key => string, three-column rows become key => {local, master}. Consumers can check is_array to tell them apart, and jq users can write .sections.Core.memory_limit.local without having to know which kind of row they're looking at in advance.

Handling "no value"

Here is cleanCell:

private function cleanCell(string $raw): string
{
    $s = str_replace(["\r\n", "\r"], "\n", $raw);
    $s = preg_replace('/[ \t]+/', ' ', $s) ?? $s;
    $lines = array_map('trim', explode("\n", $s));
    $lines = array_values(array_filter($lines, fn($l) => $l !== ''));
    $joined = implode("\n", $lines);
    if (strcasecmp($joined, 'no value') === 0) {
        return '';
    }
    return $joined;
}

Three things worth pointing out. First, newlines are preserved but runs of spaces and tabs collapse to single spaces — that's what you want for disable_functions so the output reads as exec\nsystem\npassthru instead of a single run-on string. Second, "no value" normalizes to empty. I debated this — keeping the literal string would let downstream tools distinguish "unset by the user" from "set to empty string" — but in practice nothing cares, and the empty-string convention makes shell pipelines much nicer (jq '.Core.open_basedir // "unset"' Just Works). Third, textContent on the <td> already unwraps the <i>no value</i> tag for us — we never have to special-case it at the DOM level. This is the whole advantage of working on a tree instead of a byte stream.

The diff mode — the part I actually use

None of the above is why I bothered writing this. I bothered because I had two Docker containers that claimed to run the same PHP stack, and I couldn't tell what was different between them. Three Slack messages and two docker exec sessions later, I wanted a command I could point at two saved dumps and have it spit out the delta.

public function diff(array $a, array $b): array
{
    $flatA = $this->flatten($a['sections']);
    $flatB = $this->flatten($b['sections']);
    $added = []; $removed = []; $changed = [];

    foreach ($flatB as $key => $valB) {
        if (!array_key_exists($key, $flatA)) {
            $added[$key] = $valB;
        } elseif ($this->normalize($flatA[$key]) !== $this->normalize($valB)) {
            $changed[$key] = ['from' => $flatA[$key], 'to' => $valB];
        }
    }
    foreach ($flatA as $key => $valA) {
        if (!array_key_exists($key, $flatB)) {
            $removed[$key] = $valA;
        }
    }
    // Module deltas are a separate layer.
    $modules = [
        'added'   => array_values(array_diff($b['modules'], $a['modules'])),
        'removed' => array_values(array_diff($a['modules'], $b['modules'])),
    ];
    return compact('added', 'removed', 'changed', 'modules');
}

The whole thing is thirty lines. Flattening is section.key concatenation. Normalization exists only so that a two-column row comparing to a three-column row doesn't false-positive (different shapes → normalize to a stable string first). The real insight is the output shape — by separating modules.added/removed from the key-level added/removed/changed, the consumer can answer "did we gain or lose an extension?" in one query (jq .modules) without walking the directive-level diff at all. That's the query I run 90% of the time.

Tradeoffs I picked on purpose

A few things phpinfo-json does not do, and why:

It doesn't expose the full $_ENV. phpinfo() includes environment variables, but dumping them into JSON is a secret-leakage hazard waiting to happen. I skip the Environment and PHP Variables sections from the module list specifically so people don't accidentally pipe AWS_SECRET_ACCESS_KEY into a logger. The sections are still in the JSON if you ask for them by name, but they're not in the default module listing.
The HTML format can drift between PHP minor versions. I tested against 8.2; 8.3 and 8.4 look the same so far, but if the PHP team decides to rewrite phpinfo's HTML output tomorrow I'm going to have a bad afternoon. The text format is more stable but lossier — you lose the local/master distinction sometimes, and you definitely lose multi-column tables where extensions have structured output.
No streaming. phpinfo() output is 72 KB on a typical PHP 8.2 install. That's tiny. I load the whole thing into a DOMDocument. If you have a five-megabyte phpinfo() dump, something is wrong with your PHP install, not with my parser.

Try it in 30 seconds

docker build -t phpinfo-json .

# Dump the running container's phpinfo as JSON
docker run --rm phpinfo-json | jq '.sections.Core.memory_limit'

# Just the loaded modules
docker run --rm phpinfo-json --only-modules | jq

# Diff two environments
docker run --rm -v $(pwd):/work phpinfo-json /work/prod.html --diff /work/staging.html

The image is 51 MB on Alpine. Source is MIT, no Composer runtime deps, and the test suite has 47 PHPUnit cases covering the parser, differ, formatters, and CLI. If you've ever stared at a phpinfo() dump wondering which one of 300 directives changed, this is the tool I wish I'd had in 2019.

Built as entry #144 of SEN 合同会社's 100+ public projects — a deliberate program of shipping small, focused tools in public.

DEV Community

Parsing phpinfo() in Pure PHP — or, How to Diff Two Containers

Parsing phpinfo() in Pure PHP — or, How to Diff Two Containers

The problem

Why DOMDocument, not regex

Walking the sections

Handling "no value"

The diff mode — the part I actually use

Tradeoffs I picked on purpose

Try it in 30 seconds

Top comments (0)