Shalvah

Posted on Jul 30, 2021 • Edited on Aug 18, 2023 • Originally published at blog.shalvah.me

Implementing programmatic file transformations in PHP, Part 2

#algorithms #php

Hi! Still working on the upgrade tool from last time. As we saw, the current implementation works, but there are a few issues. There are three main ones I'll tackle here:

Namespace resolution. Imagine our user's old config has this:
```
return [
    'something' => [A\B::class],
];
```
And the sample new config has this:
```
use A\B;
return [
    'something' => [B::class],
];
```
The class A\B referenced in something is the same in both configs, but written in different ways. As it stands, when we try to check if any new items were added, we'll get a "yes", even though both classes are the same. We'll fix things so both nodes resolve to the same value (A\B) in our AST.
Printing. The default printing style isn't the best. My main ick is that long arrays are printed on a single line. Let's change this.
Our tests. When we wrote them, we were loading the config files via require. We need to update them to match the parsing-into-AST approach.

If you prefer, you can head to the GitHub repo instead, where I've already addressed these issues. This article just explains my process.

Okay, let's go. We'll start with namespace resolution.

Traversing the AST

When you work with an AST, you typically want to traverse (or walk) it. Remember that an AST is a tree: one node can contain other nodes, which can contains more nodes, and so on. In HTML, a <div> node can contain an <img> and a <span?. In PHP, a statement, $x = 2 + 5 can contain an assignment ($x =) and an expression (2 + 5), which itself contains two literal nodes (2, 5), and an operator (+). Traversing an AST allows you to visit all the nodes of the tree and modify it or extract some information.

Most parsers provide some way for you to easily traverse the generated AST and perform some action against each node of a certain type. For example, here's how you can convert all bold nodes in a Markdown file to italics with remark, which uses unified (docs):

const visit = require("unist-util-visit");
visit(tree, "strong", (node) => {
    node.type = "emphasis"
});

See it in action on astexplorer.net.

Here's an example of converting var x = 2 + 5 to let x = 2 + 5 using a Babel plugin:

export default function (babel) {
  const { types: t } = babel;

  return {
    visitor: {
      VariableDeclaration(path) {
        if (path.node.kind === "var") {
          path.node.kind = "let";
        }
      }
    }
  };
}

See it in action on astexplorer.net.

Get the idea? We can use a visitor to walk our config files' ASTs and replace any ::class constants with the full names. In this particular case, we don't need to write our own visitor. PHP-Parser comes with a NameResolver visitor, which does this for us. Using this visitor, we can ensure that both files resolve to the same thing, "A\B".

So let's rewrite our code to apply this name resolution before we begin any comparisons with the ASTs.

class Upgrader
{
    protected function parseFile(string $filePath): array
    {
        $sourceCode = file_get_contents($filePath);
        $parser = (new ParserFactory)->create(ParserFactory::PREFER_PHP7);
        $ast = $parser->parse($sourceCode);
        $traverser = new NodeTraverser();
        $traverser->addVisitor(new NodeVisitor\NameResolver(null, [
            // We want to still keep the original text
            'preserveOriginalNames' => true
        ]));
        return $traverser->traverse($ast);
    }

    protected function getUserOldConfigFileAsAst(): array
    {
        return $this->parseFile($this->configFiles['user_old']);
    }

    protected function getSampleNewConfigFileAsAst(): array
    {
        return $this->parseFile($this->configFiles['sample_new']);
    }
}

Cool. Let's test this out. We'll run a simple dry run, and dump the changes:

$changes = Upgrader::ofConfigFile('test_config.php', 'test_config_sample.php')
    ->dryRun();;
ray($changes);

Before (no name resolution):

After (name resolution):

So it works! It correctly shows no changes.

Reversing the change with a custom visitor

But there's more work to do. If we leave this as it is, when we eventually write the upgraded config file back, we'd have replaced all the ::class constants with the resolved full names, which works, but isn't ideal. Ideally, we should keep things as the user wrote it.

We can do this with another visitor, that traverses the upgraded AST before we print it. This time, we'll have to implement a custom visitor. With PHP-Parser's NodeVisitor, you implement a number of methods and manually filter by node type. In our case, we filter for instances of FullyQualified and return the originalName attribute (which we have, since we set preserveOriginalNames to true). We return this from leaveNode(), and it will replace the class name node with the one we returned.

class UnresolveNamespaces implements PhpParser\NodeVisitor
{
    public function leaveNode(PhpParser\Node $node) {
        if ($node instanceof PhpParser\Node\Name\FullyQualified) {
            // Return the original node if it exists
            return $node->getAttribute('originalName', default: $node);
        }
    }

    public function beforeTraverse(array $nodes) {}

    public function enterNode(Node $node) {}

    public function afterTraverse(array $nodes) {}
}

class Upgrader
{
    // ...

    protected function writeNewConfigFile(array $ast)
    {
        $traverser = new NodeTraverser();
        $traverser->addVisitor(new UnresolveNamespaces);
        $ast = $traverser->traverse($ast);

        $prettyPrinter = new PrettyPrinter\Standard(['shortArraySyntax' => true]);
        $astAsText = $prettyPrinter->prettyPrintFile($ast);

        rename($configFilePath, "$configFilePath.bak");
        file_put_contents($configFilePath, $astAsText);
    }
}

Synchronising imports

One more thing. We need to bring in any new use statements. Imagine the new config file adds a new class C\D, but writes it this way:

use C\D;

return [
    'some_list' => [
        D::class,
    ],
];

Now that we're "unresolving" the names before printing, this new item will get added to the upgraded config as D::class, but without the extra import of C\D, leading to a bug. We can fix this by checking for any new use statements in the sample config file and copying them into the upgraded config before we print:

    protected function cleanUpAstForPrinting(array $ast): array
    {
        // First, unresolve namespaces
        $traverser = new NodeTraverser();
        $traverser->addVisitor(new UnresolveNamespaces);
        $ast = $traverser->traverse($ast);

        // Then find any new use statements and add them to the AST
        $sampleConfigAst = $this->getSampleNewConfigFileAsAst();
        $newUseStatements = Arr::where(
            $sampleConfigAst, fn(Node $node) => $node instanceof Stmt\Use_
        );
        $alreadyPresentUseStatements = Arr::where(
            $ast, fn(Node $node) => $node instanceof Stmt\Use_
        );
        $newUseStatements = $this->subtractOtherListFromList($newUseStatements, $alreadyPresentUseStatements);
        foreach ($newUseStatements as $newUseStatement) {
            array_unshift($ast, $newUseStatement['ast']);
        }
        return $ast;
    }

Here, we've created a new method, cleanUpAstForPrinting(), which handles the unresolving and syncing of use statements. The actual implementation is pretty easy: we find all instances of PhpParser\Node\Stmt\Use_ in both files, then we use our subtractOtherListFromList() method (which we implemented last time) to find the ones that are in the sample new config but not in the one we're about to print. If we find any, we add them to the AST with array_unshift(), and we're ready to go.

Awesome. We've achieved three things:

Resolving the classes to their full names before comparing
"Unresolving" them back to the way they were written orginally before printing out.
Including any new use statements.

Let's move on to the second problem: the printing. Can we fix that?

Printing with better formatting

One option I considered was using PHP-CS-Fixer to reformat the code after printing, but that proved daunting because PHP-CS-Fixer doesn't expose an API to build on.

Luckily, PHP-Parser includes an alternative printer, which attempts to preserve the original format of the file when printing.

To use the printer, we have to change how we parse the user's config file. Rather than simply calling (new ParserFactory)->create(ParserFactory::PREFER_PHP7)->parse($sourceCode);, now we have to do some more work:

class Upgrader
{
    protected array $originalAst;
    protected array $originalTokens;

    // ...

    protected function getUserOldConfigFileAsAst(): array
    {
        $sourceCode = file_get_contents($this->configFiles['user_old']);

        // 1. Create a new tokenizer (aka "lexer", the component that
        // splits the code into tokens for the parser). 
        // This lexer records extra information like where each token starts and ends.
        $lexer = new PhpParser\Lexer\Emulative([
            'usedAttributes' => [
                'comments', 'startLine', 'endLine', 'startTokenPos', 'endTokenPos',
            ],
        ]);

        // 2. Make sure our parser uses the lexer.
        $parser = new Parser\Php7($lexer);

        // 3. Store the tokens returned by our tokenizer and the AST returned by our parser,
        // so the printer can use them for comparisons later.
        $this->originalAst = $parser->parse($sourceCode);
        $this->originalTokens = $lexer->getTokens();

        // 4. Make a copy of all the nodes using the CloningVisitor, 
        // so any modifications we make won't affect the original nodes.
        $traverser = new NodeTraverser();
        $traverser->addVisitor(new NodeVisitor\CloningVisitor());
        // PS: we still have to do our namespace resolution
        $traverser->addVisitor(new NodeVisitor\NameResolver(null, [
             'preserveOriginalNames' => true
        ]));
        $this->userOldConfigFileAst = $traverser->traverse($this->originalAst);

        return $this->userOldConfigFileAst;
    }
}

And now, when we're printing, we pass in the original AST and tokens:

    protected function writeNewConfigFile(array $ast)
    {
        $ast = $this->cleanUpAstForPrinting($ast);

        $prettyPrinter = new PrettyPrinter\Standard(['shortArraySyntax' => true]);
        $astAsText = $prettyPrinter->printFormatPreserving($ast, $this->originalAst, $this->originalTokens);

        rename($configFilePath, "$configFilePath.bak");
        file_put_contents($configFilePath, $astAsText);
    }

Let's try...

Before (default printer):

After (format-preserving printer):

And it works! When we upgrade our config this time, we can see that the strategies array is printed over multiple lines, as we have in our original user config.

The format-preserving printer is experimental, so it probably won't be perfect in all cases, but I think it's good enough to go with.

Finally, let's update our tests.

Reworking our tests

In the first iteration of the upgrader, our dry run tests made use of a mockUpgraderWithConfigs() function that mocked the getConfigs() function to return specific configs. Now that we've switched to parsing code into ASTs, we can't use that anymore. I decided to go with the obvious file-based approach: I'll move the configs to files, and pass the file path to the upgrader, the way it was meant to be used. No need to mock anything. To make things easier, I'll combine all the old configs into one, but separate the new configs by test case.

For example, let's do the test for detecting added/removed map items. First, we create the old config file (which all the tests will use), which goes in tests/fixtures/old.php ("fixtures" is a term used to refer to static data used for testing):

<?php

return [
    'map' => [
        'key_1' => 1,
        'key_2' => getenv('2'),
    ],
    'nested' => [
        'map' => [
            'baa' => 'baa',
            'black' => 'sheep',
            'and' => 'more',
        ],
        'list' => []
    ],
    'list' => [
        'baa',
        'baa',
        \Exception::class,
    ],
    'thing' => false,
    'other_thing' => rand(0, 1),
];

Then the sample new config for each test case goes in tests/fixtures/new_samples. For this case, we'll create added_removed_map.php. It should be nearly the same as the old.php, with the difference being the change we plan to test (in this case, adding and removing some map items):

<?php

return [
    'map' => [
        // 'key_1' removed
        'key_2' => getenv('2'),
        'key_3' => 'added', // Added
    ],
    'nested' => [
        'map' => [
            'baa' => 'baa',
            'black' => 'sheep',
            // 'and' removed
        ],
        'list' => []
    ],
    // Remaining items are same as old config...
];

And now we update the test:

$userOldConfig = __DIR__ . '/fixtures/old.php';

it("can detect items added to/removed from maps", function () use ($userOldConfig) {
    $sampleNewConfig = __DIR__ . '/fixtures/new_samples/added_removed_map.php';
    $changes = Upgrader::ofConfigFile($userOldConfig, $sampleNewConfig)->dryRun();

    expect($changes)->toHaveCount(3);
    $this->assertArraySubset([
        [
            'type' => Upgrader::CHANGE_ADDED,
            'key' => 'map.key_3',
        ],
        [
            'type' => Upgrader::CHANGE_REMOVED,
            'key' => 'map.key_1',
        ],
        [
            'type' => Upgrader::CHANGE_REMOVED,
            'key' => 'nested.map.and',
        ],
    ], $changes);
});

Cool. Let's do the test for moved items. We reuse the same old.php file as the user's old config, and we create a new file for the sample new config (test/fixtures/new_samples/moved.php):

<?php

return [
    // Other keys same as the old...
    'thing' => false,
    'new_other_thing' => 'default', // Replaces `other_thing`
];

And the test:

it("handles move()d items properly", function () use ($userOldConfig) {
    $sampleNewConfig = __DIR__ . '/fixtures/new_samples/moved.php';
    $changes = Upgrader::ofConfigFile($userOldConfig, $sampleNewConfig)
        ->move('other_thing', 'new_other_thing')->dryRun();

    expect($changes)->toHaveCount(2);
    $this->assertArraySubset([
        [
            'type' => Upgrader::CHANGE_ADDED,
            'key' => 'new_other_thing',
        ],
        [
            'type' => Upgrader::CHANGE_MOVED,
            'key' => 'other_thing',
            'new_key' => 'new_other_thing',
        ],
    ], $changes);
});

All good. The rest of the tests follow this pattern, and you can see them on GitHub.

Adding some more tests

We have tests for dry run, now we need some tests for the actual upgrade. I'll add just one test for that, a sort of "God" test that combines the multiple types of changes into one config file, and tests that it is upgraded correctly. Now, I could split them into multiple small tests and files like I did for dry run, but I think that's unnecessary, since dry run and apply follow the same code path. Okay, on to the test.

First, the fixtures. We'll use the same old.php as the user's initial config, but this time we'll create a new sample that combines all the different kinds of changes (new_samples/all.php):

<?php

return [
    'map' => [
        // `key_1` removed
        'key_2' => getenv('2'),
        'key_3' => 'added', // Added
    ],
    'nested' => [
        'map' => [
            'baa' => 'baa',
            'black' => 'sheep',
            // `and` removed
        ],
        'list' => [
            'new_item_dont_touch' // Added, but should be ignored
        ]
    ],
    'list' => [
        'baa',
        'baa',
        'black',
        \Exception::class,
    ],
    'thing' => false,
    'new_other_thing' => 'default', // Replaces `other_thing`
];

Next, an expected.php that contains the final config file we expect (when the changes from all.php are applied to old.php):

<?php

return [
    'map' => [
        'key_2' => getenv('2'),
        'key_3' => 'added',
    ],
    'nested' => [
        'map' => [
            'baa' => 'baa',
            'black' => 'sheep',
        ],
        'list' => []
    ],
    'list' => [
        'baa',
        'baa',
        \Exception::class,
        'black',
    ],
    'thing' => false,
    'new_other_thing' => rand(0, 1),
];

And, finally, the test.

$userOldConfig = __DIR__ . '/fixtures/old.php';

it("works as expected", function () use ($userOldConfig) {
    $configFile = sys_get_temp_dir().'/config.php';
    copy($userOldConfig, $configFile);

    $sampleNewConfig = __DIR__ . '/fixtures/new_samples/all.php';
    Upgrader::ofConfigFile($configFile, $sampleNewConfig)
        ->move('other_thing', 'new_other_thing')
        ->dontTouch('nested.list')
        ->upgrade();

    expect(sys_get_temp_dir().'/config.php.bak')->toBeFile();
    expect(file_get_contents(sys_get_temp_dir().'/config.php.bak'))
        ->toEqual(file_get_contents($userOldConfig));

    $upgradedConfig = file_get_contents($configFile);
    $expectedConfig = file_get_contents(__DIR__ . '/fixtures/expected.php');

    $upgradedConfig = str_replace("\r\n", "\n", $upgradedConfig);
    $expectedConfig = str_replace("\r\n", "\n", $expectedConfig);
    expect($upgradedConfig)->toEqual($expectedConfig);
});

First, we copy the old.php fixture to a temporary location so we don't overwrite it during the upgrade (since other tests make use of it too). Then we run the upgrade. After the upgrade, we make our assertions:

We check that the old config was backed up correctly
Then we test that the new config matches what we expect. Note that we have to replace any uses of \r\n (Windows newlines) with \n (newlines used in the generated code), so we don't get any false mismatches.

And we're good!

> pest --stop-on-failure --coverage

   PASS  Tests\applytest
  ✓ it works as expected

   PASS  Tests\dryruntest
  ✓ it can detect items added to/removed from maps
  ✓ it ignores maps marked as dontTouch()
  ✓ it can detect items added to lists
  ✓ it ignores lists marked as dontTouch()
  ✓ it handles move()d items properly
  ✓ it properly handles class name constants written differently

  Tests:  7 passed
  Time:   0.61s

  Cov:    100.00%

As I've said earlier, testing is an inexact science; there's no formula for what or how to test. A lot of it is down to your judgment. Frankly, I'm not entirely satisfied with this test suite, but I think it does the job for now, so I'll move on, and add more tests as I encounter more real-world scenarios.