Manipulating Data with PHP: performing ETL operations

#php #etl

Introduction

Data manipulation is one of the basic and repetitive operation we perform as a developer. Sometimes we also want to process specific inputs to get a specific result.

This kind of operation is called ETL - Extract, Transform, Load. In my job we have been working recently on a big migration project, and we had to use a third-party library, on a small basis at the beginning, but in the end we choosed to fork it and to make it to evolve, and used it as the core of our migration. Below is a quick tour of how this small library can help you. You can find it here and simply install it via composer :

composer require wizaplace/php-etl

Uses cases

Generic

You can use this library in two main cases :

first one, you simply want to extract data from one or various sources: CSV, JSON... in order to use them directly
second one, you need to realize a complete process of data manipulation: extracting and transforming them before to use or save them in a different format.

Our use case

We used this library as a tool in the context of a major migration for one of our customers. We had to extract data of different formats (CSV, JSON...) from various sources (files, API...) in order to process and aggregate them before injecting them in the new application of our customer.

The library

At its origin the library was developed by Leonardo Marquine. We started by using this version, but since we needed to add extra components and because it seemed that there was no more activity on the original repository, we decided to fork it and continue in our own way. We cleaned a lot of stuff, like dropping the support for PHP 5.x, added some new components, you can check our changelog here.

After this extensive work, we are still maintaining it: we still have some projects for it (new classes for instances) but so far we did not plan a major refit, we currently do not consider to rewrite it since it does the job well in its current version.

The utilization is quite straightforward: you need a instance of the ETL class, which orchestrates everything else (extraction and other optional operations like transformation and persistence).

In this very short and simple example, I extract lines from a CSV, with filters, the iterator returning an object, one for each line of the CSV:

        $inputFile = 'assets/tuto/customers.csv';

        return $this->csvEtl
            ->extract(
                $this->csvExtractor,
                $inputFile,
                [
                    'throwError' => true,
                    'delimiter' => ';',
                    'columns' => ['id', 'email']
                ]
            )
            ->toIterator();

In this other simple example, I extract data from a CSV and transform them before saving them in another CSV.

        $inputCsvFile = 'assets/tuto/customers.csv';
        $outputCsvFile = 'assets/tuto/output.csv';

        $this->etl
            ->extract(
                $this->csvExtractor,
                $inputCsvFile,
                [
                    'throwError' => true,
                    'delimiter' => ';',
                    'columns' => ['id', 'email']
                ]
            )
            ->transform($this->renameColumns, ['columns' => ['email' => 'courriel']])
            ->load($this->csvLoader, $outputCsvFile)
            ->run();

In a few lines of code, you can perform some very specific operations, with a lot of built-in features. On top of that you can also easily integrate your own classes in the pipeline, for instance your own transformer, or even aggregate data from different sources

Conclusion

It is a basic and light library. Is still contains old way to proceed (legacy from PHP5 compatibility) but it does the job, and thanks to the work done by the authors of the original version, we did not have any issue to make it evolve and integrate new components.

Obviously you cannot compare it with paying, high-end, ETL solutions, but for most of operations, it will be enough.