Designing a Data Science oriented tabular data query DSL

#machinelearning #python #microservices

I have been working on a number of services written in JavaScript, Python and PHP that organise, orchestrate and visualise Machine Learning workloads. One of the most common artefacts that feed in and out of these services is tabular data, which may come in a number of forms such as:

Raw data from some kind of sensor.
Model training datasets.
Predictions from an inference pipeline.

There are a number of useful thing a user or back-end service might expect to be able to accomplish with these kind of tabular data files. A few examples might be:

A user may want to paginate though the first N rows of a large file, to preview rows.
A service may want to generate quality metrics from a series of predictions made by a machine learning model.
A service may want to split a dataset into two buckets for training and validation.

Since we're dealing with potentially multi-gigabyte files, it's impractical for clients and services to download, filter, then transfer large volumes of data over the network, so most services will reference file URIs contained in some mutually accessible store. When communicating with URIs the problem then becomes:

In addition to a URI, how do clients and services communicate which rows and columns within a data file are meaningful for a particular task?

Additional design goals were:

It needed to be fast, for large datasets.
Queries needed to be authored in PHP and JavaScript applications and then executed in a Python environment.
It needed a rich syntax, so we could effectively paginate, slice and select data for a variety of use cases.

With that in mind, I started to evaluate a number of projects that implemented some kind of query language that could be sent along with our data file URIs, a couple that came up:

CSVkit - Showed some promise, but the query features depended on importing the data into a SQLite database that locked up and fell over as soon as large datasets were involved.
q - Also had a SQLite dependency and failed to deal with large amounts of data.
Miller - Looked interesting but also seemed like a complex syntax and more focused on being CLI tool than integrating into a Python runtime.

One of the tools we were already using which was both extremely feature rich and worked with extremely large datasets was from the pandas module: pandas.DataFrame. I found that almost all our needs would be met if we could represent a series of operations against a DataFrame and pass those from service to service.

In it was settled that queries would be represented as a string containing JSON, when decoded would be an array of operations that could be executed against a DataFrame. Some examples of JSON strings and the corresponding operation

Select a column named "foo":

[
  {
    "method": "get",
    "args": ["foo"]
  }
]

df.get("foo")

Select the 5th numerically indexed column:

[
  {
    "index": "iloc",
    "keys": [":",5]
  }
]

df.iloc[:, 5]

Select column "bar", then select the first 10 rows:

[
  {
    "method": "get",
    "args": ["bar"]
  },
  {
    "method": "head",
    "args": [10]
  }
]

df.get("bar").head(10)

In the end, the code for translating these JSON strings into actual transformations on a DataFrame was quite simple:

import json


def apply_query(data_frame, query):
    parsed_query = json.loads(query)

    query_applied_object = data_frame

    for operation in parsed_query:
        if "method" in operation:
            query_applied_object = getattr(query_applied_object, operation["method"])(*operation["args"])

        if "index" in operation:
            index = getattr(query_applied_object, operation["index"])
            # Support representing a single colon string, ":" in the query object as a null slice,
            # this could be further extended to support more complex slice objects like "1:2" if
            # required in the future.
            keys = list(map(lambda accessor: slice(None) if accessor == ":" else accessor, operation["keys"]))
            query_applied_object = type(index).__getitem__(index, tuple(keys))

    return query_applied_object

With the format defined and our Python runtime recognising and executing it, it was then straightforward to integrate queries into our Symfony based PHP app as a value object:

<?php

declare(strict_types=1);

namespace App\Model\DataFile\Query;

/**
 * Represents a query against a data file.
 * 
 * Examples:
 * @code
 * Query::columnByName('foo');
 * Query::columnByIndex(3);
 * @endcode
 */
final class Query implements \Stringable {
    const NULL_SLICE = ':';

    private function __construct(private array $query) {
    }

    public static function columnByName(string $columnName): static {
        return new static([
            [
                'method' => 'get',
                'args' => [
                    $columnName,
                ],
            ],
        ]);
    }

    public static function columnByIndex(int $columnIndex): static {
        return new static([
            [
                'index' => 'iloc',
                'keys' => [
                    static::NULL_SLICE,
                    $columnIndex,
                ],
            ],
        ]);
    }

    public function __toString() {
        return json_encode($this->query);
    }

    public static function fromString(string $string): static {
        return new static(json_decode($string, true));
    }
}

DEV Community

Designing a Data Science oriented tabular data query DSL

Top comments (0)