A brief introduction to Piped Processing Language in Open Distro for Elasticsearch

#elasticsearch #database #datascience

In Open Distro for Elasticsearch 1.11.0, a new query language was introduced - Piped Processing Language (PPL). PPL provides a different way of thinking about data and compliments the existing query languages (SQL and the Query DLS) in Open Distro.

The basis of PPL is the concept of pipes from UNIX. You take output of one operation and feed it into another operation. Mentally, I think of this as a factory where a widget is manufactured step-by-step, the output of one machine just leads to the next machine.

Setup

First, let's add some documents to an index. There is nothing new here. I'm using cURL and the bulk API to add 4 documents with information about vintage computers (because why not?).

If you are familiar with Elasticsearch but not Open Distro, you might notice a few extra arguments on cURL. These are due to built-in security features of Open Distro for Elasticsearch. I'm running a Docker Open Distro cluster locally and out-of-the-box this comes with a self-signed cert, so I'm using -k to prevent peer verification. The other argument is --user as Open Distro has built-in fine-grained access control.

Simple query

Now that we have a tiny data set, let's do a very basic query. This will pipe two operations together. The first operation is to set the index with the 'source' command source=vin-computers. Think of this making the entire index available to the pipeline. Next, we will take that entire index and remove anything but two fields - name and CPU using the 'fields command, fields name, CPU. These two operations are concatenated together by a pipe character |.

A tad more complexity

We can take our existing query and add a filtering clause through the where command. The command is followed by a boolean expression - in this case a comparison. The comparison is built with a field on the left and the value on the right with an = between.

At this point, it looks a tad like SQL. PPL isn't, however, as structured as SQL. So, you can actually invert the order of the last two pipes and get the same result:

source=vin-computers | where CPU="MOS6502" | fields name,CPU

This doesn't execute exactly the same way and I wouldn't venture it's that great in efficiency, but in an analytical situation it's often more about getting the result in a way that works for your thinking process than running it in a particular performance envelope. If you want to understand what is going on behind the scenes, you can run the same query but append _explain to the endpoint (e.g. https://localhost:9200/_opendistro/_ppl/_explain).

Wrap up

This trivial example is probably not what PPL will be used for in the real-world, but I hope it explains the basic mechanics of the query language. Knowing what you know now, imagine going from a very broad set of documents to more and more narrow sets just by adding additional commands piped together. Attempting to build the same type of query with the Query DSL or SQL would probably lead to concentrating more on the syntax of the queries than refining the result.

You can find out more over at the Open Distro documentation.

DEV Community

A brief introduction to Piped Processing Language in Open Distro for Elasticsearch

Setup

Simple query

A tad more complexity

Wrap up

Top comments (0)

Read next

AI unlocking huge language models for tiny edge devices

Akshata Upadhye's Guide To Revolutionizing Staffing with Data Science, NLP and Advanced Analytics

Building a Web Search Engine in Go with Elasticsearch

Unraveling Package Hallucinations: A Comprehensive Analysis of Code-Generating LLMs