Byron Ruth

Posted on Dec 31, 2018 • Edited on Jan 27, 2019

The semantics of Elasticsearch’s nested datatype and query operator

#elasticsearch #sql #semantics

The basic question is whether predicates in a nested query operator considers one document at a time (it should according to the docs) and more importantly whether multiple conjunctive (AND) or disjunctive (OR) operations are evaluated and composed on independent records.

Here is the mapping we will test. Let’s assume the index is called shapes. We are only concerned about the points that make up the shape so other properties are not included in this example.

{
  "mappings": {
    "_doc": {
      "properties": {
        "points": {
          "type": "nested",
          "properties": {
            "x": {
              "type": "integer"
            },
            "y": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}

Here are a few shape records we will use to test.

{"points": [
  {"x": 1, "y": 1},
  {"x": 1, "y": 2},
  {"x": 2, "y": 1},
  {"x": 2, "y": 2}
]}
{"points": [
  {"x": 1, "y": 1},
  {"x": 1, "y": 2}
]}
{"points": [
  {"x": 1, "y": 1},
  {"x": 1, "y": 2},
  {"x": 2, "y": 1}
]}

To re-iterate, the question is whether nested queries (which points is a nested datatype in this case) considers all sub-records in an array.

To get started, consider the following query:

{
  "query": {
    "nested": {
      "path": "points",
      "query": {
        "bool": {
          "must": [
            {"term": {"points.x": 2}}
          ]
        }
      }
    }
  }
}

This translates to "find me all the shapes having at least one point where x = 2". In the records above, the second shape record does not have a point where x = 2 and Elasticsearch (ES) correctly excludes this from the result.

The second test added a requirement that the point must also have y = 2 rather than any value.

{
  "query": {
    "nested": {
      "path": "points",
      "query": {
        "bool": {
          "must": [
            {"term": {"points.x": 2}},
            {"term": {"points.y": 2}}
          ]
        }
      }
    }
  }
}

The first shape satisfies this requirement and ES correctly returns only this record.

Now we will try a seemingly, subtly different query. The must terms will be put in separate nested expressions.

{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "points",
            "query": {
              "term": {"points.x": 2}
            }
          }
        },
        {
          "nested": {
            "path": "points",
            "query": {
              "term": {"points.y": 2}
            }
          }
        }        
      ]
    }
  }
}

The distinction here is that each record is evaluated against each nested operand independently whereas the above query required a single point to satisfy both conditions.

We can further confirm this independent behavior through the following query which requests shapes "having a point x = 1, y = 1 and at least one point where x = 2, but not the point x = 2, y = 2".

{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "points",
            "query": {
              "bool": {
                "must": [
                  {"term": {"points.x": 1}},
                  {"term": {"points.y": 1}}
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "points",
            "query": {
              "bool": {
                "must": [
                  {"term": {"points.x": 2}}
                ]
              }
            }
          }
        }
      ],
      "must_not": {
        "nested": {
          "path": "points",
          "query": {
            "bool": {
              "must": [
                {"term": {"points.x": 2}},
                {"term": {"points.y": 2}}
              ]
            }
          }
        }
      }
    }
  }
}

Note the last two nested operands overlap in that both will match on points where x = 2. All three shapes satisfy the first must condition, but only shapes 1 and 3 satisfy the second must condition. The must_not condition excludes shape 1 since it contains the point x = 2, y = 2. Therefore the ES result only contains shape 3.

Comparison with SQL

For those familiar with relational databases and how these kinds of queries are expressed in SQL, conditioning on more than one record of a one-to-many relationship is not intuitive at first, but is easy to learn and becomes obvious.

If we were to model this data in a relational form, we would have the following tables:

create table "shapes" (
  id integer primary key
);

create table "points" (
  shape_id integer references "shapes" ("id"),
  x integer,
  y integer
);

Since a shape can have many points, it needs to be modeled as a relation that references the shape record the point pertains to (this could be further normalized to not duplicate any unique point, but this optimization is not important to this post).

The insert statements are as follows which utilizes Postgres' multi-row VALUES syntax.

insert into "shapes" (id) values
  (1), (2), (3);

insert into "points" (shape_id, x, y) values
  (1, 1, 1), (1, 1, 2), (1, 2, 1), (1, 2, 2),
  (2, 1, 1), (2, 1, 2),
  (3, 1, 1), (3, 1, 2), (3, 2, 1);

If we were to express the first ES query above in SQL, we would write:

select
  shapes.id
from
  shapes
  inner join points
    on shapes.id = points.shape_id
where
  points.x = 2

This would return shapes 1 and 3 as ES did

The second query is equally simple by adding the condition on y for the same point.

select
  shapes.id
from
  shapes
  inner join points
    on shapes.id = points.shape_id
where
  points.x = 2 and
  points.y = 2

What about the third query? As a reminder, we split these two conditions into separate ES nested queries which, based on the documented semantics, should operate independently across records. Modeling this in SQL requires expressing independent relations between shapes and points.

select
  shapes.id
from
  shapes
  inner join points p1
    on shapes.id = p1.shape_id
  inner join points p2
    on shapes.id = p2.shape_id
where
  p1.x = 2 and
  p2.y = 2

The difference here is that each condition (in the where clause) applies to a different points relation and thus does not require a single record in either relation to satisfy both conditions, but rather at least one record in both relations to satisfy one of the conditions.

What about the third query?

select
  shapes.id
from
  shapes
  inner join points p1
    on shapes.id = p1.shape_id
  inner join points p2
    on shapes.id = p2.shape_id
  inner join points p3
    on shapes.id = p3.shape_id
where
  p1.x = 1 and p1.y = 1 and
  p2.y = 2 and
  not (p3.x = 2 and p3.y = 2)

Another join is made to the points table yielding a new relation, but again this means that any conditions on this relation apply independently from the other joined relations. In this case, the third relation asserts that a shape does not contain a point x = 2, y = 2.

Given this comparison we can assert that conditions in a nested operand operate over the collection of records independently from other nested operands and thus are semantically equivalent to independent joins as represented in the SQL statement above.

DEV Community

The semantics of Elasticsearch’s nested datatype and query operator

Comparison with SQL

Top comments (0)