# The semantics of Elasticsearch’s nested datatype and query operator

###
Byron Ruth
Dec 31 '18
*Updated on Jan 27, 2019 *
・4 min read

The basic question is whether predicates in a `nested`

query operator considers one document at a time (it should according to the docs) and more importantly whether multiple conjunctive (AND) or disjunctive (OR) operations are evaluated and composed on independent records.

Here is the mapping we will test. Let’s assume the index is called `shapes`

. We are only concerned about the points that make up the shape so other properties are not included in this example.

```
{
"mappings": {
"_doc": {
"properties": {
"points": {
"type": "nested",
"properties": {
"x": {
"type": "integer"
},
"y": {
"type": "integer"
}
}
}
}
}
}
}
```

Here are a few shape records we will use to test.

```
{"points": [
{"x": 1, "y": 1},
{"x": 1, "y": 2},
{"x": 2, "y": 1},
{"x": 2, "y": 2}
]}
{"points": [
{"x": 1, "y": 1},
{"x": 1, "y": 2}
]}
{"points": [
{"x": 1, "y": 1},
{"x": 1, "y": 2},
{"x": 2, "y": 1}
]}
```

To re-iterate, the question is whether `nested`

queries (which `points`

is a nested datatype in this case) considers *all* sub-records in an array.

To get started, consider the following query:

```
{
"query": {
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}}
]
}
}
}
}
}
```

This translates to "find me all the shapes having at least one point where `x = 2`

". In the records above, the second shape record does not have a point where `x = 2`

and Elasticsearch (ES) correctly excludes this from the result.

The second test added a requirement that the point must *also* have `y = 2`

rather than any value.

```
{
"query": {
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}},
{"term": {"points.y": 2}}
]
}
}
}
}
}
```

The first shape satisfies this requirement and ES correctly returns only this record.

Now we will try a seemingly, subtly different query. The `must`

terms will be put in separate `nested`

expressions.

```
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "points",
"query": {
"term": {"points.x": 2}
}
}
},
{
"nested": {
"path": "points",
"query": {
"term": {"points.y": 2}
}
}
}
]
}
}
}
```

The distinction here is that each record is evaluated against each `nested`

operand *independently* whereas the above query required a single point to satisfy both conditions.

We can further confirm this *independent* behavior through the following query which requests shapes "having a point `x = 1, y = 1`

and at least one point where `x = 2`

, but not the point `x = 2, y = 2`

".

```
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 1}},
{"term": {"points.y": 1}}
]
}
}
}
},
{
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}}
]
}
}
}
}
],
"must_not": {
"nested": {
"path": "points",
"query": {
"bool": {
"must": [
{"term": {"points.x": 2}},
{"term": {"points.y": 2}}
]
}
}
}
}
}
}
}
```

Note the last two `nested`

operands overlap in that both will match on points where `x = 2`

. All three shapes satisfy the first `must`

condition, but only shapes 1 and 3 satisfy the second `must`

condition. The `must_not`

condition excludes shape 1 since it contains the point `x = 2, y = 2`

. Therefore the ES result only contains shape 3.

## Comparison with SQL

For those familiar with relational databases and how these kinds of queries are expressed in SQL, conditioning on more than one record of a one-to-many relationship is not intuitive at first, but is easy to learn and becomes obvious.

If we were to model this data in a relational form, we would have the following tables:

```
create table "shapes" (
id integer primary key
);
create table "points" (
shape_id integer references "shapes" ("id"),
x integer,
y integer
);
```

Since a shape can have many points, it needs to be modeled as a relation that references the shape record the point pertains to (this could be further normalized to not duplicate any unique point, but this optimization is not important to this post).

The insert statements are as follows which utilizes Postgres' multi-row `VALUES`

syntax.

```
insert into "shapes" (id) values
(1), (2), (3);
insert into "points" (shape_id, x, y) values
(1, 1, 1), (1, 1, 2), (1, 2, 1), (1, 2, 2),
(2, 1, 1), (2, 1, 2),
(3, 1, 1), (3, 1, 2), (3, 2, 1);
```

If we were to express the first ES query above in SQL, we would write:

```
select
shapes.id
from
shapes
inner join points
on shapes.id = points.shape_id
where
points.x = 2
```

This would return shapes 1 and 3 as ES did

The second query is equally simple by adding the condition on `y`

for the same point.

```
select
shapes.id
from
shapes
inner join points
on shapes.id = points.shape_id
where
points.x = 2 and
points.y = 2
```

What about the third query? As a reminder, we split these two conditions into separate ES `nested`

queries which, based on the documented semantics, should operate independently across records. Modeling this in SQL requires expressing independent relations between shapes and points.

```
select
shapes.id
from
shapes
inner join points p1
on shapes.id = p1.shape_id
inner join points p2
on shapes.id = p2.shape_id
where
p1.x = 2 and
p2.y = 2
```

The difference here is that each condition (in the `where`

clause) applies to a different `points`

relation and thus does not require a single record in either relation to satisfy both conditions, but rather at least one record in both relations to satisfy one of the conditions.

What about the third query?

```
select
shapes.id
from
shapes
inner join points p1
on shapes.id = p1.shape_id
inner join points p2
on shapes.id = p2.shape_id
inner join points p3
on shapes.id = p3.shape_id
where
p1.x = 1 and p1.y = 1 and
p2.y = 2 and
not (p3.x = 2 and p3.y = 2)
```

Another join is made to the `points`

table yielding a new relation, but again this means that any conditions on this relation apply independently from the other joined relations. In this case, the third relation asserts that a shape does *not* contain a point `x = 2, y = 2`

.

Given this comparison we can assert that conditions in a `nested`

operand operate over the collection of records independently from other `nested`

operands and thus are semantically equivalent to independent joins as represented in the SQL statement above.