DEV Community: Alexey Goncharuk

Introduction to the Join Ordering Problem

Alexey Goncharuk — Sun, 26 Sep 2021 13:49:00 +0000

A typical database may execute an SQL query in multiple ways, depending on the selected operators' order and algorithms. One crucial decision is the order in which the optimizer should join relations. The difference between optimal and non-optimal join order might be orders of magnitude. Therefore, the optimizer must choose the proper order of joins to ensure good overall performance. In this blog post, we define the join ordering problem and estimate the complexity of join planning.

Example

Consider the TPC-H schema. The customer may have orders. Every order may have several positions defined in the lineitem table. The customer table has 150,000 records, the orders table has 1,500,000 records, and the lineitem table has 6,000,000 records. Intuitively, every customer places approximately ten orders, and every order contains four positions on average.

Suppose that we want to retrieve all lineitem positions for all orders placed by the given customer:

SELECT 
  lineitem.*
FROM 
  customer,
  orders,
  lineitem
WHERE
  c_custkey = ? 
  AND c_custkey = o_custkey
  AND o_orderkey = l_orderkey

Assume that we have a cost model where an operator's cost is proportional to the number of processed tuples.

We consider two different join orders. We can join customer with orders and then with lineitem. This join order is very efficient because most customers are filtered early, and we have a tiny intermediate relation.

Alternatively, we can join orders with lineitem and then with customer. It produces a large intermediate relation because we map every lineitem to an order only to discard most of the produced tuples in the second join.

The two join orders produce plans with very different costs. The first join strategy is highly superior to the second.

Search Space

A perfect optimizer would need to construct all possible equivalent plans for a given query and choose the best plan. Let's now see how many options the optimizer would need to consider.

We model an n-way join as a sequence of n-1 2-way joins that form a full binary tree. Leaf nodes are original relations, and internal nodes are join relations. For 3 relations there are 12 valid join orders:

We count the number of possible join orders for N relations in two steps. First, we count the number of different orders of leaf nodes. For the first leaf, we choose one of N relations; for the second leaf, we choose one of remaining N-1 relations, etc. This gives us N! different orders.

Second, we need to calculate the number of all possible shapes of a full binary tree with N leaves, which is the number of ways of associating N-1 applications of a binary operator. This number is known to be equal to Catalan number C(N-1). Intuitively, for the given fixed order of N leaf nodes, we need to find the number of ways to set N-1 pairs of open and close parenthesis. E.g., for the four relations [a,b,c,d], we have five different parenthesizations:

Multiplying the two parts, we get the final equation:

Performance

The number of join orders grows exponentially. For example, for three tables, the number of all possible join plans is 12; for five tables, it is 1,680; for ten tables, it is 17,643,225,600. Practical optimizers use different techniques to ensure a good enough performance of the join enumeration.

First, optimizers might use caching to minimize memory consumption. Two widely used techniques are dynamic programming and memoization.

Second, optimizers may use various heuristics to limit the search space instead of doing an exhaustive search. A common heuristic is to prune the join orders that yield cross-products. While good enough in the general case, this heuristic may lead to non-optimal plans, e.g., for some star joins. A more aggressive pruning approach is to enumerate only left- or right-deep trees. This significantly reduces planning complexity but degrades the plan quality even further. Probabilistic algorithms might be used (e.g., genetic algorithms or simulated annealing), also without any guarantees on the plan optimality.

Summary

In this post, we took a sneak peek at the join ordering problem and got a bird's-eye view of its complexity. In further posts, we will explore the complexity of join order planning for different graph topologies, dive into details of concrete enumeration techniques, and analyze existing and potential strategies of join planning in Apache Calcite. Stay tuned!

We are always ready to help you with your query optimizer design. Just let us know.

What is Cost-based Optimization?

Alexey Goncharuk — Wed, 02 Jun 2021 07:39:46 +0000

In our previous blog posts (1, 2), we explored how query optimizers may consider different equivalent plans to find the best execution path. But what does it mean for the plan to be the "best" in the first place? In this blog post, we will discuss what a plan cost is and how it can be used to drive optimizer decisions.

Example

Consider the following query:

SELECT * FROM fact 
WHERE event_date BETWEEN ? AND ?

We may do the full table scan and then apply the filter. Alternatively, we may utilize a secondary index on the attribute event_date, merging the scan and the filter into a single index lookup. This sounds like a good idea because we reduce the amount of data that needs to be processed.

We may instruct the optimizer to apply such transformations unconditionally, based on the observation that the index lookup is likely to improve the plan's quality. This is an example of heuristic optimization.

Now consider that our filter has low selectivity. In this case, we may scan the same blocks of data several times, thus increasing the execution time.

In practice, rules that unanimously produce better plans are rare. A transformation may be useful in one set of circumstances and lead to a worse plan in another. For this reason, heuristic planning cannot guarantee optimality and may produce arbitrarily bad plans.

Cost

In the previous example, we have two alternative plans, each suitable for a particular setting. Additionally, in some scenarios, the optimization target may change. For example, a plan that gives the smallest latency might not be the best if our goal is to minimize the hardware usage costs in the cloud. So how do we decide which plan is better?

First of all, we should define the optimization goal, which could be minimal latency, maximal throughput, etc. Then we may associate every plan with a value that describes how "far" the plan is from the ideal target. For example, if the optimization goal is latency, we may assign every plan with an estimated execution time. The closer the plan's cost to zero, the better.

The underlying hardware and software are often complicated, so we rarely can estimate the optimization target precisely. Instead, we may use a collection of assumptions that approximate the behavior of the actual system. We call it the cost model. The cost model is usually based on parameters of the algorithms used in a plan, such as the estimated amount of consumed CPU and RAM, the amount of network and disk I/O, etc. We also need data statistics: operator cardinalities, filter selectivities, etc. The goal of the model is to consider these characteristics to produce a cost of the plan. For example, we may use coefficients to combine the parameters in different ways depending on the optimization goal.

The cost of the Filter might be a function of the input cardinality and predicate complexity. The cost of the NestedLoopJoin might be proportional to the estimated number of restarts of the inner input. The HashJoin cost might have a linear dependency on the inputs cardinalities and also model spilling to disk with some pessimistic coefficients if the size of the hash table becomes too large to fit into RAM.

In practical systems, the cost is usually implemented as a scalar value:

In Apache Calcite, the cost is modeled as a scalar representing the number of rows being processed.
In Catalyst, the Apache Spark optimizer, the cost is a vector of the number of rows and the number of bytes being processed. The vector is converted into a scalar value during comparison.
In Presto/Trino, the cost is a vector of estimated CPU, memory, and network usage. The vector is also converted into a scalar value during comparison.
In CockroachDB, the cost is an abstract 64-bit floating-point scalar value.

The scalar is a common choice for practical systems, but this is not a strict requirement. Any representation could be used, as long as it satisfies the requirements of your system and allows you to decide which plan is better. In multi-objective optimization, costs are often represented as vectors that do not have a strict order in the general case. In parallel query planning, a parallel plan requiring a larger amount of work can provide better latency than a sequential plan that does less work.

Cost-based Optimization

Once we know how to compare the plans, different strategies can be used to search for the best one. A common approach is to enumerate all possible plans for a query and choose a plan with the lowest cost.

Since the number of possible query plans grows exponentially with the query complexity, dynamic programming or memoization could be used to encode alternative plans in a memory-efficient way.

If the search space is still too large, we may prune the search space. In top-down optimizers, we may use the branch-and-bound pruning to discard the alternative sub-plans if their costs are greater than the cost of an already known containing plan.

Heuristic pruning may reduce the search space at the cost of the possibility of missing the optimal plan. Common examples of heuristic pruning are:

Probabilistic join order enumeration may reduce the number of alternative plans (e.g., genetic algorithms, simulated annealing). Postgres uses the genetic query optimizer.
The multi-phase optimizers split the whole optimization problem into smaller stages and search for an optimal plan locally within each step. Apache Flink, Presto/Trino, and CockroachDB all use multi-phase greedy optimization.

Summary

The cost-based optimization estimates the quality of the plans concerning the optimization target, allowing an optimizer to choose the best execution plan. The cost model depends heavily on metadata maintained by the database, such as estimated cardinalities and selectivities.

Practical optimizers typically use ordered scalar values as a plan cost. This approach might not be suitable for some complex scenarios, such as the multi-objective query optimization or deciding on the best parallel plan.

Dynamic programming or memoization is often used in cost-based optimization to encode the search space efficiently. If the search space is too large, various pruning techniques could be used, such as branch-and-bound or heuristic pruning.

In future blog posts, we will explore some of these concepts in more detail. Stay tuned!

We are always ready to help you with your query optimizer design. Just let us know.