Today I want to share with you a pattern you can use to replace JOINs with CTEs in your SQL queries and achieve consistent and faster execution times.
Consider a database where we store information about invoices, a simplified model could consist of a table where we store details about the customer and payment terms, a separate table where we have the items included in the invoice, and a 3rd table where we store product information:
CREATE TABLE invoices (
invoice_number BIGINT PRIMARY KEY
,customer_id BIGINT
,payment_terms TEXT DEFAULT '30 days from issue date'
,issue_date TIMESTAMP
);
CREATE TABLE invoice_items (
invoice_number BIGINT
,product_id BIGINT
,quantity REAL
,unit_price REAL
,PRIMARY KEY (invoice_number,product_id)
);
CREATE TABLE products (
product_id BIGINT PRIMARY KEY
,product_description TEXT
,applicable_tax_percentage REAL
);
Let’s now imagine we want to know how many units of “super cool product” have been sold in January 2024, we could write a SQL query with JOINs like this:
SELECT SUM(quantity)
FROM invoices
JOIN invoice_items USING (invoice_number)
JOIN products USING (product_id)
WHERE product_description='super cool product'
AND invoices.issue_date BETWEEN '2024-01-01' AND '2024-02-01';
This is perfectly valid SQL, but it leaves the database engine with a lot of options.
Even without considering the complexities of a distributed system, parallel processing, and disk/memory options there are still many different possible strategies here, for instance:
- “super cool product“ may only be sold very rarely, we could then start by looking up its
product_id
then all the instances where it has been sold and then check if the corresponding invoices were in January 2024 - or perhaps the product is sold often and we have data for 20 years of sales, so we could start by looking up all the invoices from January, then their line items, and see which ones are for this product
- or maybe the company only sells a handful of products and this one is a best seller, we may also only keep invoices for the last 45 days, meaning that neither the date of the invoice nor the product are very selective, in which case it may be faster to consider the full list of
invoice_items
With up-to-date statistics database engines like CrateDB can usually do a good job at identifying an optimal execution plan, but there is the risk some day statistics may not be available on your target environment, or even with statistics available other factors may induce the query engine to go down the wrong path.
The impact of using a suboptimal execution plan here could be huge, we could find ourselves trying to JOIN millions and millions of records.
Let’s do a small test creating sample data for the first scenario above, the one where “super cool product“ is only sold very rarely:
/* one million invoices in December */
INSERT INTO invoices (invoice_number,customer_id,issue_date)
SELECT a.b,1,'2023-12-01'
FROM GENERATE_SERIES(1,1000000) a(b);
/* one million invoices in January */
INSERT INTO invoices (invoice_number,customer_id,issue_date)
SELECT a.b,1,'2024-01-01'
FROM GENERATE_SERIES(1,2000000) a(b);
/* 2 products */
INSERT INTO products (product_id,product_description)
VALUES (1,'super cool product'),(2,'another product');
/* one line item per invoice and only 1 instance in 2 million where product 1 was sold */
REFRESH TABLE invoices;
INSERT INTO invoice_items (invoice_number,product_id,quantity)
SELECT invoice_number
,CASE WHEN invoice_number=2000000 THEN 1 ELSE 2 END AS product_id
,ceiling(random()*10)
FROM invoices;
We can now run the query with the JOINs a few times, in my small test environment it settles in running in around 750 milliseconds.
We can also look at the execution plan and all its details using the EXPLAIN command.
Let’s now try this approach where we use CTEs to guide the engine to execute the query using steps we know are more optimal for the profile of our data:
WITH relevant_product_ids AS (
SELECT product_id
FROM products
WHERE product_description='super cool product'
)
,relevant_invoice_lines AS (
SELECT invoice_number,quantity
FROM invoice_items
WHERE invoice_items.product_id IN (SELECT relevant_product_ids.product_id FROM relevant_product_ids)
)
,relevant_invoices AS (
SELECT invoice_number,issue_date
FROM invoices
WHERE invoices.invoice_number IN (SELECT relevant_invoice_lines.invoice_number FROM relevant_invoice_lines)
)
SELECT SUM(quantity)
FROM relevant_invoices
JOIN relevant_invoice_lines USING (invoice_number)
WHERE relevant_invoices.issue_date BETWEEN '2024-01-01' AND '2024-02-01';
We now see this runs consistently in single digit milliseconds, a 100x improvement.
In large and busy environments this kind of optimization may make a big difference, so it may be something to add to your toolbox.
I hope you found this interesting, and as usual if you have any questions do not hesitate to reach out to us in the CrateDB Community.
Top comments (0)