DEV Community: Naga Mocha

SQL joins as reduce/folds over relations

Naga Mocha — Fri, 29 May 2020 05:45:37 +0000

Mental models help us navigate complex ideas. However, forming one own's mental models can be a tricky affair. On one hand, by putting the effort to come up with menetal a model by ourselves, we gain a deeper understanding of the subject at hand. We also become better and more efficient at reasoning and problem-solving. However, on the other hand, a deficient or even an entirely wrong mental model can derail arriving at a far deeper understanding and even result in huge errors and blindspots later on, despite such mental models serving us well in the beginning. This was my case when I encountered joins at first when I was learning SQL.

With joins, I initially visualized foreign keys as sort of 'pointers' to different storage locations where the rows containing the primary keys, and joins themselves as 'dereferencing' procedures. This worked well when I had to write queries for simple inner joins involving two tables. Beyond that, eg three or four tables, or if the situation called for outer queries, I was always left stumped.

I then ditched the whole pointers-dereferencing model, and for a while, simply treated joins as opaque procedures, tinkering with them until the query somehow worked. I could afford such a 'strategy' when writing queries where there was always a set answer for comparison (such as in online exercises) but I knew at some point I'd have to write 'greenfield' queries without crutches to hold on to. A full understanding of joins was necessitated.

My strategy then was to go back to the basics. When you're at an intermediate level, going back to the basics always feels like a chore - there's that impatience coming from hey, I already know this why should I go through it over again. Furthermore, every minute spent on 'going back to the basics' could be spent on becoming more 'advanced', doing more projects and all that kind of stuff. Well, revisiting fundamentals can be made into an engaging exercise. It's the best time to reevaluate and challenge one's mental models and assumptions. Which is how I ended up getting a better understanding of how to use sql joins.

Note, there's a part 2 of this post that actually does go back to the basics -the fundamental ideas on which joins are based on. This post is more of an alternative per se: an attempt at using another high-level concept so as to make sense of joins.

Let's start with a simple SQL query:

SELECT column1, column2... columnN
FROM relationA

If we think about the 'order' of evaluation: the from gets 'evaluated' before the select. The from is where the join clauses are placed. The word 'evaluated' is in quotation marks because technically, SQL engines aren't required to, nor have to evaluate a query in some given order. In fact, data-retrieval sql queries themselves aren't dictating some imperative order of evaluation in the same way 'line number' dictates order of evaluation in synchronous code. Instead, such queries describe the shape of the data we want back, which is why SQL is said to be declarative. The 'declarativess' of SQL is yet another concept I struggled with initally. And before going further, I'd like to link to a particular Julia Evans' blog post, which gives a more in-depth treatment of the evaluation order of SQL queries.

Back to the SQL code we have above. As mentioned, let's suspend the technicalities and assume an 'order of evaluation': By the time select is evaluated, all it has to work with is a single table from which it picks the required columns, i.e. the projection part in relational algebra. Therefore, if there are any joins,these joins can be conceptualized as procedures or operations that build a huge single table from many related tables, using the join clauses to connect rows. Another way of seeing it is that the evaluation of a series of joins is in fact a reduce operation.

Now, reduce itself is a high-level concept, being one of the fundamental higher-order functions in functional programming; reduce abstracts a common iteration pattern as we shall see. However, the manner in which reduce is normally introduced to novices waters down its essence. The standard example used usually entails a collection of numbers and calculating a value such as a sum. For example, the MDN javascript docs provide the following sample:

const array1 = [1, 2, 3, 4];
const reducer = (accumulator, currentValue) => accumulator + currentValue;

// 1 + 2 + 3 + 4
console.log(array1.reduce(reducer));
// expected output: 10

// 5 + 1 + 2 + 3 + 4
console.log(array1.reduce(reducer, 5));
// expected output: 15

I kinda get why such examples are used to demonstrate reduce: rather than presenting the abstracted version - it's easier for learners to at least be familiar with it and know that it exists. Generally, all beginner material has to balance between ease of understanding, clarity, correctness and thoroughness. Most choose ease-of-understanding in the hope that once learners progress to the intermediate level, they can take on the technicalities.

And so, for quite a while as a novice, I simply thought of reduce as a fancy way to perform calculations over an array of numbers, in which case, I'd rather use good old-fashioned for-loops. It's not until I was working through Daniel Higgibotham's 'Clojure for the Brave and True' that I saw reduce in a new light. Here's how Daniel introduces reduce:

The pattern of process each element in a sequence and build a result is so common that there’s a built-in function for it called reduce....

This was a mini-moment of enlightenment for me! My understanding of reduce became more generalized and abstract:

For one, the sequence can consist of anything, not just numbers: a sequence of cats, dogs, json, other sequences, whatever.
Moreover, the sequence itself doesn't even have to be an array, it can be a tree, a map, any sequence-like/iterable data-structure.
And finally, the value we are building up using reduce doesn't even have to be of the same type as the elements in the sequence - just because the array consists of numbers doesn't mean reduce has to return a number.

For the sake of repetition, here's yet another definition of reduce that I got from Eric Elliot's post, 10 Tips for Better Redux Architecture , link:

In functional programming, the common utility reduce() or fold() is used to apply a reducer function to each value in a list of values in order to accumulate a single output value.

And now back to SQL. Do keep in mind that I might just be shoehorning one concept into another out of sheer excitement. As already mentioned, no matter how many tables are listed in the from clause, at the end of the day select expects only a single table from which it can pick out the specified collumns. Thus, one might visualize the join 'operator' as a reducer function. And just like a reducer, it takes in two arguments: the first argument is the table accumulated so far, and the second argument is the next table in line. Within this reducer function, join then builds, or rather accumulates both tables into a much larger table. The accumulation is across two dimensions, rows and collumns. On columns, since this reduction takes place before the select clause, all the columns from both tables are included. However, if we use the keyword using in the join clause instead of the more common on clause, the two columns that are being compared to are collapsed into a single column. As for the rows, it all depends on the type of join we are using. For example, when we are using a right-outer join, if a row in the left accumulated table cannot be partnered up with a row in the right table, it is discarded; once all the rows are partnered up, if there were any rows in the right table that didn't get a partner, nulls are used to fill up the gaps. This was kind of the idea in my head on how joins work.

Still, I did not know how to conceptualize foreign-key columns in a 'joins-as-reduce' context. All join-clauses I had come across or worked on at that point related primary keys with foreign keys therefore I somehow presumed both concepts are tightly related. The whole pointer-dereferencing thing felt even more iffy whenever I tried to combine it with joins-as-reduce so for the sake of sanity, I couldn not resort to it again. Toying with a couple of queries though, I found out that the columns used in a join clause don't even have to have some primary-key - secondary-key sort of relationship. All this time, I was intertwining joins and keys, when they weren't even dependent on each other. As such, faced with gaping holes in my understanding, I had to go back to the basics... which is exactly what I explore in the second part of this article. See you there!

Speeding up unique constraint checks in Postgres... or not

Naga Mocha — Thu, 28 May 2020 19:50:57 +0000

Intuitively, or rather theoretically (since this doesn't even require intuition), using hash indexes over Btree indexes for equality lookups should be faster: O(1) vs O(log n). And as is expected, various benchmarks confirm this. However, when it comes to enforcing uniqueness in a given column, hash indexes don't perform so well.

The standard way of adding a uniqueness constraint to a column, is by declaring it when creating the table. For example:

create table item(
    id int unique,
    name text
);

On my computer, it takes 5.563 seconds to insert a million sequentially ordered IDs. It takes 7.533 seconds to add a million randomly ordered IDs. Without any constraint on the id column, it takes roughly 2 seconds to insert a million items regardless of whether they are ordered sequentially or randomly.

Postgres provides another round-about way of adding a uniqueness constraint to a column - exclusion constraints. Postgres' documentation defines exclusion constraints as follows:

Exclusion constraints ensure that if any two rows are compared on the specified columns or expressions using the specified operators, at least one of these operator comparisons will return false or null.

For our case, enforcing uniqueness, the above statement could be restated in this way: Suppose we only consider a single column and provide one operator for the exclusion constraint (such as the equals operator '=' for uniqueness). Then, when inserting or updating a row, the comparison with all other pre-existing rows should result in false, (or null since sql has three-valued logic). Otherwise, it will be excluded. This is easier demonstrated with actual sql:

create table item(
    id int,
    name text,
    exclude (id with =)
);

Better yet, Postgres allows us to speed up the exclusion check using an index as follows:

create table item(
    id int,
    name text,
    exclude using hash (id with =)
);

Now, when using a unique constraint, Postgres' documentation states that it "will automatically create a unique B-tree index on the column or group of columns listed in the constraint". Given that the exclusion constraint above is using hash index, I expected it to be faster. However, inserting a million sequential ID's took 9.192 seconds, which is almost twice as slow as relying on the plain old unique constraint. Inserting randomly ordered IDs took 8.443 seconds.

At first, I presumed it has something to do with the way the underlying hash indexes are structured, but even when using btree, it took roughly the same amount of time as the hash index exclusion constraint. The btree though was way much slower when inserting randomly ordered IDs, taking 12.058 seconds. My current presumption is that Postgres developers have put a lot of work into optimizing the standard unique constraint, since it's expected that the users will opt for it over exclusion constraints, which are better left for more interesting stuff, like overlapping intervals.

Other than being slower, by relying on exclusion constraints to enforce uniqueness, we also lose the ability to have the column be referenced by foreign keys in other tables. This is because in Postgres, only unique columns and primary key columns can be referenced. For example, the second table definition below fails:

create table item(
    id int,
    name text,
    constraint unique_book_id_hash exclude using hash (id with =)
);

create table orders(
    id int,
    item_id int references item(id) -- ❌
);

So far though, if you've noticed, I've been using integers. Before making any further conclusions and dismissing hash indexes entirely, it's only fair that they're measured up in the one area where they excel quite well, comparing lengthy strings. And as expected, they do truly shine here. I began with UUID's since I didn't have to write an extra function for generating random strings. With plain old unique, it takes 32.96 seconds to insert a million UUIDs. It gets worse when inserting the next million UUIDS, 50.557 seconds. On the other hand, when using the hash-index based exclusion check, it takes 12.537 seconds to insert the first set of a million UUIDs, 12.764 to insert the next set and finally 16.24 seconds to insert the third set - quite impressive. I'll be sure to try comparing both with random strings of different lengths but I expect similar results. And yeah, that's definitely one way to speed up uniqueness constraint checks if the column's type is a string, rather than an integer and it won't be referenced elsewhere in the database.