Ant The Developer

Posted on Feb 1, 2020 • Edited on Jun 26, 2020

The Wild SQL Query I Wrote For Work This Week

#sql #postgres #database

I love SQL. It's one of the more powerful and beautiful languages in my opinion. It's been around for quite a while and I'm still learning new things. Ironically, my SQL course in college was the only useful CS course I've ever taken and the knowledge I gained there I still use today.

To fast forward, what was the problem? Eh...I can't talk specifics so I'll make up a story that is pretty close to reality.

The Premise

Let's imagine that you're writing a task manager and you just dumped 1 000 tasks into one of your projects. 1 000 tasks is a lot and unmanageable so you decide to split it into groups of 100. It'd be pretty easy to just write a Node script that connects to the database...but oh wait, that means you need to connect to the db somehow. And you don't want to create a new project, committing code to the existing codebase feels dirty, and your UI can't handle dealing with 1 000 tasks simultaneously.

But...but here's PgAdmin staring at you, already connected to the database. Can I do this with SQL? you ask yourself and because it's 4pm, you want to head home, and SQL is your favorite language, you do it.

Note: In the real world, there were 100 000 "tasks" so yeah, I can't imagine any non-paging UI handling 100 000 tasks like it's easy peasy.

Also! This was a collaborative effort with another engineer so when I say "I had to write it", it was actually me and my colleague :) Thanks, buddy!

The Final Query

DO $FN$
DECLARE
  project_id integer
BEGIN
  FOR counter IN 1..10 LOOP
    RAISE NOTICE 'Counter: %', counter;

    EXECUTE $$ INSERT INTO projects(name) VALUES ('Group ' || $1) RETURNING id $$ 
      INTO project_id 
      USING counter;

    EXECUTE $$ UPDATE tasks 
      SET project_id = $1
      WHERE tasks.id IN (
          SELECT tasks.id 
          FROM tasks
          INNER JOIN projects ON projects.id = tasks.project_id
          WHERE projects.name = $2
          ORDER BY tasks.id ASC
          LIMIT 100) $$ 
     USING project_id, 'Client project';

     RAISE NOTICE 'Finished Counter: %', counter;
  END LOOP;
END;$FN$

Explanation

I don't want to go through the query line-by-line but I'd like to definitely explain the various "blocks" of the code so it makes more sense and you can see how SQL works.

Note that a lot of this is postgres-specific meaning that MySQL, MSSQL, or other SQL databases may not have access to parts of this syntax.

The `DO $FN$` block

DO $FN$
 -- code goes here
$FN$

Essentially, this is a "do" block like any other. The documentation for the DO block compares it to a self-executing anonymous function that returns void and takes no arguments.

The $FN$ is what's called a dollar-quoted string constant. In Postgres, there are several types of quotes:

A single quote ('string goes here') which means the contents are a string
A double quote ("column_name") which lets us quote a problematic column or table name and use it safely
The dollar-quoted string ($$ SELECT * FROM projects WHERE name = 'test'; $$) which is just like a single quote string except you can use the other types of quotes within

What if you want to nest multiple quotes within each other? You can always escape them with \ BUT Postgres is amazing so it lets you designate custom tags within those dollars and nest as much as you want to:

$outer_quote$
  $inner_quote$
    $$ SELECT * FROM projects where name = 'test' $$
  $inner_quote$
$outer_quote$

You don't have to declare these custom tags anywhere, they just work. Simply shove a word between the dollar signs.

Declare variables

The next block (within our dollar-quoted custom tag string) is the declare block:

-- dollar quote start
DECLARE
  project_id integer;
-- rest of the code

This lets us declare variables that we'll want to use in the rest of the function. And yes, you have to declare them beforehand and yes, declare the type. I couldn't find the specific documentation for this statement (a DECLARE statement outside of a DO block has to do with cursors which is not what we're doing here).

Function body

DECLARE
  -- variables
BEGIN
  -- more code
END;

Essentially, the rest of the code up till this point was us preparing to run some code. We told Postgres we want to execute something via the DO block and then we declared the variables we want to use via DECLARE.

Everything that goes between BEGIN and END runs regular SQL statements.

For loop

FOR counter IN 1..10 LOOP
  -- more code
END LOOP;

Yeah, you can do for loops in SQL! :) There is extensive loop and control documentation that actually lists this example!

The counter is a variable that gets updated and we get access to inside the loop. The 1..10 creates the range (loop 1 through 10).

Note A seasoned Postgres and SQL veteran will tell me that doing the 1-10 math is unnecessary because...and get this...you can substitute 1..10 with a SQL query, creating a "for each" type of loop. eg. FOR task IN SELECT * FROM tasks LOOP would loop through every task in the database.

Raise notice!

RAISE NOTICE 'Counter: %', counter;

Quick note about this, it's a log. That's what it is. Doing RAISE NOTICE won't throw any errors. It works like a formatted string where % will be substitute with the variable after the string. Just like in other languages, you can pass in multiple % signs and then pass multiple variables after the end of the string to substitute.

It's not necessary, I just like seeing the logs :)

EXECUTE...INTO...USING (creating a new project)

EXECUTE $$ INSERT INTO projects(name) VALUES ('Group ' || $1) RETURNING id $$ 
      INTO project_id 
      USING counter;

This statement will use the counter variable to create a brand new project with the name Group followed by the counter number and store the resulting project id in the project_id variable.

I'll break this statement down a little more so as not to overwhelm. It was overwhelming for me at first and in fact, the original query was way hackier so I could avoid understanding the EXECUTE statement in its entirety.

The EXECUTE statement will execute any string it's given. We already know that $$ can be used in lieu of quotes. So INSERT INTO .... RETURNING id is a string of SQL we want to run.

A couple of things here:

The $1 is for substitution (more on that when I cover the USING part). In our example, we use || operator which is a string concatenation operator in Postgres.
the RETURNING id is a Postgres thing. You can insert a record (or bulk insert many records) and get the id back right away. You can even do a RETURNING * to get the entire new record.

The INTO statement will store the result of the query in a variable we declared earlier. I didn't know this and kept trying to do project_id := EXECUTE $$ statement $$ which does not work :(. In our case, we store the returning id into project_id.

The USING statement is used for substitution. You can pass in a list of variables here. The $1 will be substituted by the first variable/value you provide the USING statement. The second variable would be accessible via $2 and so on.

...phew.

The documentation for EXECUTE...USING calls this operation executing a dynamic statement. What's confusing is that...this is different from the EXECUTE command...which executes prepared statements.

move up to 100 tasks to the new project

EXECUTE $$ UPDATE tasks 
      SET project_id = $1
      WHERE tasks.id IN (
          SELECT tasks.id 
          FROM documents 
          INNER JOIN projects ON projects.id = tasks.project_id
          WHERE projects.name = $2
          ORDER BY tasks.id ASC
          LIMIT 100) $$ 
     USING project_id, 'Client project';

I'm glad I got to explain the EXECUTE...INTO...USING statement with the previous query because this one is way hairier.

First, let's look at the simplified query:

UPDATE tasks SET project_id = $1 WHERE tasks.id IN 
  (-- subquery)

We set the new project id to the project_id we got in the previous EXECUTE statement on the tasks with IDs from the result is of the subquery.

Next the subquery:

SELECT tasks.id 
  FROM documents 
  INNER JOIN projects ON projects.id = tasks.project_id
  WHERE projects.name = $2
    ORDER BY tasks.id ASC
    LIMIT 100

In the subquery, we look up a 100 tasks belonging to that one massive project which needs breaking up. We get the next 100 tasks in that project. We use $2 which gets substituted by the project name of the massive original project. We sort by tasks.id ASC so that the tasks don't get thrown out of order during the move.

Lastly, the execute statement that wraps it all:

EXECUTE $$ 
 -- query
$$
USING project_id, 'Client project';

As before, we execute a statement but this time we don't store the results of the query. We substitute 2 different variables which are accessible via $1 and $2 within the query.

Another raise notice

RAISE NOTICE 'Finished Counter: %', counter;

Same as before, this will let us know that a loop is done. I just like logs.

Do I need to know this?

Oh god no. Unless you're in the DB day in and day out, there's no reason for anyone to know this off the top of their head but it's good to know what's possible. I apply that philosophy when learning or using any technology. I knew I could do all of this with SQL because of my past experiences with SQL, I just needed some time with the Postgres documentation to learn how to make this exact task happen. And I think that kind of knowledge is super valuable.