DEV Community

Franck Pachot for Yugabyte

Posted on • Updated on

ORDER BY is mandatory in SQL to get a sorted result

In IT, like in math, a negative test can prove that an assertion is wrong. But a positive test doesn't prove that the assertion is right. It is worse in IT because the algorithm may show an apparently correct result with a probablilty that is higher than just being lucky:

postgres=# create table demo (n int primary key);
CREATE TABLE

postgres=# insert into demo  
           select n from generate_series(1,1e6) n;
INSERT 0 1000000

postgres=# select n from demo limit 10;

 n
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)
Enter fullscreen mode Exit fullscreen mode

With this example, you may think that a select returns the rows ordered. This is wrong. The SQL language is declarative. Without an ORDER BY, the result is random. The apparently sorted result here is just a side effect of inserting into heap tables by appending at the end of the file. And reading the file from beginning to end with one thread. The rows are displayed as they are fetched.

When I insert the rows in another order:

postgres=# drop table if exists demo;
DROP TABLE
postgres=# create table demo (n int primary key);
CREATE TABLE
postgres=# insert into demo  select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000
postgres=# select n from demo limit 10;
   n
--------
 999999
 999998
 999997
 999996
 999995
 999994
 999993
 999992
 999991
 999990
(10 rows)
Enter fullscreen mode Exit fullscreen mode

they come as they were stored. This is typical of a serial SeqScan:

postgres=# explain select n from demo limit 10;
                             QUERY PLAN
-------------------------------------------------------------
 Limit  (cost=0.00..0.14 rows=10 width=4)
   ->  Seq Scan on demo  (cost=0.00..14425.00 rows=1000000 width=4)
(2 rows)
Enter fullscreen mode Exit fullscreen mode

You cannot rely on any of this behavior. With another execution plan, this order may change:

postgres=# set enable_seqscan=false;
SET
postgres=# select n from demo limit 10;
 n
--------
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)

postgres=# explain select n from demo limit 10;

                                        QUERY PLAN
-------------------------------------------------------------
 Limit  (cost=0.42..0.77 rows=10 width=4)
   ->  Index Only Scan using demo_pkey on demo  (cost=0.42..34712.43 rows=1000000 width=4)
(2 rows)
Enter fullscreen mode Exit fullscreen mode

Many things can change the behavior. Here an index scan instead of a full table scan. Parallel query will also change the way they are read by SeqScan. And updates can also move them physically:

postgres=# set enable_seqscan=true;
SET
postgres=# alter table demo add column x char;
ALTER TABLE
postgres=# update demo set x=1 where mod(n,3)=0;
UPDATE 333334
postgres=# select n from demo limit 10;
   n
-------------
 999998
 999997
 999995
 999994
 999992
 999991
 999989
 999988
 999986
 999985
(10 rows)
Enter fullscreen mode Exit fullscreen mode

The only way to get a reliable sorted result is with an ORDER BY:

postgres=# select n from demo order by n limit 10;
 n
--------
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)
Enter fullscreen mode Exit fullscreen mode

Don't think an ORDER BY will have to sort the rows, the query planner may opt for a physical structure that returns the them in order:

postgres=# explain select n from demo order by n limit 10;
                                        QUERY PLAN
------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.77 rows=10 width=4)
   ->  Index Only Scan using demo_pkey on demo  (cost=0.42..34716.43 rows=1000000 width=4)
(2 rows)
Enter fullscreen mode Exit fullscreen mode

In SQL you declare the order you want, and the query planner knows which access methods retrieves them in order. The optimizer knows but you don't know. Except of course if you have read, and understood, the source code for the exact version you run.

YugabyteDB

In a distributed SQL database, there are good chances that the default order does not match anything you expect:

yugabyte=# create table demo (n int primary key);
CREATE TABLE

yugabyte=# insert into demo  select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000

yugabyte=# select n from demo limit 10;

   n
-------------
 110359
 192735
 219128
 237047
 310517
 593962
 627995
 651891
 669921
 790562
(10 rows)

Enter fullscreen mode Exit fullscreen mode

This looks random. It shows an increasing order, but for a different reason than PostgreSQL - this does not depend on the insertion order. Of course, there's a logic. The default sharding method is a hash function on the first column of the primary key. We can query this hashing function ourselves with yb_hash_code()

yugabyte=# select min(h),max(h),avg(h), 
            count(*)::float/count(distinct h) per_hash from (
            select yb_hash_code( generate_series(1,1e6) ) h 
           ) v;

 min |  max  |        avg         |   per_hash
----------+-------+--------------------+---------------
   0 | 65535 | 32774.509179000000 | 15.2587890625
(1 row)
Enter fullscreen mode Exit fullscreen mode

This function can return one of the 65536 values from 0 to 65535. This is what is used to distribute rows into tablets in the distributed storage (DocDB).

Without an ORDER BY, the rows are returned ordered on this hash code first:

yugabyte=# select yb_hash_code(n), n from demo limit 10;

 yb_hash_code |   n
-------------------+--------
            0 | 110359
            0 | 192735
            0 | 219128
            0 | 237047
            0 | 310517
            0 | 593962
            0 | 627995
            0 | 651891
            0 | 669921
            0 | 790562
(10 rows)
Enter fullscreen mode Exit fullscreen mode

Look, If I query all rows with this specific hash code, they come back in order.

yugabyte=# select n from demo where yb_hash_code(n)=0;

   n
-------------
 110359
 192735
 219128
 237047
 310517
 593962
 627995
 651891
 669921
 790562
 792363
 819768
 891493
 984191
(14 rows)
Enter fullscreen mode Exit fullscreen mode

When understanding the way the rows are stored physically, and retrieved with a SeqScan, the order looks random but is not magic. My 1 million rows with 65536 hash values will have approximately 1e6/65536=15.2 values per hash code. What was apparently ordered when selecting 10 rows shows a different pattern when selecting more than the number of rows per hash code:

yugabyte=# select n, yb_hash_code(n) from demo limit 50;

   n    | yb_hash_code
-------------+--------------
 110359 |            0
 192735 |            0
 219128 |            0
 237047 |            0
 310517 |            0
 593962 |            0
 627995 |            0
 651891 |            0
 669921 |            0
 790562 |            0
 792363 |            0
 819768 |            0
 891493 |            0
 984191 |            0
  17012 |            1
  24685 |            1
 153595 |            1
 186378 |            1
 219742 |            1
 258869 |            1
 271029 |            1
 547922 |            1
 565568 |            1
 763430 |            1
 766123 |            1
 772002 |            1
 781840 |            1
 840822 |            1
 844655 |            1
 953917 |            1
 162485 |            2
 168413 |            2
 271551 |            2
 285516 |            2
 407063 |            2
 420509 |            2
 440160 |            2
 572540 |            2
 585722 |            2
 589471 |            2
 628271 |            2
 719191 |            2
 837125 |            2
 866379 |            2
 951013 |            2
 976519 |            2
 994652 |            2
    854 |            3
  57757 |            3
  70079 |            3
(50 rows)
Enter fullscreen mode Exit fullscreen mode

Internally, the YugabyteDB table rows are stored ordered on the document key, which is the primary key prefixed by the hash code. This brings the fast access to a key point or range: from the hash code, it determines the right tablet. And then it finds the row in the SSTable (Sorted Sequence Table) structure.

If I decide to shard on a range, even a SeqScan would return in order:

yugabyte=# drop table demo;
DROP TABLE

yugabyte=# create table demo (n int, primary key(n asc))
           split at values ( (333333),(666666) );
CREATE TABLE

yugabyte=# insert into demo  select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000

yugabyte=# select n from demo limit 10;
 n
--------
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)

yugabyte=# explain analyze select n from demo limit 10;
                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1.00 rows=10 width=4) (actual time=0.788..0.796 rows=10 loops=1)
   ->  Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=0.787..0.791 rows=10 loops=1)
 Planning Time: 0.047 ms
 Execution Time: 0.855 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

The Seq Scan shows an order similar to the Index Scan on the primary key:

yugabyte=# explain analyze select n from demo order by n limit 10;
                                                         QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1.14 rows=10 width=4) (actual time=0.809..0.816 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo  (cost=0.00..114.00 rows=1000 width=4) (actual time=0.808..0.813 rows=10 loops=1)
 Planning Time: 0.066 ms
 Execution Time: 0.843 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

This is because the table is actually stored in the primary key structure (Similar to Oracle Index Organized Table, or SQL Server Clustered Index, or MySQL InnoDB tables...)

I mentioned that the hash code has 65536:

yugabyte=# drop table demo;
DROP TABLE
yugabyte=# create table demo (n int primary key)
           split into 16 tablets;
CREATE TABLE
yugabyte=# insert into demo
           select n from generate_series(1,1e6) n;
INSERT 0 1000000
yugabyte=# select count(distinct yb_hash_code(n)) from demo;
 count
------------
 65536
(1 row)
Enter fullscreen mode Exit fullscreen mode

But of course the number of tablets is smaller. Just a few per nodes, to allow rebalancing when adding nodes.

If I want to know how many tabletsfrom the execution plan, I run an aggregate function that is pushed-down to each tablet so that the number of rows in the execution plan is the number of aggregations:

yugabyte=# explain analyze select count(*) from demo;

yugabyte=# explain analyze select count(*) from demo;
                                                  QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=102.50..102.51 rows=1 width=8) (actual time=984.031..984.031 rows=1 loops=1)
   ->  Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=0) (actual time=357.419..984.010 rows=16 loops=1)
 Planning Time: 0.055 ms
 Execution Time: 984.145 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

The rows=16 is from the 16 count(*) results coming from each tablets. I have 16 tablets here, with 65536/16=4096 hash codes per tablet.

It is interesting to understand how the rows are stored and fetched, because you can think about performance. But when it comes to run application queries, don't forget that SQL is a declarative language. If you want a sorted result, declare it with ORDER BY. And the query planner will figure out what to do.

Discussion (0)