DEV Community

Tomas@PawSQL
Tomas@PawSQL

Posted on

Order by random Optimization

Copyright © 2023 PawSQL

Problem

Functionrand(MySQL) or random(PostgreSQL) returns a random floating-point value in the range 0 to 1.0. It is useful to get a random sample of dataset using following query.

select * from orders order by rand() limit 1; 
Enter fullscreen mode Exit fullscreen mode

This works fine if customer table is less than 10,000 rows. But when you have 1,000,000 rows, the overhead for sorting the rows becomes unacceptable. The reason is quiet clear: we sort all rows, but only keep one.

MySQL Plan

-> Limit: 1 row(s)  (actual time=203.465..203.465 rows=1 loops=1)
    -> Sort: `rand()`, limit input to 1 row(s) per chunk  (actual time=203.464..203.464 rows=1 loops=1)
        -> Stream results  (cost=20540.71 rows=200128) (actual time=0.035..173.789 rows=201800 loops=1)
            -> Table scan on l  (cost=20540.71 rows=200128) (actual time=0.030..121.293 rows=201800 loops=1)
Enter fullscreen mode Exit fullscreen mode

PostgreSQL Plan

Limit  (cost=4962.22..4962.22 rows=1 width=66) (actual time=58.080..58.082 rows=1 loops=1)
  ->  Sort  (cost=4962.22..5391.96 rows=171898 width=66) (actual time=58.078..58.079 rows=1 loops=1)
        Sort Key: (random())
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on orders l  (cost=0.00..4102.73 rows=171898 width=66) (actual time=0.010..26.818 rows=171898 loops=1)
Enter fullscreen mode Exit fullscreen mode

Solutions

  1. If there is an unique index on a numeric column with uniformly distributed values, the query can be rewrite to more efficient one to avoid full table scan and all row sorting. create unique index ord_idx_key on orders(o_orderkey)

Rewritten SQL

select
    *
from
    orders
where
    o_orderkey >= (
    select
        floor( RAND() * ((select MAX(o_orderkey) from orders)-(select MIN(o_orderkey) from orders)) + (select MIN(o_orderkey) from orders)))
order by
    o_orderkey
limit 10;
Enter fullscreen mode Exit fullscreen mode

Analyze Plan of MySQL

-> Limit: 10 row(s)  (cost=0.03 rows=3) (actual time=76.877..124.157 rows=10 loops=1)
    -> Filter: (orders.O_ORDERKEY >= floor(((rand() * <cache>(((select #3) - (select #4)))) + (select #5))))  (cost=0.03 rows=3) (actual time=76.876..124.155 rows=10 loops=1)
        -> Index scan on orders using PAW_IDX0334337551  (cost=0.03 rows=10) (actual time=0.015..111.731 rows=106410 loops=1)
        -> Select #3 (subquery in condition; run only once)
            -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
        -> Select #4 (subquery in condition; run only once)
            -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
        -> Select #5 (subquery in condition; run only once)
            -> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
Enter fullscreen mode Exit fullscreen mode

Analyze Plan of PostgreSQL

Limit  (cost=1.27..1.44 rows=1 width=66) (actual time=27.840..27.847 rows=1 loops=1)
->  Nested Loop  (cost=1.27..9724.17 rows=57299 width=66) (actual time=27.838..27.845 rows=1 loops=1)
      Join Filter: ((t1.o_orderkey)::double precision >= (floor(((random() * (($1 - $3))::double precision) + ($5)::double precision))))
      Rows Removed by Join Filter: 91974
      ->  Index Scan using ord_idx_key on orders t1  (cost=0.29..6714.94 rows=171898 width=58) (actual time=0.022..9.406 rows=91975 loops=1)
      ->  Materialize  (cost=0.98..1.02 rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=91975)
            ->  Result  (cost=0.98..1.01 rows=1 width=8) (actual time=0.112..0.117 rows=1 loops=1)
                  InitPlan 2 (returns $1)
                    ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.080..0.082 rows=1 loops=1)
                          InitPlan 1 (returns $0)
                            ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.079..0.079 rows=1 loops=1)
                                  ->  Index Only Scan Backward using ord_idx_key on orders  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.078..0.079 rows=1 loops=1)
                                        Index Cond: (o_orderkey IS NOT NULL)
                                        Heap Fetches: 0
                  InitPlan 4 (returns $3)
                    ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.013..0.014 rows=1 loops=1)
                          InitPlan 3 (returns $2)
                            ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.013..0.013 rows=1 loops=1)
                                  ->  Index Only Scan using ord_idx_key on orders orders_1  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.013..0.013 rows=1 loops=1)
                                        Index Cond: (o_orderkey IS NOT NULL)
                                        Heap Fetches: 0
                  InitPlan 6 (returns $5)
                    ->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
                          InitPlan 5 (returns $4)
                            ->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=1)
                                  ->  Index Only Scan using ord_idx_key on orders orders_2  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.006..0.007 rows=1 loops=1)
                                        Index Cond: (o_orderkey IS NOT NULL)
                                        Heap Fetches: 0
Enter fullscreen mode Exit fullscreen mode
  1. Otherwise, we need to create a map table to solve this problem.
   create table orders_key_map ( row_id int not NULL primary key, o_orderkey int not null);
   SET @id = 0;
   INSERT INTO orders_key_map SELECT @id := @id + 1, id FROM holes;
Enter fullscreen mode Exit fullscreen mode
   > select * from orders_key_map;
   +--------+-----------+
   | row_id | o_orderkey |
   +--------+-----------+
   |      1 |         100 |
   |      2 |         102 |
   |      3 |         300 |
   |      4 |         833 |
   |      5 |        1116 |
   +--------+-----------+
Enter fullscreen mode Exit fullscreen mode

And then we can use following SQL to get a random sample of orders table:

   select
    *
   from
    orders o, orders_key_map m
   where
       o.o_orderkey = m.o_orderkey
    m.row_id >= (
    select
        floor( RAND() * ((select MAX(row_id) from orders_key_map)-(select MIN(row_id) from orders_key_map)) + (select MIN(row_id) from orders_key_map)))
   order by
    row_id
   limit 1;
Enter fullscreen mode Exit fullscreen mode

About PawSQL

PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:

  • PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
  • PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
  • PawSQL Engine, which is the backend optimization engine of the PawSQL series of products, and provides optimization capability through http/json interfaces. PawSQL Engine is provided for deployment and installation as a docker image.

Contact Us

Site: https://app.pawslq.com
Email: service@pawsql.com
Twitter: https://twitter.com/pawsql

Top comments (0)