Maintain a duplicate covering index in a region

#yugabytedb #distributed #sql #database

In this series I'm showing how we can deal with a geo-distributed database. We cannot change the latency between regions, as this is a physic constraint, but we can take care of data placement. In Part 1 I defined a preferred region (leader affinity) to get fast reads from there. In Part 2 I used "follower reads" to get fast reads from the other region, but accepting a lag in data freshness. In this Part 3 I'll show a technique to get fresh reads from this remote region. Of course, there is a trade-off: if the reads are consistent without cross-region latency, the writes will have a higher latency.

Read/Write from us-east with all leaders in eu-west

I have created the same cluster, with same tables, as in Part 1. I'm running the same four queries, when connected to us-east-2,to check that the response time is still the same:

for i in {1..3} ; do
cat <<'SQL'
-- read from one tablet:
explain analyze select count(*),min(k),max(k) from franck_range where k between 100 and 100+99;
-- write in one tablet:
explain analyze update franck_range set v=v+1 where k between 100 and 100+99;
-- read from all tablets:
explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;
-- write in all tablets:
explain analyze update franck_hash set v=v+1 where k between 100 and 100+99;
SQL
done | ./yb-software/yugabyte-2.11.3.0-b3-centos-x86_64/bin/ysqlsh -h ip-172-161-30-45.us-east-2.compute.internal | grep "Execution Time" | tail -4

The grep on response time, for the last executions to avoid first parse overhead, shows:

 Execution Time:   78.439 ms -- read one range
 Execution Time:  316.680 ms -- write one range
 Execution Time:  468.908 ms -- read many hash
 Execution Time: 1302.116 ms -- write many hash

No surprises, this is the same. This cluster is optimized for access from Europe and this comes from a connection in US. The reads and writes have to go to the leader tablet peer, with a transatlantic call, and the writes have an additional sync for the quorum. I'm using hash and range sharding to show a read from one or many tablets.

Placement info

In the master console, we can see the placement info for each table or index. Both table inherit their placement from my cluster topology: RF=3 replication across the 3 regions, with leader affinity in Europe:

live_replicas {
  num_replicas: 3
  placement_blocks {
    cloud_info {
      placement_cloud: "aws"
      placement_region: "ap-south-1"
      placement_zone: "ap-south-1a"
    }
    min_num_replicas: 1
  }
  placement_blocks {
    cloud_info {
      placement_cloud: "aws"
      placement_region: "eu-west-1"
      placement_zone: "eu-west-1a"
    }
    min_num_replicas: 1
  }
  placement_blocks {
    cloud_info {
      placement_cloud: "aws"
      placement_region: "us-east-2"
      placement_zone: "us-east-2a"
    }
    min_num_replicas: 1
  }
  placement_uuid: "4d9e9020-cdca-4c9c-869b-f9d621212fb5"
}
affinitized_leaders {
  placement_cloud: "aws"
  placement_region: "eu-west-1"
  placement_zone: "eu-west-1a"
}

I defined this at cluster level, but tablespaces can be used to define a different placement for some tables or indexes.

Tablespace

In YugabyteDB, tablespaces are used to define a replication factor and replica placement. I define a "us_east_2a" tablespace for this region, the one from which I want faster reads.

create tablespace us_east_2a with (replica_placement='
{
   "num_replicas":1,
   "placement_blocks":[
      {
         "cloud":"aws",
         "region":"us-east-2",
         "zone":"us-east-2a",
         "min_num_replicas":1
      }
   ]
}');

I've created a RF=1 replica here because I have only one node for this lab. A production cluster would have multiple nodes in each region, and probably a RF=3 over 3 availability zones. Because even if I will define a redundant structure here, I want it to be highly available.

Duplicate covering index

I want a redundant (duplicate) sync copy of my table. In YugabyteDB, tables and indexes are very similar. They are all stored as LSM trees. An index is maintained when the table is updated. And an index can include more columns that the index key. Here is how I define this covering index for my two tables:

create unique index franck_hash_us_east_2a on 
franck_hash (k hash) include (v) 
tablespace us_east_2a 
;

create unique index franck_range_us_east_2a on 
franck_range (k asc) include (v) 
tablespace us_east_2a 
;

By attributing to the tablespace I've created, they will obey to the placement rules: RF=1 (leader only without followers) and in this zone only.

Response time

I've run the same as I did above, and here are the results:

 Execution Time:     1.116 ms -- read one range
 Execution Time: 15824.046 ms -- write one range
 Execution Time:   472.855 ms -- read many hash
 Execution Time: 15887.314 ms -- write many hash

The first one, reading from a range, is fast. This is a local read, from the index I've just created. The writes take much longer because they have to maintain this index. This is the price of this features that slows down the writes to accelerate the reads. The technique I'm showing here is for small or medium size fairly static tables, like reference tables. The third execution is a read which should be served by the local index, so why is it the same time as without the index? Let's see the execution plans

Execution plan

Here is the detail about the first execution, reading one range, where we clearly see that my local index Index Only Scan using franck_range_us_east_2a was used:

yugabyte=# explain analyze select count(*),min(k),max(k) from franck_range where k between 100 and 100+99;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.12..4.13 rows=1 width=16) (actual time=1.068..1.068 rows=1 loops=1)
   ->  Index Only Scan using franck_range_us_east_2a on franck_range  (cost=0.00..4.11 rows=1 width=4) (actual time=1.005..1.048 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
         Heap Fetches: 0
 Planning Time: 0.105 ms
 Execution Time: 1.116 ms
(6 rows)

For this technique, Index Only Scan is the goal, which is why I defined a covering index, because any hop to the table would have to go to the region with the tablet leader. One millisecond to read 100 rows.

This read acceleration from a local synchronized copy shows its maintenance overhead on writes:

yugabyte=# explain analyze update franck_range set v=v+1 where k between 100 and 100+99;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Update on franck_range  (cost=0.00..4.12 rows=1 width=72) (actual time=15823.397..15823.397 rows=0 loops=1)
   ->  Index Scan using franck_range_pkey on franck_range  (cost=0.00..4.12 rows=1 width=72) (actual time=78.977..80.288 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
 Planning Time: 0.072 ms
 Execution Time: 15824.046 ms
(5 rows)

The execution plan shows the query part only, the writes are done in the node in Europe even when connected to US, but the local index in US is synced back, in addition to one of the followers, which is probably the one in US which is nearer than the one in Asia.

Now, let's see why the reads from many hash didn't accelerate:

yugabyte=# explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=112.50..112.51 rows=1 width=16) (actual time=472.797..472.797 rows=1 loops=1)
   ->  Seq Scan on franck_hash  (cost=0.00..105.00 rows=1000 width=4) (actual time=78.874..472.772 rows=100 loops=1)
         Filter: ((k >= 100) AND (k <= 199))
         Rows Removed by Filter: 800
 Planning Time: 0.083 ms
 Execution Time: 472.855 ms
(6 rows)

The reason is simple, the query planner didn't use my local index. Even if the query planner is cluster-aware, there are still cases where the cost is not precise enough to take the decision. I'm doing this with YugabyteDB 2.11 and this will be improved in future versions.

Meanwhile, we have pg_hint_plan to force the access path:

yugabyte=# explain analyze /*+ IndexOnlyScan(franck_hash franck_hash_us_east_2a) */ select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=120.50..120.51 rows=1 width=16) (actual time=4.797..4.797 rows=1 loops=1)
   ->  Index Only Scan using franck_hash_us_east_2a on franck_hash  (cost=0.00..113.00 rows=1000 width=4) (actual time=1.817..4.779 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
         Heap Fetches: 0
 Planning Time: 0.138 ms
 Execution Time: 4.837 ms
(6 rows)

The Index Only Scan on the local index is fast: 4 milliseconds to read 100 rows. It is easy to see why the query planner didn't choose it: the estimated cost=120.51 is a bit higher than the cost=112.50 for Seq Scan. Because, with current (YB 2.11) optimizer, the cross-region factor is less than the index to full-scan factor.

Another solution than the hint is to disable SeqScan when you want to read from local indexes:

yugabyte=# explain analyze select count(*),min(k),max(k) from franck_hash where k between 100 and 100+99;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=120.50..120.51 rows=1 width=16) (actual time=5.523..5.524 rows=1 loops=1)
   ->  Index Only Scan using franck_hash_us_east_2a on franck_hash  (cost=0.00..113.00 rows=1000 width=4) (actual time=1.270..5.504 rows=100 loops=1)
         Index Cond: ((k >= 100) AND (k <= 199))
         Heap Fetches: 0
 Planning Time: 0.085 ms
 Execution Time: 5.564 ms

I'll explain this "duplicate covering index" technique, and the cost estimation, in a future post. This series will also cover some asynchronous replication techniques.