DEV Community: Jim Hatcher

Repaving CockroachDB in AWS EC2

Jim Hatcher — Tue, 25 Jul 2023 17:21:51 +0000

Repaving CockroachDB Nodes

"Repaving" nodes refers to the replacement of the underlying Linux/Unix nodes that are running CockroachDB -- usually in a cloud deployment. It is a procedure followed by some companies as a way to make sure that they're using the newest (and presumably most secure) versions of O/S images. It is thought to be easier than applying O/S patches. Because CockroachDB is a highly available and resilient distributed database, repaving nodes where CockroachDB is running is a relatively easy process.

Approaches to repaving

There are generally two approaches to repaving CockroachDB nodes:

decommission nodes and add them back one-at-a-time
stop each node, detach the block storage, re-attach the block storage to a newly created node and start up CockroachDB again

There are pros/cons to each. Let's examine in more detail.

Decommissioning

Let's assume we have a five-node cluster and that we're using the default replication factor of three. This means we will have a replica for every data range on each node. We can decommission a node by running cockroach node decommission <nodeid>. When we do this, the node being decommissioned will start to move its replicas to other nodes within the cluster. This process could take several hours, depending on how much data is being stored on each node. During this time, data streaming will be occurring which means that we will increase CPU, memory, disk and network usage on the nodes involved (both the node being decommissioned and any nodes -- likely all the others -- where data replicas are being moved). After the node has finished off-loading its replicas, we can add a new node. This new node will be recognized by the cluster and it will start receiving replicas as the cluster re-balances again. The nice thing about this approach is that the cluster never runs in an "under-replicated state." We will always have three replicas of all the nodes. This overall process could take a while as you move through the nodes one-by-one.

It's worth noting that using this decommissioning method on a three-node cluster with RF=3 is not a good idea because there are no additional nodes to be used to accept new replicas. If you decomm a node on this topology, the replicas will not be copied to other nodes.

Stop and replace

The other common approach is to stop nodes and switch out the underlying hardware. One node at a time, you would stop the cockroach process (i.e., kill -9 <cockroach pid>). When this node is stopped, the CockroachDB cluster will consider it to be in a "suspect state" and any of the ranges for which replicas are located on that node will be considered "under-replicated." An under-replicated state implies that a quorum is still maintained, so we can still read and write to the range. Cockroach will do nothing to remedy the under-replicated state for a certain period of time called the "time till store dead" window (the setting is called server.time_until_store_dead and the default value is 5 minutes). If the node comes back up during this "time till dead" window, then CockroachDB will put the node back into a healthy state by catching up the range to a fully consistent state by sending RAFT logs from the leaseholder(s).

So, the idea is that we have the "time till dead" timeframe in order to detach the data device from the host, replace the host, re-attach the disk and restart the CockroachDB process. It is a common practice to temporarily extend the "time till dead" window to a more manageable timeframe like 15 or 30 mins. The advantage of this approach is that no streaming of data replicas has to occur, so it's much quicker to cycle through the cluster nodes. The potential risk of this approach is that during the time where some ranges are in an under-replicated state, if you were to lose another node in the cluster (due to an unexpected outage maybe), then you could potentially lose quorum on some ranges and therefore lost the ability to read and write to parts of the database.

This risk, however, can be mitigated. One mitigating factor is that this process is usually automated and therefore runs quickly. Another step that can be taken to mitigate this risk is to adjust the replication factor of your critical data to use a replication factor (RF) of 5, rather than the default of 3. When using RF=5, a quorum is 3 out of 5 (rather than 2 out of 3). In our repaving scenario, some ranges will be in an under-replicated state of 4 out of 5; and to move into an unavailable state, we'd have to lose 2 more replicas which is unlikely.

My personal preference is to run in Production with RF=5 and to use the stop/detach/reattach method -- hopefully in a fully-tested and automated manner.

Example of the "stop/detach/reattach" method in EC2

To exemplify this method, let's spin up four EC2 instances to make a three-node CockroachDB cluster in AWS EC2. You can follow the general documentation for deploying CockroachDB in AWS EC2 -- docs here (https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-on-aws.html) -- with a few slight deviations. One deviation we'll make is to attach an additional EBS volume to each node to be used to store all data/logs. We can still use the normally attached EBS volume to house the CockroachDB executable and TLS certificates.

I suggest you spin up four EC2 instances and for each one, record the private IP, the public IP, the public host name, the private host name, and the identifiers of the volumes of each of the drives. All of this information is available from the AWS control plane. For instance:

Node name (for all): hatcher-repave-test
Region: us-east-2a

N1
3.145.68.78
ec2-3-145-68-78.us-east-2.compute.amazonaws.com
172.31.11.189
ip-172-31-11-189.us-east-2.compute.internal
vol-0821f7fb596be5f25 (100GB)
vol-07c96d7b41612d9b7 (8GB)

N2
3.139.85.111
ec2-3-139-85-111.us-east-2.compute.internal
172.31.8.166
ip-172-31-8-166.us-east-2.compute.internal
vol-03da8d9790bc8e3ca (100GB)
vol-00a17f674809e3f40 (8GB)

N3
3.145.180.186
ec2-3-145-180-186.us-east-2.compute.internal
172.31.15.100
ip-172-31-15-100.us-east-2.compute.internal
vol-0fda8d9790bc8e2ef (100GB)
vol-0e3b89cebb10f84df (8GB)

N4 (Note: the fourth node won't need a separate data node)
3.138.181.122
ec2-3-138-181-122.us-east-2.compute.internal
172.31.13.143
ip-172-31-13-143.us-east-2.compute.internal
vol-0f2c700f0d9f9749b (100GB)

Make sure that the EC2 instances have a network security group that allows inbound calls on TCP/26257, TCP/8080, and TCP/22 to allow SSH-ing into each node.

Then, for each node, we need to run some Linux setup in order to make sure that our data drive is available for use:

#Check to see the name of the device and whether it has a file system installed
sudo lsblk -f

#it shouldn't have a file system at first, so let's make a file system (assuming the device is called /dev/xvdb)
sudo mkfs -t xfs /dev/xvdb

#verify there is a file system now
sudo lsblk -f

#make the /data directory so we can mount to it
sudo mkdir /data

#manually mount the volume
sudo mount /dev/xvdb /data

#backup the existing fstab file
sudo cp /etc/fstab /etc/fstab.orig

#get the UUID of the volume
sudo blkid

#using the UUID of the volume captured in the previous step, edit the fstab so the volume will auto-mount again after subsequent reboots
# add: UUID=b26331da-9d38-4b0a-af09-3c3808b8313e     /data       xfs    defaults,nofail   0   2
sudo vim /etc/fstab

#verify that the fstab edit was done correctly by unmounting and remounting the volume
sudo umount /data
sudo mount -a

#verify that you see the drive
df -h

#change the owner to ec2-user
sudo chown ec2-user /data

#if you need to change the permission of the drive, do that here
#sudo chmod 400 /data

Now that we have our EBS-based data volume available to be used, we need to create a node cert for each node. We can follow the instructions from the docs to create the CA (certificate authority) cert/key pair and the root user's client cert/key pair; we only need to create once of these for the whole cluster. But, we want to create a separate node cert for each of the nodes. The easiest way to do this is to create the CA root/key + client root cert/key and copy those files to each of the nodes. In practice, you only need to have the ca.crt on the nodes and the ca.key should be kept in a safe place off the cluster; also in practice, you only need to copy the client root user cert/key files to locations where you will be authentication to the cluster. But, for this exercise, I recommend copying all four files to each of the nodes.

You'll need to copy the cockroach executable to each of the nodes before running the next step -- this is described in the main doc.

Then, run commands like the following on each node to create the node cert/key pair files. If you're planning on exposing the nodes via an AWS-based load balancer, you can include the LB's IP and its hostname (if you made one) to each of these commands, too.

# node 1 command
cockroach cert create-node \
172.31.11.189 \
3.145.68.78 \
ip-172-31-11-189.us-east-2.compute.internal \
ec2-3-145-68-78.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key

# node 2 command
cockroach cert create-node \
172.31.8.166 \
3.139.85.111 \
ip-172-31-8-166.us-east-2.compute.internal \
*.us-east-2.compute.internal \
*.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key

# node 3 command
cockroach cert create-node \
172.31.15.100 \
3.145.180.186 \
ip-172-31-15-100.us-east-2.compute.internal \
ec2-3-145-180-186.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key

# node 4 command
cockroach cert create-node \
172.31.13.143 \
3.138.181.122 \
ip-172-31-13-143.us-east-2.compute.internal \
ec2-3-138-181-122.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key

We're going to start a three-node CockroachDB cluster (even though we have four nodes -- we'll use that fourth node later).

Start CockroachDB on nodes 1-3 by running the start command. Notice that we're explicitly specifying the --store=/data directory so that CockroachDB will store all data and logs on our separately-mounted data volume (this is slightly different than the doc instructions).

#node 1 command
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.11.189 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background

#node 2 command
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.8.166 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background

#node 3 command
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.15.100 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background

#node 4 command (don't run yet)
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.13.143 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background

The docs will have you run an init command (from any one of the nodes) and then your CockroachDB cluster will be up and running. We could at this point initialize one of the canned CockroachDB workloads in order to put some data into the cluster. Or we could create a simple database and table and manually insert a few records.

Now that we are in a steady state, let's start the repaving part.

(1) Extend the "time till dead" window from the default of 5 minutes to 30 minutes. Execute this command from a SQL window (i.e., cockroach sql --certs-dir=certs --host=172.31.13.143)

set cluster setting server.time_until_store_dead to '30m0s';
show cluster setting server.time_until_store_dead;

(2) Verify that all the nodes are healthy and running. You can also verify that there are currently no under-replicated ranges via a SQL query (below) or in the DB Console's main page.

cockroach node status --certs-dir=certs --host=172.31.13.143

-- Under replicated (should be zero)
SELECT SUM((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS ranges_underreplicated
FROM crdb_internal.kv_store_status S
INNER JOIN crdb_internal.gossip_liveness L ON S.node_id = L.node_id
WHERE L.decommissioning <> true;

(3) Stop N3

#kill all the processes running CRDB
killall -9 cockroach

#verify it's not running anymore
ps aux | grep cockroach

(4) Unmount the data drive on N3

sudo umount /data

(5) Detach the volume from N3 by using the AWS Control Plane of the AWS CLI. Having the volume identifiers is very helpful for this step and the next step.

(6) Attach the volume to N4 by using the AWS Control Plane of the AWS CLI.

(7) Mount the volume in N4.

#Check to see the name of the device and verify that it has a file system installed -- we expect it to already have a file system
sudo lsblk -f

#make the /data directory so we can mount to it
sudo mkdir /data

#manually mount the volume
sudo mount /dev/xvdb /data

#backup the existing fstab file
sudo cp /etc/fstab /etc/fstab.orig

#get the UUID of the volume
sudo blkid

#edit the fstab so we will get the mount again after rebooting
# add: UUID=b26331da-9d38-4b0a-af09-3c3808b8313e     /data       xfs    defaults,nofail   0   2
sudo vim /etc/fstab

#verify that the fstab edit was done correctly by unmounting and remounting the volume
sudo umount /data
sudo mount -a

#verify that you see the drive
df -h

#verify that there is data in the volume
ls -alh /data

(8) Start N4

cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.13.143 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background

NOTE: CockroachDB does not require that the node being replaced and the new node have the same IP. In this example, they have different IPs. The data associated with the node in the data drive is not associated with the node IP.

(9) Verify that all nodes are up, data is available, and we have no under-replicated ranges.

cockroach node status --certs-dir=certs --host=172.31.13.143

-- Under replicated (should be zero)
SELECT SUM((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS ranges_underreplicated
FROM crdb_internal.kv_store_status S
INNER JOIN crdb_internal.gossip_liveness L ON S.node_id = L.node_id
WHERE L.decommissioning <> true;

(10) Reset the "time till dead" setting.

set cluster setting server.time_until_store_dead to default;

(11) You can delete N3 and the main EBS volume (not the data volume) associated with N3 via the AWS control plane.

Summary

Repaving nodes is a very automat-able activity that can be used to satisfy security requirements -- especially in security-sensitive, cloud-based organizations. CockroachDB tolerates this type of operation well because of its distributed, resilient, and available design.

Jim's Guide to CockroachDB Naming Standards

Jim Hatcher — Thu, 30 Mar 2023 14:31:34 +0000

I don't think CockroachDB has an official guide for naming standards, so I thought I would put out a post about what I think are the generally agreed-upon standards and claim them as my own. :)

When I say "naming standards," I'm referring to the general patterns for how we should name databases, tables, columns, etc.

How Case is Handled in Identifiers

Before I get into that, let's take a look at how case gets dealt with in CockroachDB.

If we create a table that has a camel-cased name (i.e., each word is "separated" by a capital letter), it might look something like this:

root@localhost:26257/defaultdb> CREATE TABLE CamelCaseTable
(
  myKeyValue int PRIMARY KEY,
  myField varchar
);

root@localhost:26257/defaultdb> SELECT * FROM CamelCaseTable;
  mykeyvalue | myfield
-------------+----------
(0 rows)

This all seems good so far.

However, if you ask CRDB to show you the table you created, you'll notice that the table definition that was stored does not retain the case specified:

root@localhost:26257/defaultdb> SELECT create_statement FROM [SHOW CREATE TABLE CamelCaseTable];
                         create_statement
-------------------------------------------------------------------
  CREATE TABLE public.camelcasetable (
      mykeyvalue INT8 NOT NULL,
      myfield VARCHAR NULL,
      CONSTRAINT camelcasetable_pkey PRIMARY KEY (mykeyvalue ASC)
  )
(1 row)

Also, if you try to SELECT from the table using a name that is double-quoted, CRDB will report that the table doesn't exist.

root@localhost:26257/defaultdb> SELECT * FROM "CamelCaseTable";
ERROR: relation "CamelCaseTable" does not exist
SQLSTATE: 42P01

This is strange! What's going on here?

Any time you specify an identifier name in DDL or DML, CRDB will convert that name to lowercase. If you want to override this behavior, you must specify the identifier in double quotes.

Here's an example of using double quotes:

CREATE TABLE "CamelCaseTable2" ( "myKeyValue" int PRIMARY KEY, "myField" varchar );

But, note that when you do this, you have to also use the double quotes any time you reference the name.

For instance:

root@localhost:26257/defaultdb> SELECT * FROM CamelCaseTable2;
ERROR: relation "camelcasetable2" does not exist
SQLSTATE: 42P01
root@localhost:26257/defaultdb> SELECT myKeyValue, myField FROM "CamelCaseTable2";
ERROR: column "mykeyvalue" does not exist
SQLSTATE: 42703
root@localhost:26257/defaultdb> SELECT "myKeyValue", "myField" FROM "CamelCaseTable2";
  myKeyValue | myField
-------------+----------
(0 rows)

Snake Case

Because of this "automatic lowercase-ification" of all identifiers, it is much simpler and easier to specify all your names in lower case. For separation of words, use an underscore (btw, a dash is not valid in CRDB identifiers). This is called "snake case."

Here's an example:

root@localhost:26257/defaultdb> CREATE TABLE snake_case_table
(
  my_key_value int PRIMARY KEY,
  my_field varchar
);

root@localhost:26257/defaultdb> SELECT * FROM snake_case_table;
  my_key_value | my_field
---------------+-----------
(0 rows)

root@localhost:26257/defaultdb> SELECT * FROM SNAKE_CASE_TABLE; -- if you want to use caps here, go for it
  my_key_value | my_field
---------------+-----------
(0 rows)

Naming Standards

Because of the handling of case in identifiers, it is much simpler and easier to use snake case for naming things in CRDB. i_hope_you_agree_that_this_is_the_way_to_go

Making Queries Sargable

Jim Hatcher — Mon, 27 Mar 2023 14:05:31 +0000

Introduction

When tuning database queries, a common technique is to add indexes that are tailor made to improve the performance of that query. However, sometimes, the query you are analyzing is written in such a way that it will not take advantage of indexes, even if they are in place.

The term used to describe whether a query is written in a way where it can take advantage of indexes is "sargable" -- it is an abbreviation for "search ARGument ABLE" (definition). That is, a query is said to be sargable if it can leverage an index.

Example

An example should help us here. Suppose we have an event_log table which stores millions of records and we want to be able to SELECT data from this log table based on a few simple predicates.

Let's create a table in CockroachDB with 10 million rows and some randomized data like this:

CREATE DATABASE IF NOT EXISTS sarg; 
USE sarg;
CREATE TABLE event_log ( id PRIMARY KEY, message, log_ts )
AS
SELECT
  g.i AS id,
  md5(random()::text) AS message,
  g.i::TIMESTAMP as log_ts
FROM generate_series(1668003200, 1678003200, 1) g(i);
-- 10,000,000 records
-- between 2022-11-09 14:13:20+00 (i.e., 1668003200)
-- and 2023-03-05 08:00:00 (i.e., 1678003200)

Now, let's create a decidedly non-sargable query to find log entries within a certain day and containing a certain log message:

SELECT *
FROM event_log
WHERE message LIKE '%abc%'
AND CAST(log_ts AS date) = CAST('2023-02-01' AS date);

      id     |             message              |       log_ts
-------------+----------------------------------+----------------------
  1675209983 | 813cdbec6ab76abc470cc19e42975692 | 2023-02-01 00:06:23
  1675210237 | 41963851207539abc612869eb5cf2301 | 2023-02-01 00:10:37(0 rows)
-- rows omitted for brevity --
(570 rows)

Time: 10.466s total (execution 10.466s / network 0.000s)

This query takes 10 seconds to run. Not good. Let's look at the query plan:

EXPLAIN SELECT *
FROM event_log
WHERE message LIKE '%abc%'
AND CAST(log_ts AS date) = CAST('2023-02-01' AS date);

                                            info
---------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • filter
  │ estimated row count: 1,111,111
  │ filter: (message LIKE '%abc%') AND (log_ts::DATE = '2023-02-01')
  │
  └── • scan
        estimated row count: 10,000,001 (100% of the table; stats collected 23 minutes ago)
        table: event_log@event_log_pkey
        spans: FULL SCAN
(11 rows)

We're committing the cardinal sin on non-scalable database performance -- we're doing a full scan of the table and then we're filtering for our predicate in memory. This means that as the size of the table grows, the query will get slower and slower because we have to read every record of the table every time the query runs.

Also, it's worth noting that our explain plan didn't contain any index hints (further evidence that our query is to blame here).

Just to further show that we've got a non-sargable query, let's add indexes to the message and log_ts fields to see if we can improve the execution.

CREATE INDEX ON event_log ( message );
CREATE INDEX ON event_log ( log_ts );

After adding these indexes, the query plan looks exactly the same. We have a non-sargable query on our hands.

Non-sargable patterns

There are two classic anti-patterns being used by this query contributing to it being non-sargable:

applying functions to the left side of the predicate
using a leading wildcard in a LIKE expression

Fixing functions on the left-side of the predicate

Let's see how we can change our query by addressing the first anti-pattern. In our query, we're trying to limit the results to only give us log entries from Feb 1, 2023. Logically, our predicate gives us the right results; but anytime you run a query with a function on the left-hand side of the predicate (i.e., the field itself), you're limiting the SQL optimizer's ability to leverage indexes. To fix this, let's employ some different logic that works without applying functions to the left-hand side of the predicate.

Instead of:

CAST(log_ts AS date) = CAST('2023-02-01' AS date)

let's do:

log_ts >= '2023-02-01' AND log_ts < '2023-02-02'

Running this edited version of the query, we get our results back in ~800ms instead of 10s+.

SELECT *
FROM event_log
WHERE message LIKE '%abc%'
AND log_ts >= '2023-02-01' AND log_ts < '2023-02-02';

      id     |             message              |       log_ts
-------------+----------------------------------+----------------------
-- rows omitted for brevity --
  1675295883 | 508472e48f1d4121d56ba36cfdabc147 | 2023-02-01 23:58:03
  1675295922 | 179a0abc7e6b8653f89be8cf875bc789 | 2023-02-01 23:58:42
(570 rows)

Time: 836ms total (execution 834ms / network 2ms)

Let's look at the query plan to see what's changed:

EXPLAIN
SELECT *
FROM event_log
WHERE message LIKE '%abc%'
AND log_ts >= '2023-02-01' AND log_ts < '2023-02-02';

--------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 30,670
  │ filter: message LIKE '%abc%'
  │
  └── • index join
      │ estimated row count: 92,011
      │ table: event_log@event_log_pkey
      │
      └── • scan
            estimated row count: 92,011 (0.92% of the table; stats collected 19 minutes ago; using stats forecast for 7 minutes ago)
            table: event_log@event_log_log_ts_idx
            spans: [/'2023-02-01 00:00:00' - /'2023-02-01 23:59:59.999999']

  index recommendations: 1
  1. type: index replacement
     SQL commands: CREATE INDEX ON event_log (log_ts) STORING (message); DROP INDEX event_log@event_log_log_ts_idx;
(19 rows)

This is better! We're now able to leverage the event_log_log_ts_idx index. Our plan uses this first index to filter the number of records to process, then it joins back to the primary index (i.e, the table itself) and then it does the final filter on the message field. By the time this final filter is applied, we've already narrowed down the possible result set to 30k records, so this isn't terrible, but let's see if we can improve further.

NOTE: there are some further indexing recommendations in that last query plan which we could employ to help our query performance more, but they're not related to sargability, so I'm not going to follow that bunny trail in this blog.

Fixing leading wildcards

The second non-sargable mistake we've made is to use a leading wildcard in a LIKE expression (i.e., message LIKE '%abc%'). If we had only used a wildcard at the end of the expression (i.e., message LIKE 'abc%'), then the query engine could still have leveraged the index (try looking at the query plan for this query and see for yourself); but when you put a wildcard at the beginning of the expression, the index can't be leveraged because we can't seek to specific ranges of the ordered index.

This is a classically hard problem to overcome in relational databases (it's exactly the kind of use cases that search engines like Solr and ElasticSearch are built to solve). If the requirements for our query are to look in the middle of a text string, then we can't change the query to not having a leading wildcard. We have a few choices then.

Ngram indexes

One choice is to do what we've done above and use indexes on other parts of our predicate to limit the result set so that the impact of the actual string filter is minimized. That would be a boring solution for a blog! So, let's remove that option by getting rid of the date/time filter in our query.

If we only filter by the message field, we're back to our full table scan (and our 10sec performance).

EXPLAIN
SELECT *
FROM event_log
WHERE message LIKE '%abc%';

--------------------------------------------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • filter
  │ estimated row count: 3,333,334
  │ filter: message LIKE '%abc%'
  │
  └── • scan
        estimated row count: 10,000,001 (100% of the table; stats collected 30 minutes ago; using stats forecast for 18 minutes ago)
        table: event_log@event_log_pkey
        spans: FULL SCAN
(11 rows)

The next option that we have to deal with this is to leverage an n-gram index. An n-gram is a tokenization of a piece of text into "chunks" that are of length n. For example, if we tokenize the phrase "hello world" into tokens of size 3, we end up with the following values:

"hel"
"ell"
"llo"
"lo "
"o w"
" wo"
"wor"
"orl"
"rld"

CockroachDB has a feature called trigram indexes which leverages this tokenization technique. At write time, when the index is being built, the values are broken down into 3-grams (i.e., trigrams) and these multiple values are stored. Then, at query-time, the query optimizer is able to do a seek against these values instead of having to scan the table. There is a trade-off at play here: we're doing more work at write time & storing more data in our index (potentially a lot more); but we're reducing the work that has to be done at query time.

Let's create a trigram index on our message field and see how our performance is impacted:

CREATE INDEX ON event_log USING GIN(message gin_trgm_ops);

EXPLAIN
SELECT *
FROM event_log
WHERE message LIKE '%abc%'
LIMIT 500;

info
--------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • render
  │
  └── • limit
      │ count: 500
      │
      └── • filter
          │ estimated row count: 3,333,334
          │ filter: message LIKE '%abc%'
          │
          └── • index join
              │ estimated row count: 1,111,111
              │ table: event_log@event_log_pkey
              │
              └── • scan
                    estimated row count: 167 - 1,111,112 (11% of the table; stats collected 1 minute ago; using stats forecast for 28 minutes in the future)
                    table: event_log@event_log_message_idx1
                    spans: 1 span
(20 rows)

Let's run the query and see how long it takes:

SELECT *
FROM event_log
WHERE message LIKE '%abc%'
LIMIT 500;

Time: 43ms total (execution 34ms / network 9ms)

Pretty good!

Computed Columns

Another approach that we could use to handle this situation is to add a computed column to our table which calculates whether our keyword has been seen.

ALTER TABLE event_log
ADD COLUMN contains_abc boolean
AS ( message LIKE '%abc%' ) STORED;
CREATE INDEX ON event_log ( contains_abc );

SELECT id, message, log_ts
FROM event_log
WHERE contains_abc = true
LIMIT 500;

Time: 23ms total (execution 16ms / network 7ms)

This seems to be even a little faster which makes sense since the work done at read time is very minimal (even compared to the trigram approach). Conceptually, it's a similar approach in that we're doing some extra work at write time in order to speed things up at read time. It's a little more precise (i.e., it stores less data), but it's less flexible (i.e., you can't change the string you're searching on without changing the computed column).

Summary

We've identified a few red flags to watch out for in queries that will render them non-sargable. And, we've looked at ways to address specific instances of this problem.

Efficiently deleting data

Jim Hatcher — Wed, 15 Mar 2023 01:29:03 +0000

When I delete data in CockroachDB (CRDB), the delete is slow, and how can I speed it up?

This is a topic that comes up occasionally, and it's not super intuitive, so I think it would be worth explaining in more detail.

Setup

To explore this, let's create some data we can play with.

First let's create a database and then configure the database to only wait 60 seconds after deleting today to actually delete the data out of the storage layer in CRDB (this will be helpful if we need to delete the data multiple times and run everything again)

CREATE DATABASE deletes;
ALTER DATABASE deletes CONFIGURE ZONE USING gc.ttlseconds= 60;
USE deletes;

Then, let's create a table. I'm going to create a table with a few columns of various cardinality and 100 million rows. There is a helpful function in CRDB called generate_series which is great for this kind of data creation:

CREATE TABLE deletes.order_items ( id PRIMARY KEY, customer_id, order_id, product_id, order_date )
AS
SELECT
  g.i AS id,
  g.i % 77777 AS customer_id,
  g.i % 1000000 as order_id,
  g.i % 11111 as product_id,
  g.i::TIMESTAMP as order_date
FROM generate_series(1578003200, 1678003200, 1) g(i);
-- 100,000,000 records
--  between 2020-01-02 22:13:20 (i.e., 1578003200)
--  and 2023-03-05 08:00:00 (i.e., 1678003200)

I'm leveraging a few handy techniques here:

I'm using the CREATE TABLE AS (or CTAS as we call it) to generate a new table. This is much faster in CRDB than creating the table first and then doing INSERT INTO ... SELECT because it bypasses all the consistency-checking algorithms.
I'm using the g.i value as my "seed value" and using it directly as my id field
for some of the other fields, I'm using the modulo operator (i.e., %) against the g.i feild to create some low-cardinality values.
I'm generating values in the range between 1578003200 and 1678003200 so that I can easily convert these values into timestamps.
Tip: To figure out which values you want to use as your start and end values, run SELECT '2023-01-15 2:00 PM'::TIMESTAMP::INT; with your desired date values to figure out the values to specify.

Once we have our table created, let's create indexes on our other fields in order to mimic what a realistic table might do:

CREATE INDEX ON order_items ( customer_id );
CREATE INDEX ON order_items ( order_id );
CREATE INDEX ON order_items ( product_id );
CREATE INDEX ON order_items ( order_date );

To get a sense of the cardinality of our fields, we can run a query:

SELECT
  COUNT(*) AS rec_count,
  COUNT(DISTINCT customer_id) AS cust_count,
  COUNT(DISTINCT order_id) AS ord_count,
  COUNT(DISTINCT product_id) AS prd_count,
  COUNT(DISTINCT order_date) AS date_count
FROM order_items;

Our results:

  rec_count | cust_count | ord_count | prd_count | date_count
------------+------------+-----------+-----------+-------------
  100000001 |      77777 |   1000000 |     11111 |  100000001

Looking at explain plans

Now that we have some data to play with, we can start to understand what happens when we delete data in this table.

Let's start by selecting a single record by the PK value:

SELECT *
FROM order_items
WHERE id = 1600000000;

      id     | customer_id | order_id | product_id |     order_date
-------------+-------------+----------+------------+----------------------
  1600000000 |       49333 |        0 |       4889 | 2020-09-13 12:26:40

Now, let's prepend "EXPLAIN" to our query to look at the execution plan.

EXPLAIN
SELECT *
FROM order_items
WHERE id = 1600000000;
                                                        info
---------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • scan
    estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
    table: order_items@order_items_pkey
    spans: [/1600000000 - /1600000000]

This is as simple of an execution plan as you can get. It's doing a scan of 1 span in the primary index (i.e., the table itself).

Now, let's query by the order_date field which is also a very high-cardinality value with an index on it:

SELECT *
FROM order_items
WHERE order_date = 1600000000::timestamp;
      id     | customer_id | order_id | product_id |     order_date
-------------+-------------+----------+------------+----------------------
  1600000000 |       49333 |        0 |       4889 | 2020-09-13 12:26:40

We get the exact same data back. Let's look at the explain plan:

EXPLAIN
SELECT *
FROM order_items
WHERE order_date = 1600000000::timestamp;
                                                                            info
-------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • index join
  │ estimated row count: 1
  │ table: order_items@order_items_pkey
  │
  └── • scan
        estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
        table: order_items@order_items_order_date_idx
        spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']

  index recommendations: 1
  1. type: index replacement
     SQL commands: CREATE INDEX ON order_items (order_date) STORING (customer_id, order_id, product_id); DROP INDEX order_items@order_items_order_date_idx;

Notice a few differences about this explain plan from the one above:

We do a scan of the order_items_order_date_idx this time
We then do an index join back to the primary index order_items_pkey
CRDB has helpfully given us some suggestions on how we can alter our index to perform better.

It's very efficient to use the order_date index since our predicate is on order_date. But the order_date index only contains the field we're indexing (order_date) and a reference back to the PK (id). In order to satisfy our query's SELECT * clause, we need to join back to the primary index in order to have all the rest of the field values in this record. The index suggestion is telling us that if we include/store all the other fields from the record in the order_date index that we can skip the index join since we'll already have all the data points necessary to satisfy our query.

You can prove this by changing the SELECT query to only return the id and order_date fields:

EXPLAIN
SELECT id, order_date
FROM order_items
WHERE order_date = 1600000000::timestamp;
                                                        info
---------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • scan
    estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
    table: order_items@order_items_order_date_idx
    spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']

Moving onto deletes

Now that we have a sense of interpreting query plans on our table, let's think about what happens when we delete a record in CRDB.

When we delete a record by the PK value, our execution plan is very simple -- find the record and delete it:

EXPLAIN
DELETE FROM order_items
WHERE id = 1600000000;
                                                          info
-------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • delete
  │ from: order_items
  │ auto commit
  │
  └── • scan
        estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
        table: order_items@order_items_pkey
        spans: [/1600000000 - /1600000000]

If we do a similar DELETE query using the order_date field instead of the PK, we see that we once again have an index join:

EXPLAIN DELETE FROM order_items
WHERE order_date = 1600000000::timestamp;
                                                                            info
-------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • delete
  │ from: order_items
  │ auto commit
  │
  └── • index join
      │ estimated row count: 1
      │ table: order_items@order_items_pkey
      │
      └── • scan
            estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
            table: order_items@order_items_order_date_idx
            spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']

  index recommendations: 1
  1. type: index replacement
     SQL commands: CREATE INDEX ON order_items (order_date) STORING (customer_id, order_id, product_id); DROP INDEX order_items@order_items_order_date_idx;

The need for this index join is pretty obvious on the SELECT query, but why is it needed on the DELETE query? The answer is that when CRDB deletes a record, it needs to delete the entry(ies) from the primary index but it also needs to delete the entry(ies) from all of the corresponding indexes. And, in order to delete the index entries, it needs the index keys.

Let's follow the optimizer's advice and create this new "covering" index:

CREATE INDEX ON order_items (order_date) STORING (customer_id, order_id, product_id);
DROP INDEX order_items@order_items_order_date_idx;

If we run our DELETE statement again with this new index in place, we can see that the need for the index join has been eliminated.

EXPLAIN DELETE FROM order_items
WHERE order_date = 1600000000::timestamp;                                                                                                  info
-------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • delete
  │ from: order_items
  │ auto commit
  │
  └── • scan
        estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
        table: order_items@order_items_order_date_idx1
        spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']

Warning: This is a great technique to help make deletes efficient. If you're doing a regular batch delete of old data or cleaning up records that have met some "deletable" state, this can help a lot -- especially on really big tables. However, be aware that if you later add another index on the order_items table, you'll want to remember to include that newly-indexed field into the index that is going to be leveraged in your deletes -- otherwise, this index join will creep back into your plan.

Summary

Covering indexes are a great tool in CRDB and other DBMS systems to make specific queries more efficient. It can be relatively easy to identify when they should be utilized to help SELECT queries, but it's a little un-intuitive to realize that they can also help with DELETEs. But now you know, so happy deleting!!

Running Multi-region CockroachDB on k8s -- the internals

Jim Hatcher — Mon, 13 Mar 2023 14:49:53 +0000

If you are familiar with CockroachDB (CRDB), you may have heard that CRDB runs well on Kubernetes (k8s). You may have also heard that CRDB can be deployed in either a single-region or multi-region configuration.

The Cockroach Labs' docs site has a great guide for deploying CRDB on k8s in a multi-region environment.

If you follow the guide, you'll have a working multi-region cluster, but you may not understand what exactly you've set up!

There is a section that explains some of the differences that you'll see between a single-region k8s deployment and the multi-region k8s deployment described in the guide.

In this blog, I want to explain things in a little more detail. Understanding these fundamentals can be really helpful when you're troubleshooting a cluster that isn't acting quite right.

You gotta use the manifests!

When you're deploying CRDB in a single-region, there are three deployment options:

k8s operator
Helm Chart
Manual manifests (i.e., yamls)

However, when you go the multi-region route, the only available option is the manual manifests.

This is not really a CRDB limitation. There is not a strong consensus in the k8s community about how federated k8s clusters should work, so we have our own way of doing it that doesn't really use any k8s primitives to provide coordination across the various clusters.

One CockroachDB cluster, 3 Kubernetes Clusters

So...when you create a multi-region CRDB cluster (let's say 3 regions for the sake of this illustration), you run the CRDB resources on 3 separate k8s clusters and these 3 k8s clusters know nothing about each other.

From the CRDB perspective, it is one single, logical cluster; the CRDB nodes don't really know that they're running on k8s -- they just know that there are various CRDB nodes, and as long as there is network connectivity between these nodes, everything will interact as designed.

Cluster Maintenance

There are some implication to this. One implication is that if you want to make changes to your cluster across all three regions, you'll need to interact with the k8s control planes in all three k8s clusters.

If you look at the guide and find the section called "Maintain the Cluster", you'll see steps for scaling, upgrading, and stopping the cluster. Any of these steps necessitates doing actions on all 3 k8s control planes. There is no one single control plane that can handle these actions for you.

Network Planning

Another implication is that you should not create overlapping service or pod networks in your k8s clusters.

A k8s cluster has a service network with a pool of available IP addresses which it assigns to newly-created services (things like load balancers). It also has a pod network with a pool of available IP addresses which it assigns to newly-created pods.

When you create a k8s-based deployment of CRDB, a stateful set is used. The stateful set has a number of replicas configured (the default replicas in the guide is 3). Each of these replicas is created as a pod and gets assigned an IP address from this pod network IP range.

We also create one headless service (which is used internally and doesn't use an IP) and one non-headless service (which gets an IP assigned from the service network IP range). You can see these IPs by running kubectl get pods and kubectl get services, respectively.

Below is a diagram which shows the network connectivity need of every CRDB node (running as a k8s pod).

You can use a k8s load balancer (running on the service network) to handle the incoming SQL and DB Console access, but the CRDB node-to-node connectivity requires that every node (in any region) needs to be able to connect to every other node directly -- and in a k8s deployment, that means that every k8s pod running CRDB has to be able to talk directly to every other k8s pod running CRDB.

Our docs give basic examples of creating the various k8s clusters. For instance, in the GKE path, the following commands are given:

gcloud container clusters create cockroachdb1 --zone=$GCE-ZONE1
gcloud container clusters create cockroachdb2 --zone=$GCE-ZONE2
gcloud container clusters create cockroachdb3 --zone=$GCE-ZONE3

In a simple, demo deployment, this command is fine. But for more complex deployments (certainly production deployments), it's worth thinking through whether these clusters will end up having overlapping service and pod networks.

I personally like to use a more explicit command for creating the cluster. For instance, for the GKE path, I would issue the following:

gcloud container clusters create cockroachdb1 
  --zone=$GCE-ZONE1 --machine-type=$MACHINETYPE --num-nodes=3 
  --cluster-ipv4-cidr=10.1.0.0/16
  --services-ipv4-cidr=10.101.0.0/16
gcloud container clusters create cockroachdb2 
  --zone=$GCE-ZONE2 --machine-type=$MACHINETYPE --num-nodes=3 
  --cluster-ipv4-cidr=10.2.0.0/16
  --services-ipv4-cidr=10.102.0.0/16
gcloud container clusters create cockroachdb3 
  --zone=$GCE-ZONE3 --machine-type=$MACHINETYPE --num-nodes=3 
  --cluster-ipv4-cidr=10.3.0.0/16
  --services-ipv4-cidr=10.103.0.0/16

Notice that I'm explicitly giving each cluster a unique cluster (i.e., pod) CIDR block and services CIDR block. I'm also being explicit about the machine type and number of nodes just because it avoids ambiguity.

Note: Because CRDB wants to have this direct pod-to-pod connectivity, we don't like to run CRDB on service meshes (like Istio). Service meshes tend to rely on hiding the pod details and having everything connect through the services layer, which works great for stateless services, but not so great for highly-stateful databases with node-level replication requirements.

DNS Resolution

k8s pods within a given k8s cluster are able to talk to each other by host name because DNS Resolution is handled within k8s. For example if pod0 needs to talk to pod1, pod0 references pod1's host name which gets resolved by the DNS service (either coredns or kube-dns, depending on your flavor of k8s). The DNS service returns the IP address mapped to the hostname and then pod0 talks to pod1 directly via this IP address.

Since our CRDB cluster involves three k8s clusters, this DNS system breaks down a little bit. It works fine for pods within the same cluster, but pods in different clusters can't find each other out of the box.

To remedy this, we set up a few things in our multi-region implementation:

we create a load balancer that exposes the DNS service of each cluster outside of that cluster.
in each DNS service we create a config map that tells each DNS service how to forward DNS requests to every other DNS service (using the aforementioned load balancer) based on the format of the hostname.

If you want to see these, look in the "kube-system" namespace and run kubectl get configmap and kubectl get svc.

Using this system of "DNS forwarding", each pod can now resolve the name of every other pod across all the clusters. Assuming that the pod networks are non-overlapping and routable, then we should be able to operate.

Let me walk through an example of this, just to make sure it's clear:

pod1 in region1 wants to connect to pod2 in region2
pod1 reaches out to its DNS service and asks to resolve the name "pod2.region2". The DNS service is not able to resolve that name directly but there is a forwarding rule that says for any hostname of the format "*.region2", talk to another DNS server at address 10.10.10.10.
The 10.10.10.10 address is a load balancer that is exposing the DNS service of region2's k8s cluster. The region1 DNS server talks to region2's DNS server via this link and resolves the pod2's host name as 10.2.0.2.
pod1 now knows the IP address of region2.pod2 and creates a connection to 10.2.0.2 directly.

If you're having trouble getting the pods to communicate across regions, there are few things you can look at for troubleshooting:

You can look at the logs of the CRDB pods and find error messages about connecting to other nodes. Find the logs by running kubectl logs cockroachdb-0 -c cockroach -n <namespace name here>. Sometimes the error message will help you understand whether name resolution has occurred. For instance, if you see an error like: addrConn.createTransport failed to connect to {cockroachdb-2.region2:}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.2.2.2: connect: connection refused, then you can tell that the DNS resolution has happened correctly (since both the hostname and IP address are listed in the error). In this case, you don't have a DNS resolution problem but some other network connectivity issue.
You can also look at the logs of the DNS pods by running kubectl get pods -n kube-system. Then after you find the name of one of the dns pods, you can run kubectl logs <name of dns pod> -n kube-system and look at the log messages.

Network Firewalls

You'll want to make sure that you have firewalls opened between the clusters on TCP/53 (the port used for DNS) and also TCP/26257 (the port used for CRDB node-to-node communication). Depending on your k8s environment (GKE, EKS, etc.), these settings can be controlled in one or more places. For instance, in EKS (i.e., Amazon), there is an NSG (Network Security Group) for the cluster role that has firewall settings. There is also a NACL (Network Access Control List) on each private subnet that has firewall settings.

Summary

CockroachDB is complementary to k8s. When running in multiple k8s regions, there is a little extra work (mostly done by the python script described in our docs) that creates DNS forwarding and allows k8s pods in different k8s clusters to be able to communicate. This DNS setup, in conjunction with smart network planning is a good solution for running CRDB in multi-region k8s.

Mentors

Jim Hatcher — Wed, 08 Mar 2023 16:28:57 +0000

My oldest daughter is months away from graduating from college. She and I have had lots of discussions about where she should apply for jobs and what attributes she should seek in an ideal employer. I have thought about what was most valuable for me early in my career, and if I really boil it down, it is mentorship.

In this blog, I want to talk about my early education and jobs and talk about how mentorship impacted me.

Early in my college years at NLU (later to be renamed ULM), I met a professor named Ron Berry. From the get-go, I knew that Dr. Berry was special. He was the faculty sponsor for the DPMA (the student organization for Computer Information Systems students), and I remember interacting with him at a club event and realizing that he really cared about connecting with the students. I later got a student job working in the CIS department, and Dr. Berry was assigned to be my supervisor. He gave me some great tasks, including going through all the computers in an old computer lab and to try to make some working units out of the pile of computer parts. Later, he taught me how to make ethernet cables (cutting, crimping, testing) and tasked me with wiring up a new computer lab. What amazing tasks for a kid just learning how to use computers! He also taught several of my classes and was my advisor. Nearly 25 years later, Ron Berry is now the President of ULM and is continuing to impact the lives of students.

In the summer between my junior and senior years, my longtime girlfriend and I got married. When I returned to school in the fall, I realized that I needed more income to afford my newlywed life. I decided to do classes part-time and I took a full-time job at the College of Business doing tech support for faculty. Dr. Berry was my supervisor and encouraged me to write a system to track the issues that people called me about and to track what the solutions were. I learned many things in my time with Dr. Berry -- including basic computer repair skills, networking, troubleshooting, systems analysis, but the main thing I took away from Dr. Berry is that a good man who cares about his community and his students can have an enormous impact.

Roughly a year later, Dr. Berry called me and told me that there was a job opening at Chase Manhattan Mortgage Company and suggested that it might be a good fit for me. I did accept this new role at CMMC and became a junior programmer working for a guy named Ken Robertson. Ken taught me a lot about writing database queries and basic programming. He also taught me what it looked like to be an IT professional -- including taking responsibility, acting with integrity, and how to interact with other departments effectively. Ken had a way of demonstrating things without you even knowing he was teaching you -- he just modeled professionalism in his every day routines, and you couldn't help but learn from his example.

A few years later, my family and I had moved to Colorado and I took at job at First Data Corporation as a full-stack developer. One of my teammates was a guy named Mike Barlow. It was quickly apparent to me that Mike was a really sharp guy, and that there were things I could learn from Mike. To this point in my career, I had worked a lot with databases and had done a lot of programming in languages that used top-down programming in Visual Basic and (what is now called "classic") ASP. Mike liked to challenge his teammates to always learn new techniques and technologies, and I remember a day where he took me and another gal on our team to a conference room and taught us the basics of object-oriented programming. This was an eye opener for me, and I quickly started to adopt new languages and new object-oriented techniques. Mike was always pushing those around him to do better, and he taught me that you could have teachers in your life who weren't your bosses or formally assigned mentors. My relationship with Mike was one of the best things that ever happened to me -- we worked together at many companies over the next 15-20 years and pushed each other to be better. Our relationship was often described by others as a "husband and wife" because we knew how to work together and be effective while challenging and even arguing with each other. Mike passed away recently, and I think about and miss him daily.

I am now 25 years into my IT career. I have worked in a variety of companies in various roles and have worked on many different systems. When I look back on my early career as a newly minted college graudate, I realize how little I actually knew about how to write code and interact with data. I had learned some basic skills and knew about the things one did to work on computer systems, but I had very little actual experience. What I did have was a desire to learn. I managed to work at positions which were a series of stepping stones, picking up real-world skills. Another realization was that I was supported and encouraged each step of the way by mentors who had seen things, knew things, and saw something in me, leading them to believe that I had promise and was worth investing in. I am thankful for the mentors described here and for others who had a hand in helping me. Given the chance to take on mentoring opportunities in my current role, I jump at the chance as a way to honor those who pumped time and wisdom into me.

So, when my daughter asks me, "Dad, where should I apply?," I tell her that she needs to seek out roles with good bosses, good teammates, and a culture of teamwork. This is a hard thing to find in a job -- you certainly can't filter a list of job openings for these traits! -- but I believe that if you enter into an interview process with these goals in mind that you can discern which roles are going to provide these benefits and will therefore act as a springboard to launch you into a successful career.

Here's to the mentors out there -- thank you!

Running CockroachDB on k8s - with tweaks for Production

Jim Hatcher — Mon, 06 Mar 2023 15:56:02 +0000

Running CockroachDB (CRDB) on Kubernetes (k8s) is a complementary affair. CRDB provides high availability at the data layer, and k8s provides high availability at the infrastructure layer.

For enterprises that are already leveraging k8s to run their applications & microservices and who have experience running and administering k8s, it can make sense to also run the database within k8s.

The docs for running CRDB on k8s on the Cockroach Labs' documentation site have an excellent set of steps for getting CRDB up and running. They are all you need to run a demo of CRDB on k8s.

However, if you're planning on running k8s in Production, there are a few other things you'll probably want to do. In this blog I'll explain how to go about this.

Expose CRDB outside of the k8s cluster

You can follow the docs for deploying CRDB in a single-region k8s deployment. When you deploy CRDB on k8s (through any of the available methods -- operator, helm chart, or via yaml configs), there are a few services created for you.

For instance, after installing CRDB via the CRDB k8s operator on GKE, I have the following services:

$ kubectl get svc
NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                        AGE
cockroachdb          ClusterIP   None            <none>        26258/TCP,8080/TCP,26257/TCP   59s
cockroachdb-public   ClusterIP   10.108.12.200   <none>        26258/TCP,8080/TCP,26257/TCP   59s
kubernetes           ClusterIP   10.108.0.1      <none>        443/TCP                        7m5s

In k8s, there are four potential types of services:

ClusterIP - this is the default if another type is not explicitly specified in the service's manifest; this type of service is internal to the k8s cluster
Nodeport - this is a way to expose the service outside of the k8s cluster using a custom port on the various cluster nodes - this can be OK for development clusters but it's not typically how you want to do things in Production because you have to use non-standard ports
LoadBalancer - this is another way to expose services to apps running outside of the k8s cluster; it's a better way for Production deployments because you can use the standard ports, but you need to have a process that can assign a Public IP to the load balancer; if you're running in a cloud-based k8s service (i.e., EKS, GKE, AKS, or OpenShift), this is handled for you, but if you're running on OSS k8s, you have to handle this yourself
ExternalName - this is a way of assigning an external DNS name to a k8s service and is not really applicable for what we're talking about here.

SQL Port

You'll notice that in the services listed in our CRDB k8s cluster that we have one called "cockroachdb" of type "ClusterIP" and which has no Cluster IP assigned. This is called a "headless" service. The point of this service is to be the service associated with the statefulset called "cockroach". It is not intended to be used to access the CRDB cluster by any internal or external apps. You can see the reference to this service in the statefulset's manifest:

$ kubectl get sts cockroachdb -o yaml | grep "serviceName"
  serviceName: cockroachdb

The other cockroach service called "cockroachdb-public" is also of type ClusterIP but has a Cluster IP assigned to it. The point of this service is to be used by apps wanting to access CRDB that are running inside the k8s cluster.

In the CRDB docs, you'll see a section called "Use the built-in SQL Client" and you can see that they leverage this service:

kubectl exec -it cockroachdb-client-secure \
-- ./cockroach sql \
--certs-dir=/cockroach/cockroach-certs \
--host=cockroachdb-public

This is a perfectly acceptable way to setup some basic things in the cluster via the SQL prompt (like creating the first users and to verify basic read/write capabilities are working). However, this is not the mechanism you'd want to use in Production for your apps to access the CRDB cluster -- especially if your apps are running outside the k8s cluster. I'll talk about the right way to do this a little later on.

There is also a third service listed called "kubernetes" which is not used to access CRDB at all.

When you're running CRDB, there are three access points into the CRDB nodes:

the SQL client port (26257 by default)
the DB Console port (8080 by default), and
the port used by other nodes to do node-to-node interactions (26258 by default).

All three of these ports are exposed by the "cockroachdb-public" service that we've been looking at.

DB Console port

We can technically get to the DB Console on each CRDB node from any pod running inside our k8s cluster, but that would involve running curl commands which aren't very useful.

Just to illustrate what I'm talking about, you can do something like this:

$ kubectl exec -it cockroachdb-0 -- curl -Lk http://cockroachdb-public:8080/
Defaulted container "db" out of: db, db-init (init)
<!DOCTYPE html>
<html>
    <head>
        <title>Cockroach Console</title>
        <meta charset="UTF-8">
        <link href="favicon.ico" rel="shortcut icon">
    </head>
    <body>
        <div id="react-layout"></div>
        <script src="bundle.js" type="text/javascript"></script>
    </body>
</html>

You can see that we do actually get some HTML back from our curl command, but this is a lame way to interact with a website! So, in the CRDB docs, they recommend using the kubectl port-forward command to expose this service to the computer where your kubectl command is running:

$ kubectl port-forward service/cockroachdb-public 8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Once you've run this command, you can go to the browser on your computer and get to http://localhost:8080/ and acccess the DB Console. The kubectl command is proxying your input to the CRDB nodes and their output back to your browser.

Again, this is a perfectly acceptable way to access the CRDB nodes for a demo and just to make sure they're running OK. But, for Production, you don't want to have everybody on your team port-forwarding into your nodes in order to monitor what's happening in CRDB.

Using an external load balancer

In order to access the SQL client and DB Console from outside the cluster, the best way to go is to create a k8s service of type LoadBalancer.

Create a yaml file like this:

apiVersion: v1
kind: Service
metadata:
  annotations:
    # put annotations here that affect how EKS creates things
    # service.beta.kubernetes.io/aws-load-balancer-internal: "true"
  labels:
    app: cockroachdb
  name: cockroachdb-lb
spec:
  # if you don't specify the type, it will default to ClusterIP which won't expose the services outside of the k8s cluster
  type: LoadBalancer
  selector:
    # this selector is the label associated with your CRDB pods
    # if you're not sure -- run this: kubectl get pods --show-labels
    app: cockroachdb
  ports:
  - name: https
    port: 8080
    protocol: TCP
    targetPort: 8080
  - name: tcp
    port: 26257
    protocol: TCP
    targetPort: 26257

Then, you can create the service by applying that yaml file with:

kubectl apply -f cockroachdb-lb.yaml

Notice that if you get a listing of your services right after creating it that you will see a service called "cockroachdb-lb" and it will have an External IP of "pending":

$ kubectl get svc
NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                          AGE
cockroachdb          ClusterIP      None            <none>        26258/TCP,8080/TCP,26257/TCP     33m
cockroachdb-lb       LoadBalancer   10.108.4.152    <pending>     8080:31016/TCP,26257:31395/TCP   4s
cockroachdb-public   ClusterIP      10.108.12.200   <none>        26258/TCP,8080/TCP,26257/TCP     33m
kubernetes           ClusterIP      10.108.0.1      <none>        443/TCP                          39m

If you wait a few seconds and try again, you'll see that an External IP value is assigned:

$ kubectl get svc
NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                          AGE
cockroachdb          ClusterIP      None            <none>           26258/TCP,8080/TCP,26257/TCP     35m
cockroachdb-lb       LoadBalancer   10.108.4.152    34.139.126.177   8080:31016/TCP,26257:31395/TCP   97s
cockroachdb-public   ClusterIP      10.108.12.200   <none>           26258/TCP,8080/TCP,26257/TCP     35m
kubernetes           ClusterIP      10.108.0.1      <none>           443/TCP                          41m

Because I'm running in GKE, Google Cloud handles creating a load balancer for me. If you look in the GCP Cloud Console, you can see the load balancer details.

If I describe the LB svc, I can look at the endpoints that have been exposed by the service:

$ kubectl describe svc cockroachdb-lb
Name:                     cockroachdb-lb
Namespace:                default
Labels:                   app=cockroachdb
Annotations:              cloud.google.com/neg: {"ingress":true}
Selector:                 app=cockroachdb
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.108.4.152
IPs:                      10.108.4.152
LoadBalancer Ingress:     34.139.126.177
Port:                     https  8080/TCP
TargetPort:               8080/TCP
NodePort:                 https  31016/TCP
Endpoints:                <none>
Port:                     tcp  26257/TCP
TargetPort:               26257/TCP
NodePort:                 tcp  31395/TCP
Endpoints:                <none>
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                Age    From                Message
  ----    ------                ----   ----                -------
  Normal  EnsuringLoadBalancer  3m12s  service-controller  Ensuring load balancer
  Normal  EnsuredLoadBalancer   2m36s  service-controller  Ensured load balancer

We can see here that there are no endpoints assigned. That's not good!

The way that the LB gets associated with pods is via the "selector" in its spec. My current selector is looking for pods with a label of app: cockroachdb.

Let's see what labels our pods are actually using:

$ kubectl get pods --show-labels
NAME            READY   STATUS    RESTARTS   AGE   LABELS
cockroachdb-0   1/1     Running   0          37m   app.kubernetes.io/component=database,app.kubernetes.io/instance=cockroachdb,app.kubernetes.io/name=cockroachdb,controller-revision-hash=cockroachdb-7b9668cd75,crdb=is-cool,statefulset.kubernetes.io/pod-name=cockroachdb-0
cockroachdb-1   1/1     Running   0          37m   app.kubernetes.io/component=database,app.kubernetes.io/instance=cockroachdb,app.kubernetes.io/name=cockroachdb,controller-revision-hash=cockroachdb-7b9668cd75,crdb=is-cool,statefulset.kubernetes.io/pod-name=cockroachdb-1
cockroachdb-2   1/1     Running   0          37m   app.kubernetes.io/component=database,app.kubernetes.io/instance=cockroachdb,app.kubernetes.io/name=cockroachdb,controller-revision-hash=cockroachdb-7b9668cd75,crdb=is-cool,statefulset.kubernetes.io/pod-name=cockroachdb-2

A better choice for the selector label would be app.kubernetes.io/name=cockroachdb.

Let's edit the yaml to include that value and re-apply it:

apiVersion: v1
kind: Service
metadata:
  annotations:
    # put annotations here that affect how EKS creates things
    # service.beta.kubernetes.io/aws-load-balancer-internal: "true"
  labels:
    app: cockroachdb
  name: cockroachdb-lb
spec:
  # if you don't specify the type, it will default to ClusterIP which won't expose the services outside of the k8s cluster
  type: LoadBalancer
  selector:
    # this selector is the label associated with your CRDB pods
    # if you're not sure -- run this: kubectl get pods --show-labels
    app.kubernetes.io/name: cockroachdb
  ports:
  - name: https
    port: 8080
    protocol: TCP
    targetPort: 8080
  - name: tcp
    port: 26257
    protocol: TCP
    targetPort: 26257

Notice that I have to enter app.kubernetes.io/name: cockroachdb instead of app.kubernetes.io/name=cockroachdb

kubectl apply -f cockroachdb-lb.yaml

Now, let's look at our endpoints again:

$ kubectl describe svc cockroachdb-lb
Name:                     cockroachdb-lb
Namespace:                default
Labels:                   app=cockroachdb
Annotations:              cloud.google.com/neg: {"ingress":true}
Selector:                 app.kubernetes.io/name=cockroachdb
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.108.4.152
IPs:                      10.108.4.152
LoadBalancer Ingress:     34.139.126.177
Port:                     https  8080/TCP
TargetPort:               8080/TCP
NodePort:                 https  31016/TCP
Endpoints:                10.104.0.4:8080,10.104.1.8:8080,10.104.2.7:8080
Port:                     tcp  26257/TCP
TargetPort:               26257/TCP
NodePort:                 tcp  31395/TCP
Endpoints:                10.104.0.4:26257,10.104.1.8:26257,10.104.2.7:26257
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                Age                  From                Message
  ----    ------                ----                 ----                -------
  Normal  EnsuringLoadBalancer  49s (x2 over 8m52s)  service-controller  Ensuring load balancer
  Normal  EnsuredLoadBalancer   44s (x2 over 8m16s)  service-controller  Ensured load balancer

Yay! We have endpoints.

Now, if I try to access the nodes via the SQL port or via the DB Console, I should be able to do so from outside the cluster.

I can go to a browser and access https://34.139.126.177:8080/ and it works. (You'll want to substitute the actual value of your LB's External IP here.)

Also, I can access the nodes on the SQL port. For example:

$ cockroach sql --url 'postgres://roach:Q7gc8rEdS@34.139.126.177:26257/defaultdb?sslmode=require'
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
# Client version: CockroachDB CCL v22.2.5 (aarch64-apple-darwin21.2, built 2023/02/16 16:37:38, go1.19.4)
# Server version: CockroachDB CCL v22.2.2 (x86_64-pc-linux-gnu, built 2023/01/04 17:23:00, go1.19.1)

warning: server version older than client! proceed with caution; some features may not be available.

# Cluster ID: 7539a31a-fc44-4f89-a154-cc60f8aaeddd
#
# Enter \? for a brief introduction.
#
roach@34.139.126.177:26257/defaultdb>

The SQL port in CRDB needs to be exposed by a L4/TCP load balancer (which is what we're doing above). The DB Console port can also be exposed this way (as we've demonstrated), but since it's an HTTP/HTTPS access point, it could also be exposed through an L7 endpoint, like k8s' Ingress. I'm not going to demonstrate that here in this blog, but it can certainly be done.

Fixing the Node Certs

Another thing to note here. In my example above, I'm able to connect to my server using the IP address because I specified sslmode=require. This tells my Postgres driver/client that I want to use TLS to encrypt traffic to/from the cluster, but I don't want to do any hostname verification checks. We don't recommend connecting this way because it leave your cluster susceptible to man-in-the-middle (MITM) attacks.

In order to connect the "right way", I need to connect using sslmode=verify-full and specify the ca.crt used to sign all the certs used in the CRDB cluster.

I can get a list of the certs used by my cluster by asking k8s to list out all the secrets being used:

$ kubectl get secrets
NAME               TYPE     DATA   AGE
cockroachdb-ca     Opaque   1      57m
cockroachdb-node   Opaque   3      57m
cockroachdb-root   Opaque   3      57m

If I look into the details of the cockroachdb-node secret, I can see the various cert and key files that it contains:
(note that I'm using jq which you can install on your Mac using brew install jq)

$ kubectl get secrets cockroachdb-node -o json | jq '.data | map_values(@base64d)' | awk '{gsub(/\\n/,"\n")}1'
{
  "ca.crt": "-----BEGIN CERTIFICATE-----
MIIDJTCCAg2gAwIBAgIQC+85luldQT9+ctIxQ1BitjANBgkqhkiG9w0BAQsFADAr
MRIwEAYDVQQKEwlDb2Nrcm9hY2gxFTATBgNVBAMTDENvY2tyb2FjaCBDQTAeFw0y
MzAyMjUyMDEzMzhaFw0zMzAzMDUyMDEzMzhaMCsxEjAQBgNVBAoTCUNvY2tyb2Fj
aDEVMBMGA1UEAxMMQ29ja3JvYWNoIENBMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A
MIIBCgKCAQEAp7fVT7JMbrzC0J9UeqN3r5YeW1FyYpVfpGWRiQICK8ZPv8NnzsaQ
SgOig83c9wax/wHP+xK4ISoTPMLc75eM+YKoN5fU17Ki28iopJwgIakCjSXJxcAv
cN0H6cn6BemL+qb9RS7Pffu8ohJKyLsNk7a/8xMNKUAPhgmBAYws4SOhG68/f1je
Lk8hsPrqVlCDBGPwVQdhCYkKvavLA7qG0D/+F+FfNI7a/qldqn/u74DN69gie5w4
37bB1IecleX3Ks0Ype+AiNzcdllUBC22ttVREpymVj7K24ti5DeyGPeHND5F/q6F
o8a/apYMPr+hbbPgsMjoreHlcCwgxk/zEwIDAQABo0UwQzAOBgNVHQ8BAf8EBAMC
AuQwEgYDVR0TAQH/BAgwBgEB/wIBATAdBgNVHQ4EFgQUabb2eIdtS1cn3QY/pNrk
v9Kyz8swDQYJKoZIhvcNAQELBQADggEBAI29Fz3SBzkSYdvhRWsXVjRL3XnteIvJ
GwwdIXgEx/Uxc+QXnOGRF6yKqMAzJhU15qP0u1LgqHq56tkeTmQAeApMg6VTa2wj
HibW77O8w8anukv5ThXeGs52FYTVzzQ/ao+y3R9cfyHQleoecohiFXYJ0RLKmj7n
ywZ9CocP6VnRklMyegpNBp9VWnnKTsMOs+lEaGzPDiJBdPJ0Ym9946jwaojb1st3
pnApAgN/32Ak9bTrBVf6Zl2zj6n6rLD294+EMScpvVqqIqA4iJh9cpGbIEu2TO4x
QrjTl5aBbP7e4VWQnVOSZgmeTJUnFm4L2kR53yFonmys0ZJ/14z0acw=
-----END CERTIFICATE-----
",
  "tls.crt": "-----BEGIN CERTIFICATE-----
MIID+jCCAuKgAwIBAgIQCjaSSuwS1yLAEuoysn7ZUDANBgkqhkiG9w0BAQsFADAr
MRIwEAYDVQQKEwlDb2Nrcm9hY2gxFTATBgNVBAMTDENvY2tyb2FjaCBDQTAeFw0y
MzAyMjUyMDEzMzhaFw0yODAzMDEyMDEzMzhaMCMxEjAQBgNVBAoTCUNvY2tyb2Fj
aDENMAsGA1UEAxMEbm9kZTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEB
AKT4igRxUZE5p7NDkDeSWqQjENX7W3tOTXoON1GyIjf8/j1xyQN2i/AMMFAdb5P9
f8mBFzsYes/WLgXWlPZQOal2MKJAOKJ1AYywKeZ+AqCYftIJlqm/1A/EdNn74Mv1
ykNU5f2YxdBAnl8MOIrIvWeghwzKv1PSYTiUDBFti9TNsQAvrwtXC8vrfir9rnz3
8j8QP1RMzQkySRUSsik0GGD/YMW5leTsEQKYxI+clkH7YM1pOUhw6b3SHbkZlYkO
arsgv2qlnjMUN4j/6HqtOyzu5wjyOBXKxccGwNtIJB3Xq0w3wYN1E3TWDmi9jY1c
T64w9KGgXLC8NR46MqjvfM0CAwEAAaOCASAwggEcMA4GA1UdDwEB/wQEAwIFoDAd
BgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIwHwYDVR0jBBgwFoAUabb2eIdt
S1cn3QY/pNrkv9Kyz8swgckGA1UdEQSBwTCBvoIJbG9jYWxob3N0ghJjb2Nrcm9h
Y2hkYi1wdWJsaWOCGmNvY2tyb2FjaGRiLXB1YmxpYy5kZWZhdWx0gixjb2Nrcm9h
Y2hkYi1wdWJsaWMuZGVmYXVsdC5zdmMuY2x1c3Rlci5sb2NhbIINKi5jb2Nrcm9h
Y2hkYoIVKi5jb2Nrcm9hY2hkYi5kZWZhdWx0gicqLmNvY2tyb2FjaGRiLmRlZmF1
bHQuc3ZjLmNsdXN0ZXIubG9jYWyHBH8AAAEwDQYJKoZIhvcNAQELBQADggEBACAr
6MQL8dGjbhufGRcGjpKE/ctmwpoARvfFIvCg1S5/ZXPJTz4A9fp1B0bxKoHopaOO
6F2EH9B6qH6g3cFbD6au+QXc5f/kgxVJVJOewCOUDLRjH1i3Fcnd/zxvywQU6cIs
ArfwWW+XoifirNQ6MwqxtPVzjMectGQUs1IpdDwLReO6eS9pFo4kHMZiJi5XTgjJ
krDFMbFUW8qnul1w3UrxgikXeLKnuIDnegPpX4Xk0yYF1ycxA46ZORV+DybP3DG8
F6lH6wA3uF2E62Z/52XH7UUvtUaAIK937vbxXosufD8KwbXCNEcojlSDYtuKhtCq
KcywMKGrVgdtd/nwxy4=
-----END CERTIFICATE-----
",
  "tls.key": "-----BEGIN RSA PRIVATE KEY-----
MIIEpAIBAAKCAQEApPiKBHFRkTmns0OQN5JapCMQ1ftbe05Neg43UbIiN/z+PXHJ
A3aL8AwwUB1vk/1/yYEXOxh6z9YuBdaU9lA5qXYwokA4onUBjLAp5n4CoJh+0gmW
qb/UD8R02fvgy/XKQ1Tl/ZjF0ECeXww4isi9Z6CHDMq/U9JhOJQMEW2L1M2xAC+v
C1cLy+t+Kv2ufPfyPxA/VEzNCTJJFRKyKTQYYP9gxbmV5OwRApjEj5yWQftgzWk5
SHDpvdIduRmViQ5quyC/aqWeMxQ3iP/oeq07LO7nCPI4FcrFxwbA20gkHderTDfB
g3UTdNYOaL2NjVxPrjD0oaBcsLw1HjoyqO98zQIDAQABAoIBAQCGBtY6ncXi8rBo
V6/HNkQlrcdz0W6VUxxm2T3gRZS/X+8+BD+HbLxsHbrym7eWyBEVqKcy/8RnLl7d
p2QGaU8vejIw33QjqGPF5SlldWK1Dq+Z/OhGqO6kkLtOjfAoRFw7L7Jawc+UTatd
FRSqzEP0+No/bkja1MTfrofPcOx1ygiTHsSm3JHy+rh/bxRxeU9J5JBWUD1KeRS4
FRsYqf7tgv6KzBktRRs29q/HeU4up0S9HyjbE9emc99g6ZfX2dpmqoDW0kBjo729
x0XP2KxmSGeAogTmpVBz6RjoDuCUAbtUjMAbpbDRJJqnm6R8fIj1e+mDpSwOS4QN
dikzHQiBAoGBANacPfviPU81ddowy1HjPEko4C4Qy6brmPWuaeA6QUUL/MR+QrYN
Usp4B7d8lsLnZEdHyeszDnxPaAzj4rE7uDhSSMJmfflNqQVmWR6jByQ8GgzDFS5/
Re3LYR26DJMHBNGZqCxQ7us7Aqc0+YeDT8/wlniOAlndvXQ7l1Tt7nGZAoGBAMTJ
fk7Cs81SaalQQ05O7jfjwvi6kX+ISbBnBB6LyDCNkBoCwRKIHsiDIkKlFhGgwvim
+K/Ugsg8GuqYb5qd1ag+Kb8ykQpbMjkvr6Mh1bArN3KWSQTaiFko7nJLLd7P2H0V
WzrD/OUD0J2NKkzQLJcxuS8hc5YRj0DGWqzCVw1VAoGAAmj+yTVhOuJ+0FR79A95
PdkXq2zE3LsInLm4tqvwz7WywQIp/aForJ1seMMNbmLq3WIRAnMwVnUN1hc5FIR3
LSq/Zm+AOqyEmWrs1Us/aUjDgiEuu7byMhl2nb7ZJU2O4Eu5d8Xw6PNgtEAEDWGM
I+mvxurRW/EBj6ybpniFlQECgYB/DXzQSyMdeI0htOGPyKRDT3lNb9+K0KqLCyf8
tNEuj+eu84JGfb4qRYg0MTQbc4kOU3eSxokd0LisKHk+AZO1yVTYzkQYxKKbi29B
yxGVaYGmKOPCD3oi3qt8/Y8DIXyr3cMGIQ3BqwHhBwh9iZaQk5j1lgpzpKix8J8Q
lXTw9QKBgQCNKN1p9UNxpX19ni7Mg+MVPRkZgC68DBXYxNDnAexFjlatV+dIo03u
SxGBsB8Y1q4AbqHwWe8lSp+erADzeWtkYD4u9BSZl4WD50Mbev/Fut9dGJwnI+BJ
0ldr96qyslFD1RitRl5Xc6gOTcF4Bt/O5GRo5+2F4fDwJm6+dYIjJA==
-----END RSA PRIVATE KEY-----
"
}

I'm going to copy and paste the ca.crt output into a file called ca.crt.

OK, now I can try to connect the right way:

$ cockroach sql --url 'postgres://roach:Q7gc8rEdS@34.139.126.177:26257/defaultdb?sslmode=verify-full&sslrootcert=ca.crt'
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: failed to connect to `host=34.139.126.177 user=roach database=defaultdb`: failed to write startup message (x509: certificate is valid for 127.0.0.1, not 34.139.126.177)
Failed running "sql"

Notice this time I pass my ca.crt file and use sslmode=verify-full and I get an error saying that my node certs are not authorized to respond as 34.139.126.177.

To fix this, I need to re-issue my node certs to know about their load-balancer IP. If you want to create a DNS record like crdb.myproject.mydomain.com, now is a good time to do that because we'll want to include that domain name on our new certs, too.

The process to generate the node certs is described in this doc.

Depending on how you installed CRDB in k8s in the first place, you might have these various certs and keys in place in your local file system, or you might have to download them from the certs (as we did above for the ca.crt file).

You don't need to re-create the ca key/crt pair, but you want to recreate the CRDB node cert. When you get to that step, add the IP address and DNS name (if you have one) as additional parameters to this command:

$ cockroach cert create-node \
> localhost 127.0.0.1 \
> cockroachdb-public \
> cockroachdb-public.default \
> cockroachdb-public.default.svc.cluster.local \
> *.cockroachdb \
> *.cockroachdb.default \
> *.cockroachdb.default.svc.cluster.local \
> 34.139.126.177 \
> crdb.myproject.mydomain.com \
> --certs-dir=certs \
> --ca-key=my-safe-directory/ca.key

This will create a node.crt/node.key pair. I am re-naming those to tls.key/tls.crt since that's what they're called in my installation.

You can examine the tls.crt file to see if it includes our IP and dns name by running this command:

$ openssl x509 -in certs/tls.crt -noout -text | egrep -A 2 'X509v3 Subject Alternative Name'
            X509v3 Subject Alternative Name: 
                DNS:localhost, DNS:cockroachdb-public, DNS:cockroachdb-public.default, DNS:cockroachdb-public.default.svc.cluster.local, DNS:*.cockroachdb, DNS:*.cockroachdb.default, DNS:*.cockroachdb.default.svc.cluster.local, DNS:crdb.myproject.mydomain.com, IP Address:127.0.0.1, IP Address:34.139.126.177
    Signature Algorithm: sha256WithRSAEncryption

Notice that our IP and domain name are listed.

Now we need to:

Delete the existing secret
Re-create the secret with the new cert/key files
Restart the pods so they will pick up the new secrets/certs

After doing so, if I try to connect using the "right way", I am able to do so successfully:

$ cockroach sql --url 'postgres://roach:Q7gc8rEdS@34.139.126.177:26257/defaultdb?sslmode=verify-full&sslrootcert=certs/ca.crt'
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
# Client version: CockroachDB CCL v22.2.5 (aarch64-apple-darwin21.2, built 2023/02/16 16:37:38, go1.19.4)
# Server version: CockroachDB CCL v22.2.2 (x86_64-pc-linux-gnu, built 2023/01/04 17:23:00, go1.19.1)

warning: server version older than client! proceed with caution; some features may not be available.

# Cluster ID: 7539a31a-fc44-4f89-a154-cc60f8aaeddd
#
# Enter \? for a brief introduction.
#
roach@34.139.126.177:26257/defaultdb>

Summary

When running CRDB in k8s in Production, you'll want to expose the CRDB nodes externally by using a load balancer. And, you'll want to re-create your node certs to references the load balancer details.

Happy CRDBing!

Memory Management in CockroachDB

Jim Hatcher — Mon, 27 Feb 2023 14:42:42 +0000

If you've played with CockroachDB before, you may have seen some memory-related flags when starting Cockroach. Specifically, things called "cache" and "max-sql-memory":

cockroach start \
--certs-dir=certs \
--advertise-addr=<node1 address> \
--join=<node1 address>,<node2 address>,<node3 address> \
--cache=.25 \
--max-sql-memory=.25

We talk a little bit about these settings in our docs, but how are they actually used?

I put together a few diagrams to illustrate what we mean by cache and sql-memory, and I'll try to illustrate how they're used.

Above is a diagram of a node running in a CockroachDB cluster. You can see that there are two caches managed by the cockroach process called SQL Memory and Pebble Cache. What we refer to above as "cache" is this Pebble cache. Pebble is the name of the Key/Value storage engine that CockroachDB leverages to actually store data.

There are some other areas of memory available outside of the cockroach process, namely the Operating System's page cache.

Let's consider a read that happens against a CRDB cluster, and let's assume for this example that this is the first read that's happened against this node so there is nothing loaded into any caches.

Here are the steps that happen to satisfy this read query:

SQL SELECT query hits the SQL engine
SQL engine asks KV for data
KV asks the O/S for data
O/S reads from disk & loads files into the page cache (compressed -- all SSTables in CRDB are compressed)
O/S gives KV data
KV puts data in Pebble cache (uncompressed)
KV processes data and gives to SQL
SQL uses memory to process data (ordering, grouping, etc.)
SQL returns results to client

You can imagine that on a subsequent read, that some of these steps could be short-circuited if they got a cache hit -- namely, you might be able to avoid the read from disk and only hit the O/S Page cache; or, even better, if the Pebble cache has the data, you don't need to even talk to the O/S at all.

The write path for a CockroachDB node is simpler.

The write involves:

Mutation query (i.e., INSERT/UPDATE/DELETE) hits the SQL engine
SQL engine passes data to KV engine
KV writes data to SSTables in memory
KV calls fsync to flush memory to disk (compressed)
KV updates the Pebble cache (uncompressed)
KV acknowledges write to SQL engine
SQL returns acknowledgement to client

Hopefully, this blog clears up what "SQL Memory" and "Cache" mean in the context of CockroachDB. Happy SQL-ing!

Spark Update Optimizations

Jim Hatcher — Tue, 21 Feb 2023 05:17:10 +0000

About a year ago, I posted a blog on using Apache Spark jobs to do updates against CockroachDB. At the end of that post, I pledged to come back and do some optimizations to speed things up. And, true to my word, here I am.

Between the time I published my last blog and now, I am running on a new laptop, so I needed to get Spark running on my laptop. I did the following:

installed Scala 2.12 from homebrew
installed Spark from homebrew (latest as of the time of me writing this is 3.3.2)
tried to install Java 11 but gave up and installed Java 17 (Spark 3.3.2's overview page tells me I can run on Java 8/11/17)

I then tried to simply re-run the examples from my last blog post and ran into a gnarly error:
java.lang.IllegalAccessException: final field has no write access

After a little googling, I found a similarly-described problem here. This issue is fixed in the not-yet-released Spark 3.4.0, so I was a little stuck until I looked at the changes included in the fix and decided I could reproduce them myself.

Namely, the fix seems to be starting your Java process with the following option: -Djdk.reflect.useDirectMethodHandle=false

I added this option to my java driver options like so, and was able to run my previous examples. (I also bumped up my Postgres driver version a bit):

spark-shell
  --driver-class-path /Users/jimhatcher/.m2/repository/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar
  --jars /Users/jimhatcher/.m2/repository/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar
  --conf spark.jars=/Users/jimhatcher/.m2/repository/org/postgresql/postgresql/42.5.1/postgresql-42.5.1.jar
  --conf spark.executor.memory=4g
  --conf spark.driver.memory=4g
  --driver-java-options=-Djdk.reflect.useDirectMethodHandle=false

This is a nasty little workaround (I admit), but I can omit it once Spark 3.4.0+ comes out. And, at this point, I was able to re-run my previous tests. My previous test involved creating a table with an id column and one additional field (which started null) and then updating all the records in the table where that field was null to have a value of 1.

I then proceeded to try to tune various settings (batch size, number of partitions in the Spark dataframe, fetch size in the dataframe, etc.). I could make some small improvements but nothing significant or interesting.

I looked as the SQL Statements page in the CockroachDB DB Console and saw the queries that were being generated and run against the cluster. All my updates were coming through as individual statements. Each statement itself was fairly quick (~1ms), but running a bunch of individual statements is never as quick as batching data together.

In the Postgres JDBC driver, there is a parameter called reWriteBatchInserts which can have a huge performance lift when writing data to CockroachDB. I verified that I was passing the parameter on my JDBC URL, and I was. This parameter only comes into play when running INSERTs, not UPDATEs, so I wrote a different version of my Spark job that would read from one table and INSERT values into another table. When I ran this job, it took around 6sec to INSERT 1 million records instead of the 40sec I had been seeing when trying to UPDATE 1 million records.

Here's the important bits of that Spark code:

df.coalesce(10).foreachPartition(
  (partition: Iterator[Row]) => {

    val dbc: Connection = DriverManager.getConnection(url, "root", "")
    val batchSize = 1000
    val st: PreparedStatement = dbc.prepareStatement("INSERT INTO test.test_table_target ( id, f1 ) VALUES ( ?, 1 );")

    partition.grouped(batchSize).foreach (
      batch => {
        batch.foreach (
          row => {
            st.setLong(1, row.getLong(0)) 
            st.addBatch()
          }
        )
        st.executeBatch()
      }
    )

    dbc.close()

  }
)

I then turned my thoughts to how I could force CockroachDB to do UPDATE statements in a batch, similarly to how the Postgres JDBC driver leverages query re-writing to batch INSERT queries in a very efficient manner. One such technique that we can leverage is to pass a bunch of values into a SQL statement as arrays and then use the UNNEST statement to convert these arrays into tabular input.

Instead of running a single update like this:

UPDATE test.test_table
SET f1 = 1
WHERE id = ?;

I turned my attention to an approach like this:

UPDATE test_table
SET f1 = data_table.new_f1
FROM (
  SELECT UNNEST(?) AS id, UNNEST(?) AS new_f1
) AS data_table
WHERE test_table.id = data_table.id;

To help visualize what is happening in with the UNNEST function, here is a simple example of un-nesting a simple array in CockroachDB:

root@localhost:26257/test> SELECT UNNEST(ARRAY[1, 2, 3]);
  unnest
----------
       1
       2
       3
(3 rows)

With this goal in mind, I re-wrote my Scala job to submit one UPDATE query for each batch (defined by my batch size variable).

import java.sql._
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer

val url = "jdbc:postgresql://<CRDB host name here>:26257/test?sslmode=require&sslcert=/Users/jimhatcher/spark-cluster-certs/client.root.crt&sslkey=/Users/jimhatcher/spark-cluster-certs/client.root.key.der&sslrootcert=/Users/jimhatcher/spark-cluster-certs/ca.crt&reWriteBatchedInserts=true"


val sql = "( select id from test.test_table where f1 IS NULL) t1"
val df = spark.read
  .format("jdbc")
  .option("url", url)
  .option("dbtable", sql)
  .option("user", "root")
  .option("partitionColumn", "id")
  .option("lowerBound", "1")
  .option("upperBound", "1000000")
  .option("numPartitions", "10")
  .load

df.coalesce(10).foreachPartition(
  (partition: Iterator[Row]) => {

    val dbc: Connection = DriverManager.getConnection(url, "root", "")
    val batchSize = 1000
    val st: PreparedStatement = dbc.prepareStatement("UPDATE test_table SET f1 = data_table.new_f1 FROM ( SELECT UNNEST(?) AS id, UNNEST(?) AS new_f1 ) AS data_table WHERE test_table.id = data_table.id")

    partition.grouped(batchSize).foreach (
      batch => {
        var ids = new ListBuffer[BigInt];
        var fl1s = new ListBuffer[BigInt];
        batch.foreach (
          row => {
            ids += row.getLong(0);
            fl1s += 1;
          }
        )
        st.setArray(1, st.getConnection.createArrayOf("BIGINT", ids.toList.toArray));
        st.setArray(2, st.getConnection.createArrayOf("BIGINT", fl1s.toList.toArray));
        st.execute()
      }
    )

    dbc.close()

  }
)

Using this job, I was able to UPDATE 1 million records in 3.6sec instead of ~40sec.

Note: It's generally considered bad form in Scala to use mutable lists in the way I'm doing above. I could probably re-write my batch loop above using for comprehensions and make this a little more Scala-acceptable. But, hey, that will give me something to write about in my next Spark optimization blog!

A Tale of Two Connection Pools

Jim Hatcher — Mon, 13 Feb 2023 15:20:50 +0000

It was the best of pools; it was the worst of pools...
Ok, enough with the Dickensian references.

Connection Pools

I have recently had two contrasting experiences with application-side, connection pooling frameworks when connecting to CockroachDB, and I thought I would share some of my findings.

I am a Solutions Engineer at Cockroach Labs, so I often work with customers and prospects on testing out their codebase running against a CockroachDB cluster. CockroachDB is a distributed SQL database (meaning a CockroachDB cluster is made up of many nodes). This distributed DNA comes with obvious advantages like being about to do a rolling restart of the cluster nodes and not having to take the database down. However, it's a good idea to make sure that your application is configured to handle a node restarts.

For you application to gracefully handle node restarts, we typically recommend:

Load Balancer - Point to a load balancer that is round-robin-ing traffic among the CRDB nodes and monitoring for DB node health (and taking dead nodes out of the spray). This can be a simple L4, TCP load balancer. If you're running CockroachDB in the our managed service offering, this load balancer is provided for you.
Connection Pool - Use a connection pool in your code (HikariCP is great for Java). We would recommend a few settings:
- max pool size (recommendations on size here: https://www.cockroachlabs.com/docs/stable/connection-pooling.html#sizing-connection-pools)
- max connection lifetime - set to maybe 8 hours and tune it down as necessary; this will make sure that your pool has a balanced number of connections within 8 hours of a rolling restart event.
Include Retry Logic in your code - We recommend catching and retrying serialization retry (i.e., 40001) errors in code. You can add a similar class of errors for 57* errors and 08* errors (see: https://www.postgresql.org/docs/current/errcodes-appendix.html)

Here's an example of a project I wrote in C# that shows this retry logic in action.

These three elements (LB + CP + Retry Logic) interact with each other, and with all three in place, you will have a fool-proof approach to handling restarts in your code with zero impact.

Here's a diagram to shows how this interaction works.

How to handle leaking state across session

Connection pools are great in that they save us the latencies associated with opening new connections to our database, and they allow us to multiplex lots of queries across a fewer number of connections to the database.

Now, let's talk about one risk associated with leveraging them.

Each database connection instance has various metadata associated with it. In the CockroachDB world (which is Postgres compatible), one such example of state is "session variables." For instance, there is a session variable called statement_timeout which, by default, has a value of 0 (which means unlimited). Suppose you set this variable to 60s, do something with the connection and then return it to the pool. Then, the next process to retrieve that particular connection from the pool will have the statement_timeout value set to 60s and will have no notice that they're not getting a "clean" session/connection.

To battle this idea of state leakage, some connection pool instances automatically run commands on connections when they're returned to the pool whose purpose is to restore all the state back to defaults.

I was working on a proof-of-concept recently with a company who writes their code in C#/.NET and they were using NPGSQL (which is a Postgres driver library for .NET) to connect to CockroachDB. When you use the connection pool builtin to this project and close a connection, NPGSQL runs several commands against the connection:

CLOSE ALL;
UNLISTEN *;
SELECT pg_advisory_unlock_all();
DISCARD SEQUENCES;
DISCARD TEMP;

It makes sense that a Postgres-specific driver would run Postgres-specific commands against a data source that is known to be Postgres (or Postgres-compatible).

However, in the Java world, developers have many connection pooling toolsets from which to choose. A popular choice is HikariCP. HikariCP is not specific to a particular DB (like Postgres), but can be used against any JDBC Data Source (Oracle, MySQL, etc.). I have always assumed that Hikari does similar "resetting" logic but I tested that assumption this week and learned that it does not. I suppose this makes sense since it would have to have knowledge about the "right" commands for each DB platform in order to reset state properly.

So, I went about trying to make Hikari do this resetting for me.

Code Testing

First, I setup some simple code to show the problem:

    public void showBroken() throws SQLException  {

        try {
            HikariDataSource ds = new HikariDataSource();
            ds.setJdbcUrl("jdbc:postgresql://localhost:26257/default?sslmode=require");
            ds.setUsername("roach");
            ds.setPassword("roach");
            ds.setMaximumPoolSize(1);

            Connection conn1 = ds.getConnection();
            Statement stmt1 = conn1.createStatement();
            ResultSet rs1 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout before messing with it: ");
            while (rs1.next()) {
                System.out.println(rs1.getString(1));
            }
            Statement stmt2 = conn1.createStatement();
            stmt2.execute("set statement_timeout = '60s';");
            ResultSet rs2 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout after setting it: ");
            while (rs2.next()) {
                System.out.println(rs2.getString(1));
            }
            conn1.close();

            Connection conn2 = ds.getConnection();
            Statement stmt3 = conn2.createStatement();
            ResultSet rs3 = stmt3.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout in new connection without changing anything: ");
            while (rs3.next()) {
                System.out.println(rs3.getString(1));
            }
            conn2.close();

            ds.close();
        }
        catch(Exception ex) {
            System.out.println(ex.toString());
        }


    }

When I run this code, I get this output:

value of statement_timeout before messing with it: 
0
value of statement_timeout after setting it: 
60000
value of statement_timeout in new connection without changing anything: 
60000

As you can see, we're "leaking" the 60s value between sessions. Not good!

I found one suggestion from the author of HikariCP on how to address this, which I implemented and it worked. However, there are additional classes involved, and it feels a little clunky and hard to follow.

So, I reached out to my colleague, Kai Niemi, who is a seriously smart dude, and he gave me two other suggestions which I found to be much cleaner.

Option 1 - Use a proxy

In this example, we use some Java Proxy + Reflection magic to create a HikariCP pool instance but we give ourselves a "hook" into the pooling events where we can insert a call to "DISCARD ALL;"

    public void runTest2() throws SQLException  {

        try {
            HikariDataSource ds = new HikariDataSource() {
                @Override
                public Connection getConnection() throws SQLException {
                    Connection delegate = super.getConnection();
                    return (Connection) Proxy.newProxyInstance(
                            DataSource.class.getClassLoader(),
                            new Class[] {Connection.class},
                            (proxy, method, args) -> {
                                if (method.getName().equals("close")) {
                                    Connection c = (Connection) proxy;
                                    try (Statement s = c.createStatement()) {
                                        s.execute("DISCARD ALL;");
                                    } catch (SQLException e) {
                                        // not much to do, proceed with close
                                    }
                                }
                                return method.invoke(delegate, args);
                            });
                }
            };
            ds.setJdbcUrl("jdbc:postgresql://localhost:26257/default?sslmode=require");
            ds.setUsername("roach");
            ds.setPassword("roach");
            ds.setMaximumPoolSize(1);

            Connection conn1 = ds.getConnection();
            Statement stmt1 = conn1.createStatement();
            ResultSet rs1 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout before messing with it: ");
            while (rs1.next()) {
                System.out.println(rs1.getString(1));
            }
            Statement stmt2 = conn1.createStatement();
            stmt2.execute("set statement_timeout = '60s';");
            ResultSet rs2 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout after setting it: ");
            while (rs2.next()) {
                System.out.println(rs2.getString(1));
            }
            conn1.close();

            Connection conn2 = ds.getConnection();
            Statement stmt3 = conn2.createStatement();
            ResultSet rs3 = stmt3.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout in new connection without changing anything: ");
            while (rs3.next()) {
                System.out.println(rs3.getString(1));
            }
            conn2.close();

            ds.close();
        }
        catch(Exception ex) {
            System.out.println(ex.toString());
        }


    }

The output we get from this code shows that we're not leaking anymore - yay!

value of statement_timeout before messing with it: 
0
value of statement_timeout after setting it: 
60000
value of statement_timeout in new connection without changing anything: 
0

Option 2 - Use a library

In this option, we wrap our HikariCP instance in another object. The nice thing about this option is that it can be used with any connection pooling library (not just HikariCP); but, it does require that you add an additional dependency to your project: https://github.com/jdbc-observations/datasource-proxy

    private void runTest3() {

        try {

            HikariConfig config = new HikariConfig();
            config.setJdbcUrl("jdbc:postgresql://localhost:26257/default?sslmode=require");
            config.setUsername("roach");
            config.setPassword("roach");
            config.setMaximumPoolSize(1);

            DataSource ds = ProxyDataSourceBuilder
                    .create(new HikariDataSource(config))
                    .listener(new JdbcLifecycleEventListenerAdapter() {
                        @Override
                        public void afterGetConnection(MethodExecutionContext executionContext) {
                            Connection c = (Connection) executionContext.getResult();
                            try (Statement s = c.createStatement()) {
                                s.execute("DISCARD ALL;");
                            } catch (SQLException e) {
                                // whatever
                            }
                        }
                    })
                    .build();

            Connection conn1 = ds.getConnection();
            Statement stmt1 = conn1.createStatement();
            ResultSet rs1 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout before messing with it: ");
            while (rs1.next()) {
                System.out.println(rs1.getString(1));
            }
            Statement stmt2 = conn1.createStatement();
            stmt2.execute("set statement_timeout = '60s';");
            ResultSet rs2 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout after setting it: ");
            while (rs2.next()) {
                System.out.println(rs2.getString(1));
            }
            conn1.close();

            Connection conn2 = ds.getConnection();
            Statement stmt3 = conn2.createStatement();
            ResultSet rs3 = stmt3.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout in new connection without changing anything: ");
            while (rs3.next()) {
                System.out.println(rs3.getString(1));
            }
            conn2.close();

        }
        catch(Exception ex) {
            System.out.println(ex.toString());
        }

    }

Again, we see that the output proves that our leak has been plugged.

value of statement_timeout before messing with it: 
0
value of statement_timeout after setting it: 
60000
value of statement_timeout in new connection without changing anything: 
0

Recursive CTEs in CockroachDB

Jim Hatcher — Thu, 28 Jul 2022 16:58:56 +0000

I have been thinking about how CockroachDB could be used to implement a very flexible authorization system.

I have previous experience using Graph databases which differ from relational databases in several key ways -- but one big differentiator is that graph databases are good for "highly relational data" (and I don't mean relational in the RDBMS sense). Graph databases are good at answering questions like, "Given a family tree, are persons X and Y related, and if so, what is their relationship?", or "In my LinkedIn professional network, am I connected to person X, and if so, how many 'hops' away are we?"

Questions like this are tricky to answer in general because the "distance" between things is not known, but they can be particularly difficult to answer in relational databases using standard data models, highly-structured schemas, and join-based queries.

For my authorization system needs, I want to be able these types of unstructured/unknown questions. For instance: "If I'm a user in a system, and I'm a member of a groups A, B, C, do I have permission to do action X" This is further complicated by the fact that I would want to allow a group hierarchy where a group can belong to another group which can belong to a tree of groups, and the necessary permission(s) can be attached to groups anywhere in that tree.

Most relational databases -- including CockroachDB -- have a structure that can help with these type of "walk the tree" problems; this structure is called a "Recursive CTE."

A normal, non-recursive CTE (Common Table Expression) is cool in and of itself since it allows you to create a query that can be re-used more than once in a series of queries. Beyond facilitating re-use, they can also simply your queries by eliminating complex derived tables which helps to create more easily maintainable SQL code.

A recursive CTE adds another layer of goodness because it allows recursive execution of the expression.

Let's look at an example of a non-recursive CTE borrowed from the CockroachDB docs:

WITH r AS
(
  SELECT *
  FROM rides
  WHERE revenue > 98
)
SELECT *
FROM users AS u, r -- here we're joining to the 'r' CTE
WHERE r.rider_id = u.id;

To use a recursive CTE, we need a slightly different syntax. Let's have a look at an example:

WITH RECURSIVE cte (n, factorial) AS
(
    VALUES (0, 1) -- initial subquery
    UNION ALL
    SELECT n+1, (n+1)*factorial
    FROM cte
    WHERE n < 9 -- recursive subquery
)
SELECT * FROM cte;

Recursive CTEs have to have the keyword "RECURSIVE", as well as an initial statement (a starting point) UNIONed together with a recursive query. There can also optionally be a field list which is used to explicitly name the columns returned in the CTE table-valued structure.

Now, back to the example of an authorization system. I created a GitHub repo which has an example with some example data on how such an authorization system could look in CockroachDB and how that data could be effectively queried using a recursive CTE.
https://github.com/cockroachlabs-field/perms-example

I won't repeat the contents of the GitHub repo here, but here's an example of the CTE I'm using there. We're asking the question: Does the identity 'Andy, GM' have rights to read data from the Customer table?
If the query returns results, the permission exists; if no results are returned, the permission doesn't exist.

WITH RECURSIVE roles_hierarchy_cte AS (

    --query all the roles for the user we care about
    SELECT r.rid AS rid
    FROM perms_example.role_instance r
    INNER JOIN perms_example.identity_role_assignment ira ON r.rid = ira.rid
    WHERE ira.iid = 'aaaaaaaa-1111-1111-1111-111111111111' -- Andy, GM

    UNION

    --walk up the tree and find any other parent roles
    SELECT rh.child_rid AS rid
    FROM perms_example.role_hierarchy rh
    INNER JOIN roles_hierarchy_cte rhc ON rhc.rid = rh.parent_rid
)

/* get all the permissions that this identity is assigned to directly */
SELECT i.iid, i.identity_name, NULL AS role_name, p.permission_name
FROM perms_example.identity_instance i
INNER JOIN perms_example.identity_permission_assignment ipa on i.iid = ipa.iid
INNER JOIN perms_example.permission_instance p on ipa.pid = p.pid
WHERE i.iid = 'aaaaaaaa-1111-1111-1111-111111111111' -- Andy, GM
AND p.pid = 'cccccccc-1111-1111-1111-111111111111' -- 'Read Customer Data

UNION ALL

/* also, get any permissions that this identity picks up due to role permissions
     and also role permissions from the role hierarchy */
SELECT i.iid, i.identity_name, r.role_name, p.permission_name
FROM perms_example.role_instance r
INNER JOIN perms_example.role_permission_assignment rpa on r.rid = rpa.rid
INNER JOIN perms_example.permission_instance p on rpa.pid = p.pid
--get me 1 identity row so I can include that on the output
CROSS JOIN
(
        SELECT i.iid, i.identity_name
        FROM perms_example.identity_instance i
        WHERE i.iid = 'aaaaaaaa-1111-1111-1111-111111111111' --Andy, GM
) i
WHERE r.rid IN (
    SELECT rhc.rid
    FROM roles_hierarchy_cte rhc
)
AND p.pid = 'cccccccc-1111-1111-1111-111111111111' -- 'view customer data
;

Do you have other examples where recursive CTEs may be useful?
Post them in the comments!

Happy recursing!

Apache Spark job to update CockroachDB data

Jim Hatcher — Fri, 04 Feb 2022 14:26:56 +0000

Apache Spark is a distributed execution framework which is a wonderfully complementary tool for working with distributed systems.

When working in Spark, my go-to interface is Spark SQL since I can leverage my SQL skills to get at the data I want. However, you can't do any kind of data mutations via SparkSQL (only reads). For writes, you have to use the DataFrames/DataSets or the RDD interfaces.

If you're doing an initial import of a table from CSV (or something similar), you can do something like this:

dataFrameCSV.write
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .save()

This is a great option for cases when you're either loading the whole table for the first time.

But, for cases when you want to update some of the records in the table, this write() interface is not great because Spark will either complain that the table already has records, or if you tell Spark to overwrite, it will wipe out the records in your table.

So, for more nuanced mutations, you can drop down into the java.sql library and have Spark execute batches of updates.

Here's how I went about doing this.

First, I needed to install Spark on my environment (my MacBook), so I used brew to install:

Scala 2.12 (Spark 3 doesn't like Scala 2.11)
Oracle Java 11 (Spark doesn't seem to like more recent OpenJDK versions)
Spark (Spark 3.2.1 is the latest as of the time of me writing this blog)

I already had a Postgres driver installed via Maven, but you can install this too, if necessary.

I ran the spark shell using this command:

spark-shell \
  --driver-class-path /Users/jimhatcher/.m2/repository/org/postgresql/postgresql/42.2.19/postgresql-42.2.19.jar \
  --jars /Users/jimhatcher/.m2/repository/org/postgresql/postgresql/42.2.19/postgresql-42.2.19.jar \
  --conf spark.jars=/Users/jimhatcher/.m2/repository/org/postgresql/postgresql/42.2.19/postgresql-42.2.19.jar \
  --conf spark.executor.memory=4g \
  --conf spark.driver.memory=4g

Spark runs as three separate components: the Spark master, the driver, and the executor(s). I passed references to the Postgres jar in several parameters to make sure that the various pieces knew how to reference the Postgres driver.

I also bumped up the memory available to the executor and drivers.

The next thing I needed to do was to make sure that I had good certs that I could use to access my CockroachDB instance. When CRDB issues keys, it uses a PEM format. Using JDBC (as we will in this Spark job), Java doesn't mind the certs, but it wants the keys in a DER format. So, I ran the following command to convert my client key to DER:

openssl pkcs8 -topk8 -inform PEM -outform DER -in client.root.key -out client.root.key.der -nocrypt

I wrote the following program to read a table from Cockroach, find the records where a certain field was null and then update that field.

After some trial and error, this is what I ended up with:

import java.sql._
import org.apache.spark.sql._

val url = "jdbc:postgresql://<host_name_here>:26257/test?sslmode=require&sslcert=/Users/jimhatcher/spark-cluster-certs/client.root.crt&sslkey=/Users/jimhatcher/spark-cluster-certs/client.root.key.der&sslrootcert=/Users/jimhatcher/spark-cluster-certs/ca.crt&reWriteBatchedInserts=true"
val sql = "( select id from test.test_table where f1 IS NULL) t1"
val df = spark.read
  .format("jdbc")
  .option("url", url)
  .option("dbtable", sql)
  .option("user", "root")
  .option("partitionColumn", "id")
  .option("lowerBound", "1")
  .option("upperBound", "30000000")
  .option("numPartitions", "10")
  .load

df.coalesce(10).foreachPartition(
  (partition: Iterator[Row]) => {

    val dbc: Connection = DriverManager.getConnection(url, "root", "")
    val batchSize = 10000
    val st: PreparedStatement = dbc.prepareStatement("UPDATE test.test_table SET f1 = 1 WHERE id = ?")

    partition.grouped(batchSize).foreach (
      batch => {
        batch.foreach (
          row => {
            st.setLong(1, row.getLong(0)) 
            st.addBatch()
          }
        )
        st.executeBatch()
      }
    )

    dbc.close()

  }
)

I ran this in the Spark shell by copying/pasting. (The ":paste" command is nice for this.)

This program wasn't particularly fast. I think there's some optimization I could do against it. I also didn't run it on a real Spark cluster. One super cool thing about Spark is that it's so easy to parallelize jobs like this by throwing more hardware resources at them.

I'd like to go back and try to optimize this further. But in the meantime, hopefully this is a decent example that folks can build on for doing Spark jobs against CockroachDB.