DEV Community: Frits Hoogland

The Postico 2 client for YugabyteDB

Frits Hoogland — Mon, 17 Jul 2023 13:23:35 +0000

When going through a twitter thread about PostgreSQL and PostgreSQL compatible database clients, the Postico database client was mentioned. I did not run into that client before, so I decided to take a look.

The first thing to notice is that it's a client for Mac OSX version 10.15 or later (Postico version 2). It does not have a Linux or Windows version. It also is a GUI, which means it serves a specific purpose and is not competing with psql or ysqlsh.

Then another quite prominent topic: it is not a free database client, and costs between 30 Euro (student license) and 85 Euro (commercial license). That is certainly not free. However, it has got a free unlimited trial to test out if you like it or not. Via the OSX App Store you will find the paid version, if you go through the website of Postico you will find a trail version.

Postico is a PostgreSQL database client, and because YugabyteDB reuses the PostgreSQL query engine, YugabyteDB is highly PostgreSQL compatible, and therefore PostgreSQL clients like Postico and DBeaver can be used with YugabyteDB too.

The first thing that I notice is that the GUI is very much in what I believe to be the Mac design philosophy: it has a clean UI that is kept as much free of not directly needed switches and choices, while it still giving you access to the needed options. What probably is the main competitor (and the community most popular option) DBeaver sits in the exact opposite here, providing lots of options and choices.

After setting up a database connection, the window is focussed on a single server, and a single database via breadcrumbs visible at the top of the window. Another server connection can be created via a tab or anther window, but that is a paid version only feature.

The window that is shown after connecting to a database once again is very clean:

A queries subwindow on lefthand upper side that shows the file backing of current SQL, by default 'SQL Query', which contents are shown in the main window, which is opened by default.

And a sub window with a tables and functions tab, which shows the definition of tables and their indexes in the main screen when a table is selected in the tables tab, and the definition of the functions or procedures in the main screen when selected.

Postico 2 to me looks like it's trying to give an as clean as possible interface, instead of trying to show everything that is there, which is what Beaver tries to do. That means that both Postico 2 and DBeaver have their appeal, it's a matter of preference.

Setup SSL for YSQL in Yugabyte manually

Frits Hoogland — Mon, 03 Jul 2023 15:23:45 +0000

This is blogpost providing a quick (and dirty) way of enabling SSL for clients connecting to YSQL, alias the postgres compatible API in YugabyteDB.

If you need to do this in an official way, you should use Yugabyte Anywhere.

Create certificates

The creation of certificates is done using the openssl executable. This is an executable that is normally available on any linux system and on any Mac.

The way SSL works using openssl is that a so-called keypair is needed for encryption and decryption. You can create a key file, alias private key, at will, and you can create a CSR (certificate signing request) at will, but to make your CSR usable as a certificate, it must be signed by a CA (certificate authority).

For quickly creating test certificates, you can create a (dummy) certificate authority for yourself. Obviously, certificates signed by this dummy authority are not accepted as official certificates.

Certificate Authority

The creation of a CA is quite simple:

Create the CA key file:

openssl genrsa -out test.ca.key

Create the CA certificate file:

openssl req -new -x509 -key test.ca.key -out test.ca.crt

There are some questions asked, for test purposes you can just leave these blank, except for the common name: use localhost.

Server key and CSR

The creation of the server certificate files is also simple:

Create server key file:

openssl genrsa -out test.server.key

Create the server CSR file:

openssl req -new -key test.server.key -out test.server.csr

There are some question asked here too, which are the same questions as the CA creation, because both are about a public key/certificate. If there is a hostname used, the common name should match the hostname used.

This does create the CSR, the certificate signing request.

Sign the CSR

The last step is to use the CA files to sign the server CSR:

openssl x509 -req -in test.server.csr -CA test.ca.crt -CAkey test.ca.key -CAcreateserial -out test.server.crt

That's it! Now you got a number of files:

The test.ca.crt file, which is the CA certificate, which sometimes is needed, when the CA needs to be provided for server certificate validation.
The test.server.key file, which is the private key file. If this file is made available, the SSL security is compromised.
The test.server.crt file, which is the (public) certificate file.

Plus:

The test.ca.key file, which is the private key of the certificate authority. This file is only needed for signing.
The test.ca.srl file, which contains the serial number of signed certificates (needed for a real CA to keep track of serials).
The test.server.csr, which contains the signing request. This file can be reused when the certificate (.crt) expires.

Enable SSL for YugabyteDB YSQL (postgres API)

To enable SSL for YugabyteDB YSQL, there are two steps that need to be done:

Put the needed files in a place and file format where the tablet server process can find these.
Actually enable SSL for client to server encryption.
(optionally, for test) Allow unencrypted connections alongside encrypted connections.

Put the certificate/SSL files in place.

(assuming the Yugabyte tablet server process is started with the user yugabyte and it's home directory is /home/yugabyte)

mkdir ~/certs
cp test.ca.crt ~/certs/ca.crt
cp test.server.key ~/certs/node.0.0.0.0.key
cp test.server.crt ~/certs/node.0.0.0.0.crt
chmod 700 certs

There is something to say here: the names of the key and certificate file is taken from the general parameter that controls on which address or addresses the tablet server is listening: rpc_bind_addresses. In the above situation, it was set to 0.0.0.0, which means all addresses.

Enable SSL and point the tablet server to the files

To enable client to server SSL, set the following tablet server setting:

use_client_to_server_encryption=true

To point the tablet server to the certificate files:

certs_for_client_dir=/home/yugabyte/certs

Enable unencrypted traffic alongside encrypted traffic

To allow unencrypted traffic as well as encrypted traffic, add the following line to the tablet server flag file:

ysql_hba_conf_csv="host all all all trust"

In a real life situation this is very likely not a good idea, but for test and testing this is or can be very useful. Read the documentation for more configuration options, such as configuring special unencrypted access for backup for example.

Lookup disk properties on linux

Frits Hoogland — Mon, 05 Jun 2023 16:21:51 +0000

When using linux systems, there sometimes is a need to understand the exact disk properties. One reason is to be able to understand IO read and write size, and how it affects latency.

Linux uses buffered IO by default. That means that any read or write request will use the cache. The cache in linux is completely automatic, essentially consisting of non-allocated memory. However, in most cases buffered IO will also require memory when there is memory shortage.

To understand which block device a filesystem is using, use the lsblk command:

$ lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                  8:0    0 39.1G  0 disk
├─sda1               8:1    0    1G  0 part /boot
└─sda2               8:2    0 38.1G  0 part
  ├─almalinux-root 253:0    0   36G  0 lvm  /
  └─almalinux-swap 253:1    0  2.1G  0 lvm  [SWAP]

In this case the lsblk command tells me I got one disk (sda), with two partitions (sda1 and sda2). The disk here therefore is sda.

The disk exposes a lot of its properties via /sys/block/<DISK>/queue, such as:

The disk maximum IO size: max_hw_sectors_kb (readonly).
The disk current maximum IO size: max_sectors_kb (read-write).
The disk IO queue size: nr_requests (read-write).
The disk read-ahead size: read_ahead_kb (read-write).
The scheduler for the disk: scheduler (read-write).
Is the disk (set as/considered) to be on a rotating disk: `rotational' (readonly).
Is the disk (set as/considered) to be on dax (writeable persistent memory): dax (readonly).

Read-write and readonly:

'read-write' means it's a setting that can be manipulated.
'readonly' means it's a property read by the driver from the underlying "hardware".

Create sample YugabyteDB table with data

Frits Hoogland — Fri, 02 Jun 2023 12:32:36 +0000

I keep finding myself looking around for a quick and convenient way to to create a table populated with data.
(the datasets are described as being free to use for testing)

Download 100,000 rows dataset.

curl -L https://github.com/datablist/sample-csv-files/raw/main/files/people/people-100000.zip > people-100000.zip
unzip people-100000.zip

Download 1,000,000 rows dataset.

curl -L https://github.com/datablist/sample-csv-files/raw/main/files/people/people-1000000.zip > people-1000000.zip
unzip people-1000000.zip

Create table and import data into YCQL (Cassandra compatible)

create keyspace sample;
use sample;
create table sample_data
(
index int primary key,
user_id text,
first_name text,
last_name text,
sex text,
email text,
phone text,
date_of_birth text,
job_title text
);
--copy sample_data from 'people-100000.csv' with header=true;
copy sample_data from 'people-1000000.csv' with header=true;

Create table and import data into YSQL (PostgreSQL compatible)

create table sample_data
(
index int primary key,
user_id text,
first_name text,
last_name text,
sex text,
email text,
phone text,
date_of_birth text,
job_title text
);
--\copy sample_data from 'people-100000.csv' csv header;
\copy sample_data from 'people-1000000.csv' csv header;

Log sync latency explained

Frits Hoogland — Fri, 26 May 2023 17:27:53 +0000

With the YugabyteDB database, we persist data by guaranteeing the data changes, such inserts, updates, deletes, etc. to be persisted. As with many databases and other programs that have this requirement, we do this by writing it to a file and call fsync().

However, there is a thing that I wondered about for a long time: if I calculate the time spent on the fsync() call when doing batch oriented work with YugabyteDB, the time spent waiting is often 20ms or more.

This time looks high. And you have to wait for fsync(), because only if it finishes successfully, the data is guaranteed to be stored on persistent media.

Let's follow the work that is done down into the linux kernel, and see if we can find a reason!

The first step is to see what exactly we do in YugabyteDB. For normal write calls, for which we use the write() system call, this is problematic because the Yugabyte database is a threaded application, and we use write() on many occasions, which would give lots and lots of data.

But for writing to the YugabyteDB WAL files, we use the writev() (vector write) system call. The writev() alias vector write call allows multiple requests to be submitted in one go synchronously. Below we see the vector writes submitting two write requests each.

So to see the WAL writing and fsync activity, we can use the strace utility and only show the writev and fsync() system calls. This is how that looks like:

$ strace -fp $(pgrep tserver) -e writev,fsync
...
[pid  1995] writev(121, [{iov_base="\20\0\0\0\235+\346\235:\0\222\357", iov_len=12}, {iov_base="\22\5\10\2\20\226U\30\260\261\277\311\242\377\2042", iov_len=16}], 2) = 28
[pid  1995] writev(121, [{iov_base="g\314\30\0\320v\332\237\244)\304\177", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\227U\21\0\2006\373\232\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid  1995] fsync(121)                  = 0
[pid  1995] writev(121, [{iov_base="\20\0\0\0\337\21N\212\315\357C\300", iov_len=12}, {iov_base="\22\5\10\2\20\227U\30\225\231\350\341\242\377\2042", iov_len=16}], 2) = 28
[pid  1995] writev(121, [{iov_base="g\314\30\0\327]?\33\377\36\3446", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\230U\21\0\320I\10\233\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid  1995] fsync(121)                  = 0
[pid  1995] writev(121, [{iov_base="\20\0\0\0\221\f\260\351\367h_\350", iov_len=12}, {iov_base="\22\5\10\2\20\230U\30\251\245\371\375\242\377\2042", iov_len=16}], 2) = 28
[pid  1995] writev(121, [{iov_base="g\314\30\0\26\234\252W\304g\262n", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\231U\21\0p\7\24\233\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid  1995] fsync(121)                  = 0
...

So -in general, not strictly!-, during a batch operation such as a batch insert, there's two writev() calls, one small call, about 28 bytes (the last number of the total size of the writev() request), and one big, which is 1625203 bytes.

This is where things start to get fuzzy, at least for me. The big write that is performed is seems to be 1625203 bytes, which is a little more than 1.5 MB. The YugabyteDB threshold for performing fsync is 1 second or 1 MB, but this is approximately 50% more?

But still, this is still reasonably known territory, right? We write data buffered, and then perform fsync(), there is no reason that that would take 20ms....or is it?

The next layer to look at is the block IO layer, or "bio" layer. To look at this layer, there are some brilliant tools in the bcc tools package, which use linux eBPF.

The tool to use here is biosnoop: a tool to 'snoop' on the BIO layer.

This is how that looks like for the same requests for which we looked at with strace:

TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)

15.492545   append [worker 1668   nvme1n1 W 107830376  4096      0.69
15.493980   append [worker 1668   nvme1n1 W 107830384  262144    2.08
15.495007   append [worker 1668   nvme1n1 W 107830896  262144    3.08
15.495918   append [worker 1668   nvme1n1 W 107831408  262144    3.95
15.497194   append [worker 1668   nvme1n1 W 107831920  262144    5.19
15.497705   append [worker 1668   nvme1n1 W 107832432  262144    5.67
15.499456   append [worker 1668   nvme1n1 W 107832944  262144    7.39
15.500215   append [worker 1668   nvme1n1 W 107833456  53248     8.14
15.507690   append [worker 1668   nvme1n1 W 104958050  1024      0.77
15.512486   append [worker 1668   nvme1n1 W 107833552  4096      0.59
15.514076   append [worker 1668   nvme1n1 W 107833560  262144    2.14
15.515213   append [worker 1668   nvme1n1 W 107834072  262144    3.24
15.515951   append [worker 1668   nvme1n1 W 107834584  262144    3.95
15.516999   append [worker 1668   nvme1n1 W 107835608  200704    4.90
15.517053   append [worker 1668   nvme1n1 W 107835096  262144    5.01
15.517995   append [worker 1668   nvme1n1 W 107836512  110592    5.85
15.518299   append [worker 1668   nvme1n1 W 107836000  262144    6.17
15.529624   append [worker 1668   nvme1n1 W 104958052  4096      0.62

This snippet from biosnoop shows the requests as the block IO layer sees it. Above we see the result of two times the combination of writev()/writev()/fsync().

But there are many more requests here? If we carefully look, we see that the first request is a request of 4096 bytes. That is probably the first writev(), although bigger.

But the important thing to see are the requests that follow that, each with a size of 262144 bytes, alias 256 kB.

Each of the bigger requests of 256k shows a latency of approximately 3ms or higher. This is the average IO latency time that I recognise, but you can also clearly see here that smaller requests get a better latency, and thus bigger ones have worse latency.

But where comes the 262144 / 256k come from?

The answer to that comes with the linux kernel files that are part of the disk subsystem and drivers in /sys/block/<blockdevice>/queue. This is a meta-directory, and contains information and settings about the disk.

The thing to look for is max_sectors_kb:

# cat max_sectors_kb
256

This means the maximum IO size is 256 kB! So the kernel broke down the large requests in the biggest size it is set to handle, which here is 256 kB!

Let's see if we can increase the IO size: the maximum allowed size is specified in max_hw_sectors_kb:

# cat max_hw_sectors_kb
256

Nope. We can't increase the IO size, it's already at its maximum. :-(
In case you're wondering what I run on, this is Amazon EC2 c5.

So now combining/going back to the requests that we measured using biosnoop: these are the requests from the large writev() request, but are broken down to 256kB chunks because of the maximum IO size, and then submitted to the block device.

Each of these requests takes approximately 3ms or higher, and there seems to be a tendency for successive large IO requests to get increasing latency.

Overall, this explains the fsync() latency that we see, these are the result of the chunks of 256kB that the fsync() call has to dispatch and wait for, which adds up to the 20ms or higher that we see with batches, when we fsync after (more than) 1M.

PostgreSQL: prepared out of order order by

Frits Hoogland — Tue, 16 May 2023 15:06:01 +0000

Recently, a colleague asked me a question about a PostgreSQL construction for which they found that it would silently ignore part of SQL, and whether it's truly doing that.

This is a mini setup that shows the issue:

create table test( id int primary key, f1 text );
insert into test select id, chr(ascii('a')+id) from generate_series(25,0,-1) id;;

This creates a table 'test' with two fields, and fills the fields with a number and a character from 'z' to 'a'.

If I execute the following SQL:

select * from test order by f1 limit 1;

I get:

postgres=# select * from test order by f1 limit 1;
 id | f1
----+----
  0 | a
(1 row)

Here's the issue:

prepare prep(text, int) as select * from test order by $1 limit $2;

And then execute the prepared statement:

postgres=# execute prep('f1',1);
 id | f1
----+----
 25 | z
(1 row)

These are two different results for seemingly the same statement. I actually didn't need to do anything, they also found the issue:

For the statement, the execution plan is:

postgres=# explain select * from test order by f1 limit 1;
                           QUERY PLAN
-----------------------------------------------------------------
 Limit  (cost=1.39..1.39 rows=1 width=6)
   ->  Sort  (cost=1.39..1.46 rows=26 width=6)
         Sort Key: f1
         ->  Seq Scan on test  (cost=0.00..1.26 rows=26 width=6)
(4 rows)

The select performs an unfiltered scan of a field for which is no index, and thus it performs a seq scan.
Because the seq scan does not hold ordering guarantees that an ascending or descending index can, it has to perform a sort for the order by.
The limit rowsource then takes one value because of limit 1.

For the prepared statement the execution plan is:

postgres=# explain execute prep('f1',1);
                         QUERY PLAN
------------------------------------------------------------
 Limit  (cost=0.00..0.15 rows=3 width=38)
   ->  Seq Scan on test  (cost=0.00..1.26 rows=26 width=38)
(2 rows)

The select performs an unfiltered scan of a field for which there is no index, and thus it performs a seq scan.
Limit does take the first value.
The sort step is missing, and thus the first value that it found from the seq scan is shown, which is 'z', but could be anything, because the seq scan does not guarantee ordering.

I also found some sources on the internet explaining that this is because of the bind parameters, probably the best one being from the PostgreSQL mail list: https://www.postgresql.org/message-id/1421875206968-5834948.post@n5.nabble.com

The essence of that explanation is that for the order by the parameter is used as an identifier, whilst parameters are values, and that therefore this doesn't work. However, what I am puzzled about is why this succeeds in creating a prepared statement, and why it all works, but silently skips the order by.

I hope to hear any opinions or reasons for ignoring it, or from things missing from this small investigation.

New memory related fields in Yugabyte 2.17.3 pg_stat_activity

Frits Hoogland — Mon, 08 May 2023 12:41:14 +0000

On friday, may 5th 2023 Yugabyte released version 2.17.3.0 of the database. Version 2.17.3.0 is a pre-release version.

However, there is an exciting feature that will be extremely handy for understanding, tuning and troubleshooting the YSQL (PostgreSQL compatible) API: The pg_stat_activity PostgreSQL standard catalog table has gotten two new fields:

allocated_mem_bytes
rss_mem_bytes

allocated_mem_bytes

The allocated_mem_bytes field shows the memory allocated by the memory allocator. PostgreSQL is setup in an extensible way, which includes the ability to choose a memory allocator, which for PostgreSQL is ptmalloc, and for YSQL is tcmalloc. PostgreSQL has the ability to change the memory allocator, but by default uses the operating system memory allocator.

The memory shown in the allocated_mem_bytes field is the amount of memory the allocator has allocated for the postgres backend as well as our pggate addition to communicate with DocDB.

It is important to realise that the allocator gets involved after the process has forked from the postgres server process and has paged in all the required memory to run the process, so once the postgres "application" gets run and allocates memory for its processing.

The allocated_mem_bytes field can show memory that was previously allocated and in reality is freed, but still is part of the virtual memory allocation. One specific change we made in YugabyteDB for the backend tcmalloc allocator is to free aggressively to prevent memory oversubscription. If this specific case is happening, the rss_mem_bytes field will show a smaller amount of memory than the allocated_mem_bytes field.

rss_mem_bytes

The rss_mem_bytes field shows the operating system level view of the memory usage, for which it is showing the RSS: the resident set size.

The resident set size are all the pages that are paged into the process' memory. Some of these pages can be shared with other processes, such as memory mapped pages.

The memory that the memory allocator (tcmalloc with YugabyteDB, ptmalloc for PostgreSQL) allocates is administered as RSS once the allocation gets read or written.

In general, the rss_mem_bytes/resident set size will be larger than allocated_mem_bytes, because the linux level dealing with the executable and execution requires some memory too. But this not always be true.

shared memory / buffer cache

Please mind that with YSQL there still is shared memory visible that is defined as shared memory for the database. This memory is not allocated using the a memory allocator.

This shared memory is used by some global database views, and with PostgreSQL is used to allocate the buffer cache. In YSQL, we do not use the postgres level buffer cache. Therefore you will find that almost nothing of this memory is paged in and therefore actually taking memory.

The amount of memory that is paged in will be visible in rss_mem_bytes, shared memory is totally outside of allocated_mem_bytes.

How to sniff PostgreSQL traffic

Frits Hoogland — Tue, 11 Apr 2023 10:00:18 +0000

I have been looking into the implementation details of PostgreSQL client to database communication recently, and was amazed at how easy it is to look at the PostgreSQL traffic using wireshark. When I put my amazement on Twitter, I got asked to provide an example of how to do this.

How to install wireshark on RHEL/Centos/Alma/etc.

Installing wireshark is done in the following way:

sudo yum install wireshark-cli

This installs the CLI version of wireshark, for which the name of the executable is tshark (not wireshark-cli).

How to use wireshark for sniffing PostgreSQL traffic

The next obvious question is: okay, but how do I use it?

This too is actually very easy.

If you want to see all the network traffic, and a complete description of the PostgreSQL related data in the traffic, use:

sudo tshark -i any -f 'tcp port 5432' -d tcp.port==5432,pgsql -O pgsql

There is one important caveat here: the wireshark executable must be executed on either the client machine or the server machine. Otherwise it cannot capture the network traffic.

How does that look like?

This is how it looks like on my server when I start wireshark:

[vagrant@yb-1 ~]$ sudo tshark -i any  -f 'tcp port 5433' -d tcp.port==5433,pgsql -O pgsql
Running as user "root" and group "root". This could be dangerous.
Capturing on 'any'

Please mind I capture the traffic on port 5433, which is the port that YugabyteDB uses, change for 5432 for PostgreSQL.

If I connect to the database with psql, the following is seen:

Frame 1: 80 bytes on wire (640 bits), 80 bytes captured (640 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 52441, Dst Port: 5433, Seq: 0, Len: 0

Frame 2: 76 bytes on wire (608 bits), 76 bytes captured (608 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.1
Transmission Control Protocol, Src Port: 5433, Dst Port: 52441, Seq: 0, Ack: 1, Len: 0

Frame 3: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 52441, Dst Port: 5433, Seq: 1, Ack: 1, Len: 0

Frame 4: 76 bytes on wire (608 bits), 76 bytes captured (608 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 52441, Dst Port: 5433, Seq: 1, Ack: 1, Len: 8
PostgreSQL
    Type: SSL request
    Length: 8

Frame 5: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.1
Transmission Control Protocol, Src Port: 5433, Dst Port: 52441, Seq: 1, Ack: 9, Len: 0

Frame 6: 69 bytes on wire (552 bits), 69 bytes captured (552 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.1
Transmission Control Protocol, Src Port: 5433, Dst Port: 52441, Seq: 1, Ack: 9, Len: 1

Frame 7: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 52441, Dst Port: 5433, Seq: 9, Ack: 2, Len: 0

Frame 8: 152 bytes on wire (1216 bits), 152 bytes captured (1216 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 52441, Dst Port: 5433, Seq: 9, Ack: 2, Len: 84
PostgreSQL
    Type: Startup message
    Length: 84
    Parameter name: user
    Parameter value: yugabyte
    Parameter name: database
    Parameter value: yugabyte
    Parameter name: application_name
    Parameter value: psql
    Parameter name: client_encoding
    Parameter value: UTF8

Frame 9: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.1
Transmission Control Protocol, Src Port: 5433, Dst Port: 52441, Seq: 2, Ack: 93, Len: 0

Frame 10: 404 bytes on wire (3232 bits), 404 bytes captured (3232 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.1
Transmission Control Protocol, Src Port: 5433, Dst Port: 52441, Seq: 2, Ack: 93, Len: 336
PostgreSQL
    Type: Authentication request
    Length: 8
    Authentication type: Success (0)
PostgreSQL
    Type: Parameter status
    Length: 26
    Parameter name: application_name
    Parameter value: psql
PostgreSQL
    Type: Parameter status
    Length: 25
    Parameter name: client_encoding
    Parameter value: UTF8
PostgreSQL
    Type: Parameter status
    Length: 23
    Parameter name: DateStyle
    Parameter value: ISO, MDY
PostgreSQL
    Type: Parameter status
    Length: 25
    Parameter name: integer_datetimes
    Parameter value: on
PostgreSQL
    Type: Parameter status
    Length: 27
    Parameter name: IntervalStyle
    Parameter value: postgres
PostgreSQL
    Type: Parameter status
    Length: 20
    Parameter name: is_superuser
    Parameter value: on
PostgreSQL
    Type: Parameter status
    Length: 25
    Parameter name: server_encoding
    Parameter value: UTF8
PostgreSQL
    Type: Parameter status
    Length: 39
    Parameter name: server_version
    Parameter value: 11.2-YB-2.17.2.0-b0
PostgreSQL
    Type: Parameter status
    Length: 35
    Parameter name: session_authorization
    Parameter value: yugabyte
PostgreSQL
    Type: Parameter status
    Length: 35
    Parameter name: standard_conforming_strings
    Parameter value: on
PostgreSQL
    Type: Parameter status
    Length: 17
    Parameter name: TimeZone
    Parameter value: UTC
PostgreSQL
    Type: Backend key data
    Length: 12
    PID: 5689
    Key: 4094446470
PostgreSQL
    Type: Ready for query
    Length: 5
    Status: Idle (73)

Frame 11: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 52441, Dst Port: 5433, Seq: 93, Ack: 338, Len: 0

Yes, that is a lot of information.

But the reason there is this much information is because these are all network packets that are transmitted as part of setting up the database connection.

This includes the first 3 packets performing the classic TCP 3-way handshake (frames 1-3).

The frames 4-7 are a negotiation for SSL; frame 4 shows the client asking for the PostgreSQL traffic to be SSL or not.

The frames 8-11 are dealing with the startup message from the database, which sets parameters and values. Frame 10 sends back a lot of individual messages in a single frame back as a response to the client startup message, such as:

The authentication request response
Various parameter status messages
The ready for query message, indicating the server can receive another request.

Now for the reason most people will use it, to see the database requests: here there are multiple options, but let me show two common dialogues:

Simple query protocol

Frame 12: 87 bytes on wire (696 bits), 87 bytes captured (696 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 53026, Dst Port: 5433, Seq: 93, Ack: 338, Len: 19
PostgreSQL
    Type: Simple query
    Length: 18
    Query: select now();

Frame 13: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.1
Transmission Control Protocol, Src Port: 5433, Dst Port: 53026, Seq: 338, Ack: 112, Len: 0

Frame 14: 157 bytes on wire (1256 bits), 157 bytes captured (1256 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.1
Transmission Control Protocol, Src Port: 5433, Dst Port: 53026, Seq: 338, Ack: 112, Len: 89
PostgreSQL
    Type: Row description
    Length: 28
    Field count: 1
        Column name: now
            Table OID: 0
            Column index: 0
            Type OID: 1184
            Column length: 8
            Type modifier: -1
            Format: Text (0)
PostgreSQL
    Type: Data row
    Length: 39
    Field count: 1
        Column length: 29
        Data: 323032332d30342d31312030393a33333a30392e36363032...
PostgreSQL
    Type: Command completion
    Length: 13
    Tag: SELECT 1
PostgreSQL
    Type: Ready for query
    Length: 5
    Status: Idle (73)

Frame 15: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.1, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 53026, Dst Port: 5433, Seq: 112, Ack: 427, Len: 0

This is a simple query protocol dialogue. Contrary to popular believe, the simple query protocol is not a wire protocol. The simple query protocol is a request type for a PostgreSQL database.

Frame 12 shows the query being sent to the database backend. This is a single request, which lets the PostgreSQL database backend deal with the 4 steps of PostgreSQL execution.

Frame 14 is the database response. The database response consists of a number of messages inside the frame:

A description of the row it is going to send.
The data of the row.
The indication the command has been executed.
The indication the backend is ready for another request.

Extended query

Frame 15: 124 bytes on wire (992 bits), 124 bytes captured (992 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 50034, Dst Port: 5433, Seq: 258, Ack: 428, Len: 56
PostgreSQL
    Type: Parse
    Length: 20
    Statement:
    Query: SELECT now()
    Parameters: 0
PostgreSQL
    Type: Bind
    Length: 12
    Portal:
    Statement:
    Parameter formats: 0
    Parameter values: 0
    Result formats: 0
PostgreSQL
    Type: Describe
    Length: 6
    Portal:
PostgreSQL
    Type: Execute
    Length: 9
    Portal:
    Returns: all rows
PostgreSQL
    Type: Sync
    Length: 4

Frame 16: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 5433, Dst Port: 50034, Seq: 428, Ack: 314, Len: 0

Frame 17: 167 bytes on wire (1336 bits), 167 bytes captured (1336 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 5433, Dst Port: 50034, Seq: 428, Ack: 314, Len: 99
PostgreSQL
    Type: Parse completion
    Length: 4
PostgreSQL
    Type: Bind completion
    Length: 4
PostgreSQL
    Type: Row description
    Length: 28
    Field count: 1
        Column name: now
            Table OID: 0
            Column index: 0
            Type OID: 1184
            Column length: 8
            Type modifier: -1
            Format: Text (0)
PostgreSQL
    Type: Data row
    Length: 39
    Field count: 1
        Column length: 29
        Data: 323032332d30342d31312030393a34323a35372e37303331...
PostgreSQL
    Type: Command completion
    Length: 13
    Tag: SELECT 1
PostgreSQL
    Type: Ready for query
    Length: 5
    Status: Idle (73)

Frame 18: 68 bytes on wire (544 bits), 68 bytes captured (544 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 50034, Dst Port: 5433, Seq: 314, Ack: 527, Len: 0

Here we see the same query (select now()) being executed, but now using the extended query protocol.

Frame 15 is the client request. Because this is sending a request using the extended query protocol, it doesn't send a single query, but it sends a request for the different steps that query execution must make:

Parse. The client parse step performs the server parse and rewrite steps, and introduces the SQL to the session, and performs the parsing.
Bind. The client bind step performs the server plan step. If there are bind variables, the variables are first bound before the plan step, to allow the planner to use the variables.
Execute. The client and server execute are the same thing. The important thing to notice here is that Parse, Bind and Execute are all within the same frame.

Frame 17 contains the backend response. As you can see, it responded to each request: it accepted the parse, and the parsing completed, it then accepted the bind request and completed that, and then it responded with the execution output, which is identical to the simple query protocol response: a row description, the data row, command completion and ready for query.

In case you wonder what happens if an invalid statement is sent: the server will simply respond with an error:

Frame 15: 170 bytes on wire (1360 bits), 170 bytes captured (1360 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 5433, Dst Port: 48402, Seq: 428, Ack: 311, Len: 102
PostgreSQL
    Type: Error
    Length: 101
    Severity: ERROR
    Text: ERROR
    Code: 42703
    Message: column "__" does not exist
    Position: 8
    File: parse_relation.c
    Line: 3301
    Routine: errorMissingColumn

(I changed the query from select now() to select __)

And after the error it will respond that it is ready again:

Frame 17: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) on interface 0
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.66.80, Dst: 192.168.66.80
Transmission Control Protocol, Src Port: 5433, Dst Port: 48402, Seq: 530, Ack: 311, Len: 6
PostgreSQL
    Type: Ready for query
    Length: 5
    Status: Idle (73)

Simple port scanning without a port scanner

Frits Hoogland — Mon, 03 Apr 2023 09:46:05 +0000

We all been in the situation where we need to understand if a port on a server can be reached, reachability is unsure, and no port scanner such as nmap is available, nor install is allowed.

ping

Please mind this is not about if the host in general can be reached. This generally can be done using the ping command. Although it should be considered that ping generally uses a specific protocol (ICMP), which is different from the generally used TCP protocol for encrypted and unencrypted communication.

TCP port scanning

The port state of a TCP port on an host can be determined if the openssl utility is available on the machine that needs to connect to the host. Nowadays, SSL/encryption almost always is a requirement, which makes it very likely the openssl package is installed, which contains the openssl utility.

A TCP port on a host can have 3 principal states:

Unreachable/firewalled If a host or hostname is specified that does not exist, then obviously it's impossible to connect to a port on that host, and there will be no communication back. The same happens when a host or hostname does exist, but is unreachable, or if the entire host (all ports) or the specific port is firewalled.
Open If a host or hostname does allow communication with a specific port, and an executable is listening on that port, then communication can happen.
Closed If a host or hostname does allow communication with a specific port, but no executable is listening on that port, then communication with that executable cannot happen. In such a case, the TCP/IP (network) stack will respond to the request by refusing the connection.

Now with these 3 states in mind, you can use the openssl utility to obtain the exact state of a hostname and port combination. This is how that looks like:

Unreachable

➜ openssl s_client -connect 192.168.66.79:80

If a host is unreachable, it will not, and cannot respond. In that case, the utility will not produce any response for a long time, because it's waiting for the response. Eventually it will time out:

connect: Operation timed out
connect:errno=60

Open

➜ openssl s_client -connect 192.168.66.80:7000
CONNECTED(00000003)
...much more output related to SSL...

If a host is reachable, and has an executable listening on the port on the port specified, it will be able to connect, which is what the CONNECTED(00000003) message means. This means the port is open, and can interact with the executable.

All the other output that follows is related to SSL, and might or might not be relevant. It is not for determining the port state.

Closed

➜ openssl s_client -connect 192.168.66.80:7001
connect: Connection refused
connect:errno=61

If a host is reachable, but has no executable listening on the port specified, the message will specify 'Connection refused'.

The anatomy of XFS fsync()

Frits Hoogland — Mon, 27 Mar 2023 17:43:33 +0000

This blogpost is about an investigation I performed recently about the fsync() call. If you are interested into the details of how the operating system deals with the fsync() call, this is a blogpost for you.

ps. I am using kernel version 3.10.

What is fsync()?

The fsync() call is a system call that a linux executable can execute to guarantee previously written data to be persisted on the block device.

Wait! Wasn't there something with fsync and databases?

Yes. There was a thing that some people refer to as fsyncgate 2018. This article is not about that. The fsync gate 2018 issue was a finding that the linux kernel might not expose write errors detected by the fsync() system call to an application, which therefore go undetected.

What does fsync() do?

The first thing to understand is that fsync() is a general linux system call, but because it deals with the filesystem, has, and requires, a filesystem specific implementation. The common linux filesystem we use, and is in use on RedHat compatible linux distributions, is XFS.

The second thing to understand is that by default linux lets processes and threads write to a file and perform other modifications to the filesystem, which in reality are writes/changes to memory, called the page cache, for which the kernel will perform the actual writing to disk at a later point in time, generally using kernel write-back worker threads. Only if a file is explicitly opened with the O_DIRECT flag, a write will happen to the file directly.

The function and reason for fsync() is to provide an option to request the operating system to actually write any potential pending changes to a file that are still in memory to disk and guarantee persistence.

And now the details

As you might expect, fsync() is not really a single action that magically executes the persistency guarantee. A number of actions must happen, which are detailed below.

When a "user land" process executes the fsync() call with an inode number (file), the process switches to system mode and traverses a number of layers in the linux kernel to the actual XFS implementation of the fsync() call, which is called xfs_file_fsync(). Please mind this is the source code of the standard kernel, not the RedHat kernel, which turns out to be an important detail.

This function is the basis of the XFS fsync() implementation, and essentially performs the following functions:

This is slightly simplified, for all the details please read the source code (I am not kidding, elixir makes it reasonably easy!).

filemap_write_and_wait_range

Step 1, the filemap_write_and_wait_range() call is the actual step that most people think what is all that fsync() does, which is find any range of dirty pages (pages written in memory that are not yet persisted to disk) of the given file, and write these to disk.

In kernel 3.10 in the combination that my test machine uses, this causes this step (writing dirty pages of the given inode/file) to be executed by the process itself. With a version 4 kernel with Alma linux 8, the process gathers the blocks and queues these and requests a kernel background worker thread to perform the writing.

The way filemap_write_and_wait_range() is executed, is it does gather the dirty pages in a function under filemap_write_and_wait_range() called __filemap_fdatawrite_range(). It creates a batch of a continuous range of dirty pages, and repeats this for another batch of dirty pages until it has scanned the entire file. These batches become individual IO requests which are submitted as IO requests.

The second step is in the function __filemap_fdatawait_range(), which name is reasonably self explanatory: it waits for the submitted write IO requests to finish.

If __filemap_fdatawait_range() is finished, all the pages for the given inode/file have been written to disk. But the work for fsync() is not finished yet.

xfs_ilock / xfs_iunlock

The xfs_ilock() and xfs_iulock() functions lock and unlock the inode/file, for which there are two locks (XFS_IOLOCK and XFS_ILOCK), which each can be taken in shared and exclusive modes. The interesting bit here is that the locking is done to make sure the pin count, which is the number of outstanding items in the journal/log, which is read using the function xfs_ipincount() (which is a macro, so there is no function source code), is consistent.

The thing to notice here is the if statement inside xfs_ipincount():

if (!datasync ||
    (ip->i_itemp->ili_fields & ~XFS_ILOG_TIMESTAMP))
        lsn = ip->i_itemp->ili_last_lsn;

The way to read this is that the if chooses to read the lsn variable if:

datasync is not true (data sync is a flag to indicate this function was called as 'fdatasync', which means it requests to ONLY perform the data part above of fsync) -or-
any fields need logging (meaning a structure change happened since last log, so this overrides datasync)

If any of the two options are true, the lsn variable is filled with the lsn, the log sequence number from the inode (ip) from the inode log item last lsn field.

_xfs_log_force_lsn

The _xfs_log_force_lsn() function writes the journal of the XFS filesystem to disk. A filesystem journal is a property of journalling filesystem which saves structure/metadata changes to an inode/file in a separate area on disk to be able to replay the changes to an inode/file in case of a crash.

Before journalling filesystems, crashes could lead to the filesystem structure (inode to file mappings) getting corrupted, which required running fsck (filesystem check), which typically happens during server startup after a crash to reconstruct the filesystem in case in inconsistence. With larger filesystems, this can take significant amounts of time. (Inconsistencies found during fsck are generally saved in the lost+found directory)

The writing of the journal only happens if the lsn variable that optionally could be obtained in the previous step is not 0 (if (lsn)).

The function _xfs_log_force_lsn() has some interesting comments in it's comments, which explains the following:

* Synchronous forces are implemented with a signal variable. All callers
* to force a given lsn to disk will wait on a the sv attached to the
* specific in-core log.  When given in-core log finally completes its
* write to disk, that thread will wake up all threads waiting on the
* sv.

This means that if an fsync() has written the dirty pages from the OS page cache, and obtained the log sequence number, and then entered _xfs_log_force(), it can find the in-core log.

The _xfs_log_force() function is a master function that handles the different parts of flushing XFS metadata to disk for a given in-core log.

_xfs_log_force_lsn() > xlog_cil_force_lsn()

The first thing that gets done is execute the xlog_cil_force_lsn() function. This function takes the lsn, and looks for the operating system work queue that is bound to this in-core log committed item list (cil) and checks if it's still busy. If so, it waits for it off CPU. (the committed item list/cil is a double linked list of items in the in-core log).

Once it's done, it checks the committed item list for items with an lsn lower than the required lsn. If it finds one or more committed item list items, it will request a flush (write) of the committed item list to disk via a linux workqueue.

The last thing in the xlog_cil_force_lsn() function is waiting for the committed item list flushes to finish, using the linux workqueue. This means it's preventing multiple processes or threads to perform flushing the cil, and wait for the previous one to finish.

_xfs_log_force_lsn() > xlog_state_switch_iclogs()

The next thing that is done is finding the correct buffer in the in-core log, waiting for pending IOs to finish and move the pointer to the next in-core log buffer using the function xlog_state_switch_iclogs(), which marks the current buffer for writing (synchronisation). By moving the pointer to the next buffer, new in-core log items will be created in the next buffer, and the previous in-core log buffer can be prepared for writing.

_xfs_log_force_lsn() > xlog_state_release_iclogs()

After the log buffer is switched, the function xlog_state_release_iclog() perform the preparation of the previous buffer for writing, and call xlog_sync() to actually perform the write.

Inside xlog_sync() the actual write happens. However, the write is done asynchronously.

This is why when the calls return after performing the write in xlog_sync(), the loop in _xfs_log_force_lsn() has a last if block that essentially performs xlog_wait() to check for the log write to finish.

xfs_blkdev_issue_flush()

At two places in the xfs_file_fsync() there is another call, with some dependencies on specific flags set and specific situations. This call is xfs_blkdev_issue_flush().

This function essentially sends a SCSI/device command to the disk telling it to flush its cache to disk. The basic idea is that in the case of a crash or lost power, the disk controller or disk cache itself might hold the written data and still can loose the data. This call tells it to flush that to the actual block device.

This is also the reason that some executables that perform O_DIRECT calls still perform fsync(), despite that they don't actually cache anything at the operating system layer: this way it's made sure that the written data is ultimately stored onto a persistent device.

A twist

I have been reading the source code of the linux kernel and XFS, which gives an understanding of how this works. I carefully checked the kernel version running on Centos 7, which is kernel version 3.10 and used the same version in the source code browser at https://elixir.bootlin.com.

I also used the linux kernel ftrace facility to look at what is actually going on. In my opinion, it's important to always validate whatever you have been learning and studying.

What I found was that the redhat 3.10 kernel does not obey the function order shown in the source code on elixir. The page on elixir says:

file_write_and_wait_range()
xfs_ilock()
(xfs_ipincount())
xfs_iunlock()
_xfs_log_force_lsn

But what I found when using ftrace was:

file_write_and_wait_range()
xfs_ilock()
(xfs_ipincount())
_xfs_log_force_lsn
xfs_iunlock()

This turns out to be change in the redhat kernel. Because this change is the order in newer kernel versions, I do assume this is a fix that is back ported to the 3.10 redhat kernel.

Conclusion

There is a lot going on if you take the effort to read the source code and read up on all the concepts that it implements.

One thing that I have not mentioned is that another property of fsync() is that it takes an inode, but it doesn't return or expose how much work it has actually done. The return code is zero for successful execution and -1 for error.

The reason for the rather complex way of handling the logging goes much further than I described for the sake of trying to keep it as simple as possible. The logging is done in a completely asynchronous way for maximal performance, which makes it more complex.

The first part of fsync() writing is very obvious, the second part of writing the in-core log (buffer) is less obvious, and all the optimisations to take advantage of and be able as performant and flexible as possible for concurrent writes to the log. Hopefully this gives you an understanding of some of the complexities that are going on here, and awareness of the implementation.

YugabyteDB: fs_data_dirs

Frits Hoogland — Mon, 20 Mar 2023 14:23:31 +0000

YugabyteDB, like every database, allows to specify the directory where the database files are stored. This parameter, or gflag, in YugabyteDB is fs_data_dirs.

Of course the directory can also be a mount point for a filesystem, so that the Operating System and the binaries are completely independent from the database files.

Tablet server and Master

The parameter fs_data_dirs is a setting for both the tablet server, as well as the master. Both store rocksdb database files.

It is not a problem to have both the master and the tablet server running on the same machine/node, and it also is not a problem to have them both be pointing to the same directory with fs_data_dirs. They will create their individual, separate, master and tserver directories, where they will store and manipulate their files.

`fs_data_dirs` contents

Inside the directory set by the fs_data_dirs parameter, the tablet server will create when freshly started:

pg_data (only if enable_ysql is true)
yb-data/tserver

The master server creates:

yb-data/master

Inside the yb-data/tserver and yb-data/master directories, a directory structure is created which contains:

wals: the YugabyteDB write ahead log.
tablet-meta: tablet metadata.
data: the rocksdb database directories.
consensus-meta: RAFT/consensus metadata.
logs: textual log files of the server and optionally of the PostgreSQL logfile.

The information that is stored for each tablet in the tablet-metadata directory links wal and rocksdb together, for which it is using the operating system paths.

ADDING a directory to `fs_data_dirs`

If you want to add another directory or path for storing data, you can do that by specifying a second directory separated by a comma without a space, such as: --fs_data_dirs=/mnt/d0,mnt/d1.

If you add a directory and restart the tablet server, it will create the directory yb-data/tserver in the path specified in fs_data_dirs, and create the directories wals, tablet-meta, data and consensus-meta in the created directory.

The first thing to notice is that the logs directory is not in the list of directories in yb-data/tserver. The reason is the logs will be stored in the first directory that the server finds in fs_data_dirs.

How YugabyteDB works with multiple data directories

When multiple data directories are set in fs_data_dirs, there is no dynamic or automatic moving or load balancing that will happen.

If a directory is added to fs_data_dirs later on, it will be initialised, and the allocations in the first directory will stay where they are.

Also, when an object is created, and therefore its tablet(s) which require wal, tablet-meta, rocksdb and consensus-meta allocations, these will still be created in the first directory for the first tablet; only when a second tablet is specified, it will be allocated in the second directory, a third in the first directory (if two are specified), a fourth in the second directory again, and so on.

This means that if you are running with a filesystem mounted at the path specified in fs_data_dirs, and the filesystem is getting filled up, and it's impossible to extend the filesystem (if you can extend the partition that contains the filesystem, XFS allows online resizing), adding a second data directory is quite likely not the solution.

REMOVING or CHANGING a directory in `fs_data_dirs`

Currently, we do NOT support removing or changing a directory specified in fs_data_dirs. The reason is that in the metadata, the full path is specified, and thus moving it will break the specified paths.

If you must change a path, or do something else that forces you to change the paths, the best way of doing that is adding a tablet server node to the cluster with the correct configuration for fs_data_dirs, and blacklisting the problem tablet server. This will make the tablets to move away from the problem node, and the tablet replicas be spread over the cluster again.

Conclusion

YugabyteDB stores metadata that contains the paths, so you cannot move YugabyteDB datafiles to another directory and add that to fs_data_dirs to circumvent a filesystem from filling up.

Adding a directory to fs_data_dirs is possible, but this only will get tablets placed if multiple are specified during creation, and therefore is not a way to balance filesystem space usage.

If you run into problems with the path specified in fs_data_dirs, add a node or nodes to the cluster which are sufficiently configured to store the data, and blacklist the problem node or nodes to make them move the tablets.

The idea of running YugabyteDB is to run simple reasonably small nodes with local (SSD) drives, and scale up the cluster, not the nodes. Also removing or changing a node should be a normal non-intrusive action, which is why dealing with fs_data_dirs in this way is sufficient.

YugabyteDB yb_stats: table and tablet detail

Frits Hoogland — Thu, 16 Mar 2023 20:16:26 +0000

This is a blogpost about using the yb_stats with the new options of --print-table-detail and --print-tablet-detail.

extra-data

Both the --print-table-detail and --print-tablet-detail do fetch table or tablet specific details. Fetching these requires extra calls to the tablet servers and masters. Therefore, these options only work when the --extra-data switch is set, which fetches this additional required data. Therefore normal runs without this parameter will be unaffected by this extra work.

print-table-detail

What does the --print-table-detail show?

➜ yb_stats --print-table-detail --extra-data --uuid 000033e8000030008000000000004100
Table UUID: 000033e8000030008000000000004100, version: 0, type: PGSQL_TABLE_TYPE, state: Running, keyspace: yugabyte, object_type: User tables, name: ysql_test_table
On disk size: Total: 639.60M WAL Files: 579.91M SST Files: 59.69M SST Files Uncompressed: 569.71M
Replication info:
Columns:
0    id                               int32 NOT NULL PARTITION KEY
1    f1                               string NULLABLE NOT A PARTITION KEY
Tablets:
9e75c351ddeb4b8aa4b16afd8bd086cd hash_split: [0xAAAA, 0xFFFF], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: FOLLOWER: yb-2.local FOLLOWER: yb-1.local LEADER: yb-3.local
85ac22ace45542baa1543d172b8b7657 hash_split: [0x5555, 0xAAA9], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: FOLLOWER: yb-2.local LEADER: yb-1.local FOLLOWER: yb-3.local
39f2c50a35d043f58bc0ade9e9399dba hash_split: [0x0000, 0x5554], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: LEADER: yb-2.local FOLLOWER: yb-1.local FOLLOWER: yb-3.local
Tasks:

It shows:

the table internal version (0).
the table internal type (PGSQL_TABLE_TYPE).
the state and keyspace (database) and object name.
the object type (this will say Index tables for an index).
the total on disk size, the total WAL size, the total uncompressed and compressed SST file size (for all tablets).
the columns and their internal types, as well as what is part of the partition key.
the tablet UUID, the way the tablets are split.
the tablet state and whether the tablet is running.
the tablet current last known good state.
the tasks operating at the table level (such as index backfill).

print-tablet-details

If you want to look deeper into the table, you have to look at the fundamental logical storage units, which are the tablets. The --print-tablet-detail options shows the details for a single tablet.

➜ yb_stats --print-tablet-detail --extra-data --uuid 9e75c351ddeb4b8aa4b16afd8bd086cd
192.168.66.80:9000
 General info:
  Keyspace:       yugabyte
  Object name:    ysql_test_table
  On disk sizes:  Total: 212.42M Consensus Metadata: 1.4K WAL Files: 192.52M SST Files: 19.91M SST Files Uncompressed: 189.95M
  State:          RUNNING
 Consensus:
  State:          Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 328, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
  Queue overview:
  Watermark:
  Messages:
 LogAnchor:
  Latest log entry op id: 1.328
  Min retryable request op id: 1.325
  Last committed op id: 1.328
  Max persistent intents op id: 1.327
  Earliest needed log index: 325
 Transactions:
  - { safe_time_for_participant: { physical: 1678909347240643 } remove_queue_size: 0 }
 Rocksdb:
  IntentDB:
  RegularDB:
   total_size: 2088888, uncompressed_size: 19988404, name_id: 14, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6270955, uncompressed_size: 59742289, name_id: 13, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6215593, uncompressed_size: 59741912, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6296946, uncompressed_size: 59708999, name_id: 10, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
192.168.66.82:9000
 General info:
  Keyspace:       yugabyte
  Object name:    ysql_test_table
  On disk sizes:  Total: 212.96M Consensus Metadata: 1.5K WAL Files: 193.06M SST Files: 19.91M SST Files Uncompressed: 189.95M
  State:          RUNNING
 Consensus:
  State:          Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 0, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
  Queue overview: Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 0, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
  Watermark:
  - { peer: 61c349193a9847c784d8c781b37574bd is_new: 0 last_received: 1.328 next_index: 329 last_known_committed_idx: 328 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 4 last_applied: 1.328 }
  - { peer: 4097103c4efe49019e7fd0aace42d046 is_new: 0 last_received: 1.328 next_index: 329 last_known_committed_idx: 328 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 4 last_applied: 1.328 }
  - { peer: c10e5cfd1f3243238ee654667d49a391 is_new: 0 last_received: 1.328 next_index: 329 last_known_committed_idx: 328 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 4 last_applied: 1.328 }
  Messages:
  - Entry: 0, Opid: 0.0, mesg. type: REPLICATE UNKNOWN_OP, size: 6, status: term: 0 index: 0
 LogAnchor:
  Latest log entry op id: 1.328
  Min retryable request op id: 1.325
  Last committed op id: 1.328
  Max persistent intents op id: 1.327
  Earliest needed log index: 325
 Transactions:
  - { safe_time_for_participant: { physical: 1678909347317049 } remove_queue_size: 0 }
 Rocksdb:
  IntentDB:
  RegularDB:
   total_size: 2088888, uncompressed_size: 19988404, name_id: 14, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6270955, uncompressed_size: 59742289, name_id: 13, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6215593, uncompressed_size: 59741912, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6296946, uncompressed_size: 59708999, name_id: 10, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
192.168.66.81:9000
 General info:
  Keyspace:       yugabyte
  Object name:    ysql_test_table
  On disk sizes:  Total: 212.42M Consensus Metadata: 1.5K WAL Files: 192.52M SST Files: 19.91M SST Files Uncompressed: 189.95M
  State:          RUNNING
 Consensus:
  State:          Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 328, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
  Queue overview:
  Watermark:
  Messages:
 LogAnchor:
  Latest log entry op id: 1.328
  Min retryable request op id: 1.325
  Last committed op id: 1.328
  Max persistent intents op id: 1.327
  Earliest needed log index: 325
 Transactions:
  - { safe_time_for_participant: { physical: 1678909347315157 } remove_queue_size: 0 }
 Rocksdb:
  IntentDB:
  RegularDB:
   total_size: 2088888, uncompressed_size: 19988404, name_id: 14, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6270955, uncompressed_size: 59742289, name_id: 13, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6215593, uncompressed_size: 59741912, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd
   total_size: 6296946, uncompressed_size: 59708999, name_id: 10, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004100/tablet-9e75c351ddeb4b8aa4b16afd8bd086cd

The information categories for tablet followers and master are identical, however a leader is showing more data and slightly different data because of it being leader.

It shows:

General info (keyspace name, object name, on disk sizes per tablet, state).
Consensus state: for the followers, the state row is filled out, for the leader, the queue overview as well as the 'watermark'-s are shown, as well as messages about consensus.
LogAnchor: the current log state.
Transactions: the current transactional state is shown for all replicas. The leader will show pending transactions if these are active.
Rocksdb: for both the IntentDB and RegularDB rocksdb instances the datafiles are shown, with their size, uncompressed size and id/number.

yb_stats Snapshots

If you are looking into an issue or want to store the cluster state for later investigation, then creating a yb_stats snapshot can have the --extra-data switch set. This will take longer, because it will obtain the state of all the tables and their tablets, but it allows you to investigate the state of each of them by specifying the snapshot number after the --print-table-detail or --print-tablet-detail switch.

Conclusion

Both the --print-table-detail and --print-tablet-detail switch require the additional switch --extra-data to obtain the data required, to prevent regular use of yb_stats from fetching individual table and tablet data.

By using the --extra-data switch, table and tablet detail can be investigated, that would otherwise require going in the web UI of the master (for the table data), and going into the web UI of the tablet server serving the tablet to investigate the tablet details.

DEV Community: Frits Hoogland

The Postico 2 client for YugabyteDB

Setup SSL for YSQL in Yugabyte manually

Create certificates

Certificate Authority

Server key and CSR

Sign the CSR

Enable SSL for YugabyteDB YSQL (postgres API)

Put the certificate/SSL files in place.

Enable SSL and point the tablet server to the files

Enable unencrypted traffic alongside encrypted traffic

Lookup disk properties on linux

Create sample YugabyteDB table with data

Log sync latency explained

PostgreSQL: prepared out of order order by

New memory related fields in Yugabyte 2.17.3 pg_stat_activity

allocated_mem_bytes

rss_mem_bytes

shared memory / buffer cache

How to sniff PostgreSQL traffic

How to install wireshark on RHEL/Centos/Alma/etc.

How to use wireshark for sniffing PostgreSQL traffic

How does that look like?

Simple query protocol

Extended query

Simple port scanning without a port scanner

ping

TCP port scanning

Unreachable

Open

Closed

The anatomy of XFS fsync()

What is fsync()?

Wait! Wasn't there something with fsync and databases?

What does fsync() do?

And now the details

filemap_write_and_wait_range

xfs_ilock / xfs_iunlock

_xfs_log_force_lsn

_xfs_log_force_lsn() > xlog_cil_force_lsn()

_xfs_log_force_lsn() > xlog_state_switch_iclogs()

_xfs_log_force_lsn() > xlog_state_release_iclogs()

xfs_blkdev_issue_flush()

A twist

Conclusion

YugabyteDB: fs_data_dirs

Tablet server and Master

fs_data_dirs contents

ADDING a directory to fs_data_dirs

How YugabyteDB works with multiple data directories

REMOVING or CHANGING a directory in fs_data_dirs

Conclusion

YugabyteDB yb_stats: table and tablet detail

extra-data

print-table-detail

print-tablet-details

yb_stats Snapshots

Conclusion

`fs_data_dirs` contents

ADDING a directory to `fs_data_dirs`

REMOVING or CHANGING a directory in `fs_data_dirs`