DEV Community: CrateDB

Retrieving records in bulk with a list of primary key values in CrateDB

Hernán Lionel Cianfagna — Fri, 23 Feb 2024 09:33:28 +0000

When we send SQL statements to CrateDB they need to be parsed, but in most situations we do not think about this because the resources used for parsing the statements are trivial in relation to what is required to actually execute the queries.

One exception to this is when INSERTing a large amount of rows, and for this case CrateDB has a very efficient bulk operations interface which can also be used for UPDATEs and DELETEs.

However I recently came across an unusual requirement, we had a very large table with a primary key made of multiple fields, and given tens of thousands of values for these fields we needed to retrieve all the corresponding records.

Let me exemplify the situation with this table definition:

CREATE TABLE sensor_data (
  ts TIMESTAMP
  ,machine_id TEXT
  ,sensor_type SMALLINT
  ,payload OBJECT
  ,PRIMARY KEY (ts,machine_id,sensor_type)
);

Let’s also create some sample data:

INSERT INTO sensor_data (ts,machine_id,sensor_type,payload )
SELECT now()
    ,concat('machine',a.b)
    ,random()*10
    ,{"test"='abc'}
FROM GENERATE_SERIES(1,100000) a(b);

There are different approaches we could use to retrieve multiple rows for the given PK values, such as:

SELECT * FROM sensor_data WHERE ts='2024-02-21 08:00:00.000Z' AND machine_id='machine1' AND sensor_type=8 
UNION 
SELECT * FROM sensor_data WHERE ts='2024-02-21 08:00:00.000Z' AND machine_id='machine2' AND sensor_type=5

or:

SELECT * 
FROM sensor_data 
WHERE (ts='2024-02-21 08:00:00.000Z' AND machine_id='machine1' AND sensor_type=8)
OR (ts='2024-02-21 08:00:00.000Z' AND machine_id='machine2' AND sensor_type=5);

This works reasonably well up to a few hundred records, but let’s see what happens if we try to use this approach to lookup tens of thousands of different records in a single statement as it was the requirement in this very particular case.

Let’s dynamically generate a query like the above but for 10,000 records:

WITH thedata
AS (
    SELECT CONCAT (
            'OR (ts=',(ts::BIGINT)::TEXT
            ,' and machine_id=''',machine_id
            ,''' and sensor_type=',sensor_type
            ,')'
            ) AS onewherecondition
    FROM sensor_data 
    LIMIT 10000
    )
SELECT CONCAT (
        'SELECT * FROM sensor_data WHERE '
        ,replace(replace(replace({ "thearray" = array_agg(onewherecondition) }::TEXT, '{"thearray":["OR ', ''), '","', ' '), '"]}', '')
        ,';')
FROM thedata;

This will generate a very long statement, and when we try to run it we may get:

StackOverflowError[null]

io.crate.exceptions.SQLParseException: line 1:1: statement is too large (stack overflow while parsing)

So we will need a different strategy, and we also want this to run as quickly as possible.

Let’s start by preparing a CSV file with 10,000 primary key values we will use for testing:

pip install crash
crash -c "SELECT ts,machine_id,sensor_type FROM sensor_data LIMIT 10000;" --format csv > pkvalues.csv

What we are going to do now is take advantage of a system column called _id which exists on all CrateDB tables. This column contains a unique identifier for each row, and for tables with a PK defined it is a compound string representation of all primary key values of that row. The useful characteristic here is that the value is deterministic, given 2 tables with the same PK definition rows with the same PK values will have the same _id values.

So to perform this “bulk SELECT” we are going to use a staging table defined with the same PK as the main table. The Python code below bulk loads the values from the CSV file to the staging table and then uses the _id values to locate all the rows we are interested in:

pip install pandas "crate[sqlalchemy]" --upgrade

import pandas as pd
import sqlalchemy as sa
from crate.client.sqlalchemy.support import insert_bulk

df = pd.read_csv("pkvalues.csv")

engine = sa.create_engine(
    "crate://localhost:4200",
    connect_args={"verify_ssl_cert": False},
)
connection = engine.connect()

connection.execute(sa.text("DROP TABLE IF EXISTS relevant_pk_values;"))
connection.execute(
    sa.text(
        """
        CREATE TABLE relevant_pk_values (
            ts TIMESTAMP
            ,machine_id TEXT
            ,sensor_type SMALLINT
            ,PRIMARY KEY (ts,machine_id,sensor_type)
        ) CLUSTERED INTO 1 SHARDS;
        """
    )
)

df.to_sql(
    name="relevant_pk_values",
    con=engine,
    if_exists="append",
    index=False,
    chunksize=5_000,
    method=insert_bulk,
)
connection.execute(sa.text("REFRESH TABLE relevant_pk_values;"))

resultset = connection.execute(
    sa.text(
        """
        SELECT *
        FROM sensor_data
        WHERE _id IN (SELECT _id FROM relevant_pk_values);
        """
    )
)

I hope you found this interesting, if you have any questions please do not hesitate to reach out to us through the CrateDB Community.

Using common table expressions to speed up queries

Hernán Lionel Cianfagna — Thu, 22 Feb 2024 09:44:56 +0000

Today I want to share with you a pattern you can use to replace JOINs with CTEs in your SQL queries and achieve consistent and faster execution times.

Consider a database where we store information about invoices, a simplified model could consist of a table where we store details about the customer and payment terms, a separate table where we have the items included in the invoice, and a 3rd table where we store product information:

CREATE TABLE invoices (
  invoice_number  BIGINT PRIMARY KEY
  ,customer_id  BIGINT
  ,payment_terms  TEXT  DEFAULT '30 days from issue date'
  ,issue_date TIMESTAMP
);

CREATE TABLE invoice_items (
  invoice_number  BIGINT
  ,product_id  BIGINT
  ,quantity  REAL
  ,unit_price REAL
  ,PRIMARY KEY (invoice_number,product_id)
);

CREATE TABLE products (
  product_id  BIGINT PRIMARY KEY
  ,product_description  TEXT
  ,applicable_tax_percentage  REAL
);

Let’s now imagine we want to know how many units of “super cool product” have been sold in January 2024, we could write a SQL query with JOINs like this:

SELECT SUM(quantity)
FROM invoices
JOIN invoice_items USING (invoice_number)
JOIN products USING (product_id)
WHERE product_description='super cool product' 
AND invoices.issue_date BETWEEN '2024-01-01' AND '2024-02-01';

This is perfectly valid SQL, but it leaves the database engine with a lot of options.

Even without considering the complexities of a distributed system, parallel processing, and disk/memory options there are still many different possible strategies here, for instance:

“super cool product“ may only be sold very rarely, we could then start by looking up its product_id then all the instances where it has been sold and then check if the corresponding invoices were in January 2024
or perhaps the product is sold often and we have data for 20 years of sales, so we could start by looking up all the invoices from January, then their line items, and see which ones are for this product
or maybe the company only sells a handful of products and this one is a best seller, we may also only keep invoices for the last 45 days, meaning that neither the date of the invoice nor the product are very selective, in which case it may be faster to consider the full list of invoice_items

With up-to-date statistics database engines like CrateDB can usually do a good job at identifying an optimal execution plan, but there is the risk some day statistics may not be available on your target environment, or even with statistics available other factors may induce the query engine to go down the wrong path.

The impact of using a suboptimal execution plan here could be huge, we could find ourselves trying to JOIN millions and millions of records.

Let’s do a small test creating sample data for the first scenario above, the one where “super cool product“ is only sold very rarely:

/* one million invoices in December */
INSERT INTO invoices (invoice_number,customer_id,issue_date)
SELECT a.b,1,'2023-12-01'
FROM GENERATE_SERIES(1,1000000) a(b);

/* one million invoices in January */
INSERT INTO invoices (invoice_number,customer_id,issue_date)
SELECT a.b,1,'2024-01-01'
FROM GENERATE_SERIES(1,2000000) a(b);

/* 2 products */
INSERT INTO products (product_id,product_description)
VALUES (1,'super cool product'),(2,'another product');

/* one line item per invoice and only 1 instance in 2 million where product 1 was sold */
REFRESH TABLE invoices;
INSERT INTO invoice_items (invoice_number,product_id,quantity)
SELECT invoice_number
,CASE WHEN invoice_number=2000000 THEN 1 ELSE 2 END AS product_id
,ceiling(random()*10)
FROM invoices;

We can now run the query with the JOINs a few times, in my small test environment it settles in running in around 750 milliseconds.

We can also look at the execution plan and all its details using the EXPLAIN command.

Let’s now try this approach where we use CTEs to guide the engine to execute the query using steps we know are more optimal for the profile of our data:

WITH relevant_product_ids AS (
    SELECT product_id
    FROM products
    WHERE product_description='super cool product'
    )
    ,relevant_invoice_lines AS (
    SELECT invoice_number,quantity
    FROM invoice_items
    WHERE invoice_items.product_id IN (SELECT relevant_product_ids.product_id FROM relevant_product_ids)
    )
    ,relevant_invoices AS (
    SELECT invoice_number,issue_date
    FROM invoices
    WHERE invoices.invoice_number IN (SELECT relevant_invoice_lines.invoice_number FROM relevant_invoice_lines)
    )
SELECT SUM(quantity)
FROM relevant_invoices
JOIN relevant_invoice_lines USING (invoice_number)
WHERE relevant_invoices.issue_date BETWEEN '2024-01-01' AND '2024-02-01';

We now see this runs consistently in single digit milliseconds, a 100x improvement.

In large and busy environments this kind of optimization may make a big difference, so it may be something to add to your toolbox.

I hope you found this interesting, and as usual if you have any questions do not hesitate to reach out to us in the CrateDB Community.

Connecting with SSL to CrateDB using the PostgreSQL protocol from Java-based applications

Hernán Lionel Cianfagna — Wed, 13 Sep 2023 12:39:48 +0000

If you are using CrateDB Cloud, or if you have configured a server certificate for an on-premises deployment, and you try to enforce SSL on a PostgreSQL connection to CrateDB you may come across an error message like this:

Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt.
  C:\Users\Hernan\AppData\Roaming\postgresql\root.crt (The system cannot find the path specified)

org.postgresql.util.PSQLException: Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt

This is not specific to CrateDB, and you would get the same message trying to connect to an actual PostgreSQL instance, but I found no simple explanation of this error message and the options available, so here are my two cents.

What happens here is that the client is trying to confirm the server we are establishing an encrypted connection with is indeed the machine we intended to reach, doing this involves validating that the certificate used by the server has been issued by a trusted certification authority.
In this case, the client driver is trying to find the details of valid certification authorities on a PKCS12 file on the location indicated in the error message.

I find that in most cases it makes sense to pick one of the two options below to address this.

If we want the communication channel with the server to be encrypted, but we are on a trusted network environment and do not require verification of the server certificate, we can use this in our connection string:

ssl=true&sslmode=require

But if we want to have both encryption and the confirmation that we are talking to the intended server, we can tell the driver to use the list of certification authorities our JVM accepts:

ssl=true&sslmode=verify-full&sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory

Some software (DBeaver for instance) may have separate configuration settings where you can set the SSL Factory and SSL mode:

I hope this helps. As usual please do not hesitate to let us know your thoughts in the CrateDB Community.

Using dbt with CrateDB

Hernán Lionel Cianfagna — Fri, 18 Aug 2023 16:32:06 +0000

Dbt is a tool for transforming data in data warehouses using Python and SQL. The idea is that Data Engineers make source data available to an environment where dbt projects run (for instance with Debezium or with Airflow), and Data Analysts can then run their dbt projects against this data to produce models (tables and views) that can be used with BI tools.
This layer allows the decoupling of the models on which reports and dashboards rely from the source data, and if our business rules or our source systems change we can still maintain the same models as a stable interface.

Some of the things that dbt can do include:

import reference data from csv files
track changes in source data with different strategies so that downstream models do not need to be built every time from scratch
run tests on data, to confirm assumptions remain valid, and to validate any changes made to the models' logic

Due to its unique capabilities, CrateDB is an excellent warehouse choice for data transformation projects. It offers automatic indexing, fast aggregations, easy partitioning, and the ability to scale horizontally. In this article, I will illustrate how to get the most important functionalities of dbt working by doing the necessary changes in the configuration.

Our starting point will be a fresh install of dbt-postgres:

pip install dbt-postgres==1.6.0

We can then create a profiles file with our connection details:

cd ~
mkdir .dbt
cat << EOF > .dbt/profiles.yml
example_datawarehouse_profile:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      database: crate
      schema: doc
      search_path: doc
      user: dbt   
      password: pwd1234567A
EOF

(please note the values for database, schema, and search_path in this example)

We will not go into the details of how the project files are structured (for more information check out dbt’s documentation), but in general, a dbt project consists of a combination of SQL, Jinja, YAML, and markdown files. In our project folder, alongside the models folder that most projects have, we can also create a folder called macros where we can place macro overrides.
Let's then create a macros folder and place some files with overrides on it:

mkdir macros
cd macros
wget https://community.crate.io/uploads/short-url/fKupQCFUHtuoKom3jAfKrldUXkt.sql
wget https://community.crate.io/uploads/short-url/qvQExEq1OopiVUcXACLGfpdGHYF.sql
wget https://community.crate.io/uploads/short-url/3jcFxL1EExLrERJSTc6ScnzTS9f.sql
cd ..

A few things I have tested with these overrides:

models with view, table, and ephemeral materializations
dbt source freshness
dbt test
dbt seed
Incremental materializations (with incremental_strategy='delete+insert' and without involving OBJECT columns)

I hope you find this useful. CrateDB is continuously adding new features and I will endeavor to come back and update this article if there are any developments and some of these overrides require changes or become obsolete.

Using regex comparisons and other advanced database features for real-time inspection of web server logs

Hernán Lionel Cianfagna — Mon, 14 Aug 2023 12:23:16 +0000

In Storing server logs on CrateDB for fast search and aggregations we saw how we can get server logs sent to CrateDB in real-time, and for demo purposes we set up an instance of MediaWiki.
It was just an example, but it could have been any web server application.
Let's now imagine that we suspect people are trying to perform SQL injection attacks against our website, we need to keep an eye on the logs.
We have already seen how we can use fulltext search to look for specific error messages, but would it not be great if we could have some rules inspecting the log entries as they come in, extracting relevant information, and flagging anything potentially suspicious?
There are a lot of nice features in CrateDB to support this kind of setup, let me show you an example.
The log entries our web server container is producing look like this:

 192.168.0.121 - - [11/Aug/2023:12:59:42 +0000] "GET /favicon.ico HTTP/1.1" 200 852 "http://192.168.0.202/mw-config/index.php?page=Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200"

We see we have the client IP address and the HTTP request that was sent.
Today I want to do 2 checks,

I want to query only for requests coming from a specific IP subnet,
and I want to see if the HTTP request has anything suspicious that could suggest an attempt to perform a SQL injection attack

I will use generated columns to extract this information from the log entries as they arrive.
CrateDB supports adding columns live to existing tables, but generated columns are special and they can only be added to empty tables, to add our new columns to our systemevents table without any downtime we will use another feature of CrateDB, table swapping.

Let's create a new table with the 2 special columns:

CREATE TABLE doc.systemevents2 (
    message TEXT
    ,INDEX message_ft USING FULLTEXT(message)
    ,facility INTEGER
    ,fromhost TEXT
    ,priority INTEGER
    ,DeviceReportedTime TIMESTAMP
    ,ReceivedAt TIMESTAMP
    ,InfoUnitID INTEGER
    ,SysLogTag TEXT 
    ,clientip IP GENERATED ALWAYS AS TRY_CAST(btrim(split_part(message,'-',1)) AS IP)
    ,suspectedSQLinjection BOOLEAN GENERATED ALWAYS AS message ~* 
                CONCAT('.*SELECT.*FROM.*' , 
                       '|.*UNION.*SELECT.*',
                       '|.*DELETE.*FROM.*',
                       '|.*UPATE.*SET.*',
                       '|.*ALTER.*TABLE.*',
                       '|.*(%27|'')%20.*%20(%27|'').*')
    );

Here we are extracting the client IP address from the message text and storing it using the dedicated IP data type in CrateDB. We use the split_part function to look up the string up to the dash symbol, then we use the btrim function to remove spaces from both sides of the string, and finally we use the TRY_CAST function so that log entries that do not have an IP address in this position get a NULL value as clientip but no error message is raised.

We are also using the case insensitive ~* regex comparison operator to look for indications of a possible SQL injection attack attempt, we are looking for occurrences of SELECT .. FROM , UNION ... SELECT , DELETE ... FROM , UPDATE ... SET , ALTER ... TABLE , or attempts to break a string delimiter injecting a single quote character. This will match entries like:

 172.17.0.1 - - [11/Aug/2023:13:03:07 +0000] "GET /mw-config/index.php?css=1%27%20WAITFOR%20DELAY%20%270%3A0%3A5%27%20AND%20%27Lshb%27%3D%27Lshb HTTP/1.1" 200 4627 "-" "sqlmap/1.7.8#pip (https://sqlmap.org)"

Let's now swap in this new table and rename the old one as systemevents_archive:

ALTER CLUSTER SWAP TABLE doc.systemevents2 TO doc.systemevents;
GRANT DML ON TABLE doc.systemevents TO rsyslog;
REVOKE DML ON TABLE doc.systemevents2 FROM rsyslog;
ALTER TABLE doc.systemevents2 RENAME TO systemevents_archive;

To show how this work we will need both normal activity, which we can generate just by navigating to http://localhost/ (perhaps from another machine to get a different IP address in the logs), and we will also need some malicious-looking activity, to generate this against our website we can use a well-known pentesting tool sqlmap :

sudo pip install sqlmap
sqlmap -u http://localhost/ --crawl=2

Press ENTER when prompted to proceed with default settings, there will be around 10 questions before the tool concludes there are no obvious vulnerabilities in the website.

We can now run queries like:

SELECT * 
FROM systemevents 
WHERE clientip << '172.17.0.0/16' 
ORDER BY devicereportedtime DESC
LIMIT 100;

to get the most recent activity from IP addresses in the 172.17.0.0/16 subnet, and we can also look for suspicious activity only:

SELECT * 
FROM systemevents 
WHERE suspectedSQLinjection
ORDER BY devicereportedtime DESC
LIMIT 100;

This kind of queries could then be integrated into a Grafana dashboard or some alerting system.

I hope you found this interesting. Please do not hesitate to let us know your thoughts in the CrateDB Community.

Storing server logs on CrateDB for fast search and aggregations

Hernán Lionel Cianfagna — Thu, 10 Aug 2023 08:27:19 +0000

Did you know that CrateDB can be a great store for your server logs?

If you have been using log aggregation tools or even some of the most advanced commercial SIEM systems, you have probably experienced the same frustrations I have:

timeouts when searching logs over long periods of time
a complex and proprietary query syntax
difficulties integrating queries on logs data into application monitoring dashboards

Storing server logs on CrateDB solves these problems, it allows to query the logs with standard SQL and from any tool supporting the PostgreSQL protocol; its unique indexing also makes full-text queries and aggregations super fast.
Let me show you an example.

First, we will need an instance of CrateDB, it may be best to have a dedicated cluster for this purpose, to separate the monitoring system from the systems being monitored, but for the purpose of this demo we can just have a single node cluster on a docker container:

sudo docker run -d --name cratedb --publish 4200:4200 --publish 5432:5432 crate -Cdiscovery.type=single-node

Next, we need a table to store the logs, let's connect to http://localhost:4200/#!/console and run:

CREATE TABLE doc.systemevents (
    message TEXT
    ,INDEX message_ft USING FULLTEXT(message)
    ,facility INTEGER
    ,fromhost TEXT
    ,priority INTEGER
    ,DeviceReportedTime TIMESTAMP
    ,ReceivedAt TIMESTAMP
    ,InfoUnitID INTEGER
    ,SysLogTag TEXT 
    );

Tip: if you are on a headless system you can also run queries with command-line tools.

Then we need an account for the logging system:

CREATE USER rsyslog WITH (PASSWORD='pwd123');

and we need to grant permissions on the table above:

GRANT DML ON TABLE doc.systemevents TO rsyslog;

We will use rsyslog to send the logs to CrateDB, for this setup we need rsyslog v8.2202 or higher and the ompgsql module:

sudo add-apt-repository ppa:adiscon/v8-stable
sudo apt-get update
sudo apt-get install rsyslog
sudo debconf-set-selections <<< 'rsyslog-pgsql rsyslog-pgsql/dbconfig-install string false'
sudo apt-get install rsyslog-pgsql

Let's now configure it to use the account we created earlier:

echo 'module(load="ompgsql")' | sudo tee /etc/rsyslog.d/pgsql.conf
echo '*.* action(type="ompgsql" conninfo="postgresql://rsyslog:pwd123@localhost/doc")' | sudo tee -a /etc/rsyslog.d/pgsql.conf
sudo systemctl restart rsyslog

If you are interested in more advanced setups involving queuing for additional reliability in production scenarios, you can read more about available settings in the rsyslog documentation.

Now let's imagine that we want to run a container with MediaWiki to host an intranet and we want all logs to go to CrateDB, we can just deploy this with:

sudo docker run --name mediawiki -p 80:80 -d --log-driver syslog --log-opt syslog-address=unixgram:///dev/log mediawiki

If we now point a web browser to port 80 http://localhost/ we will see a new MediaWiki page.
Let's play around a bit to generate log entries, just click on "set up the wiki" and then once on Continue.
This will have generated entries in the doc.systemevents table with syslogtag matching the container id of the container running the site.

We can now use the MATCH predicate to find the error messages we are interested in:

SELECT devicereportedtime,message
FROM doc.systemevents
WHERE MATCH(message_ft, 'Could not reliably determine') USING PHRASE
ORDER BY 1 DESC;

+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| devicereportedtime | message                                                                                                                                                                     |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Let's now see which log sources created the most entries:

SELECT syslogtag,count(*)
FROM doc.systemevents
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;

+----------------------+----------+
| syslogtag            | count(*) |
+----------------------+----------+
| kernel:              |       23 |
| 083053ae8ea3[52134]: |       20 |
| systemd[1]:          |       15 |
| sudo:                |       10 |
| rsyslogd:            |        5 |
+----------------------+----------+

I hope you found this interesting. Please do not hesitate to let us know your thoughts in the CrateDB Community.

How to add new nodes to on-premises CrateDB clusters

Hernán Lionel Cianfagna — Tue, 18 Jul 2023 13:12:26 +0000

A significant feature in CrateDB is that it can scale horizontally, which means that instead of adding more RAM, CPU, and disk resources to our existing nodes we can add more nodes to our CrateDB cluster.
This allows the handling of volumes of data that simply could not fit on a single node, but it is also very useful in scenarios where hosting everything in a single node, or a small number of nodes, would still be possible, this is because smaller nodes are often easier to manage infrastructure-wise.
More nodes also mean more resiliency to issues, on a scenario where we have for instance 5 nodes, and configure our tables with 2 replicas, we could lose 2 nodes and still serve our production workloads. This means we can carry out maintenance on the nodes one at a time and still be able to withstand an unplanned issue on another node without downtime.

Today we want to review how to add a new node to an existing on-premises cluster.

Discovery

When a CrateDB node starts it needs a mechanism to get a list of the nodes that make up the cluster, this is called discovery.
At the time of writing, there are 3 ways for a node to get this list:

The list of nodes can be defined in the discovery.seed_hosts setting in the configuration file (typically in /etc/crate/crate.yml)
The list can be retrieved with a DNS query, see https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-via-dns
In AWS environments, the list of nodes can be looked up via the EC2 API, filtering on specific security groups, availability zones, and tags, see https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-on-amazon-ec2

For the purpose of this post, we will work with the discovery.seed_hosts list.

Scaling from a single-node deployment

If a node is started without specifying the initial_master_nodes setting (the default configuration), or with discovery_type set to single-node, it will be started as a standalone instance and it cannot later be scaled into a cluster. Single-node deployments are great for development and testing, but for production setups we recommend using a cluster with at least 3 nodes.

If you are going for a single-node deployment initially, but plan to scale to a multi-node cluster in the future, there are some settings to configure before the very first run of the CrateDB node so that it bootstraps as a 1-node cluster instead of a standalone instance.
The settings that we need are:

discovery.seed_hosts set to the hostname or the node.name of the node
initial_master_nodes set to the hostname or the node.name of the node
optionally we can set a cluster.name

If you are using containers you would pass these settings with lines in the args section of your YAML file, otherwise you could create /etc/crate/crate.yml before deploying the package for your distribution (refer to https://github.com/crate/crate/blob/master/app/src/main/dist/config/crate.yml for the template), or you could prevent the package installation from auto-starting the daemon by using a mechanism such as policy-rcd-declarative, then edit the configuration file (crate.yml ), and start the crate daemon once all settings are ready.

Networking considerations

Nodes need to be able to resolve each other's hostnames at DNS level, and they need to be able to reach each other on a TCP port which is 4300 by default.

For security reasons you should configure your network so that CrateDB cluster nodes are only reachable on port 4300 from other CrateDB nodes in the cluster.
In a Kubernetes environment this can be achieved with a Service resource with a ClusterIP.
In a non-containerized environment one way to do this is to use firewall software directly on each node, for instance:

#Enable ufw - all incoming connections blocked by default
sudo ufw enable

#Allow SSH if you are using it to manage your server
sudo ufw allow 22

#Allow 4200 for clients to connect to CrateDB via the http endpoint
sudo ufw allow 4200

#Allow 5432 if you have PostgreSQL clients
sudo ufw allow 5432

#Allow 4300 from 192.168.0.202 (another cluster node in this example)
sudo ufw allow proto tcp from 192.168.0.202 to any port 4300

You may also want to consider network access control and/or a separate network adapter for intra-cluster communications.

Deploying the new node

Make sure the new node does not auto-bootstrap as a single-node instance, you may want to either create /etc/crate/crate.yml in advance or use a mechanism as policy-rcd-declarative as mentioned earlier.
On the configuration file for the new node:

Set discovery.seed_hosts to the full list of nodes, including the new one you are adding.
Optionally set a node.name , if not done the node get assigned a random name from the sys.summits table. You may wonder what those default names are about, they are the names of mountains in the area around our main office, we love mountains at Crate.io.
Set cluster.name to a value that matches the other nodes in the cluster, if not specified the default cluster name is crate.
Consider if you want to set the cluster-wide settings gateway.expected_data_nodes, gateway.recover_after_data_nodes, and/or gateway.recover_after_time to prevent the unnecessary creation of new replicas and the rebalancing of shards when a node takes a little bit longer to start, or in case of transient issues, when the cluster is starting up from a situation where all nodes are shutdown. Please note these settings are used when the cluster is starting up from being offline, if you want to delay the allocation of replicas when a node becomes unavailable on a cluster that stays online there is a different setting at table level.

Now we can start the crate daemon, you will see the node joining the cluster and CrateDB will start using it for shards allocation.

Remember to add the new nodes alongside the old ones in any monitoring system and load balancer configuration you may have in your environment.

Updating settings on the old nodes

Now we need to align a number of settings in the other nodes, these are typically in the /etc/crate/crate.yml file:

Update discovery.seed_hosts adding the new node
If you have configured gateway. settings, update them to have the same values on all nodes

These settings only play a role during restart, not at runtime, so you do not need to restart the nodes after making these changes, but if the gateway. settings need updating you may see a warning in the Admin UI which can be acknowledged.

Please also note there is no need to update the initial_master_nodes list, this is only considered during the initial cluster bootstrapping.

And that is it, we have scaled out our cluster and we are ready to work with larger volumes of data. I hope you find this useful and, as usual, please do not hesitate to raise any thoughts or questions in the CrateDB Community.

From data storage to data analysis: Tutorial on CrateDB and pandas 2

marijaselakovic — Wed, 03 May 2023 06:55:15 +0000

Introduction

Pandas is an open-source data manipulation and analysis library for Python. It is widely used for handling and analyzing data in a variety of fields, including finance, research, etc.

One of the key benefits of pandas is its ability to handle and manipulate large datasets, making it a valuable tool for data scientists and analysts. The library provides easy-to-use data structures and functions for data cleaning, transformation, and analysis, making it an essential part of the data analysis workflow.

Using CrateDB and pandas together can be a powerful combination for handling large volumes of data and performing complex data analysis tasks. In this tutorial, we will showcase using the real-world dataset how to use CrateDB and pandas together for effective data analysis.

Requirements

To follow along with this tutorial, you will need:

A running instance of CrateDB 5.2.
Python 3.x with the pandas 2 and crate 0.31 packages installed.
A real-world dataset in CSV format. In this tutorial, we will be using the shop customer data available on Kaggle.

Setting up CrateDB

Before we can start using CrateDB, we need to set it up. You can either download and install CrateDB locally via Docker or tarball or use a CrateDB Cloud instance with an option of the free cluster.

Once you have a running instance of CrateDB, create a new table to store the customer data dataset. Here is an SQL command to create a table:


 sql
CREATE TABLE IF NOT EXISTS "doc"."customer_data" (
   "customerid" INTEGER,
   "gender" TEXT,
   "age" INTEGER,
   "annualincome" INTEGER,
   "spendingscore" INTEGER,
   "profession" TEXT,
   "workexperience" INTEGER,
   "familysize" INTEGER
)

After creating the table, you can import the customer data dataset into CrateDB using the COPY FROMcommand:



COPY "doc"."customer_data" FROM 'file:///path/to/Customers.csv' 
WITH (format='csv', delimiter=',')

Once you have CrateDB running, you can start exploring data with pandas.

Querying data with CrateDB and pandas

The first step is to import the pandas library and specify the query you want to execute on CrateDB. In our example, we want to fetch all customer data.

To read data from CrateDB and work with it in a pandas DataFrame use read_sql method as illustrated below.



import pandas as pd

query = "SELECT * FROM customer_data"
df = pd.read_sql(query, 'crate://localhost:4200')

In the above code, we establish a connection to a local CrateDB instance running on localhost on port 4200, execute a SQL query, and return the results as a pandas DataFrame. You can further modify the query to retrieve only the columns you need or to filter the data based on some condition.

Analyze the data

Now that data are loaded into the pandas DataFrame, we can perform various analyses and manipulations on it. For instance, we can group the data by a certain column and calculate the average value of another column:



avg_income = df.groupby("profession")["annualincome"].mean()

In this example, we group the data in the DataFrame by the profession column and calculate the average annual income for each profession. You can plot the data about average incomes using df.plot() method, specifying the type of plot (a bar chart), and the columns to use for the x and y axes:



import matplotlib.pyplot as plt

income_by_profession.plot(kind='bar', legend=True, rot=0)
plot.show()

We also use plt.show() from matplotlib to display the plot:

Wrap up

That's it! You should now have a good idea of how to use CrateDB and pandas together to analyze large datasets stored in CrateDB. This allows you to take advantage of the powerful data manipulation capabilities of pandas to analyze and visualize your data.
To learn more about updates, features, and other questions you might have, join our CrateDB community.

Guide to bitwise operators in CrateDB

marijaselakovic — Thu, 02 Mar 2023 09:30:44 +0000

Bitwise operators are useful because they allow you to perform efficient and concise operations on individual bits within integer values, which can be very useful in a variety of SQL queries.

CrateDB continues to provide many valuable features. In the 5.2 version, we added support for bitwise operators. Now, you may wonder when this feature is handy. There are at least a couple of scenarios:

If you want to store multiple pieces of information in a single column. For example, you can use a bitwise OR operator to combine multiple flags into a single value.

Simplifying conditional statements. Bitwise operators can be used to check the state of individual bits within a value. For example, you can use a bitwise AND operator to check if a particular bit is set or not.

Easier data manipulation. Bitwise operators can be used to set or manipulate specific bits within a value. For example, you can use a bitwise OR operator to set a particular bit to 1, or a bitwise XOR to swap bit value (1 to 0 or 0 to 1).

CrateDB supports three bitwise operators:

BITWISE AND (&)

BITWISE OR (|)

BITWISE XOR (#)

Now, let’s take a look at each operator and some interesting examples.

Bitwise AND

The Bitwise AND operator compares each bit of two values and returns a new value with the bits that are set in both of the original values.

The syntax for this operator is as follows:

SELECT value1 & value2

Here, value1 and value2 are the two values that you want to compare using the bitwise AND operator. These values can be any valid expressions or constants in SQL, such as columns, variables, or literals.

For example, let’s imagine a table of employees with a column status that stores a bitmask value representing the status of each employee. The status value is stored as a byte and each bit represents a different aspect of the employee status. To save storage space, we recommend a byte data structure for storing up to 7 states simultaneously. The first bit always represents a sign and we don't use negative values to encode states. Similarly, use short data structure for storing up to 15 states, integer for up to 31 states, long for up to 63 states, and for more than 63 states that use Bit String type. To learn more about data types supported in CrateDB, check out our documentation.

The first bit in status value says whether the employee is working full-time, the second bit says if the employee is remotely, and so on. For instance, if the employee is working full-time, the status value will be 1 (B'01') if remote the status value will be 2 (B'10') and if both, the status value will be 3 (B'11').

Now, let’s create a table and populate it with sample data

CREATE TABLE employees (name text, status BYTE, comment TEXT)
INSERT INTO employees (name ,status, comment) VALUES 
('Ana', 1, 'Ana is working full-time from office'),
('Mary', 3, 'Mary is working full-time remotely'),
('Sara', 2, 'Sara is working part-time remotly');

To select all employees who are working full-time:

SELECT name, comment FROM employees WHERE status & 1 = 1 # Ana, Mary

This query will select all rows from the employees table where the first bit (full-time status) is set to 1.

To find all employees who are working full-time and remotely:

SELECT name, comment FROM employees where status & 3 = 3 # Mary

You can also use the bitwise AND operator in combination with other logical operators, such as OR and NOT, to create more complex queries.

Bitwise OR

The bitwise OR operator is used to compare two binary values and return a new binary value where the resulting bit is set to 1 if either of the input bits is 1. It is represented by the symbol |.

To understand how the bitwise OR operator works, let's consider two binary values: 1011 and 1100. The bitwise OR operation would compare each bit position of the two values and return a new binary value based on the following rules:

If either of the input bits is 1, the resulting bit is set to 1.

If both of the input bits are 0, the resulting bit is set to 0.

Applying these rules to the example above, we get the following result: 1011 | 1100 = 1111

The resulting binary value is 1111, which is equivalent to 15 in decimal.

Considering the example with employees table, let’s say we would like to select all employees that are working full-time, remotely or both. Here's how you can use the bitwise OR operator in the query:

SELECT name, comment FROM employees where status & (1 | 2) > 0;

The following query illustrates how to change a flag specifying whether an employee is working full-time without changing the existing flags:

UPDATE employees
SET status = (status | 1 /* FULL-TIME */) 
WHERE name='Sara'

It’s important to note that the bitwise OR operator only works with binary values. If you want to perform a logical OR operation with non-binary values, you can use the OR operator in SQL.

Bitwise XOR

The bitwise XOR operator, represented by the symbol #, compares two binary values and returns a new value based on the following rules:

If both values are 0, the result is 0.
If both values are 1, the result is 0.
If one value is 1 and the other is 0, the result is 1.
Using these rules in the example from above, the result of the XOR operator is: 1011 # 1100 = 0111

One interesting example of using the bitwise XOR operator is changing the status of an employee from “working remotely” to “working in the office“ and vice versa.

UPDATE employees
SET status = status # 2
WHERE name = 'Ana';

In this example, the status of the employee with the name Ana will be toggled from “working remotely“ to “working in the office“ or vice versa, depending on the current value of the status field.

Wrap up

In summary, the bitwise operators in CrateDB allow you to perform bitmasking operations on values in your database and use the resulting values to filter or modify rows in your tables. By combining this operator with other logical operators, you can create powerful queries that can manipulate and extract specific data from your database. If you have any further questions or would like to learn more about CrateDB, check out our documentation and join the CrateDB community.

Replicating data from other databases to CrateDB with Debezium and Kafka

Hernán Lionel Cianfagna — Tue, 28 Feb 2023 16:29:49 +0000

You may have line-of-business applications such as ERP software that work with transactional database systems like MSSQL, Oracle, or MySQL.

The setup may work perfectly fine for day-to-day operations, but you may find that it is not ideal for doing data analytics.

Attempting to run analytic workloads against the operational databases you may see concurrency issues deriving from locking, the analytics queries may have an impact on the performance of business-critical operations, and you may also find that the performance and feature-set in the transactional database system may not be good enough for analyzing large amounts of data.

Considering this, many organisations come to the conclusion they need to copy data to a separate environment to run reporting and dashboards, this is sometimes done with replication, sometimes with backups, and sometimes with complex ETL pipelines. This often comes with a set of challenges:

ballooning license costs
custom ad-hoc routines for getting the data to the analytics environment, requiring development, monitoring, and troubleshooting
a need to design and maintain an indexing strategy for the analytics copy of the data
high availability requirements for the analytics environment as the business starts relying on it

We know we can address several of these points by using a system like CrateDB. CrateDB is a feature-rich, open-source, SQL database which out-of-the-box automatically implements indexes, compression, and a columnar store so that most analytical queries can run much faster without any need to fiddle with settings. Because it is open-source, there is no need to be concerned about licensing expenses. Additionally, it can scale horizontally, which means that the number of nodes can be adjusted as needed to handle changing data volumes and workloads, and it provides high availability without requiring administrative effort.

If only we could replicate data from our operational database to CrateDB without having to write custom code… it turns out we can.

Enter Debezium, Debezium is a standard open-source system, built on top of Kafka, which allows to capture changes on a source database system and replicate them on another system without having to write custom scripts.

In this post I want to show an example replicating changes on a table from MSSQL to CrateDB.

Setup on the MSSQL side

We will need a SQL Server instance with the SQL Server Agent service up and running, if you are running MSSQL on a container you can get the agent running by setting the environment variable MSSQL_AGENT_ENABLED to True.

Connect to the instance with a client such as sqlcmd, SSMS, or DBeaver.

We are now going to go through a number of steps, if you already have a working system feel free to skip the operations you do not need.

Let’s create a database with a test table on it:

CREATE DATABASE erp;
GO
USE erp; 
CREATE TABLE dbo.tbltest (
    id INT PRIMARY KEY IDENTITY,
    createdon DATETIME DEFAULT getdate(),
    srcsystem NVARCHAR(max)
    );

Let’s now create an account for Debezium to use to pull the changes:

CREATE LOGIN debeziumlogin WITH PASSWORD = '<enterStrongPasswordHere>';
CREATE USER debeziumuser FOR LOGIN debeziumlogin;
CREATE ROLE debeziumrole;
EXEC sp_addrolemember 'debeziumrole', 'debeziumuser';
EXEC sp_addrolemember 'db_datareader', 'debeziumuser';

And let’s enable change data capture on our example table:

EXEC sys.sp_cdc_enable_db;
ALTER DATABASE erp ADD FILEGROUP cdcfg;
ALTER DATABASE erp ADD FILE (
    NAME= erp_cdc_file1,
    FILENAME='/var/opt/mssql/data/erp_cdc_file1.ndf'
    ) TO FILEGROUP cdcfg;
EXEC sys.sp_cdc_enable_table
    @source_schema='dbo',
    @source_name='tbltest',
    @role_name='debeziumrole',
    @filegroup_name='cdcfg',
    @supports_net_changes=0;

Setup on the CrateDB side

We will need a CrateDB instance, for the purpose of this example we can spin one up with:

sudo apt install docker.io
sudo docker run --publish 4200:4200 --publish 5432:5432 crate:latest -Cdiscovery.type=single-node

Now we need to run a couple of SQL commands on this instance, an easy way to do this is using the Admin UI that can be accessed navigating with a web browser to port 4200 on the server where CrateDB is running, for instance http://localhost:4200 and then open the console (second icon from the top on the left-hand side navigation bar).

We will create a user account for Debezium to use:

CREATE USER debezium WITH (password='debeziumpwdincratedb123');

The table on our MSSQL source is on the dbo schema, let’s imagine we want to have a dbo schema on CrateDB as well, the debezium account will need permissions on it:

GRANT DQL,DML,DDL ON SCHEMA dbo to debezium;

And let’s create the structure of the table that will receive the data:

CREATE TABLE dbo.tbltest (
    id INT PRIMARY KEY /* we need the PK definition to match the source table so that this can be used to lookup records when they need to be updated */
    ,createdon TIMESTAMP /* CrateDB supports defaults -of course- but because the source table already has a default value we do not need that here */
    ,srcsystem TEXT
    );

Zookeeper and Kafka

To use Debezium we will need to have working setups of Zookeeper and Kafka.

For the purpose of this example I will spin them up with containers on the same machine:

sudo docker run -it --rm --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper
sudo docker run -it --rm --name kafka -p 9092:9092 --link zookeeper:zookeeper --add-host host.docker.internal:host-gateway debezium/kafka

We need to create some special topics in Kafka:

sudo docker exec -it kafka "bash"
bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --topic my_connect_configs --bootstrap-server host.docker.internal:9092 --config cleanup.policy=compact
bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --topic my_connect_offsets --bootstrap-server host.docker.internal:9092 --config cleanup.policy=compact
exit

Please note this is a very basic setup, for production purposes you may want to adjust some of these settings.

Preparing and starting a Debezium container image

We need to customize the base debezium/connect Docker image adding a JDBC sink and the PostgreSQL drivers.

For this we need to download the zip file from kafka-connect-jdbc and then run the below replacing ************* with the appropriate URL:

mkdir customdockerimg
cd customdockerimg
wget *************/confluentinc-kafka-connect-jdbc-10.6.3.zip
sudo apt install unzip
mkdir confluentinc-kafka-connect-jdbc-10.6.3
cd confluentinc-kafka-connect-jdbc-10.6.3
unzip -j ../confluentinc-kafka-connect-jdbc-10.6.3.zip
cd ..
cat > Dockerfile <<EOF  
FROM debezium/connect
USER root:root
COPY ./confluentinc-kafka-connect-jdbc-10.6.3/ /kafka/connect/
RUN cd /kafka/libs && curl -sO https://jdbc.postgresql.org/download/postgresql-42.5.4.jar
USER 1001
EOF
sudo docker build -t cratedb-connect-debezium .

Let’s now start this custom image:

sudo docker run -it --rm --name connect -p 8083:8083 \
           -e GROUP_ID=1 \
           -e CONFIG_STORAGE_TOPIC=my_connect_configs \
           -e OFFSET_STORAGE_TOPIC=my_connect_offsets \
           --add-host host.docker.internal:host-gateway \
           --add-host $(hostname):host-gateway \
           -e BOOTSTRAP_SERVERS=host.docker.internal:9092 \
           -e KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter \
           -e VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter \
           cratedb-connect-debezium

This assumes Kafka is running locally on the same server, you will need to adjust BOOTSTRAP_SERVERS if that is not the case.

Configure a source connector

Let’s create a connector.json file as follows:

{
    "name": "mssql-source-tbltest",
    "config": {
        "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
        "tasks.max": "1",

        "database.history.kafka.bootstrap.servers": "host.docker.internal:9092",
        "schema.history.internal.kafka.bootstrap.servers": "host.docker.internal:9092",
        "topic.prefix": "cratedbdemo",
        "database.encrypt": "false",

        "database.hostname": "host.docker.internal",
        "database.port": "1433",
        "database.user": "debeziumlogin",
        "database.password": "<enterStrongPasswordHere>",
        "database.server.name": "mssql-server",

        "database.names": "erp",        
        "table.whitelist": "dbo.tbltest",        
        "database.history.kafka.topic": "schema-changes.mssql-server.tbltest",
        "schema.history.internal.kafka.topic": "schema-changes.inventory.mssql-server.tbltest"              
    }
}

We can observe that there are settings there concerning the Kafka setup to use, the details to connect to MSSQL, the name of the table that we want to pull changes from, and the Kafka topics that will be used to track these changes.

Let’s deploy this:

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @connector.json

Configure a target

Let’s create a destination-connector.json file as follows:

{
    "name": "cratedb-sink-tbltest",
    "config": {
        "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
        "tasks.max": "1",       

        "connection.url": "jdbc:postgresql://host.docker.internal:5432/",              
        "connection.user": "debezium",      
        "connection.password": "debeziumpwdincratedb123",               

        "topics": "cratedbdemo.erp.dbo.tbltest", 
        "table.name.format": "dbo.tbltest",
        "auto.create": "false",
        "auto.evolve": "false",

        "insert.mode": "upsert",
        "pk.fields": "id",
        "pk.mode": "record_value",      

        "transforms": "unwrap",                                                 
        "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
    }
}

We got details to connect to CrateDB, the name of table that will receive the changes (please note this is case sensitive), and some transform instructions to flatten the JSON data stored in the Kafka topic.

Let’s deploy this:

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @destination-connector.json

Testing

Let’s see this in action.

Let’s create a record from the MSSQL side:

INSERT INTO erp.dbo.tbltest (srcsystem) VALUES (@@version);

And now let’s go to CrateDB and check the table:

SELECT * FROM dbo.tbltest;

As magic the record is there.

Let’s now try an update from the MSSQL side:

UPDATE erp.dbo.tbltest
SET srcsystem = 'Updated successfully'
WHERE id = 1

Conclusion

Using Debezium we can replicate changes from different database systems to CrateDB without having to develop any custom logic, we can then take advantage of CrateDB’s performance and features for our analytic workloads.

Find the latest reported values with ease. Introducing max_by and min_by aggregations in CrateDB 5.2

marijaselakovic — Thu, 23 Feb 2023 08:42:04 +0000

CrateDB 5.2 adds two new aggregation functions: max_by and min_by
These aggregation functions allow users to quickly and easily search the value of one column based on the minimum or maximum value of another column, making them useful for analyzing trends, identifying outliers, or simply understanding the range of values within a dataset. An example use case is getting the latest measurement by using the time column and max_by(measurement, time).

MIN_BY and MAX_BY functions allow you to find the minimum or maximum value in a given column based on the values in another column. For example, if you have a table with two columns, product, category and price, you can use the min_by(product, price) to find the row with the product with the lowest price in each category:

SELECT min_by(product, price) AS cheapest_product
FROM product_list
GROUP BY category;

max_by(returned_value, maximized_value) and min_by(returned_value, minimized_value) return the value of the first column for which the value of the second column is maximized or minimized. If multiple rows maximize or minimize the result of the second column, the output will be non-deterministic and CrateDB can return any value from the list of resulting rows.

Both max_by and min_by can be used for numerical and non-numerical data.

Load the dataset

Let’s start with examples using the dataset about power consumption. First, create a table with the schema below:

CREATE TABLE IF NOT EXISTS doc.power_consumption (
   "ts" TIMESTAMP WITH TIME ZONE,
   "Global_active_power" REAL,
   "Global_reactive_power" REAL,
   "Voltage" REAL,
   "Global_intensity" REAL,
   "Sub_metering_1" REAL,
   "Sub_metering_2" REAL,
   "Sub_metering_3" REAL,
   "meter_id" TEXT,
   "location" GEO_POINT,
   "city" TEXT
);

To import data, use the following COPY FROM command:

COPY doc.power_consumption
FROM 'https://srv.demo.crate.io/datasets/power_consumption.json'
RETURN SUMMARY;

The dataset illustrates the consumption data for a few years and shows the differences between several measured utilities. For instance, the column "Sub_metering_1" shows how much energy is consumed in the kitchen. Similarly, columns "Sub_metering_2" and "Sub_metering_3" show the consumed energy in laundry and climate control systems. The full description of the dataset can be found here.

Example queries
Given the data set, let’s find the ids of house meters that had the highest consumption at one point in time for the kitchen and laundry:

SELECT max_by(meter_id, "Sub_metering_1") as max_kitchen,  
       max_by(meter_id, "Sub_metering_2") as max_laundry  
FROM doc.power_consumption;

The result of this query should contain the following meter ids:

+-------------+-------------+
| max_kitchen | max_laundry |
+-------------+-------------+
| 84007B127R  | 840070504U  |
+-------------+-------------+
SELECT 1 row in set (7.423 sec)

Another example would be to find the meter id for the house with the lowest unused power:

SELECT min_by(meter_id, "Global_reactive_power") AS min_unused 
FROM doc.power_consumption;

The return value will tell us for which house meter we had the lowest value of unused power:

+------------+
| min_unused |
+------------+
| 84007B008L |
+------------+
SELECT 1 row in set (0.197 sec)

You can also combine these functions with WHERE or GROUP BY clauses in CrateDB. For example, let's find for each meter id, the consumption of other, unmapped appliances when the unused power was the lowest:

SELECT
  meter_id,
  min_by("Global_active_power","Global_reactive_power") AS total_consumption 
FROM doc.power_consumption 
GROUP BY meter_id 
LIMIT 10;

The query result will list the consumption for each meter id:

+------------+-------------------+
| meter_id   | total_consumption |
+------------+-------------------+
| 840073190N |             0.202 |
| 840071457E |             0.258 |
| 840072897V |             0.14  |
| 840072655G |             0.218 |
| 840072219H |             0.274 |
| 840071893D |             1.342 |
| 840075260N |             0.246 |
| 840076398A |             0.226 |
| 840072328B |             0.212 |
| 840071760J |             0.222 |
+------------+-------------------+
SELECT 10 rows in set (0.067 sec)

Performance and alternatives

As you have seen above min_by and max_by provide a very concise and convenient way to easily find the value of one column based on the minimum or maximum value of another column. Not only is it a convenient feature, but it also provides significant performance gains to alternative queries one had to write in earlier versions of CrateDB.

Let us look at another example and see how much easier and faster it is to get the right results in CrateDB 5.2. We start with our often-used dataset containing IoT device data:

CREATE TABLE IF NOT EXISTS devices.readings ( 
       "time" TIMESTAMP WITH TIME ZONE NOT NULL, 
       device_id TEXT, 
       battery_level BIGINT, 
       battery_status TEXT
    ) CLUSTERED BY (device_id) INTO 8 SHARDS;

We want to find the latest reported battery_level and battery_status for each device in our dataset holding 30 Million total records. In CrateDB 5.1 and earlier versions, one could fallback to a 2-step approach and use a JOIN like so:

SELECT
    r.device_id,
    r.time,
    r.battery_level,
    r.battery_temperature
FROM devices.readings r
    JOIN (SELECT
            MAX(time) time,
            device_id
        FROM devices.readings
        GROUP BY device_id) max_r
    ON max_r.time = r.time
    AND max_r.device_id = r.device_id;

Not only makes the nested structure query adjustments more difficult and one needs to remember the pattern, but also the performance - due to the expensive JOIN - is not really that great with a runtime of roughly 9 seconds:

Runtime (in ms):
    mean:    8982.507 ± 57.494
    min/max: 8578.380 → 9843.830

That is an improvement of 85% in query speed (8.9s → 1.3s), also using a simpler syntax.

Wrap up

Overall, the max_by and min_by functions in CrateDB provide an easy and efficient way to find the maximum or minimum value of a given column in a table based on the values in a different column. These functions can be used in a variety of scenarios to quickly and easily find the highest or lowest values in a set of data.

If you like this blog post and want to learn more about CrateDB, check out our documentation and join the CrateDB community!

Guide to sharding and partitioning best practices in CrateDB

marijaselakovic — Fri, 17 Feb 2023 14:36:47 +0000

Sharding and partitioning are very important concepts when it comes to system scaling. When defining your strategy, you should account upfront for any future growth, given the significant burden of moving data and restructuring the tables. In this article, we will give you a thorough understanding of how sharding and partitioning work in CrateDB. We will start by covering the basic definitions, discussing the principles behind shard distribution and replication in CrateDB, and how to avoid common bottlenecks.

Partition, shard, and Lucene index

A table in CrateDB is a collection of data. It consists of a specified number of columns and any number of rows. Every table must have a schema describing the table structure. Very often, the table is divided into independent, smaller parts based on a particular column. This “smaller part“ is called partition. The table becomes a partitioned table if, during the table creation, a partition column is defined. In this case, when a new record is inserted, a new partition is created if the partition for the same column value doesn’t exist yet. Partitioning is done for easier maintenance of large tables and improving the performance of particular SQL operations. However, a bad selection of a partition column can lead to too many partitions, which can slow the system's performance.

Now let’s take a look into the concept of shard. In CrateDB, the shard is a division of a table or a table partition based on a configurable number and stored on a node in the cluster. When a node is added or removed from the cluster or when the data distribution becomes unbalanced, CrateDB automatically redistributes the shards across the nodes in the cluster to ensure an even data distribution. If the number of shards is not defined during table creation, CrateDB will apply a sensible default value depending on the number of nodes.

Finally, each shard in CrateDB represents a Lucene index. A Lucene index is a collection of Lucene segments where each segment represents an inverted index, doc value, or k-d trees. Take a look at our previous article to get a better overview of the Lucene index and Lucene segments. The diagram below illustrates the best connection between partitions, shards, and the Lucene index.

How the data is distributed

CrateDB uses row-based sharding to split data across multiple shards. This means that data is split based on the values in specific columns of the table, and distributed across multiple nodes. The column that the data should be sharded on is called the routing column. When you insert, update, or delete a row, CrateDB will use the routing column value to determine which shard to access. The number of shards and the routing column can be specified in the CLUSTERED clause when creating a table:

CREATE TABLE product (
     product_id INT PRIMARY KEY,
     name TEXT,
     amount INT
) CLUSTERED INTO 4 SHARDS;

In the above example, the product table is sharded into four shards. If the primary key is set, as illustrated, the routing column can be ignored, as CrateDB uses the primary key for routing by default. However, if the primary key or the routing column is not set, the internal document ID is used.

To distribute data across the cluster, CrateDB uses a hash-based approach based on the simple formula:

shard number = hash(routing column) % total primary shards

As a result, all documents with the same value of CLUSTERED BY column will be stored in the same shard. With a hash function, CrateDB will try to distribute the data roughly equally, even if the original data values are not evenly distributed.

Shard replication

Shard replication in CrateDB is a feature that allows you to replicate data across multiple nodes in a cluster. This can be useful for increasing data availability, improving performance, and reducing the risk of data loss. You can configure the number of replicas for each shard with the number_of_replicas table setting:

CREATE TABLE product (
     product_id INT PRIMARY KEY,
     name TEXT,
     amount INT
) WITH (number_of_replicas = 1);

When there are multiple copies of the same shard, CrateDB will mark one copy as the primary shard and treat the rest as replica shards. When data is written to a shard, it is first written to the primary shard. The primary shard then replicates the data to one or more replica shards on other nodes in the cluster. This process is done in real-time, ensuring that data is always up-to-date across all replicas.

In the event of a node failure, CrateDB will automatically promote one of the replica shards to become the new primary shard. Having more shard replicas means a lower chance of permanent data loss and more throughput as queries will utilize the extra replica shards so that the primary shard is not congested with many requests. In terms of the cost you pay, you will have higher disk space utilization and inter-node network traffic which leads to increased latency of inserts and updates. Additionally, CrateDB supports automatic failover, where the system automatically detects a failed node and promotes a replica shard to take its place as the primary shard.

It is also possible to specify the number of replicas for a specific table by using ALTER TABLE command. For instance:

ALTER TABLE product SET (number_of_replicas = 2);

Note that changing the number of replicas after a table is created will cause the cluster to redistribute the shards and may take some time to complete (if you want only new partitions to be affected, use the ONLY keyword).

Automatic creation of new partitions

Tables in CrateDB are designed to be dynamic and expandable, allowing for the addition of new rows and columns as needed. This means that you can create a table with an unlimited number of rows. Every time when new data is inserted, CrateDB dynamically creates a new table partition based on the partition column as illustrated by the following example:

CREATE TABLE sales (
  "name" STRING,
  "ts" TIMESTAMP,
  "month" TIMESTAMP GENERATED ALWAYS AS date_trunc('month', ts),
  "value" DOUBLE PRECISION
) CLUSTERED INTO 3 SHARDS
  PARTITIONED BY (month);

For every unique value in the month column, a new partition will be created. In our example, the table can have up to twelve partitions, one for each month in the year. If more columns are used for partitioning a new partition will be created for every unique combination of values. The partition column can also be a generated column: columns whose values are calculated based on other columns. For instance, if you have a column containing a timestamp value, you can partition the data by a column that extracts the day value from the timestamp.

The automatic creation of new partitions allows for horizontal scaling and enables the database to handle large amounts of data by distributing it across multiple partitions. Each partition’s shard is stored on a separate node in a CrateDB cluster, which helps to improve query performance and reduce the load on individual nodes.

How to avoid too many shards

If the routing column is badly chosen you can end up with too many shards in the cluster, affecting the overall stability and performance negatively. To find out how many shards your tables need, you need to consider the type of data you are processing, required queries, and hardware configuration. However, if you end up with too many shards you will have to manually reduce the number of shards by merging and moving them to the same node which is a time-consuming and tedious operation. To get an idea of how many shards your cluster needs, check out our recent tutorial.

The general rule for avoiding performance bottlenecks is to have as least as many shards for a table as there are CPUs in the cluster. This increases the chances that a query can be parallelized and distributed maximally. However, if most nodes have more shards per table than they have CPUs, you could actually see performance degradation. Each shard comes with a cost in terms of open files, RAM, and CPU cycles. Having too many small shards can negatively impact performance and scalability for a few reasons:

Increased overhead in terms of managing and maintaining your cluster.
Reduced performance because it takes longer for CrateDB to gather the results, especially true for queries that need to join data from multiple shards
Limited scalability because it becomes more difficult to scale your cluster if you have too many shards.
Increased complexity as too many shards can make it more difficult to understand and troubleshoot your data distribution.

Finally, for performance reasons, consider one thousand shards per node the highest recommended configuration. Also, a single table should not have more than one thousand partitions. If you exceed these numbers you will experience a failing cluster check.

Takeaway

Sharding and partitioning in CrateDB are two key concepts that help to improve the scalability and performance of your database. In this article, we explored the basic principles of sharding and partitioning in CrateDB, and how they can be used to improve performance and scalability in your database. If you have any further questions or would like to learn more about CrateDB, check out our documentation and join the CrateDB community.

DEV Community: CrateDB

Retrieving records in bulk with a list of primary key values in CrateDB

Using common table expressions to speed up queries

Connecting with SSL to CrateDB using the PostgreSQL protocol from Java-based applications

Using dbt with CrateDB

Using regex comparisons and other advanced database features for real-time inspection of web server logs

Storing server logs on CrateDB for fast search and aggregations

How to add new nodes to on-premises CrateDB clusters

Related reading

Discovery

Scaling from a single-node deployment

Networking considerations

Deploying the new node

Updating settings on the old nodes

From data storage to data analysis: Tutorial on CrateDB and pandas 2

Introduction

Requirements

Setting up CrateDB

Querying data with CrateDB and pandas

Analyze the data

Wrap up

Guide to bitwise operators in CrateDB

Bitwise AND

Bitwise OR

Bitwise XOR

Wrap up

Replicating data from other databases to CrateDB with Debezium and Kafka

Setup on the MSSQL side

Setup on the CrateDB side

Zookeeper and Kafka

Preparing and starting a Debezium container image

Configure a source connector

Configure a target

Testing

Conclusion

Find the latest reported values with ease. Introducing max_by and min_by aggregations in CrateDB 5.2

Load the dataset

Performance and alternatives

Wrap up

Guide to sharding and partitioning best practices in CrateDB

Partition, shard, and Lucene index

How the data is distributed

Shard replication

Automatic creation of new partitions

How to avoid too many shards

Takeaway