<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: CrateDB</title>
    <description>The latest articles on DEV Community by CrateDB (@crate).</description>
    <link>https://dev.to/crate</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F3724%2Fa95143df-0432-476e-a232-b21295f33edb.png</url>
      <title>DEV Community: CrateDB</title>
      <link>https://dev.to/crate</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/crate"/>
    <language>en</language>
    <item>
      <title>Retrieving records in bulk with a list of primary key values in CrateDB</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Fri, 23 Feb 2024 09:33:28 +0000</pubDate>
      <link>https://dev.to/crate/retrieving-records-in-bulk-with-a-list-of-primary-key-values-in-cratedb-3em0</link>
      <guid>https://dev.to/crate/retrieving-records-in-bulk-with-a-list-of-primary-key-values-in-cratedb-3em0</guid>
      <description>&lt;p&gt;When we send SQL statements to CrateDB they need to be parsed, but in most situations we do not think about this because the resources used for parsing the statements are trivial in relation to what is required to actually execute the queries.&lt;/p&gt;

&lt;p&gt;One exception to this is when INSERTing a large amount of rows, and for this case CrateDB has &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/interfaces/http.html#bulk-operations"&gt;a very efficient bulk operations interface&lt;/a&gt; which can also be used for UPDATEs and DELETEs.&lt;/p&gt;

&lt;p&gt;However I recently came across an unusual requirement, we had a very large table with a primary key made of multiple fields, and given tens of thousands of values for these fields we needed to retrieve all the corresponding records.&lt;/p&gt;

&lt;p&gt;Let me exemplify the situation with this table definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="k"&gt;OBJECT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s also create some sample data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'machine'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="p"&gt;,{&lt;/span&gt;&lt;span class="nv"&gt;"test"&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'abc'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GENERATE_SERIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are different approaches we could use to retrieve multiple rows for the given PK values, such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine1'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt; 
&lt;span class="k"&gt;UNION&lt;/span&gt; 
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine2'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine1'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine2'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works reasonably well up to a few hundred records, but let’s see what happens if we try to use this approach to lookup tens of thousands of different records in a single statement as it was the requirement in this very particular case.&lt;/p&gt;

&lt;p&gt;Let’s dynamically generate a query like the above but for 10,000 records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;thedata&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="s1"&gt;'OR (ts='&lt;/span&gt;&lt;span class="p"&gt;,(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;TEXT&lt;/span&gt;
            &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;' and machine_id=&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt;
            &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt; and sensor_type='&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt;
            &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;')'&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;onewherecondition&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; 
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s1"&gt;'SELECT * FROM sensor_data WHERE '&lt;/span&gt;
        &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nv"&gt;"thearray"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;array_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;onewherecondition&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}::&lt;/span&gt;&lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'{"thearray":["OR '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'","'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'"]}'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;';'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;thedata&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will generate a very long statement, and when we try to run it we may get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;StackOverflowError[null]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;io.crate.exceptions.SQLParseException: line 1:1: statement is too large (stack overflow while parsing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So we will need a different strategy, and we also want this to run as quickly as possible.&lt;/p&gt;

&lt;p&gt;Let’s start by preparing a CSV file with 10,000 primary key values we will use for testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;crash
crash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT ts,machine_id,sensor_type FROM sensor_data LIMIT 10000;"&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; csv &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pkvalues.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What we are going to do now is take advantage of a &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/general/ddl/system-columns.html"&gt;system column&lt;/a&gt; called &lt;code&gt;_id&lt;/code&gt; which exists on all CrateDB tables. This column contains a unique identifier for each row, and for tables with a PK defined it is a compound string representation of all primary key values of that row. The useful characteristic here is that the value is deterministic, given 2 tables with the same PK definition rows with the same PK values will have the same &lt;code&gt;_id&lt;/code&gt; values.&lt;/p&gt;

&lt;p&gt;So to perform this “bulk SELECT” we are going to use a staging table defined with the same PK as the main table. The Python code below bulk loads the values from the CSV file to the staging table and then uses the &lt;code&gt;_id&lt;/code&gt; values to locate all the rows we are interested in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pandas &lt;span class="s2"&gt;"crate[sqlalchemy]"&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crate.client.sqlalchemy.support&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;insert_bulk&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pkvalues.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crate://localhost:4200&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;connect_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify_ssl_cert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP TABLE IF EXISTS relevant_pk_values;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        CREATE TABLE relevant_pk_values (
            ts TIMESTAMP
            ,machine_id TEXT
            ,sensor_type SMALLINT
            ,PRIMARY KEY (ts,machine_id,sensor_type)
        ) CLUSTERED INTO 1 SHARDS;
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;relevant_pk_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunksize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;insert_bulk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REFRESH TABLE relevant_pk_values;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;resultset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT *
        FROM sensor_data
        WHERE _id IN (SELECT _id FROM relevant_pk_values);
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I hope you found this interesting, if you have any questions please do not hesitate to reach out to us through the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cratedb</category>
      <category>python</category>
      <category>sql</category>
      <category>performance</category>
    </item>
    <item>
      <title>Using common table expressions to speed up queries</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Thu, 22 Feb 2024 09:44:56 +0000</pubDate>
      <link>https://dev.to/crate/using-common-table-expressions-to-speed-up-queries-29om</link>
      <guid>https://dev.to/crate/using-common-table-expressions-to-speed-up-queries-29om</guid>
      <description>&lt;p&gt;Today I want to share with you a pattern you can use to replace &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/concepts/joins.html"&gt;JOINs&lt;/a&gt; with &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/general/dql/selects.html#sql-dql-with"&gt;CTEs&lt;/a&gt; in your SQL queries and achieve consistent and faster execution times.&lt;/p&gt;

&lt;p&gt;Consider a database where we store information about invoices, a simplified model could consist of a table where we store details about the customer and payment terms, a separate table where we have the items included in the invoice, and a 3rd table where we store product information:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;invoice_number&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;payment_terms&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt;  &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'30 days from issue date'&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;invoice_number&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;  &lt;span class="nb"&gt;REAL&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;unit_price&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;product_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_description&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;applicable_tax_percentage&lt;/span&gt;  &lt;span class="nb"&gt;REAL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s now imagine we want to know how many units of “super cool product” have been sold in January 2024, we could write a SQL query with JOINs like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;product_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'super cool product'&lt;/span&gt; 
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2024-02-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is perfectly valid SQL, but it leaves the database engine with a lot of options.&lt;/p&gt;

&lt;p&gt;Even without considering the complexities of a distributed system, parallel processing, and disk/memory options there are still many different possible strategies here, for instance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“super cool product“ may only be sold very rarely, we could then start by looking up its &lt;code&gt;product_id&lt;/code&gt; then all the instances where it has been sold and then check if the corresponding invoices were in January 2024&lt;/li&gt;
&lt;li&gt;or perhaps the product is sold often and we have data for 20 years of sales, so we could start by looking up all the invoices from January, then their line items, and see which ones are for this product&lt;/li&gt;
&lt;li&gt;or maybe the company only sells a handful of products and this one is a best seller, we may also only keep invoices for the last 45 days, meaning that neither the date of the invoice nor the product are very selective, in which case it may be faster to consider the full list of &lt;code&gt;invoice_items&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With up-to-date &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/sql/statements/analyze.html"&gt;statistics&lt;/a&gt; database engines like CrateDB can usually do a good job at identifying an optimal execution plan, but there is the risk some day statistics may not be available on your target environment, or even with statistics available other factors may induce the query engine to go down the wrong path.&lt;/p&gt;

&lt;p&gt;The impact of using a suboptimal execution plan here could be huge, we could find ourselves trying to JOIN millions and millions of records.&lt;/p&gt;

&lt;p&gt;Let’s do a small test creating sample data for the first scenario above, the one where “super cool product“ is only sold very rarely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="cm"&gt;/* one million invoices in December */&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'2023-12-01'&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GENERATE_SERIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cm"&gt;/* one million invoices in January */&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GENERATE_SERIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cm"&gt;/* 2 products */&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'super cool product'&lt;/span&gt;&lt;span class="p"&gt;),(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'another product'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cm"&gt;/* one line item per invoice and only 1 instance in 2 million where product 1 was sold */&lt;/span&gt;
&lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ceiling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now run the query with the JOINs a few times, in my small test environment it settles in running in around 750 milliseconds.&lt;/p&gt;

&lt;p&gt;We can also look at the execution plan and all its details using the &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/sql/statements/explain.html"&gt;EXPLAIN command&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let’s now try this approach where we use CTEs to guide the engine to execute the query using steps we know are more optimal for the profile of our data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;relevant_product_ids&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;product_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'super cool product'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relevant_product_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;relevant_product_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;relevant_invoices&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;relevant_invoices&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relevant_invoices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2024-02-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now see this runs consistently in single digit milliseconds, a 100x improvement.&lt;/p&gt;

&lt;p&gt;In large and busy environments this kind of optimization may make a big difference, so it may be something to add to your toolbox.&lt;/p&gt;

&lt;p&gt;I hope you found this interesting, and as usual if you have any questions do not hesitate to reach out to us in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>database</category>
      <category>sql</category>
      <category>performance</category>
    </item>
    <item>
      <title>Connecting with SSL to CrateDB using the PostgreSQL protocol from Java-based applications</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Wed, 13 Sep 2023 12:39:48 +0000</pubDate>
      <link>https://dev.to/crate/connecting-with-ssl-to-cratedb-using-the-postgresql-protocol-from-java-based-applications-4mpp</link>
      <guid>https://dev.to/crate/connecting-with-ssl-to-cratedb-using-the-postgresql-protocol-from-java-based-applications-4mpp</guid>
      <description>&lt;p&gt;If you are using &lt;a href="https://crate.io/lp-free-trial"&gt;CrateDB Cloud&lt;/a&gt;, or if you have &lt;a href="https://crate.io/docs/crate/reference/en/latest/admin/ssl.html"&gt;configured a server certificate&lt;/a&gt; for an on-premises deployment, and you try to enforce SSL on a PostgreSQL connection to CrateDB you may come across an error message like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt.
  C:\Users\Hernan\AppData\Roaming\postgresql\root.crt (The system cannot find the path specified)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;org.postgresql.util.PSQLException: Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not specific to CrateDB, and you would get the same message trying to connect to an actual PostgreSQL instance, but I found no simple explanation of this error message and the options available, so here are my two cents.&lt;/p&gt;

&lt;p&gt;What happens here is that the client is trying to confirm the server we are establishing an encrypted connection with is indeed the machine we intended to reach, doing this involves validating that the certificate used by the server has been issued by a trusted certification authority.&lt;br&gt;
In this case, the client driver is trying to find the details of valid certification authorities on a PKCS12 file on the location indicated in the error message.&lt;/p&gt;

&lt;p&gt;I find that in most cases it makes sense to pick one of the two options below to address this.&lt;/p&gt;

&lt;p&gt;If we want the communication channel with the server to be encrypted, but we are on a trusted network environment and do not require verification of the server certificate, we can use this in our connection string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssl=true&amp;amp;sslmode=require
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But if we want to have both encryption and the confirmation that we are talking to the intended server, we can tell the driver to use the list of certification authorities our JVM accepts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssl=true&amp;amp;sslmode=verify-full&amp;amp;sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some software (&lt;a href="https://dbeaver.io/"&gt;DBeaver&lt;/a&gt; for instance) may have separate configuration settings where you can set the SSL Factory and SSL mode:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2k85hn3z5j7q6axwgfy2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2k85hn3z5j7q6axwgfy2.png" alt="SSL Mode and Factory settings can be found in DBeaver on the SSL tab of the connection settings" width="518" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hope this helps. As usual please do not hesitate to let us know your thoughts in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>postgres</category>
      <category>java</category>
    </item>
    <item>
      <title>Using dbt with CrateDB</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Fri, 18 Aug 2023 16:32:06 +0000</pubDate>
      <link>https://dev.to/crate/using-dbt-with-cratedb-65i</link>
      <guid>https://dev.to/crate/using-dbt-with-cratedb-65i</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/dbt-labs/dbt-core"&gt;Dbt&lt;/a&gt; is a tool for transforming data in data warehouses using Python and SQL. The idea is that Data Engineers make source data available to an environment where dbt projects run (for instance with &lt;a href="https://dev.to/crate/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka-1981"&gt;Debezium&lt;/a&gt; or with &lt;a href="https://crate.io/integrations/cratedb-and-apache-airflow"&gt;Airflow&lt;/a&gt;), and Data Analysts can then run their dbt projects against this data to produce models (tables and views) that can be used with &lt;a href="https://crate.io/integrations#visualization-analytics"&gt;BI tools&lt;/a&gt;.&lt;br&gt;
This layer allows the decoupling of the models on which reports and dashboards rely from the source data, and if our business rules or our source systems change we can still maintain the same models as a stable interface.&lt;/p&gt;

&lt;p&gt;Some of the things that dbt can do include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;import reference data from csv files&lt;/li&gt;
&lt;li&gt;track changes in source data with different strategies so that downstream models do not need to be built every time from scratch&lt;/li&gt;
&lt;li&gt;run tests on data, to confirm assumptions remain valid, and to validate any changes made to the models' logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Due to its unique capabilities, &lt;a href="https://cratedb.com/"&gt;CrateDB&lt;/a&gt; is an excellent warehouse choice for data transformation projects. It offers automatic indexing, fast aggregations, easy partitioning, and the ability to scale horizontally. In this article, I will illustrate how to get the most important functionalities of dbt working by doing the necessary changes in the configuration.&lt;/p&gt;

&lt;p&gt;Our starting point will be a fresh install of &lt;code&gt;dbt-postgres&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-postgres&lt;span class="o"&gt;==&lt;/span&gt;1.6.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can then create a profiles file with our connection details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
&lt;span class="nb"&gt;mkdir&lt;/span&gt; .dbt
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; .dbt/profiles.yml
example_datawarehouse_profile:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      database: crate
      schema: doc
      search_path: doc
      user: dbt   
      password: pwd1234567A
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(please note the values for &lt;code&gt;database&lt;/code&gt;, &lt;code&gt;schema&lt;/code&gt;, and &lt;code&gt;search_path&lt;/code&gt; in this example)&lt;/p&gt;

&lt;p&gt;We will not go into the details of how the project files are structured (for more information check out dbt’s documentation), but in general, a dbt project consists of a combination of SQL, Jinja, YAML, and markdown files. In our project folder, alongside the &lt;code&gt;models&lt;/code&gt; folder that most projects have, we can also create a folder called &lt;code&gt;macros&lt;/code&gt; where we can place macro overrides.&lt;br&gt;
Let's then create a &lt;code&gt;macros&lt;/code&gt; folder and place some files with overrides on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;macros
&lt;span class="nb"&gt;cd &lt;/span&gt;macros
wget https://community.crate.io/uploads/short-url/fKupQCFUHtuoKom3jAfKrldUXkt.sql
wget https://community.crate.io/uploads/short-url/qvQExEq1OopiVUcXACLGfpdGHYF.sql
wget https://community.crate.io/uploads/short-url/3jcFxL1EExLrERJSTc6ScnzTS9f.sql
&lt;span class="nb"&gt;cd&lt;/span&gt; ..
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things I have tested with these overrides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;models with &lt;a href="https://docs.getdbt.com/docs/build/materializations"&gt;view, table, and ephemeral materializations&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getdbt.com/docs/deploy/source-freshness"&gt;dbt source freshness&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getdbt.com/docs/build/tests"&gt;dbt test&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getdbt.com/docs/build/seeds"&gt;dbt seed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getdbt.com/docs/build/incremental-models"&gt;Incremental materializations&lt;/a&gt; (with &lt;code&gt;incremental_strategy='delete+insert'&lt;/code&gt; and without involving &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/ddl/data-types.html#objects"&gt;OBJECT&lt;/a&gt; columns)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope you find this useful. CrateDB is continuously adding new features and I will endeavor to come back and update this article if there are any developments and some of these overrides require changes or become obsolete.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Using regex comparisons and other advanced database features for real-time inspection of web server logs</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Mon, 14 Aug 2023 12:23:16 +0000</pubDate>
      <link>https://dev.to/crate/using-regex-comparisons-and-other-advanced-database-features-for-real-time-inspection-of-web-server-logs-35ip</link>
      <guid>https://dev.to/crate/using-regex-comparisons-and-other-advanced-database-features-for-real-time-inspection-of-web-server-logs-35ip</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/crate/storing-server-logs-on-cratedb-for-fast-search-and-aggregations-no"&gt;Storing server logs on CrateDB for fast search and aggregations&lt;/a&gt; we saw how we can get server logs sent to CrateDB in real-time, and for demo purposes we set up an instance of MediaWiki.&lt;br&gt;
It was just an example, but it could have been any web server application.&lt;br&gt;
Let's now imagine that we suspect people are trying to perform &lt;a href="https://en.wikipedia.org/wiki/SQL_injection"&gt;SQL injection&lt;/a&gt; attacks against our website, we need to keep an eye on the logs.&lt;br&gt;
We have already seen how we can use &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/dql/fulltext.html"&gt;fulltext search&lt;/a&gt; to look for specific error messages, but would it not be great if we could have some rules inspecting the log entries as they come in, extracting relevant information, and flagging anything potentially suspicious?&lt;br&gt;
There are a lot of nice features in CrateDB to support this kind of setup, let me show you an example.&lt;br&gt;
The log entries our web server container is producing look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; 192.168.0.121 - - [11/Aug/2023:12:59:42 +0000] "GET /favicon.ico HTTP/1.1" 200 852 "http://192.168.0.202/mw-config/index.php?page=Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We see we have the client IP address and the HTTP request that was sent.&lt;br&gt;
Today I want to do 2 checks,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I want to query only for requests coming from a specific IP subnet,&lt;/li&gt;
&lt;li&gt;and I want to see if the HTTP request has anything suspicious that could suggest an attempt to perform a SQL injection attack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I will use &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/ddl/generated-columns.html"&gt;generated columns&lt;/a&gt; to extract this information from the log entries as they arrive.&lt;br&gt;
CrateDB supports &lt;a href="https://crate.io/docs/crate/reference/en/5.4/sql/statements/alter-table.html#add-column"&gt;adding columns&lt;/a&gt; live to existing tables, but generated columns are special and they can only be added to empty tables, to add our new columns to our &lt;code&gt;systemevents&lt;/code&gt; table without any downtime we will use another feature of CrateDB, &lt;a href="https://crate.io/docs/crate/reference/en/5.4/sql/statements/alter-cluster.html#swap-table"&gt;table swapping&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's create a new table with the 2 special columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;message_ft&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;FULLTEXT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;facility&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;fromhost&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DeviceReportedTime&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ReceivedAt&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;InfoUnitID&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;SysLogTag&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; 
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;clientip&lt;/span&gt; &lt;span class="n"&gt;IP&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;TRY_CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;btrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;split_part&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'-'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;IP&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;suspectedSQLinjection&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;~*&lt;/span&gt; 
                &lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'.*SELECT.*FROM.*'&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; 
                       &lt;span class="s1"&gt;'|.*UNION.*SELECT.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*DELETE.*FROM.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*UPATE.*SET.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*ALTER.*TABLE.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*(%27|&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;)%20.*%20(%27|&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;).*'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we are extracting the client IP address from the &lt;code&gt;message&lt;/code&gt; text and storing it using the dedicated &lt;code&gt;IP&lt;/code&gt; data type in CrateDB. We use the &lt;code&gt;split_part&lt;/code&gt; function to look up the string up to the dash symbol, then we use the &lt;code&gt;btrim&lt;/code&gt; function to remove spaces from both sides of the string, and finally we use the &lt;code&gt;TRY_CAST&lt;/code&gt; function so that log entries that do not have an IP address in this position get a &lt;code&gt;NULL&lt;/code&gt; value as &lt;code&gt;clientip&lt;/code&gt; but no error message is raised.&lt;/p&gt;

&lt;p&gt;We are also using the case insensitive &lt;code&gt;~*&lt;/code&gt; &lt;a href="https://crate.io/docs/crate/reference/en/master/general/dql/selects.html#sql-dql-regexp"&gt;regex comparison&lt;/a&gt; operator to look for indications of a possible SQL injection attack attempt, we are looking for occurrences of &lt;code&gt;SELECT .. FROM&lt;/code&gt; , &lt;code&gt;UNION ... SELECT&lt;/code&gt; , &lt;code&gt;DELETE ... FROM&lt;/code&gt; , &lt;code&gt;UPDATE ... SET&lt;/code&gt; , &lt;code&gt;ALTER ... TABLE&lt;/code&gt; , or attempts to break a string delimiter injecting a single quote character. This will match entries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; 172.17.0.1 - - [11/Aug/2023:13:03:07 +0000] "GET /mw-config/index.php?css=1%27%20WAITFOR%20DELAY%20%270%3A0%3A5%27%20AND%20%27Lshb%27%3D%27Lshb HTTP/1.1" 200 4627 "-" "sqlmap/1.7.8#pip (https://sqlmap.org)"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's now swap in this new table and rename the old one as &lt;code&gt;systemevents_archive&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="n"&gt;SWAP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;REVOKE&lt;/span&gt; &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;systemevents_archive&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To show how this work we will need both normal activity, which we can generate just by navigating to &lt;a href="http://localhost/"&gt;http://localhost/&lt;/a&gt; (perhaps from another machine to get a different IP address in the logs), and we will also need some malicious-looking activity, to generate this against our website we can use a well-known pentesting tool &lt;a href="https://github.com/sqlmapproject/sqlmap"&gt;sqlmap&lt;/a&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;sqlmap
sqlmap &lt;span class="nt"&gt;-u&lt;/span&gt; http://localhost/ &lt;span class="nt"&gt;--crawl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Press ENTER when prompted to proceed with default settings, there will be around 10 questions before the tool concludes there are no obvious vulnerabilities in the website.&lt;/p&gt;

&lt;p&gt;We can now run queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;systemevents&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;clientip&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'172.17.0.0/16'&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;devicereportedtime&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to get the most recent activity from IP addresses in the 172.17.0.0/16 subnet, and we can also look for suspicious activity only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;systemevents&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;suspectedSQLinjection&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;devicereportedtime&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kind of queries could then be integrated into a &lt;a href="https://community.crate.io/t/monitoring-an-on-premises-cratedb-cluster-with-prometheus-and-grafana/1236"&gt;Grafana dashboard&lt;/a&gt; or some alerting system.&lt;/p&gt;

&lt;p&gt;I hope you found this interesting. Please do not hesitate to let us know your thoughts in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>regex</category>
      <category>security</category>
      <category>sql</category>
      <category>networking</category>
    </item>
    <item>
      <title>Storing server logs on CrateDB for fast search and aggregations</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Thu, 10 Aug 2023 08:27:19 +0000</pubDate>
      <link>https://dev.to/crate/storing-server-logs-on-cratedb-for-fast-search-and-aggregations-no</link>
      <guid>https://dev.to/crate/storing-server-logs-on-cratedb-for-fast-search-and-aggregations-no</guid>
      <description>&lt;p&gt;Did you know that CrateDB can be a great store for your server logs?&lt;/p&gt;

&lt;p&gt;If you have been using log aggregation tools or even some of the most advanced commercial SIEM systems, you have probably experienced the same frustrations I have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeouts when searching logs over long periods of time&lt;/li&gt;
&lt;li&gt;a complex and proprietary query syntax&lt;/li&gt;
&lt;li&gt;difficulties integrating queries on logs data into application monitoring dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Storing server logs on CrateDB solves these problems, it allows to query the logs with standard SQL and from any tool supporting the PostgreSQL protocol; its unique indexing also makes full-text queries and aggregations super fast.&lt;br&gt;
Let me show you an example.&lt;/p&gt;

&lt;p&gt;First, we will need an instance of CrateDB, it may be best to have a dedicated cluster for this purpose, to separate the monitoring system from the systems being monitored, but for the purpose of this demo we can just have a single node cluster on a docker container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; cratedb &lt;span class="nt"&gt;--publish&lt;/span&gt; 4200:4200 &lt;span class="nt"&gt;--publish&lt;/span&gt; 5432:5432 crate &lt;span class="nt"&gt;-Cdiscovery&lt;/span&gt;.type&lt;span class="o"&gt;=&lt;/span&gt;single-node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we need a table to store the logs, let's connect to &lt;a href="http://localhost:4200/#!/console"&gt;http://localhost:4200/#!/console&lt;/a&gt; and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;message_ft&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;FULLTEXT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;facility&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;fromhost&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DeviceReportedTime&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ReceivedAt&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;InfoUnitID&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;SysLogTag&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; 
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tip: if you are on a headless system you can also run queries with &lt;a href="https://cratedb.com/docs/crate/clients-tools/en/latest/cli.html"&gt;command-line tools&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Then we need an account for the logging system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'pwd123'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and we need to grant permissions on the table above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will use &lt;a href="https://github.com/rsyslog/rsyslog"&gt;rsyslog&lt;/a&gt; to send the logs to CrateDB, for this setup we need &lt;code&gt;rsyslog&lt;/code&gt; v8.2202 or higher and the &lt;code&gt;ompgsql&lt;/code&gt; module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;add-apt-repository ppa:adiscon/v8-stable
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;rsyslog
&lt;span class="nb"&gt;sudo &lt;/span&gt;debconf-set-selections &lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'rsyslog-pgsql rsyslog-pgsql/dbconfig-install string false'&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;rsyslog-pgsql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's now configure it to use the account we created earlier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'module(load="ompgsql")'&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/rsyslog.d/pgsql.conf
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'*.* action(type="ompgsql" conninfo="postgresql://rsyslog:pwd123@localhost/doc")'&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/rsyslog.d/pgsql.conf
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart rsyslog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are interested in more advanced setups involving queuing for additional reliability in production scenarios, you can read more about available settings in the &lt;a href="https://www.rsyslog.com/doc/v8-stable/tutorials/high_database_rate.html"&gt;rsyslog documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now let's imagine that we want to run a container with &lt;a href="https://www.mediawiki.org/wiki/MediaWiki"&gt;MediaWiki&lt;/a&gt; to host an intranet and we want all logs to go to CrateDB, we can just deploy this with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;--name&lt;/span&gt; mediawiki &lt;span class="nt"&gt;-p&lt;/span&gt; 80:80 &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--log-driver&lt;/span&gt; syslog &lt;span class="nt"&gt;--log-opt&lt;/span&gt; syslog-address&lt;span class="o"&gt;=&lt;/span&gt;unixgram:///dev/log mediawiki
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we now point a web browser to port 80 &lt;a href="http://localhost/"&gt;http://localhost/&lt;/a&gt; we will see a new MediaWiki page.&lt;br&gt;
Let's play around a bit to generate log entries, just click on "set up the wiki" and then once on Continue.&lt;br&gt;
This will have generated entries in the &lt;code&gt;doc.systemevents&lt;/code&gt; table with &lt;code&gt;syslogtag&lt;/code&gt; matching the container id of the container running the site.&lt;/p&gt;

&lt;p&gt;We can now use the &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/dql/fulltext.html#predicates-match"&gt;MATCH predicate&lt;/a&gt; to find the error messages we are interested in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;devicereportedtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message_ft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Could not reliably determine'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;PHRASE&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| devicereportedtime | message                                                                                                                                                                     |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's now see which log sources created the most entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;syslogtag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------+----------+
| syslogtag            | count(*) |
+----------------------+----------+
| kernel:              |       23 |
| 083053ae8ea3[52134]: |       20 |
| systemd[1]:          |       15 |
| sudo:                |       10 |
| rsyslogd:            |        5 |
+----------------------+----------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I hope you found this interesting. Please do not hesitate to let us know your thoughts in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>docker</category>
      <category>containers</category>
      <category>database</category>
    </item>
    <item>
      <title>How to add new nodes to on-premises CrateDB clusters</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Tue, 18 Jul 2023 13:12:26 +0000</pubDate>
      <link>https://dev.to/crate/how-to-add-new-nodes-to-on-premises-cratedb-clusters-4lfo</link>
      <guid>https://dev.to/crate/how-to-add-new-nodes-to-on-premises-cratedb-clusters-4lfo</guid>
      <description>&lt;p&gt;A significant feature in CrateDB is that it can scale horizontally, which means that instead of adding more RAM, CPU, and disk resources to our existing nodes we can add more nodes to our CrateDB cluster.&lt;br&gt;
This allows the handling of volumes of data that simply could not fit on a single node, but it is also very useful in scenarios where hosting everything in a single node, or a small number of nodes, would still be possible, this is because smaller nodes are often easier to manage infrastructure-wise.&lt;br&gt;
More nodes also mean more resiliency to issues, on a scenario where we have for instance 5 nodes, and configure our tables with 2 replicas, we could lose 2 nodes and still serve our production workloads. This means we can carry out maintenance on the nodes one at a time and still be able to withstand an unplanned issue on another node without downtime.&lt;/p&gt;

&lt;p&gt;Today we want to review how to add a new node to an existing on-premises cluster.&lt;/p&gt;
&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://crate.io/docs/crate/howtos/en/latest/clustering/multi-node-setup.html"&gt;CrateDB multi-node setup — CrateDB: How-Tos&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crate.io/docs/crate/reference/en/latest/concepts/clustering.html"&gt;Clustering — CrateDB: Reference&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crate.io/docs/crate/reference/en/latest/concepts/storage-consistency.html"&gt;Storage and consistency — CrateDB: Reference&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Discovery
&lt;/h2&gt;

&lt;p&gt;When a CrateDB node starts it needs a mechanism to get a list of the nodes that make up the cluster, this is called discovery.&lt;br&gt;
At the time of writing, there are 3 ways for a node to get this list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The list of nodes can be defined in the &lt;code&gt;discovery.seed_hosts&lt;/code&gt; setting in the configuration file (typically in &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The list can be retrieved with a DNS query, see &lt;a href="https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-via-dns"&gt;https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-via-dns&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;In AWS environments, the list of nodes can be looked up via the EC2 API, filtering on specific security groups, availability zones, and tags, see &lt;a href="https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-on-amazon-ec2"&gt;https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-on-amazon-ec2&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the purpose of this post, we will work with the &lt;code&gt;discovery.seed_hosts&lt;/code&gt; list.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scaling from a single-node deployment
&lt;/h2&gt;

&lt;p&gt;If a node is started without specifying the &lt;code&gt;initial_master_nodes&lt;/code&gt; setting (the default configuration), or with &lt;code&gt;discovery_type&lt;/code&gt; set to &lt;code&gt;single-node&lt;/code&gt;, it will be started as a standalone instance and it cannot later be scaled into a cluster. Single-node deployments are great for development and testing, but for production setups we recommend using a cluster with at least 3 nodes.&lt;/p&gt;

&lt;p&gt;If you are going for a single-node deployment initially, but plan to scale to a multi-node cluster in the future, there are some settings to configure before the very first run of the CrateDB node so that it bootstraps as a 1-node cluster instead of a standalone instance.&lt;br&gt;
The settings that we need are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;discovery.seed_hosts&lt;/code&gt; set to the hostname or the &lt;code&gt;node.name&lt;/code&gt; of the node&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;initial_master_nodes&lt;/code&gt; set to the hostname or the &lt;code&gt;node.name&lt;/code&gt; of the node&lt;/li&gt;
&lt;li&gt;optionally we can set a &lt;code&gt;cluster.name&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are using containers you would pass these settings with lines in the &lt;code&gt;args&lt;/code&gt; section of your YAML file, otherwise you could create &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt; before deploying the package for your distribution (refer to &lt;a href="https://github.com/crate/crate/blob/master/app/src/main/dist/config/crate.yml"&gt;https://github.com/crate/crate/blob/master/app/src/main/dist/config/crate.yml&lt;/a&gt; for the template), or you could prevent the package installation from auto-starting the daemon by using a mechanism such as &lt;code&gt;policy-rcd-declarative&lt;/code&gt;, then edit the configuration file (&lt;code&gt;crate.yml&lt;/code&gt; ), and start the &lt;code&gt;crate&lt;/code&gt; daemon once all settings are ready.&lt;/p&gt;
&lt;h2&gt;
  
  
  Networking considerations
&lt;/h2&gt;

&lt;p&gt;Nodes need to be able to resolve each other's hostnames at DNS level, and they need to be able to reach each other on a TCP port which is 4300 by default.&lt;/p&gt;

&lt;p&gt;For security reasons you should configure your network so that CrateDB cluster nodes are only reachable on port 4300 from other CrateDB nodes in the cluster.&lt;br&gt;
In a Kubernetes environment this can be achieved with a &lt;code&gt;Service&lt;/code&gt; resource with a &lt;code&gt;ClusterIP&lt;/code&gt;.&lt;br&gt;
In a non-containerized environment one way to do this is to use firewall software directly on each node, for instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#Enable ufw - all incoming connections blocked by default&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw &lt;span class="nb"&gt;enable&lt;/span&gt;

&lt;span class="c"&gt;#Allow SSH if you are using it to manage your server&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 22

&lt;span class="c"&gt;#Allow 4200 for clients to connect to CrateDB via the http endpoint&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 4200

&lt;span class="c"&gt;#Allow 5432 if you have PostgreSQL clients&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 5432

&lt;span class="c"&gt;#Allow 4300 from 192.168.0.202 (another cluster node in this example)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow proto tcp from 192.168.0.202 to any port 4300
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You may also want to consider network access control and/or a separate network adapter for intra-cluster communications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying the new node
&lt;/h2&gt;

&lt;p&gt;Make sure the new node does not auto-bootstrap as a single-node instance, you may want to either create &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt; in advance or use a mechanism as &lt;code&gt;policy-rcd-declarative&lt;/code&gt; as mentioned earlier.&lt;br&gt;
On the configuration file for the new node:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;discovery.seed_hosts&lt;/code&gt; to the full list of nodes, including the new one you are adding.&lt;/li&gt;
&lt;li&gt;Optionally set a &lt;code&gt;node.name&lt;/code&gt; , if not done the node get assigned a random name from the &lt;code&gt;sys.summits&lt;/code&gt; table. You may wonder what those default names are about, they are the names of mountains in the area around our main office, we love mountains at &lt;a href="http://crate.io/"&gt;Crate.io&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;cluster.name&lt;/code&gt; to a value that matches the other nodes in the cluster, if not specified the default cluster name is &lt;code&gt;crate&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Consider if you want to set the &lt;a href="https://crate.io/docs/crate/reference/en/latest/config/cluster.html#metadata-gateway"&gt;cluster-wide settings&lt;/a&gt; &lt;code&gt;gateway.expected_data_nodes&lt;/code&gt;, &lt;code&gt;gateway.recover_after_data_nodes&lt;/code&gt;, and/or &lt;code&gt;gateway.recover_after_time&lt;/code&gt; to prevent the unnecessary creation of new replicas and the rebalancing of shards when a node takes a little bit longer to start, or in case of transient issues, when the cluster is starting up from a situation where all nodes are shutdown. Please note these settings are used when the cluster is starting up from being offline, if you want to delay the allocation of replicas when a node becomes unavailable on a cluster that stays online there is &lt;a href="https://crate.io/docs/crate/reference/en/5.4/sql/statements/create-table.html#unassigned-node-left-delayed-timeout"&gt;a different setting at table level&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we can start the &lt;code&gt;crate&lt;/code&gt; daemon, you will see the node joining the cluster and CrateDB will start using it for shards allocation.&lt;/p&gt;

&lt;p&gt;Remember to add the new nodes alongside the old ones in any monitoring system and load balancer configuration you may have in your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Updating settings on the old nodes
&lt;/h2&gt;

&lt;p&gt;Now we need to align a number of settings in the other nodes, these are typically in the &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt; file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Update &lt;code&gt;discovery.seed_hosts&lt;/code&gt; adding the new node&lt;/li&gt;
&lt;li&gt;If you have configured &lt;code&gt;gateway.&lt;/code&gt; settings, update them to have the same values on all nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These settings only play a role during restart, not at runtime, so you do not need to restart the nodes after making these changes, but if the &lt;code&gt;gateway.&lt;/code&gt; settings need updating you may see a warning in the Admin UI which can be acknowledged.&lt;/p&gt;

&lt;p&gt;Please also note there is no need to update the &lt;code&gt;initial_master_nodes&lt;/code&gt; list, this is only considered during the initial cluster bootstrapping.&lt;/p&gt;

&lt;p&gt;And that is it, we have scaled out our cluster and we are ready to work with larger volumes of data. I hope you find this useful and, as usual, please do not hesitate to raise any thoughts or questions in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>clustering</category>
      <category>scaling</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>From data storage to data analysis: Tutorial on CrateDB and pandas 2</title>
      <dc:creator>marijaselakovic</dc:creator>
      <pubDate>Wed, 03 May 2023 06:55:15 +0000</pubDate>
      <link>https://dev.to/crate/from-data-storage-to-data-analysis-tutorial-on-cratedb-and-pandas-2-48i0</link>
      <guid>https://dev.to/crate/from-data-storage-to-data-analysis-tutorial-on-cratedb-and-pandas-2-48i0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Pandas is an open-source data manipulation and analysis library for Python. It is widely used for handling and analyzing data in a variety of fields, including finance, research, etc.&lt;/p&gt;

&lt;p&gt;One of the key benefits of pandas is its ability to handle and manipulate large datasets, making it a valuable tool for data scientists and analysts. The library provides easy-to-use data structures and functions for data cleaning, transformation, and analysis, making it an essential part of the data analysis workflow.&lt;/p&gt;

&lt;p&gt;Using CrateDB and pandas together can be a powerful combination for handling large volumes of data and performing complex data analysis tasks. In this tutorial, we will showcase using the real-world dataset how to use CrateDB and pandas together for effective data analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;To follow along with this tutorial, you will need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A running instance of CrateDB 5.2.&lt;/li&gt;
&lt;li&gt;Python 3.x with the &lt;a href="https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html" rel="noopener noreferrer"&gt;pandas 2&lt;/a&gt; and &lt;a href="https://github.com/crate/crate-python" rel="noopener noreferrer"&gt;crate 0.31&lt;/a&gt; packages installed.&lt;/li&gt;
&lt;li&gt;A real-world dataset in CSV format. In this tutorial, we will be using the shop customer data available on &lt;a href="https://www.kaggle.com/datasets/datascientistanna/customers-dataset" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setting up CrateDB
&lt;/h2&gt;

&lt;p&gt;Before we can start using CrateDB, we need to set it up. You can either download and install CrateDB locally via &lt;a href="https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; or &lt;a href="https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#try-cratedb-without-installing" rel="noopener noreferrer"&gt;tarball&lt;/a&gt; or use a &lt;a href="https://crate.io/download?hsCtaTracking=caa20047-f2b6-4e8c-b7f9-63fbf818b17f%7Cf1ad6eaa-39ac-49cd-8115-ed7d5dac4d63" rel="noopener noreferrer"&gt;CrateDB Cloud&lt;/a&gt; instance with an option of the free cluster.&lt;/p&gt;

&lt;p&gt;Once you have a running instance of CrateDB, create a new table to store the customer data dataset. Here is an SQL command to create a table:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
 sql
CREATE TABLE IF NOT EXISTS "doc"."customer_data" (
   "customerid" INTEGER,
   "gender" TEXT,
   "age" INTEGER,
   "annualincome" INTEGER,
   "spendingscore" INTEGER,
   "profession" TEXT,
   "workexperience" INTEGER,
   "familysize" INTEGER
)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;After creating the table, you can import the customer data dataset into CrateDB using the &lt;code&gt;COPY FROM&lt;/code&gt;command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="nv"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"customer_data"&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'file:///path/to/Customers.csv'&lt;/span&gt; 
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'csv'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;','&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Once you have CrateDB running, you can start exploring data with pandas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Querying data with CrateDB and pandas
&lt;/h2&gt;

&lt;p&gt;The first step is to import the &lt;code&gt;pandas&lt;/code&gt; library and specify the query you want to execute on CrateDB. In our example, we want to fetch all customer data.&lt;/p&gt;

&lt;p&gt;To read data from CrateDB and work with it in a pandas DataFrame use &lt;code&gt;read_sql&lt;/code&gt; method as illustrated below.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM customer_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;crate://localhost:4200&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In the above code, we establish a connection to a local CrateDB instance running on localhost on port 4200, execute a SQL query, and return the results as a pandas DataFrame. You can further modify the query to retrieve only the columns you need or to filter the data based on some condition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analyze the data
&lt;/h2&gt;

&lt;p&gt;Now that data are loaded into the pandas DataFrame, we can perform various analyses and manipulations on it. For instance, we can group the data by a certain column and calculate the average value of another column:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;avg_income&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profession&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;annualincome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In this example, we group the data in the DataFrame by the &lt;code&gt;profession&lt;/code&gt; column and calculate the average annual income for each profession. You can plot the data about average incomes using &lt;code&gt;df.plot()&lt;/code&gt; method, specifying the type of plot (a bar chart), and the columns to use for the x and y axes:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;income_by_profession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bar&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We also use &lt;code&gt;plt.show()&lt;/code&gt; from &lt;code&gt;matplotlib&lt;/code&gt; to display the plot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fly5008r8tol329qtihra.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fly5008r8tol329qtihra.png" alt="Plot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;That's it! You should now have a good idea of how to use CrateDB and pandas together to analyze large datasets stored in CrateDB. This allows you to take advantage of the powerful data manipulation capabilities of pandas to analyze and visualize your data.&lt;br&gt;
 To learn more about updates, features, and other questions you might have, join our &lt;a href="https://community.crate.io/" rel="noopener noreferrer"&gt;CrateDB&lt;/a&gt; community.&lt;/p&gt;

</description>
      <category>python</category>
      <category>database</category>
      <category>dataframe</category>
      <category>pandas</category>
    </item>
    <item>
      <title>Guide to bitwise operators in CrateDB</title>
      <dc:creator>marijaselakovic</dc:creator>
      <pubDate>Thu, 02 Mar 2023 09:30:44 +0000</pubDate>
      <link>https://dev.to/crate/guide-to-bitwise-operators-in-cratedb-41ag</link>
      <guid>https://dev.to/crate/guide-to-bitwise-operators-in-cratedb-41ag</guid>
      <description>&lt;p&gt;Bitwise operators are useful because they allow you to perform efficient and concise operations on individual bits within integer values, which can be very useful in a variety of SQL queries.&lt;/p&gt;

&lt;p&gt;CrateDB continues to provide many valuable features. In the 5.2 version, we added support for bitwise operators. Now, you may wonder when this feature is handy. There are at least a couple of scenarios:&lt;/p&gt;

&lt;p&gt;If you want to store multiple pieces of information in a single column. For example, you can use a bitwise OR operator to combine multiple flags into a single value.&lt;/p&gt;

&lt;p&gt;Simplifying conditional statements. Bitwise operators can be used to check the state of individual bits within a value. For example, you can use a bitwise AND operator to check if a particular bit is set or not.&lt;/p&gt;

&lt;p&gt;Easier data manipulation. Bitwise operators can be used to set or manipulate specific bits within a value. For example, you can use a bitwise OR operator to set a particular bit to 1, or a bitwise XOR to swap bit value (1 to 0 or 0 to 1).&lt;/p&gt;

&lt;p&gt;CrateDB supports three bitwise operators:&lt;/p&gt;

&lt;p&gt;BITWISE AND (&amp;amp;)&lt;/p&gt;

&lt;p&gt;BITWISE OR (|)&lt;/p&gt;

&lt;p&gt;BITWISE XOR (#)&lt;/p&gt;

&lt;p&gt;Now, let’s take a look at each operator and some interesting examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bitwise AND
&lt;/h2&gt;

&lt;p&gt;The Bitwise AND operator compares each bit of two values and returns a new value with the bits that are set in both of the original values.&lt;/p&gt;

&lt;p&gt;The syntax for this operator is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;value1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;value2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;code&gt;value1&lt;/code&gt; and &lt;code&gt;value2&lt;/code&gt; are the two values that you want to compare using the bitwise AND operator. These values can be any valid expressions or constants in SQL, such as columns, variables, or literals.&lt;/p&gt;

&lt;p&gt;For example, let’s imagine a table of &lt;code&gt;employees&lt;/code&gt; with a column &lt;code&gt;status&lt;/code&gt; that stores a &lt;a href="https://crate.io/docs/crate/reference/en/5.1/general/ddl/data-types.html#bit-strings" rel="noopener noreferrer"&gt;bitmask&lt;/a&gt; value representing the status of each employee. The status value is stored as a byte and each bit represents a different aspect of the employee status. To save storage space, we recommend a byte data structure for storing up to 7 states simultaneously. The first bit always represents a sign and we don't use negative values to encode states. Similarly, use short data structure for storing up to 15 states, integer for up to 31 states, long for up to 63 states, and for more than 63 states that use Bit String type. To learn more about data types supported in CrateDB, check out our &lt;a href="https://crate.io/docs/crate/reference/en/5.1/general/ddl/data-types.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The first bit in status value says whether the employee is working full-time, the second bit says if the employee is remotely, and so on. For instance, if the employee is working full-time, the status value will be 1 (&lt;code&gt;B'01'&lt;/code&gt;) if remote the status value will be 2 (&lt;code&gt;B'10'&lt;/code&gt;) and if both, the status value will be 3 (&lt;code&gt;B'11'&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Now, let’s create a table and populate it with sample data&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="nb"&gt;BYTE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;comment&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; 
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Ana'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Ana is working full-time from office'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Mary'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Mary is working full-time remotely'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Sara'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Sara is working part-time remotly'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To select all employees who are working full-time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;comment&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;Ana&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Mary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query will select all rows from the employees table where the first bit (full-time status) is set to 1.&lt;/p&gt;

&lt;p&gt;To find all employees who are working full-time and remotely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;comment&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;Mary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use the bitwise AND operator in combination with other logical operators, such as OR and NOT, to create more complex queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bitwise OR
&lt;/h2&gt;

&lt;p&gt;The bitwise OR operator is used to compare two binary values and return a new binary value where the resulting bit is set to 1 if either of the input bits is 1. It is represented by the symbol &lt;code&gt;|&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To understand how the bitwise OR operator works, let's consider two binary values: 1011 and 1100. The bitwise OR operation would compare each bit position of the two values and return a new binary value based on the following rules:&lt;/p&gt;

&lt;p&gt;If either of the input bits is 1, the resulting bit is set to 1.&lt;/p&gt;

&lt;p&gt;If both of the input bits are 0, the resulting bit is set to 0.&lt;/p&gt;

&lt;p&gt;Applying these rules to the example above, we get the following result: &lt;code&gt;1011 | 1100 = 1111&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The resulting binary value is 1111, which is equivalent to 15 in decimal.&lt;/p&gt;

&lt;p&gt;Considering the example with &lt;code&gt;employees&lt;/code&gt; table, let’s say we would like to select all employees that are working full-time, remotely or both. Here's how you can use the bitwise OR operator in the query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;comment&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following query illustrates how to change a flag specifying whether an employee is working full-time without changing the existing flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="cm"&gt;/* FULL-TIME */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Sara'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s important to note that the bitwise OR operator only works with binary values. If you want to perform a logical OR operation with non-binary values, you can use the OR operator in SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bitwise XOR
&lt;/h2&gt;

&lt;p&gt;The bitwise XOR operator, represented by the symbol &lt;code&gt;#&lt;/code&gt;, compares two binary values and returns a new value based on the following rules:&lt;/p&gt;

&lt;p&gt;If both values are 0, the result is 0.&lt;br&gt;
If both values are 1, the result is 0.&lt;br&gt;
If one value is 1 and the other is 0, the result is 1.&lt;br&gt;
Using these rules in the example from above, the result of the XOR operator is: &lt;code&gt;1011 # 1100 = 0111&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;One interesting example of using the bitwise XOR operator is changing the status of an employee from “working remotely” to “working in the office“ and vice versa.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Ana'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, the status of the employee with the name Ana will be toggled from “working remotely“ to “working in the office“ or vice versa, depending on the current value of the status field.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;In summary, the bitwise operators in CrateDB allow you to perform bitmasking operations on values in your database and use the resulting values to filter or modify rows in your tables. By combining this operator with other logical operators, you can create powerful queries that can manipulate and extract specific data from your database. If you have any further questions or would like to learn more about CrateDB, check out our &lt;a href="https://crate.io/docs" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; and join the &lt;a href="https://community.crate.io/" rel="noopener noreferrer"&gt;CrateDB community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>watercooler</category>
      <category>employment</category>
      <category>career</category>
    </item>
    <item>
      <title>Replicating data from other databases to CrateDB with Debezium and Kafka</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Tue, 28 Feb 2023 16:29:49 +0000</pubDate>
      <link>https://dev.to/crate/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka-1981</link>
      <guid>https://dev.to/crate/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka-1981</guid>
      <description>&lt;p&gt;You may have line-of-business applications such as ERP software that work with transactional database systems like MSSQL, Oracle, or MySQL.&lt;/p&gt;

&lt;p&gt;The setup may work perfectly fine for day-to-day operations, but you may find that it is not ideal for doing data analytics.&lt;/p&gt;

&lt;p&gt;Attempting to run analytic workloads against the operational databases you may see concurrency issues deriving from locking, the analytics queries may have an impact on the performance of business-critical operations, and you may also find that the performance and feature-set in the transactional database system may not be good enough for analyzing large amounts of data.&lt;/p&gt;

&lt;p&gt;Considering this, many organisations come to the conclusion they need to copy data to a separate environment to run reporting and dashboards, this is sometimes done with replication, sometimes with backups, and sometimes with complex ETL pipelines. This often comes with a set of challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ballooning license costs&lt;/li&gt;
&lt;li&gt;custom ad-hoc routines for getting the data to the analytics environment, requiring development, monitoring, and troubleshooting&lt;/li&gt;
&lt;li&gt;a need to design and maintain an indexing strategy for the analytics copy of the data&lt;/li&gt;
&lt;li&gt;high availability requirements for the analytics environment as the business starts relying on it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We know we can address several of these points by using a system like CrateDB. CrateDB is a feature-rich, open-source, SQL database which out-of-the-box automatically implements indexes, compression, and a columnar store so that most analytical queries can run much faster without any need to fiddle with settings. Because it is open-source, there is no need to be concerned about licensing expenses. Additionally, it can scale horizontally, which means that the number of nodes can be adjusted as needed to handle changing data volumes and workloads, and it provides high availability without requiring administrative effort.&lt;/p&gt;

&lt;p&gt;If only we could replicate data from our operational database to CrateDB without having to write custom code… it turns out we can.&lt;/p&gt;

&lt;p&gt;Enter Debezium, &lt;a href="https://debezium.io/"&gt;Debezium &lt;/a&gt;is a standard open-source system, built on top of Kafka, which allows to capture changes on a source database system and replicate them on another system without having to write custom scripts.&lt;/p&gt;

&lt;p&gt;In this post I want to show an example replicating changes on a table from MSSQL to CrateDB.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup on the MSSQL side
&lt;/h2&gt;

&lt;p&gt;We will need a SQL Server instance with the SQL Server Agent service up and running, if you are running MSSQL on a container you can get the agent running by setting the environment variable &lt;code&gt;MSSQL_AGENT_ENABLED&lt;/code&gt; to &lt;code&gt;True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Connect to the instance with a client such as &lt;code&gt;sqlcmd&lt;/code&gt;, SSMS, or &lt;a href="https://dbeaver.io/"&gt;DBeaver&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We are now going to go through a number of steps, if you already have a working system feel free to skip the operations you do not need.&lt;/p&gt;

&lt;p&gt;Let’s create a database with a test table on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GO&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;createdon&lt;/span&gt; &lt;span class="nb"&gt;DATETIME&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;getdate&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;srcsystem&lt;/span&gt; &lt;span class="n"&gt;NVARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s now create an account for Debezium to use to pull the changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;LOGIN&lt;/span&gt; &lt;span class="n"&gt;debeziumlogin&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;PASSWORD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;enterStrongPasswordHere&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;debeziumuser&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="n"&gt;LOGIN&lt;/span&gt; &lt;span class="n"&gt;debeziumlogin&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;debeziumrole&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sp_addrolemember&lt;/span&gt; &lt;span class="s1"&gt;'debeziumrole'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'debeziumuser'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sp_addrolemember&lt;/span&gt; &lt;span class="s1"&gt;'db_datareader'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'debeziumuser'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And let’s enable change data capture on our example table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sp_cdc_enable_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FILEGROUP&lt;/span&gt; &lt;span class="n"&gt;cdcfg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FILE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;erp_cdc_file1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FILENAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'/var/opt/mssql/data/erp_cdc_file1.ndf'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;FILEGROUP&lt;/span&gt; &lt;span class="n"&gt;cdcfg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sp_cdc_enable_table&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;source_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'dbo'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;source_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'tbltest'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;role_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'debeziumrole'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;filegroup_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'cdcfg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;supports_net_changes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup on the CrateDB side
&lt;/h2&gt;

&lt;p&gt;We will need a CrateDB instance, for the purpose of this example we can spin one up with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;docker.io
&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;--publish&lt;/span&gt; 4200:4200 &lt;span class="nt"&gt;--publish&lt;/span&gt; 5432:5432 crate:latest &lt;span class="nt"&gt;-Cdiscovery&lt;/span&gt;.type&lt;span class="o"&gt;=&lt;/span&gt;single-node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we need to run a couple of SQL commands on this instance, an easy way to do this is using the Admin UI that can be accessed navigating with a web browser to port 4200 on the server where CrateDB is running, for instance &lt;code&gt;http://localhost:4200&lt;/code&gt; and then open the console (second icon from the top on the left-hand side navigation bar).&lt;/p&gt;

&lt;p&gt;We will create a user account for Debezium to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;debezium&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'debeziumpwdincratedb123'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The table on our MSSQL source is on the &lt;code&gt;dbo&lt;/code&gt; schema, let’s imagine we want to have a &lt;code&gt;dbo&lt;/code&gt; schema on CrateDB as well, the &lt;code&gt;debezium&lt;/code&gt; account will need permissions on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;DQL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DML&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DDL&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;debezium&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And let’s create the structure of the table that will receive the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="cm"&gt;/* we need the PK definition to match the source table so that this can be used to lookup records when they need to be updated */&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;createdon&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="cm"&gt;/* CrateDB supports defaults -of course- but because the source table already has a default value we do not need that here */&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;srcsystem&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Zookeeper and Kafka
&lt;/h2&gt;

&lt;p&gt;To use Debezium we will need to have working setups of Zookeeper and Kafka.&lt;/p&gt;

&lt;p&gt;For the purpose of this example I will spin them up with containers on the same machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; zookeeper &lt;span class="nt"&gt;-p&lt;/span&gt; 2181:2181 &lt;span class="nt"&gt;-p&lt;/span&gt; 2888:2888 &lt;span class="nt"&gt;-p&lt;/span&gt; 3888:3888 debezium/zookeeper
&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; kafka &lt;span class="nt"&gt;-p&lt;/span&gt; 9092:9092 &lt;span class="nt"&gt;--link&lt;/span&gt; zookeeper:zookeeper &lt;span class="nt"&gt;--add-host&lt;/span&gt; host.docker.internal:host-gateway debezium/kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We need to create some special topics in Kafka:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; kafka &lt;span class="s2"&gt;"bash"&lt;/span&gt;
bin/kafka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--replication-factor&lt;/span&gt; 1 &lt;span class="nt"&gt;--partitions&lt;/span&gt; 1 &lt;span class="nt"&gt;--topic&lt;/span&gt; my_connect_configs &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; host.docker.internal:9092 &lt;span class="nt"&gt;--config&lt;/span&gt; cleanup.policy&lt;span class="o"&gt;=&lt;/span&gt;compact
bin/kafka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--replication-factor&lt;/span&gt; 1 &lt;span class="nt"&gt;--partitions&lt;/span&gt; 1 &lt;span class="nt"&gt;--topic&lt;/span&gt; my_connect_offsets &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; host.docker.internal:9092 &lt;span class="nt"&gt;--config&lt;/span&gt; cleanup.policy&lt;span class="o"&gt;=&lt;/span&gt;compact
&lt;span class="nb"&gt;exit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Please note this is a very basic setup, for production purposes you may want to adjust some of &lt;a href="https://kafka.apache.org/documentation/#topicconfigs"&gt;these settings&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparing and starting a Debezium container image
&lt;/h2&gt;

&lt;p&gt;We need to customize the base &lt;code&gt;debezium/connect&lt;/code&gt; Docker image adding a JDBC sink and the PostgreSQL drivers.&lt;/p&gt;

&lt;p&gt;For this we need to download the zip file from &lt;a href="https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc"&gt;kafka-connect-jdbc&lt;/a&gt; and then run the below replacing &lt;code&gt;*************&lt;/code&gt; with the appropriate URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;customdockerimg
&lt;span class="nb"&gt;cd &lt;/span&gt;customdockerimg
wget &lt;span class="k"&gt;*************&lt;/span&gt;/confluentinc-kafka-connect-jdbc-10.6.3.zip
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;unzip
&lt;span class="nb"&gt;mkdir &lt;/span&gt;confluentinc-kafka-connect-jdbc-10.6.3
&lt;span class="nb"&gt;cd &lt;/span&gt;confluentinc-kafka-connect-jdbc-10.6.3
unzip &lt;span class="nt"&gt;-j&lt;/span&gt; ../confluentinc-kafka-connect-jdbc-10.6.3.zip
&lt;span class="nb"&gt;cd&lt;/span&gt; ..
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Dockerfile &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;  
FROM debezium/connect
USER root:root
COPY ./confluentinc-kafka-connect-jdbc-10.6.3/ /kafka/connect/
RUN cd /kafka/libs &amp;amp;&amp;amp; curl -sO https://jdbc.postgresql.org/download/postgresql-42.5.4.jar
USER 1001
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; cratedb-connect-debezium &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s now start this custom image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; connect &lt;span class="nt"&gt;-p&lt;/span&gt; 8083:8083 &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;GROUP_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;CONFIG_STORAGE_TOPIC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my_connect_configs &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OFFSET_STORAGE_TOPIC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my_connect_offsets &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;--add-host&lt;/span&gt; host.docker.internal:host-gateway &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;--add-host&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;:host-gateway &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;BOOTSTRAP_SERVERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host.docker.internal:9092 &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;KEY_CONVERTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;org.apache.kafka.connect.json.JsonConverter &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;VALUE_CONVERTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;org.apache.kafka.connect.json.JsonConverter &lt;span class="se"&gt;\&lt;/span&gt;
           cratedb-connect-debezium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This assumes Kafka is running locally on the same server, you will need to adjust &lt;code&gt;BOOTSTRAP_SERVERS&lt;/code&gt; if that is not the case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure a source connector
&lt;/h2&gt;

&lt;p&gt;Let’s create a &lt;code&gt;connector.json&lt;/code&gt; file as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mssql-source-tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.debezium.connector.sqlserver.SqlServerConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"tasks.max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"database.history.kafka.bootstrap.servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"host.docker.internal:9092"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema.history.internal.kafka.bootstrap.servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"host.docker.internal:9092"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"topic.prefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cratedbdemo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.encrypt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"database.hostname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"host.docker.internal"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1433"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debeziumlogin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;enterStrongPasswordHere&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.server.name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mssql-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"database.names"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"erp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;        
        &lt;/span&gt;&lt;span class="nl"&gt;"table.whitelist"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dbo.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;        
        &lt;/span&gt;&lt;span class="nl"&gt;"database.history.kafka.topic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"schema-changes.mssql-server.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema.history.internal.kafka.topic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"schema-changes.inventory.mssql-server.tbltest"&lt;/span&gt;&lt;span class="w"&gt;              
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can observe that there are settings there concerning the Kafka setup to use, the details to connect to MSSQL, the name of the table that we want to pull changes from, and the Kafka topics that will be used to track these changes.&lt;/p&gt;

&lt;p&gt;Let’s deploy this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept:application/json"&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type:application/json"&lt;/span&gt; http://localhost:8083/connectors/ &lt;span class="nt"&gt;-d&lt;/span&gt; @connector.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configure a target
&lt;/h2&gt;

&lt;p&gt;Let’s create a &lt;code&gt;destination-connector.json&lt;/code&gt; file as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cratedb-sink-tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.confluent.connect.jdbc.JdbcSinkConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"tasks.max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;       

        &lt;/span&gt;&lt;span class="nl"&gt;"connection.url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:postgresql://host.docker.internal:5432/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;              
        &lt;/span&gt;&lt;span class="nl"&gt;"connection.user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debezium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      
        &lt;/span&gt;&lt;span class="nl"&gt;"connection.password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debeziumpwdincratedb123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;               

        &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cratedbdemo.erp.dbo.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
        &lt;/span&gt;&lt;span class="nl"&gt;"table.name.format"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dbo.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"auto.create"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"auto.evolve"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"insert.mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"upsert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"pk.fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"pk.mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"record_value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      

        &lt;/span&gt;&lt;span class="nl"&gt;"transforms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"unwrap"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;                                                 
        &lt;/span&gt;&lt;span class="nl"&gt;"transforms.unwrap.type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.debezium.transforms.ExtractNewRecordState"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We got details to connect to CrateDB, the name of table that will receive the changes (please note this is case sensitive), and some transform instructions to flatten the JSON data stored in the Kafka topic.&lt;/p&gt;

&lt;p&gt;Let’s deploy this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept:application/json"&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type:application/json"&lt;/span&gt; http://localhost:8083/connectors/ &lt;span class="nt"&gt;-d&lt;/span&gt; @destination-connector.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;Let’s see this in action.&lt;/p&gt;

&lt;p&gt;Let’s create a record from the MSSQL side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;srcsystem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;@@&lt;/span&gt;&lt;span class="k"&gt;version&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now let’s go to CrateDB and check the table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As magic the record is there.&lt;/p&gt;

&lt;p&gt;Let’s now try an update from the MSSQL side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;srcsystem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Updated successfully'&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgs04ve699lndjq45vg8u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgs04ve699lndjq45vg8u.png" alt="Image description" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Using Debezium we can replicate changes from different database systems to CrateDB without having to develop any custom logic, we can then take advantage of CrateDB’s performance and features for our analytic workloads.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>database</category>
      <category>architecture</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Find the latest reported values with ease. Introducing max_by and min_by aggregations in CrateDB 5.2</title>
      <dc:creator>marijaselakovic</dc:creator>
      <pubDate>Thu, 23 Feb 2023 08:42:04 +0000</pubDate>
      <link>https://dev.to/crate/find-the-latest-reported-values-with-ease-introducing-maxby-and-minby-aggregations-in-cratedb-52-3ig8</link>
      <guid>https://dev.to/crate/find-the-latest-reported-values-with-ease-introducing-maxby-and-minby-aggregations-in-cratedb-52-3ig8</guid>
      <description>&lt;p&gt;&lt;strong&gt;CrateDB 5.2 adds two new aggregation functions: max_by and min_by&lt;/strong&gt;&lt;br&gt;
These aggregation functions allow users to quickly and easily search the value of one column based on the minimum or maximum value of another column, making them useful for analyzing trends, identifying outliers, or simply understanding the range of values within a dataset. An example use case is getting the latest measurement by using the time column and &lt;code&gt;max_by(measurement, time)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;MIN_BY&lt;/code&gt; and &lt;code&gt;MAX_BY&lt;/code&gt; functions allow you to find the minimum or maximum value in a given column based on the values in another column. For example, if you have a table with two columns, product, category and price, you can use the &lt;code&gt;min_by(product, price)&lt;/code&gt; to find the row with the product with the lowest price in each category:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;min_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cheapest_product&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;product_list&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;max_by(returned_value, maximized_value)&lt;/code&gt; and &lt;code&gt;min_by(returned_value, minimized_value)&lt;/code&gt; return the value of the first column for which the value of the second column is maximized or minimized. If multiple rows maximize or minimize the result of the second column, the output will be non-deterministic and CrateDB can return any value from the list of resulting rows.&lt;/p&gt;

&lt;p&gt;Both &lt;code&gt;max_by&lt;/code&gt; and &lt;code&gt;min_by&lt;/code&gt; can be used for numerical and non-numerical data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Load the dataset
&lt;/h2&gt;

&lt;p&gt;Let’s start with examples using the dataset about power consumption. First, create a table with the schema below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;power_consumption&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="nv"&gt;"ts"&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="nb"&gt;TIME&lt;/span&gt; &lt;span class="k"&gt;ZONE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"Global_active_power"&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"Global_reactive_power"&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"Voltage"&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"Global_intensity"&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"Sub_metering_1"&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"Sub_metering_2"&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"Sub_metering_3"&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"meter_id"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"location"&lt;/span&gt; &lt;span class="n"&gt;GEO_POINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"city"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To import data, use the following &lt;code&gt;COPY FROM&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;power_consumption&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'https://srv.demo.crate.io/datasets/power_consumption.json'&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;SUMMARY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dataset illustrates the consumption data for a few years and shows the differences between several measured utilities. For instance, the column &lt;code&gt;"Sub_metering_1"&lt;/code&gt; shows how much energy is consumed in the kitchen. Similarly, columns &lt;code&gt;"Sub_metering_2"&lt;/code&gt; and &lt;code&gt;"Sub_metering_3"&lt;/code&gt; show the consumed energy in laundry and climate control systems. The full description of the dataset can be found here.&lt;/p&gt;

&lt;p&gt;Example queries&lt;br&gt;
Given the data set, let’s find the ids of house meters that had the highest consumption at one point in time for the kitchen and laundry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;max_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meter_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Sub_metering_1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_kitchen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
       &lt;span class="n"&gt;max_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meter_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Sub_metering_2"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_laundry&lt;/span&gt;  
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;power_consumption&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result of this query should contain the following meter ids:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------------+-------------+
| max_kitchen | max_laundry |
+-------------+-------------+
| 84007B127R  | 840070504U  |
+-------------+-------------+
SELECT 1 row in set (7.423 sec)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another example would be to find the meter id for the house with the lowest unused power:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;min_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meter_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"Global_reactive_power"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_unused&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;power_consumption&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The return value will tell us for which house meter we had the lowest value of unused power:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------+
| min_unused |
+------------+
| 84007B008L |
+------------+
SELECT 1 row in set (0.197 sec)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also combine these functions with &lt;code&gt;WHERE&lt;/code&gt; or &lt;code&gt;GROUP BY&lt;/code&gt; clauses in CrateDB. For example, let's find for each meter id, the consumption of other, unmapped appliances when the unused power was the lowest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;meter_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;min_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"Global_active_power"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nv"&gt;"Global_reactive_power"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_consumption&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;power_consumption&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;meter_id&lt;/span&gt; 
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The query result will list the consumption for each meter id:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------+-------------------+
| meter_id   | total_consumption |
+------------+-------------------+
| 840073190N |             0.202 |
| 840071457E |             0.258 |
| 840072897V |             0.14  |
| 840072655G |             0.218 |
| 840072219H |             0.274 |
| 840071893D |             1.342 |
| 840075260N |             0.246 |
| 840076398A |             0.226 |
| 840072328B |             0.212 |
| 840071760J |             0.222 |
+------------+-------------------+
SELECT 10 rows in set (0.067 sec)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance and alternatives
&lt;/h2&gt;

&lt;p&gt;As you have seen above &lt;code&gt;min_by&lt;/code&gt; and &lt;code&gt;max_by&lt;/code&gt; provide a very concise and convenient way to easily find the value of one column based on the minimum or maximum value of another column. Not only is it a convenient feature, but it also provides significant performance gains to alternative queries one had to write in earlier versions of CrateDB.&lt;/p&gt;

&lt;p&gt;Let us look at another example and see how much easier and faster it is to get the right results in CrateDB 5.2. We start with our often-used dataset containing IoT device data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readings&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; 
       &lt;span class="nv"&gt;"time"&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="nb"&gt;TIME&lt;/span&gt; &lt;span class="k"&gt;ZONE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
       &lt;span class="n"&gt;device_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
       &lt;span class="n"&gt;battery_level&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
       &lt;span class="n"&gt;battery_status&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;CLUSTERED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="n"&gt;SHARDS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We want to find the latest reported &lt;code&gt;battery_level&lt;/code&gt; and &lt;code&gt;battery_status&lt;/code&gt; for each device in our dataset holding 30 Million total records. In CrateDB 5.1 and earlier versions, one could fallback to a 2-step approach and use a JOIN like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;battery_level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;battery_temperature&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readings&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
            &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;device_id&lt;/span&gt;
        &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readings&lt;/span&gt;
        &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;max_r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;max_r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;max_r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not only makes the nested structure query adjustments more difficult and one needs to remember the pattern, but also the performance - due to the expensive &lt;code&gt;JOIN&lt;/code&gt; - is not really that great with a runtime of roughly 9 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Runtime (in ms):
    mean:    8982.507 ± 57.494
    min/max: 8578.380 → 9843.830
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is an improvement of 85% in query speed (8.9s → 1.3s), also using a simpler syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;Overall, the &lt;code&gt;max_by&lt;/code&gt; and &lt;code&gt;min_by&lt;/code&gt; functions in CrateDB provide an easy and efficient way to find the maximum or minimum value of a given column in a table based on the values in a different column. These functions can be used in a variety of scenarios to quickly and easily find the highest or lowest values in a set of data.&lt;/p&gt;

&lt;p&gt;If you like this blog post and want to learn more about CrateDB, check out our documentation and join the CrateDB community!&lt;/p&gt;

</description>
      <category>cratedb</category>
      <category>sql</category>
      <category>performance</category>
      <category>aggregations</category>
    </item>
    <item>
      <title>Guide to sharding and partitioning best practices in CrateDB</title>
      <dc:creator>marijaselakovic</dc:creator>
      <pubDate>Fri, 17 Feb 2023 14:36:47 +0000</pubDate>
      <link>https://dev.to/crate/guide-to-sharding-and-partitioning-best-practices-in-cratedb-5207</link>
      <guid>https://dev.to/crate/guide-to-sharding-and-partitioning-best-practices-in-cratedb-5207</guid>
      <description>&lt;p&gt;Sharding and partitioning are very important concepts when it comes to system scaling. When defining your strategy, you should account upfront for any future growth, given the significant burden of moving data and restructuring the tables. In this article, we will give you a thorough understanding of how sharding and partitioning work in CrateDB. We will start by covering the basic definitions, discussing the principles behind shard distribution and replication in CrateDB, and how to avoid common bottlenecks. &lt;/p&gt;

&lt;h2&gt;
  
  
  Partition, shard, and Lucene index
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A table&lt;/strong&gt; in CrateDB is a collection of data. It consists of a specified number of columns and any number of rows. Every table must have a schema describing the table structure. Very often, the table is divided into independent, smaller parts based on a particular column. This “smaller part“ is called &lt;strong&gt;partition&lt;/strong&gt;. The table becomes a partitioned table if, during the table creation, a partition column is defined. In this case, when a new record is inserted, a new partition is created if the partition for the same column value doesn’t exist yet. Partitioning is done for easier maintenance of large tables and improving the performance of particular SQL operations. However, a bad selection of a partition column can lead to too many partitions, which can slow the system's performance.&lt;/p&gt;

&lt;p&gt;Now let’s take a look into the concept of shard. In CrateDB, &lt;strong&gt;the shard&lt;/strong&gt; is a division of a table or a table partition based on a configurable number and stored on a node in the cluster. When a node is added or removed from the cluster or when the data distribution becomes unbalanced, CrateDB automatically redistributes the shards across the nodes in the cluster to ensure an even data distribution. If the number of shards is not defined during table creation, CrateDB will apply a sensible default value depending on &lt;a href="https://crate.io/docs/crate/reference/en/3.3/sql/statements/create-table.html#ref-clustered-clause" rel="noopener noreferrer"&gt;the number of nodes&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally, each shard in CrateDB represents a Lucene index. A Lucene index is a collection of Lucene segments where each segment represents an inverted index, doc value, or k-d trees. Take a look at &lt;a href="https://crate.io/blog/guide-to-write-operations-in-cratedb" rel="noopener noreferrer"&gt;our previous article&lt;/a&gt; to get a better overview of the Lucene index and Lucene segments. The diagram below illustrates the best connection between partitions, shards, and the Lucene index.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxtqdabscguhpg6tjp9y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxtqdabscguhpg6tjp9y.png" alt="connections" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How the data is distributed
&lt;/h2&gt;

&lt;p&gt;CrateDB uses &lt;em&gt;row-based sharding&lt;/em&gt; to split data across multiple shards. This means that data is split based on the values in specific columns of the table, and distributed across multiple nodes. The column that the data should be sharded on is called the &lt;em&gt;routing column&lt;/em&gt;. When you insert, update, or delete a row, CrateDB will use the routing column value to determine which shard to access. The number of shards and the routing column can be specified in the &lt;a href="https://crate.io/docs/crate/reference/en/5.1/sql/statements/create-table.html#clustered" rel="noopener noreferrer"&gt;CLUSTERED&lt;/a&gt; clause when creating a table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
     &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;CLUSTERED&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="n"&gt;SHARDS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above example, the product table is sharded into four shards. If the primary key is set, as illustrated, the routing column can be ignored, as CrateDB uses the primary key for routing by default. However, if the primary key or the routing column is not set, the &lt;a href="https://crate.io/docs/crate/reference/en/5.1/general/ddl/system-columns.html#sql-administration-system-column-id" rel="noopener noreferrer"&gt;internal document ID&lt;/a&gt; is used.&lt;/p&gt;

&lt;p&gt;To distribute data across the cluster, CrateDB uses a hash-based approach based on the simple formula:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;shard number = hash(routing column) % total primary shards&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As a result, all documents with the same value of &lt;code&gt;CLUSTERED BY&lt;/code&gt; column will be stored in the same shard. With a hash function, CrateDB will try to distribute the data roughly equally, even if the original data values are not evenly distributed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shard replication
&lt;/h2&gt;

&lt;p&gt;Shard replication in CrateDB is a feature that allows you to replicate data across multiple nodes in a cluster. This can be useful for increasing data availability, improving performance, and reducing the risk of data loss. You can &lt;a href="https://crate.io/docs/crate/reference/en/5.1/sql/statements/create-table.html#sql-create-table-number-of-replicas" rel="noopener noreferrer"&gt;configure&lt;/a&gt; the number of replicas for each shard with the &lt;code&gt;number_of_replicas&lt;/code&gt; table setting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
     &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number_of_replicas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When there are multiple copies of the same shard, CrateDB will mark one copy as the &lt;em&gt;primary shard&lt;/em&gt; and treat the rest as &lt;em&gt;replica shards&lt;/em&gt;. When data is written to a shard, it is first written to the primary shard. The primary shard then replicates the data to one or more replica shards on other nodes in the cluster. This process is done in real-time, ensuring that data is always up-to-date across all replicas.&lt;/p&gt;

&lt;p&gt;In the event of a node failure, CrateDB will automatically promote one of the replica shards to become the new primary shard. Having more shard replicas means a lower chance of permanent data loss and more throughput as queries will utilize the extra replica shards so that the primary shard is not congested with many requests. In terms of the cost you pay, you will have higher disk space utilization and inter-node network traffic which leads to increased latency of inserts and updates. Additionally, CrateDB supports automatic failover, where the system &lt;em&gt;automatically&lt;/em&gt; detects a failed node and promotes a replica shard to take its place as the primary shard.&lt;/p&gt;

&lt;p&gt;It is also possible to specify the number of replicas for a specific table by using &lt;code&gt;ALTER TABLE&lt;/code&gt; command. For instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number_of_replicas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that changing the number of replicas after a table is created will cause the cluster to redistribute the shards and may take some time to complete (if you want only new partitions to be affected, use the &lt;a href="https://crate.io/docs/crate/reference/en/5.1/sql/statements/alter-table.html" rel="noopener noreferrer"&gt;ONLY keyword&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic creation of new partitions
&lt;/h2&gt;

&lt;p&gt;Tables in CrateDB are designed to be dynamic and expandable, allowing for the addition of new rows and columns as needed. This means that you can create a table with an unlimited number of rows. Every time when new data is inserted, CrateDB dynamically creates a new table partition based on the partition column as illustrated by the following example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;​&lt;/span&gt;
  &lt;span class="nv"&gt;"name"&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;​&lt;/span&gt;
  &lt;span class="nv"&gt;"ts"&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="err"&gt;​&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;"month"&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="err"&gt;​&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="err"&gt;​&lt;/span&gt;
  &lt;span class="nv"&gt;"value"&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt; &lt;span class="nb"&gt;PRECISION&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;CLUSTERED&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;SHARDS&lt;/span&gt;&lt;span class="err"&gt;​&lt;/span&gt;
  &lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmiixo7mlf5ch2ri4fn29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmiixo7mlf5ch2ri4fn29.png" alt="example sales" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For every unique value in the &lt;code&gt;month&lt;/code&gt; column, a new partition will be created. In our example, the table can have up to twelve partitions, one for each month in the year. If more columns are used for partitioning a new partition will be created for every unique combination of values. The partition column can also be a generated column: columns whose values are calculated based on other columns. For instance, if you have a column containing a timestamp value, you can partition the data by a column that extracts the day value from the timestamp.&lt;/p&gt;

&lt;p&gt;The automatic creation of new partitions allows for horizontal scaling and enables the database to handle large amounts of data by distributing it across multiple partitions. Each partition’s shard is stored on a separate node in a CrateDB cluster, which helps to improve query performance and reduce the load on individual nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to avoid too many shards
&lt;/h2&gt;

&lt;p&gt;If the routing column is badly chosen you can end up with too many shards in the cluster, affecting the overall stability and performance negatively. To find out how many shards your tables need, you need to consider the type of data you are processing, required queries, and hardware configuration. However, if you end up with too many shards you will have to manually reduce the number of shards by merging and moving them to the same node which is a time-consuming and tedious operation. To get an idea of how many shards your cluster needs, check out &lt;a href="https://community.crate.io/t/sharding-and-partitioning-guide-for-time-series-data/737" rel="noopener noreferrer"&gt;our recent tutorial&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The general rule for avoiding performance bottlenecks is to have as least as many shards for a table as there are CPUs in the cluster. This increases the chances that a query can be parallelized and distributed maximally. However, if most nodes have more shards per table than they have CPUs, you could actually see performance degradation. Each shard comes with a cost in terms of open files, RAM, and CPU cycles. Having too many small shards can negatively impact performance and scalability for a few reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Increased overhead&lt;/em&gt; in terms of managing and maintaining your cluster.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Reduced performance&lt;/em&gt; because it takes longer for CrateDB to gather the results, especially true for queries that need to join data from multiple shards&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Limited scalability&lt;/em&gt; because it becomes more difficult to scale your cluster if you have too many shards. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Increased complexity&lt;/em&gt; as too many shards can make it more difficult to understand and troubleshoot your data distribution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, for performance reasons, consider one thousand shards per node the highest recommended configuration. Also, a single table should not have more than one thousand partitions. If you exceed these numbers you will experience a failing cluster check.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Sharding and partitioning in CrateDB are two key concepts that help to improve the scalability and performance of your database. In this article, we explored the basic principles of sharding and partitioning in CrateDB, and how they can be used to improve performance and scalability in your database. If you have any further questions or would like to learn more about CrateDB, check out our &lt;a href="https://crate.io/docs" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; and join &lt;a href="https://community.crate.io/" rel="noopener noreferrer"&gt;the CrateDB community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>support</category>
      <category>discuss</category>
      <category>watercooler</category>
    </item>
  </channel>
</rss>
