DEV Community: PSWU

Join Hacktoberfest 2022 and contribute to QuestDB!

PSWU — Tue, 04 Oct 2022 13:27:27 +0000

Hacktoberfest 2022 is starting soon! We're super excited about joining
Hackberfest again and meeting new or returning open-source contributors! 🤝

Hacktoberfest

For those who aren't familiar with Hacktoberfest, it's a month-long online celebration for open-source softwares and communities. The first 40,000 participants who successfully completed the requirements will be rewarded with a special-edition Hacktoberfest T-shirt 👕 or a tree planted in your name. 🌴

Participating in Hacktoberfest is one of our approaches to raise awareness and encourage more developers or technical writers to contribute to open source. We welcome both code and non-code contributions, such as docs improvement, tutorials, and blog posts.

⛳ About us

QuestDB is a high-performance open-source database for time series. The project is built from scratch in Java and C++ with no dependencies and zero garbage collection. It is optimized for high-throughput ingestion over InfluxDB line protocol and fast SQL queries. QuestDB is also one of the most popular time series databases according to the independent reviewer DBEngines.

Developers can use QuestDB as a library for java applications. Official clients for Python, Go, C, C++, Node.js, Rust, and .NET. are also available for the wider developer community.

This year, there are three QuestDB open source projects opted in for Hacktoberfest:

QuestDB: QuestDB core database, mainly written in Java and C++. Check CONTRIBUTING.md to get started.
Documentation: QuestDB's website for documentation, suitable for technical writers who're experienced in Docusaurus and Markdown. Read our guidelinesm for docs contributors.
Web console: Monorepo that contains web console components.

🚀 Contribute to QuestDB

If you're interested in contributing to QuestDB, make sure you read the official guidelines about contributors,
pull requests, and spam.

To maintain a friendly environment for both maintainers and contributors, we put together extra recommendations to increase the chance that your pull requests get accepted:

Pay attention to CONTRIBUTING.md or any contribution guidelines available.
Start with existing issues instead of inventing new ones. While new ideas are generally welcomed, they don't always fit the project roadmap.
Filter issues with good first issues or help wanted tags if you're new to the projects.
Avoid commenting on all the available issues but those you really plan to work on, so you can leave some opportunities to other contributors.
Our maintainers will review your pull requests, please make sure you address all the comments before asking for another review.

🎁 QuestDB swag

In addition to the official reward, if you successfully contribute one valid pull-request to any of the QuestDB projects listed above, we offer an extra QuestDB T-shirt for you through our swag program. 🚀

ℹ️ Get help

If you have any questions when you're trying to contribute to QuestDB projects; here are the places to get help from our team:

Community Slack; join #contributors channel.
Documentation
StackOverflow

Also, don't forget to follow us on social media to receive the latest updates:

Last but not least, star our GitHub repo if you haven't!

Importing 3m rows/sec with io_uring

PSWU — Thu, 22 Sep 2022 10:31:22 +0000

This article is originally published on questdb.io by Andrey Pechkurov.

In this blog post, QuestDB’s very own Andrei Pechkurov presents how to ingest large CSV files a lot more efficiently using the SQL COPY statement, and takes us through the journey of benchmarking. Andrei also shares insights about how the new improvement is made possible by io_uring and compares QuestDB's import versus several well-known OLAP and time-series databases in Clickhouse's ClickBench benchmark.

Introduction

As an open source time series database company, we understand that getting your existing data into the database in a fast and convenient manner is as important as being able to ingest and query your data efficiently later on. That's why we decided to dedicate our new release, QuestDB 6.5, to the new parallel CSV file import feature. In this blog post, we discuss what parallel import means for our users and how it's implemented internally. As a bonus, we also share how recent ClickHouse team's benchmark helped us to improve both QuestDB and its demonstrated results.

How ClickBench helped us to improve

Recently ClickHouse conducted a benchmark for their own database and many others, including QuestDB. The benchmark included data import as the first step. Since we were in the process of building a faster import, this benchmark provided us with nice test data and baseline results. So, what have we achieved? Let's find out. The benchmark was using QuestDB's HTTP import endpoint to ingest the data into an existing non-partitioned table. You may wonder why it doesn't use a partitioned table, which stores the data sorted by the timestamp values and provides many benefits for time series analysis. Most likely, the reason is terrible import execution time. Both HTTP-based import and pre-6.5 COPY SQL command are simply not capable of importing a big CSV file with unsorted data. Thus, the benchmark opts for a non-partitioned table with no designated timestamp column. The test CSV file may be downloaded and uncompressed following the commands:

wget 'https://datasets.clickhouse.com/hits_compatible/hits.csv.gz'
gzip -d hits.csv.gz

The file is on the bigger side, 76GB when decompressed, and contains rows that are heavily out-of-order in terms of time. This makes it a nice import performance challenge for any time series database. Getting the data into a locally running QuestDB instance via HTTP is as simple as:

curl -F data=@hits.csv 'http://localhost:9000/imp?name=hits'

Such import took almost 28 minutes (1,668 seconds, to be precise) on a c6a.4xlarge EC2 instance with a 500GB gp2 volume in ClickBench. This yields around 47MB/s and leaves a lot to wish for. In contrast, it took ClickHouse database around 8 minutes (476 seconds) to import the file on the same hardware. But since we were already working on faster imports for partitioned tables, this benchmark provided us with nice test data and baseline results.

In addition to import speed, ClickBench measures query performance. Although none of the queries it ran were related to time series analysis, the results helped us to improve QuestDB. We found and fixed a stability issue, as well as added support for some SQL functions. Other than that, our SQL engine had a bug around multi-threaded min()/max() SQL function optimization: it was case-sensitive and simply ignored MIN()/MAX() used in ClickBench. After a trivial fix, queries using these aggregate functions got their intended speed back. Finally, a few queries marked with N/A result were using unsupported SQL syntax and it was trivial to rewrite them to get proper results. With all of these improvements, we have run ClickBench on QuestDB 6.5.2 and created a pull request with the updated results.

Long story short, although ClickBench has nothing to do with time series analysis, it provided us with a test CSV file and baseline import results, as well as helped us to improve query stability and performance.

The import speed-up

Our new optimized import is based on the SQL COPY statement:

COPY hits FROM 'hits.csv' WITH TIMESTAMP 'EventTime' FORMAT 'yyyy-MM-dd HH:mm:ss';

The above command uses the new COPY syntax to import the hits.csv file from ClickBench to the hits table. For the command to work, the file should be made available in the import root directory configured on the server:

cairo.sql.copy.root=\home\my-user\my-qdb-import

Since we care about time series data analysis, in our experiments, we partitioned it by day while the original benchmark used a non-partitioned table. Let's start with the most powerful AWS EC2 instance from the original benchmark:

Ingesting a 76GB CSV file, from fast to slow: ClickHouse, QuestDB, Apache Pinot, TimescaleDB, DuckDB, and Apache Druid.

The above benchmark compares the import speed of several well-known OLAP and time-series databases: Apache Pinot, Apache Druid, ClickHouse, DuckDB, TimescaleDB, and QuestDB. Here, our new optimized COPY imports almost 1Bln rows from the hits.csv file in 335 seconds, leaving a higher place in the competition only to ClickHouse.

We also did a run on the c6a.4xlarge instance (16 vCPU and 32GB RAM) from the original benchmark which is noticeably less powerful than the c6a.metal instance (192 vCPU and 384GB RAM). Yet, both instances had a rather slow gp2 500GB EBS volume, the result was 17,401 seconds for the less powerful c6a.4xlarge instance. So, in spite of a very slow disk, c6a.metal is 52x faster than c6a.4xlarge. Why is that?

The answer is simple. The metal instance has a huge amount of memory, so once the CSV file gets decompressed, it fits fully into the OS page cache. Hence, the import doesn't do any physical reads from the input file and instead reads the pages from the memory (note: the machine has a NUMA architecture, but non-local memory access is still way faster than the disk reads). That's why we observe such huge difference here for QuestDB and, also, you may notice a 2.5x difference for ClickHouse in the original benchmark.

You may wonder why, by removing the need to read the data from the slow disk, QuestDB makes a very noticeable improvement, while it's only 2.5x for ClickHouse and even less for other databases? We're going to explain it soon, but for now, let's continue the benchmarking fun.

Honestly speaking, we find the choice of the metal instance in the ClickBench results rather synthetic, as it makes little sense to use a very powerful (and expensive) machine in combination with a very slow (and cheap) disk. So, we did a benchmark run on a different test stand:

c5d.4xlarge EC2 instance (16 vCPU and 32GB RAM), Amazon Linux 2 with 5.15.50 kernel
400GB NVMe drive
250GB gp3, 16K IOPS and 1GB/s throughput, or gp2 of the same size

What we got is the following:

QuestDB ingestion time for ClickBench's 76GB CSV file by instance type and storage.

The very last result on the above chart stands for the scenario of c5d.4xlarge instance with a slow gp2 volume. We are including it to show the importance of the disk speed to the performance.

In the middle of the chart, the-gp3-volume-only result doesn't use the local SSD, but manages to ingest the data into a partitioned table a lot faster than the gp2 run, thanks to the faster EBS volume. Finally, in the NVMe SSD run, the import takes less than 7 minutes - an impressive ingestion rate of 2.5Bln row/s (or 193MB/s) without having the whole input file in the OS page cache. Here, the SSD is used as a read-only storage for the CSV file, while the database files are placed on the EBS volume. This is a convenient approach for a single-time import of high volume of data. As soon as the import is done, the SSD is no longer needed, so the EBS volume may be attached to a more affordable instance where the database would run.

As shown by the top result in the chart above, the optimized import makes a terrific difference for anyone who wants to import their time series data to QuestDB, but also takes us close to the ClickHouse's results from the practical perspective. Another nice property of QuestDB's import is that, as soon as the import ends, the data is laid out on disk optimally, i.e. the column files are organized in partitions and no background merging is required.

Now, as we promised, we're going to explain why huge amount of RAM or a locally-attached SSD makes such a difference for QuestDB's import performance. To learn that, we're taking a leap into an engineering story full of trial and error.

Optimizing the import

Our HTTP endpoint, as well as the old COPY implementation, is handling the incoming data serially (think, as a single-time stream) and uses a single thread for that. For out-of-order (O3) data, this means lots of O3 writes and, hence, partition re-writes. Both single-threaded handling and O3 writes become the limiting factor for these types of import.
However, the COPY statement operates on a file, so there is nothing preventing us from going over it as many times as needed.

QuestDB's storage format doesn't involve complicated layout like the one in LSM trees or in other similar persistent data structures. The column files are partitioned by time and versioned to handle concurrent reads and writes. The advantages of this approach is that as soon as the rows are committed, the on-disk data is optimal from the read operation perspective - there is no need to go through multiple files with potentially overlapping data when reading from a single partition. The downside is that such storage format may be problematic to cope with, when it comes to data import.

But no worries, that's something we have optimized.
The big ideas we had when working on our shiny new COPY are really simple. First, we should organize the import in multiple phases in order to enable in-order data ingestion. Second, we go parallel, i.e. multi-threaded, in each of those phases, where it is possible.

Broadly speaking, the phases are:

Check input file boundaries. Here we try to split the file into N chunks, so that N worker threads may work on their own chunk in isolation.
Index the input file. Each thread scans its chunk, reads designated timestamp column values, and creates temporary index files. The index files are organized in partitions and contain sorted timestamps, as well as offsets pointing to the source file.
Scan the input file and import data into temporary tables. Here, the threads use the newly built indexes to go through the input file and write their own temporary tables. The scanning and subsequent writes are guaranteed to be in-order thanks to the index files containing timestamps and offsets tuples sorted by time. The parallelism in this phase comes from multiple partitions being available to the threads to work independently.
Perform additional metadata manipulations (say, merge symbol tables) and, finally, move the partitions from temporary tables to the final one. This is completed in multiple smaller phases that we summarize as one, for the sake of simplicity.

The indexes we build at phase 2 may be illustrated in the following way:

Temporary indexes built during parallel import.

The above description is an overview of what we've done for the new COPY. Yet, a careful reader might spot a potential bottleneck. Yes, the third phase involves lots of random disk reads in case of an unordered input file. That's exactly what we observed as a noticeable bottleneck when experimenting with the initial implementation. But does it mean that there is nothing we can do with this? Not really. Modern HW & SW to the rescue!

io_uring everything!

Modern SSDs, especially NVMe ones, have evolved quite far from their spinning magnetic ancestors. They're able to cope with much higher concurrency levels for disk operations, including random read ones. But utilizing these hardware capabilities with traditional blocking interfaces, like pread(), would involve many threads and, hence, some overhead here and there (like increased memory footprint or context switching).
Moreover, QuestDB's threading model operates on a fixed-size thread pool and doesn't assume running more threads than the available CPU cores.

Luckily, newer Linux kernel versions support io_uring, a new asynchronous I/O interface. But would it help in our case? Learning the answer is simple and, in fact, doesn't even require a single line of code, thanks to fio, a very flexible I/O tester utility.

Let's check how blocking random reads of 4KB chunks would perform on a laptop with a decent NVMe SSD:

$ fio --name=read_sync_4k \
      --filename=./hits.csv \
      --rw=randread \
      --bs=4K \
      --numjobs=8 \
      --ioengine=sync \
      --group_reporting \
      --runtime=60 \
/
...
Run status group 0 (all jobs):
   READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=13.1GiB (14.0GB), run=60001-60001msec
Disk stats (read/write):
  nvme0n1: ios=3166224/361, merge=0/318, ticks=217837/455, in_queue=218357, util=50.72%

Here we're using 8 threads to make blocking read calls to the same CSV file and observe 223MB/s read rate which is not bad at all.

Now, we use io_uring to do the same job:

$ fio --name=read_io_uring_4k \
      --filename=./hits.csv \
      --rw=randread \
      --bs=4K \
      --numjobs=8 \
      --ioengine=io_uring \
      --iodepth=64 \
      --group_reporting \
      --runtime=60 \
/
...
Run status group 0 (all jobs):
   READ: bw=2232MiB/s (2340MB/s), 2232MiB/s-2232MiB/s (2340MB/s-2340MB/s), io=131GiB (140GB), run=60003-60003msec
Disk stats (read/write):
  nvme0n1: ios=25482866/16240, merge=6262/571137, ticks=27625314/25206, in_queue=27650786, util=98.86%

We get an impressive 2,232MB/s this time. Also, it is worth noting that disk utilization has increased to 98.86% against 50.72% in the previous fio run, all of that with the same number of threads.

This simple experiment proved to us that io_uring may be a great fit in our parallel COPY implementation, so we added an experimental API and continued our experiments. As a result, QuestDB checks the kernel version and, if it's new enough, uses io_uring to speed up the import. Our code is also smart enough to detect in-order adjacent lines and read these lines in one I/O operation. Thanks to such behavior, parallel COPY is faster than the serial counterpart even on ordered files.

We have explained why presence of a NVMe SSD made such a change in our introductory benchmarks. EBS volumes are very convenient, but they show an order of magnitude less IOPS and throughput rates than a physically attached drive. Thus, using such drive for the purposes of initial data import makes a lot of sense, especially when we consider a few terabytes to be imported.

What's next?

Prior to QuestDB 6.5, importing large amounts of unsorted data into a partitioned table was practically impossible. We hope that our users will appreciate this feature, as well as other improvements we've made recently. As a logical next step, we want to take our data import one step further by making it available and convenient to use in QuestDB Cloud. Finally, needless to say, we'll be thinking of more use cases for io_uring in our database.

As usual, we encourage you to try out the latest QuestDB 6.5.2 release and share your feedback with our Slack Community. You can also play with our live demo to see how fast it executes your queries. And, of course, contributions to our open source project on GitHub are more than welcome.

4Bn rows/sec query benchmark: Clickhouse vs QuestDB vs Timescale

PSWU — Thu, 23 Jun 2022 13:30:34 +0000

This article is originally published on questdb.io by Andrey Pechkurov.

We introduced JIT (Just-in-Time) compiler for SQL filters in our previous version, QuestDB 6.2. As we mentioned last time, the next step would be to parallelize the query execution when suitable to improve the execution time even further and that's what we're going to discuss and benchmark today. QuestDB 6.3 enables JIT compiled filters by default and, what's even more noticeable, includes parallel SQL filter execution optimization allowing us to reduce both cold and hot query execution times quite dramatically.

Prior to diving into the implementation details and running some before/after benchmarks for QuestDB, we'll be having a friendly competition with two popular time series and analytical databases, TimescaleDB and ClickHouse. The purpose of the competition is nothing more but an attempt to understand whether our parallel filter execution is worth the hassle or not.

Comparing with other databases

Our test box is a c5a.12xlarge AWS VM running Ubuntu Server 20.04 64-bit. In practice, this means 48 vCPU and 96 GB RAM. The attached storage is a 1 TB gp3 volume configured for 1,000 MB/s throughput and 16,000 IOPS. Apart from that, we'll be using QuestDB 6.3.1 with the default settings which means both parallel filter execution and JIT compilation being enabled.

In order to make the benchmark easily reproducible, we're going to use TSBS benchmark utilities to generate the data. We'll be using so-called IoT use case:

./tsbs_generate_data --use-case="iot" \
                     --seed=123 \
                     --scale=5000 \
                     --timestamp-start="2020-01-01T00:00:00Z" \
                     --timestamp-end="2020-07-01T00:00:00Z" \
                     --log-interval="60s" \
                     --format="influx" > /tmp/data \
                     /

The above command generates six months of per-minute measurements for 5,000 truck IoT devices. This yields almost 1.2 billion records stored in a table named readings.

Loading the data is as simple as:

./tsbs_load_questdb --file /tmp/data

Now, when we have the data in the database, we're going to execute the following query on the readings table:

SELECT *
FROM readings WHERE velocity > 90.0
 AND latitude >= 7.75 AND latitude <= 7.80
 AND longitude >= 14.90 AND longitude <= 14.95;

This (kinda synthetic) query aims to find all measurements sent from fast-moving trucks in the given location. Apart from that, it has a filter on three DOUBLE columns and doesn't include analytical clauses, like GROUP BY or SAMPLE BY, which is exactly what we need.

Our first competitor is TimescaleDB 2.6.0 running on top of PostgreSQL 14.2. As the official installation guide suggests, we made sure to run timescaledb-tune to fine-tune TimescaleDB for better performance.

We generate the test data with the following command:

./tsbs_generate_data --use-case="iot" \
                     --seed=123 \
                     --scale=5000 \
                     --timestamp-start="2020-01-01T00:00:00Z" \
                     --timestamp-end="2020-07-01T00:00:00Z" \
                     --log-interval="60s" \
                     --format="timescaledb" > /tmp/data \
                     /

That's the same command as before, but with the format argument set to timescaledb. Next, we load the data:

./tsbs_load_timescaledb --pass your_pwd --file /tmp/data

Be prepared to wait for quite a while for the data to get in this time. We observed 5-8x ingestion rate difference between QuestDB and two other databases in this particular environment. Yet, that's nothing more but a note for anyone who wants to repeat the benchmark. If you'd like learn more on the ingestion performance topic, check out this blog post.

Finally, we're able to run the first query and measure the hot execution time. Yet, if we do it, it would take more than 15 minutes for TimescaleDB to execute this query. At this point, experienced TimescaleDB & PostgreSQL users may suggest us to add an index to speed up this concrete query.

So, let's do that:

CREATE INDEX ON readings (velocity, latitude, longitude);

With an index in place, TimescaleDB can execute the query much much faster, in around 4.4 seconds. To get the full picture, let's include one more contestant.

The third member of our competition is ClickHouse 22.4.1.752. Just like with TimescaleDB, the command to generate the data stays the same with only the format argument being set to clickhouse. Once the data is generated, it can be loaded into the database:

./tsbs_load_clickhouse --file /tmp/data

We're ready to do the benchmark run.

The above chart shows that QuestDB is an order of magnitude faster than both TimescaleDB and ClickHouse in this specific query.

Interestingly, an index-based scan doesn't help TimescaleDB to win the competition. This is a nice illustration of the fact that a specialized parallelism-friendly storage model may save you from having to deal with indexes and paying the additional overhead during data ingestion.

As the next step, let's give another popular type of query a go. In the world of time series data, it's common to query only the latest rows based on a certain filter. QuestDB supports that elegantly through negative LIMIT clause values. If we were to query ten latest measurements sent from fast-moving, yet fuel-efficient trucks it would look like the following:

SELECT *
FROM readings
WHERE velocity > 75.0 AND fuel_consumption < 10.0
LIMIT -10;

Notice the LIMIT -10 clause in our query, it basically asks the database to return the last 10 rows that correspond to the filter. Thanks to the implicit ascending order based on the designated timestamp column, we also didn't have to specify the ORDER BY clause.

In TimescaleDB, this query would look more verbose:

SELECT *
FROM readings
WHERE velocity > 75.0 AND fuel_consumption < 10.0
ORDER BY time DESC
LIMIT 10;

Here, we had to specify descending ORDER BY and LIMIT clauses. When it comes to ClickHouse, the query would look just like for TimescaleDB with the exception of another column being used to store timestamps (created_at instead of time).

How do databases from our list deal with such query? Let's measure and find out!

This time, surprisingly or not, TimescaleDB does a better job than ClickHouse. That's because, just like QuestDB, TimescaleDB filters the data starting with the latest time-based partitions and stops the filtering once enough rows are found. We could also add an index on the velocity and fuel_consumption columns, but it won't change the result. That's because TimescaleDB doesn't use the index and does a full scan instead for this query. Thanks to such behavior, both QuestDB and TimescaleDB are significantly faster than ClickHouse in the exercise.

Needless to say, that both TimescaleDB and ClickHouse are great pieces of engineering. Your mileage may vary and the performance of your particular application depends on a large number of factors. So, as with any benchmark, take our results with a grain of salt and make sure to measure things on your own.

That should be it for our comparison and now it's time to discuss the design decisions behind our parallel SQL filter execution.

How does it work?

First, let's quickly recap on QuestDB's storage model to understand why it supports efficient multi-core execution. The database has a column-based append-only storage model. Data is stored in tables, with each column stored in its own file or multiple files in case when the table is partitioned by the designated timestamp.

When a SQL filter (i.e. the WHERE clause) is executed, the database needs to scan the files for the corresponding filtered columns. As you may have already guessed, when the column files are large enough, or the query touches multiple partitions, filtering the records on a single thread is inefficient. Instead, the file(s) could be split into contiguous chunks (we call them "page frames"). Then, multiple threads could execute the filter on each page frame utilizing both CPU and disk resources in a much more optimal way.

We already had this optimization in place for some of the analytical types of queries, but not for full or partial table scan with a filter. So, that's basically what we've added in version 6.3.

As usual, there are edge cases and hidden reefs, so the implementation is not as simple as it may sound. Say, what if your query has a filter and a LIMIT -10 clause, just like in our recent benchmark? Then the database should execute the query in parallel, fetch the last 10 records and cancel the remaining page frame filtering tasks, so that there is no useless filtering done by other worker threads. A similar cancellation should take place in the face of a closed PGWire or HTTP connection or a query execution timeout. So, as you already saw in the above comparison, we made sure to handle all of these edge cases. If you're interested in the implementation details, go check this lengthy pull request.

From the end user perspective, this optimization is always enabled and applies to non-JIT and JIT-compiled filters. But how does it improve QuestDB's performance? Let's find out!

Speed up measurements

We'll be using the same benchmark environment as above while using a slightly different query to keep things simple:

SELECT count(*)
FROM readings
WHERE velocity > 75.0 AND fuel_consumption < 10.0;

This query counts the total number of measurements sent from fast-moving, yet fuel-efficient trucks.

First, we focus on cold execution time, i.e. situation when the column files data is not in the OS page cache. Multi-threaded runs use QuestDB 6.3.1 while single-threaded ones use 6.2.0 version of the database. That's because JIT compilation is only available in when parallel filter execution is on starting from 6.3. The database configuration is kept default, except for the JIT being disabled or enabled in the corresponding measurements. Also notice while this given query supports JIT compilation, there is a number of limitations for the types of the queries supported by the JIT compiler.

The below chart shows the cold execution times.

What's that? Parallel filter execution is only two times faster. More than that, enabled JIT-compiled filters have almost no effect on the end result. The thing is that the disk is the bottleneck here.

Let's try to make some sense out of these results. It takes around 30.7 seconds for QuestDB 6.3 to execute the query when the data is only on disk. The query engine has to scan two groups of column files, 182 partitions each having two 50 MB files. This gives us around 18.2 GB of on-disk data and around 592 MB/s disk read rate. That's lower than the configured maximum in our EBS volume, but we should keep in mind allowed 10% fluctuations from the maximum throughput and, what's even more important, individual limits for EBS-optimized instances. Our instance type is c5a.12xlarge and, according to the AWS documentation, it's limited with 594 MB/s on 128 KiB I/O which is very close to our back of the envelope calculation.

Long story short, we're maxing out the disk with multi-threaded query execution while single-threaded execution time in version 6.2 stays the same. With this in mind, further instance type and volume improvements would lead to better performance.

Things should get even more exciting when it comes to hot execution scenario, so there we go. In the next and all of the subsequent benchmark runs, we measure the average hot execution time for the same query.

On this particular box, default QuestDB configuration leads to 16 threads being used for the shared worker pool. So, both 6.3 runs execute the filter on multiple threads speeding up the query when compared with the 6.2 runs. Another observation is almost 1x difference between JIT-compiled and non-JIT filters on 6.3. So, even with many cores available to parallel query execution, it's a good idea to keep JIT compilation enabled.

You might have noticed a weird proportion in the above chart. Namely, the difference between the execution times when JIT compilation is disabled. QuestDB 6.2 takes 30 seconds to finish the query with a single thread, while it takes only roughly 1.3 seconds on 6.3. That's 23x improvement and it's impossible to explain it only with parallel processing (remember, we run the filter on 16 threads). So, what may be the reason?

The thing is that parallel filter execution uses the same batch-based model as JIT-compiled filter functions. This means that the filter is executed in a tight, CPU-friendly loop while the resulting identifiers for the matching rows
are stored in an intermediate array. For instance, if we restrict parallel filter engine to run on a single thread which is as simple as adding shared.worker.count=1 database setting, the query under test would execute in around 13.5 seconds. Thus, in this very scenario batch-based filter processing done on a single thread allows us to cut down 55% of the query execution time. Obviously, multiple threads available to the engine let it run even faster. Refer to this blog post for more information on how we do batch-based filter processing in our SQL JIT compiler.

There is one more optimization opportunity around the query we used here. Namely, in case of queries that select only simple aggregate functions, like count(*) or max(*), and no column values we could push down the functions into the filter loop. As an example, the filter loop will be incrementing the count(*) function's counter in-place rather than doing a more generic accumulation of the filtered row identifiers. You could say that such queries are rather niche, but they could be met in various dashboard applications. Thus, it's something that we definitely consider adding in future.

What's next?

Certainly, parallel SQL filter execution introduced in 6.3 is not the final point in our quest. As we've mentioned already, we have multi-threading in place for aggregate queries, like SAMPLE BY or GROUP BY, but only for certain shapes of them. Aggregate functions push-down is another potential optimization. So stay tuned for further improvements!

As always, we encourage our users to try out 6.3.1 release on your QuestDB instances and provide feedback in our Slack Community. You can also play with our live demo to see how fast it executes your queries. And, of course, open-source contributions to our project on GitHub are more than welcome.

Join Hacktoberfest 2021 and contribute to QuestDB!

PSWU — Tue, 05 Oct 2021 15:16:50 +0000

Hacktoberfest 2021 is starting today! For the first time, QuestDB is participating as an open source project. We're super excited to meet with other open source contributors and maintainers.

For those who're not familiar with Hacktoberfest, it's a month-long online celebration for open source software and communities. By contributing to open source projects, you can get a special edition Hacktoberfest T-shirt 👕 or choose to plant a tree for our planet. 🌴

Many widely used open-source projects are maintained by a small number of developers or even a single person without any financial incentives. And we rely so much on their perseverance and commitment! Participating in Hacktoberfest is one of our approaches to raise awareness and encourage more people to contribute to open source. To celebrate Hacktoberfest, we put together some hints for you to get started.

⛳ Get started

Make sure you have a GitHub (or GitLab) account
Sign up for the event at Hacktoberfest's official website
Go to open source repositories that opt in for Hacktoberfest:
- QuestDB Core Project: https://github.com/questdb/questdb
- QuestDB Documentation: https://github.com/questdb/questdb.io
- Or, look for other open source projects labeled with hacktoberfest in their topics
If you're new to the project, look for open issues labeled with good first issues or help wanted to get started
Before you commit, don't forget to read CONTRIBUTING.md and follow the contribution guideline 👍

🎁 Tees, trees and QuestDB swag

Once you reach the contribution target of 4 valid pull requests, you can claim the reward from the official organizer! In addition, if you successfully contribute to QuestDB projects, we offer extra SWAG for you through our SWAG program!

To make sure that your pull request is valid, please follow Hacktoberfest's quality standard.

ℹ️ Get support and updates

Some questions might appear when you're trying to contribute to QuestDB projects; here are the places to get support and hints from our team:

QuestDB Community Slack: https://slack.questdb.io
GitHub Discussions: https://github.com/questdb/questdb/discussions
QuestDB Documentation: https://questdb.io/docs/introduction/

Also, don't forget to follow us on social media to receive the latest updates:

QuestDB Twitter: https://twitter.com/questdb
QuestDB Linkedin: https://www.linkedin.com/company/questdb/

Thank you Hacktoberfest!

PSWU — Wed, 04 Nov 2020 08:05:45 +0000

Hacktoberfest 2020 finally wrapped up. It’s been the first one we’ve done here at Jina (c’mon, we were only born this year), but it certainly won’t be our last!

Hacktoberfest isn’t just about the glamor, prestige and adoration of the masses that comes with contributing to open source. You can also get a T-shirt for your contributions, or have a tree planted in your name. We’re hoping quite a few of our readers managed to hit their Hacktoberfest target and can show off their T-shirt (or sapling) with pride. 👕 🌳

Contribute to Hacktoberfest and get a tree (note: image is for illustration purposes only and may not reflect final product. Pot and ability to dance not included)

We’re super grateful for all the pull requests from new open-source contributors. You’ve helped us improve our README.md in Portuguese, Russian, and Ukrainian. On top of that, many folks also pulled out all the stops in resolving issues related to RouteDriver, mypy, or numpy indexer.

Here are just a few of the highlights of our community pull requests:

Huge thanks again 🙏 to fernandakawasaki, averkij, jyothishkjames, clennan, Syarol, minfun and yartem!

Hacktoberfest is over for now. But at Jina AI, our mission still continues— building a world-class neural search framework for any kind of data.

We’re all about open source and building a sustainable open source community that reflects that. Join, contribute, help build the future of AI, and get some stickers along the way! 💪

This article is co-created by Pei-Shan Wu and Alex-CG.

Jina GitHub Repo: get.jina.ai

Website: jina.ai

Twitter: @JinaAI_

Linkedin: https://www.linkedin.com/company/jinaai/

Slack Community: slack.jina.ai

Jina AI x NLP Zurich: Designing the data model for an open source AI search framework

PSWU — Thu, 15 Oct 2020 13:16:00 +0000

If you are interested in deep learning, machine learning, NLP, data science, open source, and search framework, you might also be interested participating in our virtual talk at NLP Zurich next Tuesday on Oct 20, 2020.

My colleague, Alex, and I got in touch with NLP Zurich some time ago. Immediately, we feel so connected and want to host a meetup together.
So here we come! 😁

Before COVID-19, NLP Zurich meetups were usually offline. Since it's only possible to have virtual meetups right now, wouldn't it be nice to meet more people from around the world?

About NLP Zurich

NLP Zurich is the first Natural Language Processing (NLP) open platform in Switzerland. It aims to stimulate information and opinion exchange on topics around Artificial Intelligence, Machine Learning and Language Technologies.

Through events and meetups, it's now becoming the key platform to meet the stakeholders from academia and industry that are passionate about shaping the ecosystem.

About Jina

Jina is an open-source search framework that provides an easier way to build neural search on the cloud. Whether you’re searching with images, video clips, audio snippets, or texts in various lengths, Jina provides high-level support to many existing neural search modals.

Jina AI is the company behind the Jina project. Our mission is to provide the universal solution for these neural search problems. As an open source company, we are working our best to live by open source principles, to continuously improve the developer experience, and to make it as accessible as possible.

Feel free to join our community on Slack!

Agenda

18:50 Participants join the webinar
19:00 Talk: Designing data model for open source AI search framework (Maximilian Werk, Senior AI Engineer at Jina AI)
19:35 Q&A
19:50 Virtual Hugs and Kisses ⊂(◉‿◉)つ

Sounds interesting?
Then register here: https://www.meetup.com/NLP-Zurich/

About Speaker

Max is one of the core contributors of Jina. He works on making search truly intelligent that can handle not only text but also graphics, audio or video data. He is passionate about clean, maintainable code and architecture in the AI environment.

Max has a master’s in mathematics at TU Berlin and worked as senior research engineer at Zalando SE before joining Jina AI.

Source: NLP Zurich, Jina AI GmbH

Jina AI at Hacktoberfest 2020

PSWU — Wed, 14 Oct 2020 15:07:40 +0000

Two weeks have passed for 2020 Hacktoberfest, are you looking for fun project to contribute to?

You can find real challenges and true open source spirit here at Jina AI.

About Hacktoberfest

Hacktoberfest is a month-long (Oct 1 to Oct 31) event, celebrating open source software and its community.

Participants can get a Hacktoberfest T-shirt or choose to plant a tree if they successfully make 4 valid pull requests (Sidenote: PRs have to be merged, approved by a maintainer OR labelled as hacktoberfest-accepted) to any GitHub hosted open source projects that have opted in with putting #hacktoberfest in their topics.

Most importantly, it’s free and open for everyone. All you need to attend is your GitHub account and then register at the official website.

About Jina

Jina is an open-source search framework powered by deep-learning technology, empowering developers to build cross-modal or multi-modal search systems for text, images, video, and audio.

As a fairly young open source project, Jina was first released on GitHub in April 2020. The project is currently under heavy development and is maintained by a full-time, venture-backed team.

Thus, this is a great opportunity for contributors to make real impact. 💻 🙋

To give you some idea about how Jina can be used in the real world, here are some use cases created by our community members:

Transformer for lawyers (Read More) by ArturTan
Chatbot integration: Jina AI with Rasa (Read More) by sibbsnb

But of course, there are more possibilities to be explored!

Jina’s core components are written mostly in Python. We also make use of other open source software stacks such as Tensorflow, Docker, PyTorch, Hugging Face, etc.

Check out Jina’s repo on Github!

Contributing to Jina — where to start?

Whether you’re a beginner or a veteran, we welcome all kinds of contributors from the open-source community. We would love to have more and more active contributors or even users working together with our team.

Aligning with what is communicated by Hackoberfest, Jina’s team also values quality over quantity.

Before you submit your pull request, make sure you you have read through:

Hacktoberfest’s participation rules and quality standards (Here)
Jina’s contribution guideline (Here)
Jina’s code of conduct (Here)

Most importantly, please do not hesitate to join our Slack Community if you need more guidance when contributing to Jina.

We look forward to seeing your pull requests!