DEV Community

Francesco Tisiot
Francesco Tisiot

Posted on • Updated on • Originally published at ftisiot.net

1 billion rows challenge in PostgreSQL and ClickHouse

Last week the good old Gunnar Morling launched an interesting challenge about ordering 1 billion rows in Java. Like my ex colleague and friend Robin Moffat, I'm not at all a Java expert, and while Robin used DuckDB to solve the challenge, I did the same with PostgreSQL and ClickHouse.

Alert: the following is NOT a benchmark! The test is done with default installations of both databases and NO optimization. The blog only shows the technical viability of a solution.
 

If you need a FREE PostgreSQL database? 🦀 Check Aiven's FREE plans! 🦀
If you need to optimize your SQL query? 🐧 Check EverSQL! 🐧

Generate the data

I used pretty much the same steps as Robin to generate the data

  • Forked the repository and cloned locally
  • Installed Java21 to generate the data
sdk install java 21.0.1-zulu
sdk use java 21.0.1-zulu
Enter fullscreen mode Exit fullscreen mode
  • Built the data generator
./mvnw clean verify
Enter fullscreen mode Exit fullscreen mode
  • Generated some rows
./create_measurements.sh 1000000000
Enter fullscreen mode Exit fullscreen mode

The above generates a file named measurements.txt with the 1 billion rows.

PostgreSQL

Setup local PostgreSQL

How do I load a local file into a PostgreSQL database? Well, it depends where the PostgreSQL database is. For my test I could have used Aiven but decided for a local installation on my Mac.

I installed PostgreSQL 16 with

brew install postgres@16
Enter fullscreen mode Exit fullscreen mode

I can check how to start PostgreSQL locally with:

brew info postgresql@16
Enter fullscreen mode Exit fullscreen mode

It tells me to execute:

LC_ALL="C" /usr/local/opt/postgresql@16/bin/postgres -D /usr/local/var/postgresql@16
Enter fullscreen mode Exit fullscreen mode

Once run we should see the welcoming message LOG: database system is ready to accept connection

Load the data in PostgreSQL

With PostgreSQL running with all the default values (read more here) we are ready to upload the data.

We can connect to our PostgreSQL database with

psql postgres
Enter fullscreen mode Exit fullscreen mode

Next step is to create a table containing our data with

CREATE UNLOGGED TABLE TEST(CITY TEXT, TEMPERATURE FLOAT);
Enter fullscreen mode Exit fullscreen mode

We are using the UNLOGGED parameter to speed up the copy of the data, however we are assuming the risk of not writing any WAL log in the process... probably not the smartest idea if this is sensible data we'll want to reuse later. We can load the data with the COPY command:

\copy TEST(CITY, TEMPERATURE) FROM 'measurements.txt' DELIMITER ';' CSV;
Enter fullscreen mode Exit fullscreen mode

PostgreSQL Results

Following the same reasoning as Robin, I was able to get to a query like

WITH AGG AS(
    SELECT city, 
           MIN(temperature) min_measure,
           cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
           MAX(temperature) max_measure
    FROM test 
    GROUP BY city
    LIMIT 5)
SELECT STRING_AGG(CITY || '=' || CONCAT(min_measure,'/', mean_measure,'/', max_measure),', ' ORDER BY CITY) 
FROM AGG
;
Enter fullscreen mode Exit fullscreen mode

The main difference, compared to DuckDB is on the usage of the STRING_AGG function that allows me to directly create the string ordered by CITY with all the metrics.

Note: If I use EverSQL to optimise the query, it'll provide a suggestion to add an index on city and temperature:

CREATE INDEX test_idx_city_temperature ON "test" ("city","temperature");
Enter fullscreen mode Exit fullscreen mode

EverSQL suggesting an index

PostgreSQL Timing

To get the timing I created a file called test.sql with the entire set of commands:

\timing
\o /tmp/output
-- Load the data
DROP TABLE TEST;
CREATE UNLOGGED TABLE TEST(CITY TEXT, TEMPERATURE FLOAT);
COPY TEST(CITY, TEMPERATURE) FROM '<PATH_TO_FILE>/measurements.txt' DELIMITER ';' CSV;

-- Run calculations
WITH AGG AS(
    SELECT city,
           MIN(temperature) min_measure,
           cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
           MAX(temperature) max_measure
    FROM test
    GROUP BY city
    )
SELECT STRING_AGG(CITY || '=' || CONCAT(min_measure,'/', mean_measure,'/', max_measure),', ' ORDER BY CITY)
FROM AGG
;
Enter fullscreen mode Exit fullscreen mode

And then timed it with:

time psql postgres -f test.sql
Enter fullscreen mode Exit fullscreen mode

The timing was a, not astonishing, 9m16.135s with the majority (6m:24.376s spent on copying the data) and 2m:51.443s on aggregating.

Edit Using PostgreSQL Foreign Data Wrapper (FDW)

As suggested on HN by using a File Foreign data wrapper we could eliminate the need of loading the data in the PostgreSQL table.

The test.sql has been changed to:

\timing
\o /tmp/output
-- Load the data
DROP TABLE TEST;

CREATE EXTENSION file_fdw;
CREATE SERVER stations FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE TEST (
  city text,
  temperature float
) SERVER stations OPTIONS (filename '<PATH>/measurements.txt', format 'csv', delimiter ';');

-- Run calculations
WITH AGG AS(
    SELECT city,
           MIN(temperature) min_measure,
           cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
           MAX(temperature) max_measure
    FROM test
    GROUP BY city
    )
SELECT STRING_AGG(CITY || '=' || CONCAT(min_measure,'/', mean_measure,'/', max_measure),', ' ORDER BY CITY)
FROM AGG
;
Enter fullscreen mode Exit fullscreen mode

Where the biggest change is that now the TEST table is defined as FOREIGN TABLE pointing directly to the measurements.txt file.
With The File FDW we removed the need of uploading the data to a table (that was costing us more than 6 minutes), but now the overall query takes 8m:24.572s. Overall, compared to the copy and query solution, we are 1 min faster.

ClickHouse

Setup local ClickHouse

How do I load a local file into a ClickHouse database? Well, it depends where the ClickHouse database is. For my test I could have used Aiven but decided for a local installation on my Mac.

I installed ClickHouse locally with (source)

 curl https://clickhouse.com/ | sh
Enter fullscreen mode Exit fullscreen mode

Query the data in ClickHouse

Compared to PostgreSQL, ClickHouse allows me to directly querying the CSV file without loading the data in a table.

I can query the measurement.txt file with:

./clickhouse local -q "SELECT c1 as city, c2 as measurement FROM file('measurements.txt', CSV) LIMIT 5"  --format_csv_delimiter=";"
Enter fullscreen mode Exit fullscreen mode

In the above I defined:

  • a pointer to a file called measurements.txt in CSV format
  • a custom delimiter ;
  • the first column c1 as city
  • the second column c2 as measurement

The input is correctly parsed.

Wellington  12
Riga    0.30000000000000004
Palermo 18.4
Sochi   10.8
Accra   11.7
Enter fullscreen mode Exit fullscreen mode

ClickHouse Results

Following the same reasoning as Robin, I was able to get to a query like:

WITH AGG AS(
    SELECT city,
           cast(MIN(temperature) AS DECIMAL(8,1)) min_measure,
           cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
           cast(MAX(temperature) AS DECIMAL(8,1)) max_measure
    FROM (SELECT c1 as city, c2 as temperature FROM file('measurements.txt', CSV))
    GROUP BY city
    ORDER BY city
    )
SELECT arrayStringConcat(groupArray(city || '=' || CONCAT(min_measure,'/', mean_measure,'/', max_measure)), ', ')
FROM AGG
Enter fullscreen mode Exit fullscreen mode

The main difference, compared to the PostgreSQL solution is to use groupArray to create the list of cities (see the ORDER BY city in the first query to order them correctly), and the arrayStringConcat to concatenate the array elements in a string.

ClickHouse Timing

To get the timing I executed the following:

time ./clickhouse local -q """WITH AGG AS(
    SELECT city,
           cast(MIN(temperature) AS DECIMAL(8,1)) min_measure,
           cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
           cast(MAX(temperature) AS DECIMAL(8,1)) max_measure
    FROM (SELECT c1 as city, c2 as temperature FROM file('measurements.txt', CSV))
    GROUP BY city
    ORDER BY city
    )
SELECT arrayStringConcat(groupArray(city || '=' || CONCAT(min_measure,'/', mean_measure,'/', max_measure)), ', ')
FROM AGG"""  --format_csv_delimiter=";"
Enter fullscreen mode Exit fullscreen mode

The result is 44.465s!

Conclusion

Both PostgreSQL and ClickHouse were able to complete the challenge. PostgreSQL initial solution had the limitiation of forcing the upload of data into a table that took most of the time, when using the file FDW the performances still weren't great. ClickHouse, on the other side, was able to query directly the CSV and got much faster results, as expected from a database designed for analytics.

Alert: the above is NOT a benchmark! The test is done with default installations of both databases and NO optimization. The blog only shows the technical viability of the solution.

Top comments (5)

Collapse
 
tahmidbintaslim profile image
Tahmid Bin Taslim Rafi

@ftisiot That's a very cool comparison. But I think everyone will still prefer PostgreSQL as it is free, open source and widly adopted by developers. Haven't tried the ClickHouse but it looks good from your explanation.

Collapse
 
rdarrylr profile image
Darryl Ruggles

Really interesting comparison!

Collapse
 
ftisiot profile image
Francesco Tisiot

Thanks Darryl!

Collapse
 
keymannerdawid profile image
Dawid Kimana

The perf diff is astronomical, good read! 👍🏽

Collapse
 
noamanahmed profile image
noamanahmed

Switching from row based data store to column based data store would definitely yield better results if you are using aggregate functions