Hamed Karbasi

Posted on Apr 13, 2023

ClickHouse Basic Tutorial: An Introduction

#clickhouse #database #tutorial

This is the first part of the ClickHouse Tutorial Series. In this series, I cover some practical and vital aspects of the ClickHouse database, a robust OLAP technology many enterprise companies utilize.

In this part, I'll talk about the main features, weaknesses, installation, and usage of ClickHouse. I'll also refer to some helpful links for those who want to dive into broader details.

What is ClickHouse

ClickHouse is an open-source column-oriented database developed by Yandex. It is designed to provide high performance for analytical queries. ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. ClickHouse is used by companies such as Yandex, Facebook, and Uber for data analysis, machine learning, and more.

Main Features

The main features of Clickhouse Database are:

Column-Oriented

Data in ClickHouse is stored in columns instead of rows, bringing at least two benefits:

Every column can be sorted in a separate file; hence, stronger compression happens on each column and the whole table.
In range queries common in analytical processing, the system can access and process data easier since data is sorted in some columns (i.e., columns defined as sort keys). Additionally, it can parallelize processes on multi-cores while loading massive columns.

Row-Oriented Database (Gif by ClickHouse)

Columnar Database (Gif by ClickHouse)

Note: It should not get mistaken with Wide-Column databases like Cassandra as they store data in rows but enable you to denormalize intensive data in a table with many columns leading to a No-SQL structure.

Data Compression

Thanks to compression algorithms (zstd and LZ4), data occupies much less storage, even more than 20x smaller! You can study some of the benchmarks on ClickHouse and other databases storage here.

ClickHouse columnar structure leads to storing and reading columns more efficiently (Graph by Altinity)

Scalability

ClickHouse scales well both vertically and horizontally. It can be scaled by adding extra replicas and extra shards to process queries in a distributed way. ClickHouse supports multi-master asynchronous replication and can be deployed across multiple data centers. All nodes are equal, which allows for avoiding having single points of failure.

Weaknesses

To mention some:

Lack of full-fledged UPDATE/DELETE implementation: ClickHouse is unsuited for modification and mutations. So you'll come across poor performance regarding those kinds of queries.
OLTP queries like pointy ones would not make you happy since ClickHouse is easily outperformed by traditional RDBMSs like MySQL with those queries.

Rivals and Alternatives

To name a few:

Apache Druid
ElasticSearch
SingleStore
Snowflake
TimescaleDB

Surely, each one is suitable for different use cases and has its pros and cons, but I won't discuss their comparison here. However, you can study some valuable benchmarks here and here.

Quick Start

Installation

I only cover the Docker approach here. For other methods on different distros, please follow ClicHouse's official Installation.

The docker-compose file:



version: '2'
services:
  clickhouse:
    container_name: myclickhouse
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"
      - "9000:9000"
    volumes:
      - ./clickhouse-data:/var/lib/clickhouse/  
    restart: unless-stopped

And then run it by:



docker compose up -d

As you can see, two ports have been exposed:

8123: HTTP API Port for HTTP requests, used by JDBC, ODBC, and web interfaces.
9000: Native Protocol port (ClickHouse TCP protocol). Used by ClickHouse apps and processes like clickhouse-server, clickhouse-client, and native ClickHouse tools. Used for inter-server communication for distributed queries.

It's up to your client driver to choose one of them. For example, DBeaver uses 8123, and Python ClickhHouse-Driver uses 9000.

To continue the tutorial, we use ClickHouse-Client available on the installed server:



docker exec -it myclickhouse clickhouse-client

Database and Table Creation

Create database test:



CREATE DATABASE test;

create table orders:



CREATE TABLE test.orders
(`OrderID` Int64,
`CustomerID` Int64,
`OrderDate` DateTime,
`Comments` String,
`Cancelled` Bool)
ENGINE = MergeTree
PRIMARY KEY (OrderID, OrderDate)
ORDER BY (OrderID, OrderDate, CustomerID)
SETTINGS index_granularity = 8192;

In the next parts, we'll talk about other configurations like Engine, PRIMARY KEY, ORDER BY, etc.

Insert Data

To insert sample data:



INSERT INTO test.orders 
VALUES (334, 123, '2021-09-15 14:30:00', 'some comment', 
false);

Read Data

Just like any other SQL query:



SELECT OrderID, OrderDate FROM test.orders;

Conclusion

In the first part of the ClickHouse Tutorial Series, we discussed the traits, features, and weaknesses of ClickHouse. Then we saw how to set up an instance with minimum configuration, create a database and table, insert data into it, and read from it.