This tutorial originally appeared on StackPath Blog.
Businesses use data analytics to extract meaningful, decision-making information for their day-to-day operations. Analytics can discover hidden relations, find unexpected patterns, learn about customer behavior, understand the past and predict the future.
In this tutorial, we’ll learn about Clickhouse, a column-oriented database for real-time reports. We will explore how it works and take it out for a test drive using StackPath Containers.
Before diving into the tutorial, first we’ll provide an overview of column-based databases, ClickHouse, and the benefits of running ClickHouse on StackPath.
Compared with traditional row-based databases, column databases have a subtle yet profound difference in how data is stored on disk.
A traditional database stores data as rows:
|1||Microsoft||Bill Gates||Bill Gates||2000|
|2||Microsoft||Satya Nadella||Satya Nadella||--|
|3||Apple||Steve Cook||Steve Cook||--|
|4||M. Zuckerberg||M. Zuckerberg||--|
|5||Apple||Steve Jobs||Steve Jobs||2001|
A column-based database DBs serialize them by columns:
|COMPANY||Microsoft (1,2)||Microsoft (1,2)||Facebook (4)|
|CEO_NAME||Bill Gates (1)||Bill Gates (1)||Steve Cook (3)||M. Zuckerberg (4)||Steve Jobs (5)|
|YEAR_FROM||1975 (1)||1975 (1)||2011 (3)||2014 (4)||1997 (5)|
|YEAR_TO||2000 (1)||2000 (1)|
Our expected workload determines what type of database we should choose. Row I/O is designed to get all the information for a small set of records while column I/O is designed to get a small subset of columns for a huge set of records (or even a whole table).
Now, imagine we ask the server to calculate how many years each of the persons on the tables above were CEOs of their respective companies. A row database has to retrieve each row, then get to the columns with the relevant years and discard the rest. A column database only needs two I/O operations (regardless of the size of the table): one for each date column. No data is discarded. A column system will perform much better in this case and scales much better with large tables.
As a result of their design, column databases have unique properties:
- Efficient column aggregation and materialized views.
- Wide tables with optional columns pose no problems. On the contrary, they are encouraged.
- Because similar data compress best, column orientation uses less space on disk than rows.
- Due to less reliance on locking, column architectures can run with a much higher degree of parallelization.
- Batch inserts are more efficient on columns than on rows.
- Column databases have less reliance on indexes, which reduces overhead in the system.
Clickhouse is a database management system (DBMS) created by Yandex, the second-largest web analytics platform in the world, and released as open-source software under the Apache 2.0 License.
Beyond Yandex, many other big companies have adopted Clickhouse for their analytics, including Cisco, Cloudflare, CERN, Bloomberg, and Spotify.
Clickhouse main features are:
- True column-based DBMS.
- A subset of the SQL Language feature-rich with analytics and statistics functions.
- Fault tolerance and no-downtime upgrades thanks to replication.
- Multi-petabyte scale support.
- Several engine modes for databases and tables.
- Parallel processing on a single node.
- Distributed queries using multiple nodes.
- Support approximated calculations through sampling.
ClickHouse supports two mechanisms to scale out databases. Sharding distributes big tables among multiple machines and requests get automatically parallelized. Replication creates redundant copies of the data and can be used to distribute and filter data to different data centers. Also, ClickHouse supports multi-master replication.
Clickhouse is good for:
- OLAP and OLEP workloads as long as they don’t need full-text search.
- Data Warehouse: Clickhouse can store and process petabytes of data and build customized reports on the fly.
- Big tables: tables with hundreds or thousands of columns.
- Applications that need a high read throughput.
- Applications that make heavy use of aggregated columns or materialized views.
Clickhouse is not good for:
- OLTP (Online Transactional Processing) workloads. Clickhouse doesn’t support full-fledged transactions.
- Applications with heavy write requirements.
- Applications with complex or random write or update activity.
- Applications that need full-text search.
Databases have some unique requirements that can clash with public cloud
- Cost: high per-GB costs can make downloads and uploads expensive.
- Bandwidth: big databases can take a long time to load or export.
- Latency: round trip times can add unacceptable latencies to the queries.
Edge Computing solves these issues by moving servers as close as possible to customers and sources of data. As for StackPath, it offers the best of both worlds: the convenience and flexibility of the cloud with on premise-like performance.
In this section, we’ll deploy a single ClickHouse edge computing node in order to get familiar with the system. Then, we’ll learn how to implement a cluster.
First, we need to create a custom Docker image with our private config. Before continuing, ensure you have the following things:
Docker installed on your workstation. On Mac and Windows, we recommend installing Docker Desktop. On Linux machines, install Docker from repositories.
A Docker Hub account to store the images.
Yandex supplies two ready-to-use Docker images:
- clickhouse-server: is the server component; it stores the data, accepts client connections, and processes requests.
- clickhouse-client: is the official command-line client, much like psql or mysql.
To download the images to your machine, use the following commands:
docker pull yandex/clickhouse-server docker pull yandex/clickhouse-client
By default, the Clickhouse server includes a single user called default with a blank password. For testing purposes, having a blank password is not a problem, but since we’re going to deploy the node in production, we need to secure it.
To customize the user settings, we need to change a file called
users.xml, which is inside the clickhouse-server image.
To do this, create a directory to store our files.
mkdir my-clickhouse-server cd my-clickhouse-server
Start an instance of the container so we can copy out the file.
docker create -it --name clickhouse-server clickhouse server
Copy the file from the container to your machine.
docker cp clickhouse-server:/etc/clickhouse-server/users.xml .
users.xml with a text editor and locate the following lines:
<users> <default> <!-- This is the default user with a blank password --> <password></password> . . . </default> </users>
Clickhouse supports several password settings:
- Plaintext: insecure and not recommended.
Double SHA1: uses a cryptographic hash function to make the password harder to guess to anyone looking at our image. Double SHA1 strings go inside
SHA256: arguably the most secure alternative and the one we will use. This option uses
To generate a SHA256 password in Linux or Mac, execute the following command in a new terminal:
PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
The first line printed is the password, the second is the encoded SHA256 string. Alternatively, you can use an online SHA256 generator to obtain the string. In either case, keep both the password and the SHA string at hand as we’ll need them both soon.
Going back to
<password></password> line to get rid of the blank password. Then add the following line with your SHA256 string.
<users> <default> <password_sha256_hex>78bb49329e64a3df1ea5a48fb09ce22ed5223171cf140e3b08683ab3926a0b9b</password_sha256_hex> . . . </default> </users>
Create a file called
Dockerfile with the following content:
FROM yandex/clickhouse-server COPY users.xml /etc/clickhouse-server
Then, build the image.
docker build -t my-clickhouse-server .
Test if it starts correctly with this command:
docker run -it -d --name my-server my-clickhouse-server
Finally, from a different terminal, try connecting using your password:
docker run -it --rm --link clickhouse:clickhouse-server yandex/clickhouse-client --host clickhouse-server --user default --password YOUR_PASSWORD ClickHouse client version 188.8.131.52 (official build). Connecting to 184.108.40.206:9000 as user default. Connected to ClickHouse server version 20.3.2 revision 54433. clickhouse-server :)
Before we can use the image in StackPath, we have to put it in a Docker registry. To do this, log in to Docker Hub with your account.
docker login -u YOUR_DOCKERHUB_USERNAME
Build the image using your Docker Hub username as a prefix.
docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest
Push the image to Docker Hub.
docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest
Once we have pushed the image to the Docker Registry, we can deploy it in StackPath.
Head over to StackPath. Login with your account or sign up for a new one. Then, click on the Create button and select Workload.
We’ll name the workload “clickhouse-servers” and set the type to Container. For Image, type the address to your Docker image:
If you are using a private repository or a different registry, check the Add Image Pull Credentials box and type in the pull details.
Click on Continue to Settings.
If you want the servers to be accessible to everyone, add the following ports under Public ports. This will assign Public IPs and open the ports to the Internet.
All containers get a Private IP assigned automatically:
8123 TCPis the default HTTP API Endpoint.
9000 TCPis the default port for the native clients.
Click on Continue to Spec.
In Spec, choose the machine size. Clickhouse developers recommend a minimum of 4GB of RAM to allow for complex queries. If in doubt, choose the smallest machine, you can always scale the machine up with a few clicks later on.
Next, we’ll need a volume to persist our data, under Additional volume type
/var/lib/clickhouse and set its size on GBs.
Under Deployment Target, we will group servers with similar scaling needs. Type a descriptive name like “clickhouse-servers”. Then select the PoP, which is nearest to you or to your customers. Set the number of instances to 1.
Click on Create Workload.
Wait a few seconds until the instance is on the Running state.
The following test will only work if you have opened ports 8123 and 9000:
Click on the running instance to get its details.
Then, scroll down to the Instance details section and copy the Public IP address.
Next, open a terminal and try the
/ping endpoint with Curl or a browser. If the server is running it should respond with "Ok".
curl -w "\n" YOUR_PUBLIC_IP:8123/ping Ok
We can also connect with Clickhouse Native Client.
docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD ClickHouse client version 220.127.116.11 (official build). Connecting to 18.104.22.168:9000 as user default. Connected to ClickHouse server version 20.3.2 revision 54433. clickhouse-servers-clickhouse-servers-jfk-0 :)
Now you can execute Clickhouse SQL commands on your new server:
clickhouse-servers-clickhouse-servers-jfk-0 :) create database my_db; CREATE DATABASE my_db Ok. 0 rows in set. Elapsed: 0.393 sec. clickhouse-servers-clickhouse-servers-jfk-0 :) show databases; SHOW DATABASES ┌─name────┐ │ default │ │ my_db │ │ system │ └─────────┘ 3 rows in set. Elapsed: 0.326 sec.
In this section, we’ll deploy two more database instances to create a
3-node Edge Computing cluster.
First, we’ll create the remaining instances so that we can get their
On the left navigation menu, click on Workloads and select your container workload. Then, click on the Gear icon to edit the workload configuration.
Click on the Edit icon next to Spec, Storage & Deployment and Increase the number of Instances per PoP to 3. Click on Save Changes.
StackPath will create the remaining instances in a few seconds.
When the instances are ready, copy the Private IP address of each one:
We have to tell the Clickhouse where all the nodes are. For this, we
need to first get the
config.xml file from the Clickhouse image.
Open a terminal and type the following command to copy the file out of the container:
docker cp clickhouse-server:/etc/clickhouse-server/config.xml .
Then, open the file and locate the
<remote_servers incl="clickhouse_remote_servers" > <!-- Test only shard config for testing distributed storage --> <test_shard_localhost> <shard> <replica> <host>localhost</host> <port>9000</port> </replica> </shard> </test_shard_localhost> . . . </remote_servers>
Remove all the lines inside the
<remote_servers> </remote_servers> elements and type the following config. Replace the IP Addresses with your servers Private IPs:
<remote_servers incl="clickhouse_remote_servers" > <my_sharded_config> <shard> <replica> <host>SERVER_1_PRIVATE_IP</host> <port>9000</port> </replica> </shard> <shard> <replica> <host>SERVER_2_PRIVATE_IP</host> <port>9000</port> </replica> </shard> <shard> <replica> <host>SERVER_3_PRIVATE_IP</host> <port>9000</port> </replica> </shard> </my_sharded_config> </remote_servers>
Dockerfile to add the modified config:
FROM yandex/clickhouse-server COPY users.xml /etc/clickhouse-server COPY config.xml /etc/clickhouse-server
Build a new version of the image (notice we’re using a different tag
docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded
Push the new image to the Docker Registry, same as before.
docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded
The final step is to tell StackPath to use our new configuration. To do this, go to your StackPath Workload and click the Gear button again to change its configuration. Edit the Image name with the new tag:
Click on Save Changes.
StackPath will start recycling the containers. When this is done, you should find all the containers running the newest version:
Again, we can use Curl or any browser to check each of the ClickHouse nodes. Just replace the IP addresses with your Private IPs.
curl -w "\n" YOUR_PUBLIC_IP_1:8123/ping Ok curl -w "\n" YOUR_PUBLIC_IP_2:8123/ping Ok curl -w "\n" YOUR_PUBLIC_IP_2:8123/ping Ok
We also can connect to any of the instances to check the cluster health.
Run the following query to see the cluster status.
select * from system.clusters;
docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD ClickHouse client version 22.214.171.124 (official build). Connecting to 126.96.36.199:9000 as user default. Connected to ClickHouse server version 20.3.2 revision 54433. clickhouse-servers-clickhouse-servers-jfk-0 :) select * from system.clusters; SELECT * FROM system.clusters ┌─cluster────────────┬─shard_num─┬─shard_weight─┬─replica_num─┬─host_name───┬─host_address─┬─port─┬─is_local─┬─user────┬─default_database─┬─errors_count─┬─estimated_recovery_time─┐ │ my_sharded_config │ 1 │ 1 │ 1 │ 10.128.96.2 │ 10.128.96.2 │ 9000 │ 0 │ default │ │ 0 │ 0 │ │ my_sharded_config │ 2 │ 1 │ 1 │ 10.128.96.4 │ 10.128.96.4 │ 9000 │ 1 │ default │ │ 0 │ 0 │ │ my_sharded_config │ 3 │ 1 │ 1 │ 10.128.96.3 │ 10.128.96.3 │ 9000 │ 0 │ default │ │ 0 │ 0 │ └────────────────────┴───────────┴──────────────┴─────────────┴─────────────┴──────────────┴──────┴──────────┴─────────┴──────────────────┴──────────────┴─────────────────────────┘ 3 rows in set. Elapsed: 2.140 sec.
And that's it! If you made it this far, you have successfully deployed a 3-node cluster on StackPath.
For the next step in your road to mastering ClickHouse, try setting up a second cluster on a different PoP and establishing table replication between the two data centers.
If you want to learn more, check out these related docs and projects: