Altinity

Posted on Jul 26, 2023

How to Save Money and Deliver Real Time Analytics with ClickHouse

#tutorial #opensource #database #analytics

Everyone knows ClickHouse is fast, but what about cheap? In this blog, we dive into some secrets that ClickHouse champion cheapskates use to reduce costs.

Before we dive in, it is helpful to understand the drivers of costs in an analytic database. The drivers can be grouped into five main categories:

Proprietary licenses - these license fees often require users to pay for each instance of the database, each user access it, or each feature that is enabled.
Storage - large data sets requires a lot of storage space, and also fast access and retrieval of data
Compute - processing and analyzing data demands a lot of computing power, especially for complex queries and operations
Networking - clusters that distribute data across different locations or data centers can incur significant networking costs, depending on the distance and bandwidth
Human labor - managing and maintaining analytic databases can require skilled and experienced personnel.

As human labor is a complex topic, we will focus on the other 4 categories.

Develop on a laptop with 100% open source

A fantastic way to reduce costs when you start with ClickHouse is to use your own laptop for development.

There are three open source dev patterns that work on a laptop

Install ClickHouse packages directly
Run ClickHouse using docker
Run complete ClickHouse app with docker compose

We will dig into the third pattern as it is the most relevant pattern that we see broadly across development. Best of all it works on practically any laptop, regardless of your operating system.

Get ready to go by installing docker followed by docker compose. The steps are easy and complete in a few minutes.

If you have an application, it’s easy to turn it into a Docker image. We are going to build a Docker image that consists of Ubuntu plus the curl program, a small but invaluable program to send HTTP requests. Curl is the “application.”

Let’s generate the image.

cat > Dockerfile << END
# Simple analytic client with curl installed.
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y curl
CMD ["sleep", "infinity"]
END

docker build -t myclient:latest - < Dockerfile

The next step is to create a Docker compose file for ClickHouse and the app:

cat > docker-compose.yml << END
version: '3'
services:
  clickhouse_server:
    image: altinity/clickhouse-server:22.8.15.25.altinitystable
      ports:
       - "8123:8123"
       - "9000:9000"
      volumes:
       - ./clickhouse_database:/var/lib/clickhouse
  ubuntu_client:
    image: myclient:latest
END

The final step to start your application, ensure it’s running, and then start using your application. ‘Docker exec’ allows us to connect to the application container and send commands.

$ docker compose up -d
...
$ docker ps
CONTAINER ID ... NAMES
dff28a725b38 altinity/clickh... cheapskate-clickhouse_server-1
23a641654ac2 myclient:latest... cheapskate-ubuntu_client-1

$ docker exec -it 23a6 bash
root@23a641654ac2:/# curl http://cheapskate-clickhouse_server-1:8123
Ok.

ClickHouse answered back. And that’s it! You can now use ClickHouse on your laptop and save 100% on license fees.

Tune apps to limit resource usage

Another wonderful way to develop cost efficient applications is to tune them to limit resource usage. The same factors that make your ClickHouse application fast also make it cheap, because it runs faster with fewer resources for a shorter time. Optimizing apps to use less resources can result in huge cost savings - sometimes up to 90%!

Start with schema design. ClickHouse does not automatically create an efficient schema for you. You have to make choices. For example, you want to optimize your data types to make them small - use codecs and an effective compression algorithm (like ZSTD). You can use alias columns to save space. You also want to pay attention to your ORDER BY because it affects compression and query speed. Here’s an example of a fully optimized table design.

Below is an illustration of the speed between different levels of schema optimization. We can see an 80% reduction in the data with optimized data types and ZSTD compression.

As you are tuning your application, we suggest using ClickHouse system tables that contain useful information about your compression ratio, query time, and resource consumption. Here’s a query on table system.tables that calculates the data shown above.

SELECT name, total_rows AS rows, 
  formatReadableSize(total_bytes) AS size,
  total_bytes / rows AS bytes_per_row
FROM system.tables 
WHERE database = 'default'
ORDER BY name

In addition, check your query performance. The following example selects from system.query_log to find the details of recent queries (such as how long they took, how much memory they used, how many rows and bytes they read). This will help you to analyze your queries and make them more efficient.

SELECT
     event_time,
     type,
     query_duration_ms / 1000 AS duration,
     read_rows,
     read_bytes,
     result_rows,
     formatReadableSize(memory_usage) AS memory,
     query
FROM system.query_log
WHERE (user = 'test') AND (type = 'QueryFinish')
ORDER BY event_time DESC
LIMIT 50

Using TTLs to limit growth

TTLs are another good cost-saving tip. TTLs (time to live) can time out rows, which reduces your cost because you’re not storing endless data as your time series data keep coming. TTLs can also move, aggregate, and recompress data.

Let’s look at an example. Below, we can see the TTL deletes rows at 12 months. We use ZSTD(1) to recompress rows at one month, and we also apply a higher level of ZSTD to recompress the data after six months.

CREATE TABLE default.web_events_with_ttl_2 (
           `time` DateTime,
           . . .
           `float_value` Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (user_id, toStartOfDay(time), session_id, time)
TTL time + INTERVAL 1 MONTH RECOMPRESS CODEC (ZSTD(1)),
    time + INTERVAL 6 MONTH RECOMPRESS CODEC (ZSTD(10)),
    time + INTERVAL 12 MONTH DELETE

When we look at the impact of the TTL, in this particular dataset, we can see that the oldest part (2208_3_3_1) has a much smaller number at 264,000 bytes compared to the newest data which is over 600,000 bytes.

SELECT partition, name, rows,
     data_compressed_bytes AS compressed,
     data_uncompressed_bytes AS uncompressed
FROM system.parts
WHERE (table = 'web_events_with_ttl_2') AND active
ORDER BY name DESC

┌─partition─┬─name─────────┬──rows─┬─compressed─┬─uncompressed─┐
│ 202304    │ 202304_1_1_0 │ 50000 │     613930 │      1388890 │
│ 202302    │ 202302_2_2_1 │ 50000 │     327461 │      1388890 │
│ 202208    │ 202208_3_3_1 │ 50000 │     264054 │      1388890 │
└───────────┴──────────────┴───────┴────────────┴──────────────┘

While TTLs can help you reduce storage costs by limiting data growth and/or compressing the data, the amount of savings will vary. If you don’t want to lose any source data, you may want to use TTL to compress data instead of deleting it. Using tiered storage for older data can yield even higher cost savings. Check out our recorded webinar for a full explanation.

Scale compute capacity down when not needed

Re-scaling compute lowers cloud costs dramatically. You can resize your VMs and connect it to the same block storage. A simple way to do this is with the Altinity ClickHouse Kubernetes Operator.

Below is a picture that illustrates what happens inside a Kubernetes cluster that also includes the Altinity Operator.

For each server in a ClickHouse cluster, we will create a Pod, which is a running container. The Pod has one or more persistent volume claims (storage requests) that connect to data on actual AWS EBS volumes.

A simple setting in Kubernetes will make the pod reschedule on a different VM. The trick is to have a provisioner (like Karpenter) that watches Kubernetes and its pod changes. It will automatically make new VMs as needed to match the pod requirements. You don’t have to do anything. It’s all done for you.

How do we make that change? Here’s an example of defining a ClickHouse installation using the Altinity ClickHouse Operator. First, we use pod templates to specify replica properties. We set one node, one shard and one replica. We also have a pod template that specifies how our pods should be arranged.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"metadata:
     name: "prod"
spec:
     configuration:
     clusters:
          - name: "ch"
          layout:
               shardsCount: 1
               replicasCount: 1
          templates:
               podTemplate: clickhouse-zone-2a

Next, we change the name of the instance type. In this example, we are telling the provisioner that we want to run in an m5xlarge VM in availability zone us-west-2.

 podTemplates:
  - name: clickhouse-zone-2a
    spec:
      containers:
      - name: clickhouse
        image: altinity/clickhouse-server:22.8.15.25.altinitystable
      nodeSelector:
        node.kubernetes.io/instance-type: m5.xlarge
    zone:
      key: topology.kubernetes.io/zone
      values:
      - us-west-2a

You can use kubectl to apply the above definition. The provisioner will make an m5.xlarge VM if it’s not available. Best of all, if you want a different VM you can just change the name in the definition and reapply using kubectl. Kubernetes will drop the old VM and put your pod on a new VM of the right size.

Resizing can be useful for different scenarios. For example, if you have a cyclical pattern in ClickHouse, you can reduce your capacity when the demand is less. You can also have different levels of compute, and use less compute for older data.

There are other tricks as well. You can even turn off unused pods by setting the number of replicas in the definition to 0. Kubernetes will deallocate the VM completely. In these cases, your storage will remain the same, but you save on compute. Cost savings vary but may reach 50% in the best cases.

Conclusion

In this article, we give 4 ways that any developer can reduce costs in ClickHouse. If you enjoyed this blog, make sure to check out our webinar Run ClickHouse like a Cheapskate - 6 Ways to Save Money While Delivering Real-Time Analytics for two other great cost-savings hacks in ClickHouse.

DEV Community

How to Save Money and Deliver Real Time Analytics with ClickHouse

Develop on a laptop with 100% open source

Tune apps to limit resource usage

Using TTLs to limit growth

Scale compute capacity down when not needed

Conclusion

Top comments (0)

Read next

Supercharge Your E2E Tests with Playwright and Cucumber Integration

Understanding Color Gradients in Digital Design: A Comprehensive Guide

ClickHouse Vs DuckDB

Start Your Free Database and API Service in Just 10 Minutes！🔥🔥🔥(With GitHub repository)