DEV Community: Mohammad Arab Anvari

How to Upgrade Kafka from 1.1.1 with Zero-Downtime: An Applicable Approach

Mohammad Arab Anvari — Fri, 29 Mar 2024 08:25:54 +0000

As a data engineer or, more specifically, data platform engineer, a service with high dependency may be handed over to you. Upgrading such a service is a terrifying process. Suppose that service is Kafka, and it's the main component of your data stack at the company. However, the solution isn't ignoring the complexity because every bug fix or new feature can save you from downtime and help you increase the performance of the services. So, what is the solution? How can we ensure all services that depend on Kafka work fine after the upgrade? In this post, I will share my experience through this process.

Main concerns

When we talk about services like Kafka, we know many producers and consumers are in between. So, what happens to them after an upgrade? Do they continue to produce/consume? What about the schema registry and other components that depend on Kafka? So, one of the main concerns is the healthiness of the dependent element.
Also, we want to upgrade Kafka for two significant versions; how should we check deprecated configs? Should I read all the changelogs one by one? There is a better approach that minimizes the time spent and the probability of downtime.

Proposed approach

Honestly, every time I think about Docker, I wonder what a beautiful tool this is :D You know? Amazingly, you can independently set up a whole stack in a separate network with tools like docker-compose.

A better approach is to use Docker to simulate production services in a safe environment. We can set up a whole stack with the same configs but fewer resources, simulate upgrades, and check each component's behavior.

Applied approach for Kafka

To simulate the upgrade process for Kafka, I am supposed to create a stack including these components:

Zookeeper Instances -> Coordinator for Kafka Cluster
Kafka Instances -> Main component
Schema Registry -> Persist schema of produced messages
Kafka UI -> Monitor Kafka cluster and see incoming messages in topics
Producers -> Python code to produce data into Kafka topic in Avro format.
Consumer -> Python code to consume data produced by Producer.
Clickhouse -> Analytical database to store data coming from Kafka
Postgres -> OLTP database stores transactional data
Postgres Producer -> Python code, which Inserts one record every 0.1 seconds into the Postgres database
Debezium -> Capture each change in Postgres and send it to the corresponding Kafka topic in Avro format.

Now, it's time to prepare the appropriate docker-compose.yaml

Implement detail

Zookeeper
- Image: Official Image
- Configs:
  - Mount zoo.cfg into the container
  - myid and data directory created using zookeeper_conf_creator.py
Kafka
- Image: Bitnami Image
  - Image customized by Dockerfile-Kafka (https://hub.docker.com/r/bitnami/kafka)
- Configs:
  - Set as an environment variable
  - server.properties converted to server.env using kafka_env_creator.py
  - This image didn't support SCRAM-SHA for authentication. So libkafka.sh (which is bitnami's Kafka library), rewritten.
Schema Registry:
- Image: Official Image
- Configs:
  - Set as an environment variable
  - schema-registry.properties converted to schema-registry.env using schema_registry_config_creator.py
Kafka UI
- Image: Official Image
- Configs:
  - Set as an environment variable
  - Directly in docker-compose.yaml
*Producer and Consumer *
- Image: Official Python Image
  - Image customized by Dockerfile-Python
- Code: producer.py and consumer.py
- Configs: Set as environment variables directly in docker-compose.yaml
Clickhouse
- Image: Official Image
- Tables: Tables DDL defined here and then mounted into /docker-entrypoint-initdb.d
  - For each table in Postgres three tables are defined here:
    - Base table -> data persist here
    - Kafka table -> read data from kafka
    - Materialize view -> ship data from the Kafka table into the base table.
- Configs: Default configs used only kafka.xml mounted into /etc/clickhouse-server/config.d/kafka.xml
- Logs: For debug purposes, logs mounted into local directory
Postgres
- Image: Debezium Example Image
  - This Postgres contains sample sale data.
Postgres Producer
- Same as Producer and Consumer but this code used
Debezium
- Image: Official Image
- Configs:
  - Set as an environment variable
  - kafka-connect.properties converted to kafka-connect.env using kafka_connect_config_generator.py

Some Extra Containers

kafka-setup-user
- It uses the same image as Kafka; it runs after kafka1 becomes healthy. Some users are created after this container runs (exit with status 0). See them here
- It needs one Kafka broker and also a Zookeeper cluster because SCRAM-SHA needs to persist on Zookeeper.
kafka-setup-topic
- It uses the same image as Kafka and creates some topics. See the list here
submit-connector
- It use curl image to submit this connector into Debezium. The connector captures the changes in Postgres, sends events to Kafka, and then Clickhouse consumes the data into appropriate tables.

Some Extra Notes:

The version of all containers defined in the .env file. You can change them from this file.
Container dependencies are defined accurately. So, if one container depends on another to come up, appropriate healthcheck and depends_on conditions are defined for it.
If you take a look at the healthcheck of containers, for example, kafka, you see this command:

   (echo > /dev/tcp/kafka1/9092) &>/dev/null && exit 0 || exit 1

This shell script helps check the TCP port in a container without telnet.

Simulation Process

To run the simulation, you can follow these steps.

Result

All tests were successful. By successful, I mean the producer can still produce messages without errors, and consumers can consume messages without errors. No other criteria were investigated; you can define your metrics for this simulation. Only one problem was seen in this process.

Problems:

In Setup Kafka User: java.lang.ClassNotFoundException: kafka.security.auth.SimpleAclAuthorizer occurred
1. It deprecated after 2.4.0. See here
2. Doc recommends to use kafka.security.authorizer.AclAuthorizer instead. It's fully compatible with deprecated class, so it was replaced in docker-compose and it worked

Conclusion:

As there is the official document for upgrading from any version to 3.6.1 (and another previous version), there is no obstacle in this process. Also, our test shows this process works, and we can upgrade our Kafka to whatever version we want.

Conclusion

This article is a suggestion for the best approach for upgrading highly dependent services. We talked about the details of implementing this process, and then, as we saw in the Result section, one problem was found before upgrading so we can upgrade our Kafka cluster seamlessly, with zero-downtime :)

Empowering Your Kafka Connectors: A Guide to Connector Guardian

Mohammad Arab Anvari — Tue, 27 Feb 2024 18:15:06 +0000

Hi there :)

In this post, I want to introduce you to Connector Guardian. If you've ever found yourself grappling with the management of Kafka Connect connectors, you're in for a treat. Connector Guardian is tailor-made to simplify your life as a developer or operator, providing efficient tools for the seamless management and maintenance of your Kafka Connectors.

How It Works

Connector Guardian smoothly interacts with your Kafka Connect cluster using its REST API. In its initial release, version 0.1.0, it harnesses the power of jq for JSON parsing. Later, transitioning to version 0.2.0, Connector Guardian adopts Python's built-in JSON library.

Features

Auto Connector Restart: Starting from V0.1.0, Connector Guardian monitors the status of connectors and tasks, automatically restarting them if they fail.
Restart Back Off: Introduced in V0.3.0, this feature ensures that restarts occur at increasing time intervals. The initial restart happens immediately, and subsequent restarts are delayed exponentially. This approach allows for efficient issue resolution, even in the face of prolonged network outages. After a configurable number of restarts (MAX_RESTART), the Guardian stops automatic restarting, leaving it to you for manual intervention.

How to Add Guardian to Kafka Connect Cluster

Container Image

You can easily pull the Connector Guardian image from Docker Hub, then run it with docker run command:

docker run -d \
  -e KAFKA_CONNECT_HOST=localhost \
  -e KAFKA_CONNECT_PORT=8083 \
  -e KAFKA_CONNECT_PROTO=http \
  -e KAFKA_CONNECT_USER='' \
  -e KAFKA_CONNECT_PASS='' \
  -e ENABLE_BACKOFF=1 \
  -e MAX_RESTART=7 \
  -e EXPONENTIAL_RATIO=2 \
  anvaari/connector-guardian

Non-Cloud Environments

For deployment on your server, use the provided docker-compose file. Before deploying the image, ensure that you set the appropriate environment variables in docker-compose.yaml.

cd deploy
docker compose up -d

Kubernetes or OpenShift

Utilize the provided Helm chart for deployment. Make sure to set the required environment variables in values.yaml.

helm upgrade connector-guardian --install -n {your_namespace_name} -f deploy/chart/values.yaml deploy/chart

Once deployed, Connector Guardian runs as a pod, executing connector_guardian.py every 5 minutes.

Environment Variables

To use the Docker image, docker-compose, or Helm chart, set the following environment variables:

KAFKA_CONNECT_HOST: Default = localhost
KAFKA_CONNECT_PORT: Default = 8083
KAFKA_CONNECT_PROTO: Default = http
KAFKA_CONNECT_USER: Default = ''
KAFKA_CONNECT_PASS: Default = ''
ENABLE_BACKOFF: Default = 1
MAX_RESTART: Default = 7
EXPONENTIAL_RATIO: Default = 2

In The End ...

Connector Guardian is your steadfast companion in the realm of Kafka Connect connectors. Whether you are a seasoned developer or an operations expert, this tool streamlines the management of your connectors, offering an automated approach to restarts and intelligent back-off mechanisms.

As I continue to evolve Connector Guardian, I invite you to be part of this journey. Your feedback, suggestions, and contributions are not only valued but crucial in shaping the future of this open-source project. Let's work together to make Kafka Connector maintenance a seamless experience for all.

Get Involved!

Contribute to the project on GitHub.
Share your experiences and ideas in the issues section.
Spread the word - let others in your network know about Connector Guardian.
Stay tuned for updates and new features!

Getting Valuable Insights from ClickHouse Error Logs using ClickSight

Mohammad Arab Anvari — Thu, 03 Aug 2023 10:10:18 +0000

Introduction

When managing a production ClickHouse cluster, you might face numerous challenges. One of them is finding the root cause in case of a crash. During such incidents, querying system tables like system.errors or system.text_log is not possible. Hence, we need to search into the clickhouse-server.err.log file to identify the root cause and address it effectively. Although using Unix tools is the most efficient approach, we should aim to do it as quickly as possible.

To gain better and faster insights from ClickHouse error logs, I developed an Ansible Playbook named ClickSight. This playbook performs log aggregation on ClickHouse logs from all specified nodes. ClickSight is available under the Apache license on GitHub as the ClickSight project. In the next section, we will see how to use it and customize it to suit your needs.

It's essential to have a monitoring dashboard, such as this one, which can help us identify issues like a high number of mutations or high RAM usage. However, in complex situations, it might not be sufficient to find the root cause, and that's when error logs become crucial.

How ClickSight Works

ClickSight leverages various useful Unix commands, such as grep, cat, tail, cut, and sed, for working with text files. Additionally, Unix's ability to pass the output of one command to another using pipes (|) allows us to chain commands together. For example, we can use grep to find specific lines in a file and then use cut to select all text after the > character with cat myfile.txt | grep "some arbitrary phrase" | cut -d">" -f2. ClickSight harnesses the power of these Unix commands in an Ansible playbook, making log aggregation a breeze. With ClickSight, you only need to run the playbook, and the results will be available on your local machine.

Setting Up ClickSight

To use ClickSight, you need at least one active ClickHouse server and Ansible installed on your system. It's recommended to install it on a Unix-based OS. Additionally, ensure that you have access to the ClickHouse server as a sudo user, or at least your user should have access to the clickhouse-server.err.log file.

For detailed setup instructions for ClickSight and Ansible, please refer to the Prerequisites section of the repository.

Running ClickSight

A comprehensive guide on how to run ClickSight can be found in the Run ClickSight section of the repository. Feel free to follow that guide, and if you have any questions, don't hesitate to ask here or on GitHub.

Analyzing ClickHouse Error Logs

ClickSight offers five modes of operation:

All fatal errors in clickhouse-server.err.log with details: In case of a crash, a fatal error is likely present, explaining the cause of the crash.
Timeline of errors: On production systems, there might be numerous errors every minute, making it challenging to track them in log files. ClickSight can help by displaying the timeline of errors extracted from ClickHouse log lines. The error names are based on the logger_name, providing valuable information about the error category.
Timeline of errors associated with a query_id: ClickHouse provides query_id for some logs associated with executed queries, denoted by {}. ClickSight can display the timeline of errors linked to specific query_ids.
Timeline of errors associated with a specific query_id: ClickSight allows you to view all errors and their occurrences associated with a specific query_id.
Timeline of a specific error: Sometimes, we suspect a particular error and want to know when and how often it occurs. ClickSight can provide a detailed timeline for specific errors.

Contributing to ClickSight

I welcome and appreciate contributions to the ClickSight project. Whether you want to report issues, suggest improvements, or submit new features, your input is valuable in making ClickSight even more useful.

1. Reporting Issues or Feature Requests

If you encounter any problems while using ClickSight or have ideas for new features, please don't hesitate to report them. To do so:

Go to the ClickSight GitHub repository: ClickSight GitHub repository.
Click on the "Issues" tab.
Click on the green "New Issue" button.
Provide a descriptive title and detailed description of the issue you encountered or the feature you want to suggest.
If it's a bug, include steps to reproduce the problem.

Contributing Code

If you're a developer and want to contribute directly to the ClickSight codebase, follow these steps:

Fork the ClickSight repository to your GitHub account using the "Fork" button in the top-right corner.
Clone the forked repository to your local development environment.
Create a new branch for your contribution: git checkout -b my-feature.
Make your changes and improvements.
Test your changes thoroughly to ensure they work as expected.
Commit your changes with clear and concise messages: git commit -m "Add my awesome feature".
Push the changes to your forked repository: git push origin my-feature.
Create a pull request (PR) by navigating to the original ClickSight repository and clicking on "New Pull Request."
Describe your changes in the PR, including any relevant information or context.
I will review your PR, provide feedback, and work with you to merge the changes into the main ClickSight project.

I'm Mohammad Anvaari, a Data Engineer at Snapp! I'm curious about data engineering and often write about my challenges and experiences on my blog.

Estimate Disk/Service IOPS and Throughput

Mohammad Arab Anvari — Wed, 12 Jul 2023 15:25:15 +0000

Introduction

Sometimes we need to know the current status of the storage usage of a service in order to find any possible bottleneck in storage side. This tutorial describes how to measure current IOPS and Throughput on your server.

Max `IOPS` and `Throughput` of storage:

In this section, we want to estimate the maximum IOPS and Throughput of storage.

`dd` command

dd if=/dev/zero of=/tmp/disk_test_dd.file bs=100M count=1 oflag=dsync

By running this command, dd will read from /dev/zero (a stream of null bytes) and write 100 megabytes of data to the file /tmp/disk_test_dd.file. The write operation will be synchronized and physically written to the disk before dd exits, thanks to the oflag=dsync option. This command is often used to test the disk write performance or to create files filled with null bytes for various purposes.

Result:

1+0 records in
1+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.22367 s, 469 MB/s

As you can see Throughput of the disk was 469 MB/s

`fio` command (Recommended)

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test_disk_fio --filename=test_disk_fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

By running this command, you will initiate a fio test that performs random read-write I/O operations with a block size of 4 kilobytes, using a 4-gigabyte test file/device. The test will use the libaio I/O engine with direct I/O enabled. The I/O depth is set to 64, and the reads-to-writes ratio is 75:25. The test results will provide performance metrics and insights into the disk's I/O capabilities under these conditions.

Result:

test_disk_fio: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=128MiB/s,w=41.0MiB/s][r=32.8k,w=10.7k IOPS][eta 00m:00s]
test_disk_fio: (groupid=0, jobs=1): err= 0: pid=68138: Mon May 15 10:52:23 2023
   read: IOPS=37.1k, BW=145MiB/s (152MB/s)(3070MiB/21170msec)
   bw (  KiB/s): min=105184, max=246976, per=100.00%, avg=148570.26, stdev=22072.40, samples=42
   iops        : min=26296, max=61744, avg=37142.55, stdev=5518.10, samples=42
  write: IOPS=12.4k, BW=48.5MiB/s (50.8MB/s)(1026MiB/21170msec)
   bw (  KiB/s): min=36384, max=82448, per=100.00%, avg=49650.48, stdev=7312.82, samples=42
   iops        : min= 9096, max=20612, avg=12412.60, stdev=1828.21, samples=42
  cpu          : usr=4.81%, sys=32.05%, ctx=72759, majf=0, minf=153
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwt: total=785920,262656,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=145MiB/s (152MB/s), 145MiB/s-145MiB/s (152MB/s-152MB/s), io=3070MiB (3219MB), run=21170-21170msec
  WRITE: bw=48.5MiB/s (50.8MB/s), 48.5MiB/s-48.5MiB/s (50.8MB/s-50.8MB/s), io=1026MiB (1076MB), run=21170-21170msec

Disk stats (read/write):
  sda: ios=785117/262419, merge=0/226, ticks=1027296/246228, in_queue=1273668, util=99.60%

As you see fio provides more detailed result
Mean IOPS for read was 37.1k and for write was 12.4k
Mean Throughput for read was 152MB/s and for write was 50.8MB/s

Current `IOPS` and `Throughput` of storage:

In this section, we want to estimate current IOPS and Throughput of storage.

With `iostat`

iostat -xdmb 60 1440

By running this command, iostat will start monitoring and displaying disk statistics for all disk partitions every 60 seconds. The statistics will include information about disk utilization, wait time, service time, and the timestamp when the statistics were collected. The monitoring will continue for approximately 1440 minutes (around 24 hours).

Note:
You can run iostat -xdmb 60 1440 >> iostat_res.txt to redirect results to a file.

Result:

05/13/2023 08:22:19 PM
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     1.60     0.00   0.00   0.00
sda            444.91   84.11  26980.23   2880.85     3.43    63.74   0.76  43.11    0.31    1.74   0.10    60.64    34.25   0.15   8.08
sda1             0.00    0.00      0.09      0.00     0.00     0.00  11.23   0.00    0.22   12.28   0.00    41.62     0.50   0.13   0.00
sda2           444.91   84.11  26980.14   2880.85     3.43    63.74   0.76  43.11    0.31    1.74   0.10    60.64    34.25   0.15   8.08
sdb            897.80  739.67  96125.22  35384.91    41.59   944.47   4.43  56.08    0.02    0.03   0.03   107.07    47.84   0.03   5.62

You can see the detail of each column in iostat Doc
At 05/13/2023 08:22:19 PM, sda has 444.91 IOPS for reading and 84.11 for write. (From r/s and w/s columns)
As same, sda has 26980.23 kB/s Throughput for reading and 2880.85 kB/s for write. (From rkB/s and wkB/s columns)

Note:
This is a result of single 60 seconds. If you enter the command same as above, you will see 1440 tables like this after 1 day.

Analyze Output in `jupyter notebook` [Optional]

In case you have some servers, and you need to analyze disk activity on them, You can use provided Jupyter Notebook to get more insight into the disks on your servers.

You can see and download the Jupyter Notebook from this link

Conclusion

With provided commands and corresponding results, you can estimate needs of current service and see whether there is a bottleneck regarding to I/O or not.

References

Introduction to Clickhouse at scale

Mohammad Arab Anvari — Wed, 17 May 2023 08:48:08 +0000

Introduction

In many cases we prefer to scale our services, we always prefer scale out against scale up because of the lower cost of scaling out.

Scaling out = adding more components in parallel to spread out a load. Scaling up = making a component bigger or faster so that it can handle more load. Source

In Clickhouse terminology, Scale out is equal to sharding. And we use replicas to ensure availability. In this article, we will learn about Cluster, Replicas, and Sharding in Clickhouse.

Replication used for data integrity and automatic failover. Sharding is used for horizontal scaling of the cluster. From Reference 1

Sharding

Suppose we have a table on a Clickhouse node (yellow table in host1). After a while data become bigger and the request rate increased. Now we decide to scale the node, here host2 comes to the scene. We apply sharding to the table and send a second shard on host2.

How to make a select query to the table?
1. We can directly query a table in each host. In this case, we should know what data exist in each shard.
2. We can create a distributed table. It can create on each node, when we query to distributed table it ingests data from the proper shard. It doesn't store data but has metadata about data in each shard.

How to insert data into the table?
1. We can directly insert our data into the shards if we have a predefined schema or something same.
2. Also we cant insert data into the distributed table and it inserts data regarding the defined sharding key. You can read more about sharding key in Clickhouse Doc

Replication

Suppose we run an application depending on Clickhouse or we need online analytics which uses Clickhouse. What happens if one node becomes down or a hardware issue rises? Certainly, we don't want that, so we need replication for each table.
We can only have replicated tables for *MergeTree* tables.
Clickhouse Keeper is who manages things about replication in Clickhouse. It is compatible with Zookeeper and it's a kind of alternative for Zookeeper which Clickhouse presents.

Cluster

A cluster is a way to manage sharding and replication between some nodes. We can have many cluster topologies for the same nodes. Every time we add a table to one cluster, It shards and replicates in a way defined in that cluster.

How to apply Sharding/Replication/Clustering in Clickhouse?

Set up Clickhouse Cluster

In the below picture, we have 4 nodes. We define a cluster named cluster1 (in the Clickhouse config file). Each table associated with this cluster will have 2 replicas and 2 shards (But they should be Replicated* tables)

Set up Clickhouse Keeper

See This part of Reference 1

Create tables

Every time we want to create a table with a cluster1 topology, we should use the ON CLUSTER statement. For example, if we want to create the table:

The first parameter of the ReplicatedMergeTree engine is a Clickhouse keeper path of the table and the second one is a replica name. Tables with the same path and different replica names are replicated (Clickhouse keeper does these things).
We should define {shard} and {replica} as macros in each node.

Once mydb.my_table is created in each of host1, host2, host3, or host4 it will create in all other nodes.
And finally, we will create a distributed table to better query on my_db.my_table which has 2 shards.

References

Great video by Clickhouse in youtube

Apply CDC From MySQL To Clickhouse on local environment

Mohammad Arab Anvari — Mon, 15 May 2023 06:20:03 +0000

The aim of this tutorial is to capture every change (delete, insert, and update) from the Mysql table and sync it with Clickhouse.

Prerequisites

Mysql
Zookeeper
Kafka
Kafka-Connect
Clickhouse

We can set up all of these services with a simple docker-compose file(Source).



version: '2'
services:
  zookeeper:
    image: quay.io/debezium/zookeeper:${DEBEZIUM_VERSION}
    ports:
     - 2181:2181
     - 2888:2888
     - 3888:3888
  kafka:
    image: quay.io/debezium/kafka:${DEBEZIUM_VERSION}
    ports:
     - 9092:9092
    links:
     - zookeeper
    environment:
     - ZOOKEEPER_CONNECT=zookeeper:2181
  mysql:
    image: quay.io/debezium/example-mysql:${DEBEZIUM_VERSION}
    ports:
     - 3306:3306
    environment:
     - MYSQL_ROOT_PASSWORD=debezium
     - MYSQL_USER=mysqluser
     - MYSQL_PASSWORD=mysqlpw
  connect:
    image: quay.io/debezium/connect:${DEBEZIUM_VERSION}
    ports:
     - 8083:8083
    links:
     - kafka
     - mysql
    environment:
     - BOOTSTRAP_SERVERS=kafka:9092
     - GROUP_ID=1
     - CONFIG_STORAGE_TOPIC=my_connect_configs
     - OFFSET_STORAGE_TOPIC=my_connect_offsets
     - STATUS_STORAGE_TOPIC=my_connect_statuse
  clickhouse:
    image: clickhouse/clickhouse-server:23.2.4.12
    links:
     - kafka
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    ports:
      - 8123:8123
      - 9000:9000

You can read more about the options of every service in this tutorial.
After saving the yaml file as docker-compose.yml :



export DEBEZIUM_VERSION=2.2
docker compose up

Now we have a Mysql container which contains a simple database named inventory, a Kafka container, and Zookeeper which manages a Kafka cluster, connect instance which adds abilities of Kafka-Connectors to Kafka and also a Clickhouse instance. Now we have all perquisites.

Image Source

Deploy Debezium connector

We can interact with Kafka-Connect with Rest API.
Base request :



curl -i -X {Request_Type} -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/

See current connectors :



curl -i -X GET -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/

Delete {my-conn} connector:



curl -i -X DELETE -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/{my-conn}

Add connector:



curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{connector-config-as-json}'

Config for MySQL Connector



{
    "name": "mysql-connector",
    "config": {
    "tasks.max": "1",
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "root",
    "database.password": "debezium",
    "database.include.list": "inventory",
    "table.include.list": "inventory.orders",
    "database.server.id": "1",
    "message.key.columns": "inventory.orders:order_number",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "dbz.inventory.history",
    "snapshot.mode": "schema_only",
    "topic.prefix": "dbz.inventory.v2",
    "transforms": "unwrap",
    "transforms.unwrap.delete.handling.mode": "rewrite",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
  }
}

name: The name of the connector.
config: The connector’s configuration.
tasks.max: Only one task should operate at any one time. Because the MySQL connector reads the MySQL server’s binlog, using a single connector task ensures proper order and event handling. The Kafka Connect service uses connectors to start one or more tasks that do the work, and it automatically distributes the running tasks across the cluster of Kafka Connect services. If any of the services stop or crash, those tasks will be redistributed to running services.
connector.class: Type of connector, On of These
database.hostname: The database host, which is the name of the Docker container running the MySQL server (mysql). Docker manipulates the network stack within the containers so that each linked container can be resolved with /etc/hosts using the container name for the hostname. If MySQL were running on a normal network, you would specify the IP address or resolvable hostname for this value.
database.user & database.password: Username and password of mysql user with these privileges. For this example, I use the root user and pass.
database.include.list: Only changes in the inventory database will be detected.
topic.prefix: A unique topic prefix. This name will be used as the prefix for all Kafka topics.
schema.history.internal.kafka.bootstrap.servers & schema.history.internal.kafka.topic: The connector will store the history of the database schemas in Kafka using this broker (the same broker to which you are sending events) and topic name. Upon restart, the connector will recover the schemas of the database that existed at the point in time in the binlog when the connector should begin reading.
transforms*: These transformations are needed to insert data in Clickhouse. More explanation here

Full reference of configs for MySQL connector can be found here.

Consume Messages From Kafka

We wanna see a list of topics in our Kafka broker. First, we should access bash inside the Kafka container :



docker exec -it {kafka-container-name} /bin/bash

Then:



/kafka/bin/kafka-topics.sh --bootstrap-server kafka:9092 --list

Note that the topic corresponding to our orders table in MySQL has such format: {topic.prefix}.{database_name}.{table_name}. In this example, it turns to dbz.inventory.v2.inventory.orders
To consume all messages from a topic:



 /kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic dbz.inventory.v2.inventory.orders --from-beginning

Set Up Clickhouse Tables

As mentioned in this article in Clickhouse doc, we need 3 tables:

A Table witch Kafka engine
A Materialized View table
A MergeTree table

Kafka Engine Table

As mentioned in the doc we should specify the format of message arriving from Kafka topic (one of these), We can use [[Kafka Schema Registry]] but here we wanna parse Json directly, So with help of solution provided in this post we get message as JSONString format and then parse it using Mat. View.



CREATE TABLE `default`.kafka_orders
(
`msg_json_str` String
)
Engine=Kafka('kafka:9092', 'dbz.inventory.v2.inventory.orders', 'clickhouse', 'JSONAsString')

Full doc of Kafka engine in Clickhouse.

MergeTree Table

As mentioned at the first of this article we wanna capture delete and update so we use ReplacingMergeTree:



CREATE TABLE default.stream_orders
(
`order_number` Int16,
`order_date` DATE ,
`purchaser` Int16 ,
`quantity` Int16,
`product_id` Int16,
`__deleted` Nullable(String)
)
ENGINE = ReplacingMergeTree
ORDER BY (order_number)
SETTINGS index_granularity = 8192

Mat. View

We parse Json using JSONExtract functions in Clickhouse.
We should consider that Debezium treats DATE data type as a number of days since the 1970-01-01 Source. It's the cause of using toDate with combination of JSONExtractInt.



CREATE MATERIALIZED VIEW default.consumer__orders TO default.stream_orders
(
`order_number` Int16,
`order_date` DATE ,
`purchaser` Int16 ,
`quantity` Int16,
`product_id` Int16,
`__deleted` Nullable(String)
) AS
SELECT
JSONExtractInt(msg_json_str,'payload','order_number') AS order_number,
(toDate('1970-01-01')+JSONExtractInt(msg_json_str,'payload','order_date')) AS order_date,
JSONExtractInt(msg_json_str,'payload','purchaser') AS purchaser,
JSONExtractInt(msg_json_str,'payload','quantity') AS quantity,
JSONExtractInt(msg_json_str,'payload','product_id') as product_id,
JSONExtractString(msg_json_str,'payload','__deleted') AS __deleted
FROM default.kafka_orders

A View (Optional)

Clickhouse will merge consumer__orders table in an irregular schedule so we can't see the latest version of data at all times. But we can use view to obtain this goal:



CREATE VIEW orders(
`order_number` Int16,
`order_date_` DATE ,
`purchaser` Int16 ,
`quantity` Int16,
`product_id` Int16
) AS
SELECT
order_number,
max(order_date) as order_date_,
argMax(purchaser,order_date) as purchaser,
argMax(quantity,order_date) as quantity,
argMax(product_id,order_date) as product_id
FROM default.stream_orders
WHERE `__deleted`= 'false'
GROUP BY order_number

We can also use FINAL modified instead of GROUP BY but it's not recommended in a production environment.

Troubleshooting

In case of any error or even lack of data in tables, we should check Clickhouse server logs located in /var/log/clickhouse-server/clickhouse-server.err.log

DEV Community: Mohammad Arab Anvari

How to Upgrade Kafka from 1.1.1 with Zero-Downtime: An Applicable Approach

Main concerns

Proposed approach

Applied approach for Kafka

Implement detail

Simulation Process

Result

Conclusion

Empowering Your Kafka Connectors: A Guide to Connector Guardian

How It Works

Features

How to Add Guardian to Kafka Connect Cluster

Container Image

Non-Cloud Environments

Kubernetes or OpenShift

Environment Variables

In The End ...

Getting Valuable Insights from ClickHouse Error Logs using ClickSight

Introduction

How ClickSight Works

Setting Up ClickSight

Running ClickSight

Analyzing ClickHouse Error Logs

Contributing to ClickSight

1. Reporting Issues or Feature Requests

Contributing Code

Estimate Disk/Service IOPS and Throughput

Introduction

Max IOPS and Throughput of storage:

dd command

fio command (Recommended)

Current IOPS and Throughput of storage:

With iostat

Analyze Output in jupyter notebook [Optional]

Conclusion

References

Introduction to Clickhouse at scale

Introduction

Sharding

Replication

Cluster

How to apply Sharding/Replication/Clustering in Clickhouse?

Set up Clickhouse Cluster

Set up Clickhouse Keeper

Create tables

References

Apply CDC From MySQL To Clickhouse on local environment

Prerequisites

Deploy Debezium connector

Config for MySQL Connector

Consume Messages From Kafka

Set Up Clickhouse Tables

Kafka Engine Table

MergeTree Table

Mat. View

A View (Optional)

Troubleshooting

References

Max `IOPS` and `Throughput` of storage:

`dd` command

`fio` command (Recommended)

Current `IOPS` and `Throughput` of storage:

With `iostat`

Analyze Output in `jupyter notebook` [Optional]