Shahab Ranjbary

Posted on Sep 25, 2023

Connecting Multiple Kafka Clusters in ClickHouse Using Named Collections

#clickhouse #kafka #dataintegration #bigdata

Introduction:

ClickHouse is a powerful columnar database renowned for its speed and efficiency. A pivotal strength lies in its seamless integration with external data sources like Kafka. With the rising need for multi-cluster setups in modern data architectures, ClickHouse's Named Collections offers an invaluable asset. In this guide, we'll delve into how you can leverage this feature to seamlessly set up connections to two distinct Kafka clusters.

Why Use Named Collections?

Understanding the true value of Named Collections is crucial before we dive deep into the configurations. They allow us to:

Reduce Repetition: Eliminate the need to redundantly specify configurations.
Centralized Management: Maintain all configurations in a single, easily manageable location.
Improved Security: Safeguard sensitive credentials, keeping them out of the reach of non-administrative users.

Configuring Named Collections for Kafka:

With the prominence of Named Collections established, let's gear up to connect to two distinct Kafka clusters – primary and secondary.

XML Configuration:

<clickhouse>
    <named_collections>
        <!-- Primary Kafka Cluster Configuration -->
        <primary_kafka_cluster>
            <broker_list>primary-kafka-cluster:9094</broker_list>
            <kafka_settings>
                <client_id>primary_kafka_client</client_id>
                <security_protocol>SASL_PLAINTEXT</security_protocol>
                <sasl_mechanisms>SCRAM-SHA-512</sasl_mechanisms>
                <sasl_username>clickhouse_primary</sasl_username>
                <sasl_password>primary_secret_password</sasl_password>
            </kafka_settings>
        </primary_kafka_cluster>
        <!-- Secondary Kafka Cluster Configuration -->
        <secondary_kafka_cluster>
            <broker_list>backup-kafka-cluster:9095</broker_list>
            <kafka_settings>
                <client_id>secondary_kafka_client</client_id>
                <security_protocol>SASL_PLAINTEXT</security_protocol>
                <sasl_mechanism>SCRAM-SHA-512</sasl_mechanism>
                <sasl_username>clickhouse_secondary</sasl_username>
                <sasl_password>secondary_secret_password</sasl_password>
            </kafka_settings>
        </secondary_kafka_cluster>
    </named_collections>
</clickhouse>

For a more detailed configuration setup, refer to Pull Request #31691 starting from ClickHouse v21.12, which provides a more streamlined approach to using named_collections.

Setting Up Permanent Storage: MergeTree Table

After configuring our Kafka connections, the focus shifts to the ClickHouse realm. We'll architect tables that act as our permanent data reservoirs.

1. Kafka Engine Table:

To tap directly into our Kafka topics, we'll shape tables in ClickHouse using the Kafka engine. Here's how you can define these tables:

For the primary Kafka cluster:

CREATE TABLE kafka_cluster_a
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
)
ENGINE = Kafka(primary_kafka_cluster)
SETTINGS kafka_topic_list = 'your_topic_name_for_primary',
         kafka_group_name = 'your_consumer_group_for_primary',
         kafka_format = 'JSONEachRow';

For the secondary Kafka cluster:

CREATE TABLE kafka_cluster_b
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
)
ENGINE = Kafka(secondary_kafka_cluster)
SETTINGS kafka_topic_list = 'your_topic_name_for_secondary',
         kafka_group_name = 'your_consumer_group_for_secondary',
         kafka_format = 'JSONEachRow';

2. MergeTree Table:

We'll use the MergeTree table to persistently store the data streamed from Kafka:

For kafka.cluster_a:

CREATE TABLE cluster_a_storage
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
) ENGINE = MergeTree()
ORDER BY id;

For kafka.cluster_b:

CREATE TABLE cluster_b_storage
(
    `id` UInt32,
    `first_name` String,
    `last_name` String
) ENGINE = MergeTree()
ORDER BY id;

3. Materialized View:

The Materialized View serves as the Kafka table's consumer, directing data flow:

For kafka.cluster_a:

CREATE MATERIALIZED VIEW cluster_a_mv TO cluster_a_storage AS
SELECT 
    id,
    first_name,
    last_name
FROM kafka.cluster_a;

For kafka.cluster_b:

CREATE MATERIALIZED VIEW cluster_b_mv TO cluster_b_storage AS
SELECT 
    id,
    first_name,
    last_name
FROM kafka.cluster_b;

Practical Applications:

With the above groundwork, ClickHouse is primed to consistently ingest and archive data from both Kafka clusters. This means any data dispatched to the delineated Kafka topics will be assimilated in real time. This is particularly advantageous for businesses seeking to conduct instantaneous analytics or data-driven decision-making.

Conclusion:

Harnessing ClickHouse's Named Collections, establishing connections to multiple Kafka clusters transitions from being merely possible to efficient and organized. This structure guarantees instant data availability for querying, simplifying real-time analytics.

DEV Community