DEV Community: Kepha Mwandiki

From Kafka to Clean Tables: Building a Confluent Snowflake Pipeline with Streams & Tasks

Kepha Mwandiki — Wed, 01 Oct 2025 11:44:58 +0000

Building reliable data pipelines often starts with messy JSON and ends with clean, analytics-ready tables. In this article, we will walk through the complete journey of streaming data from Kafka into Snowflake using the Confluent Snowflake Sink Connector. We will begin by generating private and public keys to authenticate Confluent with Snowflake, then set up the connector so raw json events land in a Snowflake table. From there, we will use Snowflake Streams and Tasks to continuously transform that nested json into a structured table. By the end, you’ll have a fully automated pipeline where Kafka pushes data through Confluent, Snowflake stores the raw records, and your clean tables stay updated in real-time, ready for dashboards, analytics, and reporting.

Setting up Confluent - Environment, cluster and Topic

Confluent Environment

An environment acts as a logical workspace in Confluent Cloud where you can group together clusters, topics and connectors for a specific project.
In the screenshot below, you can see the environment dashboard, which is the starting point before spinning up a Kafka cluster and creating topics to hold our streaming data.

Confluent Cluster

The cluster is where your topics live and where all the data streaming happens.
In Confluent Cloud, you can choose between Basic, Standard, or Dedicated clusters depending on workload.
For this pipeline, a standard/basic cluster is sufficient to handle ingesting weather data into our topic before pushing it downstream to Snowflake.

Cloud Provider

When creating the cluster, Confluent asks you to pick a cloud provider and region (AWS or Azure etc).
This choice is more than just location, it affects latency, data transfer costs, and overall performance. _ - Always select the same cloud provider and region as your Snowflake account_.
In our case, Snowflake is hosted on AWS af-south-1 (Cape Town), so we placed our Confluent cluster in the same AWS region to ensure low latency and avoid unnecessary cross-region data transfer fees.

Confluent Client

After the cluster is provisioned, Confluent provides client connection configurations so that applications can produce or consume messages from the cluster.
Since our producer script is written in Python, we download the Python client configuration.
This file contains important details like the bootstrap servers, API key, and API secret, which authenticate the Python client to the Confluent cluster securely.
These credentials are later used in our weather data producer script to publish json events (such as city, temperature, humidity, etc.) into the Kafka topic.

Creating a Topic

The final step in setting up Confluent is creating a Kafka Topic inside the cluster.
A topic is a channel where streaming data is stored and organized before being delivered downstream.

Data streaming into our Topic

Once the topic is created, we can begin streaming data into it.

In the screenshot above, you can see messages arriving in Confluent, which confirms that the pipeline is working.
This data is being produced by a Python producer script that I wrote earlier, which fetches live weather information and publishes it into the Kafka topic.
On the other side, a simple Python consumer can also be used to verify that messages are flowing correctly before we connect downstream systems like Snowflake.

Confluent Connectors and sinks

Confluent offers a wide range of connectors that make it easy to move data between Kafka and external systems without writing custom code.
In the screenshot below, you can see several available connectors sources to bring data into Kafka and sinks to deliver data out of Kafka.

For our use case, we are interested in the Snowflake Sink Connector, because our goal is to continuously push streaming weather data from Kafka into Snowflake.

The sink connector ensures that every new event in our topic is automatically written into a Snowflake table, making it available for storage and downstream analytics.

Snowflake SinK Connector

When setting up the Snowflake Sink Connector, the first step is to tell it which Kafka topic to read from.
In this case, we select our previously created weather topic as the source.
This ensures that every json message produced into the topic containing fields like city, temperature, and humidity is automatically ingested by the connector and prepared for delivery into Snowflake.
By binding the connector to the topic, we establish the direct pipeline from Confluent into our Snowflake account.

Sink connector API

To authenticate the connector, Confluent requires an API key and secret that grant access to the Kafka cluster. In our case we will use the API we generated when creating this cluster.

Configuring Snowflake Connection details

The next step is to configure the Snowflake connection details.

Here we provide the account URL, which is derived from the Snowflake account locator and region eg https://xc15924.af-south-1.aws.snowflakecomputing.com .
We also specify the database, schema, and warehouse that will receive the data. All these should already be created in Snowflake.
Finally, we configure the Snowflake user and attach the public key to enable key-pair authentication.
This ensures the connector can securely deliver data into the right Snowflake environment without needing to store passwords.

Getting your Public and Private keys

Public

The public key is derived from the private key and is the one uploaded to Snowflake.
This allows Snowflake to validate signatures created with the private key. Here's the command to extract the public key:

openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out rsa_key.p8

openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

cat rsa_key.pub

sed -e '1d' -e '$d' rsa_key.pub | tr -d '\n' > rsa_key.pub.single
at rsa_key.pub.single

cat rsa_key.pub.single

Private

The private key is generated locally and never shared with Snowflake.
It is used by the Snowflake Sink Connector to sign authentication requests.
Below is the command I used to create a private key in PKCS8 format required by Snowflake

openssl genrsa 2048 | openssl pkcs8 -topk8 -nocrypt -out rsa_key.p8

cat rsa_key.p8

After getting the public key, you should now save it in snowflake.

Below is the code for that, USERNAME = your snowflake username.

ALTER USER USERNAME SET RSA_PUBLIC_KEY='MIIBIjANBgkqh...';

The following code will confirm whether the public key has been set:

SHOW PARAMETERS LIKE 'RSA_PUBLIC_KEY' FOR USER USERNAME;

Data format and ingestion method

In the next step, the connector asks you to choose the data format and the ingestion method.
For this pipeline, we select json as the format since our weather producer sends data in json structure.
Snowflake offers two ingestion options: Snowpipe - micro-batch loading and Snowpipe - Streaming low-latency streaming ingestion.
While both work, Snowpipe Streaming provides near real-time delivery into Snowflake, making it better suited for continuously flowing data like weather updates.

Sizing

The final step in configuring the connector is sizing, where you choose the number of tasks.
Each task is a worker instance that reads from Kafka and writes to Snowflake.
For small workloads, 1 task is enough.
If you expect higher throughput, you can increase the task count to scale out ingestion.
In our weather data pipeline, a single task is sufficient to handle the incoming JSON events.

Launching The Snowflake Sink Connector

The final step is just confirming everything, and launching the connector.

Data arriving into snowflake warehouse

When the Sink Connector pushes data into Snowflake, it creates a staging table that only has two columns RECORD_METADATA which has Kafka details like topic, partition, offset, and timestamp and RECORD_CONTENT which stores the full message as json.
This makes sure all raw events are captured safely, but it’s not easy to query since everything sits inside one big json field. That's why we later use Streams and Tasks to transform this raw data into a clean, structured table with proper columns for analytics.

Creating a new structured table

Now that the raw json is safely landing in Snowflake, the next step is to create a clean target table where our structured weather data will live. We will call this table WEATHER_DATA_TABLE, and it will have properly defined columns like city, temperature, humidity, wind_speed, and others.

This table will act as the final destination for all transformed data, making it much easier to query and analyse compared to the raw RECORD_CONTENT json.

Here is the SQL code for that:

-- creating a new table WEATHER_DATA_TABLE
CREATE OR REPLACE TABLE WEATHER_DATA_TABLE (
 CITY          STRING,
LATITUDE      FLOAT,
LONGITUDE     FLOAT,
TEMPERATURE   FLOAT,
TEMP_MIN      FLOAT,
TEMP_MAX      FLOAT,
PRESSURE      INT,
HUMIDITY      INT,
WIND_SPEED    FLOAT,
CLOUDS        INT,
W_CONDITION   STRING,
DESCRIPTION   STRING,
TIME_STAMP    TIMESTAMP
);

Inserting raw data from RECORD_CONTENT column into our new Structured table

After creating WEATHER_DATA_TABLE, the next step was to insert data from the raw table's RECORD_CONTENT column. All the weather details like city, temperature, humidity, wind_speed, and others were stored as json inside RECORD_CONTENT. Using Snowflake's json functions, we pulled out each field from this json and placed it into the right column of WEATHER_DATA_TABLE. This turned the raw json in RECORD_CONTENT into a clean and structured table that is much easier to query and analyse.

-- Inserting data from RECORD_CONTENT into the new table
INSERT INTO WEATHER_DATA_TABLE
SELECT
RECORD_CONTENT:city::string,
RECORD_CONTENT:latitude::float,
RECORD_CONTENT:longitude::float,
RECORD_CONTENT:temperature::float,
RECORD_CONTENT:temp_min::float,
RECORD_CONTENT:temp_max::float,
RECORD_CONTENT:pressure::int,
RECORD_CONTENT:humidity::int,
RECORD_CONTENT:wind_speed::float,
RECORD_CONTENT:clouds::int,
RECORD_CONTENT:w_condition::string,
RECORD_CONTENT:description::string,
TO_TIMESTAMP(RECORD_CONTENT:time_stamp::int)
FROM TOPIC_WEATHER_STREAM;

Creating a stream that tracks new changes in the TOPIC_WEATHER_STREAM table

To keep the clean table updated as new data arrives, we created a Snowflake Stream on the raw topic table. A stream works like a tracker, it records all the new rows or changes that come into the raw table. Instead of repeatedly scanning the whole table, we can just look at the stream to know what fresh data has arrived. This makes it easy to continuously insert only the latest weather records from RECORD_CONTENT into WEATHER_DATA_TABLE.

-- Creating A Snowflake stream tracks changes, new rows in a table.
CREATE OR REPLACE STREAM WEATHER_STREAM_CHANGES 
ON TABLE TOPIC_WEATHER_STREAM
APPEND_ONLY = TRUE;

Creating a Snowflake Task that automatically takes data from the stream inserting into our new table

After setting up the stream, the next step was to create a Snowflake Task.

A task is like a scheduler inside Snowflake, it can automatically run SQL statements at regular intervals. In our case, we used the task to read new data from the stream and insert it into WEATHER_DATA_TABLE. This way, we don’t have to manually run the INSERT query every time fresh weather data arrives.
- The task checks the stream, finds any new rows in RECORD_CONTENT, and then copies the values into the correct columns of WEATHER_DATA_TABLE. By combining the stream which tracks changes, with the task which automates inserts, our pipeline becomes fully automated. Every time Kafka pushes new weather data through Confluent into Snowflake, the clean table updates on its own.

-- creating a task that will run automatically  insert new rows into WEATHER_DATA_TABLE.

CREATE OR REPLACE TASK WEATHER_TASK
WAREHOUSE = COMPUTE_WH
SCHEDULE = '1 MINUTE'
AS
INSERT INTO WEATHER_DATA_TABLE
SELECT DISTINCT
RECORD_CONTENT:city::string,
RECORD_CONTENT:latitude::float,
RECORD_CONTENT:longitude::float,
RECORD_CONTENT:temperature::float,
RECORD_CONTENT:temp_min::float,
RECORD_CONTENT:temp_max::float,
RECORD_CONTENT:pressure::int,
RECORD_CONTENT:humidity::int,
RECORD_CONTENT:wind_speed::float,
RECORD_CONTENT:clouds::int,
RECORD_CONTENT:w_condition::string,
RECORD_CONTENT:description::string,
TO_TIMESTAMP(RECORD_CONTENT:time_stamp::int)
FROM WEATHER_STREAM_CHANGES;

Once the task was created, the final step was to enable it so it could start running on its own. By default, a new task in Snowflake is created in a suspended state, meaning it won’t execute until you turn it on. We enabled the task with a single SQL command, and from that moment, Snowflake automatically began checking the stream and inserting new weather data into WEATHER_DATA_TABLE on schedule. This completed the automation, ensuring the clean table always stays up to date without any manual work.

-- Enabling the task created above ALTER TASK WEATHER_TASK RESUME;

Snowflake full setup showing our tables, stream and task

In this screenshot, you can see the full setup in Snowflake: the raw topic table holding json in RECORD_CONTENT, the clean WEATHER_DATA_TABLE where structured data lives, the stream that tracks new records, and the task that automates the inserts. Together, these pieces form the end-to-end pipeline - from raw Kafka events landing in Snowflake, all the way to a clean, continuously updated table that’s ready for analysis.

Querying directly from our WEATHER_DATA_TABLE

With the pipeline complete, we can now query WEATHER_DATA_TABLE just like any normal Snowflake table. Since the data is already structured into clean columns, queries are straightforward. For example, to see the latest weather updates, we can run:

SELECT *
FROM WEATHER_DATA_TABLE
ORDER BY TIME_STAMP DESC
LIMIT 110;

Result showing our clean data in rows & columns

In this screenshot, we can see the results of querying WEATHER_DATA_TABLE. Unlike the raw json format, the data is now well organised into proper columns; city, temperature, humidity, windspeed, timestamp, and more. Everything is clean and easy to read, which makes analysis simple and efficient.

This confirms that our pipeline successfully transformed the raw Kafka events into a structured Snowflake table ready for use.

Snowflake Sink connector running

The last screenshot shows the Snowflake Sink Connector actively running in Confluent. You can see that messages are being processed and delivered to Snowflake in real time. This confirms that the pipeline is live, as soon as new weather data is produced in Kafka, it flows through Confluent and lands in Snowflake, where our stream and task keep the clean table continuously updated.

Conclusion

By combining Confluent Kafka, the Snowflake Sink Connector, and Snowflake's native features like Streams and Tasks, we built a fully automated data pipeline that transforms raw JSON events into clean, query-ready tables. This approach follows modern data engineering best practices separating raw and curated layers, automating ingestion and transformation, and ensuring data freshness with minimal manual effort.

With this pipeline in place, every new weather event flows seamlessly from Kafka → Confluent → Snowflake, and ends up in a structured table optimised for analytics. It is scalable, reliable, and production ready, the kind of design expected in a real-world data platform architecture.

In my next article, I'll share the full Python scripts for producing, consuming, and automating this pipeline end-to-end, so you can replicate and extend it in your own projects.

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Kepha Mwandiki — Tue, 23 Sep 2025 14:48:53 +0000

Apache Kafka is an open-source distributed event streaming platform.

What does this mean? - Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single solution:

To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
To store streams of events durably and reliably for as long as you want.
To process streams of events as they occur.

How Does Kafka Work?

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol.

Server/Broker - The Server run the Kafka software and is responsible for:

Receiving messages from producers
Storing them in topics & partitions
Serving them to consumers when requested

Each broker can handle thousands of partitions and millions of messages per second.

i. Kafka clients

They allow you to write distributed applications and services that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures.

ii. Producer clients

Apps that send data into Kafka topics.

Example: My weather script → sends weather JSON to Kafka.

iii. Consumer clients

Apps that read data from Kafka topics.

Example: Snowflake loader → consumes weather data and inserts into a table.

iv. Admin clients

Used to manage Kafka; create topics, configure partitions, check clusters etc.

2. Apache Kafka Core Concepts.

2.1 Producer

An application that sends messages (records/events) into Kafka.

from kafka import KafkaProducer
import json
import time

# Create a KafkaProducer instance
# Using a JSON serializer for demonstration purposes
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Define the topic to send messages to
topic_name = 'My_Kafka_Article'

# Send a few messages
for i in range(3):
    message_data = {"id": i, "message": f"This is message number {i}"}
    print(f"Sending message: {message_data}")
    producer.send(topic_name, value=message_data)
    time.sleep(1) # Simulate some delay

# Ensure all messages are sent
producer.flush()

# Close the producer
producer.close()

print("Kafka Producer finished sending messages.")

2.2 Consumer

An application that reads messages from Kafka.

from kafka import KafkaConsumer
import json

# 1. Create a Kafka Consumer Instance
consumer = KafkaConsumer(
   'My_Kafka_Article',  
    bootstrap_servers='localhost:9092', 
    # Deserialize JSON data
    value_deserializer=lambda x: json.loads(x.decode('utf-8')) 
)

# 2. Poll for Messages and Process Them
for message in consumer:
    print(f"Received message: {message.value}")

2.3 Topic

A category or channel in Kafka where data is stored.
Below is my example of some topics.

2.4 Partition

Topics are partitioned, meaning a topic is spread over a number of "buckets" located on different Kafka brokers. This distributed placement of your data is very important for scalability because it allows client applications to both read and write the data from/to many brokers at the same time.

2.5 Broker

A Kafka server that stores data and serves producers/consumers.

Many brokers form a Kafka cluster.

2.6 Cluster

A cluster is a group of brokers working together, where topics/partitions are distributed by kafka.

3. Data Engineering Applications for Kafka

3.1 Real-Time Data Ingestion

This is the process of bringing real-time data from various data sources and streaming it into a warehouse, lake or a streaming platform.

3.2 Log Aggregation

Log aggregation collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages.

3.2.1 Log Aggregation Pipeline with Kafka

3.2.1.1 Log Sources - Producers:

Applications
Web servers (Apache)
System logs (via Fluentd, Filebeat, syslog → Kafka)

3.2.1.2 Kafka Topics:

Logs are written into topics like app_logs, error_logs, access_logs.

3.2.1.3 Consumers:

HDFS/S3 → long-term storage.
Monitoring tools → Grafana.

3.3 Website Activity Tracking

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.

3.4 Stream processing

Stream processing means working with continuous flows of data, streams, in real time.

Eg, Instead of analyzing yesterday’s sales, you process every transaction as it happens.

3.4.1 Typical Flow of stream processing

Producers → send data into Kafka (clicks, IoT, transactions).
Kafka Topics → receive and store the message.
Stream Processor (Kafka Streams ) → processes data in real time.

Below shows a picture of real-time data querying, using confluent, a kafka streaming platform

4 Real-World Production Practices Using Kafka

4.1 Sports

Millions of fans world-wide need real-time updates on scores, player stats, and events.

How Kafka helps:

Producers → stadium sensors, referee systems (VAR), commentary feeds.

Kafka Topics → scores, player_stats, tracking_data.

Consumers → Mobile apps eg LiveScore, FotMob get instant updates.

Stream processors - aggregate, filter, and push alerts eg notifying the fans "Goal Scored"

4.2 Banking & Finance

Fraud needs to be caught in seconds. Payments must be processed swiftly and fast, with no duplicates.

How Kafka helps:

Producers → ATM machines, mobile banking apps.

Kafka Topics → transactions, fraud_alerts.

Stream processors → Aggregate transactions per user in a 5-minute window.

Flag anomalies eg notifying there have been too many withdrawals in 5 minutes.

Consumers → fraud detection systems, real-time dashboards, data warehouse for historical data.

4.3 Healthcare

Patient vitals eg oxygen and heart rate must be monitored continuously to ensure good health and real-time follow up.

How Kafka helps:

Producers → IoT devices on patients eg wearables and hospital monitors.

Kafka Topics → patient_vitals, alerts.

Stream processors → Check thresholds eg heart rate greater than 180 bpm.
Trigger emergency alerts instantly.

Consumers → doctor dashboards, alert systems, patient history databases.

APACHE AIRFLOW, AND ITs IMPORTANCE IN DATA ENGINEERING

Kepha Mwandiki — Sun, 07 Sep 2025 13:37:43 +0000

Apache Airflow - This is a tool used in workflow orchestration, the automated coordination and management of data workflows.
Airflow is important in data engineering since it provides a way to orchestrate, schedule and monitor workflows/pipelines.

Why use Airflow

Scalability and Flexibility - Airflow supports workflows ranging from small scripts to large scripts handling very large data.

Airflow works with many systems, databases, cloud storage, snowflake etc

Scheduling - Airflow has a built-in scheduler to run tasks at specific intervals, and also, it automates repetitive tasks, reducing manual intervention.

Monitoring - Airflow provides an interface to track task execution, progress, successes and failures.

Extensibility - Airflow provides plugins and extensions to be able to connect with various systems eg APIs, AWS, AZURE etc

Error handling - Airflow makes error handling automated, flexible, and visible. Instead of always monitoring, you can set retries, alerts, and failure alerts so problems are handled.

Screenshot Documentation

Airflow UI header with "Apache Airflow" logo

DAGs list showing example DAGs
In airflow, a Directed Acyclic Graph is a defined set of instructions that tells airflow what tasks to run and in what order.
A photo example of a DAG in airflow:

Airflow Scheduler
It is a component of airflow responsible for deciding when and which tasks should run.
Scheduler is rsponsible for triggering DAG runs and managing how many runs will run at specified times.

Airflow Webserver
The Airflow Webserver is the component that provides the Graphical User Interface (GUI) for Airflow, it is the part you interact with in your browser to view, monitor, and manage your DAGs and tasks.

Below is an image showing both the Webserver and the Scheduler running:

Below now is an example of the DAG running on my browser, clearly showing the tasks running, first run, most recent run, success/failure of some of the runs and how the tasks are scheduled.

The DAG is running on my localhost:8080

Installing and setting up PostgreSQL on a Linux server

Kepha Mwandiki — Sun, 03 Aug 2025 21:59:09 +0000

Step 1: Updating the system

You should update your server packages to ensure all existing packages are up to date. This can be done by running the commands:
Sudo apt update
Sudo apt upgrade -y

Step 2: Installing PostgreSQL

The following commands are run when installing PostgreSQL and its necessary packages:
Sudo apt install PostgreSQL PostgreSQL-contrib

Step 3: Verifying the installation.

The PostgreSQL service should start automatically after complete installation. The following commands are used to verify the installation.

Sudo systemctl status PostgreSQL

Step 4: Accessing the PostgreSQL shell
PostgreSQL creates a Linux user ‘postgres’ by default. To interact with this:

Sudo -I -u postgres

Step 5: Creating a new user and Database

To create a new user, the following command is used:
CREATE USER ‘user1’ WITH PASSWORD ‘password’;
Then create a database:
CREATE DATABASE ‘mydatabase’ WITH OWNER ‘user1’;

Step 6: Secure your installation with a firewall

Enable a firewall like ufw on ubuntu to restrict external access. You must first open the default PostgreSQL port, 5432 to allow remote connections.

Sudo ufw allow 5432/tcp
Sudo ufw reload

End.

You have now successfully installed and setup PostgreSQL on your Linux server. You have created a user and a database, and secured the connection.