DEV Community: Fred Munjogu

End-to-End Data Workflow: Kestra, Redshift, and dbt Integration

Fred Munjogu — Wed, 29 Oct 2025 17:23:11 +0000

Imagine that at the end of every month, you are required to download data from a particular source, load it into storage, and transfer it to a data warehouse. After the first time, you will quickly realize how repetitive these tasks can be. Since you will be moving similar data each month, it would be much easier if there were a way to automate this. Lucky for us, there are several tools at our disposal. In this article, we will focus on the following tools:

Kestra
dbt
S3 Bucket
Amazon Redshift

We will be using NYC taxi trip data in this project (yellow and green taxis).

If you want to follow along, you can create an AWS account, and for first timers, you get $200 credit, which will be more than enough. Also, Redshift offers $300 credit for first-time users, which really comes in handy since queries can be quite expensive if not careful or dealing with large amounts of data.

Kestra

Kestra is an orchestration tool that we will leverage to automate most of the repetitive tasks. It is an easy-to-use tool since the flows we will write for it are stored in a yaml file.

We will have two flows. One will ingest data, i.e., download the data we will use, upload it to our S3 bucket, and then load the schemas and tables to our data warehouse (Redshift).

Data Build Tool (DBT)

After our data is in our data warehouse, we will use dbt to clean, aggregate, and create analytics-ready tables. This is made possible by the use of DBT models.

AWS Resources

As I have mentioned before, we will be utilizing some AWS resources, and to set them up, we will be using Terraform.

Terraform allows us to write a configuration that then creates the resources with three simple commands (terraform init, terraform plan, and terraform apply). This article will not focus too much on this, but I will mention the steps needed to get our resources up and running.

Configuring Kestra

This will be our first step since it contains the flows needed to run our pipeline. We will first run Docker to get our Kestra instance. Once that is done, we will navigate to the Kestra UI, where we will begin by adding environment variables in the KV store, which is found in the namespace tab.

Inside the KV store, we will add the following environment variables:

S3_BUCKET_NAME: the name of our S3 bucket.
AWS_REGION: our AWS region
JDBC_URL: string connection to our database in Redshift
DB_USER: database username
DB_PASS: database password
KESTRA_ROLE: IAM role that allows Kestra to copy data from S3 to Redshift
REDSHIFT_USER: our Redshift user configured in Terraform
REDSHIFT_PASSWORD: our Redshift password
REDSHIFT_HOST: our workgroup host

With this, we are all set up to run our first flow. All the values stored in the KV store will be included in our flow when we reference it using the "{{ kv('var_name') }}" block.

Here is a snippet of our KV store:

Running our first flow

The entire project can be found here. We will run the flow named redshift_taxi_scheduled.yaml, located in the kestra/flows directory. This flow:

Downloads Yellow/Green Taxi Data (depending on the taxi you choose)
Uploads to S3
Loads to Redshift

To run this flow, we will navigate to the triggers tab, where we will execute a backfill. The triggers tab looks like this:

We will click on the backfill execution for the green schedule, which will download taxi data for the green cabs. We will be asked to choose the dates from when the data will be downloaded, and for this article, we will use data from January 1, 2021, to June 30, 2021.

We will also add an execution label (backfill: true) so that we know this is a backfill when it runs. After doing this, we can execute our flow, and it will begin downloading and uploading the green taxi data.

Kestra has this rather neat feature that, during executions, you can see a Gantt chart showing the tasks that are ongoing and their statuses (passed or failed).

Since we are doing it for six months, our flow will run six times. If we look at our executions tab at the end of the six runs, we will see that all six runs were successful, and next to them are the labels of the files downloaded.

We can see that our flows executed successfully. Now, we can check to see if our S3 bucket contains any data.

Inside our S3 bucket, we can see that our files have been uploaded and named similarly to the labels we saw in Kestra.

We will do the same for the yellow taxi data. By the end of these executions, we will have twelve objects in our S3 bucket (6 for green taxis and the other 6 for the yellow taxis).

We can now confirm if our tables have been loaded into Redshift. Before we do this, let me explain a section of the flow. Inside the flow we have been running, there is a segment responsible for loading data into Redshift.

- id: yellow_table_create
        type: io.kestra.plugin.jdbc.redshift.Query
        url: "{{kv('JDBC_URL')}}"
        username: "{{kv('DB_USER')}}"
        password: "{{kv('DB_PASS')}}"
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

This creates our main table, and based on the columns, it creates the yellow taxi main table in Redshift. The next step is creating the staging table, which will be used to deduplicate data before loading it to the main table.

- id: yellow_create_staging_table
        type: io.kestra.plugin.jdbc.redshift.Query
        url: "{{kv('JDBC_URL')}}"
        username: "{{kv('DB_USER')}}"
        password: "{{kv('DB_PASS')}}"
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (LIKE {{render(vars.table)}});

The staging table will be similar to the main table. The next step will be to truncate this table. In the first run, it does not make any sense since the table is empty, and truncating it does not change anything. However, in the succeeding runs, it will make sense. We will come back to this shortly.

After our staging table is created and then truncated, we move to the next step, where we copy the tables' contents from the S3 files we uploaded a while ago. This is done by this task:

 - id: yellow_copy_into_staging_table
        type: io.kestra.plugin.jdbc.redshift.Query
        url: "{{kv('JDBC_URL')}}"
        username: "{{kv('DB_USER')}}"
        password: "{{kv('DB_PASS')}}"
        sql: |
          COPY {{render(vars.staging_table)}} (
            VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count,
            trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID,
            payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
            improvement_surcharge, total_amount, congestion_surcharge
          )
          FROM 's3://{{kv("S3_BUCKET_NAME")}}/{{render(vars.file)}}'
          IAM_ROLE '{{kv("KESTRA_ROLE")}}'
          FORMAT AS CSV
          IGNOREHEADER 1;

We specify the columns here since the tables we created have two additional columns not present in the CSV files we are copying data from. The purpose of those two additional columns is identification and uniqueness. The unique_row_id column will have a unique value, and it is what we will use to ensure there is no duplicate data in our main table.

We will then move to adding the unique columns and the filenames. This is done using the following code:

 - id: yellow_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.redshift.Query
        url: "{{kv('JDBC_URL')}}"
        username: "{{kv('DB_USER')}}"
        password: "{{kv('DB_PASS')}}"
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(VendorID, '') ||
              COALESCE(CAST(tpep_pickup_datetime AS varchar), '') ||
              COALESCE(CAST(tpep_dropoff_datetime AS varchar), '') ||
              COALESCE(PULocationID, '') ||
              COALESCE(DOLocationID, '') ||
              COALESCE(CAST(fare_amount AS varchar), '') ||
              COALESCE(CAST(trip_distance AS varchar), '')
            ),
            filename = '{{render(vars.file)}}';

We then merge the files in this table with our main table.

- id: yellow_merge_data
        type: io.kestra.plugin.jdbc.redshift.Query
        url: "{{kv('JDBC_URL')}}"
        username: "{{kv('DB_USER')}}"
        password: "{{kv('DB_PASS')}}"
        sql: |
          MERGE INTO {{render(vars.table)}}
          USING {{render(vars.staging_table)}} AS S
          ON {{render(vars.table)}}.unique_row_id = S.unique_row_id
          WHEN MATCHED THEN
            UPDATE SET
              unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,
              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
              improvement_surcharge, total_amount, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,
              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,
              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,
              S.improvement_surcharge, S.total_amount, S.congestion_surcharge
            );

If you look closely, you can see the WHEN NOT MATCHED THEN statement. This checks the unique_row_id column in the staging table and the main table. If none of the ids match, it means there are no duplicates, so the data is added to our main table.

I mentioned truncating our staging table and how it made sense in the subsequent runs. After this first run, both our main table and staging table have the same data. When the next flow is executed, we begin by creating the main table, but this step is aborted since the table already exists. The next step is to create the staging table, and this process is also skipped since the table exists. After this comes the truncating of the staging table. Remember our staging table contains the same data as our main table, and in this case, for January. So our staging table is truncated here to remain with no records and then loaded with records for the second month. So at this point, the main table has data for January, while our staging table has data for February.

Through our unique_row_id column, we can confirm that there is no data in February that is similar to that of January, hence no duplicates. When this condition is satisfied, then it is merged with the main table.

Inside Redshift, we can confirm if our tables have been created. We are supposed to have two green taxi tables (main and staging) and two yellow tables (main and staging).

Running our Second flow

Now that we have our tables in Redshift, we can move to the transformations part using dbt. For this, we will use the flow dbt_redshift.yaml found in the kestra/flows directory.

What this flow does is clone our repository and then sync the dbt project, which is located in the taxi_dbt directory, into our namespace. This allows Kestra to find our project and, more specifically, the dbt_project.yml file.

There are a few things we need to specify before our flow runs successfully. We need to add a profiles section so that Kestra and dbt know the location of our data and where the tables our models create will be stored. If you have used dbt Core, you can get these credentials in the ~/.dbt/profiles.yml file.
Run

cat ~/.dbt/profiles.yml

You will see the credentials needed by dbt. These configurations should be input here:

 profiles: |
      taxi_dbt:
        outputs:
          dev:
            type: redshift
            host: {{ kv('REDSHIFT_HOST') }}
            user: "{{ kv('REDSHIFT_USER') }}"
            password: "{{ kv('REDSHIFT_PASSWORD') }}"
            port: 5439
            dbname: dev
            schema: public
            autocommit: true
            threads: 3
            connect_timeout: 300
        target: dev

If you haven't installed dbt on your machine, these configurations can be found in Redshift under the workgroup tab.

With configurations out of the way, we can run our dbt flow. Inside Kestra, execute the flow.

We can see that the flow executed successfully and also passed all the tests I had written. These ensure that the models we create have consistent data, which prevents our pipeline from breaking down.

You can see the number of tests and whether they succeeded or not. This marks the end of this pipeline, and because of our triggers, it should run monthly. This automates the entire workflow, and what remains is monitoring the logs to ensure no abnormalities occur and trying to identify bottlenecks that could be solved by minor tweaks in our workflow.

Remember to destroy the resources once done with the project to avoid incurring additional costs. For more explanation on usage and setup, refer to the README at the root of this repo.

With our models, we can now pass the data in the tables created to any analytics or BI tool.

Conclusion

We have executed two flows that extract data, upload it to S3, copy the content from S3 to Redshift, and perform transformations using dbt.

Hope this article was helpful and informative. If you want to read more on Terraform or Infrastructure as Code (IaC), here is a link to an article of a project I did that goes into more detail about the configurations and Terraform itself.

Here is a link to this project's repo.

Automating AWS Infrastructure for a FastAPI Application with Terraform

Fred Munjogu — Thu, 16 Oct 2025 16:31:32 +0000

Introduction

In this project, I automated the deployment of a FastAPI-based Book Review API by provisioning all the necessary AWS resources using Terraform. The goal was to build a simple, scalable, and secure cloud architecture that separates the web and database tiers while maintaining infrastructure as code (IaC) principles.

Architecture Overview

In this project, we will create a few AWS resources. These resources include:

Virtual Private Cloud (VPC)
Public subnet (web server)
Private subnet (database server)
Internet gateway for public subnet connectivity
NAT gateway for private subnet outbound access
Route tables and routing configuration
Security groups for both tiers
EC2 instances

Let's go through each of these steps to gain a better understanding of how we will deploy our app.

Deploying our Web Server

We will need to first create a VPC (
Virtual Private Cloud), which, in simple terms, is a virtual network that is similar to a traditional network. The VPC will allow us to add subnets, which are a range of IP addresses in our VPC.

VPC

Here is a snippet of our Terraform configuration to create a VPC:

resource "aws_vpc" "test_vpc" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "api_vpc"
  }

  enable_dns_hostnames = true
  enable_dns_support   = true
}

Public subnet

With our VPC now configured, we can work on creating a public subnet. To create our subnet, we will specify the VPC we want to create it in and also the IP address range. We will also add a property to allow us to map a public IP when we launch an EC2 instance in that subnet.

Here is a snippet of how we will create our public subnet:

resource "aws_subnet" "public_subnet" {
  vpc_id                  = aws_vpc.test_vpc.id
  cidr_block              = "10.0.0.0/24"
  availability_zone       = var.aws_availability_zone
  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet"
  }
}

Internet Gateway

After creating our public subnet, we will now add an internet gateway to allow our instances access to the internet. This involves adding an internet gateway resource and attaching it to our VPC.

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.test_vpc.id

  tags = {
    Name = "test-igw"
  }
}

Routing table and routes

We will then create a routing table and add routes for how our traffic in the public subnet will flow. In our routing table, we will specify a CIDR block of 0.0.0.0/0, which will allow outbound traffic from resources in the public subnet to access the internet.

Routing table association

The final step is associating these route tables with the public subnet. This will apply all the rules to the public subnet and dictate how outbound traffic will flow.

resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public_subnet.id
  route_table_id = aws_route_table.public_route.id
}

Security groups

By now, our web server can access the internet, but we need to specify some inbound and outbound rules to limit the number of devices that can access our server. As you may have already thought, this is a safety measure to ensure only authorized personnel have access to our servers.

Security groups allow us the opportunity to explicitly declare these rules. Below are some of the rules I specified for my web server.

Creating the security group

resource "aws_security_group" "webserver" {
  name        = "webserversg"
  description = "security group for webservers"
  vpc_id      = aws_vpc.test_vpc.id
  tags = {
    Name = "WebServerSG"
  }
}

Adding inbound and outbound rules

# allow ssh, http, and https traffic to webserver
resource "aws_vpc_security_group_ingress_rule" "allow_ssh_webserver" {
  security_group_id = aws_security_group.webserver.id
  cidr_ipv4         = "${chomp(data.http.myip.response_body)}/32"
  from_port         = 22
  ip_protocol       = "tcp"
  to_port           = 22
}

resource "aws_vpc_security_group_ingress_rule" "allow_http" {
  security_group_id = aws_security_group.webserver.id
  cidr_ipv4         = "0.0.0.0/0"
  from_port         = 80
  ip_protocol       = "tcp"
  to_port           = 80
}

resource "aws_vpc_security_group_ingress_rule" "allow_https" {
  security_group_id = aws_security_group.webserver.id
  cidr_ipv6         = "::/0"
  from_port         = 443
  ip_protocol       = "tcp"
  to_port           = 443
}

# adding egress rules for all traffic
resource "aws_vpc_security_group_egress_rule" "allow_all_traffic_webserver" {
  security_group_id = aws_security_group.webserver.id
  cidr_ipv4         = "0.0.0.0/0"
  ip_protocol       = "-1"
}

Ingress refers to the inbound rules, while egress refers to the outbound rules. From the snippet above, you see that I allow inbound traffic from SSH (port 22), HTTP (port 80), and HTTPS (port 443). In the SSH part, I specified my own IP since it was in development, but in a production setting, you would specify an IP address range.

Creating our EC2 Instance

The final step is creating an EC2 instance where our app will be deployed. We will be using an Ubuntu instance to deploy our app.
We will create this instance in our public subnet.

# create an EC2 instance
data "aws_ssm_parameter" "ubuntu_ami" {
  name = "/aws/service/canonical/ubuntu/server/24.04/stable/current/amd64/hvm/ebs-gp3/ami-id"
}

# webserver instance
resource "aws_instance" "web_instance" {
  ami                         = data.aws_ssm_parameter.ubuntu_ami.value
  instance_type               = "t3.micro"
  key_name                    = "fredssh"
  vpc_security_group_ids      = [aws_security_group.webserver.id]
  subnet_id                   = aws_subnet.public_subnet.id
  associate_public_ip_address = true
  user_data_base64            = filebase64("webserver.sh")

  tags = {
    Name = "webserver-instance"
  }
}

The user_data_base64 attribute links to a bash script that downloads and starts nginx when the EC2 instance starts. Now we have a full web server with nginx ready to deploy our app.

We will follow the same steps when building our private instance that will have our database. However, there are a few differences when creating the private instance. Firstly, we do not want the server to be directly accessible from the internet. The server should only be accessible from the web server. This means we have to use a NAT gateway.

NAT gateway

To create this gateway, we need an Elastic IP. We will use this Elastic IP to create our NAT gateway. After creating this gateway, we will route it the same way we did the internet gateway, only this time to the private subnet. This enables the private subnet to access the internet securely without being directly exposed or allowing unauthorized inbound connections.

# allocate elastic ip to nat gateway
resource "aws_eip" "nat_eip" {
  domain = "vpc"

  tags = {
    Name = "nat-eip"
  }
}

# create a NAT gateway
resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.nat_eip.id
  subnet_id     = aws_subnet.public_subnet.id

  tags = {
    Name = "nat-gw"
  }

  depends_on = [aws_internet_gateway.igw]
}

# routing
resource "aws_route_table" "private_route" {
  vpc_id = aws_vpc.test_vpc.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat.id
  }

  tags = {
    Name = "private-rt"
  }
}

The other thing that differs is how we define our security group. In the private instance, we will allow inbound traffic from SSH and Postgres (port 5432), which is the database we will be using.

Creating the security group

resource "aws_security_group" "dbserver" {
  name        = "dbserversg"
  description = "security group for database servers"
  vpc_id      = aws_vpc.test_vpc.id
  tags = {
    Name = "DBServerSG"
  }
}

Defining the inbound and outbound rules

# allow ssh, and Postgres traffic to dbserver from instances in the webserver
resource "aws_vpc_security_group_ingress_rule" "allow_ssh_db" {
  security_group_id            = aws_security_group.dbserver.id
  referenced_security_group_id = aws_security_group.webserver.id
  from_port                    = 22
  ip_protocol                  = "tcp"
  to_port                      = 22
}

resource "aws_vpc_security_group_ingress_rule" "allow_postgres" {
  security_group_id            = aws_security_group.dbserver.id
  referenced_security_group_id = aws_security_group.webserver.id
  from_port                    = 5432
  ip_protocol                  = "tcp"
  to_port                      = 5432
}

# adding egress rules for all traffic
resource "aws_vpc_security_group_egress_rule" "allow_all_traffic_db" {
  security_group_id = aws_security_group.dbserver.id
  cidr_ipv4         = "0.0.0.0/0"
  ip_protocol       = "-1"
}

We will create the EC2 instance the same way, but we will not associate a public IP address with the private instance.

Launching and Testing our App

We will now create all these resources with the following Terraform commands:

terraform init
terraform fmt
terraform validate
terraform apply

We now have our two instances running. It's time to test if our API app works as intended. For a more detailed guide on how to configure nginx, postgres, and launch our app, please refer to the README in the root of the aws directory in this repo. It also contains all the Terraform configurations.

API Testing

Let's first run our app

python3 -m uvicorn main:app

You should see the following output if successful on your terminal

You should also see the following if you navigate to the docs page

Testing the users endpoint

Inside the docs, we will create a new user and see the response we receive. We will also confirm if this output is similar to that in our database.

We will start by creating a user. These are the results

Now, let's check if our database has the same record

Our results show that our app works well. Let's try for the books and review endpoints.

Testing the book endpoint

We will follow the same process as above. In the Swagger docs, we will create a book and ensure everything works fine. This is the result of creating a book.

Now, let's check if our database has the same record

Testing the review endpoint

For this to work, there has to be a user and a book record. Without these records, we cannot create a review. This makes sense because to create a review, there needs to be a book to review and also the person reviewing it. Luckily, we have already created a user and a book. So we can proceed to write a review for the book we created. Here is the output

Now, let's check if our database has the same record

This works for all the endpoints, showing that our deployed app works as we expected. This concludes our project.

Design Decisions and Trade-offs

Two-tier architecture: Separating the web and database layers improves security and scalability.
Private database subnet: Prevents external exposure of sensitive data.
Dynamic IP restriction: The web server’s SSH access is dynamically limited to my public IP, reducing attack surface.
Terraform automation: Enables consistent, version-controlled provisioning.

Conclusion

This project highlights the power of Infrastructure as Code in managing cloud deployments. Another advantage of Terraform, in particular, is that it is cloud-agnostic. We can use it with different cloud providers (Azure, GCP).

The source code of this project can be found here: Github Repository

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Fred Munjogu — Tue, 09 Sep 2025 13:55:56 +0000

As a data engineer, you will often need to stream data. To be more specific, you will need a tool to help you stream live data for whichever project you will be working on.

Kafka is a great tool and has a ton of functionality to help you stream data seamlessly. In this article, we will focus on the core concepts you need to know to get started with Kafka.

Kafka Architecture

Brokers

A broker is a server that stores the data we use in streaming and also handles all the data streaming requests. The broker acts as a middleman between a producer (those who send information) and a consumer (those who receive the information).

Zookeeper vs. KRaft (Kafka Raft Metadata mode)

In earlier versions of Kafka (lower than v2.8), Kafka contained an external coordinator by the name ZooKeeper, which was in charge of handling metadata. ZooKeeper worked hand in hand with the brokers to provide information on who the controller is and also to persist the state.

KRaft was introduced in Kafka 2.8 as an alternative and fully replaced ZooKeeper in Kafka 3.3+. Instead of brokers relying on ZooKeeper anymore, Kafka now uses Raft consensus algorithm to manage metadata and fully remove the need for ZooKeeper. This makes it simple and much more scalable since there are fewer moving parts.

Topics, Partitions, Offsets

Topic

A topic in Kafka is a log that stores messages and events in a logical order. We can equate a topic to a folder in a filesystem, and the events to the files.

Partition

A partition is a "slice" of a topic. This means that when you create a topic, you need to specify the number of partitions you will need.

This is important since it allows more brokers to share the load, since the data is stored in separate partitions. This also allows consumers in the same group to read from different partitions at the same time. An example of a partition is as follows:
Say you create a topic called orders with 3 partitions. This is what it will actually look like

Orders partitions
orders-0: [msg1, msg4, msg7]
orders-1: [msg2, msg5, msg8]
orders-2: [msg3, msg6, msg9]

Offsets

An offset is a unique identifier assigned to each message in a partition. This helps the producers, consumers, and brokers to determine the position of a message in a partition.

Producers

A producer is a client that writes messages to topics in the Kafka cluster. It is also possible to specify the partition the message will be stored in by using key-based partitioning.

Consumers

A consumer is a client that reads messages from topics in the Kafka cluster.

In Kafka, there are consumer groups. These basically help consumers to work together in parallel. Different consumers are able to read messages from topics concurrently.

Message Delivery Semantics

There are three different types of delivery semantics:

at-most-once
at-least-once
exactly-once

At-Most-Once Delivery

This approach entails the consumer only saving the position of the last event and then processing it. This means that in the event the consumer fails in the middle of execution, there is no way to go back to read this event.

This approach is for situations where some data loss is not an issue and accuracy is not a priority.

At-Least-Once Delivery

This approach entails the consumer processing the received events, saving the results, and also saving the position of the last received event. This is different from at-most-once delivery in the sense that it can retrieve and reprocess old events.

Exactly-Once Delivery

This approach is similar to at-least-once delivery, whereby the consumer receives the event, saves the results, and also the position of the last received event. The difference comes in where any duplicates are dropped, and this leads to only one processing of an event.

Retention Policies

Retention refers to the configurable time period during which data is retained in a Kafka topic. This outlines how long messages are preserved in topics before they are deleted.

Some of the retention policies include:

Time-based retention
Size-based retention

Time-based retention

This is a policy where you can configure how long messages are retained in their topics based on their timestamps. Once that time period expires, the closed segments are deleted.

Size-based retention

This is a policy where the expiration of messages is dependent on the bytes of messages retained. This means that if the configured size is attained or exceeded, the messages are then deleted.

Serialization & Deserialization

Serialization refers to the process of converting structured data into a byte stream. This is crucial since Kafka stores and transmits data as raw bytes. This process is done by a serializer.

Deserialization is the opposite of serialization. This is converting data from a byte stream to a structured form, e.g., an object. This process is done by a deserializer.

The common formats used in Kafka are Avro, Protobuf, or JSON schema.

Replication and Fault Tolerance

In Kafka, one can specify the replication factor when creating a topic. This allows for multiple copies to be created across different brokers, which ensures data availability.

Replication Factor

A configurable setting for each topic that determines the total number of copies for each partition.

Leader

For each partition, one replica is designated as the leader, which handles all incoming produce and consume requests.

Follower

Other replicas for a partition are followers. They continuously fetch data from the leader to stay synchronized.

ISR (In-Sync Replicas)

The set of follower replicas that have successfully replicated the leader's data and are fully up-to-date.

High Availability

The system remains functional and data is accessible, even if one or more brokers fail.

Kafka Connect

This is a tool for streaming data between Kafka and other data systems. In this article, we will focus on streaming data between two databases, namely, PostgreSQL and Cassandra.

Configuring our Postgres Connector

In our case, we will be using Debezium Connect, which is built on Kafka Connect. To start things off, we will begin with the configuration file, which is in JSON format.

In our case, we will use Docker, so we will have to come up with a Docker Compose file that has all the services we need. These include:

zookeeper - handling metadata
Kafka - for streaming
postgres (15) - our source database
Cassandra (4.1) - our sink db
debezium/connect - a tool to allow us to stream data from PostgreSQL to Cassandra
kafka-ui - Graphical User Interface to view our streamed data

Let's break down each service component.

Postgres

  postgres:
    image: postgres:15
    container_name: postgres
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: root
      POSTGRES_DB: test
    command: >
      postgres -c wal_level=logical
               -c max_wal_senders=10
               -c max_replication_slots=10
               -c max_connections=200
    volumes:
      - postgres_data:/var/lib/postgresql/data

The command property is used to specify arguments to our PostgreSQL. The wal_level being set to logical allows for replication and logs to be shared and streamed in Kafka, which will be used to replicate data in the Cassandra DB

Cassandra

  cassandra:
    image: cassandra:4.1
    container_name: cassandra
    ports:
      - "9042:9042"
    environment:
      CASSANDRA_CLUSTER_NAME: "cdc-cluster"
      CASSANDRA_NUM_TOKENS: 16
      CASSANDRA_DC: datacenter1
      CASSANDRA_RACK: rack1
    volumes:
      - cassandra_data:/var/lib/cassandra

This is just a simple configuration to load our simple sink database.

Connect

This is where we use Debeziu,m, and there are a few things we need to do so that it can transmit the data to these data systems.

  connect:
    image: debezium/connect:2.7.3.Final
    container_name: debezium
    depends_on:
    - kafka
    - postgres
    ports:
    - "8083:8083"
    environment:
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      BOOTSTRAP_SERVERS: kafka:9092
      HOST_NAME: "connect"
      ADVERTISED_HOST_NAME: "connect"
      ADVERTISED_PORT: "8083"
      KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      OFFSET_FLUSH_INTERVAL_MS: "60000"
      OFFSET_FLUSH_TIMEOUT_MS: "5000"
      SHUTDOWN_TIMEOUT: "10000"
      HEAP_OPTS: "-Xms512M -Xmx2G"
      LOG_LEVEL: "INFO"
      ENABLE_APICURIO_CONVERTERS: "false"
      ENABLE_DEBEZIUM_SCRIPTING: "false"
      KAFKA_CONNECT_PLUGINS_DIR: /kafka/connect,/kafka/connect/plugins
    volumes:
      - connect_data:/kafka/connect
      - ../plugins:/kafka/connect/plugins

The main things to focus on in this configuration are the volumes. By default, Debezium has a pre-installed PostgreSQL plugin, which is located in the /kafka/connect directory. This means that you do not need to install the PostgreSQL plugin. All you have to do is provide that path to the environment variable KAFKA_CONNECT_PLUGINS_DIR. For Cassandra, it is a little bit different since you have to source the plugin. Once installed in your system, extract the file and locate a Kafka sink connector file with the extension .jar. This is what we will place in our root directory and mount it in the /kafka/connect/plugins directory. This will allow our Docker container to recognize the plugin, and we will be able to register our sink connector.

Similar to what we did for Postgres, we need to add this path to the environment variable KAFKA_CONNECT_PLUGINS_DIR.

After adding the zookeeper, Kafka, and Kafka UI images (I have not provided the snippets since they require the basic configuration and are easy to set up), we will now create identical tables in both PostgreSQL and Cassandra.

Creating Tables

For Postgres:

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(150),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert some sample rows
INSERT INTO customers (name, email) VALUES
('Alice', 'alice@example.com'),
('Bob', 'bob@example.com');

For Cassandra:

CREATE KEYSPACE IF NOT EXISTS cdc_demo
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE cdc_demo;

CREATE TABLE customers (
    id INT PRIMARY KEY,
    name TEXT,
    email TEXT,
    created_at TIMESTAMP
);

Registering the connectors

Postgres configuration: postgres_connector.json

{
  "name": "postgres-connector",  
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 
    "database.hostname": "postgres", 
    "database.port": "5432", 
    "database.user": "postgres", 
    "database.password": "postgres", 
    "database.dbname": "postgres", 
    "topic.prefix": "test", 
    "plugin.name": "pgoutput",
    "snapshot.mode": "initial"
  }
}

Save this config to a JSON file and run this command to register the connector:

curl -X POST -H "Content-Type: application/json" \
     --data @postgres_connector.json \
     http://localhost:8083/connectors

Cassandra Configuration: cassandra_connector.json

{
  "name": "stocks-sink",
  "config": {
    "connector.class": "com.datastax.kafkaconnector.DseSinkConnector",
    "tasks.max": "1",
    "topics": "test.public.customers,
    "contactPoints": "cassandra",
    "loadBalancing.localDc": "datacenter1",
    "port": 9042,
    "auth.provider": "None",
    "ssl.provider": "None",
    "topic.test.public.cdc_demo.customers.mapping": 
    “id=value.id, name=value.name, email=value.email, created_at=value.created_at”,
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
  }
}

Save this config to a JSON file and run this command to register the connector:

curl -X POST -H "Content-Type: application/json" \
     --data @cassandra_connector.json \
     http://localhost:8083/connectors

If we look into our Cassandra database now, we will be able to see the records that are in our PostgreSQL database. If we insert more data into our Postgres Database, this will also reflect in our Cassandra database.

What we have just done is referred to as CDC (Change Data Capture). Essentially, this involves having a source database and a sink database. This allows for availability since we can retrace our steps in case of failure from the source database in our sink database. This is one of the applications of Kafka Connect.

How Kafka is Used in the Industry

To explain how Kafka is used in the real world, we will use the example of a popular company, Uber.

Uber operates at a massive scale:

Millions of ride requests, driver updates, payments, GPS events, and ETA predictions per second.
They need to process events in real-time for features like surge pricing, live tracking, fraud detection, and customer notifications.

Example Use Case

When a rider presses “Request Ride” in the Uber app, the action is published as an event into a Kafka topic called rides. Since Kafka topics are partitioned, this request can be processed in parallel with millions of others, which helps Uber handle high throughput at scale. At the same time, every driver’s phone sends GPS updates to another Kafka topic called locations. Kafka’s durability makes sure these events are stored safely on disk, copied across brokers, and can be replayed if needed. Uber’s matching service consumes data from both rides and locations topics in real time, and Kafka’s consumer groups allow multiple consumers to share the work of processing events. Kafka also guarantees ordering within partitions, so a driver’s GPS updates are processed in the correct order, which helps the system calculate accurate ETAs. Once a rider-driver match is made, the result is published into another topic called matches, which is consumed by the notification service to update both apps instantly.

Understanding Docker, Dockerfile, and Docker Compose

Fred Munjogu — Tue, 26 Aug 2025 14:31:54 +0000

What is Docker?

Docker is an open-source platform that allows one to automate the deployment of applications inside containers. A container is a stand-alone unit that includes all components an application needs to run (code, libraries, and dependencies).

Containers are different from virtual machines in that they share the host system's kernel. This makes them more efficient, resource-wise, and they are a lot easier to manage.

Some of the benefits of using Docker include:

Portability
Efficiency
Scalability
Isolation

What is a Dockerfile

Now that we have an idea of what Docker is, we can move to a Dockerfile. A Dockerfile is a text document that contains a set of instructions used in building an image.

An image refers to the blueprint of libraries and dependencies that are necessary inside a container for an application to run.

Docker will read our Dockerfile and build an image that matches the specification listed. To understand this a bit more, let us write one and see what exactly happens.

Using a code editor of your choice, create a file and name it Dockerfile. Inside the file, write the following:

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt /app

RUN pip install -r /app/requirements.txt

COPY . /app

CMD["python", "main.py"]

As you may have guessed from the file above, we require to have a requirements.txt file and a main.py file. For this part, we will not write anything complex since this is focused on understanding containers and images. So in our requirements.txt file, we will write the following:

requests

In our main.py, we can write a simple "Hello, world!" program.

print("Hello, world!")

With these files now, we can build our first image. To build an image, we use the docker build command. The flag "-t" refers to the tag, which is a name you can assign to your image. If you do not assign a name, one will be assigned by default. The trailing period in the command refers to the relative path of the Dockerfile, which in our case is the current directory.

docker build -t test .
Output:

To confirm if the image has been built, use docker images and look for an image with the name "test".

Now that our image is built, we will create a container so that we can run our Python file. To create a container, we will use this command:
docker run --name test_container test
Output:

Our Python file runs great!

What is Docker Compose?

Docker Compose is a tool used to define and manage multi-container Docker applications. Suppose our project requires Python and a PostgreSQL Database. With Docker Compose, we can define and manage these services in a YAML file. Let us create this YAML file.
Back in your code editor, create a new file docker-compose.yml. Inside the file we will create our services:

services:
  web:
    build: .
    ports:
      - "5000:5000"
    depends_on:
      - db

  db:
    image: postgres:14
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: mydatabase
    volumes:
      - db_data:/var/lib/postgresql/data

volumes:
  db_data:

In this file, we have two services: web and our database. The web service will run our DockerFile, which we specified above. The db service defines our Postgres Database and also the username, password, and name of the database. Volumes ensure data consistency and persistence. This way, when we stop and start the container again, all the data that was in the database will be preserved.

To run this, we will use the command docker compose up. If we want to stop it, we will use docker compose down. A tip when running docker compose up is to use the "-d" flag, which will give you your terminal back.

Mastering Data Engineering: 15 Essential Concepts for Building Reliable and Scalable Data Systems

Fred Munjogu — Sun, 10 Aug 2025 20:19:10 +0000

As a Data Engineering student, I believe there are a few fundamental concepts that are important for setting a good foundation in the field. In this article, I will focus on explaining these concepts and their importance. In some cases, I will also provide examples. Let's get to it, shall we?

Batch vs Streaming Ingestion

Batch Ingestion

Batch ingestion refers to the processing and loading of huge volumes of data in batches. These batches are usually in chunks and are of a predefined period (eg, hourly, weekly, yearly). Batch ingestion is useful in areas where real-time analysis is not needed. The beauty of batch processing is the large amounts of data that can be processed at once. This ultimately leads to inexpensive procedures since batch ingestion and processing can occur outside business hours. An example of batch ingestion is an e-commerce platform that exports its daily sales reports to a data warehouse. This data can now be analyzed and insights sent to the respective departments. This, however, happens after all the sales transactions have occurred.

Stream Ingestion

Stream ingestion is the exact opposite of batch ingestion. Instead of processing/ingesting after a certain period, stream ingestion involves the immediate processing of data once it is produced. Stream ingestion has its advantages, such as real-time processing, which leads to real-time analysis, which is crucial for certain industries/businesses. An area that thrives on this is fraud detection. By conducting real-time analysis, any anomalies that may occur in the system are detected in real-time, leading to crisis aversion.

Change Data Capture (CDC)

Change Data Capture(CDC) is a process of tracking changes such as inserts, updates, and deletes in a database. This data is stored and is used for operations such as replication or analysis. In stream ingestion systems, for example, CDC allows for real-time or near-real-time replication of data across different data destinations such as databases. In databases, the changes are represented in lists, and they are referred to as a CDC feed.
Suppose we have a database, and inside it, we add a new record. The CDC will contain the information of the new record together with the type of operation that has occurred (in our case, insertion).

Idempotency

The first thing that comes to mind when I see the term idempotency is APIs. Why, you may ask? Let us first define idempotency, and hopefully by the end, you will be able to see how the two relate.
Idempotency, in simple terms, is the property of an operation that ensures that repeating the same operation multiple times will yield the same result.
Let us go back to APIs. Imagine we have a payment API that is used to process a purchase when a client sends a POST/payments request. In APIs, we have an idempotency key that is used to identify a certain request. If, for instance, a customer makes a payment, the API processes the idempotency key. If the same request is sent again multiple times by the client, the API can refer to the idempotency key, and if it is similar to the previous ones, it executes only once. This prevents duplicate charges when requests are retried.

OLTP vs OLAP

Online Analytical Processing (OLAP) refers to database systems designed primarily for complex data analytics and reporting. These systems enable advanced querying by analysts to identify patterns and forecast trends, which are critical for data-driven decision-making. OLAP leverages multidimensional data, enabling storage of various data types across different periods.

Online Transactional Processing (OLTP) systems focus on handling database transactions. These are typically short, fast, and precise operations that keep databases current and consistent. OLTP systems support ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure data integrity.

For more detailed differences between the two, feel free to read this article.

Columnar vs Row-based Storage

Columnar Storage

In this type of storage, data is stored in columns. This is especially helpful in data warehousing and analytics, where data is queried based on specific columns. Only the required columns are returned, leading to higher processing speeds. Having seen the differences between OLAP and OLTP, columnar storage is perfect for OLAP systems.

Row-based Storage

This is the complete opposite of columnar storage. Data is stored in rows. Querying data involves retrieving a specific row that contains all the information for the specified records. This practice is popular in traditional relational database management systems. This type of storage is well-suited for OLTP systems, as it is ideal for CRUD operations. However, querying speeds are a bottleneck for this storage since it retrieves whole rows compared to columnar storage that can aggregate the required columns.

Partitioning

Partitioning in English terms refers to dividing something into several parts. This also applies to data. Suppose we have a large database that contains huge volumes of data. We can divide these databases into smaller, manageable databases. This instantly leads to improved performance since, instead of querying records from the entire database, we will now be querying our data from the smaller databases, hence the improved query performance.
By partitioning data, we can also create replicas of the data in the databases, leading to availability and reliability. Consider having two databases, a master database and a slave database. We can create replicas of the master database and store them in the slave database. If the master database becomes unavailable, the slave database can still serve requests. This setup eliminates a single point of failure by maintaining two database instances.

ETL vs ELT

Extract, Transform and Load (ELT)

This is the more traditional technique when building pipelines. The first step is getting data from various data sources. After the data has been obtained, it is converted to a form that is required by the user (transform). The data is finally sent to a particular destination (e.g, data warehouse) where users can access.

Extract, Load and Transform (ELT)

The first part is similar to that of the ETL process. The difference comes in during the next step. After the extraction of the raw data, it is directly loaded to a data warehouse or data lake as is. Transformation is done on a need basis, and the data is converted to the intended format and used for tasks such as analysis.

CAP Theorem

CAP theorem states that distributed systems can only deliver two of the desired three characteristics.
These characteristics are:

Consistency
Availability
Partition tolerance

Consistency refers to the ability of clients to see the same data even when on different nodes.
Availability refers to the ability of a client to receive a response even when some nodes are down.
Partition tolerance refers to the system being active and functional despite the network connecting the nodes having any kind of faults.

Windowing in Streaming

Streaming usually involves processing and displaying continuous data. This is because it processes real-time data. For smaller data, this is manageable, but this can quickly become overwhelming when the data being streamed is relatively large. To make sense of the data that is coming in, we would have to apply certain measures that will only show us data in given chunks. This allows us to understand what is constantly happening without being bombarded by data of activities that happened some time back.
Windowing does exactly this. This can be done using different criteria. Some of these criteria are:

time-based (e.g., 10-minute intervals)
count-based (e.g., 50 messages)

DAGs and Workflow Orchestration

A Directed Acyclic Graph (DAG) is a processing model that represents how different tasks will be executed and the dependencies between these tasks in a workflow. Workflow orchestration refers to the use of orchestration tools to control the execution of the created DAGs.
Tools such as airflow can also automate DAGs by the use of schedulers that make sure the DAGs run at a specified period.

Retry Logic & Dead Letter Queues

Retry Logic

Retry logic allows an operation to be attempted again after failure. This is important since the failure may be caused by a temporary connection issue, and because of the retry logic, this operation can be repeated. An example of this is a Kafka consumer that tries to read data from an API. This process fails because of a temporary connection issue, and the consumer waits a few seconds before trying the same operation again.

Dead Letter Queues

This is a queue where messages or events are sent after failing to be processed within the configured retry attempts. This is done so that it can free up other operations in the pipeline.

Backfilling & Reprocessing

Backfilling

Say you are building an ETL pipeline that utilizes taxi data from various years. In your database, however, you only have data from the last 2 years and the current year. You realize you need data from the last 5 years to conduct an accurate analysis. You will use a process that will ingest data from the years that are not present in your database. This is what a backfill does. It sources historical data and adds it to the specified destination.

Reprocessing

This refers to processing existing data again. This may be due to the detection of errors in the existing data, and so you reprocess it to ensure you have the latest and most correct data.

Data Governance

Data Governance is a discipline of data management that ensures that data is gathered, processed, and stored in a secure manner and one that adheres with the policies set.

Time Travel & Data Versioning

Time Travel

This refers to the ability to view and query data as it existed at a specific point in time.

Data Versioning

This refers to maintaining and tracking different versions of data over time.

Distributed Processing

This is a computing approach where large tasks are split into smaller subtasks and each of these subtasks is executed in parallel using different processors, and the results are combined. This speeds up the processing of large datasets and improves scalability.

Understanding OLAP and OLTP: When to Use Each Database System

Fred Munjogu — Tue, 29 Jul 2025 13:48:18 +0000

OLAP Systems

OLAP leverages multidimensional data, enabling storage of various data types across different periods. It uses an OLAP cube, which aggregates data by attributes such as region or category. This allows analysts to drill down into more granular insights. For instance, if data is stored by region, analysts can slice it down to the state or city level to evaluate product performance in specific areas.

Example of an OLAP System

A common OLAP system is a data warehouse, a centralized repository that aggregates data from multiple sources. These systems typically handle massive datasets, often in terabytes (TB) or petabytes (PB). Once the data is transformed, analysts can discover relationships that generate insights for top-level management.

An example query might look like this:

-- Region-level analysis
SELECT c.region,
       AVG(s.sale_amount) AS avg_sales
FROM sales s
JOIN commodities cm 
  ON s.commodity_id = cm.commodity_id
JOIN cities c 
  ON s.city_id = c.city_id
WHERE cm.name = 'Wheat'
GROUP BY c.region
ORDER BY avg_sales DESC;

-- Drill-down to city-level analysis
SELECT c.region,
       c.name AS city,
       AVG(s.sale_amount) AS avg_sales
FROM sales s
JOIN commodities cm 
  ON s.commodity_id = cm.commodity_id
JOIN cities c 
  ON s.city_id = c.city_id
WHERE cm.name = 'Wheat'
  AND c.region = 'Midwest'  
GROUP BY c.region, c.name
ORDER BY avg_sales DESC;

OLTP Systems

Online Transactional Processing (OLTP) systems focus on handling database transactions. These are typically short, fast, and precise operations that keep databases current and consistent.

OLTP systems support ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure data integrity. Unlike OLAP, they are optimized for write-heavy operations, primarily insertions, updates, and deletions. Queries are designed to be quick to support real-time processing and offer a seamless user experience.

Data in OLTP systems is usually smaller than in OLAP, typically measured in gigabytes (GB).

Example of an OLTP System

A typical OLTP system is a Point of Sale (POS) system. When purchasing items, the POS terminal records the transaction, verifies available funds, and updates the inventory and payment records accordingly.

Example queries:

-- Reduce inventory
UPDATE item_table
SET bread = bread - 2;

-- Update payment amount
UPDATE item_table
SET amount = amount + cash

Here, cash represents the amount paid. As shown, OLTP queries are brief and optimized for low-latency processing, often completed in milliseconds.

Differences Between OLAP and OLTP Systems

Purpose: OLAP supports analytics and reporting; OLTP handles real-time transactions.
Data Model: OLAP uses multidimensional models for drill-down analysis; OLTP uses normalized (unidimensional) models to ensure consistency.
Data Type: OLAP combines historical and current data for trend analysis; OLTP focuses on current data for real-time operations.
Operation Focus: OLAP emphasizes read operations; OLTP emphasizes write operations.
Performance: OLAP queries take longer due to large datasets; OLTP queries are faster, aided by indexing and smaller datasets.
Schema Design: OLAP uses denormalized schemas (e.g., star, snowflake) for performance; OLTP uses normalized schemas (e.g., 3NF) to eliminate redundancy and ensure integrity.

Choosing the Right System

When should you use OLAP, and when should you use OLTP?

If your focus is on processing current transactional data quickly, OLTP is the right choice.

However, if your goal is to analyze historical data to uncover trends and develop strategic insights, OLAP is more suitable.

Most businesses use both systems: OLTP for day-to-day operations and OLAP for management-level analysis and planning.

I hope this clarified the differences between OLAP and OLTP systems and helps you determine which is better suited to your needs. Feel free to ask questions, share feedback, or leave comments.