<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Fred Munjogu</title>
    <description>The latest articles on DEV Community by Fred Munjogu (@fredmunjogu).</description>
    <link>https://dev.to/fredmunjogu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3375916%2F6d92ba25-79c1-4df4-b1fc-b862292cc0ad.png</url>
      <title>DEV Community: Fred Munjogu</title>
      <link>https://dev.to/fredmunjogu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fredmunjogu"/>
    <language>en</language>
    <item>
      <title>End-to-End Data Workflow: Kestra, Redshift, and dbt Integration</title>
      <dc:creator>Fred Munjogu</dc:creator>
      <pubDate>Wed, 29 Oct 2025 17:23:11 +0000</pubDate>
      <link>https://dev.to/fredmunjogu/end-to-end-data-workflow-kestra-redshift-and-dbt-integration-3ani</link>
      <guid>https://dev.to/fredmunjogu/end-to-end-data-workflow-kestra-redshift-and-dbt-integration-3ani</guid>
      <description>&lt;p&gt;Imagine that at the end of every month, you are required to download data from a particular source, load it into storage, and transfer it to a data warehouse. After the first time, you will quickly realize how repetitive these tasks can be. Since you will be moving similar data each month, it would be much easier if there were a way to automate this. Lucky for us, there are several tools at our disposal. In this article, we will focus on the following tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kestra&lt;/li&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;li&gt;S3 Bucket&lt;/li&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will be using NYC taxi trip data in this project (yellow and green taxis).&lt;/p&gt;

&lt;p&gt;If you want to follow along, you can create an AWS account, and for first timers, you get $200 credit, which will be more than enough. Also, Redshift offers $300 credit for first-time users, which really comes in handy since queries can be quite expensive if not careful or dealing with large amounts of data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kestra
&lt;/h3&gt;

&lt;p&gt;Kestra is an orchestration tool that we will leverage to automate most of the repetitive tasks. It is an easy-to-use tool since the flows we will write for it are stored in a &lt;code&gt;yaml&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;We will have two flows. One will ingest data, i.e., download the data we will use, upload it to our S3 bucket, and then load the schemas and tables to our data warehouse (Redshift).&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Build Tool (DBT)
&lt;/h3&gt;

&lt;p&gt;After our data is in our data warehouse, we will use dbt to clean, aggregate, and create analytics-ready tables. This is made possible by the use of DBT models.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Resources
&lt;/h3&gt;

&lt;p&gt;As I have mentioned before, we will be utilizing some AWS resources, and to set them up, we will be using Terraform.&lt;/p&gt;

&lt;p&gt;Terraform allows us to write a configuration that then creates the resources with three simple commands (terraform init, terraform plan, and terraform apply). This article will not focus too much on this, but I will mention the steps needed to get our resources up and running. &lt;/p&gt;

&lt;h2&gt;
  
  
  Configuring Kestra
&lt;/h2&gt;

&lt;p&gt;This will be our first step since it contains the flows needed to run our pipeline. We will first run Docker to get our Kestra instance. Once that is done, we will navigate to the Kestra UI, where we will begin by adding environment variables in the KV store, which is found in the namespace tab.&lt;/p&gt;

&lt;p&gt;Inside the KV store, we will add the following environment variables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3_BUCKET_NAME: the name of our S3 bucket.&lt;/li&gt;
&lt;li&gt;AWS_REGION: our AWS region&lt;/li&gt;
&lt;li&gt;JDBC_URL: string connection to our database in Redshift&lt;/li&gt;
&lt;li&gt;DB_USER: database username&lt;/li&gt;
&lt;li&gt;DB_PASS: database password&lt;/li&gt;
&lt;li&gt;KESTRA_ROLE: IAM role that allows Kestra to copy data from S3 to Redshift&lt;/li&gt;
&lt;li&gt;REDSHIFT_USER: our Redshift user configured in Terraform&lt;/li&gt;
&lt;li&gt;REDSHIFT_PASSWORD: our Redshift password&lt;/li&gt;
&lt;li&gt;REDSHIFT_HOST: our workgroup host&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this, we are all set up to run our first flow. All the values stored in the KV store will be included in our flow when we reference it using the &lt;code&gt;"{{ kv('var_name') }}"&lt;/code&gt; block.&lt;/p&gt;

&lt;p&gt;Here is a snippet of our KV store:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpplvrr0sqv5rvqq1gzin.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpplvrr0sqv5rvqq1gzin.png" alt="Kestra's KV store" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Running our first flow
&lt;/h3&gt;

&lt;p&gt;The entire project can be found &lt;a href="https://github.com/Munjogu123/taxi-nyc-dbt" rel="noopener noreferrer"&gt;here&lt;/a&gt;. We will run the flow named &lt;code&gt;redshift_taxi_scheduled.yaml&lt;/code&gt;, located in the &lt;code&gt;kestra/flows&lt;/code&gt; directory. This flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Downloads Yellow/Green Taxi Data (depending on the taxi you choose)&lt;/li&gt;
&lt;li&gt;Uploads to S3&lt;/li&gt;
&lt;li&gt;Loads to Redshift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To run this flow, we will navigate to the triggers tab, where we will execute a backfill. The triggers tab looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhglpxw7y7efe6xbrd541.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhglpxw7y7efe6xbrd541.png" alt="Triggers tab" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will click on the backfill execution for the green schedule, which will download taxi data for the green cabs. We will be asked to choose the dates from when the data will be downloaded, and for this article, we will use data from January 1, 2021, to June 30, 2021.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjs3laoz2dfi4u2zi91p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjs3laoz2dfi4u2zi91p.png" alt="Backfill Execution" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will also add an execution label (backfill: true) so that we know this is a backfill when it runs. After doing this, we can execute our flow, and it will begin downloading and uploading the green taxi data.&lt;/p&gt;

&lt;p&gt;Kestra has this rather neat feature that, during executions, you can see a Gantt chart showing the tasks that are ongoing and their statuses (passed or failed).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswg0ek5kz5h1snhl7z98.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswg0ek5kz5h1snhl7z98.png" alt="Green taxi execution" width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since we are doing it for six months, our flow will run six times. If we look at our executions tab at the end of the six runs, we will see that all six runs were successful, and next to them are the labels of the files downloaded.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkforsrz5vpo1hkdd5mei.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkforsrz5vpo1hkdd5mei.png" alt="Executions" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that our flows executed successfully. Now, we can check to see if our S3 bucket contains any data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxgdvi5oehpeeuvx54at.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foxgdvi5oehpeeuvx54at.png" alt="S3 bucket" width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Inside our S3 bucket, we can see that our files have been uploaded and named similarly to the labels we saw in Kestra.&lt;/p&gt;

&lt;p&gt;We will do the same for the yellow taxi data. By the end of these executions, we will have twelve objects in our S3 bucket (6 for green taxis and the other 6 for the yellow taxis). &lt;/p&gt;

&lt;p&gt;We can now confirm if our tables have been loaded into Redshift. Before we do this, let me explain a section of the flow. Inside the flow we have been running, there is a segment responsible for loading data into Redshift.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yellow_table_create&lt;/span&gt;
        &lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.jdbc.redshift.Query&lt;/span&gt;
        &lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('JDBC_URL')}}"&lt;/span&gt;
        &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_USER')}}"&lt;/span&gt;
        &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_PASS')}}"&lt;/span&gt;
        &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (&lt;/span&gt;
              &lt;span class="s"&gt;unique_row_id          text,&lt;/span&gt;
              &lt;span class="s"&gt;filename               text,&lt;/span&gt;
              &lt;span class="s"&gt;VendorID               text,&lt;/span&gt;
              &lt;span class="s"&gt;tpep_pickup_datetime   timestamp,&lt;/span&gt;
              &lt;span class="s"&gt;tpep_dropoff_datetime  timestamp,&lt;/span&gt;
              &lt;span class="s"&gt;passenger_count        integer,&lt;/span&gt;
              &lt;span class="s"&gt;trip_distance          double precision,&lt;/span&gt;
              &lt;span class="s"&gt;RatecodeID             text,&lt;/span&gt;
              &lt;span class="s"&gt;store_and_fwd_flag     text,&lt;/span&gt;
              &lt;span class="s"&gt;PULocationID           text,&lt;/span&gt;
              &lt;span class="s"&gt;DOLocationID           text,&lt;/span&gt;
              &lt;span class="s"&gt;payment_type           integer,&lt;/span&gt;
              &lt;span class="s"&gt;fare_amount            double precision,&lt;/span&gt;
              &lt;span class="s"&gt;extra                  double precision,&lt;/span&gt;
              &lt;span class="s"&gt;mta_tax                double precision,&lt;/span&gt;
              &lt;span class="s"&gt;tip_amount             double precision,&lt;/span&gt;
              &lt;span class="s"&gt;tolls_amount           double precision,&lt;/span&gt;
              &lt;span class="s"&gt;improvement_surcharge  double precision,&lt;/span&gt;
              &lt;span class="s"&gt;total_amount           double precision,&lt;/span&gt;
              &lt;span class="s"&gt;congestion_surcharge   double precision&lt;/span&gt;
          &lt;span class="s"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates our main table, and based on the columns, it creates the yellow taxi main table in Redshift. The next step is creating the staging table, which will be used to deduplicate data before loading it to the main table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yellow_create_staging_table&lt;/span&gt;
        &lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.jdbc.redshift.Query&lt;/span&gt;
        &lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('JDBC_URL')}}"&lt;/span&gt;
        &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_USER')}}"&lt;/span&gt;
        &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_PASS')}}"&lt;/span&gt;
        &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (LIKE {{render(vars.table)}});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The staging table will be similar to the main table. The next step will be to truncate this table. In the first run, it does not make any sense since the table is empty, and truncating it does not change anything. However, in the succeeding runs, it will make sense. We will come back to this shortly.&lt;/p&gt;

&lt;p&gt;After our staging table is created and then truncated, we move to the next step, where we copy the tables' contents from the S3 files we uploaded a while ago. This is done by this task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt; &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yellow_copy_into_staging_table&lt;/span&gt;
        &lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.jdbc.redshift.Query&lt;/span&gt;
        &lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('JDBC_URL')}}"&lt;/span&gt;
        &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_USER')}}"&lt;/span&gt;
        &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_PASS')}}"&lt;/span&gt;
        &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;COPY {{render(vars.staging_table)}} (&lt;/span&gt;
            &lt;span class="s"&gt;VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count,&lt;/span&gt;
            &lt;span class="s"&gt;trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID,&lt;/span&gt;
            &lt;span class="s"&gt;payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,&lt;/span&gt;
            &lt;span class="s"&gt;improvement_surcharge, total_amount, congestion_surcharge&lt;/span&gt;
          &lt;span class="s"&gt;)&lt;/span&gt;
          &lt;span class="s"&gt;FROM 's3://{{kv("S3_BUCKET_NAME")}}/{{render(vars.file)}}'&lt;/span&gt;
          &lt;span class="s"&gt;IAM_ROLE '{{kv("KESTRA_ROLE")}}'&lt;/span&gt;
          &lt;span class="s"&gt;FORMAT AS CSV&lt;/span&gt;
          &lt;span class="s"&gt;IGNOREHEADER 1;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We specify the columns here since the tables we created have two additional columns not present in the CSV files we are copying data from. The purpose of those two additional columns is identification and uniqueness. The &lt;code&gt;unique_row_id&lt;/code&gt; column will have a unique value, and it is what we will use to ensure there is no duplicate data in our main table.&lt;/p&gt;

&lt;p&gt;We will then move to adding the unique columns and the filenames. This is done using the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt; &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yellow_add_unique_id_and_filename&lt;/span&gt;
        &lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.jdbc.redshift.Query&lt;/span&gt;
        &lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('JDBC_URL')}}"&lt;/span&gt;
        &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_USER')}}"&lt;/span&gt;
        &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_PASS')}}"&lt;/span&gt;
        &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;UPDATE {{render(vars.staging_table)}}&lt;/span&gt;
          &lt;span class="s"&gt;SET &lt;/span&gt;
            &lt;span class="s"&gt;unique_row_id = md5(&lt;/span&gt;
              &lt;span class="s"&gt;COALESCE(VendorID, '') ||&lt;/span&gt;
              &lt;span class="s"&gt;COALESCE(CAST(tpep_pickup_datetime AS varchar), '') ||&lt;/span&gt;
              &lt;span class="s"&gt;COALESCE(CAST(tpep_dropoff_datetime AS varchar), '') ||&lt;/span&gt;
              &lt;span class="s"&gt;COALESCE(PULocationID, '') ||&lt;/span&gt;
              &lt;span class="s"&gt;COALESCE(DOLocationID, '') ||&lt;/span&gt;
              &lt;span class="s"&gt;COALESCE(CAST(fare_amount AS varchar), '') ||&lt;/span&gt;
              &lt;span class="s"&gt;COALESCE(CAST(trip_distance AS varchar), '')&lt;/span&gt;
            &lt;span class="s"&gt;),&lt;/span&gt;
            &lt;span class="s"&gt;filename = '{{render(vars.file)}}';&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We then merge the files in this table with our main table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yellow_merge_data&lt;/span&gt;
        &lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.jdbc.redshift.Query&lt;/span&gt;
        &lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('JDBC_URL')}}"&lt;/span&gt;
        &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_USER')}}"&lt;/span&gt;
        &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{kv('DB_PASS')}}"&lt;/span&gt;
        &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;MERGE INTO {{render(vars.table)}}&lt;/span&gt;
          &lt;span class="s"&gt;USING {{render(vars.staging_table)}} AS S&lt;/span&gt;
          &lt;span class="s"&gt;ON {{render(vars.table)}}.unique_row_id = S.unique_row_id&lt;/span&gt;
          &lt;span class="s"&gt;WHEN MATCHED THEN&lt;/span&gt;
            &lt;span class="s"&gt;UPDATE SET&lt;/span&gt;
              &lt;span class="s"&gt;unique_row_id = S.unique_row_id&lt;/span&gt;
          &lt;span class="s"&gt;WHEN NOT MATCHED THEN&lt;/span&gt;
            &lt;span class="s"&gt;INSERT (&lt;/span&gt;
              &lt;span class="s"&gt;unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,&lt;/span&gt;
              &lt;span class="s"&gt;passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,&lt;/span&gt;
              &lt;span class="s"&gt;DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,&lt;/span&gt;
              &lt;span class="s"&gt;improvement_surcharge, total_amount, congestion_surcharge&lt;/span&gt;
            &lt;span class="s"&gt;)&lt;/span&gt;
            &lt;span class="s"&gt;VALUES (&lt;/span&gt;
              &lt;span class="s"&gt;S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,&lt;/span&gt;
              &lt;span class="s"&gt;S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,&lt;/span&gt;
              &lt;span class="s"&gt;S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,&lt;/span&gt;
              &lt;span class="s"&gt;S.improvement_surcharge, S.total_amount, S.congestion_surcharge&lt;/span&gt;
            &lt;span class="s"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you look closely, you can see the &lt;code&gt;WHEN NOT MATCHED THEN&lt;/code&gt; statement. This checks the &lt;code&gt;unique_row_id&lt;/code&gt; column in the staging table and the main table. If none of the ids match, it means there are no duplicates, so the data is added to our main table.&lt;/p&gt;

&lt;p&gt;I mentioned truncating our staging table and how it made sense in the subsequent runs. After this first run, both our main table and staging table have the same data. When the next flow is executed, we begin by creating the main table, but this step is aborted since the table already exists. The next step is to create the staging table, and this process is also skipped since the table exists. After this comes the truncating of the staging table. Remember our staging table contains the same data as our main table, and in this case, for January. So our staging table is truncated here to remain with no records and then loaded with records for the second month. So at this point, the main table has data for January, while our staging table has data for February.&lt;/p&gt;

&lt;p&gt;Through our &lt;code&gt;unique_row_id&lt;/code&gt; column, we can confirm that there is no data in February that is similar to that of January, hence no duplicates. When this condition is satisfied, then it is merged with the main table. &lt;/p&gt;

&lt;p&gt;Inside Redshift, we can confirm if our tables have been created. We are supposed to have two green taxi tables (main and staging) and two yellow tables (main and staging).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fea8cf5b92yf6yf722puo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fea8cf5b92yf6yf722puo.png" alt="Redshift tables" width="490" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Running our Second flow
&lt;/h3&gt;

&lt;p&gt;Now that we have our tables in Redshift, we can move to the transformations part using dbt. For this, we will use the flow &lt;code&gt;dbt_redshift.yaml&lt;/code&gt; found in the &lt;code&gt;kestra/flows&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;What this flow does is clone our repository and then sync the dbt project, which is located in the &lt;code&gt;taxi_dbt&lt;/code&gt; directory, into our namespace. This allows Kestra to find our project and, more specifically, the &lt;code&gt;dbt_project.yml&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;There are a few things we need to specify before our flow runs successfully. We need to add a profiles section so that Kestra and dbt know the location of our data and where the tables our models create will be stored. If you have used dbt Core, you can get these credentials in the &lt;code&gt;~/.dbt/profiles.yml&lt;/code&gt; file. &lt;br&gt;
Run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; ~/.dbt/profiles.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will see the credentials needed by dbt. These configurations should be input here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt; &lt;span class="na"&gt;profiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;taxi_dbt:&lt;/span&gt;
        &lt;span class="s"&gt;outputs:&lt;/span&gt;
          &lt;span class="s"&gt;dev:&lt;/span&gt;
            &lt;span class="s"&gt;type: redshift&lt;/span&gt;
            &lt;span class="s"&gt;host: {{ kv('REDSHIFT_HOST') }}&lt;/span&gt;
            &lt;span class="s"&gt;user: "{{ kv('REDSHIFT_USER') }}"&lt;/span&gt;
            &lt;span class="s"&gt;password: "{{ kv('REDSHIFT_PASSWORD') }}"&lt;/span&gt;
            &lt;span class="s"&gt;port: 5439&lt;/span&gt;
            &lt;span class="s"&gt;dbname: dev&lt;/span&gt;
            &lt;span class="s"&gt;schema: public&lt;/span&gt;
            &lt;span class="s"&gt;autocommit: true&lt;/span&gt;
            &lt;span class="s"&gt;threads: 3&lt;/span&gt;
            &lt;span class="s"&gt;connect_timeout: 300&lt;/span&gt;
        &lt;span class="s"&gt;target: dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you haven't installed dbt on your machine, these configurations can be found in Redshift under the workgroup tab.&lt;/p&gt;

&lt;p&gt;With configurations out of the way, we can run our dbt flow. Inside Kestra, execute the flow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qyg8hjtl6t3uoppv0pg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qyg8hjtl6t3uoppv0pg.png" alt="DBT flow" width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that the flow executed successfully and also passed all the tests I had written. These ensure that the models we create have consistent data, which prevents our pipeline from breaking down.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mrg4zrmxn00c9swj0ld.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mrg4zrmxn00c9swj0ld.png" alt="Tests confirmation" width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see the number of tests and whether they succeeded or not. This marks the end of this pipeline, and because of our triggers, it should run monthly. This automates the entire workflow, and what remains is monitoring the logs to ensure no abnormalities occur and trying to identify bottlenecks that could be solved by minor tweaks in our workflow.&lt;/p&gt;

&lt;p&gt;Remember to destroy the resources once done with the project to avoid incurring additional costs. For more explanation on usage and setup, refer to the README at the root of this &lt;a href="https://github.com/Munjogu123/taxi-nyc-dbt" rel="noopener noreferrer"&gt;repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With our models, we can now pass the data in the tables created to any analytics or BI tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We have executed two flows that extract data, upload it to S3, copy the content from S3 to Redshift, and perform transformations using dbt.&lt;/p&gt;

&lt;p&gt;Hope this article was helpful and informative. If you want to read more on Terraform or Infrastructure as Code (IaC), here is a &lt;a href="https://dev.to/fredmunjogu/automating-aws-infrastructure-for-a-fastapi-application-with-terraform-58lp"&gt;link&lt;/a&gt; to an article of a project I did that goes into more detail about the configurations and Terraform itself.&lt;/p&gt;

&lt;p&gt;Here is a link to this project's &lt;a href="https://github.com/Munjogu123/taxi-nyc-dbt" rel="noopener noreferrer"&gt;repo&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>automation</category>
      <category>tutorial</category>
      <category>aws</category>
    </item>
    <item>
      <title>Automating AWS Infrastructure for a FastAPI Application with Terraform</title>
      <dc:creator>Fred Munjogu</dc:creator>
      <pubDate>Thu, 16 Oct 2025 16:31:32 +0000</pubDate>
      <link>https://dev.to/fredmunjogu/automating-aws-infrastructure-for-a-fastapi-application-with-terraform-58lp</link>
      <guid>https://dev.to/fredmunjogu/automating-aws-infrastructure-for-a-fastapi-application-with-terraform-58lp</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this project, I automated the deployment of a FastAPI-based Book Review API by provisioning all the necessary AWS resources using Terraform. The goal was to build a simple, scalable, and secure cloud architecture that separates the web and database tiers while maintaining infrastructure as code (IaC) principles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;In this project, we will create a few AWS resources. These resources include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Virtual Private Cloud (VPC)&lt;/li&gt;
&lt;li&gt;Public subnet (web server)&lt;/li&gt;
&lt;li&gt;Private subnet (database server)&lt;/li&gt;
&lt;li&gt;Internet gateway for public subnet connectivity&lt;/li&gt;
&lt;li&gt;NAT gateway for private subnet outbound access&lt;/li&gt;
&lt;li&gt;Route tables and routing configuration&lt;/li&gt;
&lt;li&gt;Security groups for both tiers&lt;/li&gt;
&lt;li&gt;EC2 instances &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's go through each of these steps to gain a better understanding of how we will deploy our app.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz6iqp0ubblj7i6g79t3t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz6iqp0ubblj7i6g79t3t.png" alt="Architecture diagram" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying our Web Server
&lt;/h2&gt;

&lt;p&gt;We will need to first create a VPC (&lt;br&gt;
Virtual Private Cloud), which, in simple terms, is a virtual network that is similar to a traditional network. The VPC will allow us to add subnets, which are a range of IP addresses in our VPC. &lt;/p&gt;
&lt;h3&gt;
  
  
  VPC
&lt;/h3&gt;

&lt;p&gt;Here is a snippet of our Terraform configuration to create a VPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc"&lt;/span&gt; &lt;span class="s2"&gt;"test_vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api_vpc"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;enable_dns_hostnames&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;enable_dns_support&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Public subnet
&lt;/h3&gt;

&lt;p&gt;With our VPC now configured, we can work on creating a public subnet. To create our subnet, we will specify the VPC we want to create it in and also the IP address range. We will also add a property to allow us to map a public IP when we launch an EC2 instance in that subnet.&lt;/p&gt;

&lt;p&gt;Here is a snippet of how we will create our public subnet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_subnet"&lt;/span&gt; &lt;span class="s2"&gt;"public_subnet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;test_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_block&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/24"&lt;/span&gt;
  &lt;span class="nx"&gt;availability_zone&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_availability_zone&lt;/span&gt;
  &lt;span class="nx"&gt;map_public_ip_on_launch&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"public-subnet"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Internet Gateway
&lt;/h3&gt;

&lt;p&gt;After creating our public subnet, we will now add an internet gateway to allow our instances access to the internet. This involves adding an internet gateway resource and attaching it to our VPC.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_internet_gateway"&lt;/span&gt; &lt;span class="s2"&gt;"igw"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;test_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"test-igw"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Routing table and routes
&lt;/h3&gt;

&lt;p&gt;We will then create a routing table and add routes for how our traffic in the public subnet will flow. In our routing table, we will specify a CIDR block of &lt;code&gt;0.0.0.0/0&lt;/code&gt;, which will allow outbound traffic from resources in the public subnet to access the internet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Routing table association
&lt;/h3&gt;

&lt;p&gt;The final step is associating these route tables with the public subnet. This will apply all the rules to the public subnet and dictate how outbound traffic will flow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table_association"&lt;/span&gt; &lt;span class="s2"&gt;"public"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Security groups
&lt;/h3&gt;

&lt;p&gt;By now, our web server can access the internet, but we need to specify some inbound and outbound rules to limit the number of devices that can access our server. As you may have already thought, this is a safety measure to ensure only authorized personnel have access to our servers.&lt;/p&gt;

&lt;p&gt;Security groups allow us the opportunity to explicitly declare these rules. Below are some of the rules I specified for my web server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creating the security group&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"webserver"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"webserversg"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"security group for webservers"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;test_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"WebServerSG"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Adding inbound and outbound rules&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# allow ssh, http, and https traffic to webserver&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_security_group_ingress_rule"&lt;/span&gt; &lt;span class="s2"&gt;"allow_ssh_webserver"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_ipv4&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${chomp(data.http.myip.response_body)}/32"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;
  &lt;span class="nx"&gt;ip_protocol&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_security_group_ingress_rule"&lt;/span&gt; &lt;span class="s2"&gt;"allow_http"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_ipv4&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
  &lt;span class="nx"&gt;ip_protocol&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_security_group_ingress_rule"&lt;/span&gt; &lt;span class="s2"&gt;"allow_https"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_ipv6&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"::/0"&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
  &lt;span class="nx"&gt;ip_protocol&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# adding egress rules for all traffic&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_security_group_egress_rule"&lt;/span&gt; &lt;span class="s2"&gt;"allow_all_traffic_webserver"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_ipv4&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
  &lt;span class="nx"&gt;ip_protocol&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ingress refers to the inbound rules, while egress refers to the outbound rules. From the snippet above, you see that I allow inbound traffic from SSH (port 22), HTTP (port 80), and HTTPS (port 443). In the SSH part, I specified my own IP since it was in development, but in a production setting, you would specify an IP address range.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating our EC2 Instance
&lt;/h3&gt;

&lt;p&gt;The final step is creating an EC2 instance where our app will be deployed. We will be using an Ubuntu instance to deploy our app. &lt;br&gt;
We will create this instance in our public subnet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# create an EC2 instance&lt;/span&gt;
&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_ssm_parameter"&lt;/span&gt; &lt;span class="s2"&gt;"ubuntu_ami"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/aws/service/canonical/ubuntu/server/24.04/stable/current/amd64/hvm/ebs-gp3/ami-id"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# webserver instance&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"web_instance"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;ami&lt;/span&gt;                         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_ssm_parameter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ubuntu_ami&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t3.micro"&lt;/span&gt;
  &lt;span class="nx"&gt;key_name&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"fredssh"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_security_group_ids&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;associate_public_ip_address&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;user_data_base64&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;filebase64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"webserver.sh"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"webserver-instance"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;user_data_base64&lt;/code&gt; attribute links to a bash script that downloads and starts nginx when the EC2 instance starts. Now we have a full web server with nginx ready to deploy our app.&lt;/p&gt;

&lt;p&gt;We will follow the same steps when building our private instance that will have our database. However, there are a few differences when creating the private instance. Firstly, we do not want the server to be directly accessible from the internet. The server should only be accessible from the web server. This means we have to use a NAT gateway.&lt;/p&gt;

&lt;h3&gt;
  
  
  NAT gateway
&lt;/h3&gt;

&lt;p&gt;To create this gateway, we need an Elastic IP. We will use this Elastic IP to create our NAT gateway. After creating this gateway, we will route it the same way we did the internet gateway, only this time to the private subnet. This enables the private subnet to access the internet securely without being directly exposed or allowing unauthorized inbound connections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# allocate elastic ip to nat gateway&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_eip"&lt;/span&gt; &lt;span class="s2"&gt;"nat_eip"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;domain&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"nat-eip"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# create a NAT gateway&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_nat_gateway"&lt;/span&gt; &lt;span class="s2"&gt;"nat"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;allocation_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_eip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nat_eip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"nat-gw"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;depends_on&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_internet_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;igw&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# routing&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table"&lt;/span&gt; &lt;span class="s2"&gt;"private_route"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;test_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;route&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_block&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
    &lt;span class="nx"&gt;nat_gateway_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_nat_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"private-rt"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The other thing that differs is how we define our security group. In the private instance, we will allow inbound traffic from SSH and Postgres (port 5432), which is the database we will be using.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creating the security group&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"dbserver"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"dbserversg"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"security group for database servers"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;test_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DBServerSG"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Defining the inbound and outbound rules&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# allow ssh, and Postgres traffic to dbserver from instances in the webserver&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_security_group_ingress_rule"&lt;/span&gt; &lt;span class="s2"&gt;"allow_ssh_db"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_id&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dbserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;referenced_security_group_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;
  &lt;span class="nx"&gt;ip_protocol&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;                      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_security_group_ingress_rule"&lt;/span&gt; &lt;span class="s2"&gt;"allow_postgres"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_id&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dbserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;referenced_security_group_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;from_port&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
  &lt;span class="nx"&gt;ip_protocol&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
  &lt;span class="nx"&gt;to_port&lt;/span&gt;                      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# adding egress rules for all traffic&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_security_group_egress_rule"&lt;/span&gt; &lt;span class="s2"&gt;"allow_all_traffic_db"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dbserver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_ipv4&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
  &lt;span class="nx"&gt;ip_protocol&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will create the EC2 instance the same way, but we will not associate a public IP address with the private instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Launching and Testing our App
&lt;/h2&gt;

&lt;p&gt;We will now create all these resources with the following Terraform commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;terraform&lt;/span&gt; &lt;span class="nx"&gt;init&lt;/span&gt;
&lt;span class="k"&gt;terraform&lt;/span&gt; &lt;span class="nx"&gt;fmt&lt;/span&gt;
&lt;span class="k"&gt;terraform&lt;/span&gt; &lt;span class="nx"&gt;validate&lt;/span&gt;
&lt;span class="k"&gt;terraform&lt;/span&gt; &lt;span class="nx"&gt;apply&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now have our two instances running. It's time to test if our API app works as intended. For a more detailed guide on how to configure nginx, postgres, and launch our app, please refer to the README in the root of the &lt;code&gt;aws&lt;/code&gt; directory in this &lt;a href="https://github.com/Munjogu123/book-review-api" rel="noopener noreferrer"&gt;repo&lt;/a&gt;. It also contains all the Terraform configurations.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Testing
&lt;/h2&gt;

&lt;p&gt;Let's first run our app&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; uvicorn main:app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see the following output if successful on your terminal&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7wws8oezv9e44ij37atn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7wws8oezv9e44ij37atn.png" alt="app launch success" width="730" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You should also see the following if you navigate to the docs page&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyraorsdiuw9y35966w4k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyraorsdiuw9y35966w4k.png" alt="docs page" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing the users endpoint
&lt;/h3&gt;

&lt;p&gt;Inside the docs, we will create a new user and see the response we receive. We will also confirm if this output is similar to that in our database.&lt;/p&gt;

&lt;p&gt;We will start by creating a user. These are the results&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ap4mks10wjntu7joyys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ap4mks10wjntu7joyys.png" alt="creates user" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, let's check if our database has the same record&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d6tzl0okz3pou6oxz19.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d6tzl0okz3pou6oxz19.png" alt="database results" width="800" height="89"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our results show that our app works well. Let's try for the books and review endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing the book endpoint
&lt;/h3&gt;

&lt;p&gt;We will follow the same process as above. In the Swagger docs, we will create a book and ensure everything works fine. This is the result of creating a book.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcw5sir59okjmdpxy3njl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcw5sir59okjmdpxy3njl.png" alt="books endpoint" width="800" height="308"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, let's check if our database has the same record&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk6ouiag8jmz7nw7oi71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk6ouiag8jmz7nw7oi71.png" alt="database result" width="800" height="52"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing the review endpoint
&lt;/h3&gt;

&lt;p&gt;For this to work, there has to be a user and a book record. Without these records, we cannot create a review. This makes sense because to create a review, there needs to be a book to review and also the person reviewing it. Luckily, we have already created a user and a book. So we can proceed to write a review for the book we created. Here is the output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvd8itsr5zb4a5ffhkjn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvd8itsr5zb4a5ffhkjn.png" alt="reviews" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, let's check if our database has the same record&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfmb5247qp15sdfjvknf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfmb5247qp15sdfjvknf.png" alt="reviews database" width="800" height="37"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This works for all the endpoints, showing that our deployed app works as we expected. This concludes our project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Decisions and Trade-offs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Two-tier architecture:&lt;/strong&gt; Separating the web and database layers improves security and scalability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Private database subnet:&lt;/strong&gt; Prevents external exposure of sensitive data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dynamic IP restriction:&lt;/strong&gt; The web server’s SSH access is dynamically limited to my public IP, reducing attack surface.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Terraform automation:&lt;/strong&gt; Enables consistent, version-controlled provisioning.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project highlights the power of Infrastructure as Code in managing cloud deployments. Another advantage of Terraform, in particular, is that it is cloud-agnostic. We can use it with different cloud providers (Azure, GCP).&lt;/p&gt;

&lt;p&gt;The source code of this project can be found here: &lt;a href="https://github.com/Munjogu123/book-review-api" rel="noopener noreferrer"&gt;Github Repository&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>terraform</category>
      <category>aws</category>
    </item>
    <item>
      <title>Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices</title>
      <dc:creator>Fred Munjogu</dc:creator>
      <pubDate>Tue, 09 Sep 2025 13:55:56 +0000</pubDate>
      <link>https://dev.to/fredmunjogu/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-549d</link>
      <guid>https://dev.to/fredmunjogu/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-549d</guid>
      <description>&lt;p&gt;As a data engineer, you will often need to stream data. To be more specific, you will need a tool to help you stream live data for whichever project you will be working on. &lt;/p&gt;

&lt;p&gt;Kafka is a great tool and has a ton of functionality to help you stream data seamlessly. In this article, we will focus on the core concepts you need to know to get started with Kafka.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kafka Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Brokers
&lt;/h3&gt;

&lt;p&gt;A broker is a server that stores the data we use in streaming and also handles all the data streaming requests. The broker acts as a middleman between a producer (those who send information) and a consumer (those who receive the information). &lt;/p&gt;

&lt;h3&gt;
  
  
  Zookeeper vs. KRaft (Kafka Raft Metadata mode)
&lt;/h3&gt;

&lt;p&gt;In earlier versions of Kafka (lower than v2.8), Kafka contained an external coordinator by the name ZooKeeper, which was in charge of handling metadata. ZooKeeper worked hand in hand with the brokers to provide information on who the controller is and also to persist the state.&lt;/p&gt;

&lt;p&gt;KRaft was introduced in Kafka 2.8 as an alternative and fully replaced ZooKeeper in Kafka 3.3+. Instead of brokers relying on ZooKeeper anymore, Kafka now uses Raft consensus algorithm to manage metadata and fully remove the need for ZooKeeper. This makes it simple and much more scalable since there are fewer moving parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Topics, Partitions, Offsets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Topic
&lt;/h3&gt;

&lt;p&gt;A topic in Kafka is a log that stores messages and events in a logical order. We can equate a topic to a folder in a filesystem, and the events to the files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Partition
&lt;/h3&gt;

&lt;p&gt;A partition is a "slice" of a topic. This means that when you create a topic, you need to specify the number of partitions you will need. &lt;/p&gt;

&lt;p&gt;This is important since it allows more brokers to share the load, since the data is stored in separate partitions. This also allows consumers in the same group to read from different partitions at the same time. An example of a partition is as follows:&lt;br&gt;
Say you create a topic called orders with 3 partitions. This is what it will actually look like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Orders partitions
orders-0: [msg1, msg4, msg7]
orders-1: [msg2, msg5, msg8]
orders-2: [msg3, msg6, msg9]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Offsets
&lt;/h3&gt;

&lt;p&gt;An offset is a unique identifier assigned to each message in a partition. This helps the producers, consumers, and brokers to determine the position of a message in a partition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Producers
&lt;/h2&gt;

&lt;p&gt;A producer is a client that writes messages to topics in the Kafka cluster. It is also possible to specify the partition the message will be stored in by using key-based partitioning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consumers
&lt;/h2&gt;

&lt;p&gt;A consumer is a client that reads messages from topics in the Kafka cluster. &lt;/p&gt;

&lt;p&gt;In Kafka, there are consumer groups. These basically help consumers to work together in parallel. Different consumers are able to read messages from topics concurrently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Message Delivery Semantics
&lt;/h2&gt;

&lt;p&gt;There are three different types of delivery semantics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;at-most-once&lt;/li&gt;
&lt;li&gt;at-least-once&lt;/li&gt;
&lt;li&gt;exactly-once&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  At-Most-Once Delivery
&lt;/h3&gt;

&lt;p&gt;This approach entails the consumer only saving the position of the last event and then processing it. This means that in the event the consumer fails in the middle of execution, there is no way to go back to read this event.&lt;/p&gt;

&lt;p&gt;This approach is for situations where some data loss is not an issue and accuracy is not a priority.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5dak6wynfw6idh0pokr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5dak6wynfw6idh0pokr.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  At-Least-Once Delivery
&lt;/h3&gt;

&lt;p&gt;This approach entails the consumer processing the received events, saving the results, and also saving the position of the last received event. This is different from at-most-once delivery in the sense that it can retrieve and reprocess old events.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbi7176jk0d46zetcilpr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbi7176jk0d46zetcilpr.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Exactly-Once Delivery
&lt;/h3&gt;

&lt;p&gt;This approach is similar to at-least-once delivery, whereby the consumer receives the event, saves the results, and also the position of the last received event. The difference comes in where any duplicates are dropped, and this leads to only one processing of an event.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5f5egzuj9gf0jdvyt1fg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5f5egzuj9gf0jdvyt1fg.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Retention Policies
&lt;/h2&gt;

&lt;p&gt;Retention refers to the configurable time period during which data is retained in a Kafka topic. This outlines how long messages are preserved in topics before they are deleted.&lt;/p&gt;

&lt;p&gt;Some of the retention policies include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time-based retention&lt;/li&gt;
&lt;li&gt;Size-based retention&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Time-based retention
&lt;/h3&gt;

&lt;p&gt;This is a policy where you can configure how long messages are retained in their topics based on their timestamps. Once that time period expires, the closed segments are deleted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Size-based retention
&lt;/h3&gt;

&lt;p&gt;This is a policy where the expiration of messages is dependent on the bytes of messages retained. This means that if the configured size is attained or exceeded, the messages are then deleted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Serialization &amp;amp; Deserialization
&lt;/h2&gt;

&lt;p&gt;Serialization refers to the process of converting structured data into a byte stream. This is crucial since Kafka stores and transmits data as raw bytes. This process is done by a serializer.&lt;/p&gt;

&lt;p&gt;Deserialization is the opposite of serialization. This is converting data from a byte stream to a structured form, e.g., an object. This process is done by a deserializer.&lt;/p&gt;

&lt;p&gt;The common formats used in Kafka are Avro, Protobuf, or JSON schema.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replication and Fault Tolerance
&lt;/h2&gt;

&lt;p&gt;In Kafka, one can specify the replication factor when creating a topic. This allows for multiple copies to be created across different brokers, which ensures data availability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Replication Factor
&lt;/h3&gt;

&lt;p&gt;A configurable setting for each topic that determines the total number of copies for each partition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leader
&lt;/h3&gt;

&lt;p&gt;For each partition, one replica is designated as the leader, which handles all incoming produce and consume requests. &lt;/p&gt;

&lt;h3&gt;
  
  
  Follower
&lt;/h3&gt;

&lt;p&gt;Other replicas for a partition are followers. They continuously fetch data from the leader to stay synchronized.&lt;/p&gt;

&lt;h3&gt;
  
  
  ISR (In-Sync Replicas)
&lt;/h3&gt;

&lt;p&gt;The set of follower replicas that have successfully replicated the leader's data and are fully up-to-date. &lt;/p&gt;

&lt;h3&gt;
  
  
  High Availability
&lt;/h3&gt;

&lt;p&gt;The system remains functional and data is accessible, even if one or more brokers fail. &lt;/p&gt;

&lt;h2&gt;
  
  
  Kafka Connect
&lt;/h2&gt;

&lt;p&gt;This is a tool for streaming data between Kafka and other data systems. In this article, we will focus on streaming data between two databases, namely, PostgreSQL and Cassandra.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuring our Postgres Connector
&lt;/h3&gt;

&lt;p&gt;In our case, we will be using Debezium Connect, which is built on Kafka Connect. To start things off, we will begin with the configuration file, which is in JSON format.&lt;/p&gt;

&lt;p&gt;In our case, we will use Docker, so we will have to come up with a Docker Compose file that has all the services we need. These include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;zookeeper - handling metadata&lt;/li&gt;
&lt;li&gt;Kafka - for streaming&lt;/li&gt;
&lt;li&gt;postgres (15) - our source database&lt;/li&gt;
&lt;li&gt;Cassandra (4.1) - our sink db&lt;/li&gt;
&lt;li&gt;debezium/connect - a tool to allow us to stream data from PostgreSQL to Cassandra&lt;/li&gt;
&lt;li&gt;kafka-ui - Graphical User Interface to view our streamed data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's break down each service component.&lt;/p&gt;

&lt;h3&gt;
  
  
  Postgres
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  postgres:
    image: postgres:15
    container_name: postgres
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: root
      POSTGRES_DB: test
    command: &amp;gt;
      postgres -c wal_level=logical
               -c max_wal_senders=10
               -c max_replication_slots=10
               -c max_connections=200
    volumes:
      - postgres_data:/var/lib/postgresql/data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The command property is used to specify arguments to our PostgreSQL. The wal_level being set to logical allows for replication and logs to be shared and streamed in Kafka, which will be used to replicate data in the Cassandra DB&lt;/p&gt;

&lt;h3&gt;
  
  
  Cassandra
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  cassandra:
    image: cassandra:4.1
    container_name: cassandra
    ports:
      - "9042:9042"
    environment:
      CASSANDRA_CLUSTER_NAME: "cdc-cluster"
      CASSANDRA_NUM_TOKENS: 16
      CASSANDRA_DC: datacenter1
      CASSANDRA_RACK: rack1
    volumes:
      - cassandra_data:/var/lib/cassandra
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is just a simple configuration to load our simple sink database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connect
&lt;/h3&gt;

&lt;p&gt;This is where we use Debeziu,m, and there are a few things we need to do so that it can transmit the data to these data systems.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  connect:
    image: debezium/connect:2.7.3.Final
    container_name: debezium
    depends_on:
    - kafka
    - postgres
    ports:
    - "8083:8083"
    environment:
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      BOOTSTRAP_SERVERS: kafka:9092
      HOST_NAME: "connect"
      ADVERTISED_HOST_NAME: "connect"
      ADVERTISED_PORT: "8083"
      KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      OFFSET_FLUSH_INTERVAL_MS: "60000"
      OFFSET_FLUSH_TIMEOUT_MS: "5000"
      SHUTDOWN_TIMEOUT: "10000"
      HEAP_OPTS: "-Xms512M -Xmx2G"
      LOG_LEVEL: "INFO"
      ENABLE_APICURIO_CONVERTERS: "false"
      ENABLE_DEBEZIUM_SCRIPTING: "false"
      KAFKA_CONNECT_PLUGINS_DIR: /kafka/connect,/kafka/connect/plugins
    volumes:
      - connect_data:/kafka/connect
      - ../plugins:/kafka/connect/plugins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main things to focus on in this configuration are the volumes. By default, Debezium has a pre-installed PostgreSQL plugin, which is located in the &lt;code&gt;/kafka/connect&lt;/code&gt; directory. This means that you do not need to install the PostgreSQL plugin. All you have to do is provide that path to the environment variable &lt;code&gt;KAFKA_CONNECT_PLUGINS_DIR&lt;/code&gt;. For Cassandra, it is a little bit different since you have to source the plugin. Once installed in your system, extract the file and locate a Kafka sink connector file with the extension &lt;code&gt;.jar&lt;/code&gt;. This is what we will place in our root directory and mount it in the &lt;code&gt;/kafka/connect/plugins&lt;/code&gt; directory. This will allow our Docker container to recognize the plugin, and we will be able to register our sink connector.&lt;/p&gt;

&lt;p&gt;Similar to what we did for Postgres, we need to add this path to the environment variable &lt;code&gt;KAFKA_CONNECT_PLUGINS_DIR&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;After adding the zookeeper, Kafka, and Kafka UI images (I have not provided the snippets since they require the basic configuration and are easy to set up), we will now create identical tables in both PostgreSQL and Cassandra.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating Tables
&lt;/h3&gt;

&lt;p&gt;For Postgres:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(150),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert some sample rows
INSERT INTO customers (name, email) VALUES
('Alice', 'alice@example.com'),
('Bob', 'bob@example.com');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Cassandra:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE KEYSPACE IF NOT EXISTS cdc_demo
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE cdc_demo;

CREATE TABLE customers (
    id INT PRIMARY KEY,
    name TEXT,
    email TEXT,
    created_at TIMESTAMP
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Registering the connectors
&lt;/h3&gt;

&lt;p&gt;Postgres configuration: &lt;code&gt;postgres_connector.json&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "name": "postgres-connector",  
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 
    "database.hostname": "postgres", 
    "database.port": "5432", 
    "database.user": "postgres", 
    "database.password": "postgres", 
    "database.dbname": "postgres", 
    "topic.prefix": "test", 
    "plugin.name": "pgoutput",
    "snapshot.mode": "initial"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this config to a JSON file and run this command to register the connector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -X POST -H "Content-Type: application/json" \
     --data @postgres_connector.json \
     http://localhost:8083/connectors

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cassandra Configuration: &lt;code&gt;cassandra_connector.json&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "name": "stocks-sink",
  "config": {
    "connector.class": "com.datastax.kafkaconnector.DseSinkConnector",
    "tasks.max": "1",
    "topics": "test.public.customers,
    "contactPoints": "cassandra",
    "loadBalancing.localDc": "datacenter1",
    "port": 9042,
    "auth.provider": "None",
    "ssl.provider": "None",
    "topic.test.public.cdc_demo.customers.mapping": 
    “id=value.id, name=value.name, email=value.email, created_at=value.created_at”,
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this config to a JSON file and run this command to register the connector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -X POST -H "Content-Type: application/json" \
     --data @cassandra_connector.json \
     http://localhost:8083/connectors

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we look into our Cassandra database now, we will be able to see the records that are in our PostgreSQL database. If we insert more data into our Postgres Database, this will also reflect in our Cassandra database.&lt;/p&gt;

&lt;p&gt;What we have just done is referred to as CDC (Change Data Capture). Essentially, this involves having a source database and a sink database. This allows for availability since we can retrace our steps in case of failure from the source database in our sink database. This is one of the applications of Kafka Connect.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Kafka is Used in the Industry
&lt;/h2&gt;

&lt;p&gt;To explain how Kafka is used in the real world, we will use the example of a popular company, Uber.&lt;/p&gt;

&lt;p&gt;Uber operates at a massive scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Millions of ride requests, driver updates, payments, GPS events, and ETA predictions per second.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They need to process events in real-time for features like surge pricing, live tracking, fraud detection, and customer notifications.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Use Case
&lt;/h3&gt;

&lt;p&gt;When a rider presses “Request Ride” in the Uber app, the action is published as an event into a Kafka topic called rides. Since Kafka topics are partitioned, this request can be processed in parallel with millions of others, which helps Uber handle high throughput at scale. At the same time, every driver’s phone sends GPS updates to another Kafka topic called locations. Kafka’s durability makes sure these events are stored safely on disk, copied across brokers, and can be replayed if needed. Uber’s matching service consumes data from both rides and locations topics in real time, and Kafka’s consumer groups allow multiple consumers to share the work of processing events. Kafka also guarantees ordering within partitions, so a driver’s GPS updates are processed in the correct order, which helps the system calculate accurate ETAs. Once a rider-driver match is made, the result is published into another topic called matches, which is consumed by the notification service to update both apps instantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4ulxpyng6eruumqauxp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo4ulxpyng6eruumqauxp.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding Docker, Dockerfile, and Docker Compose</title>
      <dc:creator>Fred Munjogu</dc:creator>
      <pubDate>Tue, 26 Aug 2025 14:31:54 +0000</pubDate>
      <link>https://dev.to/fredmunjogu/understanding-docker-dockerfile-and-docker-compose-3e73</link>
      <guid>https://dev.to/fredmunjogu/understanding-docker-dockerfile-and-docker-compose-3e73</guid>
      <description>&lt;h2&gt;
  
  
  What is Docker?
&lt;/h2&gt;

&lt;p&gt;Docker is an open-source platform that allows one to automate the deployment of applications inside containers. A container is a stand-alone unit that includes all components an application needs to run (code, libraries, and dependencies).&lt;/p&gt;

&lt;p&gt;Containers are different from virtual machines in that they share the host system's kernel. This makes them more efficient, resource-wise, and they are a lot easier to manage.&lt;/p&gt;

&lt;p&gt;Some of the benefits of using Docker include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Portability&lt;/li&gt;
&lt;li&gt;Efficiency&lt;/li&gt;
&lt;li&gt;Scalability&lt;/li&gt;
&lt;li&gt;Isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is a Dockerfile
&lt;/h2&gt;

&lt;p&gt;Now that we have an idea of what Docker is, we can move to a Dockerfile. A Dockerfile is a text document that contains a set of instructions used in building an image.&lt;/p&gt;

&lt;p&gt;An image refers to the blueprint of libraries and dependencies that are necessary inside a container for an application to run.&lt;/p&gt;

&lt;p&gt;Docker will read our Dockerfile and build an image that matches the specification listed. To understand this a bit more, let us write one and see what exactly happens.&lt;/p&gt;

&lt;p&gt;Using a code editor of your choice, create a file and name it &lt;code&gt;Dockerfile&lt;/code&gt;. Inside the file, write the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt /app

RUN pip install -r /app/requirements.txt

COPY . /app

CMD["python", "main.py"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you may have guessed from the file above, we require to have a &lt;code&gt;requirements.txt&lt;/code&gt; file and a &lt;code&gt;main.py&lt;/code&gt; file. For this part, we will not write anything complex since this is focused on understanding containers and images. So in our &lt;code&gt;requirements.txt&lt;/code&gt; file, we will write the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our &lt;code&gt;main.py&lt;/code&gt;, we can write a simple "Hello, world!" program.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print("Hello, world!")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With these files now, we can build our first image. To build an image, we use the &lt;code&gt;docker build&lt;/code&gt; command. The flag "-t" refers to the tag, which is a name you can assign to your image. If you do not assign a name, one will be assigned by default. The trailing period in the command refers to the relative path of the Dockerfile, which in our case is the current directory.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker build -t test .&lt;/code&gt;&lt;br&gt;
Output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb9gkaj6ortj94skj6m01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb9gkaj6ortj94skj6m01.png" alt="Results of building the image" width="800" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To confirm if the image has been built, use &lt;code&gt;docker images&lt;/code&gt; and look for an image with the name "test".&lt;/p&gt;

&lt;p&gt;Now that our image is built, we will create a container so that we can run our Python file. To create a container, we will use this command:&lt;br&gt;
&lt;code&gt;docker run --name test_container test&lt;/code&gt;&lt;br&gt;
Output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdesggu5tz9sksqvqjfr3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdesggu5tz9sksqvqjfr3.png" alt="Running the container" width="369" height="56"&gt;&lt;/a&gt;&lt;br&gt;
Our Python file runs great!&lt;/p&gt;
&lt;h2&gt;
  
  
  What is Docker Compose?
&lt;/h2&gt;

&lt;p&gt;Docker Compose is a tool used to define and manage multi-container Docker applications. Suppose our project requires Python and a PostgreSQL Database. With Docker Compose, we can define and manage these services in a YAML file. Let us create this YAML file.&lt;br&gt;
Back in your code editor, create a new file &lt;code&gt;docker-compose.yml&lt;/code&gt;. Inside the file we will create our services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;services:
  web:
    build: .
    ports:
      - "5000:5000"
    depends_on:
      - db

  db:
    image: postgres:14
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: mydatabase
    volumes:
      - db_data:/var/lib/postgresql/data

volumes:
  db_data:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this file, we have two services: web and our database. The &lt;code&gt;web&lt;/code&gt; service will run our DockerFile, which we specified above. The &lt;code&gt;db&lt;/code&gt; service defines our Postgres Database and also the username, password, and name of the database. Volumes ensure data consistency and persistence. This way, when we stop and start the container again, all the data that was in the database will be preserved.&lt;/p&gt;

&lt;p&gt;To run this, we will use the command &lt;code&gt;docker compose up&lt;/code&gt;. If we want to stop it, we will use &lt;code&gt;docker compose down&lt;/code&gt;. A tip when running &lt;code&gt;docker compose up&lt;/code&gt; is to use the "-d" flag, which will give you your terminal back.&lt;/p&gt;

</description>
      <category>docker</category>
    </item>
    <item>
      <title>Mastering Data Engineering: 15 Essential Concepts for Building Reliable and Scalable Data Systems</title>
      <dc:creator>Fred Munjogu</dc:creator>
      <pubDate>Sun, 10 Aug 2025 20:19:10 +0000</pubDate>
      <link>https://dev.to/fredmunjogu/mastering-data-engineering-15-essential-concepts-for-building-reliable-and-scalable-data-systems-24bd</link>
      <guid>https://dev.to/fredmunjogu/mastering-data-engineering-15-essential-concepts-for-building-reliable-and-scalable-data-systems-24bd</guid>
      <description>&lt;p&gt;As a Data Engineering student, I believe there are a few fundamental concepts that are important for setting a good foundation in the field. In this article, I will focus on explaining these concepts and their importance. In some cases, I will also provide examples. Let's get to it, shall we?&lt;/p&gt;

&lt;h2&gt;
  
  
  Batch vs Streaming Ingestion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Batch Ingestion
&lt;/h3&gt;

&lt;p&gt;Batch ingestion refers to the processing and loading of huge volumes of data in batches. These batches are usually in chunks and are of a predefined period (eg, hourly, weekly, yearly). Batch ingestion is useful in areas where real-time analysis is not needed. The beauty of batch processing is the large amounts of data that can be processed at once. This ultimately leads to inexpensive procedures since batch ingestion and processing can occur outside business hours. An example of batch ingestion is an e-commerce platform that exports its daily sales reports to a data warehouse. This data can now be analyzed and insights sent to the respective departments. This, however, happens after all the sales transactions have occurred.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stream Ingestion
&lt;/h3&gt;

&lt;p&gt;Stream ingestion is the exact opposite of batch ingestion. Instead of processing/ingesting after a certain period, stream ingestion involves the immediate processing of data once it is produced. Stream ingestion has its advantages, such as real-time processing, which leads to real-time analysis, which is crucial for certain industries/businesses. An area that thrives on this is fraud detection. By conducting real-time analysis, any anomalies that may occur in the system are detected in real-time, leading to crisis aversion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Change Data Capture (CDC)
&lt;/h2&gt;

&lt;p&gt;Change Data Capture(CDC) is a process of tracking changes such as inserts, updates, and deletes in a database. This data is stored and is used for operations such as replication or analysis. In stream ingestion systems, for example, CDC allows for real-time or near-real-time replication of data across different data destinations such as databases. In databases, the changes are represented in lists, and they are referred to as a CDC feed.&lt;br&gt;
Suppose we have a database, and inside it, we add a new record. The CDC will contain the information of the new record together with the type of operation that has occurred (in our case, insertion).&lt;/p&gt;

&lt;h2&gt;
  
  
  Idempotency
&lt;/h2&gt;

&lt;p&gt;The first thing that comes to mind when I see the term idempotency is APIs. Why, you may ask? Let us first define idempotency, and hopefully by the end, you will be able to see how the two relate.&lt;br&gt;
Idempotency, in simple terms, is the property of an operation that ensures that repeating the same operation multiple times will yield the same result.&lt;br&gt;
Let us go back to APIs. Imagine we have a payment API that is used to process a purchase when a client sends a &lt;code&gt;POST/payments&lt;/code&gt; request. In APIs, we have an idempotency key that is used to identify a certain request. If, for instance, a customer makes a payment, the API processes the idempotency key. If the same request is sent again multiple times by the client, the API can refer to the idempotency key, and if it is similar to the previous ones, it executes only once. This prevents duplicate charges when requests are retried.&lt;/p&gt;

&lt;h2&gt;
  
  
  OLTP vs OLAP
&lt;/h2&gt;

&lt;p&gt;Online Analytical Processing (OLAP) refers to database systems designed primarily for complex data analytics and reporting. These systems enable advanced querying by analysts to identify patterns and forecast trends, which are critical for data-driven decision-making. OLAP leverages multidimensional data, enabling storage of various data types across different periods.&lt;/p&gt;

&lt;p&gt;Online Transactional Processing (OLTP) systems focus on handling database transactions. These are typically short, fast, and precise operations that keep databases current and consistent. OLTP systems support ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure data integrity.&lt;/p&gt;

&lt;p&gt;For more detailed differences between the two, feel free to read this &lt;a href="https://dev.to/fredmunjogu/understanding-olap-and-oltp-when-to-use-each-database-system-2joh"&gt;article&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Columnar vs Row-based Storage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Columnar Storage
&lt;/h3&gt;

&lt;p&gt;In this type of storage, data is stored in columns. This is especially helpful in data warehousing and analytics, where data is queried based on specific columns. Only the required columns are returned, leading to higher processing speeds. Having seen the differences between OLAP and OLTP, columnar storage is perfect for OLAP systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Row-based Storage
&lt;/h3&gt;

&lt;p&gt;This is the complete opposite of columnar storage. Data is stored in rows. Querying data involves retrieving a specific row that contains all the information for the specified records. This practice is popular in traditional relational database management systems. This type of storage is well-suited for OLTP systems, as it is ideal for CRUD operations. However, querying speeds are a bottleneck for this storage since it retrieves whole rows compared to columnar storage that can aggregate the required columns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partitioning
&lt;/h2&gt;

&lt;p&gt;Partitioning in English terms refers to dividing something into several parts. This also applies to data. Suppose we have a large database that contains huge volumes of data. We can divide these databases into smaller, manageable databases. This instantly leads to improved performance since, instead of querying records from the entire database, we will now be querying our data from the smaller databases, hence the improved query performance.&lt;br&gt;
By partitioning data, we can also create replicas of the data in the databases, leading to availability and reliability. Consider having two databases, a master database and a slave database. We can create replicas of the master database and store them in the slave database. If the master database becomes unavailable, the slave database can still serve requests. This setup eliminates a single point of failure by maintaining two database instances.&lt;/p&gt;

&lt;h2&gt;
  
  
  ETL vs ELT
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Extract, Transform and Load (ELT)
&lt;/h3&gt;

&lt;p&gt;This is the more traditional technique when building pipelines. The first step is getting data from various data sources. After the data has been obtained, it is converted to a form that is required by the user (transform). The data is finally sent to a particular destination (e.g, data warehouse) where users can access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extract, Load and Transform (ELT)
&lt;/h3&gt;

&lt;p&gt;The first part is similar to that of the ETL process. The difference comes in during the next step. After the extraction of the raw data, it is directly loaded to a data warehouse or data lake as is. Transformation is done on a need basis, and the data is converted to the intended format and used for tasks such as analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  CAP Theorem
&lt;/h2&gt;

&lt;p&gt;CAP theorem states that distributed systems can only deliver two of the desired three characteristics.&lt;br&gt;
These characteristics are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consistency&lt;/li&gt;
&lt;li&gt;Availability&lt;/li&gt;
&lt;li&gt;Partition tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Consistency refers to the ability of clients to see the same data even when on different nodes.&lt;br&gt;
Availability refers to the ability of a client to receive a response even when some nodes are down.&lt;br&gt;
Partition tolerance refers to the system being active and functional despite the network connecting the nodes having any kind of faults.&lt;/p&gt;

&lt;h2&gt;
  
  
  Windowing in Streaming
&lt;/h2&gt;

&lt;p&gt;Streaming usually involves processing and displaying continuous data. This is because it processes real-time data. For smaller data, this is manageable, but this can quickly become overwhelming when the data being streamed is relatively large. To make sense of the data that is coming in, we would have to apply certain measures that will only show us data in given chunks. This allows us to understand what is constantly happening without being bombarded by data of activities that happened some time back.&lt;br&gt;
Windowing does exactly this. This can be done using different criteria. Some of these criteria are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;time-based (e.g., 10-minute intervals)&lt;/li&gt;
&lt;li&gt;count-based (e.g., 50 messages)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  DAGs and Workflow Orchestration
&lt;/h2&gt;

&lt;p&gt;A Directed Acyclic Graph (DAG) is a processing model that represents how different tasks will be executed and the dependencies between these tasks in a workflow. Workflow orchestration refers to the use of orchestration tools to control the execution of the created DAGs. &lt;br&gt;
Tools such as airflow can also automate DAGs by the use of schedulers that make sure the DAGs run at a specified period.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry Logic &amp;amp; Dead Letter Queues
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Retry Logic
&lt;/h3&gt;

&lt;p&gt;Retry logic allows an operation to be attempted again after failure. This is important since the failure may be caused by a temporary connection issue, and because of the retry logic, this operation can be repeated. An example of this is a Kafka consumer that tries to read data from an API. This process fails because of a temporary connection issue, and the consumer waits a few seconds before trying the same operation again. &lt;/p&gt;

&lt;h3&gt;
  
  
  Dead Letter Queues
&lt;/h3&gt;

&lt;p&gt;This is a queue where messages or events are sent after failing to be processed within the configured retry attempts. This is done so that it can free up other operations in the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backfilling &amp;amp; Reprocessing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Backfilling
&lt;/h3&gt;

&lt;p&gt;Say you are building an ETL pipeline that utilizes taxi data from various years. In your database, however, you only have data from the last 2 years and the current year. You realize you need data from the last 5 years to conduct an accurate analysis. You will use a process that will ingest data from the years that are not present in your database. This is what a backfill does. It sources historical data and adds it to the specified destination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reprocessing
&lt;/h3&gt;

&lt;p&gt;This refers to processing existing data again. This may be due to the detection of errors in the existing data, and so you reprocess it to ensure you have the latest and most correct data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Governance
&lt;/h2&gt;

&lt;p&gt;Data Governance is a discipline of data management that ensures that data is gathered, processed, and stored in a secure manner and one that adheres with the policies set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time Travel &amp;amp; Data Versioning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Time Travel
&lt;/h3&gt;

&lt;p&gt;This refers to the ability to view and query data as it existed at a specific point in time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Versioning
&lt;/h3&gt;

&lt;p&gt;This refers to maintaining and tracking different versions of data over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributed Processing
&lt;/h2&gt;

&lt;p&gt;This is a computing approach where large tasks are split into smaller subtasks and each of these subtasks is executed in parallel using different processors, and the results are combined. This speeds up the processing of large datasets and improves scalability.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding OLAP and OLTP: When to Use Each Database System</title>
      <dc:creator>Fred Munjogu</dc:creator>
      <pubDate>Tue, 29 Jul 2025 13:48:18 +0000</pubDate>
      <link>https://dev.to/fredmunjogu/understanding-olap-and-oltp-when-to-use-each-database-system-2joh</link>
      <guid>https://dev.to/fredmunjogu/understanding-olap-and-oltp-when-to-use-each-database-system-2joh</guid>
      <description>&lt;h2&gt;
  
  
  OLAP Systems
&lt;/h2&gt;

&lt;p&gt;Online Analytical Processing (OLAP) refers to database systems designed primarily for complex data analytics and reporting. These systems enable advanced querying by analysts to identify patterns and forecast trends, which are critical for data-driven decision-making.&lt;/p&gt;

&lt;p&gt;OLAP leverages multidimensional data, enabling storage of various data types across different periods. It uses an OLAP cube, which aggregates data by attributes such as region or category. This allows analysts to drill down into more granular insights. For instance, if data is stored by region, analysts can slice it down to the state or city level to evaluate product performance in specific areas.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example of an OLAP System
&lt;/h3&gt;

&lt;p&gt;A common OLAP system is a data warehouse, a centralized repository that aggregates data from multiple sources. These systems typically handle massive datasets, often in terabytes (TB) or petabytes (PB). Once the data is transformed, analysts can discover relationships that generate insights for top-level management.&lt;/p&gt;

&lt;p&gt;An example query might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- Region-level analysis
SELECT c.region,
       AVG(s.sale_amount) AS avg_sales
FROM sales s
JOIN commodities cm 
  ON s.commodity_id = cm.commodity_id
JOIN cities c 
  ON s.city_id = c.city_id
WHERE cm.name = 'Wheat'
GROUP BY c.region
ORDER BY avg_sales DESC;

-- Drill-down to city-level analysis
SELECT c.region,
       c.name AS city,
       AVG(s.sale_amount) AS avg_sales
FROM sales s
JOIN commodities cm 
  ON s.commodity_id = cm.commodity_id
JOIN cities c 
  ON s.city_id = c.city_id
WHERE cm.name = 'Wheat'
  AND c.region = 'Midwest'  
GROUP BY c.region, c.name
ORDER BY avg_sales DESC; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  OLTP Systems
&lt;/h2&gt;

&lt;p&gt;Online Transactional Processing (OLTP) systems focus on handling database transactions. These are typically short, fast, and precise operations that keep databases current and consistent.&lt;/p&gt;

&lt;p&gt;OLTP systems support ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure data integrity. Unlike OLAP, they are optimized for write-heavy operations, primarily insertions, updates, and deletions. Queries are designed to be quick to support real-time processing and offer a seamless user experience.&lt;/p&gt;

&lt;p&gt;Data in OLTP systems is usually smaller than in OLAP, typically measured in gigabytes (GB).&lt;/p&gt;

&lt;h3&gt;
  
  
  Example of an OLTP System
&lt;/h3&gt;

&lt;p&gt;A typical OLTP system is a Point of Sale (POS) system. When purchasing items, the POS terminal records the transaction, verifies available funds, and updates the inventory and payment records accordingly.&lt;/p&gt;

&lt;p&gt;Example queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- Reduce inventory
UPDATE item_table
SET bread = bread - 2;

-- Update payment amount
UPDATE item_table
SET amount = amount + cash 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;code&gt;cash&lt;/code&gt; represents the amount paid. As shown, OLTP queries are brief and optimized for low-latency processing, often completed in milliseconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Differences Between OLAP and OLTP Systems
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Purpose:&lt;/strong&gt; OLAP supports analytics and reporting; OLTP handles real-time transactions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Model:&lt;/strong&gt; OLAP uses multidimensional models for drill-down analysis; OLTP uses normalized (unidimensional) models to ensure consistency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Type:&lt;/strong&gt; OLAP combines historical and current data for trend analysis; OLTP focuses on current data for real-time operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Operation Focus:&lt;/strong&gt; OLAP emphasizes read operations; OLTP emphasizes write operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt; OLAP queries take longer due to large datasets; OLTP queries are faster, aided by indexing and smaller datasets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schema Design:&lt;/strong&gt; OLAP uses denormalized schemas (e.g., star, snowflake) for performance; OLTP uses normalized schemas (e.g., 3NF) to eliminate redundancy and ensure integrity.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choosing the Right System
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;When should you use OLAP, and when should you use OLTP?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your focus is on processing current transactional data quickly, OLTP is the right choice.&lt;/p&gt;

&lt;p&gt;However, if your goal is to analyze historical data to uncover trends and develop strategic insights, OLAP is more suitable.&lt;/p&gt;

&lt;p&gt;Most businesses use both systems: OLTP for day-to-day operations and OLAP for management-level analysis and planning.&lt;/p&gt;

&lt;p&gt;I hope this clarified the differences between OLAP and OLTP systems and helps you determine which is better suited to your needs. Feel free to ask questions, share feedback, or leave comments.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
