<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nelson Sammy</title>
    <description>The latest articles on DEV Community by Nelson Sammy (@nelsongei).</description>
    <link>https://dev.to/nelsongei</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1173121%2F686a071e-7af4-4fb7-8bf4-740ca5ff7e78.png</url>
      <title>DEV Community: Nelson Sammy</title>
      <link>https://dev.to/nelsongei</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nelsongei"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Thu, 17 Apr 2025 06:46:53 +0000</pubDate>
      <link>https://dev.to/nelsongei/-557</link>
      <guid>https://dev.to/nelsongei/-557</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/luxdevhq/a-step-by-step-guide-to-streaming-live-weather-data-using-apache-kafka-and-apache-cassandra-ep2" class="crayons-story__hidden-navigation-link"&gt;A Step-by-Step Guide to Streaming Live Weather Data Using Apache Kafka and Apache Cassandra&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;
          &lt;a class="crayons-logo crayons-logo--l" href="/luxdevhq"&gt;
            &lt;img alt="LuxDevHQ logo" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F4798%2Fdb1d01d7-dfac-4ccd-8ad8-a8ef16600073.jpg" class="crayons-logo__image"&gt;
          &lt;/a&gt;

          &lt;a href="/nelsongei" class="crayons-avatar  crayons-avatar--s absolute -right-2 -bottom-2 border-solid border-2 border-base-inverted  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1173121%2F686a071e-7af4-4fb7-8bf4-740ca5ff7e78.png" alt="nelsongei profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/nelsongei" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Nelson Sammy
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Nelson Sammy
                
              
              &lt;div id="story-author-preview-content-2410274" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/nelsongei" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1173121%2F686a071e-7af4-4fb7-8bf4-740ca5ff7e78.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Nelson Sammy&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

            &lt;span&gt;
              &lt;span class="crayons-story__tertiary fw-normal"&gt; for &lt;/span&gt;&lt;a href="/luxdevhq" class="crayons-story__secondary fw-medium"&gt;LuxDevHQ&lt;/a&gt;
            &lt;/span&gt;
          &lt;/div&gt;
          &lt;a href="https://dev.to/luxdevhq/a-step-by-step-guide-to-streaming-live-weather-data-using-apache-kafka-and-apache-cassandra-ep2" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 16 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/luxdevhq/a-step-by-step-guide-to-streaming-live-weather-data-using-apache-kafka-and-apache-cassandra-ep2" id="article-link-2410274"&gt;
          A Step-by-Step Guide to Streaming Live Weather Data Using Apache Kafka and Apache Cassandra
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/kafka"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;kafka&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/luxdevhq/a-step-by-step-guide-to-streaming-live-weather-data-using-apache-kafka-and-apache-cassandra-ep2" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;6&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/luxdevhq/a-step-by-step-guide-to-streaming-live-weather-data-using-apache-kafka-and-apache-cassandra-ep2#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              3&lt;span class="hidden s:inline"&gt; comments&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>kafka</category>
    </item>
    <item>
      <title>A Step-by-Step Guide to Streaming Live Weather Data Using Apache Kafka and Apache Cassandra</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Wed, 16 Apr 2025 03:31:52 +0000</pubDate>
      <link>https://dev.to/luxdevhq/a-step-by-step-guide-to-streaming-live-weather-data-using-apache-kafka-and-apache-cassandra-ep2</link>
      <guid>https://dev.to/luxdevhq/a-step-by-step-guide-to-streaming-live-weather-data-using-apache-kafka-and-apache-cassandra-ep2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegg2c0w9mp35vusns6kj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegg2c0w9mp35vusns6kj.png" alt="Weather Data using Kafka,Confluent" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Delivering real-time weather data is increasingly important for applications across logistics, travel, emergency services, and consumer tools. In this tutorial, we will build a real-time weather data streaming pipeline using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenWeatherMap API to fetch weather data&lt;/li&gt;
&lt;li&gt;Apache Kafka (via Confluent Cloud) for streaming&lt;/li&gt;
&lt;li&gt;Apache Cassandra (installed on a Linux machine) for scalable storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll implement this pipeline using Python, demonstrate practical setups, and include screenshots to guide you through each step.&lt;/p&gt;

&lt;p&gt;By the end, you'll have a running system where weather data is continuously fetched, streamed to Kafka, and written to Cassandra for querying and visualization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiaes96vt2r97k41gaomb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiaes96vt2r97k41gaomb.png" alt="Weather Data Architecture using Kafka,Confluent" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.8+&lt;/li&gt;
&lt;li&gt;Linux Machine&lt;/li&gt;
&lt;li&gt;Kafka cluster on Confluent Cloud&lt;/li&gt;
&lt;li&gt;OpenWeatherMap API key &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Set Up Kafka on Confluent Cloud
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Go to confluent.cloud&lt;/li&gt;
&lt;li&gt;Create an account (free tier available)&lt;/li&gt;
&lt;li&gt;Create a Kafka cluster&lt;/li&gt;
&lt;li&gt;Create a topic named weather-stream&lt;/li&gt;
&lt;li&gt;Generate an API Key and Secret&lt;/li&gt;
&lt;li&gt;Note the Bootstrap Server, API Key, and API Secret&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Install Cassandra on a Linux Machine
&lt;/h3&gt;

&lt;p&gt;Open your terminal and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install openjdk-11-jdk -y

# Add Apache Cassandra repo
echo "deb https://downloads.apache.org/cassandra/debian 40x main" | sudo tee /etc/apt/sources.list.d/cassandra.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -

sudo apt update
sudo apt install cassandra -y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start and verify Cassandra:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo systemctl enable cassandra
sudo systemctl start cassandra
nodetool status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Connect Cassandra to DBeaver (GUI Tool)
&lt;/h3&gt;

&lt;p&gt;DBeaver is a great visual interface for managing Cassandra.&lt;br&gt;
Steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install DBeaver&lt;/li&gt;
&lt;li&gt;Open DBeaver and click New Connection&lt;/li&gt;
&lt;li&gt;Select Apache Cassandra from the list&lt;/li&gt;
&lt;li&gt;Fill in the following:&lt;/li&gt;
&lt;li&gt;Host: 127.0.0.1&lt;/li&gt;
&lt;li&gt;Port: 9042&lt;/li&gt;
&lt;li&gt;Username: leave blank (default auth)&lt;/li&gt;
&lt;li&gt;Password: leave blank&lt;/li&gt;
&lt;li&gt;Click Test Connection — you should see a successful message
Save and connect — you can now browse your keyspaces, tables, and run CQL visually.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Step 4: Create the Cassandra Table
&lt;/h3&gt;

&lt;p&gt;Once connected (or in cqlsh), run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE KEYSPACE IF NOT EXISTS weather
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE weather;

CREATE TABLE IF NOT EXISTS weather_data (
    city TEXT,
    timestamp TIMESTAMP,
    temperature FLOAT,
    humidity INT,
    PRIMARY KEY (city, timestamp)
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This schema stores weather info per city, indexed by time.&lt;br&gt;
You can also run the above queries in DBeaver’s SQL editor.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 5: Create Kafka Producer in Python
&lt;/h3&gt;

&lt;p&gt;Install Dependencies&lt;br&gt;
&lt;code&gt;pip install requests confluent-kafka python-dotenv&lt;/code&gt;&lt;br&gt;
Create a .env file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BOOTSTRAP_SERVERS=pkc-xyz.us-central1.gcp.confluent.cloud:9092
SASL_USERNAME=API_KEY
SASL_PASSWORD=API_SECRET
OPENWEATHER_API_KEY=YOUR_OPENWEATHER_API_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python Script: weather_producer.py&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests
import json
from confluent_kafka import Producer
import time
from dotenv import load_dotenv
import os

load_dotenv()

conf = {
    'bootstrap.servers': os.getenv("BOOTSTRAP_SERVERS"),
    'security.protocol': 'SASL_SSL',
    'sasl.mechanisms': 'PLAIN',
    'sasl.username': os.getenv("SASL_USERNAME"),
    'sasl.password': os.getenv("SASL_PASSWORD")
}

producer = Producer(conf)
API_KEY = os.getenv("OPENWEATHER_API_KEY")
TOPIC = 'weather-stream'
CITIES = ["Nairobi", "Lagos", "Accra", "Cairo", "Cape Town", "Addis Ababa", "Dakar", "Kampala", "Algiers"]

def get_weather(city):
    url = f'https://api.openweathermap.org/data/2.5/weather?q={city}&amp;amp;appid={API_KEY}&amp;amp;units=metric'
    response = requests.get(url)
    return response.json()

def delivery_report(err, msg):
    if err is not None:
        print(f"Delivery failed: {err}")
    else:
        print(f"Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")

while True:
    for city in CITIES:
        weather = get_weather(city)
        weather['city'] = city  # Attach city explicitly
        producer.produce(TOPIC, json.dumps(weather).encode('utf-8'), callback=delivery_report)
        producer.flush()
        time.sleep(2)  # This will prevent API rate limit
    time.sleep(60)  # Wait before the next full cycle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This script loads credentials from .env, loops through several African cities, and sends weather data to your Kafka topic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Create Kafka Consumer in Python (Store Data in Cassandra)
&lt;/h3&gt;

&lt;p&gt;Install additional libraries:&lt;br&gt;
&lt;code&gt;pip install cassandra-driver&lt;/code&gt;&lt;br&gt;
Python Script: weather_consumer.py&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
from cassandra.cluster import Cluster
from confluent_kafka import Consumer
import os
from dotenv import load_dotenv

load_dotenv()

# Cassandra connection
cluster = Cluster(['127.0.0.1'])
session = cluster.connect()
session.set_keyspace('weather')

# Kafka configuration
conf = {
    'bootstrap.servers': os.getenv("BOOTSTRAP_SERVERS"),
    'security.protocol': 'SASL_SSL',
    'sasl.mechanisms': 'PLAIN',
    'sasl.username': os.getenv("SASL_USERNAME"),
    'sasl.password': os.getenv("SASL_PASSWORD"),
    'group.id': 'weather-group',
    'auto.offset.reset': 'earliest'
}

consumer = Consumer(conf)
consumer.subscribe(['weather-stream'])

print("Listening for weather data...")

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        print(f"Consumer error: {msg.error()}")
        continue

    data = json.loads(msg.value().decode('utf-8'))
    try:
        session.execute(
            """
            INSERT INTO weather_data (city, timestamp, temperature, humidity)
            VALUES (%s, toTimestamp(now()), %s, %s)
            """,
            (data['city'], data['main']['temp'], data['main']['humidity'])
        )
        print(f"Stored data for {data['city']}")
    except Exception as e:
        print(f"Failed to insert data: {e}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This consumer listens to your Kafka topic, parses incoming messages, and stores them in the weather_data table.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7: Querying Cassandra Data via DBeaver
&lt;/h3&gt;

&lt;p&gt;Once the consumer is running and data is flowing, open DBeaver and run a CQL query to verify the data:&lt;br&gt;
&lt;code&gt;SELECT * FROM weather.weather_data;&lt;/code&gt;&lt;br&gt;
You should now see rows of weather data streaming in from various African cities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion &amp;amp; Next Steps
&lt;/h2&gt;

&lt;p&gt;You’ve successfully built a real-time data pipeline using Python, Kafka, and Cassandra. Here’s a summary of what you’ve done:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up Kafka via Confluent Cloud&lt;/li&gt;
&lt;li&gt;Pulled real-time weather data using OpenWeatherMap&lt;/li&gt;
&lt;li&gt;Streamed data to Kafka via a Python producer&lt;/li&gt;
&lt;li&gt;Consumed Kafka events and stored them in Cassandra&lt;/li&gt;
&lt;li&gt;Queried Cassandra data in DBeaver&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Suggested Enhancements:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Add Weather Alerts: Trigger notifications if temperatures exceed a threshold&lt;/li&gt;
&lt;li&gt;Streamlit Dashboard: Build a live dashboard showing city-by-city weather updates&lt;/li&gt;
&lt;li&gt;Data Retention Policy: Expire older data using Cassandra TTL&lt;/li&gt;
&lt;li&gt;Dockerize the Project: For easier deployment &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>kafka</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Mon, 14 Apr 2025 13:05:12 +0000</pubDate>
      <link>https://dev.to/nelsongei/-2k7p</link>
      <guid>https://dev.to/nelsongei/-2k7p</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/nelsongei/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concepts-233g" class="crayons-story__hidden-navigation-link"&gt;The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/nelsongei" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1173121%2F686a071e-7af4-4fb7-8bf4-740ca5ff7e78.png" alt="nelsongei profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/nelsongei" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Nelson Sammy
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Nelson Sammy
                
              
              &lt;div id="story-author-preview-content-2322609" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/nelsongei" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1173121%2F686a071e-7af4-4fb7-8bf4-740ca5ff7e78.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Nelson Sammy&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/nelsongei/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concepts-233g" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Mar 10 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/nelsongei/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concepts-233g" id="article-link-2322609"&gt;
          The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/kafka"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;kafka&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/nelsongei/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concepts-233g" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/nelsongei/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concepts-233g#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            5 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>kafka</category>
    </item>
    <item>
      <title>Apache Airflow for Data Engineering: Best Practices and Real-World Examples</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Mon, 14 Apr 2025 04:31:14 +0000</pubDate>
      <link>https://dev.to/nelsongei/apache-airflow-for-data-engineering-best-practices-and-real-world-examples-k9d</link>
      <guid>https://dev.to/nelsongei/apache-airflow-for-data-engineering-best-practices-and-real-world-examples-k9d</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yixrmy0qsj1si2vd7v7.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yixrmy0qsj1si2vd7v7.jpeg" alt="Image description" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Apache Airflow&lt;/strong&gt; is a piece of open sourced orchestration software, originally developed at Airbnb and now part of the apache software foundation, that provides functionality for authoring, monitoring, and scheduling Workflows. Some of the features available in Airflow include stateful scheduling, a rich user interface, core functionality for logging, monitoring, and alerting, and a code-based approach to authoring pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Apache Airflow?
&lt;/h2&gt;

&lt;p&gt;At its core, it's used for orchestrating complex data processing tasks, enabling users to define and manage workflows as code (using Python). Airflow leverages Directed Acyclic Graphs (DAGs) to represent workflows, with individual tasks within a DAG representing specific operations like data extraction, transformation, or loading&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Use Apache Airflow in Data Engineering?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Apache Airflow&lt;/strong&gt; is beneficial in data engineering for its robust workflow orchestration capabilities, allowing for the creation, scheduling, and monitoring of complex data pipelines. It helps automate tasks, manage dependencies, and provides a centralized platform for visualizing and debugging workflows, ultimately leading to more efficient and reliable data processing.&lt;br&gt;
Here's a more detailed look at why Airflow is a valuable tool for data engineers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Orchestration and Scheduling: Airflow allows data engineers to define and schedule workflows as Directed Acyclic Graphs (DAGs), using Python code. This enables the orchestration of complex data pipelines, ensuring tasks are executed in the correct order and dependencies are managed effectively. Airflow provides a scheduler that can handle various scheduling intervals, from daily to hourly or weekly, simplifying the process of setting up recurring workflows.&lt;/li&gt;
&lt;li&gt;Automation and Scalability: Airflow automates data pipelines, reducing manual intervention and potential errors. It's highly scalable, allowing you to manage a large number of pipelines and tasks concurrently. The open-source nature of Airflow makes it readily accessible and customizable for various data engineering needs. &lt;/li&gt;
&lt;li&gt;Monitoring and Alerting: Airflow provides a user-friendly web interface for monitoring the progress of workflows, allowing you to visualize dependencies, logs, and task statuses. You can set up alerts to be notified of any issues or failures in your pipelines, ensuring timely intervention. This real-time monitoring helps prevent data inconsistencies and ensures downstream tasks only run when their prerequisites are met. &lt;/li&gt;
&lt;li&gt;Flexibility and Extensibility: Airflow's Python-based architecture allows for easy integration with various tools and libraries, making it adaptable to different data engineering environments. Its modular design enables you to extend Airflow's functionality with custom operators and plugins. Airflow supports asynchronous task execution, data-aware scheduling, and tasks that adapt to input conditions, providing flexibility in designing workflows. &lt;/li&gt;
&lt;li&gt;Collaboration and Documentation: Airflow's web UI facilitates collaboration among data engineers, allowing them to share and manage pipelines effectively. The Python-based DAG definitions provide clear documentation of your data pipelines, making them easier to understand and maintain. &lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;Apache Airflow is commonly used for orchestrating various data pipelines, including ETL (Extract, Transform, Load) processes, machine learning workflows, and data warehousing tasks. It excels at automating and monitoring these pipelines, making them reliable and scalable. Here's a more detailed look at its real-world applications: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ETL Pipelines: 

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Extraction&lt;/strong&gt;:Airflow can be used to pull data from various sources like databases, APIs, and cloud storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Transformation&lt;/strong&gt;:It orchestrates the steps needed to clean, validate, and transform the extracted data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Loading:&lt;/strong&gt;Airflow loads the transformed data into data warehouses or other target systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Machine Learning Workflows: 

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Preparation&lt;/strong&gt;:Airflow can automate tasks like data cleaning, feature engineering, and validation for machine learning models. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Training&lt;/strong&gt;:It can trigger and manage model training processes, including tasks like running experiments and tuning hyperparameters. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Deployment&lt;/strong&gt;:Airflow helps automate the deployment of trained models to various platforms. &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Data Warehousing: 

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Updates&lt;/strong&gt;:Airflow schedules and automates the process of updating and managing data lakes and warehouses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Refresh&lt;/strong&gt;:It can be used to refresh data views and materialized views in data warehouses.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Best Practices for Using Apache Airflow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Keep DAGs Lightweight: Avoid writing heavy logic directly in the DAG file. Move business logic to separate Python modules or scripts.&lt;/li&gt;
&lt;li&gt;Use Task Retries and Alerts: Add retries and email/Slack alerts to catch and recover from transient failures.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;default_args = {
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['data-team@example.com']
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Leverage XComs for Task Communication: Use XCom (cross-communication) for small metadata passing between tasks—but avoid for large data!&lt;/li&gt;
&lt;li&gt;Dynamic DAGs for Scale: Generate DAGs dynamically if you have multiple similar pipelines (e.g., per customer or data source).&lt;/li&gt;
&lt;li&gt;Parameterize for Reusability: Use dagrun.conf or templates for passing dynamic parameters into DAGs for flexibility and reuse.&lt;/li&gt;
&lt;li&gt;Version Control DAGs: Keep your DAGs in Git and use CI/CD pipelines to deploy updates. This ensures reproducibility and collaboration.&lt;/li&gt;
&lt;li&gt;Monitor with the UI and Logs: Always check the Airflow UI to monitor execution, task duration, and inspect logs for troubleshooting.&lt;/li&gt;
&lt;li&gt;Use Sensors and Hooks Efficiently: Sensors wait for conditions to be met (e.g., file existence), while Hooks abstract external system connections (e.g., S3, PostgreSQL).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  conclusion
&lt;/h3&gt;

&lt;p&gt;Apache Airflow is a powerful ally in the data engineer’s toolkit. When used properly, it brings clarity, automation, and resilience to your data pipelines. Whether you're running simple ETL jobs or orchestrating ML workflows, following best practices and learning from real-world patterns will set you up for success.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>airflow</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Mon, 10 Mar 2025 15:41:58 +0000</pubDate>
      <link>https://dev.to/nelsongei/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concepts-233g</link>
      <guid>https://dev.to/nelsongei/the-ultimate-guide-to-apache-kafka-basics-architecture-and-core-concepts-233g</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftam4u6oyq7j68e3oe7ap.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftam4u6oyq7j68e3oe7ap.png" alt="Image description" width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Apache Kafka&lt;/strong&gt; is an open source distributed event-streaming platform or a distributed commit log. It was developed at LinkedIn by a team which included Jay Kreps, Jun Rao, and Neha Narkhede. Apache Kafka is built to optimize and ingest data in real-time hence it can be used to implement high-performance data-pipelines,streaming analytics applications, and data integration services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Kafka Key Features and Concepts
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Distributed System: Kafka works as a cluster of one or more nodes that can live in different datacenters, we can distribute data/ load across different nodes in the Kafka Cluster, and it is inherently scalable, available, and fault-tolerant.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Event Streaming: An event is any type of action, incident, or change that's identified or recorded by software or applications. For example, a payment, a website click, or a temperature reading, along with a description of what happened.Kafka excels at handling continuous streams of data, making it ideal for real-time applications and data pipelines&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scalability: Kafka can scale horizontally to handle increasing data volumes and user loads. Kafka clusters can be scaled up to a thousand brokers, handling trillions of messages per day and petabytes of data. Kafka's partitioned log model allows for elastic expansion and contraction of storage and processing capacities. This scalability ensures that Kafka can support a vast array of data sources and streams.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Durability: Kafka ensures data durability by storing data in a durable manner, preventing data loss even in the event of system failures. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kafka Streams: Kafka Streams is a client library that allows developers to build real-time streaming applications directly on top of Kafka. It enables processing data streams in real-time, filtering, joining, aggregating, and grouping data without writing complex code. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kafka Connect: Kafka Connect is a framework for connecting Kafka to external systems, allowing data to be moved into and out of Kafka. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ksqlDB: ksqlDB is a stream processing engine that extends the Kafka Streams API, allowing developers to query and analyze streams using SQL-like syntax. &lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started with Kafka
&lt;/h2&gt;

&lt;p&gt;It is often recommended that apache kafka is started with zookeeper for optimum compatibility. Also, installing kafka on windows may run into several problems because it is not natively designed for use with the windows system. On Windows it is advised to use:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;WSK&lt;/li&gt;
&lt;li&gt;Docker
Else use ubuntu to install and run Kafka. For either OS make sure you have java 11 or 17
Ensure Java is installed by running
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;java --version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Else run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install openjdk-11-jdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first command will check java version, in case java is not installed you need to run the second command for installation.&lt;br&gt;
Once that's done head over to &lt;a href="http://kafka.apache.org/downloads" rel="noopener noreferrer"&gt;kafka download page&lt;/a&gt; and download kafka either source of binary download. I will use 3.6.0 source downloads. Go to your downloads folder and open a terminal and write the following command to unzip it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tar -xzf kafka-3.6.0-src.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will unzip the folder and create a folder for us. Now, we can rename the folder by running the following command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mv kafka-3.6.0-src kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will move all contents of kafka-3.6.0 folder to the new kafka folder.&lt;/p&gt;

&lt;h5&gt;
  
  
  Start Zookeeper
&lt;/h5&gt;

&lt;p&gt;Zookeeper is required for cluster management in kafka hence it must be launched before kafka and zookeeper it is part of kafka.&lt;br&gt;
To start zookeeper you can run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kafka/bin/zookeeper-server-start.sh kafka/config/zookeeper.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kafka/bin/zookeeper-server-start.sh&lt;/code&gt; is the path to zookeeper server, that is starting zookeeper.&lt;br&gt;
The &lt;code&gt;kafka/config/zookeeper.properties&lt;/code&gt; is the path to config files for zookeeper server&lt;/p&gt;
&lt;h5&gt;
  
  
  Start Kafka Server
&lt;/h5&gt;

&lt;p&gt;Open another terminal, and run the following command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kafka/bin/kafka-server-start.sh kafka/config/server.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kafka/bin/kafka-server-start.sh&lt;/code&gt; command starts the kafka server and the &lt;code&gt;kafka/config/server.properties&lt;/code&gt; is the path to the configuration file for apache kafka.&lt;/p&gt;

&lt;h5&gt;
  
  
  Create a topic
&lt;/h5&gt;

&lt;p&gt;Once the zookeeper server and kafka server are both running, we can now create a topic. Open another terminal window and run the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kafka/bin/kafka-topics.sh  --create  --topic testourtopic  --bootstrap-server 127.0.0.1:9092 --partitions 1 --replication-factor 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;testourtopic is the topic name that will be create once the command is executed. By default apache kafka runs on port 9092.&lt;br&gt;
&lt;code&gt;kafka/bin/kafka-topics.sh&lt;/code&gt; This is the script used to manage Kafka topics. It is located inside the Kafka installation directory (kafka/bin).&lt;br&gt;
&lt;code&gt;--create&lt;/code&gt; This flag tells Kafka to create a new topic.&lt;br&gt;
&lt;code&gt;--topic testourtopic&lt;/code&gt; command specifies the topic from which the consumer will consume messages&lt;br&gt;
&lt;code&gt;--bootstrap-server 127.0.0.1:9092&lt;/code&gt; This defines the Kafka broker address. 127.0.0.1:9092 means Kafka is running on the local machine (localhost) on port 9092.&lt;br&gt;
&lt;code&gt;--partitions 1&lt;/code&gt; This sets the number of partitions for the topic to 1.&lt;br&gt;
&lt;code&gt;--replication-factor 1&lt;/code&gt; This sets the replication factor to 1.&lt;br&gt;
To list topics you need to run the following command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Apache Kafka Architecture
&lt;/h2&gt;

&lt;p&gt;Apache Kafka's architecture revolves around a distributed, fault-tolerant system for handling real-time data streams, featuring key components like producers, consumers, brokers, topics, and partitions, enabling high-throughput and low-latency data processing. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Brokers: These are servers that manage data streams. Kafka clusters consist of one or more brokers. A broker works as a container that can hold multiple topics with different partitions. A unique integer ID is used to identify brokers in the Kafka cluster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Topics: Topics: Topics are named channels or categories through which messages are sent and received. They are a stream of messages that are a part of a specific category or feed name is referred to as a Kafka topic. In Kafka, data is stored in the form of topics. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Producers: Applications that write data (messages) to Kafka topics. They publish messages to one or more topics. They send data to the Kafka cluster. Whenever a Kafka producer publishes a message to Kafka, the broker receives the message and appends it to a particular partition. Producers are given a choice to publish messages to a partition of their choice.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consumers &amp;amp; Consumer Group: Applications that read data from Kafka topics. read data from the Kafka cluster. The data to be read by the consumers has to be pulled from the broker when the consumer is ready to receive the message. A consumer group in Kafka refers to a number of consumers that pull data from the same topic or same set of topics&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Partitions: Topics are divided into partitions, which are ordered, immutable sequences of messages, enabling horizontal scalability and parallel processing.  Topics in Kafka are divided into a configurable number of parts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replication: Kafka replicates data across multiple brokers within a cluster, ensuring data durability and fault tolerance. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Leader and Follower: In a replicated partition, one broker acts as the leader, handling all writes, while other brokers (followers) replicate the data. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Offsets: Each message within a partition has a unique offset, which is a sequential number that identifies its position in the partition. &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>kafka</category>
    </item>
    <item>
      <title>A Comprehensive Guide to Setting Up a Data Engineering Project Environment</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Wed, 29 Jan 2025 11:16:50 +0000</pubDate>
      <link>https://dev.to/nelsongei/a-comprehensive-guide-to-setting-up-a-data-engineering-project-environment-4fkl</link>
      <guid>https://dev.to/nelsongei/a-comprehensive-guide-to-setting-up-a-data-engineering-project-environment-4fkl</guid>
      <description>&lt;p&gt;Data engineering is the backbone of modern data-driven organizations, enabling the collection, storage, and processing of vast amounts of data. Setting up a robust and scalable data engineering project environment is critical to ensuring the success of your data pipelines, ETL processes, and analytics workflows. This guide will walk you through the essential steps to create a well-structured environment, covering cloud account setup, tool installation, networking, permissions, and best practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Setting Up Cloud Accounts (AWS or Azure)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Choosing a Cloud Provider&lt;/p&gt;

&lt;p&gt;The first step in setting up your data engineering environment is selecting a cloud provider. AWS and Azure are the two most popular options, offering a wide range of services for data storage, processing, and analytics.&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Create an AWS Account: Sign up at aws.amazon.com.&lt;/li&gt;
&lt;li&gt;Set Up Billing Alerts: Configure billing alerts in the AWS Billing Dashboard to avoid unexpected costs.&lt;/li&gt;
&lt;li&gt;Enable Multi-Factor Authentication (MFA): Secure your root account with MFA.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Azure
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Create an Azure Account: Sign up at azure.microsoft.com.&lt;/li&gt;
&lt;li&gt;Set Up a Subscription: Choose a subscription model (e.g., Pay-As-You-Go) and configure spending limits.&lt;/li&gt;
&lt;li&gt;Enable Security Features: Use Azure Active Directory (AD) for identity management and enable MFA.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Installing and Configuring Key Data Engineering Tools
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Database Management
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL: Install PostgreSQL for relational data storage. Use tools like pgAdmin or DBeaver as SQL clients to interact with the database.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;NoSQL Databases: For unstructured data, consider MongoDB or Cassandra.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Data Storage Solutions
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;AWS S3: Use S3 for scalable object storage.&lt;/li&gt;
&lt;li&gt;Azure Blob Storage: Ideal for storing large amounts of unstructured data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Workflow Orchestration
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Apache Airflow: Install Airflow to manage and schedule data pipelines.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install apache-airflow
airflow db init
airflow webserver --port 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Version Control
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: Set up a GitHub repository for version control and collaboration.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git init
git remote add origin &amp;lt;repository-url&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Stream Processing
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Apache Kafka: Install Kafka for real-time data streaming.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wget https://downloads.apache.org/kafka/3.1.0/kafka_2.13-3.1.0.tgz
tar -xzf kafka_2.13-3.1.0.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Networking and Permissions
&lt;/h2&gt;

&lt;p&gt;Identity and Access Management (IAM)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS IAM: Create IAM roles and policies to grant least-privilege access to resources.&lt;/li&gt;
&lt;li&gt;Azure AD: Use Azure AD to manage user roles and permissions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Virtual Private Cloud (VPC) and Subnets&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS VPC: Set up a VPC to isolate your resources. Configure subnets, route tables, and security groups.&lt;/li&gt;
&lt;li&gt;Azure Virtual Network: Create a virtual network and define subnets for resource segmentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Security Groups and Firewalls&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configure security groups (AWS) or network security groups (Azure) to control inbound and outbound traffic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Preparing for Data Pipelines, ETL Processes, and Database Connections
&lt;/h2&gt;

&lt;p&gt;Data Pipeline Design&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define the source, transformation, and destination (ETL) stages of your pipeline.&lt;/li&gt;
&lt;li&gt;Use tools like Apache NiFi or AWS Glue for ETL processes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Database Connections&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configure JDBC/ODBC connections for databases.&lt;/li&gt;
&lt;li&gt;Use connection strings for cloud-based databases (e.g., AWS RDS or Azure SQL Database).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data Validation and Testing&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement data validation checks to ensure data quality.&lt;/li&gt;
&lt;li&gt;Use unit testing frameworks like pytest for Python-based pipelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Integration with Cloud Services
&lt;/h2&gt;

&lt;p&gt;AWS Services&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3: Store raw and processed data.&lt;/li&gt;
&lt;li&gt;EC2: Use EC2 instances for running compute-intensive tasks.&lt;/li&gt;
&lt;li&gt;Redshift: Set up a data warehouse for analytics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Azure Services&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure Blob Storage: Store large datasets.&lt;/li&gt;
&lt;li&gt;Azure Databricks: Use Databricks for big data processing and machine learning.&lt;/li&gt;
&lt;li&gt;Azure Synapse Analytics: Build a data warehouse for advanced analytics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hybrid Cloud Solutions&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use tools like Snowflake or Google BigQuery for cross-cloud data integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. Best Practices for Environment Configuration and Resource Management
&lt;/h2&gt;

&lt;p&gt;Infrastructure as Code (IaC)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use tools like Terraform or AWS CloudFormation to define and manage infrastructure.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-data-bucket"
  acl    = "private"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitoring and Logging&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement monitoring using AWS CloudWatch or Azure Monitor.&lt;/li&gt;
&lt;li&gt;Use centralized logging tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cost Optimization&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use spot instances (AWS) or low-priority VMs (Azure) for non-critical workloads.&lt;/li&gt;
&lt;li&gt;Regularly review and clean up unused resources.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scalability and Performance&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use auto-scaling groups (AWS) or VM scale sets (Azure) to handle variable workloads.&lt;/li&gt;
&lt;li&gt;Optimize database queries and pipeline performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Disaster Recovery&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement backup and recovery strategies using AWS Backup or Azure Backup.&lt;/li&gt;
&lt;li&gt;Use multi-region replication for critical data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  7. Additional Considerations
&lt;/h2&gt;

&lt;p&gt;Collaboration and Documentation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Confluence or Notion for project documentation.&lt;/li&gt;
&lt;li&gt;Encourage team collaboration through Slack or Microsoft Teams.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compliance and Security&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensure compliance with regulations like GDPR or HIPAA.&lt;/li&gt;
&lt;li&gt;Encrypt data at rest and in transit using AWS KMS or Azure Key Vault.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Continuous Integration/Continuous Deployment (CI/CD)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up CI/CD pipelines using GitHub Actions, AWS CodePipeline, or Azure DevOps.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Exploratory Data Analysis using Data Visualization Techniques</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Sat, 14 Oct 2023 03:13:34 +0000</pubDate>
      <link>https://dev.to/nelsongei/exploratory-data-analysis-using-data-visualization-techniques-3jpp</link>
      <guid>https://dev.to/nelsongei/exploratory-data-analysis-using-data-visualization-techniques-3jpp</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Introduction&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data is often hailed as the new oil, and like oil, it requires refinement before it can reveal its true value. In the world of data science, Exploratory Data Analysis (EDA) is the refining process that uncovers insights and patterns from raw data. One of the most powerful tools in the EDA arsenal is data visualization. Visualizing data can help you understand its structure, identify outliers, discover trends, and communicate findings effectively. In this article, we'll delve into the world of EDA and explore how data visualization techniques can be harnessed to unlock the hidden stories within data.&lt;/p&gt;

&lt;p&gt;The Role of Exploratory Data Analysis&lt;/p&gt;

&lt;p&gt;Before we jump into data visualization, let's understand the importance of EDA. It's the initial, crucial phase of data analysis where raw data is scrutinized to grasp its essence. EDA helps data scientists and analysts&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understand the Data&lt;/strong&gt; You need to get to know your data intimately. This means understanding its size, structure, and quality. EDA can help you identify missing values, data types, and potential data issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identify Patterns and Relationships&lt;/strong&gt; EDA allows you to uncover patterns, trends, and relationships between variables. This can be invaluable for making informed decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spot Anomalies&lt;/strong&gt; Outliers and anomalies can be hiding within your data. EDA can help detect these unusual data points, which might hold essential information or indicate data quality issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formulate Hypotheses&lt;/strong&gt; EDA can help you generate hypotheses that can be tested later with more advanced statistical methods.&lt;/p&gt;

&lt;p&gt;Data Visualization as a Tool for EDA&lt;/p&gt;

&lt;p&gt;Data visualization is the art of representing data in graphical or pictorial format. It transforms raw numbers into visual insights, making complex information more understandable. Here are some key data visualization techniques that are particularly useful in EDA&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Histograms&lt;/strong&gt; A histogram provides a visual representation of the distribution of a single variable. It helps you understand the central tendency and spread of the data. For instance, it can reveal whether a dataset is normally distributed or skewed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scatter Plots&lt;/strong&gt; Scatter plots are excellent for visualizing the relationship between two continuous variables. They help in identifying patterns, clusters, and correlations between variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Box Plots&lt;/strong&gt; Box plots display the distribution, central tendency, and variability of a dataset. They are great for identifying outliers and comparing multiple datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bar Charts&lt;/strong&gt; Bar charts are useful for displaying the distribution of categorical variables. They can show the frequency of categories and highlight trends or patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Heatmaps&lt;/strong&gt; Heatmaps are beneficial for visualizing relationships in large datasets. They use color to represent the strength or intensity of a relationship, making it easy to spot patterns and clusters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Line Charts&lt;/strong&gt; Line charts are perfect for showing trends over time. They are commonly used in time series data analysis to uncover temporal patterns.&lt;/p&gt;

&lt;p&gt;The Power of Interactive Visualization&lt;/p&gt;

&lt;p&gt;With the advancement of technology, interactive data visualization tools have become increasingly popular. These tools allow users to explore data dynamically, zooming in on areas of interest, filtering, and getting real-time insights. Tools like Tableau, Power BI, and Python libraries like Plotly and Bokeh have made it easier for data scientists to create interactive visualizations that enhance EDA.&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;Exploratory Data Analysis is the cornerstone of data science, providing the crucial initial steps to understand data before embarking on modeling and prediction. Data visualization techniques are powerful allies in this endeavor, allowing data scientists to see, explore, and communicate the patterns and stories hidden within the data. Whether you are preparing data for machine learning, identifying trends in business data, or exploring scientific phenomena,&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Data Science for Beginners: 2023 - 2024 Complete Roadmap</title>
      <dc:creator>Nelson Sammy</dc:creator>
      <pubDate>Sun, 01 Oct 2023 15:46:56 +0000</pubDate>
      <link>https://dev.to/nelsongei/data-science-for-beginners-2023-2024-complete-roadmap-36om</link>
      <guid>https://dev.to/nelsongei/data-science-for-beginners-2023-2024-complete-roadmap-36om</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The term "Data is the new Gold" is a term that has been used in the early 2000s and personally as a tech enthusiast I came across it in the last 7 years. This phrase is often attributed by Clive Humby, a British mathematician and data science pioneer. "Data is the new Gold" in my own understanding means  it is valuable, but if unrefined, it cannot really be used. Over time, this phrase has evolved, and "Data is the new gold" is a variation of the original expression. It has since become a popular way to emphasize the immense value of data in the modern era of technology and data-driven decision-making. &lt;br&gt;
The big question you might ask yourself before learning data science is, what is really data science? In a simple explanation, Data science is the study of data to extract meaningful insights for business.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Key tools to learn for Data Science&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To learn data science you will need the following tools to extract and analyze data&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Programming Languages: Python, R&lt;/li&gt;
&lt;li&gt;Machine learning libraries: TensorFlow, Keras, and Scikit-learn &lt;/li&gt;
&lt;li&gt;Data visualization tools: Visualization tools like Tableau, Power BI, and Matplotlib &lt;/li&gt;
&lt;li&gt;Data storage and management systems: Databases like MySQL, MongoDB, and PostgreSQL &lt;/li&gt;
&lt;li&gt;Mathematics: Linear Algebra, Calculus, Statistics, Probability.&lt;/li&gt;
&lt;li&gt;Data Manipulation: Numpy, Pandas&lt;/li&gt;
&lt;li&gt;Git and Github: Git is for version control while github is the collaboration with other data scientists&lt;/li&gt;
&lt;li&gt;Machine Learning: Supervised learning and Unsupervised learning&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why Learn Data Science
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Need of data scientists&lt;/strong&gt;: The need for data science has become increasingly important in today's world due to the vast amount of data being generated by businesses, organizations, and individuals. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opportunities&lt;/strong&gt;: Almost every organization including blue collar "Jua Kali" sector today they have system/ applications and the application have databases with data. At some point the owners of those applications will need someone to analyze that data to predict for instance usage, income and so on and so forth about the application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Salary&lt;/strong&gt;: In the US the Salary of a data scientist is $124,407 per year. Even as a freelance in let's say Kenya that is good money.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Learning in the tech world is something everyone no matter the section you are in you are supposed to do. There are several ways to learn for instance Online Courses, boot camps, communities, Projects, solving bugs and Many more.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>luxdatanerds</category>
    </item>
  </channel>
</rss>
