<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joy Akinyi</title>
    <description>The latest articles on DEV Community by Joy Akinyi (@joy_akinyi_115689d7dff92f).</description>
    <link>https://dev.to/joy_akinyi_115689d7dff92f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3374426%2Fdcb3331c-d283-4254-b97c-6b77b286f8c4.png</url>
      <title>DEV Community: Joy Akinyi</title>
      <link>https://dev.to/joy_akinyi_115689d7dff92f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joy_akinyi_115689d7dff92f"/>
    <language>en</language>
    <item>
      <title>Change Data Capture (CDC) in Data Engineering: Concepts, Tools, and Real-World Implementation Strategies</title>
      <dc:creator>Joy Akinyi</dc:creator>
      <pubDate>Sun, 14 Sep 2025 14:01:37 +0000</pubDate>
      <link>https://dev.to/joy_akinyi_115689d7dff92f/change-data-capture-cdc-in-data-engineering-concepts-tools-and-real-world-implementation-22bm</link>
      <guid>https://dev.to/joy_akinyi_115689d7dff92f/change-data-capture-cdc-in-data-engineering-concepts-tools-and-real-world-implementation-22bm</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In today’s fast-paced data landscape, organizations need real-time insights to stay competitive. Change Data Capture (CDC) is a cornerstone of modern data engineering, enabling systems to track and propagate database changes—inserts, updates, and deletes—to downstream applications with minimal latency. Unlike batch processing, which relies on periodic data dumps, CDC streams changes as they occur, supporting use cases like real-time analytics, microservices synchronization, and cloud migrations.&lt;/p&gt;

&lt;p&gt;According to &lt;a href="https://www.confluent.io/learn/change-data-capture/" rel="noopener noreferrer"&gt;Confluent&lt;/a&gt;, CDC "tracks all changes in data sources so they can be captured in destination systems, ensuring data integrity and consistency across multiple systems and environments." This is critical for scenarios like replicating operational data to a data warehouse without overloading the source database. &lt;a href="https://debezium.io/documentation/" rel="noopener noreferrer"&gt;Debezium&lt;/a&gt;, an open-source CDC platform, defines it as a distributed service that captures row-level changes and streams them as events to consumers, making it ideal for event-driven architectures.&lt;/p&gt;

&lt;p&gt;This article dives into CDC concepts, explores tools like Debezium and Kafka, and walks through a real-world implementation for a crypto time-series data pipeline using Docker, PostgreSQL, Kafka, and Cassandra. We’ll also address common challenges—schema evolution, event ordering, late data, and fault tolerance—with practical solutions, drawing from official documentation and real-world configurations. By the end, you’ll have a clear blueprint for building robust CDC pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concepts of CDC
&lt;/h2&gt;

&lt;p&gt;CDC transforms database transactions into event streams, allowing applications to react to changes in near real-time. It’s particularly valuable for synchronizing data across heterogeneous systems, such as replicating a transactional database to a scalable NoSQL store for analytics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Methods of CDC
&lt;/h3&gt;

&lt;p&gt;CDC can be implemented through several approaches, each with distinct trade-offs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Log-Based CDC&lt;/strong&gt;: The most efficient method, it reads the database’s transaction log (e.g., PostgreSQL’s WAL or MySQL’s binlog) to capture changes. Logs record all operations sequentially, enabling low-latency capture with minimal impact on the source database. &lt;a href="https://debezium.io/documentation/" rel="noopener noreferrer"&gt;Debezium’s documentation&lt;/a&gt; highlights its effectiveness for capturing all operations, including deletes, without additional queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trigger-Based CDC&lt;/strong&gt;: Triggers are set on database tables to log changes into an "outbox" table or notify consumers directly. While simple, this adds overhead, as triggers execute SQL for each change. &lt;a href="https://www.confluent.io/learn/change-data-capture/" rel="noopener noreferrer"&gt;Confluent&lt;/a&gt; notes that triggers can impact write performance, making them less ideal for high-throughput systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query-Based CDC&lt;/strong&gt;: This involves polling the database for changes using timestamps or version columns. It’s straightforward but risks missing events and struggles with deletes. &lt;a href="https://redpanda.com/guides/cdc/change-data-capture-guide" rel="noopener noreferrer"&gt;Redpanda&lt;/a&gt; recommends it only when log-based options are unavailable due to its inefficiencies.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  CDC Architecture
&lt;/h3&gt;

&lt;p&gt;A typical CDC pipeline includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source Database&lt;/strong&gt;: Where changes occur (e.g., PostgreSQL).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture Mechanism&lt;/strong&gt;: A tool like Debezium parses logs or triggers to extract events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event Stream&lt;/strong&gt;: Apache Kafka buffers and distributes events reliably.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumers&lt;/strong&gt;: Downstream systems (e.g., Cassandra, data warehouses) process events.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The flow is: Changes are logged → CDC tool captures events → Events are streamed to Kafka → Consumers apply changes. For example, in a crypto pipeline, trade data is inserted into PostgreSQL, captured by Debezium, streamed via Kafka, and stored in Cassandra for scalable analytics.&lt;/p&gt;

&lt;p&gt;Here’s a simplified architecture description:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[PostgreSQL] → [Transaction Log (WAL)] → [Debezium] → [Kafka Topics] → [Cassandra]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Tools for CDC
&lt;/h2&gt;

&lt;p&gt;Several tools enable CDC, with Debezium and Kafka Connect standing out for their open-source flexibility and integration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debezium
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://debezium.io/documentation/" rel="noopener noreferrer"&gt;Debezium&lt;/a&gt; is an open-source platform for log-based CDC, designed to work with Kafka. It supports connectors for databases like PostgreSQL, MySQL, and MongoDB, capturing row-level changes as events. Debezium performs an initial snapshot of the database and then streams ongoing changes, ensuring consistency and scalability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kafka Connect
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://kafka.apache.org/documentation/#connect" rel="noopener noreferrer"&gt;Kafka Connect&lt;/a&gt; is a framework for integrating Kafka with external systems. It uses source connectors (e.g., Debezium) to capture data and sink connectors to write to targets like Cassandra. &lt;a href="https://www.confluent.io/learn/change-data-capture/" rel="noopener noreferrer"&gt;Confluent’s CDC guide&lt;/a&gt; emphasizes its role in simplifying CDC pipelines with managed connectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Implementation: Crypto Data Pipeline
&lt;/h2&gt;

&lt;p&gt;To illustrate CDC, we’ll implement a pipeline for crypto time-series data, replicating trades from PostgreSQL to Cassandra via Kafka using Debezium. The setup uses Docker on an Ubuntu server, based on a real-world configuration shared by a data engineering team.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ubuntu server (e.g., 22.04 LTS).&lt;/li&gt;
&lt;li&gt;Docker and Docker Compose installed.&lt;/li&gt;
&lt;li&gt;Firewall allowing ports 5433, 2181, 9092, 8083, and 9042.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Set Up the Ubuntu Server
&lt;/h3&gt;

&lt;p&gt;Update the system and install Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt upgrade &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;docker.io docker-compose &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start docker
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;docker
&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; docker &lt;span class="nv"&gt;$USER&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure Docker Services
&lt;/h3&gt;

&lt;p&gt;The following &lt;code&gt;docker-compose.yml&lt;/code&gt; orchestrates the pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;debezium/postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mydb&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;joy&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;your password&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mydb&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wal_level=logical"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_wal_senders=1"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_replication_slots=1"&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5433:5432"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;
  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;binance&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;binance_app&lt;/span&gt;
    &lt;span class="na"&gt;env_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.env&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
  &lt;span class="na"&gt;zookeeper&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;confluentinc/cp-zookeeper:7.6.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zookeeper&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ZOOKEEPER_CLIENT_PORT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2181&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2181:2181"&lt;/span&gt;
  &lt;span class="na"&gt;kafka&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;confluentinc/cp-kafka:7.6.1&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kafka&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;zookeeper&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;KAFKA_ZOOKEEPER_CONNECT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zookeeper:2181&lt;/span&gt;
      &lt;span class="na"&gt;KAFKA_ADVERTISED_LISTENERS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PLAINTEXT://kafka:9092&lt;/span&gt;
      &lt;span class="na"&gt;KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9092:9092"&lt;/span&gt;
  &lt;span class="na"&gt;connect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;debezium/connect:2.7.3.Final&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;connect&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;kafka&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cassandra&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;BOOTSTRAP_SERVERS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kafka:9092&lt;/span&gt;
      &lt;span class="na"&gt;GROUP_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;CONFIG_STORAGE_TOPIC&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;connect-configs&lt;/span&gt;
      &lt;span class="na"&gt;OFFSET_STORAGE_TOPIC&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;connect-offsets&lt;/span&gt;
      &lt;span class="na"&gt;STATUS_STORAGE_TOPIC&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;connect-status&lt;/span&gt;
      &lt;span class="na"&gt;CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;CONNECT_STATUS_STORAGE_REPLICATION_FACTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;KEY_CONVERTER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org.apache.kafka.connect.json.JsonConverter&lt;/span&gt;
      &lt;span class="na"&gt;VALUE_CONVERTER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org.apache.kafka.connect.json.JsonConverter&lt;/span&gt;
      &lt;span class="na"&gt;CONNECT_PLUGIN_PATH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/kafka/connect,/kafka/connect/cassandra-sink,/kafka/connect/debezium-connector-postgres&lt;/span&gt;
      &lt;span class="na"&gt;CONNECT_REPLICATION_FACTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;CONNECT_OFFSET_FLUSH_INTERVAL_MS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8083:8083"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./plugins:/kafka/connect&lt;/span&gt;
  &lt;span class="na"&gt;cassandra&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cassandra:5&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cassandra&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;MAX_HEAP_SIZE=1G&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;HEAP_NEWSIZE=256M&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9042:9042"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cassandra_data:/var/lib/cassandra&lt;/span&gt;
&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cassandra_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;postgres&lt;/strong&gt;: Uses &lt;code&gt;debezium/postgres:16&lt;/code&gt; with CDC enabled via &lt;code&gt;wal_level=logical&lt;/code&gt;. The &lt;code&gt;mydb&lt;/code&gt; database is created with user &lt;code&gt;joy&lt;/code&gt; and password &lt;code&gt;your password&lt;/code&gt;. Port 5433 (host) maps to 5432 (container).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;app&lt;/strong&gt;: A custom &lt;code&gt;binance&lt;/code&gt; image (assumed to ingest crypto data into PostgreSQL tables like &lt;code&gt;klines&lt;/code&gt;, &lt;code&gt;trades&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;zookeeper&lt;/strong&gt; and &lt;strong&gt;kafka&lt;/strong&gt;: Provide the event streaming backbone, with Kafka advertising on port 9092.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;connect&lt;/strong&gt;: Runs Debezium Connect to manage CDC connectors, exposing port 8083 for REST API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cassandra&lt;/strong&gt;: Runs Cassandra 5 for scalable storage, with port 9042 for CQL access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;restart: always&lt;/strong&gt;: Ensures services run persistently, replacing the need for &lt;code&gt;nohup&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Save this in &lt;code&gt;~/crypto-pipeline/docker-compose.yml&lt;/code&gt; and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/crypto-pipeline
docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Configure Debezium Connectors
&lt;/h3&gt;

&lt;h4&gt;
  
  
  PostgreSQL Source Connector (&lt;code&gt;postgres-source.json&lt;/code&gt;)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"postgres-source"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.debezium.connector.postgresql.PostgresConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"database.hostname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mydb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"database.port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5432"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"database.user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"joy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"database.password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"database.dbname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mydb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"plugin.name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pgoutput"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"slot.name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debezium_slot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"publication.name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debezium_pub"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"table.include.list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"public.klines,public.order_book,public.prices,public.ticker_24hr,public.trades"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topic.prefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dbz"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Captures changes from PostgreSQL tables (&lt;code&gt;klines&lt;/code&gt;, &lt;code&gt;order_book&lt;/code&gt;, &lt;code&gt;prices&lt;/code&gt;, &lt;code&gt;ticker_24hr&lt;/code&gt;, &lt;code&gt;trades&lt;/code&gt;) and publishes to Kafka topics (e.g., &lt;code&gt;dbz.klines&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;pgoutput&lt;/code&gt; for logical decoding, with a dedicated replication slot (&lt;code&gt;debezium_slot&lt;/code&gt;) and publication (&lt;code&gt;debezium_pub&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Connects to the &lt;code&gt;mydb&lt;/code&gt; database with user &lt;code&gt;joy&lt;/code&gt; and password &lt;code&gt;your password&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Register the connector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;--data&lt;/span&gt; @postgres-source.json http://localhost:8083/connectors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cassandra Sink Connector (&lt;code&gt;cassandra-sink.json&lt;/code&gt;)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cassandra-sink"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"com.datastax.oss.kafka.sink.CassandraSinkConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tasks.max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dbz.public.prices,dbz.public.klines,dbz.public.order_book,dbz.public.ticker_24hr,dbz.public.trades"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"contactPoints"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cassandra"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"loadBalancing.localDc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"datacenter1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topic.dbz.public.prices.crypto.prices.mapping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"symbol=value.symbol, price=value.price, event_time=value.event_time"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topic.dbz.public.klines.crypto.klines.mapping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"symbol=value.symbol, open_time=value.open_time, close_price=value.close_price, close_time=value.close_time, event_time=value.event_time, high_price=value.highPrice, low_price=value.lowPrice, open_price=value.openPrice, volume=value.volume"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topic.dbz.public.order_book.crypto.order_book.mapping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"symbol=value.symbol, event_time=value.event_time, side=value.side, price=value.price, qty=value.qty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topic.dbz.public.ticker_24hr.crypto.ticker_24hr.mapping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"symbol=value.symbol, event_time=value.event_time, high_price=value.highPrice, last_price=value.lastPrice, low_price=value.lowPrice, price_change_percent=value.priceChangePercent, volume=value.volume"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"topic.dbz.public.trades.crypto.trades.mapping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id=value.id, price=value.price, qty=value.qty, quoteqty=value.quoteQty, time=value.time, isbuyermaker=value.isBuyerMaker, isbestmatch=value.isBestMatch, event_time=value.event_time"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"transforms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"unwrap"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"transforms.unwrap.type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.debezium.transforms.ExtractNewRecordState"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writes data from Kafka topics (&lt;code&gt;dbz.klines&lt;/code&gt;, etc.) to Cassandra tables in the &lt;code&gt;crypto&lt;/code&gt; keyspace.&lt;/li&gt;
&lt;li&gt;Connects to the &lt;code&gt;cassandra&lt;/code&gt; service, ensuring scalable storage for crypto data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Register the connector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;--data&lt;/span&gt; @cassandra-sink.json http://localhost:8083/connectors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Test the Pipeline
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Populate PostgreSQL&lt;/strong&gt;: Connect to the database:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; mydb psql &lt;span class="nt"&gt;-U&lt;/span&gt; joy &lt;span class="nt"&gt;-d&lt;/span&gt; mydb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a table and insert data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;   &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="n"&gt;REPLICA&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;FULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'BTCUSDT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2025-09-13 11:35:00'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trades&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit with &lt;code&gt;\q&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify Cassandra&lt;/strong&gt;: Check replicated data:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; cassandra cqlsh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query the &lt;code&gt;trades&lt;/code&gt; table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   DESCRIBE KEYSPACE crypto;
   SELECT * FROM crypto.trades;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit with &lt;code&gt;exit&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;binance&lt;/code&gt; app inserts data into PostgreSQL, Debezium captures changes, Kafka streams them, and the sink connector writes to Cassandra.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Alternatively, you can automate the python script that fetches data from binance at a defined interval say 5mins using the BlockingScheduler library.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges &amp;amp; Solutions
&lt;/h2&gt;

&lt;p&gt;Building robust CDC pipelines involves addressing several challenges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema Evolution
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;: Schema changes (e.g., adding/dropping columns) can break consumers. &lt;a href="https://www.decodable.co/blog/change-data-capture-cdc-explained" rel="noopener noreferrer"&gt;Decodable&lt;/a&gt; notes that forward-compatible changes (adding optional columns) allow old consumers to ignore new fields, while backward-compatible changes (dropping optional columns) ensure new consumers handle old data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a schema registry (e.g., &lt;a href="https://docs.confluent.io/platform/current/schema-registry" rel="noopener noreferrer"&gt;Confluent Schema Registry&lt;/a&gt;) to enforce compatibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Event Ordering
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;: Incorrect event order can lead to inconsistencies, especially across distributed systems so maintaining the original transactional order matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka guarantees order within partitions. Use key-based partitioning (e.g., by &lt;code&gt;symbol&lt;/code&gt; in the crypto pipeline) to ensure related events stay ordered.&lt;/li&gt;
&lt;li&gt;Debezium groups events by transaction for consistency. &lt;a href="https://olake.io/change-data-capture/" rel="noopener noreferrer"&gt;OLake&lt;/a&gt; recommends idempotent consumers to handle occasional out-of-order events.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Late Data
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;: Late-arriving events can disrupt aggregates or state in real-time analytics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use watermarking in stream processors like Apache Flink to define lateness thresholds and design sinks to be idempotent(applying the same event twice has the same effect as applying it once) and able to accept corrections (e.g., update rows with newer timestamps).&lt;/li&gt;
&lt;li&gt;Buffer events in Kafka for replay. &lt;a href="https://www.confluent.io/blog/real-time-data-processing-with-kafka-and-streaming/" rel="noopener noreferrer"&gt;Confluent&lt;/a&gt; advocates event-time processing to handle delays accurately.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fault Tolerance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;: Failures in connectors or networks can cause data loss or duplicates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debezium provides at-least-once delivery; use idempotent sinks (e.g., Cassandra’s upsert) to deduplicate.&lt;/li&gt;
&lt;li&gt;Kafka’s replication ensures durability. Configure high availability for PostgreSQL to prevent replication slot buildup. &lt;a href="https://olake.io/change-data-capture/" rel="noopener noreferrer"&gt;OLake&lt;/a&gt; suggests monitoring with Prometheus for proactive fault detection.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;CDC is a game-changer for real-time data engineering, enabling seamless synchronization across systems. Using Debezium, Kafka, and connectors, our crypto pipeline demonstrates how to replicate data from PostgreSQL to Cassandra efficiently. By addressing schema evolution, ordering, late data, and fault tolerance, engineers can build reliable pipelines. As data demands grow, CDC will remain a critical tool for agile, data-driven organizations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.confluent.io/learn/change-data-capture/" rel="noopener noreferrer"&gt;Confluent: Change Data Capture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://debezium.io/documentation/" rel="noopener noreferrer"&gt;Debezium Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://redpanda.com/guides/cdc/change-data-capture-guide" rel="noopener noreferrer"&gt;Redpanda: CDC Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server" rel="noopener noreferrer"&gt;Microsoft: About CDC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.decodable.co/blog/change-data-capture-cdc-explained" rel="noopener noreferrer"&gt;Decodable: CDC Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://olake.io/change-data-capture/" rel="noopener noreferrer"&gt;OLake: CDC Insights&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Deploying Anaconda with JupyterLab on an Azure VM for Team Collaboration</title>
      <dc:creator>Joy Akinyi</dc:creator>
      <pubDate>Tue, 26 Aug 2025 14:24:04 +0000</pubDate>
      <link>https://dev.to/joy_akinyi_115689d7dff92f/deploying-anaconda-with-jupyterlab-on-an-azure-vm-for-team-collaboration-4e8n</link>
      <guid>https://dev.to/joy_akinyi_115689d7dff92f/deploying-anaconda-with-jupyterlab-on-an-azure-vm-for-team-collaboration-4e8n</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Anaconda is an open-source distribution platform that bundles Python, the conda package manager, and essential libraries like NumPy, pandas, and scikit-learn. It streamlines environment management and ensures consistency across team projects. By deploying Anaconda with JupyterLab on an Azure Virtual Machine (VM) running Ubuntu, teams can create a cloud-based, collaborative workspace.&lt;/p&gt;

&lt;p&gt;This guide walks you through setting up an Azure VM with Ubuntu, installing Anaconda, configuring JupyterLab for team access on port 8888, and testing the setup with a sample project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;An Azure subscription (sign up for a free account at &lt;a href="https://azure.microsoft.com/free" rel="noopener noreferrer"&gt;azure.microsoft.com/free&lt;/a&gt;) or better still, a student account&lt;/li&gt;
&lt;li&gt;Familiarity with Linux terminal commands.&lt;/li&gt;
&lt;li&gt;Sudo privileges on the VM.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Set Up an Azure VM with Ubuntu
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Log in to the Azure portal at &lt;a href="https://portal.azure.com" rel="noopener noreferrer"&gt;portal.azure.com&lt;/a&gt;. or use a student account to log in.&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Virtual machines&lt;/strong&gt; under Services and select &lt;strong&gt;Create&lt;/strong&gt; &amp;gt; &lt;strong&gt;Virtual machine&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In the &lt;strong&gt;Basics&lt;/strong&gt; tab:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose your subscription.&lt;/li&gt;
&lt;li&gt;Create a new resource group (e.g., &lt;code&gt;myResourceGroup&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Name the VM (e.g., &lt;code&gt;myVM&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Select a region close to your team for low latency.&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Ubuntu Server 22.04 LTS - Gen2&lt;/strong&gt; as the Image.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Under &lt;strong&gt;Administrator account&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Select &lt;strong&gt;SSH public key&lt;/strong&gt; or &lt;strong&gt;Password&lt;/strong&gt; for authentication.&lt;/li&gt;
&lt;li&gt;Set a username (e.g., &lt;code&gt;azureuser&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Provide an SSH key or password as needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In &lt;strong&gt;Inbound port rules&lt;/strong&gt;, allow &lt;strong&gt;SSH (22)&lt;/strong&gt; and &lt;strong&gt;Custom TCP (8888)&lt;/strong&gt; for JupyterLab access. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To allow port 8888:

&lt;ul&gt;
&lt;li&gt;After VM creation, go to the VM’s &lt;strong&gt;Networking&lt;/strong&gt; tab in the Azure portal.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add inbound port rule&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;Service&lt;/strong&gt; to Custom, &lt;strong&gt;Port ranges&lt;/strong&gt; to &lt;code&gt;8888&lt;/code&gt;, &lt;strong&gt;Protocol&lt;/strong&gt; to TCP, and &lt;strong&gt;Action&lt;/strong&gt; to Allow.&lt;/li&gt;
&lt;li&gt;For security, restrict &lt;strong&gt;Source&lt;/strong&gt; to your team’s IP ranges.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Review and create the VM, saving the SSH key if generated.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Note the public IP address from the VM’s overview page.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Connect to the VM via SSH: &lt;code&gt;ssh azureuser@&amp;lt;public-ip&amp;gt;&lt;/code&gt; (add &lt;code&gt;-i path/to/key.pem&lt;/code&gt; for key-based authentication).&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Install Anaconda on the Ubuntu VM
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
Update the system:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt upgrade &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Install required utilities:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;wget curl git &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Download the latest Anaconda installer:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   wget https://repo.anaconda.com/archive/Anaconda3-2025.06-1-Linux-x86_64.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Verify the installer (optional):
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;sha256sum &lt;/span&gt;Anaconda3-2025.06-1-Linux-x86_64.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare the checksum with the official value from Anaconda’s website.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
Run the installer:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="c"&gt;#Default is ~/anaconda3, but you can change it to /opt/anaconda3 for system-wide use.&lt;/span&gt;
   bash Anaconda3-2025.06-1-Linux-x86_64.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Accept the license and install in a shared location like &lt;code&gt;/opt/anaconda3&lt;/code&gt; for team access. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
Set permissions:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="c"&gt;#ensures the 'users' group has control of Anaconda’s directory.&lt;/span&gt;
   &lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; :users /opt/anaconda3
   &lt;span class="c"&gt;#allows all users in the 'users' group to install/update packages without sudo&lt;/span&gt;
   &lt;span class="nb"&gt;sudo chmod&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; g+w /opt/anaconda3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Initialize conda:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="c"&gt;#Initialize conda for your shell&lt;/span&gt;
   /opt/anaconda3/bin/conda init
    &lt;span class="c"&gt;# Reload your shell configuration file&lt;/span&gt;
   &lt;span class="nb"&gt;source&lt;/span&gt; ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Verify the installation:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   conda &lt;span class="nt"&gt;--version&lt;/span&gt;
   python &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Create a shared conda environment:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   conda create &lt;span class="nt"&gt;--name&lt;/span&gt; teamenv &lt;span class="nv"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3.11
   conda activate teamenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Configure a Secure JupyterLab Server
&lt;/h2&gt;

&lt;p&gt;JupyterLab is configured on port 8888 with a password to secure access, ensuring only authorized team members can log in.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Install JupyterLab:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;conda activate teamenv
conda &lt;span class="nb"&gt;install &lt;/span&gt;jupyterlab
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Generate a configuration file:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;jupyter lab &lt;span class="nt"&gt;--generate-config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Set a password to secure the server:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jupyter_server.auth&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;passwd&lt;/span&gt;
&lt;span class="nf"&gt;passwd&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Enter a password (e.g., &lt;code&gt;mysecurepassword&lt;/code&gt;) and copy the &lt;code&gt;sha1:...&lt;/code&gt; hash. This password is critical to prevent unauthorized access to &lt;code&gt;http://&amp;lt;public-ip&amp;gt;:8888&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Edit &lt;code&gt;~/.jupyter/jupyter_lab_config.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Allow access from any IP
&lt;/span&gt;   &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8888&lt;/span&gt;     &lt;span class="c1"&gt;# Default port
&lt;/span&gt;    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open_browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
   &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sha1:&amp;lt;hashed-password&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Paste the hashed password
&lt;/span&gt;   &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allow_remote_access&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
   &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/opt/shared_notebooks&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# Shared directory
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Create a shared notebook directory:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="c"&gt;#create a directory&lt;/span&gt;
   &lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; /opt/shared_notebooks
   &lt;span class="c"&gt;# give ownership to the users group&lt;/span&gt;
   &lt;span class="nb"&gt;sudo chown &lt;/span&gt;azureuser:users /opt/shared_notebooks
   &lt;span class="c"&gt;# give groups write rights in the notebook&lt;/span&gt;
   &lt;span class="nb"&gt;sudo chmod &lt;/span&gt;g+w /opt/shared_notebooks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: If you want all users who run JupyterLab to access this folder, you need to make sure they’re added to the users group.&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; &lt;span class="nb"&gt;users&lt;/span&gt; &amp;lt;username&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
Start JupyterLab in the background:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   conda activate teamenv
   &lt;span class="nb"&gt;nohup &lt;/span&gt;jupyter lab 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The nohup command ensures JupyterLab continues running after you exit the SSH session, maintaining access at http://:8888. &lt;br&gt;
Access JupyterLab at &lt;code&gt;http://&amp;lt;public-ip&amp;gt;:8888&lt;/code&gt; and log in with the password. Use HTTPS if configured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Collaboration Notes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users share the &lt;code&gt;teamenv&lt;/code&gt; environment and &lt;code&gt;/opt/shared_notebooks&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Add team members to the &lt;code&gt;users&lt;/code&gt; group (e.g., &lt;code&gt;sudo adduser teamuser1; sudo usermod -aG users teamuser1&lt;/code&gt;) and share the password securely.&lt;/li&gt;
&lt;li&gt;Avoid conflicts by using Git or subdirectories in &lt;code&gt;/opt/shared_notebooks&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 4: Test with a Mini Project
&lt;/h2&gt;

&lt;p&gt;Test with a JupyterLab notebook fetching cryptocurrency data from CoinGecko.&lt;/p&gt;

&lt;p&gt;Install dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   conda activate teamenv
   conda &lt;span class="nb"&gt;install &lt;/span&gt;requests pandas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a notebook in &lt;code&gt;/opt/shared_notebooks&lt;/code&gt; and add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

   &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.coingecko.com/api/v3/coins/markets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
   &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vs_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bitcoin,ethereum,cardano,solana&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;market_cap_desc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sparkline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;

   &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
       &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;symbol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;current_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;market_cap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_volume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
       &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
       &lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching data:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run and share the notebook via &lt;code&gt;/opt/shared_notebooks&lt;/code&gt; or Git.&lt;/p&gt;

</description>
      <category>anaconda</category>
      <category>azure</category>
      <category>jupyterlab</category>
    </item>
    <item>
      <title>Getting Started with Docker and Docker Compose: A Beginner’s Guide</title>
      <dc:creator>Joy Akinyi</dc:creator>
      <pubDate>Sun, 24 Aug 2025 08:31:18 +0000</pubDate>
      <link>https://dev.to/joy_akinyi_115689d7dff92f/getting-started-with-docker-and-docker-compose-a-beginners-guide-5g42</link>
      <guid>https://dev.to/joy_akinyi_115689d7dff92f/getting-started-with-docker-and-docker-compose-a-beginners-guide-5g42</guid>
      <description>&lt;p&gt;Have you ever heard the phrase “But it works on my machine”?&lt;br&gt;
This is one of the most common problems developers face when moving applications from their local computer to a server or sharing projects with teammates.&lt;/p&gt;

&lt;p&gt;That’s where Docker comes in. Docker allows you to package your application with all its dependencies into a container so that it can run consistently on any environment say your laptop, a server, or even the cloud.&lt;/p&gt;

&lt;p&gt;In this guide, we’ll cover the basics of Docker and introduce Docker Compose, a tool that helps you run multi-container applications with ease.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Docker?&lt;/strong&gt;&lt;br&gt;
Docker is an open source platform that allows you to run applications in isolated environments through containerization. Unlike Virtual Environments that emulate entire operating systems,docker containers share the host OS but isolate applications. They are also lightweight and fast.&lt;/p&gt;
&lt;h4&gt;
  
  
  Installing Docker:
&lt;/h4&gt;

&lt;p&gt;You can approach this by either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
Installing Docker Desktop&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Docker Desktop provides an easy-to-use application with a GUI and the Docker CLI preinstalled.It also integrates well with WSL 2 on Windows.&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://docs.docker.com/desktop/" rel="noopener noreferrer"&gt;Download here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once installed, you’ll be able to run Docker commands directly from your terminal (PowerShell, WSL, or macOS terminal).&lt;/p&gt;

&lt;p&gt;2.Installing the Docker CLI on Linux:&lt;br&gt;
If you’re running Linux, you can install the Docker CLI directly.&lt;/p&gt;

&lt;p&gt;For example, on WSL 2 with Ubuntu 22.04,you can follow this steps:&lt;br&gt;
&lt;a href="https://gist.github.com/dehsilvadeveloper/c3bdf0f4cdcc5c177e2fe9be671820c7" rel="noopener noreferrer"&gt;Installing Docker on WSL 2 with Ubuntu 22.04&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After installation,you can verify if Docker is working by running:&lt;br&gt;
&lt;code&gt;docker version&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If it's working, you should see the following output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client: Docker Engine - Community
 Version:           27.0.2
 API version:       1.46
 Go version:        go1.21.5
 Git commit:        3ab9da9
 Built:             Tue Jul 18 17:45:00 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.2
  API version:      1.46 (minimum version 1.12)
  Go version:       go1.21.5
  Git commit:       3ab9da9
  Built:            Tue Jul 18 17:45:00 2024
  OS/Arch:          linux/amd64
  Experimental:     false 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Basic Docker Concepts:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Image → A blueprint for your application.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Container → A running instance of an image.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dockerfile → A recipe to build an image.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Docker Hub → Public repository of prebuilt images.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now,let's talk about building and running your first container;&lt;br&gt;
You can run an already existing image by running&lt;br&gt;
&lt;code&gt;docker run hello-world&lt;br&gt;
&lt;/code&gt;In this case,docker pulls the image from Docker Hub, creates a container, runs it, then exits.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For listing and stopping containers,you can run:&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;docker ps        # running containers  &lt;br&gt;
docker ps -a     # all containers (even stopped ones)  &lt;br&gt;
docker stop &amp;lt;container_id&amp;gt;&lt;br&gt;
docker rm &amp;lt;container_id&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Running your first Dockerfile:
&lt;/h2&gt;

&lt;p&gt;A Docker file as mentioned before is like a recipe for building images&lt;br&gt;
An example of one is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Use official Python image
FROM python:3.13-slim

## Set working directory
WORKDIR /app

## Copies the .txt file to the set working directory
COPY requirements.txt .

## Runs the requirements file
RUN pip install -r requirements.txt

## Copy local files into the container
COPY . .

## Run main
CMD ["python","main.py"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we have a docker file,we can build an image say image1 using:&lt;br&gt;
&lt;code&gt;docker build -t image1 .&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
Next is to run the container by running:&lt;br&gt;
&lt;code&gt;docker run image1&lt;br&gt;
&lt;/code&gt;This will run a container based on the image you just created&lt;/p&gt;

&lt;p&gt;&lt;em&gt;So far, we’ve learned how to run one container at a time. But in real-world projects, applications often rely on multiple containers running together — for example, a web server and a database. Managing these individually can get complicated, and that’s where Docker Compose comes in&lt;/em&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Getting Started with Docker Compose
&lt;/h1&gt;

&lt;p&gt;Docker Compose is a tool that lets you define and run multi-container applications using a single configuration file called &lt;strong&gt;&lt;em&gt;docker-compose.yml&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of starting containers one by one, you define all your services in YAML and start them together with one command. A sample docker-compose.yml file is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: "3.8"

services:
  web:
    build: .
    ports:
      - "5000:5000"
    depends_on:
      - db

  db:
    image: postgres:15
    restart: always
    environment:
      POSTGRES_USER: example
      POSTGRES_PASSWORD: example
      POSTGRES_DB: testdb
    ports:
      - "5432:5432"


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Running with compose&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
To build and start all services, run&lt;br&gt;
&lt;code&gt;docker-compose up&lt;/code&gt; &lt;br&gt;
To stop everything,run&lt;br&gt;
&lt;code&gt;docker-compose down&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In conclusion,With Docker Compose, you don’t have to remember long docker run commands as everything is in one file and also it super easy to share projects.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>dockercompose</category>
    </item>
    <item>
      <title>Why we use Apache Airflow for Data Engineering</title>
      <dc:creator>Joy Akinyi</dc:creator>
      <pubDate>Mon, 21 Jul 2025 14:19:28 +0000</pubDate>
      <link>https://dev.to/joy_akinyi_115689d7dff92f/why-we-use-apache-airflow-for-data-engineering-2f19</link>
      <guid>https://dev.to/joy_akinyi_115689d7dff92f/why-we-use-apache-airflow-for-data-engineering-2f19</guid>
      <description>&lt;p&gt;Goal: To explain the value of Apache Airflow in building, scheduling, and managing workflows in Data Engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Definitions:&lt;/strong&gt;&lt;br&gt;
Apache Airflow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
Open source platform used to schedule and manage batch oriented workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data Engineering&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
Data Engineering involves designing and managing data pipelines by extracting, transforming, and loading data (ETL) hence prepares datasets for analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Orchestration tools like Airflow in data engineering are important especially when it comes to automation, optimization and the execution of data workflows that involve multiple dependent tasks across systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Components of the Airflow Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Directed Acyclic Graphs(DAG's):A DAG is basically code written in python that defines the sequence of tasks needed to execute a workflow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scheduler: Triggers scheduled workflows and submitting tasks to executor&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Executor: Runs the tasks e.g LocalExecutor&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Web server: Provides a user interface (UI) to inspect, trigger and debug DAGs’ behaviours and tasks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Metadata Database: Used by the scheduler, executor and webserver to store state&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following image represents the structure of Apache Airflow&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ad7exiu9do236cyci0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ad7exiu9do236cyci0d.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also,to effectively design and manage workflows, Apache Airflow uses tasks and operators as core components.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A task is the basic unit of execution in Airflow and each task represents an action like running a python function or executing a sql script.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An operator defines the kind of tasks you want to execute e.g &lt;br&gt;
&lt;em&gt;PythonOperator that executes a python function&lt;/em&gt;&lt;br&gt;
&lt;em&gt;BashOperator   that runs a Bash command or script&lt;/em&gt;&lt;br&gt;
&lt;em&gt;PostgresOperator that executes SQL commands on a Postgres database&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With the knowledge above,we can give reasons why Data Engineers use Airflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Modular &amp;amp; Scalable Workflow Management:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Python‑based DAG definitions let you build reusable and maintainable modules for pipelines.&lt;/li&gt;
&lt;li&gt;Scalable means your workflows can handle more tasks or data without breaking or needing major redesign e.g through parallelization where multiple tasks can run at once.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2.Easy Debugging : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detailed logs per task in the UI plus retry mechanisms and alerting making debugging robust.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3.Supports Dynamic Pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instead of hardcoding every task, you can use loops, conditions, and variables to create tasks using python.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;4.Integration with External Systems&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extensive integration with various external systems, databases, and cloud platforms like GCP,Azure and AWS hence ideal in organisations with diverse systems.This proves it's versatility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also,Workflow Dependencies are Explicit meaning that you declare dependencies clearly (with &amp;gt;&amp;gt;, &amp;lt;&amp;lt; or task dependencies) ensuring correct execution order.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>airflow</category>
    </item>
  </channel>
</rss>
