DEV Community: Mangesh Walimbe

[Boost]

Mangesh Walimbe — Mon, 22 Jun 2026 18:16:50 +0000

Mangesh Walimbe

Jun 22

What is Apache Kafka? — An Overview

#dataengineering #kafka #datastreaming #beginners

4 min read

What is Apache Kafka? — An Overview

Mangesh Walimbe — Mon, 22 Jun 2026 16:24:19 +0000

What is data streaming?

Data streaming means continuous flow of data from one or more sources. It is a high-speed data transfer, typically in real-time or near real-time. Streaming data is generated from various sources like IoT devices, social media feeds, user web activities/logs, financial transactions, news feeds, etc. The goal of streaming data is low-latency processing.

What is Apache Kafka?

Apache Kafka is an open-source data streaming platform. It was designed by LinkedIn in 2011. The goal of this system is to handle constant loads of streaming data, process it sequentially and incrementally.

Apache Kafka works on a publish-subscribe model. It effectively stores the streaming data in order and processes it in real time. The primary use of this system is real-time data pipelines and streaming applications. It provides high tolerance, high -throughput, scalable messaging system.

Even though Apache Kafka is a messaging system, it differs from traditional messaging systems.

Kafka is a combination of both messaging queues and publish-subscribe model as it uses partitioned log model.
It has policy-based message retention. Users can configure the retention window. The message gets deleted after consumption in the traditional messaging system.
Kafka allows partitions across different servers which leads to scalability.
Because of partitioned log architecture, consumers receive information in order.
Multiple consumers can subscribe to same topic/message in Kafka. In traditional system, message is removed after it is consumed by one consumer which makes it unavailable for another consumer.
In Kafka, topics are automatically replicated. In traditional system, it must be configured for automatic replication.

How does Apache Kafka work?

The above diagram shows the workflow of data streaming system using Apache Kafka. Kafka blends both queuing and publish-subscribe messaging systems to offer strength of both systems.

Queuing allows many consumers to distribute the load, making it great for scaling. But it doesn’t support multi-subscribers.
Publish-subscribe model is a multi-subscriber model, but it cannot distribute work among them.

Kafka solves this by using a partitions log model. A log refers to a sequence of records. These logs then split into partitions for different subscribers. Each consumer reads from different partitions of the same topic. It allows both scalability and multi-subscriber approach.

Kafka also allows them to reprocess or re-read past messages from the topic. So different applications can read the same data independently at their own pace.

Advantages of Apache Kafka:

High Throughput: Kafka can handle millions of records per second making it ideal for high volume and fast environment.
Real-time Processing: Because of high throughput, Kafka supports low latency which makes it suitable for real-time analytics and monitoring.
Scalability: Kafka can be scalable horizontally. It means that, based on workload, multiple brokers, partitions, consumers can be added.
Fault Tolerance: Kafka offers replication of data. So, even if any server/node fails, data will be available for processing. It ensures high availability and durability.
Replayability: Kafka allows customers to reprocess or re-read messages any time as it stores the data based on retention policy.
Message Order: Because of Kafka partitioning, messages can be ordered based on partition key. It is useful for many applications like financial transactions, logs etc.

Disadvantages of Apache Kafka:

High Resource Utilization: For heavy load and large retention period, Kafka utilizes significant resources like memory, CPU, etc.
Limitation on Messaging Order: Kafka supports message ordering but only within partition. It is complex to set up ordering across all partitions.
Complex Setup: Setting up Kafka processes like broker, partitions, replication etc. can be complex. It needs to be set up carefully.
Message Transformation: Kafka doesn’t support message transformation internally. It has dependency on external tools.
Security Configuration: Even though Kafka supports security features like encryption, authentication etc., setting them can be complex.
Storage Cost: Kafka can store messages based on a retention policy which can cause high storage costs.

Use Cases:

Real-time Data Streaming: This is the most common use case of Kafka since it is ideal for real-time data ingestion and processing of the data. It was the original use case of Kafka. LinkedIn uses it for user activity tracking, real-time feeds etc. Other use cases are live stock market data for stock trading platform, supply chain and logistics data to optimize routes and track shipment etc.
Real-time Data Analytics: Kafka streams live data to analytics engine. It helps to make fast and informed decisions.
Event-drive Architecture: Kafka supports Event-Driven Architecture. It produces, consumes and processes events in real-time. It helps to build complex, event driven applications.
Log Aggregation: Kafka allows collection of logs from various sources, stores and analysis of logs. It allows low-latency processing and supports multiple data sources. It offers good performance and durability.
Metrics: Apache Kafka can aggregate metrics from distributed applications so that it can have centralized feeds of operational data which is used for operational monitoring data.
Fraud Detection: Kafka plays key role in real-time fraud detection by streaming financial transactional data like credit card purchases, fund transfer etc, continuously to machine learning model. These models help to detect unsual patterns or anamolies to detect fradulent behavior.

References: https://kafka.apache.org/43/getting-started/introduction/

Event-Driven Architecture(EDA) — An Overview

Mangesh Walimbe — Sun, 05 Apr 2026 16:15:47 +0000

What is Event-Driven Architecture (EDA) ?

It is a scalable system designed to handle real-time processing of user activities in a distributed environment. It is a software design pattern that allows the system to decouple the components by using events which help to improve scalability, modularity, and responsiveness of the system.

Architectural Style:

An event-driven architecture uses a publish-subscribe model and an event streaming model.

The system communicates with each other asynchronously to events through an event broker where applications do not need to know where they are publishing and consuming the information. Events represent a change in the state or user actions (e.g., an order being placed, user signed up).

There are 3 main components:

Event Producer:
It is an application, service or any component where the data is generated, and an event is published. The producer doesn’t know who will subscribe to this event and how the data will be processed. It can handle high volumes of events without any blocker from consumers.

Event Broker:
It is a middleware between producers and consumers. The key role of the event broker is to receive events from producers, store them, and route events to consumers. This communication is asynchronous. The routing is based on topics, queues filters.

Event brokers support multiple delivery models like publish-subscribe, point-to-point etc. It makes sure that the events are not lost through various techniques like retries, message persistence, dead-letter queues. It can also handle high volumes of events and can scale horizontally. Some of the examples are Apache Kafka, RabbitMQ, AWS EventBridge etc.

Event Consumer:
It is a component, application or service that subscribes to events published by event producers through event brokers. It processes the event asynchronously as soon as it receives. Each consumer handles specific types of events and will not know who publishes that. Multiple instances of consumers can consume events parallelly.

Communication Models:

Publish-Subscribe model:
In this model, even producer/publisher publishes events. It will not know who will consume those. Event subscribers/consumers look for specific events and subscribe to those events as soon as those are available. Producers and consumers work independently because of decoupled structure.

Event Streaming model:
In this model, event data is continuously generated and processed in real time. Events act as a stream of records that are processed, stored and analyzed as soon as they are created. Applications which need real-time analysis and monitoring can make use of this model.

When to use Event-Driven Architecture (EDA)?

Event-driven architecture is a great choice when your application needs to respond to user actions instantly and handle a high volume of events without slowing down. Each component can send and listen to events instead of having all parts of the application directly talk to each other. This decoupling of services makes it easier to update or add any new features later without affecting the rest of the system. This system helps your application to be reliable and flexible, even if traffic spikes or something goes down temporarily.

Using events makes your application more scalable, reliable, increases responsiveness, and simpler to grow over time. Overall, if you need a system that reacts in real time, handles high volume of traffic gracefully, and allows you to build and change services without breaking everything else, event-driven architecture is a great choice.

Use Case:

Let’s take an example of an online food delivery application like UberEats in which customers, restaurants, and food delivery people(drivers) need to get real time updates at the same time. Event-Driven architecture design pattern works perfectly in this case. It will allow all the different components in this system to work independently but still will stay in sync.

Here are the main components:

Order Service:
Customer places an order using the application. The order service publishes an event ‘OrderPlaced’. It will include information about customer, payment, order etc.

Restaurant Service:
Restaurant service is responsible for taking orders, send it to restaurant, etc. It listens to ‘OrderPlaced’ event. As soon as the Restaurant Service receives the event, it notifies the restaurant about the order to prepare and updates the restaurant dashboard for tracking in real time. Once order is ready, it publishes ‘OrderReady’ event.

Payment Service:
Payment service also looks for ‘OrderPlaced’ event. Once it receives, the service will process the payment and publishes a new event ‘PaymentSuccessful’.

Delivery Service:
This service listens to ‘PaymentSuccessful’ event. it tracks down all the available drivers, sends them delivery request and publishes ‘DriverAssigned’ event.

Once the restaurant service publishes ‘OrderReady’ event, delivery service listens to this event and publishes a new event ‘OrderOutForDelivery’.

Notification Service:
This service listens to all the events ‘OrderPlaced’, ‘PaymentSuccessful’, ‘DriverAssigned’, ‘OrderReady’, ‘OrderOutForDelivery’ and sends real-time updates to the customer about the order.

Tracking Service:
This service is used for real-time tracking. It tracks down all the events to send updates to notification service.

Challenges:

Debugging and Monitoring:
In Event-Driven architecture, different components in system communicates to each other through events. Even though it is asynchronous, it is difficult to trace the data flow which makes the testing complex.

Consistency:
Sometimes, events might comes out of order which can cause consistency issues if these are not handled properly.

Duplication:
Because of network issues or retries, it is possible to receive same event multiple times. If the same event received more than once, the application must be designed to handle duplicate events gracefully.

Operation Overhead:
There is always operational overhead as you need to set up and maintain tools like brokers (e.g., Apache Kafka, RabbitMQ, AWS EventBridge) to keep track of how the data moves. If the system is not scaled properly, it may cause performance degradation and can increase operational cost.

Error Handling:
Since Event-driven architecture mainly uses asynchronous communication, a challenge with it is error handling. Thus, if something goes wrong while processing an event, you need a mechanism to retry or fix it without affecting other services.

Security and Access:
These events can have sensitive data like payment, PII etc. and can be consumed by services which are not authorized use those.

Event Loss:
Because of broken failure, network issues, or incorrect configuration, it is possible that the events can be lost.

Yuniql: Plain SQL Database Migrations

Mangesh Walimbe — Fri, 05 Dec 2025 22:12:18 +0000

Yuniql is a lightweight, open-source migration engine through which we can maintain version and deploy database changes using plain SQL with CI/CD pipelines. It uses migration-based and database-first delivery models.

Why Yuniql?

1. Plain SQL and Folder based migration: Yuniql uses simple, intuitive approach for db versioning. Each migration has a dedicated directory (v0.00, v0.01 etc). This directory contains plain .sql files and optional csv files for the data that need to bulk-seed master tables.
2. No runtime dependencies: The CLI is distributed as fully self-contained executable. It doesn’t need to install .NET CLR or any additional framework on target machines.
3. Cross Platform and Cloud Ready: It is fully tested across major cloud providers like Azure SQL, Amazon RDS, Google Cloud SQL etc. It also supports db engines like SQL Server, PostgreSQL, MySQL, MariaDB and preview drivers available for Snowflake, Redshift, and Oracle. This ensures flexibility for hybrid and multi cloud environments.
4. DevOps Friendly: Yuniql integrates with CI/CD pipelines seamlessly. It offers first-class support for Azure pipeline tasks and provides official docker images. It makes automated deployments and containerized environments straight forward and reliable.

Yuniql Installation:

Yuniql can be installed in various ways.

Download yuniql.exe directly from Github. (Windows)
Choco Package on windows (x64): Get the latest Yuniql CLI from Chocolatey Package Manager. Run below commands with admin access.

choco install yuniql -y
yuniql version

.NET Global Tool Install latest Yuniql CLI with .Net global tool. It needs .Net core 3.0 SDK installed.

dotnet tool install -g yuniql.cli
yuniql version

Azure Devops Extension: The free extension is available on Azure Marketplace. Install it into organization.
Docker Image: Get the official docker image for yuniql and run the migration from containerized environment. For more information, visit: https://yuniql.io/docs/migrate-via-docker-container/ Docker hub: https://hub.docker.com/repository/docker/yuniql/yuniql

For other methods, check this: https://yuniql.io/docs/install-yuniql/

Directory Structure and Execution Flow in Yuniql:

When you run below command after installation, it frames directory structure to stage the migration and utility scripts.

yuniql init

Here is the default directory structure:

Db-name/
├─ _init/      -- It executes only once during first migration. It sets up objects and configuration.
├─ _pre/       -- It runs before every versioned migration for pre-check or environment setup.
├─ v0.00/      -- baseline version folder (initial schema and seed data)
├─ v1.00/      -- next version folder 
├─ _draft/     -- run every time after latest version. Typically for work in-progress scripts or temporary fixes.
├─ _post/      -- Runs at the end of migration. For cleanup or post deployment scripts.
└─ _erase/     -- Runs manual cleanup scripts for yuniql erase command for controlled db cleanup.

Azure Devops YAML Pipeline using Yuniql.

Below are the steps to setup an Azure DevOps YAML pipeline.

1. Yuniql Workspace: Run the command yuniql init in the directory where the db migration will run from. Once it is run, it will create yuniql workspace (above folder structure)
2. Azure DevOps setup: As mentioned above, acquire the free Yuniql extension for Azure Devops from Azure Marketplace and install into the organization.

Pipeline:

Here is the sample pipeline for db migration for PostgreSql

- task: UseYuniqlCLI@1
    inputs:
      version: "latest"

- task: RunYuniqlCLI@1
    inputs:
      version: "latest"
      connectionString: "$(PostgresDBConnectionString)"
      workspacePath: "$(Build.SourcesDirectory)"
      targetPlatform: "postgresql"
      additionalArguments: "--meta-schema public --meta-table -yuniql_version --debug"

- task: VerifyYUNIQLCLI@1
    displayName: ‘Run verify task (no commit)'
    inputs:
      version: 'latest'
      connectionString: "$(PostgresDBConnectionString)"
      workspacePath: "$(Build.SourcesDirectory)"
      targetPlatform: “postgresql”
      additionalArguments: '--debug'

Here are the properties used in this pipeline:

Version: It shows yuniql version. It not mentioned, it will install latest version of yuniql cli.
connectionString: database connection string
workspacePath: The yuniql workspace location from where the db migration will take place.
targetPlatform: The target db platform. Here it is PostgreSQL.
additionalArguments: any additional CLI arguments. Here 2 arguments are used. -meta-schema: postgres schema where the db migration version is logged. -meta-table: The db version will be stored in this table. -debug: it enables detailed trace messages.

Here are the additional properties that can be used:

autoCreateDatabase: Yuniql will create and configure database of this property is set to true.
tokenKeyValuePair: It allows custom tokens and values to migration scripts at runtime.

Here one more task is added for verification. It performs uncommitted migration just to see if the migration scripts succeed before applying changes.
Note: Instead of exposing db connection publicly, use Azure Keyvault to fetch values for db connection and other required details.

Practical considerations:

Keep the v0.00 as clean baseline. Write forward only scripts.
Use token or tokenkeyvaluepair for environment specific names
Bulk upload: use csv to see master data. Separator and batch size can be configured.
Tracing: Use meta-schema/table to log and trace migration.

Rollback in Yuniql:

Yuniql is forward only migration tool. So, the rollback is handled differently here.

There is no automatic rollback in Yuniql. The version is logged in db in table. So, it must be deleted manually
Yuniql provides _erase folder. Place the cleanup script in this folder and run below command.

yuniql erase --platform postgres \
-c "<connection-string>" \
--debug

This is used for non-production environments.

For production environments, create reverse script for each version and run manually. Common approach is to maintain separate folder like rollback/v1.00 with undo scripts and execute manually.

Limitations:

It supports Plain-SQL only: Yuniql is designed for Plain-SQL intentionally for organized versioned folders. It needs more manual efforts for ORM/native code migration.
Platform differences: Yuniql supports multiple engines. It means to account for different syntax and batch processes. It is not portable across engines.
Manual rollback: As mentioned earlier, yuniql doesn’t support automatic rollback. Rollback strategies must be engineered during design.
No operational features: Operational features like checks, pre-deployment validations, tests are outside of Yuniql. Additional scripting is needed.

My take on Yuniql:

Even though it has several drawbacks, Yuniql is a great fit for Plain-SQL migration which has zero dependencies as it has self-contained CLI. The integration with Azure pipeline and Docker is straightforward with tasks and images.

Data Ingestion using Logstash: PostgreSql to Elastic

Mangesh Walimbe — Thu, 26 Jun 2025 17:36:02 +0000

What is Logstash?

Logstash is an open-source data processing pipeline from Elastic. It is being used to ingest, transform and ship data to different sources including Elasticsearch, Kafka, flat files etc.

Logstash pipeline includes 3 different processes:

Input: It is the data source from which the data is collected for ingestion
Filter: It transforms (cleanup, aggregate etc.) the data using plugins like Grok, Mutate, Date etc.
Output: Destination for ingestion.(Elasticsearch, flat files, db etc)

Below are the prerequisites to send data using Logstash to elastic:

Logstash installed on the system with JDBC driver for Postgres
Postgres database with a table or function to sync.
Elasticsearch instance is running.

Logstash Setup (for Windows):

Below are the steps in brief to install and run Logstash locally.

1. Install java:
Download JDK package (java 8 or later) from Official Oracle Website . Once the download is complete, extract the files to the preferred location.
Once the files are extracted, the environment variable needs to be added for the system to recognize java commands.
Go to environment variables, add a new variable with name JAVA_HOME and point it to the directory where the java files are located. Append %JAVA_HOME%\bin to the path.
To verify the successful installation, go to command prompt and run below command.

java -version

If everything is set up correctly, it will show the java version.

2. Install Logstash:
To install Logstash, download the package from Official Elastic Website and extract to preferred location.
To test this locally, open command prompt, go to the bin folder from Logstash folder and run below command to test.

logstash -e "input { stdin {} } output { stdout {} }"

Logstash Ingestion Pipeline:

1. Install required JDBC driver:
Download the Postgres driver from Official PostgreSql Website
Place the jar file to some accessible location.

2. Create logstash pipeline:
Here is the sample pipeline.

input {
    jdbc {
        jdbc_driver_library => "c:/logstash/jdbc/postgresql.jar"
        jdbc_driver_class => "org.postgresql.Driver"
        jdbc_connection_string => "${JDBC_HOST}"
        jdbc_user => "${DB_USER}"
        jdbc_password => "${DB_PWD}"
        jdbc_paging_enabled => true
        jdbc_page_size => 1000
        schedule => "* * * * *"  # schedule to run every minute
        statement => "SELECT * FROM employee WHERE updated_at > :sql_last_value"
        use_column_value => true
        tracking_column => "updated_at"
        tracking_column_type => "timestamp"
        last_run_metadata_path => "c:/logstash/employee.tracker"
    }
}

filter {
}
    mutate {
        remove_field => ["date", "@timestamp", "host"]
    }

    # Example of parsing JSON fields if needed
    json {
         source => "first_name"
         target => "name"
    }
}

output {
    stdout { codec => json_lines }
    elasticsearch {
        hosts => ["http://localhost:9200"]
        index => "my_table_index"
        custom_headers => {
                "Authorization" => "${AUTH_KEY}"
            }
        document_id => "%{table_id}" # Unique identifier from the table
        timeout => 120
    }
}

The above pipeline is used for incremental ingestion. It means that it tracks the last run and takes the records from the last run to ingest the data on the schedule.

Here are the key concepts used:

Input:

jdbc_driver_library: location where the jdbc driver file(.jar) is stored.
jdbc_driver_class: the driver class being used.
jdbc_connection_string: postgres db connection string
jdbc_user: Database username
jdbc_password: database password for the user
paging: In this pipeline, the data will be sent in multiple pages with page size of 1000. It will improve the performance of the pipeline and will help to track the number of records sent to elastic search.
schedule: The above pipeline is scheduled to run every minute.
Here is the format for schedule.

* * * * *
│ │ │ │ │
│ │ │ │ └─── Day of the week (0 - 7) (Sunday is both 0 and 7)
│ │ │ └───── Month (1 - 12)
│ │ └─────── Day of the month (1 - 31)
│ └───────── Hour (0 - 23)
└─────────── Minute (0 - 59)

statement: It is a SQL statement which the pipeline will execute. To execute complex statements, it can be saved in a separate .sql file and mention the file path to statement_filepathinstead of statement. It is better to use view or materialized view instead of a query with complex joins.
The last section is for incremental ingestion.

use_column_value => true
tracking_column => "updated_dt"
tracking_column_type => "timestamp"
last_run_metadata_path => "c:/project/logstash/date.tracker"

use_column_value is set to true. It lets the Logstash know to track the actual value of the column updated_at used in tracking_column instead of using the time when the query was run last time. In this case, :sql_last_value will use updated_dt value.

If it is set to false, Logstash will use the last query execution time for :sql_last_value.

The last run time will be saved in the file mentioned in last_run_metadata_path. It will be used to track the last time the pipeline was run.

Filter

This is an optional section to manipulate the data before sending it to destination.

In the above pipeline, the date field is being removed from ingestion. Also, it is sending the first_name from the data to name field in destination.

Output

This section defines the destination for the data. In this case, it is Elasticsearch endpoint, authorization key if any, elastic index, document_id. document_id is a unique identifier of the elastic document in index. If this field is not mentioned, elastic search will automatically assign a unique identifier to the document.

In case of incremental ingestion, it is recommended to define this field. During ingestion, elastic search would look for this field in index, if it matched, it would update the same document.

If the field is not defined, it creates a new document in the index resulting duplicate records.

Run the pipeline

To run this pipeline, open command prompt, go to the Logstash folder and run below command.

bin/logstash -f c:/logstash/sample_pipeline.conf

Here is the output of the pipeline.

Output from elastic search index.

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "testing",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "name": "James",
                    "id": 1,
                    "last_name": "Smith",
                    "updated_dt": "2024-12-12T16:10:57.349Z",
                    "@version": "1",
                    "@timestamp": "2025-06-25T20:41:02.167442600Z"
                }
            },
            {
                "_index": "testing",
                "_id": "2",
                "_score": 1.0,
                "_source": {
                    "name": "John",
                    "id": 2,
                    "last_name": "Doe",
                    "updated_dt": "2024-12-12T16:10:57.349Z",
                    "@version": "1",
                    "@timestamp": "2025-06-25T20:41:02.169021400Z"
                }
            },
            {
                "_index": "testing",
                "_id": "3",
                "_score": 1.0,
                "_source": {
                    "name": "Kate",
                    "id": 3,
                    "last_name": "Williams",
                    "updated_dt": "2024-12-12T16:10:57.349Z",                    
                    "@version": "1",
                    "@timestamp": "2025-06-25T20:41:02.170098800Z"
                }
            }
        ]
    }
}

There are a few advantages of this method.

Logstash is an open-source tool and easy to implement.
There are over 200+ plugins available for data transformation. Using these plugins, data can be parsed and transformed using filters.
It is a decoupled architecture between data source and the elastic search
It has seamless integration with elastic search.

Although this is an open-source and simple method to implement, there are some disadvantages.

Latency issues: It is not ideal where very low latency or real time data is required. As the pipeline grows, it takes time to load, transform/filter, and send the data.
Error handling: Unless it is explicitly monitored, it is difficult to track down the errors and can result in data drop.
It can create duplicates if not the pipeline not defined properly.
It takes longer time to start compared to other tools.
It uses YAML style config files which make it complex and can be difficult to maintain.
Resource utilization: It can utilize more resources with heavy loads and complex pipelines.

Above pipeline can be used if someone is looking for a more robust and centralized data streaming pipeline. It is not ideal for real time data shipping.