<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: KiplangatJaphet</title>
    <description>The latest articles on DEV Community by KiplangatJaphet (@kiplangatjaphet).</description>
    <link>https://dev.to/kiplangatjaphet</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3520072%2F3f522f1a-e25a-4fca-9741-fe7174db69aa.png</url>
      <title>DEV Community: KiplangatJaphet</title>
      <link>https://dev.to/kiplangatjaphet</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kiplangatjaphet"/>
    <language>en</language>
    <item>
      <title>A Beginner’s Guide to Big Data Analytics with Apache Spark and PySpark</title>
      <dc:creator>KiplangatJaphet</dc:creator>
      <pubDate>Mon, 29 Sep 2025 21:49:45 +0000</pubDate>
      <link>https://dev.to/kiplangatjaphet/a-beginners-guide-to-big-data-analytics-with-apache-spark-and-pyspark-dpc</link>
      <guid>https://dev.to/kiplangatjaphet/a-beginners-guide-to-big-data-analytics-with-apache-spark-and-pyspark-dpc</guid>
      <description>&lt;h2&gt;
  
  
  What is Apache Spark?
&lt;/h2&gt;

&lt;p&gt;Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the history of Apache Spark?&lt;/strong&gt;&lt;br&gt;
Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab, a collaboration involving students, researchers, and faculty, focused on data-intensive application domains. The goal of Spark was to create a new framework, optimized for fast iterative processing like machine learning, and interactive data analysis, while retaining the scalability, and fault tolerance of Hadoop MapReduce. The first paper entitled, “Spark: Cluster Computing with Working Sets” was published in June 2010, and Spark was open sourced under a BSD license. In June, 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February, 2014. Spark can run standalone, on Apache Mesos, or most frequently on Apache Hadoop. &lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;What is PySpark?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;PySpark is the Python API for Apache Spark, a powerful framework designed for distributed data processing. If you’ve ever worked with large datasets and found your programs running slowly, PySpark might be the solution you’ve been searching for. It allows you to process massive datasets across multiple computers at the same time, meaning your programs can handle more data in less time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features of PySpark&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Distributed Processing: Instead of relying on one computer, PySpark breaks up your data into smaller chunks and processes them on multiple machines simultaneously.&lt;/li&gt;
&lt;li&gt;In-Memory Processing: PySpark can store data in memory (RAM), making it much faster than traditional methods that often rely on slow disk access.&lt;/li&gt;
&lt;li&gt;Fault Tolerance: Even if one machine fails while processing data, PySpark can automatically recover, ensuring your data is safe and the job gets done.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Importance of using pyspark&lt;/strong&gt;&lt;br&gt;
PySpark lets you handle that same data efficiently by splitting the work across multiple computers in a cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Use Cases&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data Analysis: If you’re analyzing huge datasets (e.g., sales data, website logs), PySpark helps process that data quickly.&lt;/li&gt;
&lt;li&gt;Machine Learning: PySpark is often used to build models that predict trends or patterns from large datasets.&lt;/li&gt;
&lt;li&gt;Big Data Processing: Companies with tons of data (like social media platforms or e-commerce giants) use PySpark to keep things running smoothly.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Apache Spark Architecture
&lt;/h2&gt;

&lt;p&gt;The Spark runtime consists of several key components that work together to execute distributed computations.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4km862zdpqy1udbc8z1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4km862zdpqy1udbc8z1.png" alt=" " width="800" height="345"&gt;&lt;/a&gt; &lt;br&gt;
Below are the functions of each component of Spark architecture. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Spark driver&lt;/strong&gt;&lt;br&gt;
The driver is the program or process responsible for coordinating the execution of the Spark application. It runs the main function and creates the SparkContext, which connects to the cluster manager.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Spark executors&lt;/strong&gt;&lt;br&gt;
Executors are worker processes responsible for executing tasks in Spark applications. They are launched on worker nodes and communicate with the driver program and cluster manager. Executors run tasks concurrently and store data in memory or disk for caching and intermediate storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cluster manager&lt;/strong&gt;&lt;br&gt;
The cluster manager is responsible for allocating resources and managing the cluster on which the Spark application runs. Spark supports various cluster managers like Apache Mesos, Hadoop YARN, and standalone cluster manager.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;sparkContext&lt;/strong&gt;&lt;br&gt;
SparkContext is the entry point for any Spark functionality. It represents the connection to a Spark cluster and can be used to create RDDs (Resilient Distributed Datasets), accumulators, and broadcast variables. SparkContext also coordinates the execution of tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task&lt;/strong&gt;&lt;br&gt;
A task is the smallest unit of work in Spark, representing a unit of computation that can be performed on a single partition of data. The driver program divides the Spark job into tasks and assigns them to the executor nodes for execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Working Of Spark Architecture&lt;/strong&gt;&lt;br&gt;
When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes several other components, including;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DAG Scheduler. &lt;/li&gt;
&lt;li&gt;Task Scheduler.&lt;/li&gt;
&lt;li&gt;Backend Scheduler.&lt;/li&gt;
&lt;li&gt;Block Manager.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These components translate user code into jobs that are executed on the cluster. Together, the Spark Driver and SparkContext oversee the entire job execution lifecycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installing Apache Spark and Pyspark on your terminal&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Install Java 17 (required for Spark 4.x)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;openjdk-17-jdk &lt;span class="nt"&gt;-y&lt;/span&gt;

&lt;span class="c"&gt;#Verify Java installation&lt;/span&gt;
java &lt;span class="nt"&gt;-version&lt;/span&gt;   &lt;span class="c"&gt;# should print openjdk version "17.0.xx"&lt;/span&gt;

&lt;span class="c"&gt;# Step 2: Download Apache Spark 4.0.1 (built with Hadoop 3)&lt;/span&gt;
wget https://dlcdn.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

&lt;span class="c"&gt;#: Extract the tarball&lt;/span&gt;
&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xvzf&lt;/span&gt; spark-4.0.1-bin-hadoop3.tgz

&lt;span class="c"&gt;#: Remove the archive to save space&lt;/span&gt;
&lt;span class="nb"&gt;rm &lt;/span&gt;spark-4.0.1-bin-hadoop3.tgz

&lt;span class="c"&gt;#: Rename the extracted folder to something simpler&lt;/span&gt;
&lt;span class="nb"&gt;mv &lt;/span&gt;spark-4.0.1-bin-hadoop3 spark

&lt;span class="c"&gt;#: Navigate into the Spark installation directory&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;spark 

&lt;span class="c"&gt;#Step 3: Verify Python &lt;/span&gt;
python &lt;span class="nt"&gt;--version&lt;/span&gt;

&lt;span class="c"&gt;#Step 4: Set up Pyspark Environment &lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv sparkenv 

&lt;span class="c"&gt;#Activate the Environment:&lt;/span&gt;
&lt;span class="nb"&gt;source &lt;/span&gt;sparkenv/bin/activate

&lt;span class="c"&gt;#Install Pyspark&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;pyspark

&lt;span class="c"&gt;#Step 5: Running Pyspark with JupyterLab&lt;/span&gt;
&lt;span class="c"&gt;#Install Jupyter in your virtual environment&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;notebook ipykernel 

&lt;span class="c"&gt;#Register your environment &lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; ipykernel &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sparkenv &lt;span class="nt"&gt;--display-name&lt;/span&gt; &lt;span class="s2"&gt;"Python (sparkenv)"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running Pyspark Code&lt;/strong&gt; &lt;br&gt;
After a successful set up, initialize a Pyspark session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;#Create a Spark session
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;restaurant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Core concepts of Pyspark&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Resilient Distributed Datasets (RDDs)&lt;/strong&gt;&lt;br&gt;
A RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark. These are the elements that run and operate on multiple nodes to do parallel processing on a cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Characteristics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Immutable: Once created, RDDs can't be modified; transformations create new RDDs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Resilient: Can recover from node failures through lineage tracking (remembers the transformations used to build it) rather than data replication.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Distributed: Data is partitioned and processed in parallel across cluster nodes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dataset: Holds your data like a large list or table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lazy Evaluation: Transformations are not executed immediately - they build up a computation graph that executes only when an action is called.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Creating an RDD&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;from pyspark.sql import SparkSession

&lt;span class="c"&gt;# Initialize SparkSession&lt;/span&gt;
sc &lt;span class="o"&gt;=&lt;/span&gt; SparkSession.builder &lt;span class="se"&gt;\&lt;/span&gt;
    .appName&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"RDD"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    .getOrCreate&lt;span class="o"&gt;()&lt;/span&gt; 

&lt;span class="c"&gt;#Initialize SparkContext&lt;/span&gt;
sc &lt;span class="o"&gt;=&lt;/span&gt; spark.sparkContext

&lt;span class="c"&gt;#Create RDD using parallelize&lt;/span&gt;
rdd &lt;span class="o"&gt;=&lt;/span&gt; spark.sparkContext.parallelize&lt;span class="o"&gt;([&lt;/span&gt;1, 2, 3, 4, 5, 6, 7, 8, 9, 10]&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c"&gt;# takes the python list and changes it into a RDD so Spark can process it in parallel.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RDD Transformations and Actions&lt;/strong&gt; &lt;br&gt;
There are two types of operations you can perform on an RDD: Transformations and Actions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Transformations: These are lazy operations that define how the RDD should be transformed. Examples include map(), filter(), flatMap(), groupByKey(), reduceByKey(). These don’t execute right away; they build up a plan of what should happen.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Actions: These trigger the actual computation and return the result. Examples include collect(), count(), first(), take(), saveAsTextFile(). &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example of Transformation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#Create RDD using parallelize&lt;/span&gt;
rdd &lt;span class="o"&gt;=&lt;/span&gt; spark.sparkContext.parallelize&lt;span class="o"&gt;([&lt;/span&gt;1, 2, 3, 4, 5, 6, 7, 8, 9, 10]&lt;span class="o"&gt;)&lt;/span&gt; 

&lt;span class="c"&gt;#filter(): Keep only elements matching a condition&lt;/span&gt;
rdd.filter&lt;span class="o"&gt;(&lt;/span&gt;lambda x: x % 2 &lt;span class="o"&gt;==&lt;/span&gt; 1&lt;span class="o"&gt;)&lt;/span&gt;.collect&lt;span class="o"&gt;()&lt;/span&gt; 

&lt;span class="c"&gt;#flatMap(): Like map() but flattens lists&lt;/span&gt;
rdd2 &lt;span class="o"&gt;=&lt;/span&gt; spark.sparkContext.parallelize&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s2"&gt;"hello world"&lt;/span&gt;, &lt;span class="s2"&gt;"hi spark"&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;
rdd2.flatMap&lt;span class="o"&gt;(&lt;/span&gt;lambda line: line.split&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;.collect&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example of Action:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#Create RDD using parallelize&lt;/span&gt;
rdd &lt;span class="o"&gt;=&lt;/span&gt; spark.sparkContext.parallelize&lt;span class="o"&gt;([&lt;/span&gt;1, 2, 3, 4, 5, 6, 7, 8, 9, 10]&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;#Brings all elements from the distributed RDD back to the driver&lt;/span&gt;
rdd.collect&lt;span class="o"&gt;()&lt;/span&gt; 

&lt;span class="c"&gt;#Returns the first N elements from the RDD (here, 3 elements). &lt;/span&gt;
rdd.take&lt;span class="o"&gt;(&lt;/span&gt;5&lt;span class="o"&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. DataFrames&lt;/strong&gt;&lt;br&gt;
Like RDDs, DataFrames are immutable and distributed, but they add schema and column support for structured data processing. &lt;br&gt;
&lt;strong&gt;Why Use DataFrames Instead of RDDs?&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Optimized for Performance: DataFrames come with built-in optimizations that RDDs don’t have, which means operations run faster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Schema Information: With DataFrames, you know the structure of your data (e.g., column names and types), which allows for more meaningful data manipulation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ease of Use: DataFrames allow you to perform SQL-like operations such as filtering, grouping, and aggregating data, which are more intuitive than RDD transformations.&lt;br&gt;
&lt;strong&gt;When to Use DataFrames&lt;/strong&gt; &lt;br&gt;
Ideal for:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Structured or semi-structured data (JSON, Parquet, CSV, databases)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Better performance requirements due to optimizations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SQL-like operations and complex queries&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ETL pipelines with data from multiple sources&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When you need both programmatic and SQL access to the same data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Working with data analysts who prefer SQL syntax &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Creating DataFrames&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# creating spark session&lt;/span&gt;
from pyspark.sql import SparkSession 

&lt;span class="c"&gt;# Initializing a Spark session&lt;/span&gt;
spark &lt;span class="o"&gt;=&lt;/span&gt; SparkSession.builder.appName&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"uber"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.getOrCreate&lt;span class="o"&gt;()&lt;/span&gt; 

&lt;span class="c"&gt;#read CSV file into DataFrame&lt;/span&gt;
uber_df &lt;span class="o"&gt;=&lt;/span&gt; spark.read.csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"uber.csv"&lt;/span&gt;, &lt;span class="nv"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True, &lt;span class="nv"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
uber_df.show&lt;span class="o"&gt;(&lt;/span&gt;5 &lt;span class="o"&gt;)&lt;/span&gt; 

&lt;span class="c"&gt;#Check coloumns and their datatypes&lt;/span&gt;
uber_df.printSchema&lt;span class="o"&gt;()&lt;/span&gt; 

&lt;span class="c"&gt;#Check the number of rows and columns&lt;/span&gt;
uber_df.count&lt;span class="o"&gt;()&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Spark SQL&lt;/strong&gt;&lt;br&gt;
Spark SQL is a module of PySpark that lets you use SQL-like syntax to interact with DataFrames. It’s particularly useful when you need to query structured data. Whether you’re filtering, grouping, or joining data, you can use familiar SQL commands, just like you would in a traditional database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why use Spark SQL?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Familiar SQL syntax.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Performance optimization using Catalyst.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Integration with BI tools.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Using SQL to Query a DataFrame&lt;/strong&gt;&lt;br&gt;
Let’s start by registering a DataFrame as a temporary table so we can query it using SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;from pyspark.sql import SparkSession

&lt;span class="c"&gt;# create a Spark session&lt;/span&gt;
spark &lt;span class="o"&gt;=&lt;/span&gt; SparkSession.builder.appName&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"restuarant"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.getOrCreate&lt;span class="o"&gt;()&lt;/span&gt; 

&lt;span class="c"&gt;#read CSV file into DataFrame&lt;/span&gt;
restaurant_df &lt;span class="o"&gt;=&lt;/span&gt; spark.read.csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"restaurant_orders.csv"&lt;/span&gt;, &lt;span class="nv"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True, &lt;span class="nv"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
restaurant_df.show&lt;span class="o"&gt;(&lt;/span&gt;5 &lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Register the DataFrame as a temporary view&lt;/span&gt;
restuarant_df.createOrReplaceTempView&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"restuarant_info"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Run a SQL query&lt;/span&gt;
spark.sql&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"SELECT * FROM restuarant_info"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.show&lt;span class="o"&gt;()&lt;/span&gt; 

&lt;span class="c"&gt;# Stop the SparkSession&lt;/span&gt;
spark.stop&lt;span class="o"&gt;()&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
We learned about the Apache Spark Architecture in order to understand how to build big data applications efficiently. They’re accessible and consist of components, which is very beneficial for cluster computing and big data technology. Spark calculates the desired outcomes in an easy way and is popular for batch processing.&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>datascience</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices</title>
      <dc:creator>KiplangatJaphet</dc:creator>
      <pubDate>Tue, 23 Sep 2025 20:53:17 +0000</pubDate>
      <link>https://dev.to/kiplangatjaphet/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-2mjg</link>
      <guid>https://dev.to/kiplangatjaphet/apache-kafka-deep-dive-core-concepts-data-engineering-applications-and-real-world-production-2mjg</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Apache Kafka has emerged as a cornerstone technology for building scalable, real-time data pipelines and event-driven architectures. Originally developed at LinkedIn and open-sourced in 2011, Kafka is a distributed streaming platform designed to handle massive volumes of data with low latency and high throughput. This article explores Kafka’s core concepts, its applications in data engineering, and best practices for running Kafka in production, with insights into how companies like Netflix, LinkedIn, and Uber leverage it. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Apache kafka?&lt;/strong&gt;&lt;br&gt;
Apache Kafka is an open-source distributed event streaming platform that serves as a robust system for building real-time data pipelines and streaming applications. It enables applications to publish and subscribe to streams of events, making it ideal for applications that need to process large volumes of data in real-time, such as data ingestion, real-time analytics, and event-driven architectures. Kafka's key features include high throughput, scalability, fault tolerance through data replication, and durable, ordered message storage within topics. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Kafka as an event streaming platform&lt;/strong&gt;&lt;br&gt;
Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.&lt;/li&gt;
&lt;li&gt;To store streams of events durably and reliably for as long as you want.&lt;/li&gt;
&lt;li&gt;To process streams of events as they occur or retrospectively.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud. You can choose between self-managing your Kafka environments and using fully managed services offered by a variety of vendors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does Kafka work?&lt;/strong&gt;&lt;br&gt;
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Servers&lt;/strong&gt;&lt;br&gt;
Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clients&lt;/strong&gt;&lt;br&gt;
They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Components of kafka&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Producer  -
Producers are applications or services that publish (write) messages into Kafka topics.
They decide which topic and partition the message goes to, either randomly, in round-robin fashion or based on key.&lt;/li&gt;
&lt;li&gt;Zookeper -
It is an open-source distributed coordinated service that helps manage and synchronize large clusters of distributed systems by providing a reliable place where services can keep configuration, naming, synchronization, and group information.&lt;/li&gt;
&lt;li&gt;Consumer -
Consumers are applications that subscribe (read) messages from Kafka topics
Consumers exists in consumer groups to share the load of message consumption.&lt;/li&gt;
&lt;li&gt;Topic -
A topic is like a logical channel or category where messages are stored.
Producers write messages into topics, and consumers read from them.&lt;/li&gt;
&lt;li&gt;Partition -
Topics are split into partitions to allow parallelism and scalability.
Each partition is an ordered, immutable log of records.
Messages inside partitions are identified by a unique offset.&lt;/li&gt;
&lt;li&gt;Broker -
A broker is a Kafka server that stores and serves messages.
Acting as a central hub, the broker accepts messages from producers, assigns them unique offsets, and stores them securely on disk.&lt;/li&gt;
&lt;li&gt;Cluster -
A Kafka cluster is a group of brokers working together.
It ensures data replication, fault tolerance, and high availability.&lt;/li&gt;
&lt;li&gt;Offset -
An offset is a unique ID assigned to each message in a partition.
It helps consumers keep track of which messages have been read.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ofj7fjuqa4a4tn9v0on.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ofj7fjuqa4a4tn9v0on.png" alt=" " width="609" height="171"&gt;&lt;/a&gt;&lt;br&gt;
      The architecture of kafka.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each Kafka broker can host multiple topics, and each topic is divided into multiple partitions for scalability and fault tolerance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5ctz7v4eu7sj1jxt2s3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5ctz7v4eu7sj1jxt2s3.jpg" alt=" " width="800" height="439"&gt;&lt;/a&gt;&lt;br&gt;
  Kafka topic partitions layout.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumers use offsets to read messages sequentially from oldest to newest, and upon recovery from failure, resume from the last committed offset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Set up Kafka on the Terminal&lt;/strong&gt;&lt;br&gt;
Let’s dive into the installation and running of Kafka directly from the terminal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Install Java&lt;/strong&gt;&lt;br&gt;
Kafka requires Java (JDK 11 or 17). Let’s install Java 11:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;openjdk-11-jdk &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm Java Installation&lt;/p&gt;

&lt;p&gt;Verify that Java is installed correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;java &lt;span class="nt"&gt;-version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see output similar to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openjdk version "11.0.xx" ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Download and Extract Kafka&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s download and set up Kafka 3.9.1 with Scala 2.13.&lt;/p&gt;

&lt;p&gt;Download Kafka&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://downloads.apache.org/kafka/3.9.1/kafka_2.13-3.9.1.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extract the downloaded file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xvzf&lt;/span&gt; kafka_2.13-3.9.1.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Delete the archive to free up space&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm &lt;/span&gt;kafka_2.13-3.9.1.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rename the extracted folder to something simpler&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mv &lt;/span&gt;kafka_2.13-3.9.1 kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Change into the Kafka directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;3. Start ZooKeeper and Kafka&lt;/strong&gt;&lt;br&gt;
Start ZooKeeper (in one terminal):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/zookeeper-server-start.sh config/zookeeper.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see output like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig18hq3un12q8ineyath.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig18hq3un12q8ineyath.png" alt="ZooKeeper starting" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Start Kafka (in another terminal)&lt;/p&gt;

&lt;p&gt;Now open a &lt;strong&gt;second terminal&lt;/strong&gt;, navigate to the Kafka folder, and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/kafka-server-start.sh config/server.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see an output like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgtsx4c8bby03v4g3sn8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgtsx4c8bby03v4g3sn8.png" alt=" " width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;4. Create a Kafka Topic&lt;/strong&gt;&lt;br&gt;
Let’s create a topic called &lt;code&gt;exams&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/kafka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--topic&lt;/span&gt; exams &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; localhost:9092 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--partitions&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--replication-factor&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5. Start a Kafka Producer&lt;/strong&gt;&lt;br&gt;
This send some messages to the exams topic using Kafka’s built-in console producer.&lt;br&gt;
In a new terminal window, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/kafka-console-producer.sh &lt;span class="nt"&gt;--topic&lt;/span&gt; exams &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; localhost:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once it starts, you can type messages directly into the terminal and hit Enter to send each message to the Kafka topic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kbxfoyr18w470dkj4ll.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kbxfoyr18w470dkj4ll.png" alt=" " width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each line you type gets published to the &lt;code&gt;exams&lt;/code&gt; topic.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;6. Start a Kafka Consumer&lt;/strong&gt;&lt;br&gt;
To read the messages you just sent, start a Kafka consumer that listens to the &lt;code&gt;exams&lt;/code&gt; topic.&lt;/p&gt;

&lt;p&gt;In another terminal window, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/kafka-console-consumer.sh &lt;span class="nt"&gt;--topic&lt;/span&gt; exams &lt;span class="nt"&gt;--from-beginning&lt;/span&gt; &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; localhost:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will display &lt;strong&gt;all messages&lt;/strong&gt; from the beginning of the topic — including the ones you just produced.&lt;/p&gt;

&lt;p&gt;Your output should be like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhxydgnjmhiiee9aq7ob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhxydgnjmhiiee9aq7ob.png" alt=" " width="800" height="210"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Cases&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Netflix&lt;/strong&gt;&lt;br&gt;
Netflix needs no introduction. One of the world’s most innovative and robust OTT platform uses Apache Kafka in its keystone pipeline project to push and receive notifications.&lt;/p&gt;

&lt;p&gt;There are two types of Kafka used by Netflix which are Fronting Kafka, used for data collection and buffering by producers and Consumer Kafka, used for content routing to the consumers.&lt;/p&gt;

&lt;p&gt;All of us know that the amount of data processed at Netflix is pretty huge, and Netflix uses 36 Kafka clusters (out of which 24 are Fronting Kafka and 12 are Consumer Kafka) to work on almost 700 billion instances in a day.&lt;/p&gt;

&lt;p&gt;Netflix has achieved a data loss rate of 0.01% through this keystone pipeline project and Apache Kafka is a key driver to reduce this data loss to such a significant amount.&lt;/p&gt;

&lt;p&gt;Netflix plans to use a 0.9.0.1 version to improve resource utilization and availability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Uber&lt;/strong&gt;&lt;br&gt;
There are a lot of parameters where a giant in the travel industry like Uber needs to have a system that is uncompromising to errors, and fault-tolerant.&lt;/p&gt;

&lt;p&gt;Uber uses Apache Kafka to run their driver injury protection program in more than 200 cities.&lt;/p&gt;

&lt;p&gt;Drivers registered on Uber pay a premium on every ride and this program has been working successfully due to scalability and robustness of Apache Kafka.&lt;/p&gt;

&lt;p&gt;It has achieved this success largely through the unblocked batch processing method, which allows the Uber engineering team to get a steady throughput.&lt;/p&gt;

&lt;p&gt;The multiple retries have allowed the Uber team to work on the segmentation of messages to achieve real-time process updates and flexibility.&lt;/p&gt;

&lt;p&gt;Uber is planning on introducing a framework, where they can improve the uptime, grow, scale and facilitate the program without having to decrease the developer time with Apache Kafka.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LinkedIn&lt;/strong&gt;&lt;br&gt;
LinkedIn, one of the world’s most prominent B2B social media platforms handles well over a trillion messages per day.&lt;/p&gt;

&lt;p&gt;And we thought the number of messages handled by Netflix was huge. This figure is mind-blowing and LinkedIn has seen a rise of over 1200x over the last few years.&lt;/p&gt;

&lt;p&gt;LinkedIn uses different clusters for different applications to avoid clashing of failure of one application which would lead to harm to the other applications in the cluster.&lt;/p&gt;

&lt;p&gt;Broker Kafka clusters at LinkedIn help them to differentiate and white list certain users to allow them a higher bandwidth and ensure the seamless user experience.&lt;/p&gt;

&lt;p&gt;LinkedIn plans to achieve a lesser number of data loss rate through the Mirror Maker.&lt;/p&gt;

&lt;p&gt;As the Mirror Maker is used as the intermediary between Kafka clusters and Kafka topics.&lt;/p&gt;

&lt;p&gt;At present, there is a limit on the message size of 1 MB.&lt;/p&gt;

&lt;p&gt;But, through the Kafka ecosystem, LinkedIn plans to enable the publishers and the users to send well over that limit in the coming future.&lt;br&gt;
&lt;strong&gt;Messaging&lt;/strong&gt;&lt;br&gt;
Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.&lt;br&gt;
In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.&lt;br&gt;
(&lt;a href="https://www.knowledgenile.com/blogs/apache-kafka-use-cases" rel="noopener noreferrer"&gt;https://www.knowledgenile.com/blogs/apache-kafka-use-cases&lt;/a&gt;)&lt;br&gt;
&lt;strong&gt;Resources&lt;/strong&gt;&lt;br&gt;
Apache Kafka - Fundamentals -(&lt;a href="https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm" rel="noopener noreferrer"&gt;https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm&lt;/a&gt;)&lt;br&gt;
Apache Kafka Documentation - (&lt;a href="https://kafka.apache.org/documentation/?utm_source=chatgpt.com#uses" rel="noopener noreferrer"&gt;https://kafka.apache.org/documentation/?utm_source=chatgpt.com#uses&lt;/a&gt;)&lt;br&gt;
Introduction to Apache Kafka - &lt;a href="https://kafka.apache.org/intro" rel="noopener noreferrer"&gt;https://kafka.apache.org/intro&lt;/a&gt;&lt;br&gt;
Best Apache Kafka Use Cases - (&lt;a href="https://www.knowledgenile.com/blogs/apache-kafka-use-cases" rel="noopener noreferrer"&gt;https://www.knowledgenile.com/blogs/apache-kafka-use-cases&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>architecture</category>
      <category>backend</category>
      <category>kafka</category>
    </item>
  </channel>
</rss>
