<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nithyalakshmi Kamalakkannan</title>
    <description>The latest articles on DEV Community by Nithyalakshmi Kamalakkannan (@ktnl).</description>
    <link>https://dev.to/ktnl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F697502%2F7f8f6c78-540c-4d56-bbd3-a1b26f7d271c.png</url>
      <title>DEV Community: Nithyalakshmi Kamalakkannan</title>
      <link>https://dev.to/ktnl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ktnl"/>
    <language>en</language>
    <item>
      <title>Part 8: Databricks Pipeline &amp; Dashboard</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 11:00:32 +0000</pubDate>
      <link>https://dev.to/ktnl/part-8-databricks-pipeline-dashboard-b3h</link>
      <guid>https://dev.to/ktnl/part-8-databricks-pipeline-dashboard-b3h</guid>
      <description>&lt;h2&gt;
  
  
  Pipeline creation
&lt;/h2&gt;

&lt;p&gt;Databricks workflow is created with each task doing each part discussed in this blog series. The entire pipeline is orchestrated to stream and process data incrementally.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bronze ingestion&lt;/li&gt;
&lt;li&gt;ZIP dimension build&lt;/li&gt;
&lt;li&gt;Silver enrichment&lt;/li&gt;
&lt;li&gt;Gold aggregation (both the tables)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Dependencies enforce order automatically. If you are interested you can schedule the pipeline as well as per need with simple cron expressions!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frl3b18uigjr8krk64qci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frl3b18uigjr8krk64qci.png" alt=" " width="800" height="84"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2ujw4z2exnyak1cjp62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2ujw4z2exnyak1cjp62.png" alt=" " width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dashboard Creation
&lt;/h2&gt;

&lt;p&gt;Queries on the Gold tables feed data to Databricks dashboards.&lt;/p&gt;

&lt;p&gt;In databricks workflow, create a your own dashboard and add custom queries to provide visual representation of business insights.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2yzs9edyhc6g5jq4vgk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy2yzs9edyhc6g5jq4vgk.png" alt=" " width="800" height="585"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example, to get the peak hours we add the below query as Data (from SQL) and create a tile in our dashboard to show the results fetched.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;SELECT&lt;br&gt;
  trip_hour,&lt;br&gt;
  SUM(total_trips) AS trips&lt;br&gt;
FROM nyc_taxi.gold.taxi_trip_metrics&lt;br&gt;
GROUP BY trip_hour&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And the result is,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsrc9b4upltetxxdpr0a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsrc9b4upltetxxdpr0a.png" alt=" " width="628" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can keep adding tiles to beautify your dashboard!&lt;/p&gt;

&lt;p&gt;Dashboards update automatically when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New files arrive&lt;/li&gt;
&lt;li&gt;Jobs rerun&lt;/li&gt;
&lt;li&gt;Late data is processed (within watermark)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To simulate - new data arrival, we can add extra data to the DBFS input file source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy53t53qlzhswh0q5yohi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy53t53qlzhswh0q5yohi.png" alt=" " width="737" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can play with &lt;code&gt;tpep_pickup_datetime&lt;/code&gt; - to see watermarks dropping late data in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Messed up somewhere / want to reset the state ?
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reprocessing Strategy&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To reprocess everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drop tables or schema&lt;/li&gt;
&lt;li&gt;Delete checkpoints&lt;/li&gt;
&lt;li&gt;Rerun workflow&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Hope you liked the series, please do share your feedback. &lt;br&gt;
The source code is available in the &lt;a href="https://github.com/ktnl97/taxi-trip-analysis/tree/main/Taxi%20data%20-%20workflow" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; for reference.&lt;/p&gt;

&lt;p&gt;That's all for now. See you soon!&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>dataengineering</category>
      <category>tutorial</category>
      <category>sql</category>
    </item>
    <item>
      <title>Part 7: Gold Layer – Metrics, Watermarks, and Aggregations</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:51:04 +0000</pubDate>
      <link>https://dev.to/ktnl/part-7-gold-layer-metrics-watermarks-and-aggregations-1jcm</link>
      <guid>https://dev.to/ktnl/part-7-gold-layer-metrics-watermarks-and-aggregations-1jcm</guid>
      <description>&lt;p&gt;Gold tables answer business questions directly.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trips per hour by region&lt;/li&gt;
&lt;li&gt;Revenue per ZIP&lt;/li&gt;
&lt;li&gt;Average distance by time window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gold tables are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregated&lt;/li&gt;
&lt;li&gt;Optimized&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dashboard-ready&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Introducing Event Time &amp;amp; Watermarking&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again, for the gold layer to handle late data we would add watermark. Here, with windowing as well to properly close the aggregations on events grouped based on time.&lt;br&gt;
Here, telling spark to wait for 30 minutes from the latest event received on the open window of 1 hour time gap. Later on, to close the window and add the aggregated results as finalized when the watermark threshold (max event &lt;code&gt;tpep_pickup_datetime&lt;/code&gt; received - 30 minsutes) becomes greater than the window close time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;from pyspark.sql.functions import *&lt;br&gt;
silver_df = spark.readStream.format("delta").table("nyc_taxi.silver.taxi_trips_enriched")&lt;br&gt;
gold_df = (&lt;br&gt;
    silver_df&lt;br&gt;
        .withWatermark("tpep_pickup_datetime", "30 minutes")&lt;br&gt;
        .groupBy(&lt;br&gt;
            window("tpep_pickup_datetime", "1 hour"),&lt;br&gt;
            "region"&lt;br&gt;
        )&lt;br&gt;
        .agg(&lt;br&gt;
            count("*").alias("trip_count"),&lt;br&gt;
            sum("fare_amount").alias("total_fare"),&lt;br&gt;
            avg("trip_distance").alias("avg_distance")&lt;br&gt;
        )&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now, to stream it to gold delta tables.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;(&lt;br&gt;
    gold_df.writeStream.option('mergeSchema', 'true')&lt;br&gt;
        .trigger(availableNow=True)&lt;br&gt;
        .option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/gold/taxi_metrics") \&lt;br&gt;
        .outputMode("append") \&lt;br&gt;
        .toTable("nyc_taxi.gold.taxi_metrics")&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As mentioned, the gold answers business directly and hence there can be mulitple views required. We would create one more view highlighting the taxi_trip_metrics.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;from pyspark.sql.functions import *&lt;br&gt;
silver_stream = spark.readStream.format("delta").table("nyc_taxi.silver.taxi_trips_enriched")&lt;br&gt;
gold_stream = (&lt;br&gt;
    silver_stream&lt;br&gt;
    .withWatermark("tpep_pickup_datetime", "30 minutes")&lt;br&gt;
    .withColumn("trip_date", to_date("tpep_pickup_datetime"))&lt;br&gt;
    .withColumn("trip_hour", hour("tpep_pickup_datetime"))&lt;br&gt;
    .groupBy(&lt;br&gt;
        window("tpep_pickup_datetime", "1 hour"),&lt;br&gt;
        "trip_date",&lt;br&gt;
        "trip_hour",&lt;br&gt;
        "pickup_zip",&lt;br&gt;
        "region"&lt;br&gt;
    )&lt;br&gt;
    .agg(&lt;br&gt;
        count("*").alias("total_trips"),&lt;br&gt;
        sum("fare_amount").alias("total_revenue"),&lt;br&gt;
        avg("fare_amount").alias("avg_fare"),&lt;br&gt;
        avg("trip_distance").alias("avg_distance")&lt;br&gt;
    )&lt;br&gt;
)&lt;br&gt;
gold_stream.writeStream \&lt;br&gt;
    .format("delta") \&lt;br&gt;
    .trigger(availableNow=True) \&lt;br&gt;
    .option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/gold/taxi_trip_metrics") \&lt;br&gt;
    .outputMode("append") \&lt;br&gt;
    .table("nyc_taxi.gold.taxi_trip_metrics")&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The data is now aggregated and available in gold delta tables to be used for inferring business insights!&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>dataengineering</category>
      <category>tutorial</category>
      <category>python</category>
    </item>
    <item>
      <title>Part 6: Silver Layer – Cleansing, Enrichment, and Dimensions</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:50:36 +0000</pubDate>
      <link>https://dev.to/ktnl/part-6-silver-layer-cleansing-enrichment-and-dimensions-ff3</link>
      <guid>https://dev.to/ktnl/part-6-silver-layer-cleansing-enrichment-and-dimensions-ff3</guid>
      <description>&lt;p&gt;The Silver layer converts raw events into analytics-ready records by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cleaning bad data&lt;/li&gt;
&lt;li&gt;Enforcing schema&lt;/li&gt;
&lt;li&gt;Adding business context&lt;/li&gt;
&lt;li&gt;Applying dimensional modeling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where value is created.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Cleansing and Type Enforcement
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bronze must remain untouched&lt;/li&gt;
&lt;li&gt;Silver enforces correctness&lt;/li&gt;
&lt;li&gt;Errors are isolated from ingestion&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;silver_stream = (&lt;br&gt;
    spark.readStream&lt;br&gt;
        .format("delta")&lt;br&gt;
        .table("nyc_taxi.bronze.taxi_trips")&lt;br&gt;
        .withColumn(&lt;br&gt;
            "tpep_pickup_datetime",&lt;br&gt;
            to_timestamp("tpep_pickup_datetime")&lt;br&gt;
        )&lt;br&gt;
        .withColumn(&lt;br&gt;
            "fare_amount",&lt;br&gt;
            col("fare_amount").cast("double")&lt;br&gt;
        )&lt;br&gt;
        .filter(col("fare_amount") &amp;gt; 0)&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Using Broadcast joins
&lt;/h2&gt;

&lt;p&gt;For ensuring we capture the required dimentional modelling, we need to make joins. But with distributed computing across executors, shuffling the data among them is costlier. In our case, the use case is to join with zip_dim, a relatively smaller table. Hence as a performance improvement, we are using the Broadcast join here. This can be seen from the screenshots attached below.&lt;/p&gt;

&lt;p&gt;Without Broadcast join&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgd5uleq0vv48yr7sv1y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgd5uleq0vv48yr7sv1y.png" alt=" " width="398" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With Broadcast join&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi0wjxtfhmapc0a6889n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi0wjxtfhmapc0a6889n.png" alt=" " width="398" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding watermarks
&lt;/h2&gt;

&lt;p&gt;We are looking for real time data to be processed now and then, and hence we would need to say when it's ready to be processed, apply joins and add to sink for the next steps. Of course, Either as whole result or only changeset! &lt;br&gt;
Thus, we have added watermark asking spark to wait and accommodate for 30 minutes late data.&lt;/p&gt;

&lt;p&gt;The final code for the silver layer is below.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;from pyspark.sql.functions import *&lt;br&gt;
from pyspark.sql.functions import broadcast&lt;br&gt;
bronze_stream = spark.readStream.table("nyc_taxi.bronze.taxi_trips")&lt;br&gt;
zip_dim = spark.read.table("nyc_taxi.raw.zip_dim")&lt;br&gt;
silver_df = (&lt;br&gt;
    bronze_stream&lt;br&gt;
        .withColumn(&lt;br&gt;
            "pickup_zip",&lt;br&gt;
            regexp_replace("pickup_zip", "\.0$", "").cast("int")&lt;br&gt;
        )&lt;br&gt;
         .withColumn(&lt;br&gt;
            "tpep_pickup_datetime",&lt;br&gt;
            to_timestamp("tpep_pickup_datetime")&lt;br&gt;
        )&lt;br&gt;
        .withColumn(&lt;br&gt;
            "tpep_dropoff_datetime",&lt;br&gt;
            to_timestamp("tpep_dropoff_datetime")&lt;br&gt;
        )&lt;br&gt;
        .withWatermark("tpep_pickup_datetime", "30 minutes")&lt;br&gt;&lt;br&gt;
        .join(&lt;br&gt;
            broadcast(zip_dim),&lt;br&gt;
            bronze_stream.pickup_zip == zip_dim.zip_code,&lt;br&gt;
            "left"&lt;br&gt;
        )&lt;br&gt;
        .select(&lt;br&gt;
            "tpep_pickup_datetime",&lt;br&gt;
            "tpep_dropoff_datetime",&lt;br&gt;
            "trip_distance",&lt;br&gt;
            "fare_amount",&lt;br&gt;
            "pickup_zip",&lt;br&gt;
            "region",&lt;br&gt;
            "state"&lt;br&gt;
        )&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now, we will stream it to silver delta table with output mode as append - To get the finalized or closed window's results added to the Silver delta lake sink.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;(&lt;br&gt;
    silver_df.writeStream&lt;br&gt;
        .format("delta")&lt;br&gt;
        .outputMode("append")&lt;br&gt;
        .option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/silver")&lt;br&gt;
        .trigger(availableNow=True)&lt;br&gt;
        .toTable("nyc_taxi.silver.taxi_trips_enriched")&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The required cleansing and normalization has happened, and the data is now ready to get further matured for showcasing the business insights.&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>architecture</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Part 5: Building a ZIP Code Dimension Table</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:49:47 +0000</pubDate>
      <link>https://dev.to/ktnl/part-5-building-a-zip-code-dimension-table-1him</link>
      <guid>https://dev.to/ktnl/part-5-building-a-zip-code-dimension-table-1him</guid>
      <description>&lt;h2&gt;
  
  
  Why?, The Need for it!
&lt;/h2&gt;

&lt;p&gt;Fact tables (like taxi trips) are optimized for events:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pickup time&lt;/li&gt;
&lt;li&gt;Distance&lt;/li&gt;
&lt;li&gt;Fare&lt;/li&gt;
&lt;li&gt;Pickup ZIP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But analytics teams ask questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trips by region&lt;/li&gt;
&lt;li&gt;Revenue by state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Storing these attributes repeatedly in the fact table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increases storage&lt;/li&gt;
&lt;li&gt;Slows joins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Breaking these into dimensional modeling is the best practice. In our project, the use case of knowing region of pickup / drop zip code would pave way for creating the dimension table zip_dim.&lt;/p&gt;

&lt;p&gt;In real projects, ZIP metadata comes from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Census data&lt;/li&gt;
&lt;li&gt;Exposed via APIs&lt;/li&gt;
&lt;li&gt;Internal reference tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this project, we simulate it with some random range based hardcoded values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Does the ZIP Dimension Belong?
&lt;/h2&gt;

&lt;p&gt;Layer   Responsibility&lt;br&gt;
Bronze  Raw ZIP values as they appear&lt;br&gt;
Silver  Create and maintain ZIP dimension&lt;br&gt;
Gold    Join ZIP dimension for analytics&lt;/p&gt;

&lt;p&gt;Even though ZIPs appear in Bronze data, the dimension itself is curated, so it belongs in Silver, not Bronze!&lt;br&gt;
We derive ZIPs from the Bronze Delta table, not directly from raw files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Zip_dim builder
&lt;/h2&gt;

&lt;p&gt;Step 1: Create the schema for the zip_dim table&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;%sql&lt;br&gt;
CREATE SCHEMA IF NOT EXISTS nyc_taxi.raw;&lt;br&gt;
CREATE TABLE IF NOT EXISTS nyc_taxi.raw.zip_dim (&lt;br&gt;
    zip_code INT,&lt;br&gt;
    state STRING,&lt;br&gt;
    region STRING&lt;br&gt;
)&lt;br&gt;
USING DELTA;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Step 2: Read the unique and valid list of Zips - both pick up and drop from the bronze data.&lt;/p&gt;

&lt;p&gt;from pyspark.sql.functions import *&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;zip_stream = (&lt;br&gt;
    spark.readStream&lt;br&gt;
         .table("nyc_taxi.bronze.taxi_trips")&lt;br&gt;
         .selectExpr("pickup_zip as zip")&lt;br&gt;
         .union(&lt;br&gt;
             spark.readStream&lt;br&gt;
                  .table("nyc_taxi.bronze.taxi_trips")&lt;br&gt;
                  .selectExpr("dropoff_zip as zip")&lt;br&gt;
         )&lt;br&gt;
         .where("zip IS NOT NULL")&lt;br&gt;
         .dropDuplicates(["zip"])&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Step 3: Assign random metadata to the Zip values to simulate the actual metadata seeding.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;def upsert_zip_dim(batch_df, batch_id):&lt;br&gt;
    batch_df.createOrReplaceTempView("zip_updates")&lt;br&gt;
    spark.sql("""&lt;br&gt;
      MERGE INTO nyc_taxi.raw.zip_dim t&lt;br&gt;
      USING (&lt;br&gt;
        SELECT&lt;br&gt;
          CAST(zip AS INT) AS zip_code,&lt;br&gt;
          CASE&lt;br&gt;
            WHEN zip BETWEEN 10001 AND 10282 THEN 'NY'&lt;br&gt;
            WHEN zip BETWEEN 11201 AND 11256 THEN 'US'&lt;br&gt;
            WHEN zip BETWEEN 10451 AND 10475 THEN 'IN'&lt;br&gt;
            WHEN zip BETWEEN 10301 AND 10314 THEN 'AD'&lt;br&gt;
            ELSE 'SA'&lt;br&gt;
          END AS state,&lt;br&gt;
          CASE&lt;br&gt;
            WHEN zip BETWEEN 10001 AND 10282 THEN 'Manhattan'&lt;br&gt;
            WHEN zip BETWEEN 11201 AND 11256 THEN 'Brooklyn'&lt;br&gt;
            WHEN zip BETWEEN 10451 AND 10475 THEN 'Bronx'&lt;br&gt;
            WHEN zip BETWEEN 10301 AND 10314 THEN 'Staten Island'&lt;br&gt;
            ELSE 'Queens'&lt;br&gt;
          END AS region&lt;br&gt;
        FROM zip_updates&lt;br&gt;
      ) s&lt;br&gt;
      ON t.zip_code = s.zip_code&lt;br&gt;
      WHEN NOT MATCHED THEN INSERT *&lt;br&gt;
    """)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Step 4: Populate the nyc_taxi.raw.zip_dim delta table with zip meta data by batch processing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;(&lt;br&gt;
    zip_stream.writeStream&lt;br&gt;
        .foreachBatch(upsert_zip_dim)&lt;br&gt;
        .option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/raw/zip_dim_data")&lt;br&gt;
        .trigger(availableNow=True)&lt;br&gt;
        .start()&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Zip dimention table nyc_taxi.raw.zip_dim is now ready.&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Part 4: Building the Bronze Layer with Auto Loader and Delta Lake</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:49:22 +0000</pubDate>
      <link>https://dev.to/ktnl/part-4-building-the-bronze-layer-with-auto-loader-and-delta-lake-31ih</link>
      <guid>https://dev.to/ktnl/part-4-building-the-bronze-layer-with-auto-loader-and-delta-lake-31ih</guid>
      <description>&lt;p&gt;The Bronze layer is the foundation of the entire streaming architecture. Its role is to ingest data exactly as it arrives and store it durably. Possibly by adding some timestamps on when the event arrived.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating schemas and volumes
&lt;/h2&gt;

&lt;p&gt;We create the catalog, bronze schema and volumes required to store the metadata and various checkpoints data in the databricks. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;%sql&lt;br&gt;
CREATE CATALOG IF NOT EXISTS nyc_taxi;&lt;br&gt;
CREATE SCHEMA IF NOT EXISTS nyc_taxi.bronze;&lt;br&gt;
CREATE SCHEMA IF NOT EXISTS nyc_taxi.infra;&lt;br&gt;
CREATE VOLUME IF NOT EXISTS nyc_taxi.infra.autoloader;&lt;br&gt;
CREATE VOLUME IF NOT EXISTS nyc_taxi.infra.checkpoints;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Using Auto loader for Bronze ingestion
&lt;/h2&gt;

&lt;p&gt;Databricks Auto Loader is purpose-built for scalable file-based ingestion.&lt;br&gt;
In this project, Auto Loader continuously watches the directory created in Part 2 and processes only newly arrived files.&lt;/p&gt;

&lt;p&gt;We start by defining a streaming DataFrame using Auto Loader:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;bronze_stream = (&lt;br&gt;
    spark.readStream&lt;br&gt;
        .format("cloudFiles")&lt;br&gt;
        .option("cloudFiles.format", "json")&lt;br&gt;
        .option("cloudFiles.inferColumnTypes", "true")&lt;br&gt;
        .option("cloudFiles.schemaLocation", "/Volumes/nyc_taxi/infra/autoloader/metadata")&lt;br&gt;
        .load("/tmp/taxi_stream_input")&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;cloudFiles.format&lt;/code&gt;- Specifies the underlying file format (json in this case).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cloudFiles.inferColumnTypes&lt;/code&gt; - Automatically infers data types instead of defaulting to strings.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cloudFiles.schemaLocation&lt;/code&gt; - Stores inferred schemas and supports schema evolution across runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing to Bronze Delta Tables
&lt;/h2&gt;

&lt;p&gt;Next, we write the streaming data into a Bronze Delta table.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;(&lt;br&gt;
    bronze_stream.writeStream&lt;br&gt;
        .format("delta")&lt;br&gt;
        .outputMode("append")&lt;br&gt;
        .option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/bronze/taxi_trips")&lt;br&gt;
        .trigger(availableNow=True)&lt;br&gt;
        .toTable("nyc_taxi.bronze.taxi_trips")&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Using Delta tables in the Bronze layer provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACID guarantees on streaming writes&lt;/li&gt;
&lt;li&gt;Schema enforcement and evolution&lt;/li&gt;
&lt;li&gt;Reliable checkpointing&lt;/li&gt;
&lt;li&gt;Compatibility with batch and streaming reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pipeline uses &lt;code&gt;.trigger(availableNow=True)&lt;/code&gt; This will processes all available files and stops automatically when finished. This can be scheduled periodically. Reduces cost compared to always-on streaming&lt;/p&gt;

&lt;p&gt;In practice, this behaves like incremental batch processing — ideal for cloud storage–based ingestion.&lt;/p&gt;

&lt;p&gt;The raw data is now ingested into bronze delta tables using autoloader and is ready to be refined further in the silver layer.&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>tutorial</category>
      <category>sql</category>
    </item>
    <item>
      <title>Part 3: Simulating Real-Time Streaming Data Using Databricks Sample Datasets</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:48:51 +0000</pubDate>
      <link>https://dev.to/ktnl/part-3-simulating-real-time-streaming-data-using-databricks-sample-datasets-5be3</link>
      <guid>https://dev.to/ktnl/part-3-simulating-real-time-streaming-data-using-databricks-sample-datasets-5be3</guid>
      <description>&lt;p&gt;We use the Databricks NYC Taxi sample dataset, available by default in Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag2rmw905ssdxvq6507j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag2rmw905ssdxvq6507j.png" alt=" " width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This dataset is ideal because it includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event timestamps (tpep_pickup_datetime)&lt;/li&gt;
&lt;li&gt;Numeric measures (fare_amount, trip_distance)&lt;/li&gt;
&lt;li&gt;Location attributes (pickup_zip, dropoff_zip)&lt;/li&gt;
&lt;li&gt;Sufficient data volume to observe performance and shuffle behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Although the dataset is static, we will convert it into a streaming source.&lt;/p&gt;

&lt;h2&gt;
  
  
  Converting Static Data into a Streaming Source
&lt;/h2&gt;

&lt;p&gt;Step 1: Read the Sample Dataset&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;df = spark.table("samples.nyctaxi.trips")&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this point, the data is a normal batch DataFrame.&lt;/p&gt;

&lt;p&gt;Step 2: Write Data as JSON Files&lt;/p&gt;

&lt;p&gt;To simulate streaming input, we write the dataset as JSON files to a directory:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;(&lt;br&gt;
    df.write&lt;br&gt;
      .mode("overwrite")&lt;br&gt;
      .format("json")&lt;br&gt;
      .save("/tmp/taxi_stream_input")&lt;br&gt;
)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This writes files to DBFS (Databricks File system - imagine it to be a virtual space provided by databricks) overwritting any files present earlier in the "/tmp/taxi_stream_input". Creates multiple JSON files. Each file represents a batch of incoming events&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fviwi6c3x754dw4fp9lk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fviwi6c3x754dw4fp9lk2.png" alt=" " width="800" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, the data is available as file storage for us to read and start the streaming!&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Part 2: Project Architecture</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:48:26 +0000</pubDate>
      <link>https://dev.to/ktnl/part-2-project-architecture-1d2a</link>
      <guid>https://dev.to/ktnl/part-2-project-architecture-1d2a</guid>
      <description>&lt;p&gt;The goal is not just to “make streaming work”, but to design a maintainable and observable streaming platform.&lt;/p&gt;

&lt;p&gt;At a high level, the platform follows a Medallion Architecture, which organizes data into progressive layers of refinement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bronze: Raw, append-only streaming ingestion&lt;/li&gt;
&lt;li&gt;Silver: Cleaned, enriched, normalized data&lt;/li&gt;
&lt;li&gt;Gold: Aggregated, business-ready metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architectural flow
&lt;/h2&gt;

&lt;p&gt;The project outlines end-to-end real time data pipeline built on Databricks, following the Medallion Architecture pattern. Each stage progressively refines data from raw events into business-ready insights.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic3ickjvmgwgitukm5nq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic3ickjvmgwgitukm5nq.png" alt=" " width="336" height="1050"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks Sample Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At the top of the pipeline, Databricks-provided sample datasets (in this case, NYC Taxi trip data) act as the data source. These datasets contain realistic event timestamps, numeric measures, and location attributes, making them suitable for simulating real-world streaming use cases without requiring external systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulated Streaming Input&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because the sample data is static by default, it is first written incrementally as files into cloud storage (DBFS). This step simulates real-time data arrival, mimicking how production systems often receive data from upstream applications, IoT devices, or operational databases via files landing in object storage.&lt;/p&gt;

&lt;p&gt;New files arriving in this directory represent new streaming events.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqc7h1k3vev364yjl517p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqc7h1k3vev364yjl517p.png" alt=" " width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto Loader&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Databricks Auto Loader continuously monitors the input directory and efficiently detects newly arrived files, provides schema inference and evolution.&lt;br&gt;
Auto Loader integrates natively with Spark Structured Streaming, allowing file-based ingestion to behave like a true streaming source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bronze Delta Tables (Raw Layer)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Bronze layer stores raw, append-only data exactly as it arrives from the source, with minimal transformation.&lt;br&gt;
This layer ensures that raw data is always preserved, enabling replay, debugging, and full reprocessing if needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silver Delta Tables (Cleaned &amp;amp; Enriched Layer)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the Silver layer, data is cleansed, standardized, and enriched. Like&lt;br&gt;
Date type normalization, Filtering invalid or malformed records and Joining with dimension tables (for example, ZIP code to region mappings).&lt;/p&gt;

&lt;p&gt;Silver tables represent trusted, analytics-ready data that can be reused across multiple downstream use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gold Delta Tables (Business Layer)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Gold layer contains aggregated, business-focused datasets designed for analytics and reporting. For example,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hourly trip counts by region&lt;/li&gt;
&lt;li&gt;Revenue metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer often uses event-time processing, windowed aggregations, and watermarking to handle late-arriving data while keeping state bounded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks SQL Dashboards&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Finally, Gold tables are consumed by Databricks SQL Dashboards. As new data flows through the pipeline, dashboards update automatically, closing the loop from raw events to actionable insights.&lt;/p&gt;

&lt;p&gt;Together, these components form a robust, scalable, and maintainable real-time data platform.&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>data</category>
      <category>architecture</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Part 1: Creating Databricks Workspace and Enabling Unity Catalog</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:48:10 +0000</pubDate>
      <link>https://dev.to/ktnl/part-1-creating-databricks-workspace-and-enabling-unity-catalog-3e44</link>
      <guid>https://dev.to/ktnl/part-1-creating-databricks-workspace-and-enabling-unity-catalog-3e44</guid>
      <description>&lt;p&gt;In Databricks, a secure, governed foundation for our data platform is provided by Unity Catalog, which centralizes metadata, access control, and storage governance across workspaces.&lt;/p&gt;

&lt;p&gt;Unity Catalog is like a control plane for modern Databricks platforms offering the below benefits out of the box.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralized metastore for all tables and views&lt;/li&gt;
&lt;li&gt;Fine-grained access control (catalog, schema, table, column)&lt;/li&gt;
&lt;li&gt;Data lineage and auditing&lt;/li&gt;
&lt;li&gt;Secure multi-workspace governance&lt;/li&gt;
&lt;li&gt;Clear separation between compute and storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Step 1: Create an Azure Databricks Workspace&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Login to your account in Azure Portal &amp;gt; Create a Resource &amp;gt; Search for Azure Databricks&lt;/p&gt;

&lt;p&gt;Provide the required details like resource group, workspace name, region, etc.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Step 2: Create Azure Data Lake Storage (ADLS Gen2)&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Unity Catalog requires a cloud storage location to store managed tables and metadata.&lt;/p&gt;

&lt;p&gt;Azure Portal &amp;gt; Create a Resource &amp;gt; Search for Storage account&lt;/p&gt;

&lt;p&gt;Create an ADLS Gen2 storage account with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hierarchical namespace enabled&lt;/li&gt;
&lt;li&gt;Secure networking (private endpoints if required)&lt;/li&gt;
&lt;li&gt;A container dedicated to analytics (e.g. datalake)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This storage will physically hold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parquet data files&lt;/li&gt;
&lt;li&gt;_delta_log transaction logs&lt;/li&gt;
&lt;li&gt;Deletion vectors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Step 3: Configure Access Using Azure Managed Identity or Service Principal&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Databricks must be granted secure access to ADLS.&lt;/p&gt;

&lt;p&gt;Azure Portal &amp;gt; Create a Resource &amp;gt; Search for Storage account&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8cc7pmjs350ugsnm6m7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8cc7pmjs350ugsnm6m7.png" alt=" " width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is required to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create Delta tables&lt;/li&gt;
&lt;li&gt;Manage _delta_log transactions&lt;/li&gt;
&lt;li&gt;Handle compaction and vacuum&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;Step 4: Create the Unity Catalog Metastore&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;In the Databricks Account Console:&lt;/p&gt;

&lt;p&gt;Navigate to Data &amp;gt; Metastores &amp;gt; Create a new Unity Catalog metastore&lt;/p&gt;

&lt;p&gt;Provide:&lt;/p&gt;

&lt;p&gt;Name (e.g. nyc_taxi_metastore)&lt;br&gt;
Region (must match storage)&lt;br&gt;
ADLS Gen2 storage root (e.g. abfss://&lt;a href="mailto:datalake@storageaccount.dfs.core.windows.net"&gt;datalake@storageaccount.dfs.core.windows.net&lt;/a&gt;/uc)&lt;/p&gt;

&lt;p&gt;This location becomes the default storage root for managed tables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyaas6yb5hoxjjjbt3jc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyaas6yb5hoxjjjbt3jc.png" alt=" " width="800" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Step 5: Attach the Metastore to the Databricks Workspace&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Once the metastore is created,&lt;/p&gt;

&lt;p&gt;Navigate to the metastore &amp;gt; Click Assign to workspace &amp;gt; Select the Databricks workspace created earlier&lt;/p&gt;

&lt;p&gt;With all the set up, now our data platform foundation is laid!!!&lt;/p&gt;

&lt;p&gt;Points to remember&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All catalogs, schemas, and tables are governed centrally&lt;/li&gt;
&lt;li&gt;Multiple workspaces can share the same metastore, while one workspace cannot have mulitple metastores.&lt;/li&gt;
&lt;li&gt;Unity Catalog is account-level, not workspace-level.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alright! It’s time to get our hands dirty and do some Spark coding!&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>azure</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>End-to-End Real-Time Data Engineering on Databricks Using Spark Structured Streaming and Delta Lake</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Fri, 02 Jan 2026 10:47:44 +0000</pubDate>
      <link>https://dev.to/ktnl/end-to-end-real-time-data-engineering-on-databricks-using-spark-structured-streaming-and-delta-lake-207k</link>
      <guid>https://dev.to/ktnl/end-to-end-real-time-data-engineering-on-databricks-using-spark-structured-streaming-and-delta-lake-207k</guid>
      <description>&lt;p&gt;Simple batch processing and static dashboards have retired!&lt;/p&gt;

&lt;p&gt;Data platforms must ingest continuously arriving data, gracefully handle late and out-of-order events, scale efficiently, and still deliver reliable, business-ready metrics in no to near real time!&lt;/p&gt;

&lt;p&gt;In this blog series, we shall explore how to build an end-to-end real time streaming data platform on Databricks.&lt;/p&gt;

&lt;p&gt;As a newcomer to streaming systems, I have applied what I have learned about Spark Structured Streaming, Delta Lake, Auto Loader, and the Medallion Architecture to design and implement this solution. &lt;/p&gt;

&lt;p&gt;This will be a small, hands-on data engineering project to get practical experience on the Databricks platform, using the sample NYC Taxi Trips dataset. The intension is to get started with something to play around and apply what's read in theory. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zxlco4l7qjf9fz6qqxy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zxlco4l7qjf9fz6qqxy.png" alt=" " width="800" height="515"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The project ingests data from file storage using Auto Loader into Bronze Delta tables, reads from bronze via Spark Structured Streaming, cleanses and normalizes the data into Silver Delta tables using spark, and applies aggregations to produce Gold Delta tables. The pipeline is orchestrated using Databricks Workflows, with insights visualized through dashboards built on queries against the Gold layer.&lt;/p&gt;

&lt;p&gt;I have primarily used Databricks serverless compute, I did not explicitly create or manage clusters, feel free to create your own clusters and run the same Spark workloads to gain deeper insight into execution behavior, resource utilization, and performance characteristics using the Spark UI.&lt;/p&gt;

&lt;p&gt;I have attached the source code git repo as well in the last post of this series. Keep scrolling and your feedbacks are most welcome. &lt;/p&gt;

&lt;p&gt;Happy learning!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>handson</category>
      <category>realtimeproject</category>
      <category>spark</category>
    </item>
    <item>
      <title>Kubernetes (K8s) Command Cheat Sheet</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Mon, 28 Apr 2025 03:35:18 +0000</pubDate>
      <link>https://dev.to/ktnl/kubernetes-k8s-command-cheat-sheet-291h</link>
      <guid>https://dev.to/ktnl/kubernetes-k8s-command-cheat-sheet-291h</guid>
      <description>&lt;p&gt;Whether you're wrangling microservices in production or just tired of Googling the same five kubectl commands, this blog is for you.&lt;/p&gt;

&lt;p&gt;We will go beyond the copy &amp;amp; paste to give you real command-line hands-on examples and some lighter dives to understand why things work the way they do.&lt;/p&gt;

&lt;p&gt;Let’s level up your K8s game. 🚀&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick Refresher: What Kubernetes Is
&lt;/h3&gt;

&lt;p&gt;Kubernetes is a container orchestration system that helps you manage applications across clusters of machines. It handles their scheduling, scaling, networking, and rollouts. You tell it what you want, and it figures out how to get there! &lt;/p&gt;

&lt;p&gt;Before we jump into commands, it helps to know how kubectl is structured—it’ll make everything click faster, especially as you start scripting or working with multiple clusters - Please do read the other parts of this series to get the grasp of the underlying architecture.&lt;/p&gt;

&lt;p&gt;Enough of theory!&lt;br&gt;
Most kubectl commands follow this pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl [operation] [resource] [name] [flags]

For example:

kubectl get pods -n dev

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s what’s happening:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;get is the operation&lt;br&gt;
pods is the resource type&lt;br&gt;
-n dev tells kubectl to only look in the dev namespace&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now, here's where flags come into play:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;-n &amp;lt;namespace&amp;gt; (or --namespace=&amp;lt;namespace&amp;gt;) lets you target a specific namespace.&lt;br&gt;
-A (short for --all-namespaces) will show results across every namespace.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Compare these two:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods -n dev     # Just pods in the 'dev' namespace
kubectl get pods -A         # All pods in all namespaces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Why this matters&lt;/em&gt;: Many kubectl commands default to the current namespace (often default), so if you don’t specify -n or set your namespace context, you might think things are missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bonus Tip:&lt;/strong&gt; To avoid typing -n  all the time, you can set your namespace context like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl config set-context --current --namespace=dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now all kubectl commands will assume dev unless you override it.&lt;br&gt;
Try out the &lt;strong&gt;kubectx&lt;/strong&gt; and &lt;strong&gt;kubens&lt;/strong&gt; tools after checking support to your platform - they make life much easier for context and namespace switching and many more!&lt;/p&gt;

&lt;p&gt;Hey wait, Context?? &lt;br&gt;
Relax! We will get the context in few minutes!!&lt;/p&gt;

&lt;p&gt;Cool? Cool. Now let’s hit the terminal.&lt;/p&gt;
&lt;h4&gt;
  
  
  Essentials
&lt;/h4&gt;

&lt;p&gt;These are the bread-and-butter commands for interacting with your K8s cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get all                 # Get pods, services, deployments, etc.
kubectl get pods                # List all pods in the current namespace
kubectl describe pod &amp;lt;name&amp;gt;     # Detailed info about a pod
kubectl delete pod &amp;lt;name&amp;gt;       # Delete a pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, keep playing with permutation and combination for other resource types with these operations!&lt;/p&gt;

&lt;h4&gt;
  
  
  Multi-Cluster Management
&lt;/h4&gt;

&lt;p&gt;Working across multiple clusters? &lt;br&gt;
Here's where you will need kubectl config commands the most. These commands help you manage different contexts, namespaces, and clusters seamlessly.&lt;/p&gt;

&lt;p&gt;If you’re using Azure Kubernetes Service (AKS), you’ll need to configure your kubectl to authenticate and connect to the correct AKS cluster.&lt;/p&gt;

&lt;p&gt;A Kubernetes context is like a shortcut or profile that tells kubectl where to send commands (your cluster) and how to authenticate (user creds).&lt;/p&gt;

&lt;p&gt;Get a quick local setup for your projects -&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;az login  # Login to Azure
az account set -s "&amp;lt;subscription-id&amp;gt;"  # Set subscription context
az aks get-credentials --name &amp;lt;aks-cluster-name&amp;gt; --resource-group &amp;lt;resource-group&amp;gt;  # Get credentials for AKS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kubectl config&lt;/code&gt; Commands&lt;br&gt;
To manage multiple clusters, here are some useful commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# List all available clusters
kubectl config get-contexts             

# Get the current AKS cluster connected with kubectl (e.g., dev/uat/prod)
kubectl config current-context          

# Switch between clusters (dev/uat/prod)
kubectl config use-context &amp;lt;cluster-name&amp;gt;                     
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deployments and Scaling
&lt;/h4&gt;

&lt;p&gt;Managing deployments and scaling is where Kubernetes really shines. Use the following commands to control your apps in the cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl create deployment &amp;lt;your-deployment&amp;gt; --image=nginx
kubectl expose deployment &amp;lt;your-deployment&amp;gt; --port=80 --type=NodePort
kubectl scale deployment &amp;lt;your-deployment&amp;gt; --replicas=3
kubectl rollout status deployment/&amp;lt;your-deployment&amp;gt;
kubectl rollout undo deployment/&amp;lt;your-deployment&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Logs, Exec, and Debugging
&lt;/h4&gt;

&lt;p&gt;When things go wrong, you need to dig deep. Here are some useful ways to get the logs, interact with your containers, and troubleshoot effectively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs &amp;lt;pod-name&amp;gt; --tail &amp;lt;nr-of-lines&amp;gt;        # Shows specific number of lines of the log
kubectl logs &amp;lt;pod-name&amp;gt; | findstr &amp;lt;search string&amp;gt;   # Shows all logs matching the search string
kubectl logs &amp;lt;pod-name&amp;gt; --timestamps=true           # Logs of a specific pod with timestamps
kubectl logs &amp;lt;pod-name&amp;gt; --since=1h                  # Logs for a specific duration (1 hour here)
kubectl logs &amp;lt;pod-name&amp;gt; --follow                    # Continuously shows the logs (Ctrl + C to exit)
kubectl logs &amp;lt;pod-name&amp;gt; --previous                  # Logs for a previous instantiation of a container
kubectl logs &amp;lt;pod-name&amp;gt; &amp;gt; &amp;lt;log-file-name&amp;gt;           # Write logs to a file
kubectl logs &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt;         # Logs from a specific container in a multi-container pod
kubectl logs -l app=&amp;lt;label-value&amp;gt;                   # Logs from all pods having a common label
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When debugging, &lt;code&gt;--follow&lt;/code&gt; for real-time monitoring, and &lt;code&gt;--previous&lt;/code&gt; helps track down issues after a pod has been restarted subject to error situation.&lt;/p&gt;

&lt;h4&gt;
  
  
  Getting Inside a Pod
&lt;/h4&gt;

&lt;p&gt;Need to jump into a running pod? Here's the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl exec -it &amp;lt;pod-name&amp;gt; cmd.exe 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Windows containers use &lt;code&gt;cmd.exe&lt;/code&gt;, otherwise use &lt;code&gt;/bin/bash&lt;/code&gt; or &lt;code&gt;sh&lt;/code&gt; for Linux containers based on your preference.&lt;/p&gt;

&lt;h4&gt;
  
  
  Metrics &amp;amp; Resource Monitoring
&lt;/h4&gt;

&lt;p&gt;Stay on top of your cluster’s health with kubectl top to see resource usage!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl top pod &amp;lt;pod-name&amp;gt; --containers         
kubectl top pod &amp;lt;pod-name&amp;gt; --sort-by=cpu        
kubectl top node &amp;lt;node-name&amp;gt;                    
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  ConfigMaps &amp;amp; Secrets
&lt;/h4&gt;

&lt;p&gt;Handle sensitive data with kubectl commands for ConfigMaps and Secrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl create configmap my-config --from-literal=env=prod
kubectl get configmaps
kubectl describe configmap my-config

kubectl get secrets
kubectl describe secret my-secret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;Kubernetes can feel like magic—until it breaks. Then it’s all about knowing the right commands, fast!&lt;/p&gt;

&lt;p&gt;This cheat sheet is aimed to bridge you from just struggling to get it working by surfing to really understanding how the pieces fit together. The more you use these commands, the more comfortable and handier it becomes!&lt;/p&gt;

&lt;p&gt;Got a favorite kubectl command or trick to share? Drop it in the comments...&lt;/p&gt;

&lt;p&gt;Happy learning! &lt;/p&gt;

</description>
      <category>k8s</category>
      <category>cheatsheet</category>
      <category>devops</category>
    </item>
    <item>
      <title>Sneak peek into Alibaba Cloud</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Wed, 06 Jul 2022 11:59:49 +0000</pubDate>
      <link>https://dev.to/ktnl/sneak-peek-into-alibaba-cloud-1b5a</link>
      <guid>https://dev.to/ktnl/sneak-peek-into-alibaba-cloud-1b5a</guid>
      <description>&lt;p&gt;&lt;strong&gt;Alibaba Cloud&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The cloud computing service provider currently operates in 27 data center regions and 84 global availability zones. It is primarily focused on areas in mainland China and other Asia-Pacific regions, but there are also fewer regions set up in the U.S. and European Union.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gq1ATWJm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3p38xve82z5jg3teh8a6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gq1ATWJm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3p38xve82z5jg3teh8a6.png" alt="Image description" width="880" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Products &amp;amp; Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Alibaba provides an expanding range of high-performance cloud products including large-scale computing, storage resources, and Big Data processing capabilities for users around the world. Highlighting few of the most common services offered in compute, storage, database, networking aspects here. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alibaba Cloud Elastic Compute Service (ECS) provides fast memory and the latest CPUs to power cloud applications and achieve faster results with low latency along with capability to scale up or down based on real-time demands.&lt;br&gt;
Using the next-generation virtualization technology independently developed by Alibaba Cloud, ECS Bare Metal Instance features both the elasticity of a virtual server and the high-performance and comprehensive features of a physical server. This enables you to retain the elasticity capability of common ECS while delivering the same user experience as physical servers.&lt;br&gt;
Alibaba Cloud Container Service for Kubernetes (ACK) integrates virtualization, storage, networking, and security capabilities to deploy applications in high-performance and scalable containers and provides full lifecycle management of enterprise-class containerized applications. Alibaba also offers Container Registry, a platform to manage images throughout the image life cycle with easy image permission management. This service simplifies the creation and maintenance of the image registry and supports image management in multiple regions. Combined with other cloud services such as Container Service, Container Registry provides an optimized solution for using Docker in the cloud.&lt;br&gt;
Alibaba Cloud Function Compute is a fully managed, event-driven compute service focused on writing and uploading code without having to manage infrastructure such as servers. No fees are incurred for up to 1,000,000 invocations and 400,000 CU-second compute resources per month.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;High volumes of any type of unstructured data such as image, audio and video files shall be stored in the cloud with encryption and high availability using the Object Storage Service (OSS) of Alibaba. These objects come with configurations that can be modified to meet region, access controls and storage class requirements. Alibaba offers four OSS storage classes, Standard, Infrequent Access (IA), Archive and Cold Archive. &lt;br&gt;
Alibaba provides Elastic Block Storage (EBS) devices for ECS instances that come with low-latency storage, random read-write capabilities and data persistence.&lt;br&gt;
Apsara File Storage service provides network attached file storage (NAS) for ECS instances, Elastic High-Performance Computing instances and Container Service for Kubernetes nodes. The distributed file system offers a maximum capacity of up to 10 PB that automatically scales when files are added or removed with encryption of data at rest and in transit. These also offer shared access, high throughput, data replication, backup and shall be accessed via standard file access protocols.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ch4DKg8N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rzma0xsqydi3hkvddaco.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ch4DKg8N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rzma0xsqydi3hkvddaco.png" alt="Image description" width="880" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alibaba offers a relational database service, compatible with MySQL, PostgreSQL and Oracle syntaxes with a 100 TB maximum storage capacity known as the ApsaraDB for PolarDB. This has low-latency physical replication, data backup and disaster recovery. Alibaba also offers PolarDB stacks, an on-premises database management appliance.&lt;br&gt;
The NoSQL database offering supports open source MongoDB protocol. ApsaraDB for MongoDB is a document database that features automatic monitoring and scalability with architecture configurations to enable standalone instances, replica set instances and sharded cluster instances.&lt;br&gt;
Alibaba Cloud offers a low-cost self-service database migration experience that supports homogeneous and heterogeneous migration smoothly from hundreds of GBs to multiple TBs with minimal business impact. With over 400,000 databases successfully migrated to Alibaba Cloud, the Database Architect Team has a proven record of making the migration process an efficient and hassle-free journey. Data Transmission Service (DTS) migrate and synchronize data between data storage engines, such as relational databases, NoSQL, and OLAP, with just a few clicks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Networking &amp;amp; CDN&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Server Load Balancer (SLB) distributes network traffic across groups of backend servers improving service capability and application availability. It functions as a reverse proxy at Layer 7（Application Load Balancer) and load balancing services at Layer 4 (Classic Load Balance.&lt;br&gt;
Network Intelligence Service (NIS) monitors the health status and performance of networks, performs diagnostics and troubleshoots issues, and analyzes and measures network traffic. &lt;br&gt;
Alibaba Cloud PrivateZone provides a private DNS based on specified VPCs. PrivateZones resolve IP addresses and manage resources within the specific VPCs, these domain names cannot be accessed over the internet or any other resource besides the specified VPCs.&lt;/p&gt;

&lt;p&gt;The complete list of products and services offered can be found in their official product website. &lt;br&gt;
This provides a good amount of information regarding the service and guidance to get started with.&lt;/p&gt;

</description>
      <category>alibaba</category>
      <category>cloud</category>
      <category>beginners</category>
    </item>
    <item>
      <title>K8s Objects - Part 3 [Service]</title>
      <dc:creator>Nithyalakshmi Kamalakkannan</dc:creator>
      <pubDate>Tue, 17 May 2022 06:19:02 +0000</pubDate>
      <link>https://dev.to/ktnl/k8s-objects-part-3-service--3la4</link>
      <guid>https://dev.to/ktnl/k8s-objects-part-3-service--3la4</guid>
      <description>&lt;p&gt;Deployment ensures that the desired number of Pods are up and running with the desired configuration at any given point of time.&lt;/p&gt;

&lt;p&gt;But, when a new Pod is added (due to scaling or version changes)... We know that Pod has an IP to reach, and we use port forwarding to reach it from the outside world. &lt;/p&gt;

&lt;p&gt;When the Pod changes the IP also changes along with it. Given this, maintaining your application with Pods whose IPs can frequently change is a challenge. In order to ensure the seamless communication of your application with the outside world, the K8s Service comes as the saviour!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The K8s Service&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The virtual component that consists of the set of ipTable rules for the cluster is the K8s Service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It is used to expose Pods, instead of talking to the pods you end up talking with just the service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The service takes the responsibility of routing the traffic and/or communicating with the Pods.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Linking the service with your K8s Deployment or Replicaset or Pod is simple and same - As usual just the match labels does this for you :)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Types of Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are four types of K8s Services.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ClusterIP&lt;/li&gt;
&lt;li&gt;NodePort&lt;/li&gt;
&lt;li&gt;LoadBalancer&lt;/li&gt;
&lt;li&gt;ExternalName&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default, K8s creates a ClusterIP type of service. We can build different kinds of services by specifying the type in a spec.type property of the service configuration file.&lt;br&gt;
Let's explore them one by one!&lt;/p&gt;

&lt;p&gt;To ensure demoing the LoadBalancer type of service, I have created a Cluster in Azure and creating these services there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ClusterIP&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Exposes your service within your cluster on a cluster-internal IP. Applications can interact with other applications internally using the ClusterIP.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Service is unreachable from outside the cluster. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This is the default ServiceType.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Demo time!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;clusterIp-service-demo.yml&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KHCn5WFu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rjxa6yhtfmdbxypaq7tc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KHCn5WFu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rjxa6yhtfmdbxypaq7tc.png" alt="Image description" width="578" height="1236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This configuration file creates a deployment managing three nginx Pods and exposes them under the ClusterIP service.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Rigj2Hs3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d7650k7vpgdj2gevi8rs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Rigj2Hs3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d7650k7vpgdj2gevi8rs.png" alt="Image description" width="880" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These Pods can be accessed using the exposed InternalIP &lt;code&gt;10.0.248.71&lt;/code&gt; by other applications within the cluster. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NodePort&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Exposes your service outside your cluster. This creates a mapping of Pods to it's hosting node on a static port. Thus the service is accessible at the NodeIP:NodePort.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NodeIP is the IP address of your node and NodePort is the port which you decide to expose the service at. Usually something taken between 30000 - 32767.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Demo Time!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Lz_og3Bq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3uw3sdb40z1875di7j20.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Lz_og3Bq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3uw3sdb40z1875di7j20.png" alt="Image description" width="578" height="1272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Applying this configuration, results in three more Pods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vzIvvh02--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1y2cmzs7mrzzxcj1hd4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vzIvvh02--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1y2cmzs7mrzzxcj1hd4x.png" alt="Image description" width="880" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These Pods are now exposed through Node Port service to the outside world at :30003 (30003 as you have mentioned in configuration file, else K8s randomly picks from the allowed range.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LoadBalancer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;This creates load balancers in various Cloud providers like AWS, GCP, Azure, etc., and exposes our application to the Internet. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Cloud provider will provide a mechanism for routing the traffic to the services. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Demo Time!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XqofeDKp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mhrubdewvy1mpint9esi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XqofeDKp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mhrubdewvy1mpint9esi.png" alt="Image description" width="632" height="1272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Applying the configuration, creates Pods exposed via loadbalancer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lnYyLQpB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7mhe40wfcb102fvcmpet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lnYyLQpB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7mhe40wfcb102fvcmpet.png" alt="Image description" width="880" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, the application is accessible at 20.198.164.24 to the world. (Don't try to access, I will delete the service shortly to save my penny :P)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zCIAvFDu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y1ct7eudot39fhkybvpu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zCIAvFDu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y1ct7eudot39fhkybvpu.png" alt="Image description" width="880" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ExternalName&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Maps the Service to the contents of the externalName field. Accessing your service within your cluster redirects to the externalName you have provided. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It is not to any typical selector labels. Rather it is attached to the CNAME of the external server.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--E_q-3i7w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jej9fm8oq2q5x7r8mxlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--E_q-3i7w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jej9fm8oq2q5x7r8mxlw.png" alt="Image description" width="568" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This creates a service which when accessed at my-service.default.svc.cluster.local redirects to the content of my.service.com.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nU34mb5m--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xw3i7lu180g6ivqox5bd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nU34mb5m--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xw3i7lu180g6ivqox5bd.png" alt="Image description" width="880" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hope this gives an introduction to Service K8s object. See you in the next blog.&lt;/p&gt;

&lt;p&gt;Happy learning! &lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
