DEV Community: Nithyalakshmi Kamalakkannan

Part 8: Databricks Pipeline & Dashboard

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 11:00:32 +0000

Pipeline creation

Databricks workflow is created with each task doing each part discussed in this blog series. The entire pipeline is orchestrated to stream and process data incrementally.

Bronze ingestion
ZIP dimension build
Silver enrichment
Gold aggregation (both the tables)

Dependencies enforce order automatically. If you are interested you can schedule the pipeline as well as per need with simple cron expressions!

Dashboard Creation

Queries on the Gold tables feed data to Databricks dashboards.

In databricks workflow, create a your own dashboard and add custom queries to provide visual representation of business insights.

For example, to get the peak hours we add the below query as Data (from SQL) and create a tile in our dashboard to show the results fetched.

SELECT
trip_hour,
SUM(total_trips) AS trips
FROM nyc_taxi.gold.taxi_trip_metrics
GROUP BY trip_hour

And the result is,

You can keep adding tiles to beautify your dashboard!

Dashboards update automatically when:

New files arrive
Jobs rerun
Late data is processed (within watermark)

To simulate - new data arrival, we can add extra data to the DBFS input file source.

You can play with tpep_pickup_datetime - to see watermarks dropping late data in action.

Messed up somewhere / want to reset the state ?

Reprocessing Strategy

To reprocess everything:

Drop tables or schema
Delete checkpoints
Rerun workflow

Hope you liked the series, please do share your feedback.
The source code is available in the GitHub repository for reference.

That's all for now. See you soon!

Happy learning!

Part 7: Gold Layer – Metrics, Watermarks, and Aggregations

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:51:04 +0000

Gold tables answer business questions directly.

Examples:

Trips per hour by region
Revenue per ZIP
Average distance by time window

Gold tables are:

Aggregated
Optimized
Dashboard-ready
Introducing Event Time & Watermarking

Again, for the gold layer to handle late data we would add watermark. Here, with windowing as well to properly close the aggregations on events grouped based on time.
Here, telling spark to wait for 30 minutes from the latest event received on the open window of 1 hour time gap. Later on, to close the window and add the aggregated results as finalized when the watermark threshold (max event tpep_pickup_datetime received - 30 minsutes) becomes greater than the window close time.

from pyspark.sql.functions import *
silver_df = spark.readStream.format("delta").table("nyc_taxi.silver.taxi_trips_enriched")
gold_df = (
silver_df
.withWatermark("tpep_pickup_datetime", "30 minutes")
.groupBy(
window("tpep_pickup_datetime", "1 hour"),
"region"
)
.agg(
count("*").alias("trip_count"),
sum("fare_amount").alias("total_fare"),
avg("trip_distance").alias("avg_distance")
)
)

Now, to stream it to gold delta tables.

(
gold_df.writeStream.option('mergeSchema', 'true')
.trigger(availableNow=True)
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/gold/taxi_metrics") \
.outputMode("append") \
.toTable("nyc_taxi.gold.taxi_metrics")
)

As mentioned, the gold answers business directly and hence there can be mulitple views required. We would create one more view highlighting the taxi_trip_metrics.

from pyspark.sql.functions import *
silver_stream = spark.readStream.format("delta").table("nyc_taxi.silver.taxi_trips_enriched")
gold_stream = (
silver_stream
.withWatermark("tpep_pickup_datetime", "30 minutes")
.withColumn("trip_date", to_date("tpep_pickup_datetime"))
.withColumn("trip_hour", hour("tpep_pickup_datetime"))
.groupBy(
window("tpep_pickup_datetime", "1 hour"),
"trip_date",
"trip_hour",
"pickup_zip",
"region"
)
.agg(
count("*").alias("total_trips"),
sum("fare_amount").alias("total_revenue"),
avg("fare_amount").alias("avg_fare"),
avg("trip_distance").alias("avg_distance")
)
)
gold_stream.writeStream \
.format("delta") \
.trigger(availableNow=True) \
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/gold/taxi_trip_metrics") \
.outputMode("append") \
.table("nyc_taxi.gold.taxi_trip_metrics")

The data is now aggregated and available in gold delta tables to be used for inferring business insights!

Happy learning!

Part 6: Silver Layer – Cleansing, Enrichment, and Dimensions

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:50:36 +0000

The Silver layer converts raw events into analytics-ready records by:

Cleaning bad data
Enforcing schema
Adding business context
Applying dimensional modeling

This is where value is created.

Data Cleansing and Type Enforcement

Bronze must remain untouched
Silver enforces correctness
Errors are isolated from ingestion

silver_stream = (
spark.readStream
.format("delta")
.table("nyc_taxi.bronze.taxi_trips")
.withColumn(
"tpep_pickup_datetime",
to_timestamp("tpep_pickup_datetime")
)
.withColumn(
"fare_amount",
col("fare_amount").cast("double")
)
.filter(col("fare_amount") > 0)
)

Using Broadcast joins

For ensuring we capture the required dimentional modelling, we need to make joins. But with distributed computing across executors, shuffling the data among them is costlier. In our case, the use case is to join with zip_dim, a relatively smaller table. Hence as a performance improvement, we are using the Broadcast join here. This can be seen from the screenshots attached below.

Without Broadcast join

With Broadcast join

Adding watermarks

We are looking for real time data to be processed now and then, and hence we would need to say when it's ready to be processed, apply joins and add to sink for the next steps. Of course, Either as whole result or only changeset!
Thus, we have added watermark asking spark to wait and accommodate for 30 minutes late data.

The final code for the silver layer is below.

from pyspark.sql.functions import *
from pyspark.sql.functions import broadcast
bronze_stream = spark.readStream.table("nyc_taxi.bronze.taxi_trips")
zip_dim = spark.read.table("nyc_taxi.raw.zip_dim")
silver_df = (
bronze_stream
.withColumn(
"pickup_zip",
regexp_replace("pickup_zip", "\.0$", "").cast("int")
)
.withColumn(
"tpep_pickup_datetime",
to_timestamp("tpep_pickup_datetime")
)
.withColumn(
"tpep_dropoff_datetime",
to_timestamp("tpep_dropoff_datetime")
)
.withWatermark("tpep_pickup_datetime", "30 minutes")

.join(
broadcast(zip_dim),
bronze_stream.pickup_zip == zip_dim.zip_code,
"left"
)
.select(
"tpep_pickup_datetime",
"tpep_dropoff_datetime",
"trip_distance",
"fare_amount",
"pickup_zip",
"region",
"state"
)
)

Now, we will stream it to silver delta table with output mode as append - To get the finalized or closed window's results added to the Silver delta lake sink.

(
silver_df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/silver")
.trigger(availableNow=True)
.toTable("nyc_taxi.silver.taxi_trips_enriched")
)

The required cleansing and normalization has happened, and the data is now ready to get further matured for showcasing the business insights.

Happy learning!

Part 5: Building a ZIP Code Dimension Table

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:49:47 +0000

Why?, The Need for it!

Fact tables (like taxi trips) are optimized for events:

Pickup time
Distance
Fare
Pickup ZIP

But analytics teams ask questions like:

Trips by region
Revenue by state

Storing these attributes repeatedly in the fact table:

Increases storage
Slows joins

Breaking these into dimensional modeling is the best practice. In our project, the use case of knowing region of pickup / drop zip code would pave way for creating the dimension table zip_dim.

In real projects, ZIP metadata comes from:

Census data
Exposed via APIs
Internal reference tables

For this project, we simulate it with some random range based hardcoded values.

Where Does the ZIP Dimension Belong?

Layer Responsibility
Bronze Raw ZIP values as they appear
Silver Create and maintain ZIP dimension
Gold Join ZIP dimension for analytics

Even though ZIPs appear in Bronze data, the dimension itself is curated, so it belongs in Silver, not Bronze!
We derive ZIPs from the Bronze Delta table, not directly from raw files.

Zip_dim builder

Step 1: Create the schema for the zip_dim table

%sql
CREATE SCHEMA IF NOT EXISTS nyc_taxi.raw;
CREATE TABLE IF NOT EXISTS nyc_taxi.raw.zip_dim (
zip_code INT,
state STRING,
region STRING
)
USING DELTA;

Step 2: Read the unique and valid list of Zips - both pick up and drop from the bronze data.

from pyspark.sql.functions import *

zip_stream = (
spark.readStream
.table("nyc_taxi.bronze.taxi_trips")
.selectExpr("pickup_zip as zip")
.union(
spark.readStream
.table("nyc_taxi.bronze.taxi_trips")
.selectExpr("dropoff_zip as zip")
)
.where("zip IS NOT NULL")
.dropDuplicates(["zip"])
)

Step 3: Assign random metadata to the Zip values to simulate the actual metadata seeding.

def upsert_zip_dim(batch_df, batch_id):
batch_df.createOrReplaceTempView("zip_updates")
spark.sql("""
MERGE INTO nyc_taxi.raw.zip_dim t
USING (
SELECT
CAST(zip AS INT) AS zip_code,
CASE
WHEN zip BETWEEN 10001 AND 10282 THEN 'NY'
WHEN zip BETWEEN 11201 AND 11256 THEN 'US'
WHEN zip BETWEEN 10451 AND 10475 THEN 'IN'
WHEN zip BETWEEN 10301 AND 10314 THEN 'AD'
ELSE 'SA'
END AS state,
CASE
WHEN zip BETWEEN 10001 AND 10282 THEN 'Manhattan'
WHEN zip BETWEEN 11201 AND 11256 THEN 'Brooklyn'
WHEN zip BETWEEN 10451 AND 10475 THEN 'Bronx'
WHEN zip BETWEEN 10301 AND 10314 THEN 'Staten Island'
ELSE 'Queens'
END AS region
FROM zip_updates
) s
ON t.zip_code = s.zip_code
WHEN NOT MATCHED THEN INSERT *
""")

Step 4: Populate the nyc_taxi.raw.zip_dim delta table with zip meta data by batch processing.

(
zip_stream.writeStream
.foreachBatch(upsert_zip_dim)
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/raw/zip_dim_data")
.trigger(availableNow=True)
.start()
)

The Zip dimention table nyc_taxi.raw.zip_dim is now ready.

Happy learning!

Part 4: Building the Bronze Layer with Auto Loader and Delta Lake

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:49:22 +0000

The Bronze layer is the foundation of the entire streaming architecture. Its role is to ingest data exactly as it arrives and store it durably. Possibly by adding some timestamps on when the event arrived.

Creating schemas and volumes

We create the catalog, bronze schema and volumes required to store the metadata and various checkpoints data in the databricks.

%sql
CREATE CATALOG IF NOT EXISTS nyc_taxi;
CREATE SCHEMA IF NOT EXISTS nyc_taxi.bronze;
CREATE SCHEMA IF NOT EXISTS nyc_taxi.infra;
CREATE VOLUME IF NOT EXISTS nyc_taxi.infra.autoloader;
CREATE VOLUME IF NOT EXISTS nyc_taxi.infra.checkpoints;

Using Auto loader for Bronze ingestion

Databricks Auto Loader is purpose-built for scalable file-based ingestion.
In this project, Auto Loader continuously watches the directory created in Part 2 and processes only newly arrived files.

We start by defining a streaming DataFrame using Auto Loader:

bronze_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaLocation", "/Volumes/nyc_taxi/infra/autoloader/metadata")
.load("/tmp/taxi_stream_input")
)

cloudFiles.format- Specifies the underlying file format (json in this case).

cloudFiles.inferColumnTypes - Automatically infers data types instead of defaulting to strings.

cloudFiles.schemaLocation - Stores inferred schemas and supports schema evolution across runs.

Writing to Bronze Delta Tables

Next, we write the streaming data into a Bronze Delta table.

(
bronze_stream.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/bronze/taxi_trips")
.trigger(availableNow=True)
.toTable("nyc_taxi.bronze.taxi_trips")
)

Using Delta tables in the Bronze layer provides:

ACID guarantees on streaming writes
Schema enforcement and evolution
Reliable checkpointing
Compatibility with batch and streaming reads

This pipeline uses .trigger(availableNow=True) This will processes all available files and stops automatically when finished. This can be scheduled periodically. Reduces cost compared to always-on streaming

In practice, this behaves like incremental batch processing — ideal for cloud storage–based ingestion.

The raw data is now ingested into bronze delta tables using autoloader and is ready to be refined further in the silver layer.

Happy learning!

Part 3: Simulating Real-Time Streaming Data Using Databricks Sample Datasets

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:48:51 +0000

We use the Databricks NYC Taxi sample dataset, available by default in Databricks.

This dataset is ideal because it includes:

Event timestamps (tpep_pickup_datetime)
Numeric measures (fare_amount, trip_distance)
Location attributes (pickup_zip, dropoff_zip)
Sufficient data volume to observe performance and shuffle behavior

Although the dataset is static, we will convert it into a streaming source.

Converting Static Data into a Streaming Source

Step 1: Read the Sample Dataset

df = spark.table("samples.nyctaxi.trips")

At this point, the data is a normal batch DataFrame.

Step 2: Write Data as JSON Files

To simulate streaming input, we write the dataset as JSON files to a directory:

(
df.write
.mode("overwrite")
.format("json")
.save("/tmp/taxi_stream_input")
)

This writes files to DBFS (Databricks File system - imagine it to be a virtual space provided by databricks) overwritting any files present earlier in the "/tmp/taxi_stream_input". Creates multiple JSON files. Each file represents a batch of incoming events

Now, the data is available as file storage for us to read and start the streaming!

Happy learning!

Part 2: Project Architecture

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:48:26 +0000

The goal is not just to “make streaming work”, but to design a maintainable and observable streaming platform.

At a high level, the platform follows a Medallion Architecture, which organizes data into progressive layers of refinement:

Bronze: Raw, append-only streaming ingestion
Silver: Cleaned, enriched, normalized data
Gold: Aggregated, business-ready metrics

Architectural flow

The project outlines end-to-end real time data pipeline built on Databricks, following the Medallion Architecture pattern. Each stage progressively refines data from raw events into business-ready insights.

Databricks Sample Data

At the top of the pipeline, Databricks-provided sample datasets (in this case, NYC Taxi trip data) act as the data source. These datasets contain realistic event timestamps, numeric measures, and location attributes, making them suitable for simulating real-world streaming use cases without requiring external systems.

Simulated Streaming Input

Because the sample data is static by default, it is first written incrementally as files into cloud storage (DBFS). This step simulates real-time data arrival, mimicking how production systems often receive data from upstream applications, IoT devices, or operational databases via files landing in object storage.

New files arriving in this directory represent new streaming events.

Auto Loader

Databricks Auto Loader continuously monitors the input directory and efficiently detects newly arrived files, provides schema inference and evolution.
Auto Loader integrates natively with Spark Structured Streaming, allowing file-based ingestion to behave like a true streaming source.

Bronze Delta Tables (Raw Layer)

The Bronze layer stores raw, append-only data exactly as it arrives from the source, with minimal transformation.
This layer ensures that raw data is always preserved, enabling replay, debugging, and full reprocessing if needed.

Silver Delta Tables (Cleaned & Enriched Layer)

In the Silver layer, data is cleansed, standardized, and enriched. Like
Date type normalization, Filtering invalid or malformed records and Joining with dimension tables (for example, ZIP code to region mappings).

Silver tables represent trusted, analytics-ready data that can be reused across multiple downstream use cases.

Gold Delta Tables (Business Layer)

The Gold layer contains aggregated, business-focused datasets designed for analytics and reporting. For example,

Hourly trip counts by region
Revenue metrics

This layer often uses event-time processing, windowed aggregations, and watermarking to handle late-arriving data while keeping state bounded.

Databricks SQL Dashboards

Finally, Gold tables are consumed by Databricks SQL Dashboards. As new data flows through the pipeline, dashboards update automatically, closing the loop from raw events to actionable insights.

Together, these components form a robust, scalable, and maintainable real-time data platform.

Happy learning!

Part 1: Creating Databricks Workspace and Enabling Unity Catalog

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:48:10 +0000

In Databricks, a secure, governed foundation for our data platform is provided by Unity Catalog, which centralizes metadata, access control, and storage governance across workspaces.

Unity Catalog is like a control plane for modern Databricks platforms offering the below benefits out of the box.

Centralized metastore for all tables and views
Fine-grained access control (catalog, schema, table, column)
Data lineage and auditing
Secure multi-workspace governance
Clear separation between compute and storage

Step 1: Create an Azure Databricks Workspace

Provide the required details like resource group, workspace name, region, etc.

Step 2: Create Azure Data Lake Storage (ADLS Gen2)

Unity Catalog requires a cloud storage location to store managed tables and metadata.

Azure Portal > Create a Resource > Search for Storage account

Create an ADLS Gen2 storage account with:

Hierarchical namespace enabled
Secure networking (private endpoints if required)
A container dedicated to analytics (e.g. datalake)

This storage will physically hold:

Parquet data files
_delta_log transaction logs
Deletion vectors

Step 3: Configure Access Using Azure Managed Identity or Service Principal

Databricks must be granted secure access to ADLS.

Azure Portal > Create a Resource > Search for Storage account

This is required to:

Create Delta tables
Manage _delta_log transactions
Handle compaction and vacuum

Step 4: Create the Unity Catalog Metastore

In the Databricks Account Console:

Navigate to Data > Metastores > Create a new Unity Catalog metastore

Provide:

Name (e.g. nyc_taxi_metastore)
Region (must match storage)
ADLS Gen2 storage root (e.g. abfss://datalake@storageaccount.dfs.core.windows.net/uc)

This location becomes the default storage root for managed tables.

Step 5: Attach the Metastore to the Databricks Workspace

Once the metastore is created,

Navigate to the metastore > Click Assign to workspace > Select the Databricks workspace created earlier

With all the set up, now our data platform foundation is laid!!!

Points to remember

All catalogs, schemas, and tables are governed centrally
Multiple workspaces can share the same metastore, while one workspace cannot have mulitple metastores.
Unity Catalog is account-level, not workspace-level.

Alright! It’s time to get our hands dirty and do some Spark coding!

Happy learning!

End-to-End Real-Time Data Engineering on Databricks Using Spark Structured Streaming and Delta Lake

Nithyalakshmi Kamalakkannan — Fri, 02 Jan 2026 10:47:44 +0000

Simple batch processing and static dashboards have retired!

Data platforms must ingest continuously arriving data, gracefully handle late and out-of-order events, scale efficiently, and still deliver reliable, business-ready metrics in no to near real time!

In this blog series, we shall explore how to build an end-to-end real time streaming data platform on Databricks.

As a newcomer to streaming systems, I have applied what I have learned about Spark Structured Streaming, Delta Lake, Auto Loader, and the Medallion Architecture to design and implement this solution.

This will be a small, hands-on data engineering project to get practical experience on the Databricks platform, using the sample NYC Taxi Trips dataset. The intension is to get started with something to play around and apply what's read in theory.

The project ingests data from file storage using Auto Loader into Bronze Delta tables, reads from bronze via Spark Structured Streaming, cleanses and normalizes the data into Silver Delta tables using spark, and applies aggregations to produce Gold Delta tables. The pipeline is orchestrated using Databricks Workflows, with insights visualized through dashboards built on queries against the Gold layer.

I have primarily used Databricks serverless compute, I did not explicitly create or manage clusters, feel free to create your own clusters and run the same Spark workloads to gain deeper insight into execution behavior, resource utilization, and performance characteristics using the Spark UI.

I have attached the source code git repo as well in the last post of this series. Keep scrolling and your feedbacks are most welcome.

Happy learning!!

Kubernetes (K8s) Command Cheat Sheet

Nithyalakshmi Kamalakkannan — Mon, 28 Apr 2025 03:35:18 +0000

Whether you're wrangling microservices in production or just tired of Googling the same five kubectl commands, this blog is for you.

We will go beyond the copy & paste to give you real command-line hands-on examples and some lighter dives to understand why things work the way they do.

Let’s level up your K8s game. 🚀

Quick Refresher: What Kubernetes Is

Kubernetes is a container orchestration system that helps you manage applications across clusters of machines. It handles their scheduling, scaling, networking, and rollouts. You tell it what you want, and it figures out how to get there!

Before we jump into commands, it helps to know how kubectl is structured—it’ll make everything click faster, especially as you start scripting or working with multiple clusters - Please do read the other parts of this series to get the grasp of the underlying architecture.

Enough of theory!
Most kubectl commands follow this pattern:

kubectl [operation] [resource] [name] [flags]

For example:

kubectl get pods -n dev

Here’s what’s happening:

get is the operation pods is the resource type -n dev tells kubectl to only look in the dev namespace

Now, here's where flags come into play:

-n <namespace> (or --namespace=<namespace>) lets you target a specific namespace. -A (short for --all-namespaces) will show results across every namespace.

Compare these two:

kubectl get pods -n dev     # Just pods in the 'dev' namespace
kubectl get pods -A         # All pods in all namespaces

Why this matters: Many kubectl commands default to the current namespace (often default), so if you don’t specify -n or set your namespace context, you might think things are missing.

Bonus Tip: To avoid typing -n all the time, you can set your namespace context like this:

kubectl config set-context --current --namespace=dev

Now all kubectl commands will assume dev unless you override it.
Try out the kubectx and kubens tools after checking support to your platform - they make life much easier for context and namespace switching and many more!

Hey wait, Context??
Relax! We will get the context in few minutes!!

Cool? Cool. Now let’s hit the terminal.

Essentials

These are the bread-and-butter commands for interacting with your K8s cluster.

kubectl get all                 # Get pods, services, deployments, etc.
kubectl get pods                # List all pods in the current namespace
kubectl describe pod <name>     # Detailed info about a pod
kubectl delete pod <name>       # Delete a pod

Now, keep playing with permutation and combination for other resource types with these operations!

Multi-Cluster Management

Working across multiple clusters?
Here's where you will need kubectl config commands the most. These commands help you manage different contexts, namespaces, and clusters seamlessly.

If you’re using Azure Kubernetes Service (AKS), you’ll need to configure your kubectl to authenticate and connect to the correct AKS cluster.

A Kubernetes context is like a shortcut or profile that tells kubectl where to send commands (your cluster) and how to authenticate (user creds).

Get a quick local setup for your projects -

az login  # Login to Azure
az account set -s "<subscription-id>"  # Set subscription context
az aks get-credentials --name <aks-cluster-name> --resource-group <resource-group>  # Get credentials for AKS

kubectl config Commands
To manage multiple clusters, here are some useful commands:

# List all available clusters
kubectl config get-contexts             

# Get the current AKS cluster connected with kubectl (e.g., dev/uat/prod)
kubectl config current-context          

# Switch between clusters (dev/uat/prod)
kubectl config use-context <cluster-name>

Deployments and Scaling

Managing deployments and scaling is where Kubernetes really shines. Use the following commands to control your apps in the cluster.

kubectl create deployment <your-deployment> --image=nginx
kubectl expose deployment <your-deployment> --port=80 --type=NodePort
kubectl scale deployment <your-deployment> --replicas=3
kubectl rollout status deployment/<your-deployment>
kubectl rollout undo deployment/<your-deployment>

Logs, Exec, and Debugging

When things go wrong, you need to dig deep. Here are some useful ways to get the logs, interact with your containers, and troubleshoot effectively.

kubectl logs <pod-name> --tail <nr-of-lines>        # Shows specific number of lines of the log
kubectl logs <pod-name> | findstr <search string>   # Shows all logs matching the search string
kubectl logs <pod-name> --timestamps=true           # Logs of a specific pod with timestamps
kubectl logs <pod-name> --since=1h                  # Logs for a specific duration (1 hour here)
kubectl logs <pod-name> --follow                    # Continuously shows the logs (Ctrl + C to exit)
kubectl logs <pod-name> --previous                  # Logs for a previous instantiation of a container
kubectl logs <pod-name> > <log-file-name>           # Write logs to a file
kubectl logs <pod-name> -c <container-name>         # Logs from a specific container in a multi-container pod
kubectl logs -l app=<label-value>                   # Logs from all pods having a common label

When debugging, --follow for real-time monitoring, and --previous helps track down issues after a pod has been restarted subject to error situation.

Getting Inside a Pod

Need to jump into a running pod? Here's the command:

kubectl exec -it <pod-name> cmd.exe

For Windows containers use cmd.exe, otherwise use /bin/bash or sh for Linux containers based on your preference.

Metrics & Resource Monitoring

Stay on top of your cluster’s health with kubectl top to see resource usage!

kubectl top pod <pod-name> --containers         
kubectl top pod <pod-name> --sort-by=cpu        
kubectl top node <node-name>

ConfigMaps & Secrets

Handle sensitive data with kubectl commands for ConfigMaps and Secrets:

kubectl create configmap my-config --from-literal=env=prod
kubectl get configmaps
kubectl describe configmap my-config

kubectl get secrets
kubectl describe secret my-secret

Final Thoughts

Kubernetes can feel like magic—until it breaks. Then it’s all about knowing the right commands, fast!

This cheat sheet is aimed to bridge you from just struggling to get it working by surfing to really understanding how the pieces fit together. The more you use these commands, the more comfortable and handier it becomes!

Got a favorite kubectl command or trick to share? Drop it in the comments...

Happy learning!

Sneak peek into Alibaba Cloud

Nithyalakshmi Kamalakkannan — Wed, 06 Jul 2022 11:59:49 +0000

Alibaba Cloud

The cloud computing service provider currently operates in 27 data center regions and 84 global availability zones. It is primarily focused on areas in mainland China and other Asia-Pacific regions, but there are also fewer regions set up in the U.S. and European Union.

Products & Services

Alibaba provides an expanding range of high-performance cloud products including large-scale computing, storage resources, and Big Data processing capabilities for users around the world. Highlighting few of the most common services offered in compute, storage, database, networking aspects here.

Compute

Alibaba Cloud Elastic Compute Service (ECS) provides fast memory and the latest CPUs to power cloud applications and achieve faster results with low latency along with capability to scale up or down based on real-time demands.
Using the next-generation virtualization technology independently developed by Alibaba Cloud, ECS Bare Metal Instance features both the elasticity of a virtual server and the high-performance and comprehensive features of a physical server. This enables you to retain the elasticity capability of common ECS while delivering the same user experience as physical servers.
Alibaba Cloud Container Service for Kubernetes (ACK) integrates virtualization, storage, networking, and security capabilities to deploy applications in high-performance and scalable containers and provides full lifecycle management of enterprise-class containerized applications. Alibaba also offers Container Registry, a platform to manage images throughout the image life cycle with easy image permission management. This service simplifies the creation and maintenance of the image registry and supports image management in multiple regions. Combined with other cloud services such as Container Service, Container Registry provides an optimized solution for using Docker in the cloud.
Alibaba Cloud Function Compute is a fully managed, event-driven compute service focused on writing and uploading code without having to manage infrastructure such as servers. No fees are incurred for up to 1,000,000 invocations and 400,000 CU-second compute resources per month.

Storage

High volumes of any type of unstructured data such as image, audio and video files shall be stored in the cloud with encryption and high availability using the Object Storage Service (OSS) of Alibaba. These objects come with configurations that can be modified to meet region, access controls and storage class requirements. Alibaba offers four OSS storage classes, Standard, Infrequent Access (IA), Archive and Cold Archive.
Alibaba provides Elastic Block Storage (EBS) devices for ECS instances that come with low-latency storage, random read-write capabilities and data persistence.
Apsara File Storage service provides network attached file storage (NAS) for ECS instances, Elastic High-Performance Computing instances and Container Service for Kubernetes nodes. The distributed file system offers a maximum capacity of up to 10 PB that automatically scales when files are added or removed with encryption of data at rest and in transit. These also offer shared access, high throughput, data replication, backup and shall be accessed via standard file access protocols.

Database

Alibaba offers a relational database service, compatible with MySQL, PostgreSQL and Oracle syntaxes with a 100 TB maximum storage capacity known as the ApsaraDB for PolarDB. This has low-latency physical replication, data backup and disaster recovery. Alibaba also offers PolarDB stacks, an on-premises database management appliance.
The NoSQL database offering supports open source MongoDB protocol. ApsaraDB for MongoDB is a document database that features automatic monitoring and scalability with architecture configurations to enable standalone instances, replica set instances and sharded cluster instances.
Alibaba Cloud offers a low-cost self-service database migration experience that supports homogeneous and heterogeneous migration smoothly from hundreds of GBs to multiple TBs with minimal business impact. With over 400,000 databases successfully migrated to Alibaba Cloud, the Database Architect Team has a proven record of making the migration process an efficient and hassle-free journey. Data Transmission Service (DTS) migrate and synchronize data between data storage engines, such as relational databases, NoSQL, and OLAP, with just a few clicks.

Networking & CDN

Server Load Balancer (SLB) distributes network traffic across groups of backend servers improving service capability and application availability. It functions as a reverse proxy at Layer 7（Application Load Balancer) and load balancing services at Layer 4 (Classic Load Balance.
Network Intelligence Service (NIS) monitors the health status and performance of networks, performs diagnostics and troubleshoots issues, and analyzes and measures network traffic.
Alibaba Cloud PrivateZone provides a private DNS based on specified VPCs. PrivateZones resolve IP addresses and manage resources within the specific VPCs, these domain names cannot be accessed over the internet or any other resource besides the specified VPCs.

The complete list of products and services offered can be found in their official product website.
This provides a good amount of information regarding the service and guidance to get started with.

K8s Objects - Part 3 [Service]

Nithyalakshmi Kamalakkannan — Tue, 17 May 2022 06:19:02 +0000

Deployment ensures that the desired number of Pods are up and running with the desired configuration at any given point of time.

But, when a new Pod is added (due to scaling or version changes)... We know that Pod has an IP to reach, and we use port forwarding to reach it from the outside world.

When the Pod changes the IP also changes along with it. Given this, maintaining your application with Pods whose IPs can frequently change is a challenge. In order to ensure the seamless communication of your application with the outside world, the K8s Service comes as the saviour!

The K8s Service

The virtual component that consists of the set of ipTable rules for the cluster is the K8s Service.
It is used to expose Pods, instead of talking to the pods you end up talking with just the service.
The service takes the responsibility of routing the traffic and/or communicating with the Pods.
Linking the service with your K8s Deployment or Replicaset or Pod is simple and same - As usual just the match labels does this for you :)

Types of Services

There are four types of K8s Services.

ClusterIP
NodePort
LoadBalancer
ExternalName

By default, K8s creates a ClusterIP type of service. We can build different kinds of services by specifying the type in a spec.type property of the service configuration file.
Let's explore them one by one!

To ensure demoing the LoadBalancer type of service, I have created a Cluster in Azure and creating these services there.

ClusterIP

Exposes your service within your cluster on a cluster-internal IP. Applications can interact with other applications internally using the ClusterIP.
The Service is unreachable from outside the cluster.
This is the default ServiceType.

Demo time!

clusterIp-service-demo.yml

This configuration file creates a deployment managing three nginx Pods and exposes them under the ClusterIP service.

These Pods can be accessed using the exposed InternalIP 10.0.248.71 by other applications within the cluster.

NodePort

Exposes your service outside your cluster. This creates a mapping of Pods to it's hosting node on a static port. Thus the service is accessible at the NodeIP:NodePort.
NodeIP is the IP address of your node and NodePort is the port which you decide to expose the service at. Usually something taken between 30000 - 32767.

Demo Time!

Applying this configuration, results in three more Pods.

These Pods are now exposed through Node Port service to the outside world at :30003 (30003 as you have mentioned in configuration file, else K8s randomly picks from the allowed range.)

LoadBalancer

This creates load balancers in various Cloud providers like AWS, GCP, Azure, etc., and exposes our application to the Internet.
The Cloud provider will provide a mechanism for routing the traffic to the services.

Demo Time!

Applying the configuration, creates Pods exposed via loadbalancer.

Now, the application is accessible at 20.198.164.24 to the world. (Don't try to access, I will delete the service shortly to save my penny :P)

ExternalName

Maps the Service to the contents of the externalName field. Accessing your service within your cluster redirects to the externalName you have provided.
It is not to any typical selector labels. Rather it is attached to the CNAME of the external server.

This creates a service which when accessed at my-service.default.svc.cluster.local redirects to the content of my.service.com.

Hope this gives an introduction to Service K8s object. See you in the next blog.

Happy learning!