DEV Community: Rubens Barbosa

Slowly Changing Dimensions (SCD)

Rubens Barbosa — Sat, 12 Jul 2025 16:02:47 +0000

Slowly Changing Dimensions (SCD) are a fundamental part of Dimensional Data Modeling, particularly in data warehousing and business intelligence. Before we delve into the details of SCD, it is helpful to focus on some fundamental concepts.

What is Data Modeling?

Data Modeling is the process of creating a visual representation/diagram and its relationships that represent your data system.

What is a dimension?

Dimensions are attributes of an entity.

What is an entity?

An entity is a representation of either a real-world object or a concept that can represent abstract ideas.

Example of entities:

Real-World Objects: customer, car, product etc.
Concepts: course, sale, order etc.

Attributes of an entity aka dimensions are specific pieces of information about that entity. For example, a customer entity might have attributes such as: name, birthday, address, and phone number.

Dimensions are categorized into two types

In data modeling, it is important to consider whether an attribute is fixed or slowly changing. Considering the above attributes of a customer entity, the birthday is a great example of a fixed dimension, no one can change their birthday, and phone number is an example of slowing changing because the customer can keep change their phone number, which means that this attribute is time-dependent and, as it is time-dependent, it is changing slowly.

Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions (SCD) are a concept in data warehousing that refers to how the attributes of a given dimension table are managed when some records may change slowly over time, often in an unpredictable manner. There are different types of SCDs, each of with its own approach to handling changes.

In Germany there is a service of the Deutsche Post called "Post Nachsendeauftrag" which forwards order/letters in case of a relocation (from the old address to the new one) I will use this example: a customer has moved and now needs to inform Deutsche Post of his new address in order to receive letters at his new address. in order to understand the different types of SCDs.

SCD Type 0: Fixed dimensions

No changes are allowed. The dimension attributes remain static and are never updated.

SCD Type 1: Overwrite the old value

The old value of the dimension attribute is overwritten with the new value. This approach does not keep any history of changes. For example, if a customer moves to a new address, the old address is updated with the new address.

customer_id	name	address	city
1	Friedrich	Goethestraße 1	München
2	Sophia	Eiffestraße 12	Hamburg

If Friedrich moves to Ebertstraße, 17 in Berlim, the Deutsche dimensional customer table would be updated as follows:

customer_id	name	address	city
1	Friedrich	Ebertstraße 17	Berlim
2	Sophia	Eiffestraße 12	Hamburg

SCD Type 2: Adding a new row

In Type 2 SCD, a new row is added to the dimension table to represent the new value, and the old row is marked as inactive or expired. This approach maintains a full history of changes.

Using the same Deutsche customer dimension table, if Friedrich moves to Ebertstraße, 17 in Berlim, a new row is added for Friedrich with the new address, and the old row is marked as inactive.

id	name	address	city	active	effective_date	end_date
1	Friedrich	Goethestraße 1	München	N	2020-01-01	2025-03-13
1	Friedrich	Ebertstraße 17	Berlim	Y	2025-03-14	NULL
2	Sophia	Eiffestraße 12	Hamburg	Y	2022-02-10	NULL

Instead of using NULL to indicate that a record is currently active, some dimensional data modeling might use a future date far enough to mark that the record is still valid. For example, using '9999-12-31' for active records.

id	name	address	city	active	effective_date	end_date
1	Friedrich	Goethestraße 1	München	N	2020-01-01	2025-03-13
1	Friedrich	Ebertstraße 17	Berlim	Y	2025-03-14	9999-12-31
2	Sophia	Eiffestraße 12	Hamburg	Y	2022-02-10	9999-12-31

The effective date column indicates the date from which a particular record of a dimension becomes active or valid. It marks the beginning of the period during which the information in that record is considered current.

The end date column indicates the date when a particular record of a dimension is no longer active or valid. It marks the end of the period during which the information in that record was considered valid. This column might also be called expiry date, effective end date, or similar names.

SCD Type 3: Adding a new column

A new column is added to the dimension table to store the new value, while the old value is preserved in the original column. This approach maintains limited history, as only the previous value is preserved.

If Friedrich moves to Ebertstraße, 17 in Berlim, a new column is added to store the new address, and the old address is preserved in the original column.

customer_id	name	previous_address	new_address
1	Friedrich	Goethestraße 1, München	Ebertstraße 17, Berlim
2	Sophia	Eiffestraße 12, Hamburg	NULL

Conclusion

There are other types of SCDs, but I will only go into these above in this article. Each type of SCD has its own advantages and disadvantages, and the choice of which type to use depends on the specific requirements of the data warehousing project. I would say the SCD Type 2 is the most commonly used in data warehousing and business intelligence, because it allows store historical data changes in dimension attributes, which is crucial for many analytical and reporting purposes.

"In a well-designed dimensional model, dimension tables have many columns or attributes. It is not uncommon for a dimension table to have 50 to 100 attributes. The power of the data warehouse is directly proportional to the quality and depth of the dimension attributes." (The Data Warehouse Toolkit)

Apache Spark 101

Rubens Barbosa — Sat, 25 May 2024 20:47:37 +0000

In order to understand Spark let's remember what was the scenario before its creation. A couple of years ago computers became faster every year through processor speed increases. This trend in hardware stopped around 2005 due to hard limits in heat dissipation. So, hardware engineers stopped making individual processors faster, and started adding parallel CPU cores all running at the same speed. As a result of this change, applications needed to be modified to add parallelism in order to run faster.

Google wanted run giant computations on high volumes of data across large clusters. Because they were creating indexes of all the content of the web in order to identify the most important pages. So, they designed MapReduce, a parallel data processing framework, which enabled Google to index the web.

At that time, Hadoop MapReduce was the dominant parallel programming engine for clusters of thousands of nodes. So, why was Spark created? Well, MapReduce engine made it challenging and inefficient to build large applications that needed multiple MapReduce jobs together, which causes a lot of reading and writing to disk.

To address this issue, the Spark team first designed an API based on functional programming that could express multistep applications. The team then implemented this API over a new engine that could perform efficient, in-memory data sharing across computation steps.

What is Apache Spark?

Apache Spark is an open-source unified computing engine and a set of libraries for parallel data processing on computer clusters.

Spark is a fast engine for large-scale data processing, basically the idea is that we can write a code which describes how you want to transform a huge amount of data, and Spark will figure out how to distribute that work across an entire cluster of computers, i.e., the driver send tasks to workers to run/process them on a parallel mode. Apache Spark gets a massive data set and distribute the processing across an entire set of computers that work together in parallel at the same time.

In a nutshell Spark can execute tasks on data across a cluster of computers.

NOTE: Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). So, therefore to run Spark either on your laptop or a cluster, you need an installation of Java.

Architecture

Spark application architecture at high level

Spark architecture consists of driver process, executors, cluster manager, and worker nodes. Apache Spark follows a master and worker architecture; it has a single master and any number of workers.

There are some key componentes under the hood such as: Driver Program, Cluster Manager, Task, Partitions, Executors, Worker nodes.

Spark APIs

When working with Spark, we will come across different APIs

RDD (Resilient Distributed Datasets) API
DataFrame API
Dataset API
SQL API
Structured Streaming API

RDDs

RDDs are distributed collections of objects that can be processed in parallel;
RDDs support two types of operations: transformations (which produce a new RDD) and actions (which return a value to the driver program after running a computation on the dataset);
RDDs provides low-level control over data flow, data processing/operations;
RDDs are fault tolerant, automatically recovers lost data due to node failures using lineage information. (Data lineage is the process of tracking the flow of data over time);
RDDs don’t infer the schema of the data we need to specify it.

RDD Scala example:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("rdd")
    .getOrCreate()

// I wanna square everything
val rdd = spark.sparkContext.parallelize(List(1,2,3,4))
// we are creating a new RDD called squares
val rddSquares = rdd.map(x => x * x)
println(rddSquares)

res = 1, 4, 9, 16

The beauty of this example is that it could be distributed. So, if the RDD was really massive, it could actually split that processing up and handle that squaring in different chunks of that RDD on different nodes within our cluster, and send the result back to your driver script and get the final answers that we want.

Another example:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("rdd")
    .getOrCreate()

val rddNums: RDD[Int] = spark.sparkContext.parallelize(List(1,2,3))

val rddCollect: Array[Int] = rddNums.collect()
println("Action: RDD converted to Array[Int]")

Let's talk about rdd.collect() method in Apache Spark is a powerful and potentially problematic operation. It's used to retrieve the entire rdd from the distributed environment back to the local driver program. The collect() method require a full dataset in memory, it carries significant risks and potentially issues, especially when dealing with large datasets.

Issues with rdd.collect()
memory overload because it transfers all data from the distributed nodes to the driver node. If the dataset is large, this can cause the driver program to run out of memory and crash because it tries to fit the entire dataset into the limited memory of the driver node. Imagine calling rdd.collect() with terabytes of data, it will try to bring all that data into the memory of a single machine aka the driver, which is often impossible. So, in this scenario the job certainly will fail.

network bottleneck due to transferring large amounts of data over the network from the worker nodes to the driver node. This can lead to slow performance of the Spark job.

reduced parallelism one of the strengths of Spark is its ability to process data in parallel across a cluster, using collect() invalidate this advantage by aggregating all the data into a single node, reducing the benefits of distributed processing.

Rules of thumb avoid use collect() as much as possible, its use should be approached with caution. There are some best practices, instead of collecting the entire dataset, use Spark actions such as take(n), aggregate(), reduce()to perform computations on the data directly within the distributed environment. Also, persist intermediate results in memory or disk using persist() or cache().

DataFrame

DataFrames is a distributed collection of rows under named columns (similar to a table in a relational database);
Built on top of RDDs it provides a higher-level abstraction for structured data;
Simplifies data manipulation with a high-level API;
Easily integrates with various data sources like JSON, CSV, Parquet, etc;
It does not support compile time safely, thus the user is limited in case the structure of the data is not known.

DataFrame makes easier to perform complex data processing tasks

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

// Initialize SparkSession
val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("dataframe")
    .getOrCreate()

// Create DataFrame from CSV file
val filePath = "path/to/your/csvfile.csv"
val df = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv(filePath)

// Show the first 5 rows
df.show(5)

Dataset

Datasets are a distributed collection of data, combining the best features of RDDs and DataFrames;
A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema;
Ensures compile-time type safety and supports object-oriented programming paradigms;
The main disadvantage of datasets is that they require typecasting into strings;
We can use it when complex transformations on structured data where compile-time type checking is beneficial.

import org.apache.spark.sql.{Dataset, SparkSession}

// Define the schema of our data
case class Client(name: String, age: Int, city: String)

// Initialize SparkSession
val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("dataset")
    .getOrCreate()

import spark.implicits._
// Create Dataset from a sequence of case class instances
val data = Seq(
      Client("John", 30, "München"),
      Client("Jane", 25, "Berlin"),
      Client("Mike", 35, "Frankfurt"),
      Client("Sara", 28, "Dachau")
    )

val ds: Dataset[Client] = spark.createDataset(data)

// Show the content of the Dataset
ds.show()

Using Datasets, we can benefit from the best features of RDDs and DataFrames. Such as type safety and object-oriented programming interface of RDDs; and the optimizations execution, ease of use due to a higher level of abstraction from DataFrames for working with structured data in Spark.

SQL (via Spark SQL)

Allows users to run SQL queries directly on DataFrames or Datasets;
Provides a way to query data using standard SQL syntax;
Uses standard SQL, which is familiar to many data professionals;
Queries return DataFrames, enabling further processing using the DataFrame API;
Ad-hoc querying and data exploration.

import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("sql")
    .getOrCreate()

// Create DataFrame from CSV file
val filePath = "path/to/your/csvfile.csv"
val df = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv(filePath)

// Register the DF as a temp SQL view
df.createOrReplaceView("clients")

// Execute SQL queries 
val allRowsDF = spark.sql("SELECT * FROM clients")
allRowsDF.show()

Using Spark SQL with Scala allows you to execute SQL queries on your data

Structured Streaming

Built on the Spark SQL engine, it enables the same DataFrame and Dataset API to be used for stream processing;
Uses the same API for batch and streaming data, simplifying the development process;
Easy to use due to High-level abstraction for defining streaming computations;
Real-time data processing and analytics;
Stream processing applications that require the same APIs and optimizations as batch processing.

import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark: SparkSession = SparkSession
    .builder()
    .master("local[*]")
    .appName("StructuredStream")
    .getOrCreate()

val kafkaStream = spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "kafka_topic_name")
    .option("startingOffsets", "earliest")
    .load()

val query = kafkaStream
    .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
    .writeStream
    .outputMode("append")
    .format("console")
    .start()

query.awaitTermination()

In this example, the code reads data from the Kafka topic. Then the key and value are written to the console in the output mode append. The awaitTermination method is called to start the streaming query and wait for it to terminate.

Structured Streaming in Spark is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows you to work with streaming data in the same way you would work with batch data.

Why should we use Spark?

Spark can run programs up to 100 times faster than Hadoop MapReduce;
It offers fast processing speed; through in-memory caching and processing data;
Spark is a very mature technology, and it’s been out for a while so it’s reliable at this point;
Spark is not that hard and applications can be implemented in a variety of programming languages like Scala, Java, Python;
Spark puts together powerful libraries.

That's all folks :)

Gradient Descent, an Optimization Method used in Machine Learning

Rubens Barbosa — Tue, 17 May 2022 10:12:18 +0000

In this article we are going to use some concepts of Differential Calculus, mainly in partial derivative and chain rule. 📚

Optimization

A mathematical optimization problem, or just optimization problem, has only one goal: to find the best element from a set of candidates through a function known as cost function, loss function, or objective function.

Mathematically, an unbounded optimization problem with decision variables θ of the cost function L, has the following form

We are interested in finding a vector θ that leads to the lowest value of the cost function. If it exists, the vector is called optima solution or minimum global and will be denoted by θ*, as follows:

Note that, depending on the function, the equation above might not have optima solution. Sometimes, the function may have more than one minima or maxima local like the graph presented in the top of this article. However, we want to find a minimum global. In this context, let’s admit that our problem has an optima solution.

Assuming L is differentiable and convex, a necessary and sufficient condition for a vector θ* to be an optima solution is

where ∇ is the gradient operator:

Concretely, if the function is strictly convex, θ* is the only optima solution for the optimization problem.

Cost Function

In Machine Learning it is common that we compute the cost function in order to know how our model is performing, because our purpose is to find the best hypothesis. The cost function computes the error of all training data set. So, we want to minimize the cost function, i.e., minimize the error of any machine learning model.

There are different types of cost functions and we choose them depending on the model that we are going to work. In this article, we use the linear regression model and our cost function is the mean squared error, as shown bellow

Gradient Descent

Probably, gradient descent is one of the most simple and widely used iterative algorithms for optimization of continuous and differentiable functions. The basic idea behind the method is to begin from an initial point chosen randomly and improve it repeatedly, taking small steps in the opposite direction of the gradient at each iteration.

That is, we start with an initial guess θ(0) and at each iteration t = 0. . . we compute the update rule:

The choice of the learning rate α is crucial for the convergence of the gradient descent. In practice, it is common to adopt a constant value for α.

— ∇L(θ) is the direction in which L decrease in θ.

Applying gradient descent to the cost function

We are going to use the linear regression model represented bellow

Find the gradient vector of θ(0)

Find the gradient vector of θ(1)

We can visualize our cost function with an illustration below. Observe that our function is convex. Thus, our solution θ* always has a global minimum.

Gradient Descent Algorithm

Below you can find the pseudocode of gradient descent

Conclusion

In this article, we learned the theoretical part of the gradient descent applied on the linear regression model. However, gradient descent might be used in others machine learning and deep learning algorithms. Something important for gradient descent is the learning rate choice. Large values for the learning rate can result in overshooting. That is, if we update the parameters using large values for the learning rate the function may fail to converge to the global minimum. On the other hand, small values for the learning rate would take too many iterations for the function to converge.

Although I only talked about gradient descent, there are others versions of it, such as: batch gradient descent, stochastic gradient descent and mini-batch gradient descent. I would like you to search about these types of algorithms and try to understand when you might use each one of them.

I hope you enjoyed reading this article. Thank you! 🙂

Getting started with Apache Kafka using Python

Rubens Barbosa — Sat, 07 May 2022 11:40:05 +0000

Apache Kafka is a distributed streaming system that provide real-time access to the data. This system let us publish and subscribe to streams of data, store them, and process them.

Message

The unit of data within Kafka is called a message. A message is simply an array of bytes. A message can have an optional bit of metadata, which is referred to as a key.

For efficiency, messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition.

Topics

Messages in Kafka are categorized into topics. The closest analogies for a topic are a database table or a folder in a filesystem. Topics are additionally broken down into a number of partitions. Note that as a topic typically has multiple partitions, there is no guarantee of message time-ordering across the entire topic, just within a single partition.

Producers and Consumers

Producers in Kafka are the ones who produce and send the messages to the topics. In some cases, the producer will direct messages to specific partitions. This is typically done using the message key and a partitioner that will generate a hash of the key and map it to a specific partition.

The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset is another bit of metadata an integer value that continually increases that Kafka adds to each message as it is produced. Each message in a given partition has a unique offset.

Brokers and Clusters

A single Kafka server is called a broker. Depending on the specific hardware and its performance characteristics, a single broker can easily handle thousands of partitions and millions of messages per second. Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller.

Retention

A key feature of Apache Kafka is that of retention, which is the durable storage of messages for some period of time. Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time (e.g., 7 days) or until the topic reaches a certain size in bytes (e.g., 1 GB). Once these limits are reached, messages are expired and deleted so that the retention configuration is a minimum amount of data available at any time. Individual topics can also be configured with their own retention settings so that messages are stored for only as long as they are useful.

Now that we have an overview about Apache Kafka, let's install it.

Installing Kafka

I'll install Apache Kafka on mac using homebrew. To do so, I just need type on my terminal:

$ brew install kafka

Apache Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client details. So, during the installation it will install Apache Zookeeper as well. We must already have Java installed on our machine.

After installing Kafka we can see something like this:

Navigate to this directory in separate terminal sessions in order to execute Zookeeper and Kafka. It might be another path depending on your machine and O.S.

$ cd /usr/local/opt/kafka/bin

First, let's start Apache Zookeeper Server.

$ zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties

Now, in another terminal session execute the command below

$ kafka-server-start /usr/local/etc/kafka/server.properties

All right, we have Apache Zookeeper and Apache Kafka running what we should do now? Let's create a Kafka topic.

Creation of Kafka Topic

Now, let's create a topic called: first-topic in a new terminal session.

$ kafka-topics --create --topic first-topic \
--bootstrap-server localhost:9092 \
--replication-factor 1 --partitions 1

Created topic first-topic.

Producer

Produce messages to the first-topic

$ kafka-console-producer --broker-list localhost:9092 \
--topic first-topic

>Sunday 1st May 2022
>Data Engineering    
>

Consumer

Consume messages from the first-topic

$ kafka-console-consumer --bootstrap-server localhost:9092 \
--topic first-topic --from-beginning

Sunday 1st May 2022
Data Engineering

List Topics

Listing all the Kafka topics in a cluster

$ kafka-topics --list --bootstrap-server localhost:9092

Delete Topic

We might want to delete a specific topic

$ kafka-topics --bootstrap-server localhost:9092 \
--delete --topic first-topic

Producer & Consumer with Python

Let's create a producer and consumer using python. First, we should create virtual environment.

$ python3 -m venv env

$ python -m venv env

Activate the virtual env, in order to install libraries

$ source env/bin/activate

Let's install Python client for Apache Kafka and Request libraries

$ pip install kafka-python
$ pip install requests

Python Producer

Now, let's dive into our producer.py

#!/usr/local/bin/python
import sys
import json
import logging
import requests
from kafka import KafkaProducer
from datetime import datetime, timedelta

logging.basicConfig(stream=sys.stdout,
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")

logger = logging.getLogger(__name__)


def producer():
    producer = KafkaProducer(bootstrap_servers=['localhost:9092'])

    # get data from public API
    date = datetime.today() - timedelta(days=2)
    previous_date = f"{date.year}-{date.month}-{date.day}"
    url = 'https://indicadores.integrasus.saude.ce.gov.br/api/casos-coronavirus?dataInicio='+ previous_date +'&dataFim=' + previous_date
    req = requests.get(url)
    covid_data = req.json()

    for data in covid_data:
        producer.send('covid-topic', json.dumps(data).encode('utf-8'))
        producer.flush()


if __name__ == "__main__":
    producer()

Python Consumer

Alright, let's have a look at our consumer.py

#!/usr/local/bin/python
from kafka import KafkaConsumer


if __name__ == "__main__":
    consumer = KafkaConsumer('covid-topic')
    for data in consumer:
        print(data)

You should have two terminal session to run producer.py and consumer.py

Conclusion

We learned the main concepts of Apache Kafka: message/record, producers, consumers, topics, brokers and retention. An event records the fact that "something happened". It is also called record or message. Producers are those client applications that publish (write) events to Kafka, and consumers are those that subscribe to (read and process) these events. Events are organized and durably stored in topics. Very simplified, a topic is similar to a folder in a filesystem, and the events are the files in that folder.

A Bunch of Linux Commands

Rubens Barbosa — Sun, 01 May 2022 17:24:54 +0000

🚀 Summary

Linux Overview
System Information
Files & Directory
Compress & Extract Files
Process Management
File Permission
Network

🖥 Linux Overview

username@system_name:~$

The tilde (~) symbol stands for your home directory

Directory	Function
/	begins the file system, called root
/home	contains users home directory
/bin	all the standard commands and utility programs i.e. executable binaries such as cat, cp, ls, mv, ps, rm
/usr	holds those files and commands used by the system
/var	files that are expected to change in size and content (var stands for variable), such as mailbox files
/dev	file interfaces for devices such as the terminals and printers
/etc	is the home for system configuration files and any other system files
/boot	contains the few essential files needed to boot the system
/lib	contains libraries (common code shared by applications and needed for them to run)
/mnt	it has been used since the early days of UNIX for temporarily mounting filesystems
/opt	optional application software packages
/tmp	temporary files; on some distributions erased across a reboot and/or may actually be a ramdisk in memory
/sys	virtual pseudo-filesystem giving information about the system and the hardware. Can be used to alter system parameters and for debugging purposes

creating user account

$ sudo useradd -m -c "Rubens Barbosa" -s /bin/bash rubnsbarbosa
$ sudo passwd rubnsbarbosa

look if it was well created

$ grep rubnsbarbosa /etc/passwd /etc/group

connect to the new user

$ shh rubnsbarbosa@localhost

remove the new user

$ sudo userdel -r rubnsbarbosa

to check all users

$ ls -l /home

Keyboard Shortcuts

keyboard shortcut	task
tab	auto-completes files, directories, and binaries
ctrl + l	clear the screen
ctrl + a	goes to the beginning of the line
ctrl + e	goes to the end of the line
ctrl + d	exits the current shell
ctrl + z	puts the current process into suspended background
ctrl + c	kill the current process
ctrl + h	works the same as backspace
ctrl + w	deletes the word before the cursor
ctrl + u	deletes from beginning of line to cursor position

🧐 System Information

the man pages, which are manuals for linux commands available from the Command Line Interface CLI

$ man ls  
$ man mkdir  
$ man rm
$ man grep
$ man patch
$ man diff

display one line manual page description of a command

$ whatis top
$ whatis mv
$ whatis nice

display full path of commands where given commands reside

$ which top
$ which grep
$ which nice

locate the binary, source, and manual page files for a command

$ whereis top
$ whereis grep
$ whereis nice

tell us your username

$ whoami

show the current date and time

$ date

show this month's calendar

$ cal

show current uptime

$ uptime

tell us detailed information about the machine name, operating system and kernel

$ uname -a

display CPU information

$ cat /proc/cpuinfo

display memory information

$ cat /proc/meminfo

show the disk usage

$ df

show directory usage space

$ du

display memory and swap usage

$ free

display memory and swap usage in human format

$ free -h

⚒️ Files & Directory

clear terminal

$ clear  
$ ctrl + l

list of files and/or directories contents

$ ls

list content of current directory (display attributes such as owner, group owner, permissions)

$ ls -l

list hidden files and/or directories (hidden file begins with . dot sign)

$ ls -a

list everything of current directory (attributes such as owner, group owner, permissions) + hidden files

$ ls -la
$ ls -al

to know in which directory you're located (pwd stands for "print working directory")

$ pwd

to make directories e.g. mkdir foo will create a new directory or folder called "foo"

$ mkdir foo

to create any number of directories/folders simultaneously

$ mkdir foo bar foobar

to remove or delete an empty directory/folder

$ rmdir foo

to remove or delete a file in your directory entries

$ rm

to remove or delete a directory and all of its contents recursively [f = force]

$ rm -r foo
$ rm -rf foo

to change directories (moving through the file system)

$ cd

to navigate into the root directory

$ cd /

to navigate to your home directory, use "cd"

$ cd
$ cd ~
$ cd $HOME
$ cd /home_path/

to navigate up one directory level

$ cd ..

to navigate into the documents directory

$ cd Documents/

to navigate into the documents directory and see which files and/or directories exists there

$ cd Documents/{press tab twice}

to navigate into the directory which contains space in their names

$ cd 'best songs ever'
$ cd best\ songs\ ever

to create a new empty file

$ touch file_name.txt

to create multiples empty files

$ touch foo.txt bar.txt foobar.txt

to execute several commands on the same line by separating them with a semicolon ;

$ ls ; date

to see the contents of a file (-n parameter shows the number of lines in file)

$ cat main.py
$ cat passwd.txt
$ cat -n song.txt

displays only the first 10 lines of the file

$ head foo.txt

prints the first -n 'num' lines instead of first 10 lines

$ head -n 5 foo.txt

display only the last 10 lines of the file

$ tail foo.txt

prints the last -n 'num' lines instead of last 10 lines

$ tail -n 3 foo.txt

to move a file to a different location

$ mv [source-file] [destination-file]

$ mv foo.txt /home/ubuntu/script/

to move a file to home directory (the terminal use the ~ shortcut to your home directory)

$ mv foo.txt ~ 
$ mv foo.txt /home/ubuntu/

to copy a file from the current directory to a different one, the command below will make an exact copy of "foo.txt" file

$ cp [source-file] [destination-file]

$ cp foo.txt /home/ubuntu/script/

to copy a directory, use "cp -r directory name" (-r recursively = to copy the directory and all its files and subdirectories and all their files and so on)

$ cp -r bar /home/ubuntu/script/
$ cp -r 'best songs ever' /home/ubuntu/music/

to copy or move all your C or Python source code files to a given directory

$ cp *.c algorithms
$ mv *.py algorithms

to find a file in a current working directory

$ find . -name hello_world.py
$ find . -name hello_world.py -print

to find all files with the .py extension in the script directory

$ find script -name '*.py' -ls

to locate other directories e.g. the command below will locate the script directory

$ find /home/ubuntu -name script -type d -print

to find text in a file i.e. the command will search through the file to find a piece of text which you are looking for

$ grep 'Hello' hello_world.py

to print the history of a long list of executed commands in the terminal

$ history

reboots the system

$ reboot

shuts down the system

$ shutdown

shuts down the system by powering off

$ poweroff

brings the system down immediately

$ halt

reboots the system by shutting it down completely and then restarting it

$ init 6

powers off the system using predefined scripts to synchronize and clean up the system prior to shutting down

$ init 0

📦 Compress & Extract Files

Tar utility creates archives for files and directories, and was originally designed to create archieves on tapes (the term "tar" stands for tape archive). The tar utility is ideal for making backups of your files, which can then be transferred over the Internet

Archives using tar

Syntax:

tar [options] [archive-name.tar] [directory-or-file-name]

Options	Function
-c	creates a new archive
-x	extract the archive
-f	specify an archive filename
-v	verbosely display the .tar progress in the terminal
-t	lists files in archived file
-r	appends files to an archive
-u	updates an archive with new files
-w	waits for a confirmation from the user before archiving each file
-z	creates archived file using gzip
-j	creates archived file using bzip

create a tar archive using option -cvf

$ tar -cvf foo-archive.tar foo.txt

extract file from archive using option -xvf

$ tar -xvf foo-archive.tar

create a gzip tar archive using option -cvzf

$ tar -cvzf foo-archive.tar.gz foo.txt

extract a gzip tar archive using option -xvzf

$ tar -xvzf foo-archive.tar.gz

create a gzip tar archive with python files

$ tar -cvzf python-codes.tar.gz *.py

create a gzip tar archive file for a directory

$ tar -cvzf images-august-2021.tar.gz /home/ubuntu/images/

create tar with bzip2 compression

$ tar -cvjf foo-archive.tar.bz2 foo.txt

extract a tar using bzip2

$ tar -xvjf foo-archive.tar.bz2

compress to file.gz

$ gzip foo-archive

decompress file.gz

$ gzip -d foo-archive.gz

👨‍💻 Process Management

A process, in simple terms, is an instance of a running program. Whenever we execute a command in Linux it starts, or creates a new process. The Linux kernel tracks processes through an ID number known as PID which means Process ID. The kernel is a part & core of the Operating System OS, it's closer to the hardware i.e. is the lowest level of the OS. The Operating System is the software package which contains applications like the user interface (shell, gui, tools, etc) and communicates directly to the hardware and our application. The kernel is the main part of the Operating System and is responsible for translating the command into something that can be understood by the computer. Basically the Kernel is the layer between hardware (devices which are available in computer) and software (applications like gedit). Only Kernel provides low level services such as:

memory management
network management
device driver
file management
process management

Types of Processes

When we create a new process (run a command), there are two types:

Foreground Processes: They run on the screen and need input from the user. For example Office Programs
Background Processes: They run in the background and usually do not need user input. For example Antivirus.

to display the currently working processes

$ ps

all the currently running processes

$ ps -A

processes associated with the current terminal session

$ ps -T

to check all the processes associated with a particular User

$ ps -u rubnsbarbosa

to check all the processes running under a user

$ ps ux

to check all the processes associated with a particular user group

$ ps -fG root

to display the processes running with full information

$ ps -f

to display the processes running on the system in the form of a tree

$ pstree

description of all the fields displayed by ps -f command

Column	Description
UID	user id that this process belongs to
PID	process id
PPID	parent process id (the id of the process that started it
C	CPU utilization process
STIME	process start time
TIY	terminal type associated with the process
TIME	CPU time taken by the process
CMD	The command that started this process

display all running Linux processes

$ top

interactive process viewer

$ htop

display list of kill

$ kill -l

kill the process with given pid (for example: I want to kill this 217956 PID)

$ kill 217956

if a process ignore a regular kill command, we can use kill -9 followed by the PID

$ kill -9 217956

to find the PID of a process

$ pidof Photoshop.exe

kill all the process named proc

$ killal proc

will kill all processes matching the pattern

pkill pattern

to check the nice value of a process (example: I'd like to find terminal value name)

$ ps -el | grep terminal

run a program with modified scheduling priority i.e. set processes CPU priority. Kernel will allocate more CPU time to that process

$ nice -10 gnome-terminal

changing priority of running processes with PID 77982

$ renice -n 15 -p 77982

change the priority of all programs of a specific group

$ renice -n 10 -g 4

suspend process running in foreground

$ ctrl+z

list jobs table

$ jobs

send stopped process to background

$ bg [job-num]

brings process to foreground

$ fg [job-num]

📝 File Permission

Every file or directory within Linux has a set of permissions that control who may read, write and execute the contents. In Linux Linux, a directory is just a special type of file. File ownership is an important component of Unix which provides a secure method for storing files. Every file in Linux has the following category:

Owner − the name of the user that owns the file/directory;
Group − the name of the group that has permissions on the file/directory;
Other − the name of all other users can perform on the file/directory.

Permissions

Every file and directory in your Linux has following 3 permissions defined for all the 3 category discussed above.

Permission	Abbreviation	File	Directory	Octal Value
read	r	able to view the contents of a file	able to list the files within the directory	4
write	w	able to modify the contents of a file	able to add/delete files to/from directory	2
execute	x	able to run the file as an executable	able to cd into the directory and access files	1

Octal Notation Table

Octal	Decimal	Permission	Representation
000	(0+0+0) = 0	no permission	---
001	(0+0+1) = 1	execute	--x
010	(0+2+0) = 2	write	-w-
011	(0+2+1) = 3	write + execute	-wx
100	(4+0+0) = 4	read	r--
101	(4+0+1) = 5	read + execute	r-x
110	(4+2+0) = 6	read + write	rw-
111	(4+2+1) = 7	read + write + execute	rwx

Whenever using ls -l command, it display informations related to file permission as follows

The first character is called the file type. An ordinary file is represented by a dash (-) and a directory is represented by a d.

Note: A dash (-) anywhere else in the permission set indicates no permission.
The 1st set of three characters are the users permissions in green.
The 2nd set of characters are the group permissions in cyan.
The 3rd set of characters are the permissions for all other users in red.

The File permissions that are set depend on the type of file e.g. a text file has different permissions to a shell script because a text file doesn’t need the executable permission but a shell script does.

examples of different types of permissions on files and directories:

-rwx------ this file is read/write/execute for the owner only
dr-xr-x--- this directory is read/execute for the owner and the group
-rwxr-xr-x this file is read/write/execute for the owner, and read/execute for the group and others

Setting Permission

In order to change file permissions we use chmod command (change mode - changes permissions of a given file) followed by the octal values that reflect the permissions we want to set. To decide on the permissions:

work out what you want each category of user to be able to do and the appropriate octal value for this (see Octal Notation Table);
take these 3 octal values and put them together to form a set which will be the permissions for that file.

The example below shows that if we want a user to be able to read and write to a file but the group and other to only be able to read that file then the permissions for this file would need to be set to 644

category	u	g	o
permission	r w	r	r
value	4 + 2	4	4
total	6	4	4

more examples:

chmod 755 foo.txt (results in rwxr-xr-x)
chmod 666 foo.txt (results in rw-rw-rw-)
chmod 664 foo.txt (results in rw-rw-r--)
chmod 700 foo.txt (results in rwx------)
chmod 711 foo.txt (results in rwx--x--x)
chmod 754 foo.txt (results in rwxr-xr--)
chmod 755 foo.txt (results in rwxr-xr-x)
chmod 000 foo.txt (results in ---------)
chmod 777 foo.txt (results in rwxrwxrwx)
chmod g+r foo.txt (adds read to group)
chmod g-r foo.txt (removes read to group)
chmod o+r foo.txt (adds read to others)
chmod a-w foo.txt (removes write from all users)

Using chmod in symbolic mode

Chmod operator	Description
+	adds a permission to a file/directory
-	removes the permission from a file/directory
=	sets the designated permission(s)

a shell script or any other which needs to be executable should have a permission 711

$ chmod 711 foobar.sh

owner - read, write and execute
group - execute
other - execute

a text file doesn't need to be executable, it should have a permission 644

$ chmod 644 foo.txt

owner - read and write
group - read
other - read

⚡️ Network

Let’s start with the most basic question: is our physical interface up? The ip link show command tells us

# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:82:d6:6e brd ff:ff:ff:ff:ff:ff

your interface might be disable, so before check cables you should bring the interface up

# ip link set eth0 up

this prints output in a much more readable table format

# ip -br link show
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0 UP 52:54:00:82:d6:6e <BROADCAST,MULTICAST,UP,LOWER_UP>

we can use the -s flag with the ip command to print additional statistics about an interface

# ip -s link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:82:d6:6e brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
34107919 5808 0 6 0 0
TX: bytes packets errors dropped carrier collsns
434573 4487 0 0 0 0

We can check the entries in our ARP table with the ip neighbor command:

# ip neighbor show
192.168.122.1 dev eth0 lladdr 52:54:00:11:23:84 REACHABLE

Note that the gateway’s MAC address is populated (we’ll talk more about how to find your gateway in the next section). If there was a problem with ARP, then we would see a resolution failure:

# ip neighbor show
192.168.122.1 dev eth0 FAILED

Linux caches the ARP entry for a period of time, so you may not be able to send traffic to your default gateway until the ARP entry for your gateway times out. For highly important systems, this result is undesirable. Luckily, you can manually delete an ARP entry, which will force a new ARP discovery process:

# ip neighbor show
192.168.122.170 dev eth0 lladdr 52:54:00:04:2c:5d REACHABLE
192.168.122.1 dev eth0 lladdr 52:54:00:11:23:84 REACHABLE
# ip neighbor delete 192.168.122.170 dev eth0
# ip neighbor show
192.168.122.1 dev eth0 lladdr 52:54:00:11:23:84 REACHABLE

# ip -br address show
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0 UP 192.168.122.135/24 fe80::184e:a34d:1d37:441a/64 fe80::c52f:d96e:a4a2:743/64

# ping www.google.com
PING www.google.com (172.217.165.4) 56(84) bytes of data.
64 bytes from yyz12s06-in-f4.1e100.net (172.217.165.4): icmp_seq=1 ttl=54 time=12.5 ms
64 bytes from yyz12s06-in-f4.1e100.net (172.217.165.4): icmp_seq=2 ttl=54 time=12.6 ms
64 bytes from yyz12s06-in-f4.1e100.net (172.217.165.4): icmp_seq=3 ttl=54 time=12.5 ms
^C
--- www.google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 12.527/12.567/12.615/0.036 ms

We can print the routing table using the ip route show command:

# ip route show
default via 192.168.122.1 dev eth0 proto dhcp metric 100
192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.135 metric 100

SSH

$ ssh admin@192.168.1.113
$ ssh ubuntu@192.168.1.113

ssh key

$ ssh-key
$ ssh-keygen -t rsa

ssh authentication - you store your public key on your server and you have your private key with you, and the private key is the most important is the key that will allow you to authenticate successfully with the server. So you need to keep it secure, if you lose it you will lose access to the server.

copy file from remote server

$ scp foo.txt admin@192.168.1.113:/home/admin/

display the current network interface configuration information

# ifconfig

display IP addresses and property information

# id addr

List all of the route entries in the kernel

# ip route

display neighbour objects; also known as the ARP table for IPv4

# ip neigh

lists all my connections and their DNS servers

# systemd-resolve --status

allows you to test the IP-level connectivity of a given host on the network

# 192.168.2.32

display the route that a packet takes to reach the host; also prints detail about all the hops that it visits

# traceroute google.com
# traceroute 172.217.26.206

ETL with Spark on Azure Databricks and Azure Data Warehouse (Part 2)

Rubens Barbosa — Sat, 30 Apr 2022 20:31:09 +0000

Hey y'all, this is a continuation of the previous article. We already have data on Azure Data Lake Storage. Now, we will integrate it with Apache Spark on Azure Databricks to perform a small transformation on top of the JSON, and send the data to Azure SQL Data Warehouse. I'll try to be the most hands on as possible. Let's get started!

What is Apache Spark?

Apache Spark is a framework for processing large-scale data, i.e., Big Data distributed across clusters. It is used for executing data engineering, data science, and machine learning.

Overview

The main abstraction Spark provides is a Resilient Distributed Dataset RDD which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. Thus, Spark running multiple processes concurrently in parallel that don't interfere each other. RDD can be created from text files, SQL databases, NoSQL databases, HDFS, Cloud Storage and so on. The processing of RDD is done entirely in memory.

At a high level, in a Spark cluster you will have a driver node and then several worker nodes. The driver node is running the main program which has all of the transformations that you want to do with your data and then get sent out to the worker nodes who then operate a task and return that result do the driver node. This is the core engine of Spark on top of that there are several library modules that allow developers to easily interact with the core engine. These libraries include: Spark SQL, Spark Streaming, MLLib, GraphX.

Databricks

Databricks is a company that was founded by the creators of Apache Spark with the intention to make Apache Spark much easier to use it. Databricks develops a web-based platform for working with Spark. that provides automated cluster management and IPython style notebooks.

The Databricks workspace is the cloud based environment in which you use Databricks and includes the user interface, integrated storage, security settings, job scheduling, and most importantly notebooks.

Cluster

As mentioned before Spark is all about cluster. That's why, we'll first create a cluster on Databricks. After launch Azure Databricks workspace go to compute, then create cluster.

Once we have our cluster running we can create a notebook and start coding, which I'll use PySpark. We will start a creation of ETL with PySpark on Azure Databricks. In the load phase we will write data on Azure SQL Data Warehouse. So, we must already have our Data Warehouse deployed and get the connection string.

Creation of ETL with PySpark on Azure Databricks Notebook

Let's have a look.

Extract

First of all, we need extract the JSON file from Azure Data Lake Storage and read it into DataFrame. After that, we will ingest data into PySpark DataFrame.

# Title: ETL Spark: extract from Azure Data Lake Storage, and load to Azure SQL Data Warehouse
# Language: PySpark
# Author: Rubens Santos Barbosa

# config the session using spark object and set the key from our azure data lake storage account
spark.conf.set(
  "fs.azure.account.key.YOUR_AZURE_DATA_LAKE_STORAGE_ACCOUNT_NAME.dfs.core.windows.net",
  "YOUR_AZURE_DATA_LAKE_ACCOUNT_KEY"
)

# abfss://AZURE_DL_CONTAINER_NAME@AZURE_DL_STORAGE_ACCOUNT_NAME.dfs.core.windows.net/DIRECTORY_CLIENT
dbutils.fs.ls("abfss://az-covid-data@engdatalake.dfs.core.windows.net/directory-covid19")

# path JSON file on azure data lake storage 
covid_data_json = "abfss://az-covid-data@engdatalake.dfs.core.windows.net/directory-covid19/covid-2022-4-21.json"

# read JSON file into DataFrame
df = spark.read.option("multiline","true").json(covid_data_json)

# PySpark print schema
df.printSchema()

We might wanna see some content from our DataFrame.

# showing first 5 rows
df.head(5)

Transform

We've just done the data extraction. Now, we will do a little transformation. Let's analyze if there is missing data in our columns on PySpark Dataframe.

# missing values in a specific column of pySpark dataframe
df.filter(df['bairroPaciente'].isNull()).count()

# count null value in every column
for col in df.columns:
  print(col, "\t", "with null values: ", df.filter(df[col].isNull()).count())

df.count()

We noticed that our PySpark DataFrame there are 82 rows, and there some columns with 81 null values. So, let's drop these columns.

# columns in pyspark dataframe to drop
columns_to_drop = ['classificacaoEstadoSivep', 'comorbidadeAsmaSivep', 'comorbidadeDiabetesSivep', 'comorbidadeHematologiaSivep', 'comorbidadeImunodeficienciaSivep', 'comorbidadeNeurologiaSivep', 'comorbidadeObesidadeSivep', 'comorbidadePneumopatiaSivep', 'comorbidadePuerperaSivep', 'comorbidadeRenalSivep', 'comorbidadeSindromeDownSivep', 'dataEntradaUtisSvep', 'dataEvolucaoCasoSivep', 'dataInternacaoSivep', 'dataResultadoExame', 'dataSolicitacaoExame', 'evolucaoCaso', 'idSivep', 'paisPaciente', 'cnesNotificacaoEsus', 'comorbidadeCardiovascularSivep', 'dataColetaExame', 'resultadoFinalExame', 'tipoTesteExame']

# delete columns in pyspark dataframe
df = df.drop(*columns_to_drop)

Let's display our data.

df.describe()
display(df)

As you can see above the columns dataInicioSintomas and dataNotificacao are in timestamp format, I will transform it to date format in our PySpark DataFrame.

from pyspark.sql.functions import to_date
# timestamp to date
df = df.withColumn("dataInicioSintomas", to_date(df['dataInicioSintomas']))
df = df.withColumn("dataNotificacao", to_date(df['dataNotificacao']))

display(df)

Load

We've done the data transformation. We will load these data into Azure SQL Data Warehouse.

# removing repeated rows
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
display(distinctDF)

# Load PySpark DataFrame to Azure SQL Data Warehouse
db_table = "dbo.COVID"
sql_password = "YOUR_PASSWORD"
jdbc_url = "jdbc:sqlserver://cosmos-database.database.windows.net:1433;database=cosmos-pool;user=rubnsbarbosa@cosmos-database;password=" + sql_password + ";encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"

distinctDF.write.jdbc(url=jdbc_url, table=db_table, mode="append")

We've finished the ETL with PySpark on Azure Databricks.

Azure SQL Data Warehouse

Before loading PySpark DataFrame into Azure SQL Data Warehouse, we must have created our table in our SQL DW. So, we must enter the query editor and create. You can see the query I created below.

CREATE TABLE dbo.COVID (
    bairroPaciente VARCHAR(254),
    codigoMunicipioPaciente VARCHAR(254),
    codigoPaciente VARCHAR(254),
    dataInicioSintomas VARCHAR(50),
    dataNotificacao VARCHAR(50),
    estadoPaciente VARCHAR(10),
    idadePaciente VARCHAR(10),
    municipioNotificacaoEsus VARCHAR(60),
    municipioPaciente VARCHAR(100),
    profissionalSaude VARCHAR(60),
    racaCorPaciente VARCHAR(60),
    sexoPaciente VARCHAR(60)
);

It might happen some firewall issues when you try to load the data, you just need go into Firewalls and Virtual Networks [inside of SQL DW] and save the Client IP address. Finally, let's see the data into our Azure SQL Data Warehouse.

Conclusion

We have the second and last part of our project completed. It was created an ETL using Spark on Azure Databricks Cluster. In the extraction phase we got data from Azure Data Lake Storage, we performed a basic transformation, and the data was loaded into Azure SQL Data Warehouse as proposed.

ELT Data Pipeline with Kubernetes CronJob, Azure Data Lake, Azure Databricks (Part 1)

Rubens Barbosa — Sun, 24 Apr 2022 14:35:29 +0000

Hey world, the concept of ETL are far from new, but nowadays it is widely used in the industry. ETL stands for Extract, Transform, and Load. Okay, but what does that mean? The easiest way to understand how ETL works is to understand what happens in each step of the process. Let's dive into it.

Extract

During the extraction, raw data is moved from a structured or unstructured data pool to a staging data repository.

Transform

The data source might have a different structure than the target destination, we'll transform the data from the source schema to the destination schema.

Load

In this phase, we'll then load the transformed data into the data warehouse.

A disadvantage of the ETL approach is that the transformation stage can take a long time. An alternative approach is extract, load, and transform (ELT). In ELT, the data is immediately extracted and loaded into a large data repository, such as Azure Data Lake Storage. We can begin transforming the data as soon as the load is complete.

Hands on

In this first part I will show how to create an ELT. We'll extract data from a Public API called IntegraSUS regarding Covid-19 data, and load it on Azure Data Lake Storage. So, this ELT will be containerized on Azure Container Registry (ACR), and we will use Azure Kubernetes Service (AKS) to schedule our job on K8s cluster to run daily.

In the second part of this project, we will integrate the Azure Data Lake with Apache Spark on Azure Databricks to perform a small transformation on top of the files sent to the Data Lake and then we will store the result of the transformation in a Data Warehouse.

We will learn how to:

Create a Python ELT and load into Azure Data Lake;
Create an Azure Container Registry and push images into it;
Create an Azure Kubernetes Service;
Deploy CronJob into Azure Kubernetes Cluster;
Integrate Azure Data Lake with Apache Spark on Databricks;
Transform data using PySpark on Azure Databricks
Load new data into Data Warehouse.

The project code is available here: github repository

1. Create a Python ELT

In the extraction phase we will get data from a Public API about Fortaleza/Ceará/Brazil Covid-19 data, and store the data into a json file. After that, we will load it into Azure Data Lake. You can see the project code below.

#!/usr/local/bin/python
import os
import sys
import yaml
import json
import logging
import requests
from datetime import datetime, timedelta
from azure.storage.filedatalake import DataLakeServiceClient

logging.basicConfig(stream=sys.stdout,
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")

logger = logging.getLogger(__name__)

date = datetime.today() - timedelta(days=2)
previous_date = f"{date.year}-{date.month}-{date.day}"


def extract():
    logger.info('EXTRACTING...')
    #extract data from API
    url = 'https://indicadores.integrasus.saude.ce.gov.br/api/casos-coronavirus?dataInicio='+ previous_date +'&dataFim=' + previous_date
    req = requests.get(url)
    data = req.json()

    if data:
        with open(f'covid-data-{previous_date}.json', 'w', encoding='utf-8') as file:
            json.dump(data, file, ensure_ascii=False, indent=4)
    else:
        logger.info('THERE IS NOT DATA')

def initialize_storage_account(storage_account_name, storage_account_key):
    try:
        global service_client
        service_client = DataLakeServiceClient(account_url="https://"+ storage_account_name +".dfs.core.windows.net", credential=storage_account_key)

    except Exception as e:
        logger.info('EXCEPTION...')
        logger.info(e)

def create_file_system(container_name):
    try:
        global file_system_client
        logger.info('CREATING A CONTAINER NAMED AZ-COVID-DATA')
        file_system_client = service_client.create_file_system(file_system=container_name)

    except Exception as e:
        logger.info('EXCEPTION...')
        logger.info(e)

def create_directory():
    try:
        logger.info('CREATING A DIRECTORY NAMED DIRECTORY-COVID22')
        file_system_client.create_directory("directory-covid19")

    except Exception as e:
        logger.info('EXCEPTION...')
        logger.info(e)

def upload_file_to_container_datalake(local_file, container_name):
    try:
        logger.info('UPLOADING FILE TO AZURE DATA LAKE STORAGE...')
        file_system_client = service_client.get_file_system_client(file_system=container_name)
        directory_client = file_system_client.get_directory_client("directory-covid19")

        file_client = directory_client.get_file_client(f"covid-{previous_date}.json")

        with open(local_file, "rb") as data:
            file_client.upload_data(data, overwrite=True)
            logger.info('UPLOADED TO AZURE DATA LAKE')

    except Exception as e:
        logger.info('EXCEPTION...')
        logger.info(e)

def load_config():
    directory = os.path.dirname(os.path.abspath(__file__))
    with open(directory + "/config.yaml", "r") as yamlfile:
        return yaml.load(yamlfile, Loader=yaml.FullLoader)


if __name__ == "__main__":
    extract()
    config = load_config()
    initialize_storage_account(config["AZURE_DL_STORAGE_ACCOUNT_NAME"], config["AZURE_DL_ACCOUNT_KEY"])
    create_file_system(config["AZURE_DL_CONTAINER_NAME"])
    upload_file_to_container_datalake(f"covid-data-{previous_date}.json", config["AZURE_DL_CONTAINER_NAME"])

2. Create docker image for our Python ELT

We are going to build a docker image for our ELT job and run it inside the container. So, let's create a Dockerfile, which describes how a docker image is built. You can see a list of instructions below.

FROM python:3.9.12-buster

WORKDIR /usr/src/app

COPY requirements.txt /usr/src/app
RUN pip install -r requirements.txt

COPY config.yaml /usr/src/app

COPY el2datalake.py /usr/src/app
RUN chmod a+x el2datalake.py

CMD ["./el2datalake.py"]

We can build the docker image using docker build command

$ docker build -t el2datalakejob .

Now we should run our ELT inside the container

$ docker run -it el2datalakejob:latest

3. Push docker images to Azure Container Registry

Azure Container Registry handles private Docker container images and allow us to build, store, and manage container images. We are going to deploy an ACR instance and push a docker image to it.

To create an any instance on Azure, we must create a resource group. We create a new resource group with az group create command.

$ az group create --name myResourceGroup --location westeurope

Once we have a resource group, we can create an Azure Container Registry with az acr create command.

$ az acr create \
  --resource-group myResourceGroup \
  --name azcrjobs \
  --sku Basic \
  --location westeurope

Let's login on Azure Container Registry

$ az acr login --name azcrjobs

Let us tag the image to the login server azcrjobs.azurecr.io

$ docker tag el2datalakejob \
azcrjobs.azurecr.io/el2datalakejob:v1

Push the Docker image to ACR

$ docker push azcrjobs.azurecr.io/el2datalakejob:v1

Now we have our ELT on Azure Container Registry let's move on to the next step.

4. Create and Deploy CronJobs on Azure Kubernetes Service

Azure Kubernetes Service (AKS) deploy and manage containerized applications more easily with a fully managed Kubernetes service. Let’s create an AKS cluster with az aks create command.

$ az aks create \
  --resource-group myResourceGroup \
  --name az-aks-jobs \
  --node-count 1 \
  --attach-acr azcrjobs \
  --location westeurope

To connect to the cluster from local machine we use Kubernetes client kubectl, open the terminal to connect to the cluster

$ az aks get-credentials --resource-group myResourceGroup \ --name az-aks-jobs

Let us see our node available on AKS

$ kubectl get nodes

We start from creating a manifest file for our ELT cron job.

apiVersion: batch/v1
kind: CronJob
metadata:
  creationTimestamp: null
  name: k8sjob
spec:
  jobTemplate:
    metadata:
      creationTimestamp: null
      name: k8sjob
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: azcrjobs.azurecr.io/el2datalakejob:v1
            imagePullPolicy: IfNotPresent
            name: k8sjob
            resources: {}
          restartPolicy: OnFailure
  schedule: '55 23 * * *'
status: {}

Above on our manifest file we defined the crontab expression used as a schedule for our job, and is scheduled to run everyday at 23:55. We put the name of the docker image to be pulled from container registry attached to cluster.

To deploy our job, we will use the kubectl apply command.

kubectl apply -f job.yml

We can view some details about the job with

kubectl get cronjobs

To retrieve cron job logs from Kubernetes, we can use kubectl logs command, but first we must get the pod name.

$ kubectl get pods
NAME                       READY   STATUS      RESTARTS   AGE
k8sjob-27513350--1-xnj8x   0/1     Completed   0          4m2s

Retrieve cron job logs from Kubernetes

$ kubectl logs k8sjob-27513350--1-xnj8x

Conclusion

Finally, we have the first stage of our project completed. Now we have Covid data on Azure Data Lake. For the next step, we will read this file from Azure Data Lake and perform a little processing of this data using Apache Spark on Azure Databricks, and we will be able to make the result of this processing available in a Data Warehouse.