<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rubens Barbosa</title>
    <description>The latest articles on DEV Community by Rubens Barbosa (@rubnsbarbosa).</description>
    <link>https://dev.to/rubnsbarbosa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F849597%2F371054c4-3dc2-4942-bf2e-18b42a7342fc.JPG</url>
      <title>DEV Community: Rubens Barbosa</title>
      <link>https://dev.to/rubnsbarbosa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rubnsbarbosa"/>
    <language>en</language>
    <item>
      <title>Slowly Changing Dimensions (SCD)</title>
      <dc:creator>Rubens Barbosa</dc:creator>
      <pubDate>Sat, 12 Jul 2025 16:02:47 +0000</pubDate>
      <link>https://dev.to/rubnsbarbosa/slowly-changing-dimensions-scd-2a2e</link>
      <guid>https://dev.to/rubnsbarbosa/slowly-changing-dimensions-scd-2a2e</guid>
      <description>&lt;p&gt;Slowly Changing Dimensions (SCD) are a fundamental part of Dimensional Data Modeling, particularly in data warehousing and business intelligence. Before we delve into the details of SCD, it is helpful to focus on some fundamental concepts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Data Modeling?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data Modeling is the process of creating a visual representation/diagram and its relationships that represent your data system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a dimension?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dimensions are attributes of an entity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is an entity?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An entity is a representation of either a real-world object or a concept that can represent abstract ideas.&lt;/p&gt;

&lt;p&gt;Example of entities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-World Objects:&lt;/strong&gt; customer, car, product etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concepts:&lt;/strong&gt; course, sale, order etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Attributes of an entity&lt;/strong&gt; aka &lt;strong&gt;dimensions&lt;/strong&gt; are specific pieces of information about that entity. For example, a customer entity might have attributes such as: name, birthday, address, and phone number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dimensions are categorized into two types
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2bq7fcoh0dokksj9nd7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2bq7fcoh0dokksj9nd7.png" alt="dimensions-types" width="800" height="714"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In data modeling, it is important to consider whether an attribute is fixed or slowly changing. Considering the above attributes of a customer entity, the birthday is a great example of a fixed dimension, no one can change their birthday, and phone number is an example of slowing changing because the customer can keep change their phone number, which means that this attribute is time-dependent and, as it is time-dependent, it is changing slowly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Slowly Changing Dimensions (SCD)
&lt;/h2&gt;

&lt;p&gt;Slowly Changing Dimensions (SCD) are a concept in data warehousing that refers to how the attributes of a given dimension table are managed when some records may change slowly over time, often in an unpredictable manner. There are different types of SCDs, each of with its own approach to handling changes.&lt;/p&gt;

&lt;p&gt;In Germany there is a service of the Deutsche Post called &lt;em&gt;"Post Nachsendeauftrag"&lt;/em&gt; which forwards order/letters in case of a relocation (from the old address to the new one) I will use this example: &lt;strong&gt;&lt;em&gt;a customer has moved and now needs to inform Deutsche Post of his new address in order to receive letters at his new address&lt;/em&gt;.&lt;/strong&gt;  in order to understand the different types of SCDs. &lt;/p&gt;

&lt;h3&gt;
  
  
  SCD Type 0: Fixed dimensions
&lt;/h3&gt;

&lt;p&gt;No changes are allowed. The dimension attributes remain static and are never updated.&lt;/p&gt;

&lt;h3&gt;
  
  
  SCD Type 1: Overwrite the old value
&lt;/h3&gt;

&lt;p&gt;The old value of the dimension attribute is overwritten with the new value. This approach does not keep any history of changes. For example, if a customer moves to a new address, the old address is updated with the new address.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;address&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Friedrich&lt;/td&gt;
&lt;td&gt;Goethestraße 1&lt;/td&gt;
&lt;td&gt;München&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td&gt;Eiffestraße 12&lt;/td&gt;
&lt;td&gt;Hamburg&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If Friedrich moves to Ebertstraße, 17 in Berlim, the Deutsche dimensional customer table would be updated as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;address&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Friedrich&lt;/td&gt;
&lt;td&gt;Ebertstraße 17&lt;/td&gt;
&lt;td&gt;Berlim&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td&gt;Eiffestraße 12&lt;/td&gt;
&lt;td&gt;Hamburg&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  SCD Type 2: Adding a new row
&lt;/h3&gt;

&lt;p&gt;In Type 2 SCD, a new row is added to the dimension table to represent the new value, and the old row is marked as inactive or expired. This approach maintains a full history of changes. &lt;/p&gt;

&lt;p&gt;Using the same Deutsche customer dimension table, if Friedrich moves to Ebertstraße, 17 in Berlim, a new row is added for Friedrich with the new address, and the old row is marked as inactive.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;address&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;active&lt;/th&gt;
&lt;th&gt;effective_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Friedrich&lt;/td&gt;
&lt;td&gt;Goethestraße 1&lt;/td&gt;
&lt;td&gt;München&lt;/td&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;td&gt;2020-01-01&lt;/td&gt;
&lt;td&gt;2025-03-13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Friedrich&lt;/td&gt;
&lt;td&gt;Ebertstraße 17&lt;/td&gt;
&lt;td&gt;Berlim&lt;/td&gt;
&lt;td&gt;Y&lt;/td&gt;
&lt;td&gt;2025-03-14&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td&gt;Eiffestraße 12&lt;/td&gt;
&lt;td&gt;Hamburg&lt;/td&gt;
&lt;td&gt;Y&lt;/td&gt;
&lt;td&gt;2022-02-10&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Instead of using NULL to indicate that a record is currently active, some dimensional data modeling might use a future date far enough to mark that the record is still valid. For example, using '9999-12-31' for active records.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;address&lt;/th&gt;
&lt;th&gt;city&lt;/th&gt;
&lt;th&gt;active&lt;/th&gt;
&lt;th&gt;effective_date&lt;/th&gt;
&lt;th&gt;end_date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Friedrich&lt;/td&gt;
&lt;td&gt;Goethestraße 1&lt;/td&gt;
&lt;td&gt;München&lt;/td&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;td&gt;2020-01-01&lt;/td&gt;
&lt;td&gt;2025-03-13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Friedrich&lt;/td&gt;
&lt;td&gt;Ebertstraße 17&lt;/td&gt;
&lt;td&gt;Berlim&lt;/td&gt;
&lt;td&gt;Y&lt;/td&gt;
&lt;td&gt;2025-03-14&lt;/td&gt;
&lt;td&gt;9999-12-31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td&gt;Eiffestraße 12&lt;/td&gt;
&lt;td&gt;Hamburg&lt;/td&gt;
&lt;td&gt;Y&lt;/td&gt;
&lt;td&gt;2022-02-10&lt;/td&gt;
&lt;td&gt;9999-12-31&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;effective date&lt;/strong&gt; column indicates the date from which a particular record of a dimension becomes active or valid. It marks the beginning of the period during which the information in that record is considered current.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;end date&lt;/strong&gt; column indicates the date when a particular record of a dimension is no longer active or valid. It marks the end of the period during which the information in that record was considered valid. This column might also be called expiry date, effective end date, or similar names.&lt;/p&gt;

&lt;h3&gt;
  
  
  SCD Type 3: Adding a new column
&lt;/h3&gt;

&lt;p&gt;A new column is added to the dimension table to store the new value, while the old value is preserved in the original column. This approach maintains limited history, as only the previous value is preserved.&lt;/p&gt;

&lt;p&gt;If Friedrich moves to Ebertstraße, 17 in Berlim, a new column is added to store the new address, and the old address is preserved in the original column.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;customer_id&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;previous_address&lt;/th&gt;
&lt;th&gt;new_address&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Friedrich&lt;/td&gt;
&lt;td&gt;Goethestraße 1, München&lt;/td&gt;
&lt;td&gt;Ebertstraße 17, Berlim&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td&gt;Eiffestraße 12, Hamburg&lt;/td&gt;
&lt;td&gt;NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;There are other types of SCDs, but I will only go into these above in this article. Each type of SCD has its own advantages and disadvantages, and the choice of which type to use depends on the specific requirements of the data warehousing project. I would say the SCD Type 2 is the most commonly used in data warehousing and business intelligence, because it allows store historical data changes in dimension attributes, which is crucial for many analytical and reporting purposes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"In a well-designed dimensional model, dimension tables have many columns or attributes. It is not uncommon for a dimension table to have 50 to 100 attributes. The power of the data warehouse is directly proportional to the quality and depth of the dimension attributes."&lt;/em&gt; (The Data Warehouse Toolkit)&lt;/p&gt;

</description>
      <category>datawarehouse</category>
      <category>dataengineering</category>
      <category>database</category>
    </item>
    <item>
      <title>Apache Spark 101</title>
      <dc:creator>Rubens Barbosa</dc:creator>
      <pubDate>Sat, 25 May 2024 20:47:37 +0000</pubDate>
      <link>https://dev.to/rubnsbarbosa/apache-spark-101-2p68</link>
      <guid>https://dev.to/rubnsbarbosa/apache-spark-101-2p68</guid>
      <description>&lt;p&gt;In order to understand Spark let's remember what was the scenario before its creation. A couple of years ago computers became faster every year through processor speed increases. This trend in hardware stopped around 2005 due to hard limits in heat dissipation. So, hardware engineers stopped making individual processors faster, and started adding &lt;strong&gt;parallel CPU cores all running at the same speed&lt;/strong&gt;. As a result of this change, applications needed to be modified to add parallelism in order to run faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google&lt;/strong&gt; wanted run giant computations on high volumes of data across large clusters. Because they were creating indexes of all the content of the web in order to identify the most important pages. So, they &lt;strong&gt;designed MapReduce, a parallel data processing framework&lt;/strong&gt;, which enabled Google to index the web.&lt;/p&gt;

&lt;p&gt;At that time, Hadoop MapReduce was the dominant parallel programming engine for clusters of thousands of nodes. So, why was Spark created? Well, &lt;strong&gt;MapReduce engine made it challenging and inefficient to build large applications&lt;/strong&gt; that needed multiple MapReduce jobs together, which &lt;strong&gt;causes a lot of reading and writing to disk&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To address this issue, the &lt;strong&gt;Spark&lt;/strong&gt; team first designed an API based on functional programming that could express multistep applications. The team then implemented this API over a new engine that &lt;strong&gt;could perform efficient, in-memory data sharing across computation steps&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Apache Spark?
&lt;/h3&gt;

&lt;p&gt;Apache Spark is an open-source unified computing engine and a set of libraries for parallel data processing on computer clusters. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spark is a fast engine for large-scale data processing&lt;/strong&gt;, basically the idea is that we can write a code which describes how you want to transform a huge amount of data, and Spark will figure out how to distribute that work across an entire cluster of computers, i.e., the driver send tasks to workers to run/process them on a parallel mode. Apache Spark gets a massive data set and distribute the processing across an entire set of computers that work together in parallel at the same time. &lt;/p&gt;

&lt;p&gt;In a nutshell Spark can execute tasks on data across a cluster of computers. &lt;/p&gt;

&lt;p&gt;NOTE: Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). So, therefore to run Spark either on your laptop or a cluster, you need an installation of Java.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;Spark application architecture at high level&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3rielawpqybv0nzq5ed.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3rielawpqybv0nzq5ed.png" alt="spark-architecture" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Spark architecture consists of driver process, executors, cluster manager, and worker nodes. Apache Spark follows a master and worker architecture; it has a single master and any number of workers.&lt;/p&gt;

&lt;p&gt;There are some key componentes under the hood such as: Driver Program, Cluster Manager, Task, Partitions, Executors, Worker nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spark APIs
&lt;/h3&gt;

&lt;p&gt;When working with Spark, we will come across different APIs&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;RDD (Resilient Distributed Datasets) API&lt;/li&gt;
&lt;li&gt;DataFrame API&lt;/li&gt;
&lt;li&gt;Dataset API&lt;/li&gt;
&lt;li&gt;SQL API&lt;/li&gt;
&lt;li&gt;Structured Streaming API&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  RDDs
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;RDDs are distributed collections of objects that can be processed in parallel;&lt;/li&gt;
&lt;li&gt;RDDs support two types of operations: transformations (which produce a new RDD) and actions (which return a value to the driver program after running a computation on the dataset);&lt;/li&gt;
&lt;li&gt;RDDs provides low-level control over data flow, data processing/operations;&lt;/li&gt;
&lt;li&gt;RDDs are fault tolerant, automatically recovers lost data due to node failures using lineage information. (Data lineage is the process of tracking the flow of data over time);&lt;/li&gt;
&lt;li&gt;RDDs don’t infer the schema of the data we need to specify it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RDD Scala example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.rdd.RDD&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.SparkSession&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;master&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"local[*]"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"rdd"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// I wanna square everything&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;rdd&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sparkContext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parallelize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;// we are creating a new RDD called squares&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;rddSquares&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;rdd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rddSquares&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;res = 1, 4, 9, 16&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The beauty of this example is that it could be distributed. So, if the RDD was really massive, it could actually split that processing up and handle that squaring in different chunks of that RDD on different nodes within our cluster, and send the result back to your driver script and get the final answers that we want.&lt;/p&gt;

&lt;p&gt;Another example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.rdd.RDD&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.SparkSession&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;master&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"local[*]"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"rdd"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;rddNums&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;RDD&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sparkContext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parallelize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;rddCollect&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;rddNums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Action: RDD converted to Array[Int]"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's talk about &lt;code&gt;rdd.collect()&lt;/code&gt; method in Apache Spark is a powerful and potentially problematic operation. It's used to retrieve the entire &lt;code&gt;rdd&lt;/code&gt; from the distributed environment back to the local driver program. The &lt;code&gt;collect()&lt;/code&gt; method require a full dataset in memory, it carries significant risks and potentially issues, especially when dealing with large datasets.&lt;/p&gt;

&lt;p&gt;Issues with &lt;code&gt;rdd.collect()&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;memory overload&lt;/strong&gt; because it transfers all data from the distributed nodes to the driver node. If the dataset is large, this can cause the &lt;strong&gt;driver program to run out of memory&lt;/strong&gt; and crash because it tries to fit the entire dataset into the limited memory of the driver node. Imagine calling &lt;code&gt;rdd.collect()&lt;/code&gt; with terabytes of data, it will try to bring all that data into the memory of a single machine aka the driver, which is often impossible. So, in this scenario the job certainly will fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;network bottleneck&lt;/strong&gt; due to transferring large amounts of data over the network from the worker nodes to the driver node. This can lead to slow performance of the Spark job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;reduced parallelism&lt;/strong&gt; one of the strengths of Spark is its ability to process data in parallel across a cluster, using &lt;code&gt;collect()&lt;/code&gt; invalidate this advantage by aggregating all the data into a single node, reducing the benefits of distributed processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules of thumb&lt;/strong&gt; avoid use &lt;code&gt;collect()&lt;/code&gt; as much as possible, its use should be approached with caution. There are some best practices, instead of collecting the entire dataset, use Spark actions such as &lt;code&gt;take(n)&lt;/code&gt;, &lt;code&gt;aggregate()&lt;/code&gt;, &lt;code&gt;reduce()&lt;/code&gt;to perform computations on the data directly within the distributed environment. Also, persist intermediate results in memory or disk using &lt;code&gt;persist()&lt;/code&gt; or &lt;code&gt;cache()&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  DataFrame
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;DataFrames is a distributed collection of rows under named columns (similar to a table in a relational database);&lt;/li&gt;
&lt;li&gt;Built on top of RDDs it provides a higher-level abstraction for structured data;&lt;/li&gt;
&lt;li&gt;Simplifies data manipulation with a high-level API;&lt;/li&gt;
&lt;li&gt;Easily integrates with various data sources like JSON, CSV, Parquet, etc;&lt;/li&gt;
&lt;li&gt;It does not support compile time safely, thus the user is limited in case the structure of the data is not known.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DataFrame makes easier to perform complex data processing tasks&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.SparkSession&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.functions._&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize SparkSession&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;master&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"local[*]"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dataframe"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// Create DataFrame from CSV file&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;filePath&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"path/to/your/csvfile.csv"&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;read&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"header"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"inferSchema"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;csv&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Show the first 5 rows&lt;/span&gt;
&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Dataset
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Datasets are a distributed collection of data, combining the best features of RDDs and DataFrames;&lt;/li&gt;
&lt;li&gt;A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema;&lt;/li&gt;
&lt;li&gt;Ensures compile-time type safety and supports object-oriented programming paradigms;&lt;/li&gt;
&lt;li&gt;The main disadvantage of datasets is that they require typecasting into strings;&lt;/li&gt;
&lt;li&gt;We can use it when complex transformations on structured data where compile-time type checking is beneficial.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nc"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Define the schema of our data&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize SparkSession&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;master&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"local[*]"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dataset"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;spark.implicits._&lt;/span&gt;
&lt;span class="c1"&gt;// Create Dataset from a sequence of case class instances&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;data&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
      &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"John"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"München"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
      &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Jane"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Berlin"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
      &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Mike"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Frankfurt"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
      &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sara"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Dachau"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;ds&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Client&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;createDataset&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Show the content of the Dataset&lt;/span&gt;
&lt;span class="nv"&gt;ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using Datasets, we can benefit from the best features of RDDs and DataFrames. Such as type safety and object-oriented programming interface of RDDs; and the optimizations execution, ease of use due to a higher level of abstraction from DataFrames for working with structured data in Spark.&lt;/p&gt;

&lt;h4&gt;
  
  
  SQL (via Spark SQL)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Allows users to run SQL queries directly on DataFrames or Datasets;&lt;/li&gt;
&lt;li&gt;Provides a way to query data using standard SQL syntax;&lt;/li&gt;
&lt;li&gt;Uses standard SQL, which is familiar to many data professionals;&lt;/li&gt;
&lt;li&gt;Queries return DataFrames, enabling further processing using the DataFrame API;&lt;/li&gt;
&lt;li&gt;Ad-hoc querying and data exploration.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize SparkSession&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;master&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"local[*]"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sql"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// Create DataFrame from CSV file&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;filePath&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"path/to/your/csvfile.csv"&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;read&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"header"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"inferSchema"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;csv&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Register the DF as a temp SQL view&lt;/span&gt;
&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;createOrReplaceView&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"clients"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Execute SQL queries &lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;allRowsDF&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SELECT * FROM clients"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;allRowsDF&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using Spark SQL with Scala allows you to execute SQL queries on your data&lt;/p&gt;

&lt;h4&gt;
  
  
  Structured Streaming
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Built on the Spark SQL engine, it enables the same DataFrame and Dataset API to be used for stream processing;&lt;/li&gt;
&lt;li&gt;Uses the same API for batch and streaming data, simplifying the development process;&lt;/li&gt;
&lt;li&gt;Easy to use due to High-level abstraction for defining streaming computations;&lt;/li&gt;
&lt;li&gt;Real-time data processing and analytics;&lt;/li&gt;
&lt;li&gt;Stream processing applications that require the same APIs and optimizations as batch processing.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.SparkSession&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize SparkSession&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;master&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"local[*]"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"StructuredStream"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;kafkaStream&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;readStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"kafka"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"kafka.bootstrap.servers"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"localhost:9092"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"subscribe"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"kafka_topic_name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"startingOffsets"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"earliest"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;load&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;query&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kafkaStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;selectExpr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CAST(key AS STRING)"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"CAST(value AS STRING)"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;writeStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;outputMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"append"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"console"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;start&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="nv"&gt;query&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;awaitTermination&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, the code reads data from the Kafka topic. Then the key and value are written to the console in the output mode append. The &lt;code&gt;awaitTermination&lt;/code&gt; method is called to start the streaming query and wait for it to terminate.&lt;/p&gt;

&lt;p&gt;Structured Streaming in Spark is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows you to work with streaming data in the same way you would work with batch data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why should we use Spark?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Spark can run programs up to 100 times faster than Hadoop MapReduce;&lt;/li&gt;
&lt;li&gt;It offers fast processing speed; through in-memory caching and processing data;&lt;/li&gt;
&lt;li&gt;Spark is a very mature technology, and it’s been out for a while so it’s reliable at this point;&lt;/li&gt;
&lt;li&gt;Spark is not that hard and applications can be implemented in a variety of programming languages like Scala, Java, Python;&lt;/li&gt;
&lt;li&gt;Spark puts together powerful libraries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's all folks :)&lt;/p&gt;

</description>
      <category>apache</category>
      <category>spark</category>
      <category>scala</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Gradient Descent, an Optimization Method used in Machine Learning</title>
      <dc:creator>Rubens Barbosa</dc:creator>
      <pubDate>Tue, 17 May 2022 10:12:18 +0000</pubDate>
      <link>https://dev.to/rubnsbarbosa/gradient-descent-an-optimization-method-used-in-machine-learning-1njp</link>
      <guid>https://dev.to/rubnsbarbosa/gradient-descent-an-optimization-method-used-in-machine-learning-1njp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;In this article we are going to use some concepts of Differential Calculus, mainly in partial derivative and chain rule.&lt;/em&gt; 📚&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fam0nr3qmoxd40omlis86.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fam0nr3qmoxd40omlis86.png" alt="function with minima and maxima locals" width="616" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization
&lt;/h2&gt;

&lt;p&gt;A mathematical optimization problem, or just optimization problem, has only one goal: to find the best element from a set of candidates through a function known as cost function, loss function, or objective function.&lt;/p&gt;

&lt;p&gt;Mathematically, an unbounded optimization problem with decision variables &lt;strong&gt;θ&lt;/strong&gt; of the cost function L, has the following form&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69irtl9xku2e5z04b0kh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69irtl9xku2e5z04b0kh.png" alt="min" width="227" height="63"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are interested in finding a vector &lt;strong&gt;θ&lt;/strong&gt; that leads to the lowest value of the cost function. If it exists, the vector is called optima solution or minimum global and will be denoted by &lt;strong&gt;θ&lt;/strong&gt;*, as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4nucwmzjbhom31kg3air.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4nucwmzjbhom31kg3air.png" alt="min global" width="665" height="69"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that, depending on the function, the equation above might not have optima solution. Sometimes, the function may have more than one minima or maxima local like the graph presented in the top of this article. However, we want to find a minimum global. In this context, let’s admit that our problem has an optima solution.&lt;/p&gt;

&lt;p&gt;Assuming L is differentiable and convex, a necessary and sufficient condition for a vector &lt;strong&gt;θ&lt;/strong&gt;* to be an optima solution is&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fipo8drcjakbukbtpqo2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fipo8drcjakbukbtpqo2g.png" alt="gradient" width="156" height="52"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;where ∇ is the gradient operator:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbv19hfb7ssaa9pbo89j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbv19hfb7ssaa9pbo89j.png" alt="gradient" width="225" height="145"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Concretely, if the function is strictly convex, &lt;strong&gt;θ&lt;/strong&gt;* is the only optima solution for the optimization problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Function
&lt;/h2&gt;

&lt;p&gt;In Machine Learning it is common that we compute the cost function in order to know how our model is performing, because our purpose is to find the best hypothesis. The cost function computes the error of all training data set. So, we want to minimize the cost function, i.e., minimize the error of any machine learning model.&lt;/p&gt;

&lt;p&gt;There are different types of cost functions and we choose them depending on the model that we are going to work. In this article, we use the linear regression model and our cost function is the mean squared error, as shown bellow&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgx6r693gq925wmo59sb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgx6r693gq925wmo59sb.png" alt="MSE" width="669" height="210"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Gradient Descent
&lt;/h2&gt;

&lt;p&gt;Probably, gradient descent is one of the most simple and widely used iterative algorithms for optimization of continuous and differentiable functions. The basic idea behind the method is to begin from an initial point chosen randomly and improve it repeatedly, taking small steps in the opposite direction of the gradient at each iteration.&lt;/p&gt;

&lt;p&gt;That is, we start with an initial guess &lt;strong&gt;θ&lt;/strong&gt;(0) and at each iteration t = 0. . . we compute the update rule:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5h69nxv5wuyin7dmpha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5h69nxv5wuyin7dmpha.png" alt="gradient_descent" width="309" height="63"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The choice of the learning rate α is crucial for the convergence of the gradient descent. In practice, it is common to adopt a constant value for α.&lt;/p&gt;

&lt;p&gt;— ∇L(&lt;strong&gt;θ&lt;/strong&gt;) is the direction in which L decrease in &lt;strong&gt;θ&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applying gradient descent to the cost function
&lt;/h2&gt;

&lt;p&gt;We are going to use the linear regression model represented bellow&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xb7f6cqoergx3d7w8xr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xb7f6cqoergx3d7w8xr.png" alt=" " width="352" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Find the gradient vector of &lt;strong&gt;θ&lt;/strong&gt;(0)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fficdfpfwmgpk8pvnuxu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fficdfpfwmgpk8pvnuxu3.png" alt="theta zero" width="598" height="637"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Find the gradient vector of &lt;strong&gt;θ&lt;/strong&gt;(1)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiva4q2xljyt9z3k9duzb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiva4q2xljyt9z3k9duzb.png" alt="theta one" width="584" height="638"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can visualize our cost function with an illustration below. Observe that our function is convex. Thus, our solution &lt;strong&gt;θ&lt;/strong&gt;* always has a global minimum.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz2hfr0vmyipbq2x03kq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz2hfr0vmyipbq2x03kq.png" alt="convex cost function" width="616" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Gradient Descent Algorithm
&lt;/h2&gt;

&lt;p&gt;Below you can find the pseudocode of gradient descent&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglwmt6xxh9oodh1x7ouu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglwmt6xxh9oodh1x7ouu.png" alt="pseudocode" width="731" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we learned the theoretical part of the gradient descent applied on the linear regression model. However, gradient descent might be used in others machine learning and deep learning algorithms. Something important for gradient descent is the learning rate choice. Large values for the learning rate can result in overshooting. That is, if we update the parameters using large values for the learning rate the function may fail to converge to the global minimum. On the other hand, small values for the learning rate would take too many iterations for the function to converge.&lt;/p&gt;

&lt;p&gt;Although I only talked about gradient descent, there are others versions of it, such as: batch gradient descent, stochastic gradient descent and mini-batch gradient descent. I would like you to search about these types of algorithms and try to understand when you might use each one of them.&lt;/p&gt;

&lt;p&gt;I hope you enjoyed reading this article. Thank you! 🙂&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Getting started with Apache Kafka using Python</title>
      <dc:creator>Rubens Barbosa</dc:creator>
      <pubDate>Sat, 07 May 2022 11:40:05 +0000</pubDate>
      <link>https://dev.to/rubnsbarbosa/getting-started-with-apache-kafka-using-python-36ko</link>
      <guid>https://dev.to/rubnsbarbosa/getting-started-with-apache-kafka-using-python-36ko</guid>
      <description>&lt;p&gt;Apache Kafka is a distributed streaming system that provide real-time access to the data. This system let us publish and subscribe to streams of data, store them, and process them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;The unit of data within Kafka is called a message. A message is simply an array of bytes. A message can have an optional bit of metadata, which is referred to as a key.&lt;/p&gt;

&lt;p&gt;For efficiency, messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Topics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Messages in Kafka are categorized into topics. The closest analogies for a topic are a database table or a folder in a filesystem. Topics are additionally broken down into a number of partitions. Note that as a topic typically has multiple partitions, there is no guarantee of message time-ordering across the entire topic, just within a single partition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Producers and Consumers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Producers in Kafka are the ones who produce and send the messages to the topics. In some cases, the producer will direct messages to specific partitions. This is typically done using the message key and a partitioner that will generate a hash of the key and map it to a specific partition.&lt;/p&gt;

&lt;p&gt;The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset is another bit of metadata an integer value that continually increases that Kafka adds to each message as it is produced. Each message in a given partition has a unique offset. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brokers and Clusters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A single Kafka server is called a broker. Depending on the specific hardware and its performance characteristics, a single broker can easily handle thousands of partitions and millions of messages per second. Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retention&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A key feature of Apache Kafka is that of retention, which is the durable storage of messages for some period of time. Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time (e.g., 7 days) or until the topic reaches a certain size in bytes (e.g., 1 GB). Once these limits are reached, messages are expired and deleted so that the retention configuration is a minimum amount of data available at any time. Individual topics can also be configured with their own retention settings so that messages are stored for only as long as they are useful.&lt;/p&gt;

&lt;p&gt;Now that we have an overview about Apache Kafka, let's install it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installing Kafka&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'll install Apache Kafka on mac using homebrew. To do so, I just need type on my terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ brew install kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client details. So, during the installation it will install Apache Zookeeper as well. We must already have Java installed on our machine.&lt;/p&gt;

&lt;p&gt;After installing Kafka we can see something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrjovsdowsc4azkhfpxm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrjovsdowsc4azkhfpxm.png" alt="Kafka-Logs" width="606" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Navigate to this directory in separate terminal sessions in order to execute Zookeeper and Kafka. It might be another path depending on your machine and O.S.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd /usr/local/opt/kafka/bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First, let's start Apache Zookeeper Server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, in another terminal session execute the command below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kafka-server-start /usr/local/etc/kafka/server.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All right, we have Apache Zookeeper and Apache Kafka running what we should do now? Let's create a Kafka topic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creation of Kafka Topic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now, let's create a topic called: first-topic in a new terminal session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kafka-topics --create --topic first-topic \
--bootstrap-server localhost:9092 \
--replication-factor 1 --partitions 1

Created topic first-topic.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Producer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Produce messages to the &lt;strong&gt;first-topic&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kafka-console-producer --broker-list localhost:9092 \
--topic first-topic

&amp;gt;Sunday 1st May 2022
&amp;gt;Data Engineering    
&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Consumer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consume messages from the &lt;strong&gt;first-topic&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kafka-console-consumer --bootstrap-server localhost:9092 \
--topic first-topic --from-beginning

Sunday 1st May 2022
Data Engineering 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;List Topics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Listing all the Kafka topics in a cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kafka-topics --list --bootstrap-server localhost:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Delete Topic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We might want to delete a specific topic&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kafka-topics --bootstrap-server localhost:9092 \
--delete --topic first-topic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Producer &amp;amp; Consumer with Python
&lt;/h3&gt;

&lt;p&gt;Let's create a producer and consumer using python. First, we should create virtual environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python3 -m venv env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -m venv env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate the virtual env, in order to install libraries&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ source env/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's install Python client for Apache Kafka and Request libraries&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pip install kafka-python
$ pip install requests 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python Producer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now, let's dive into our producer.py&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/local/bin/python
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaProducer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%(asctime)s - %(name)s - %(levelname)s - %(message)s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaProducer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# get data from public API
&lt;/span&gt;    &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;previous_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://indicadores.integrasus.saude.ce.gov.br/api/casos-coronavirus?dataInicio=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;previous_date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;dataFim=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;previous_date&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;covid_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;covid_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;covid-topic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python Consumer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Alright, let's have a look at our consumer.py&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/local/bin/python
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaConsumer&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;covid-topic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should have two terminal session to run producer.py and consumer.py&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;We learned the main concepts of Apache Kafka: message/record, producers, consumers, topics, brokers and retention. An &lt;strong&gt;event&lt;/strong&gt; records the fact that "something happened". It is also called record or message. &lt;strong&gt;Producers&lt;/strong&gt; are those client applications that publish (write) events to Kafka, and &lt;strong&gt;consumers&lt;/strong&gt; are those that subscribe to (read and process) these events. Events are organized and durably stored in &lt;strong&gt;topics&lt;/strong&gt;. Very simplified, a topic is similar to a folder in a filesystem, and the events are the files in that folder.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>programming</category>
      <category>python</category>
      <category>bash</category>
    </item>
    <item>
      <title>A Bunch of Linux Commands</title>
      <dc:creator>Rubens Barbosa</dc:creator>
      <pubDate>Sun, 01 May 2022 17:24:54 +0000</pubDate>
      <link>https://dev.to/rubnsbarbosa/a-bunch-of-linux-commands-4kn8</link>
      <guid>https://dev.to/rubnsbarbosa/a-bunch-of-linux-commands-4kn8</guid>
      <description>&lt;h2&gt;
  
  
  🚀 Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Linux Overview&lt;/li&gt;
&lt;li&gt;System Information&lt;/li&gt;
&lt;li&gt;Files &amp;amp; Directory&lt;/li&gt;
&lt;li&gt;Compress &amp;amp; Extract Files&lt;/li&gt;
&lt;li&gt;Process Management&lt;/li&gt;
&lt;li&gt;File Permission&lt;/li&gt;
&lt;li&gt;Network&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🖥 Linux Overview
&lt;/h3&gt;

&lt;h4&gt;
  
  
  username@system_name:~$
&lt;/h4&gt;

&lt;h5&gt;
  
  
  The tilde (~) symbol stands for your home directory
&lt;/h5&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Directory&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;/&lt;/td&gt;
&lt;td&gt;begins the file system, called root&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/home&lt;/td&gt;
&lt;td&gt;contains users home directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/bin&lt;/td&gt;
&lt;td&gt;all the standard commands and utility programs i.e. executable binaries such as cat, cp, ls, mv, ps, rm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/usr&lt;/td&gt;
&lt;td&gt;holds those files and commands used by the system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/var&lt;/td&gt;
&lt;td&gt;files that are expected to change in size and content (var stands for variable), such as mailbox files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/dev&lt;/td&gt;
&lt;td&gt;file interfaces for devices such as the terminals and printers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/etc&lt;/td&gt;
&lt;td&gt;is the home for system configuration files and any other system files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/boot&lt;/td&gt;
&lt;td&gt;contains the few essential files needed to boot the system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/lib&lt;/td&gt;
&lt;td&gt;contains libraries (common code shared by applications and needed for them to run)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/mnt&lt;/td&gt;
&lt;td&gt;it has been used since the early days of UNIX for temporarily mounting filesystems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/opt&lt;/td&gt;
&lt;td&gt;optional application software packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/tmp&lt;/td&gt;
&lt;td&gt;temporary files; on some distributions erased across a reboot and/or may actually be a ramdisk in memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/sys&lt;/td&gt;
&lt;td&gt;virtual pseudo-filesystem giving information about the system and the hardware. Can be used to alter system parameters and for debugging purposes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;creating user account&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ sudo useradd -m -c "Rubens Barbosa" -s /bin/bash rubnsbarbosa
$ sudo passwd rubnsbarbosa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;look if it was well created&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ grep rubnsbarbosa /etc/passwd /etc/group
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;connect to the new user&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ shh rubnsbarbosa@localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;remove the new user&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ sudo userdel -r rubnsbarbosa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to check all users&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -l /home
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Keyboard Shortcuts
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;keyboard shortcut&lt;/th&gt;
&lt;th&gt;task&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;tab&lt;/td&gt;
&lt;td&gt;auto-completes files, directories, and binaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + l&lt;/td&gt;
&lt;td&gt;clear the screen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + a&lt;/td&gt;
&lt;td&gt;goes to the beginning of the line&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + e&lt;/td&gt;
&lt;td&gt;goes to the end of the line&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + d&lt;/td&gt;
&lt;td&gt;exits the current shell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + z&lt;/td&gt;
&lt;td&gt;puts the current process into suspended background&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + c&lt;/td&gt;
&lt;td&gt;kill the current process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + h&lt;/td&gt;
&lt;td&gt;works the same as backspace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + w&lt;/td&gt;
&lt;td&gt;deletes the word before the cursor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ctrl + u&lt;/td&gt;
&lt;td&gt;deletes from beginning of line to cursor position&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  🧐 System Information
&lt;/h3&gt;

&lt;p&gt;the man pages, which are manuals for linux commands available from the Command Line Interface CLI&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;man &lt;span class="nb"&gt;ls&lt;/span&gt;  
&lt;span class="nv"&gt;$ &lt;/span&gt;man &lt;span class="nb"&gt;mkdir&lt;/span&gt;  
&lt;span class="nv"&gt;$ &lt;/span&gt;man &lt;span class="nb"&gt;rm&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;man &lt;span class="nb"&gt;grep&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;man patch
&lt;span class="nv"&gt;$ &lt;/span&gt;man diff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display one line manual page description of a command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ whatis top
$ whatis mv
$ whatis nice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display full path of commands where given commands reside&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ which top
$ which grep
$ which nice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;locate the binary, source, and manual page files for a command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ whereis top
$ whereis grep
$ whereis nice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;tell us your username&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;whoami&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;show the current date and time&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;show this month's calendar&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;show current uptime&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ uptime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;tell us detailed information about the machine name, operating system and kernel&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display CPU information&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cat /proc/cpuinfo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display memory information&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cat /proc/meminfo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;show the disk usage&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;show directory usage space&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ du
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display memory and swap usage&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ free
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display memory and swap usage in human format&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ free -h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ⚒️ Files &amp;amp; Directory
&lt;/h3&gt;

&lt;p&gt;clear terminal&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;clear  
&lt;span class="nv"&gt;$ &lt;/span&gt;ctrl + l
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;list of files and/or directories contents&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;list content of current directory (display attributes such as owner, group owner, permissions)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -l
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;list hidden files and/or directories (hidden file begins with . dot sign)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;list everything of current directory (attributes such as owner, group owner, permissions) + hidden files&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -la
$ ls -al
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to know in which directory you're located (pwd stands for "print working directory")&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pwd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to make directories e.g. mkdir foo will create a new directory or folder called "foo"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mkdir foo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to create any number of directories/folders simultaneously&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mkdir foo bar foobar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to remove or delete an empty directory/folder&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ rmdir foo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to remove or delete a file in your directory entries&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ rm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to remove or delete a directory and all of its contents recursively [f = force]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ rm -r foo
$ rm -rf foo 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to change directories (moving through the file system)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to navigate into the root directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to navigate to your home directory, use "cd"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd
$ cd ~
$ cd $HOME
$ cd /home_path/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to navigate up one directory level&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd ..
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to navigate into the documents directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd Documents/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to navigate into the documents directory and see which files and/or directories exists there&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd Documents/{press tab twice}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to navigate into the directory which contains space in their names&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd 'best songs ever'
$ cd best\ songs\ ever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to create a new empty file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ touch file_name.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to create multiples empty files&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ touch foo.txt bar.txt foobar.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to execute several commands on the same line by separating them with a semicolon ;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls ; date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to see the contents of a file (-n parameter shows the number of lines in file)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cat main.py
$ cat passwd.txt
$ cat -n song.txt 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;displays only the first 10 lines of the file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ head foo.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;prints the first -n 'num' lines instead of first 10 lines&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ head -n 5 foo.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display only the last 10 lines of the file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tail foo.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;prints the last -n 'num' lines instead of last 10 lines&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tail -n 3 foo.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to move a file to a different location&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;$ mv [source-file] [destination-file]&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mv foo.txt /home/ubuntu/script/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to move a file to home directory (the terminal use the ~ shortcut to your home directory)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ mv foo.txt ~ 
$ mv foo.txt /home/ubuntu/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to copy a file from the current directory to a different one, the command below will make an exact copy of "foo.txt" file&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;$ cp [source-file] [destination-file]&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cp foo.txt /home/ubuntu/script/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to copy a directory, use "cp -r directory name" (-r recursively = to copy the directory and all its files and subdirectories and all their files and so on)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cp -r bar /home/ubuntu/script/
$ cp -r 'best songs ever' /home/ubuntu/music/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to copy or move all your C or Python source code files to a given directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cp *.c algorithms
$ mv *.py algorithms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to find a file in a current working directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ find . -name hello_world.py
$ find . -name hello_world.py -print
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to find all files with the .py extension in the script directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ find script -name '*.py' -ls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to locate other directories e.g. the command below will locate the script directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ find /home/ubuntu -name script -type d -print
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to find text in a file i.e. the command will search through the file to find a piece of text which you are looking for&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ grep 'Hello' hello_world.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to print the history of a long list of executed commands in the terminal&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ history
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;reboots the system&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ reboot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;shuts down the system&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ shutdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;shuts down the system by powering off&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ poweroff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;brings the system down immediately&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ halt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;reboots the system by shutting it down completely and then restarting it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ init 6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;powers off the system using predefined scripts to synchronize and clean up the system prior to shutting down&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ init 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  📦 Compress &amp;amp; Extract Files
&lt;/h3&gt;

&lt;p&gt;Tar utility creates archives for files and directories, and was originally designed to create archieves on tapes (the term "tar" stands for tape archive). The tar utility is ideal for making backups of your files, which can then be transferred over the Internet&lt;/p&gt;

&lt;h4&gt;
  
  
  Archives using tar
&lt;/h4&gt;

&lt;p&gt;Syntax:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;tar [options] [archive-name.tar] [directory-or-file-name]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Options&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;-c&lt;/td&gt;
&lt;td&gt;creates a new archive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-x&lt;/td&gt;
&lt;td&gt;extract the archive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-f&lt;/td&gt;
&lt;td&gt;specify an archive filename&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-v&lt;/td&gt;
&lt;td&gt;verbosely display the .tar progress in the terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-t&lt;/td&gt;
&lt;td&gt;lists files in archived file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-r&lt;/td&gt;
&lt;td&gt;appends files to an archive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-u&lt;/td&gt;
&lt;td&gt;updates an archive with new files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-w&lt;/td&gt;
&lt;td&gt;waits for a confirmation from the user before archiving each file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-z&lt;/td&gt;
&lt;td&gt;creates archived file using gzip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-j&lt;/td&gt;
&lt;td&gt;creates archived file using bzip&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;create a tar archive using option -cvf&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -cvf foo-archive.tar foo.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;extract file from archive using option -xvf&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -xvf foo-archive.tar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;create a gzip tar archive using option -cvzf&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -cvzf foo-archive.tar.gz foo.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;extract a gzip tar archive using option -xvzf&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -xvzf foo-archive.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;create a gzip tar archive with python files&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -cvzf python-codes.tar.gz *.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;create a gzip tar archive file for a directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -cvzf images-august-2021.tar.gz /home/ubuntu/images/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;create tar with bzip2 compression&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -cvjf foo-archive.tar.bz2 foo.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;extract a tar using bzip2&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tar -xvjf foo-archive.tar.bz2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;compress to file.gz&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ gzip foo-archive
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;decompress file.gz&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ gzip -d foo-archive.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  👨‍💻 Process Management
&lt;/h3&gt;

&lt;p&gt;A process, in simple terms, is an instance of a running program. Whenever we execute a command in Linux it starts, or creates a new process. The Linux kernel tracks processes through an ID number known as PID which means Process ID. The kernel is a part &amp;amp; core of the Operating System OS, it's closer to the hardware i.e. is the lowest level of the OS. The Operating System is the software package which contains applications like the user interface (shell, gui, tools, etc) and communicates directly to the hardware and our application. The kernel is the main part of the Operating System and is responsible for translating the command into something that can be understood by the computer. Basically the Kernel is the layer between hardware (devices which are available in computer) and software (applications like gedit). Only Kernel provides low level services such as:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;memory management&lt;/li&gt;
&lt;li&gt;network management&lt;/li&gt;
&lt;li&gt;device driver&lt;/li&gt;
&lt;li&gt;file management&lt;/li&gt;
&lt;li&gt;process management&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Types of Processes
&lt;/h3&gt;

&lt;p&gt;When we create a new process (run a command), there are two types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Foreground Processes: They run on the screen and need input from the user. For example Office Programs&lt;/li&gt;
&lt;li&gt;Background Processes: They run in the background and usually do not need user input. For example Antivirus.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;to display the currently working processes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;all the currently running processes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps -A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;processes associated with the current terminal session&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps -T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to check all the processes associated with a particular User&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps -u rubnsbarbosa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to check all the processes running under a user&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps ux
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to check all the processes associated with a particular user group&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps -fG root
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to display the processes running with full information&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps -f
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to display the processes running on the system in the form of a tree&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pstree
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;description of all the fields displayed by &lt;strong&gt;ps -f&lt;/strong&gt; command&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UID&lt;/td&gt;
&lt;td&gt;user id that this process belongs to&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PID&lt;/td&gt;
&lt;td&gt;process id&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PPID&lt;/td&gt;
&lt;td&gt;parent process id (the id of the process that started it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;CPU utilization process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;STIME&lt;/td&gt;
&lt;td&gt;process start time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIY&lt;/td&gt;
&lt;td&gt;terminal type associated with the process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIME&lt;/td&gt;
&lt;td&gt;CPU time taken by the process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CMD&lt;/td&gt;
&lt;td&gt;The command that started this process&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;display all running Linux processes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ top
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;interactive process viewer&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ htop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display list of kill&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kill -l
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;kill the process with given pid (for example: I want to kill this 217956 PID)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kill 217956
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;if a process ignore a regular kill command, we can use &lt;strong&gt;kill -9&lt;/strong&gt; followed by the PID&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kill -9 217956
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to find the PID of a process&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pidof Photoshop.exe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;kill all the process named proc&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ killal proc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;will kill all processes matching the pattern&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pkill pattern
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to check the nice value of a process (example: I'd like to find terminal value name)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ps -el | grep terminal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;run a program with modified scheduling priority i.e. set processes CPU priority. Kernel will allocate more CPU time to that process&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ nice -10 gnome-terminal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;changing priority of running processes with PID 77982&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ renice -n 15 -p 77982
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;change the priority of all programs of a specific group&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ renice -n 10 -g 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;suspend process running in foreground&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ctrl+z
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;list jobs table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ jobs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;send stopped process to background&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ bg [job-num]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;brings process to foreground&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ fg [job-num]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📝 File Permission
&lt;/h2&gt;

&lt;p&gt;Every file or directory within Linux has a set of permissions that control who may read, write and execute the contents. In Linux  Linux, a directory is just a special type of file. File ownership is an important component of Unix which provides a secure method for storing files. Every file in Linux has the following category:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Owner&lt;/strong&gt; − the name of the user that owns the file/directory;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group&lt;/strong&gt; − the name of the group that has permissions on the file/directory;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Other&lt;/strong&gt; − the name of all other users can perform on the file/directory.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Permissions
&lt;/h3&gt;

&lt;p&gt;Every file and directory in your Linux has following 3 permissions defined for all the 3 category discussed above.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Permission&lt;/th&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Directory&lt;/th&gt;
&lt;th&gt;Octal Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;read&lt;/td&gt;
&lt;td&gt;r&lt;/td&gt;
&lt;td&gt;able to view the contents of a file&lt;/td&gt;
&lt;td&gt;able to list the files within the directory&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;write&lt;/td&gt;
&lt;td&gt;w&lt;/td&gt;
&lt;td&gt;able to modify the contents of a file&lt;/td&gt;
&lt;td&gt;able to add/delete files to/from directory&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;execute&lt;/td&gt;
&lt;td&gt;x&lt;/td&gt;
&lt;td&gt;able to run the file as an executable&lt;/td&gt;
&lt;td&gt;able to cd into the directory and access files&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Octal Notation Table&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Octal&lt;/th&gt;
&lt;th&gt;Decimal&lt;/th&gt;
&lt;th&gt;Permission&lt;/th&gt;
&lt;th&gt;Representation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;000&lt;/td&gt;
&lt;td&gt;(0+0+0) = 0&lt;/td&gt;
&lt;td&gt;no permission&lt;/td&gt;
&lt;td&gt;---&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;001&lt;/td&gt;
&lt;td&gt;(0+0+1) = 1&lt;/td&gt;
&lt;td&gt;execute&lt;/td&gt;
&lt;td&gt;--x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;010&lt;/td&gt;
&lt;td&gt;(0+2+0) = 2&lt;/td&gt;
&lt;td&gt;write&lt;/td&gt;
&lt;td&gt;-w-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;011&lt;/td&gt;
&lt;td&gt;(0+2+1) = 3&lt;/td&gt;
&lt;td&gt;write + execute&lt;/td&gt;
&lt;td&gt;-wx&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;(4+0+0) = 4&lt;/td&gt;
&lt;td&gt;read&lt;/td&gt;
&lt;td&gt;r--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;(4+0+1) = 5&lt;/td&gt;
&lt;td&gt;read + execute&lt;/td&gt;
&lt;td&gt;r-x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;td&gt;(4+2+0) = 6&lt;/td&gt;
&lt;td&gt;read + write&lt;/td&gt;
&lt;td&gt;rw-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;td&gt;(4+2+1) = 7&lt;/td&gt;
&lt;td&gt;read + write + execute&lt;/td&gt;
&lt;td&gt;rwx&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Whenever using &lt;strong&gt;ls -l&lt;/strong&gt; command, it display informations related to file permission as follows&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2F1.bp.blogspot.com%2F-RzUG1frbLvw%2FXbVnX6AYBpI%2FAAAAAAAAbbM%2Fh7HpiDW-F8Emd2C0-dULpC9RzP4n8Dh1ACLcBGAsYHQ%2Fs1600%2Ffig_permissions_chmod%252Bcommand.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2F1.bp.blogspot.com%2F-RzUG1frbLvw%2FXbVnX6AYBpI%2FAAAAAAAAbbM%2Fh7HpiDW-F8Emd2C0-dULpC9RzP4n8Dh1ACLcBGAsYHQ%2Fs1600%2Ffig_permissions_chmod%252Bcommand.jpg" width="596" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The first character is called the &lt;strong&gt;file type&lt;/strong&gt;. An ordinary file is represented by a dash (-) and a directory is represented by a d.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Note: A dash (-) anywhere else in the permission set indicates no permission.&lt;/strong&gt; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The 1st set of three characters are the &lt;strong&gt;users permissions&lt;/strong&gt; in green.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The 2nd set of characters are the &lt;strong&gt;group permissions&lt;/strong&gt; in cyan.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The 3rd set of characters  are the permissions for all other users in red.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The File permissions that are set depend on the type of file e.g. a text file has different permissions to a shell script because a text file doesn’t need the executable permission but a shell script does.&lt;/p&gt;

&lt;p&gt;examples of different types of permissions on files and directories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-rwx------ this file is read/write/execute for the owner only
dr-xr-x--- this directory is read/execute for the owner and the group
-rwxr-xr-x this file is read/write/execute for the owner, and read/execute for the group and others
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Setting Permission
&lt;/h3&gt;

&lt;p&gt;In order to change file permissions we use &lt;strong&gt;chmod&lt;/strong&gt; command (&lt;strong&gt;ch&lt;/strong&gt;ange &lt;strong&gt;mod&lt;/strong&gt;e - changes permissions of a given file) followed by the octal values that reflect the permissions we want to set. To decide on the permissions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;work out what you want each category of user to be able to do and the appropriate octal value for this (see Octal Notation Table);&lt;/li&gt;
&lt;li&gt;take these 3 octal values and put them together to form a set which will be the permissions for that file.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The example below shows that if we want a &lt;strong&gt;user&lt;/strong&gt; to be able to read and write to a file but the &lt;strong&gt;group&lt;/strong&gt; and &lt;strong&gt;other&lt;/strong&gt; to only be able to read that file then the permissions for this file would need to be set to 644&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;category&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;u&lt;/th&gt;
&lt;th&gt;g&lt;/th&gt;
&lt;th&gt;o&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;permission&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;r  w&lt;/td&gt;
&lt;td&gt;r&lt;/td&gt;
&lt;td&gt;r&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;value&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4 + 2&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;more examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;chmod 755 foo.txt&lt;/code&gt; (results in rwxr-xr-x)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 666 foo.txt&lt;/code&gt; (results in rw-rw-rw-)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 664 foo.txt&lt;/code&gt; (results in rw-rw-r--)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 700 foo.txt&lt;/code&gt; (results in rwx------)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 711 foo.txt&lt;/code&gt; (results in rwx--x--x)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 754 foo.txt&lt;/code&gt; (results in rwxr-xr--)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 755 foo.txt&lt;/code&gt; (results in rwxr-xr-x)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 000 foo.txt&lt;/code&gt; (results in ---------)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod 777 foo.txt&lt;/code&gt; (results in rwxrwxrwx)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod g+r foo.txt&lt;/code&gt; (adds read to group)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod g-r foo.txt&lt;/code&gt; (removes read to group)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod o+r foo.txt&lt;/code&gt; (adds read to others)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chmod a-w foo.txt&lt;/code&gt; (removes write from all users)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using chmod in symbolic mode&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chmod operator&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;td&gt;adds a permission to a file/directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;removes the permission from a file/directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;=&lt;/td&gt;
&lt;td&gt;sets the designated permission(s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;a shell script or any other which needs to be executable should have a permission 711&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ chmod 711 foobar.sh

owner - read, write and execute
group - execute
other - execute
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;a text file doesn't need to be executable, it should have a permission 644&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ chmod 644 foo.txt

owner - read and write
group - read
other - read
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ⚡️ Network
&lt;/h3&gt;

&lt;p&gt;Let’s start with the most basic question: is our physical interface up? The &lt;em&gt;ip link show&lt;/em&gt; command tells us&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip link show
1: lo: &amp;lt;LOOPBACK,UP,LOWER_UP&amp;gt; mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: &amp;lt;BROADCAST,MULTICAST&amp;gt; mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:82:d6:6e brd ff:ff:ff:ff:ff:ff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;your interface might be disable, so before check cables you should bring the interface up&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip link set eth0 up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;this prints output in a much more readable table format&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip -br link show
lo UNKNOWN 00:00:00:00:00:00 &amp;lt;LOOPBACK,UP,LOWER_UP&amp;gt;
eth0 UP 52:54:00:82:d6:6e &amp;lt;BROADCAST,MULTICAST,UP,LOWER_UP&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;we can use the -s flag with the ip command to print additional statistics about an interface&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip -s link show eth0
2: eth0: &amp;lt;BROADCAST,MULTICAST,UP,LOWER_UP&amp;gt; mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:82:d6:6e brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
34107919 5808 0 6 0 0
TX: bytes packets errors dropped carrier collsns
434573 4487 0 0 0 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can check the entries in our ARP table with the ip neighbor command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip neighbor show
192.168.122.1 dev eth0 lladdr 52:54:00:11:23:84 REACHABLE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the gateway’s MAC address is populated (we’ll talk more about how to find your gateway in the next section). If there was a problem with ARP, then we would see a resolution failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip neighbor show
192.168.122.1 dev eth0 FAILED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Linux caches the ARP entry for a period of time, so you may not be able to send traffic to your default gateway until the ARP entry for your gateway times out. For highly important systems, this result is undesirable. Luckily, you can manually delete an ARP entry, which will force a new ARP discovery process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip neighbor show
192.168.122.170 dev eth0 lladdr 52:54:00:04:2c:5d REACHABLE
192.168.122.1 dev eth0 lladdr 52:54:00:11:23:84 REACHABLE
# ip neighbor delete 192.168.122.170 dev eth0
# ip neighbor show
192.168.122.1 dev eth0 lladdr 52:54:00:11:23:84 REACHABLE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip -br address show
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0 UP 192.168.122.135/24 fe80::184e:a34d:1d37:441a/64 fe80::c52f:d96e:a4a2:743/64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ping www.google.com
PING www.google.com (172.217.165.4) 56(84) bytes of data.
64 bytes from yyz12s06-in-f4.1e100.net (172.217.165.4): icmp_seq=1 ttl=54 time=12.5 ms
64 bytes from yyz12s06-in-f4.1e100.net (172.217.165.4): icmp_seq=2 ttl=54 time=12.6 ms
64 bytes from yyz12s06-in-f4.1e100.net (172.217.165.4): icmp_seq=3 ttl=54 time=12.5 ms
^C
--- www.google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 12.527/12.567/12.615/0.036 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can print the routing table using the ip route show command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip route show
default via 192.168.122.1 dev eth0 proto dhcp metric 100
192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.135 metric 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  SSH
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ssh admin@192.168.1.113
$ ssh ubuntu@192.168.1.113
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ssh key&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ssh-key
$ ssh-keygen -t rsa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ssh authentication - you store your public key on your server and you have your private key with you, and the private key is the most important is the key that will allow you to authenticate successfully with the server. So you need to keep it secure, if you lose it you will lose access to the server.&lt;/p&gt;

&lt;p&gt;copy file from remote server&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ scp foo.txt admin@192.168.1.113:/home/admin/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display the current network interface configuration information&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ifconfig
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display IP addresses and property information&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# id addr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;List all of the route entries in the kernel&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip route
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display neighbour objects; also known as the ARP table for IPv4&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ip neigh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;lists all my connections and their DNS servers&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# systemd-resolve --status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;allows you to test the IP-level connectivity of a given host on the network&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# 192.168.2.32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;display the route that a packet takes to reach the host; also prints detail about all the hops that it visits&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# traceroute google.com
# traceroute 172.217.26.206
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>linux</category>
      <category>opensource</category>
      <category>bash</category>
      <category>python</category>
    </item>
    <item>
      <title>ETL with Spark on Azure Databricks and Azure Data Warehouse (Part 2)</title>
      <dc:creator>Rubens Barbosa</dc:creator>
      <pubDate>Sat, 30 Apr 2022 20:31:09 +0000</pubDate>
      <link>https://dev.to/rubnsbarbosa/etl-with-spark-on-azure-databricks-and-azure-data-warehouse-part-2-5cep</link>
      <guid>https://dev.to/rubnsbarbosa/etl-with-spark-on-azure-databricks-and-azure-data-warehouse-part-2-5cep</guid>
      <description>&lt;p&gt;Hey y'all, this is a continuation of the previous article. We already have data on Azure Data Lake Storage. Now, we will integrate it with Apache Spark on Azure Databricks to perform a small transformation on top of the JSON, and send the data to Azure SQL Data Warehouse. I'll try to be the most hands on as possible. Let's get started!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Apache Spark?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apache Spark is a framework for processing large-scale data, i.e., Big Data distributed across clusters. It is used for executing data engineering, data science, and machine learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The main abstraction Spark provides is a Resilient Distributed Dataset RDD which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. Thus, Spark running multiple processes concurrently in parallel that don't interfere each other. RDD can be created from text files, SQL databases, NoSQL databases, HDFS, Cloud Storage and so on. The processing of RDD is done entirely in memory. &lt;/p&gt;

&lt;p&gt;At a high level, in a Spark cluster you will have a driver node and then several worker nodes. The driver node is running the main program which has all of the transformations that you want to do with your data and then get sent out to the worker nodes who then operate a task and return that result do the driver node. This is the core engine of Spark on top of that there are several library modules that allow developers to easily interact with the core engine. These libraries include: Spark SQL, Spark Streaming, MLLib, GraphX.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Databricks is a company that was founded by the creators of Apache Spark with the intention to make Apache Spark much easier to use it. Databricks develops a web-based platform for working with Spark. that provides automated cluster management  and IPython style notebooks.&lt;/p&gt;

&lt;p&gt;The Databricks workspace is the cloud based environment in which you use Databricks and includes the user interface, integrated storage, security settings, job scheduling, and most importantly notebooks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As mentioned before Spark is all about cluster. That's why, we'll first create a cluster on Databricks. After launch Azure Databricks workspace go to compute, then create cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzbcpyqr6eb01rwytd65.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzbcpyqr6eb01rwytd65.png" alt="Create Cluster on Azure Databricks" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once we have our cluster running we can create a notebook and start coding, which I'll use PySpark. We will start a creation of ETL with PySpark on Azure Databricks. In the load phase we will write data on Azure SQL Data Warehouse. So, we must already have our Data Warehouse deployed and get the connection string.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creation of ETL with PySpark on Azure Databricks Notebook&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's have a look.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extract&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;First of all, we need extract the JSON file from Azure Data Lake Storage and read it into DataFrame. After that, we will ingest data into PySpark DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Title: ETL Spark: extract from Azure Data Lake Storage, and load to Azure SQL Data Warehouse
# Language: PySpark
# Author: Rubens Santos Barbosa
&lt;/span&gt;
&lt;span class="c1"&gt;# config the session using spark object and set the key from our azure data lake storage account
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fs.azure.account.key.YOUR_AZURE_DATA_LAKE_STORAGE_ACCOUNT_NAME.dfs.core.windows.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_AZURE_DATA_LAKE_ACCOUNT_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# abfss://AZURE_DL_CONTAINER_NAME@AZURE_DL_STORAGE_ACCOUNT_NAME.dfs.core.windows.net/DIRECTORY_CLIENT
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abfss://az-covid-data@engdatalake.dfs.core.windows.net/directory-covid19&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# path JSON file on azure data lake storage 
&lt;/span&gt;&lt;span class="n"&gt;covid_data_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abfss://az-covid-data@engdatalake.dfs.core.windows.net/directory-covid19/covid-2022-4-21.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# read JSON file into DataFrame
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;covid_data_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# PySpark print schema
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;printSchema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjctaiyyi2upr2mt3r0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjctaiyyi2upr2mt3r0z.png" alt="PySpark on Azure Databricks" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We might wanna see some content from our DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# showing first 5 rows
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbe5vc25jz80gf1z8o4tp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbe5vc25jz80gf1z8o4tp.png" alt="Firts rows" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transform&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We've just done the data extraction. Now, we will do a little transformation. Let's analyze if there is missing data in our columns on PySpark Dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# missing values in a specific column of pySpark dataframe
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bairroPaciente&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;isNull&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# count null value in every column
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;with null values: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;isNull&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffuory4y6ecquohajt1tg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffuory4y6ecquohajt1tg.png" alt="missing data" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We noticed that our PySpark DataFrame there are 82 rows, and there some columns with 81 null values. So, let's drop these columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# columns in pyspark dataframe to drop
&lt;/span&gt;&lt;span class="n"&gt;columns_to_drop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;classificacaoEstadoSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeAsmaSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeDiabetesSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeHematologiaSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeImunodeficienciaSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeNeurologiaSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeObesidadeSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadePneumopatiaSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadePuerperaSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeRenalSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeSindromeDownSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataEntradaUtisSvep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataEvolucaoCasoSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataInternacaoSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataResultadoExame&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataSolicitacaoExame&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;evolucaoCaso&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;idSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;paisPaciente&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cnesNotificacaoEsus&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comorbidadeCardiovascularSivep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataColetaExame&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;resultadoFinalExame&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tipoTesteExame&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# delete columns in pyspark dataframe
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;columns_to_drop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70uv0wi2whddwnyc67bu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70uv0wi2whddwnyc67bu.png" alt="delete columns" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's display our data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1op9c8rru0q21fzhbqvr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1op9c8rru0q21fzhbqvr.png" alt="display data" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see above the columns dataInicioSintomas and dataNotificacao are in timestamp format, I will transform it to date format in our PySpark DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;to_date&lt;/span&gt;
&lt;span class="c1"&gt;# timestamp to date
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataInicioSintomas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataInicioSintomas&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataNotificacao&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dataNotificacao&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlf8ih9vmvt1qct3wwqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlf8ih9vmvt1qct3wwqw.png" alt="timestamp2date" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We've done the data transformation. We will load these data into Azure SQL Data Warehouse.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# removing repeated rows
&lt;/span&gt;&lt;span class="n"&gt;distinctDF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;distinct&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Distinct count: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distinctDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distinctDF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load PySpark DataFrame to Azure SQL Data Warehouse
&lt;/span&gt;&lt;span class="n"&gt;db_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbo.COVID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;sql_password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PASSWORD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;jdbc_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc:sqlserver://cosmos-database.database.windows.net:1433;database=cosmos-pool;user=rubnsbarbosa@cosmos-database;password=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sql_password&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;distinctDF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;jdbc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jdbc_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx72fnliptz9w591tzss8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx72fnliptz9w591tzss8.png" alt="load" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We've finished the ETL with PySpark on Azure Databricks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure SQL Data Warehouse&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before loading PySpark DataFrame into Azure SQL Data Warehouse, we must have created our table in our SQL DW. So, we must enter the query editor and create. You can see the query I created below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COVID&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bairroPaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;254&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;codigoMunicipioPaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;254&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;codigoPaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;254&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dataInicioSintomas&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dataNotificacao&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;estadoPaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;idadePaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;municipioNotificacaoEsus&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;municipioPaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;profissionalSaude&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;racaCorPaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;sexoPaciente&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It might happen some firewall issues when you try to load the data, you just need go into Firewalls and Virtual Networks [inside of SQL DW] and save the Client IP address. Finally, let's see the data into our Azure SQL Data Warehouse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fihmdlmhwbti7akjadgci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fihmdlmhwbti7akjadgci.png" alt="dw" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We have the second and last part of our project completed. It was created an ETL using Spark on Azure Databricks Cluster. In the extraction phase we got data from Azure Data Lake Storage, we performed a basic transformation, and the data was loaded into Azure SQL Data Warehouse as proposed.&lt;/p&gt;

</description>
      <category>spark</category>
      <category>databricks</category>
      <category>python</category>
      <category>azure</category>
    </item>
    <item>
      <title>ELT Data Pipeline with Kubernetes CronJob, Azure Data Lake, Azure Databricks (Part 1)</title>
      <dc:creator>Rubens Barbosa</dc:creator>
      <pubDate>Sun, 24 Apr 2022 14:35:29 +0000</pubDate>
      <link>https://dev.to/rubnsbarbosa/elt-data-pipeline-with-kubernetes-cronjob-azure-data-lake-azure-databricks-part-1-d58</link>
      <guid>https://dev.to/rubnsbarbosa/elt-data-pipeline-with-kubernetes-cronjob-azure-data-lake-azure-databricks-part-1-d58</guid>
      <description>&lt;p&gt;Hey world, the concept of ETL are far from new, but nowadays it is widely used in the industry. ETL stands for Extract, Transform, and Load. Okay, but what does that mean? The easiest way to understand how ETL works is to understand what happens in each step of the process. Let's dive into it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extract&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;During the extraction, raw data is moved from a structured or unstructured data pool to a staging data repository.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transform&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The data source might have a different structure than the target destination, we'll transform the data from the source schema to the destination schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this phase, we'll then load the transformed data into the data warehouse.&lt;/p&gt;

&lt;p&gt;A disadvantage of the ETL approach is that the transformation stage can take a long time. An alternative approach is extract, load, and transform (ELT). In ELT, the data is immediately extracted and loaded into a large data repository, such as Azure Data Lake Storage. We can begin transforming the data as soon as the load is complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hands on&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this first part I will show how to create an ELT. We'll extract data from a Public API called IntegraSUS regarding Covid-19 data, and load it on Azure Data Lake Storage. So, this ELT will be containerized on Azure Container Registry (ACR), and we will use Azure Kubernetes Service (AKS) to schedule our job on K8s cluster to run daily.&lt;/p&gt;

&lt;p&gt;In the second part of this project, we will integrate the Azure Data Lake with Apache Spark on Azure Databricks to perform a small transformation on top of the files sent to the Data Lake and then we will store the result of the transformation in a Data Warehouse.&lt;/p&gt;

&lt;p&gt;We will learn how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a Python ELT and load into Azure Data Lake;&lt;/li&gt;
&lt;li&gt;Create an Azure Container Registry and push images into it;&lt;/li&gt;
&lt;li&gt;Create an Azure Kubernetes Service;&lt;/li&gt;
&lt;li&gt;Deploy CronJob into Azure Kubernetes Cluster;&lt;/li&gt;
&lt;li&gt;Integrate Azure Data Lake with Apache Spark on Databricks;&lt;/li&gt;
&lt;li&gt;Transform data using PySpark on Azure Databricks&lt;/li&gt;
&lt;li&gt;Load new data into Data Warehouse.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project code is available here: &lt;a href="https://github.com/rubnsbarbosa/elt2datalake" rel="noopener noreferrer"&gt;github repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Create a Python ELT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the extraction phase we will get data from a Public API about Fortaleza/Ceará/Brazil Covid-19 data, and store the data into a json file. After that, we will load it into Azure Data Lake. You can see the project code below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/local/bin/python
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.storage.filedatalake&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataLakeServiceClient&lt;/span&gt;

&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%(asctime)s - %(name)s - %(levelname)s - %(message)s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;previous_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EXTRACTING...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;#extract data from API
&lt;/span&gt;    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://indicadores.integrasus.saude.ce.gov.br/api/casos-coronavirus?dataInicio=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;previous_date&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;dataFim=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;previous_date&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;covid-data-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;previous_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;THERE IS NOT DATA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;initialize_storage_account&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage_account_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;storage_account_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;service_client&lt;/span&gt;
        &lt;span class="n"&gt;service_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataLakeServiceClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;storage_account_name&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.dfs.core.windows.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;storage_account_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EXCEPTION...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_file_system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;file_system_client&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CREATING A CONTAINER NAMED AZ-COVID-DATA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;file_system_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;service_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_file_system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EXCEPTION...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_directory&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CREATING A DIRECTORY NAMED DIRECTORY-COVID22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;file_system_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;directory-covid19&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EXCEPTION...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upload_file_to_container_datalake&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;UPLOADING FILE TO AZURE DATA LAKE STORAGE...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;file_system_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;service_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_file_system_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;directory_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_system_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_directory_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;directory-covid19&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;file_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;directory_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_file_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;covid-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;previous_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;file_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overwrite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;UPLOADED TO AZURE DATA LAKE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EXCEPTION...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_config&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;directory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;directory&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/config.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;yamlfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yamlfile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Loader&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FullLoader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;initialize_storage_account&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_DL_STORAGE_ACCOUNT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_DL_ACCOUNT_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;create_file_system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_DL_CONTAINER_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;upload_file_to_container_datalake&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;covid-data-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;previous_date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_DL_CONTAINER_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Create docker image for our Python ELT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We are going to build a docker image for our ELT job and run it inside the container. So, let's create a Dockerfile, which describes how a docker image is built. You can see a list of instructions below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;3.9&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;buster&lt;/span&gt;

&lt;span class="n"&gt;WORKDIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;usr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;

&lt;span class="n"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;requirements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;usr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="n"&gt;requirements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt;

&lt;span class="n"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yaml&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;usr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;

&lt;span class="n"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;el2datalake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;usr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt; &lt;span class="n"&gt;chmod&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;el2datalake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;

&lt;span class="n"&gt;CMD&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./el2datalake.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can build the docker image using docker build command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker build -t el2datalakejob .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we should run our ELT inside the container&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker run -it el2datalakejob:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Push docker images to Azure Container Registry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Azure Container Registry handles private Docker container images and allow us to build, store, and manage container images. We are going to deploy an ACR instance and push a docker image to it.&lt;/p&gt;

&lt;p&gt;To create an any instance on Azure, we must create a resource group. We create a new resource group with &lt;em&gt;az group create&lt;/em&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ az group create --name myResourceGroup --location westeurope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once we have a resource group, we can create an Azure Container Registry with &lt;em&gt;az acr create&lt;/em&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ az acr create \
  --resource-group myResourceGroup \
  --name azcrjobs \
  --sku Basic \
  --location westeurope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's login on Azure Container Registry&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ az acr login --name azcrjobs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let us tag the image to the login server azcrjobs.azurecr.io&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker tag el2datalakejob \
azcrjobs.azurecr.io/el2datalakejob:v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Push the Docker image to ACR&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker push azcrjobs.azurecr.io/el2datalakejob:v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we have our ELT on Azure Container Registry let's move on to the next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Create and Deploy CronJobs on Azure Kubernetes Service&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Azure Kubernetes Service (AKS) deploy and manage containerized applications more easily with a fully managed Kubernetes service. Let’s create an AKS cluster with &lt;em&gt;az aks create&lt;/em&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ az aks create \
  --resource-group myResourceGroup \
  --name az-aks-jobs \
  --node-count 1 \
  --attach-acr azcrjobs \
  --location westeurope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To connect to the cluster from local machine we use Kubernetes client kubectl, open the terminal to connect to the cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ az aks get-credentials --resource-group myResourceGroup \ --name az-aks-jobs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let us see our node available on AKS&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We start from creating a manifest file for our ELT cron job.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: batch/v1
kind: CronJob
metadata:
  creationTimestamp: null
  name: k8sjob
spec:
  jobTemplate:
    metadata:
      creationTimestamp: null
      name: k8sjob
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: azcrjobs.azurecr.io/el2datalakejob:v1
            imagePullPolicy: IfNotPresent
            name: k8sjob
            resources: {}
          restartPolicy: OnFailure
  schedule: '55 23 * * *'
status: {}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Above on our manifest file we defined the crontab expression used as a schedule for our job, and is scheduled to run everyday at 23:55. We put the name of the docker image to be pulled from container registry attached to cluster.&lt;/p&gt;

&lt;p&gt;To deploy our job, we will use the &lt;em&gt;kubectl apply&lt;/em&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl apply -f job.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can view some details about the job with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get cronjobs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To retrieve cron job logs from Kubernetes, we can use &lt;em&gt;kubectl logs&lt;/em&gt; command, but first we must get the pod name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl get pods
NAME                       READY   STATUS      RESTARTS   AGE
k8sjob-27513350--1-xnj8x   0/1     Completed   0          4m2s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Retrieve cron job logs from Kubernetes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl logs k8sjob-27513350--1-xnj8x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Finally, we have the first stage of our project completed. Now we have Covid data on Azure Data Lake. For the next step, we will read this file from Azure Data Lake and perform a little processing of this data using Apache Spark on Azure Databricks, and we will be able to make the result of this processing available in a Data Warehouse.&lt;/p&gt;

</description>
      <category>etl</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>azure</category>
    </item>
  </channel>
</rss>
