<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pradip Sodha</title>
    <description>The latest articles on DEV Community by Pradip Sodha (@sudo_pradip).</description>
    <link>https://dev.to/sudo_pradip</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1973306%2F60355057-a1c5-4704-982f-ad591289741a.png</url>
      <title>DEV Community: Pradip Sodha</title>
      <link>https://dev.to/sudo_pradip</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sudo_pradip"/>
    <language>en</language>
    <item>
      <title>Scala's Ignored Features</title>
      <dc:creator>Pradip Sodha</dc:creator>
      <pubDate>Mon, 07 Oct 2024 06:18:41 +0000</pubDate>
      <link>https://dev.to/sudo_pradip/scalas-ignored-features-1bmh</link>
      <guid>https://dev.to/sudo_pradip/scalas-ignored-features-1bmh</guid>
      <description>&lt;p&gt;Scala often flies under the radar, seen as an underrated language despite its elegant design. Many developers know how to use Scala, but not all fully grasp its core concepts—leaving some powerful features untouched. In this article, we'll explore some key features of Scala that developers frequently overlook, helping you unlock the language’s full potential.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. I’m Pure OOP&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Let’s start with a simple question: How is Scala pure OOP? Take a look at this expression:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;x&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You might think this is straightforward: there's a variable &lt;code&gt;x&lt;/code&gt;, an assignment operator &lt;code&gt;=&lt;/code&gt;, a plus operator &lt;code&gt;+&lt;/code&gt;, and two integer constants &lt;code&gt;1&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt;. But how about this expression:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;x&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="o"&gt;+(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Surprised? This is equally valid in Scala because &lt;strong&gt;everything in Scala is an object&lt;/strong&gt;, even the numbers! What’s really happening here is that &lt;code&gt;1&lt;/code&gt; is an instance of the &lt;code&gt;Int&lt;/code&gt; class, and the &lt;code&gt;+&lt;/code&gt; method is called on it. The more you explore Scala, the more you'll see how object-oriented principles are embedded deeply within the language.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Free Will: Creative Variable Names&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Have you ever heard the rule that variable names must start with an alphabetic character or underscore and can’t contain special characters? That might be true in languages like C, but Scala breaks free from these traditional constraints.&lt;/p&gt;

&lt;p&gt;Consider this block of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;ten&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="c1"&gt;//1&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;_20&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="c1"&gt;//2&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="k"&gt;#&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="c1"&gt;//3&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;`@40`&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="c1"&gt;//4&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;` `&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="c1"&gt;//5&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;` = null //6
println { null } //7
println { `&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;//8&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
At first glance, you might think variables 5 and 7 are invalid, but in fact, the only invalid one here is line 6! Scala allows special characters and even spaces in variable names when enclosed in backticks. Line 8 will print `0`, and the empty space is a valid variable name.

---

### **3. Functional Meets OOP: Every Statement Returns Something**

In Scala, **everything is an expression**—even `if`-`else`, `match`, and loops. This makes functional programming more natural because these constructs always return a value, eliminating the need to declare mutable variables to store results.

For example:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
scala&lt;br&gt;
val result = if (condition) "yes" else "no"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Here, `if`-`else` returns a value just like a function would, reinforcing the functional paradigm. This feature paves the way for advanced techniques like **Higher-Order Functions** (HOF) and **Currying**.

---

### **4. Implicit Magic: Reducing Boilerplate**

Are you tired of passing the same context or parameter repeatedly across multiple functions? Scala’s **implicit** keyword can help reduce this redundancy.

Instead of explicitly passing variables around, you can declare an implicit value that will automatically be picked up by any function that expects it:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
scala&lt;br&gt;
implicit val context: Context = new Context()&lt;/p&gt;

&lt;p&gt;def read(...)(implicit ctx: Context) = ...&lt;br&gt;
def transform()(implicit ctx: Context) = ...&lt;br&gt;
def write(...)(implicit ctx: Context) = ...&lt;/p&gt;

&lt;p&gt;read(otherParam)&lt;br&gt;
transform() // No need to pass &lt;code&gt;context&lt;/code&gt; again&lt;br&gt;
write(otherParam)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This makes your code more concise and expressive, while still ensuring that necessary values like contexts or configurations are passed correctly.

---

### **5. Lazy Evaluation: Only When You Need It**

You might know that `lazy` in Scala means delaying a variable’s evaluation until it’s first accessed. But what real-world scenarios benefit from `lazy`?

For example, imagine an **expensive computation** that shouldn’t be performed unless absolutely necessary:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
scala&lt;br&gt;
lazy val expensiveComputation = {&lt;br&gt;
  println("Computing...")&lt;br&gt;
  (1 to 1000000).sum&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;println("Before accessing lazy value")&lt;br&gt;
// No computation happens yet&lt;/p&gt;

&lt;p&gt;println(expensiveComputation)  // Now the computation occurs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This is particularly useful for **resource initialization** (e.g., database connections), **memoization** (caching the result of a computation), Conditional Initialization, Deferred Initialization in Multi-threaded Environments, Lazy Logging, Avoiding Circular Dependencies.

---

### **6. Understanding Nil, Null, None, Nothing, and Unit**

Scala has several ways to represent "nothing" or the absence of value, but each has its own distinct role:

- **`Nil`**: Represents an empty list (`List()`).
- **`Null`**: A subtype of all reference types (`AnyRef`), representing the absence of an object.
- **`None`**: A safer alternative to `null`, used within the `Option` type.
- **`Nothing`**: A subtype of every type, representing the absence of value or the bottom of the type hierarchy. Often used in functions that throw exceptions.
- **`Unit`**: Equivalent to `void` in Java, but carries a value (`()`).

Understanding these types is key to writing robust Scala code, especially when dealing with optional values and avoiding null pointer issues.

---

### **7. Tail Recursion: Efficient Recursion Without Stack Overflow**

Recursion is a natural solution for many problems but can lead to **stack overflow** issues. Scala solves this with **tail recursion**, which allows the compiler to optimize recursive calls into iterative loops.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
scala&lt;br&gt;
@annotation.tailrec&lt;br&gt;
def factorial(n: Int, acc: Int = 1): Int = {&lt;br&gt;
  if (n &amp;lt;= 1) acc&lt;br&gt;
  else factorial(n - 1, n * acc)&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;println(factorial(5)) // Output: 120&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The Scala compiler transforms the tail-recursive function into a loop, preventing stack overflow and making recursion efficient.

---

### **8. Yield: Generating Collections from Loops**

In Scala, you can use `yield` in a `for` comprehension to generate a new collection from an existing one:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
scala&lt;br&gt;
val numbers = List(1, 2, 3, 4)&lt;br&gt;
val doubled = for (n &amp;lt;- numbers) yield n * 2&lt;br&gt;
println(doubled)  // Output: List(2, 4, 6, 8)&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The power of `yield` lies in transforming collections in a concise and expressive way.

---

### **Conclusion**

Scala’s beauty lies in its elegance and the seamless integration of OOP and functional programming. Features like `lazy`, `implicit`, and `tail recursion` are just the tip of the iceberg. As you dive deeper, you’ll uncover more of Scala’s hidden gems and understand why this language stands out as a truly well-crafted tool for developers who aim to write clean, efficient, and expressive code.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
    </item>
    <item>
      <title>DBT vs. Data Engineers: A Love-Hate Saga!</title>
      <dc:creator>Pradip Sodha</dc:creator>
      <pubDate>Thu, 03 Oct 2024 13:33:01 +0000</pubDate>
      <link>https://dev.to/sudo_pradip/dbt-vs-data-engineers-a-love-hate-saga-218e</link>
      <guid>https://dev.to/sudo_pradip/dbt-vs-data-engineers-a-love-hate-saga-218e</guid>
      <description>&lt;p&gt;No doubt, &lt;a href="https://www.getdbt.com/blog/what-exactly-is-dbt" rel="noopener noreferrer"&gt;DBT&lt;/a&gt; (Data Build Tool) has introduced a whole new approach to writing transformations. Or should I say, the “correct” way to write them? But if you’ve been a data engineer for 5 to 10 years, DBT might feel...strange. Maybe even unnecessary or over-complicated. You’ve likely found comfort in the traditional ways of doing things—&lt;a href="https://docs.databricks.com/en/notebooks/index.html" rel="noopener noreferrer"&gt;Databricks notebooks&lt;/a&gt;, &lt;a href="https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview" rel="noopener noreferrer"&gt;Azure Data Flows&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Stored_procedure" rel="noopener noreferrer"&gt;stored procedures&lt;/a&gt;, etc.—giving you more control over your work. Let’s explore why, from a data engineer’s perspective, DBT can feel like an alien language and what makes it a tough beast to tame.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Only SELECT Statements? Say What?!
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DE:&lt;/strong&gt; If I open any DBT model file, all I see are SELECT statements. What if I want a good ol’ MERGE or INSERT statement? Oh wait, I have to check DBT docs for that? Really? I’ve mastered MERGE over the past decade, and now I need to look up docs like a newbie? And what if there’s a new DML feature? I have to wait for DBT to support it? Feels like I’m shackled! And don’t get me started on conditional updates or deletes—where do I even go to beg for those?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DBT:&lt;/strong&gt; I get it. You’re right, but let’s flip the script. You don’t need to worry about &lt;a href="https://en.wikipedia.org/wiki/Data_definition_language" rel="noopener noreferrer"&gt;DDL&lt;/a&gt; (create, alter) or &lt;a href="https://en.wikipedia.org/wiki/Data_manipulation_language" rel="noopener noreferrer"&gt;DML&lt;/a&gt; (insert, merge) anymore. Focus on your transformations; I’ll handle the nitty-gritty! By &lt;a href="https://en.wikipedia.org/wiki/Abstraction_(computer_science)" rel="noopener noreferrer"&gt;abstracting&lt;/a&gt; these tedious commands, you can scale easier and faster. And hey, if DBT doesn’t support a feature yet, get yourself a solid software engineer. Seriously, they’re the magic fix for everything. What are you waiting for?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pro Tip: DBT’s secret sauce is a good software engineer. Get one who’s also a data engineer, and you’ll be flying in no time. Heaven awaits, trust me!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  2. Jinja is... Well, Something
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DE:&lt;/strong&gt; What is this &lt;a href="https://jinja.palletsprojects.com/en/3.1.x/" rel="noopener noreferrer"&gt;Jinja&lt;/a&gt; stuff? Honestly, it’s harder to read and write than the worst SQL I’ve ever seen. Sometimes, I dream of Jinja syntax haunting me like floating code fragments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DBT:&lt;/strong&gt; Oof, I hear you. We’ve all had second thoughts about using &lt;a href="https://github.com/dbt-labs/dbt-learn-jinja" rel="noopener noreferrer"&gt;Jinja&lt;/a&gt;, and managing it is a struggle. But hey, it’s a powerful templating engine! With its if-else and for-loops, you can write dynamic SQL that will take your transformations to the next level. Hang in there, my friend, the power will reveal itself!&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Documentation Drama!
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DE:&lt;/strong&gt; Sure, &lt;a href="https://docs.getdbt.com/docs/build/documentation" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; is great, but now my team relies on docs instead of SQL. The doc could say the Earth is flat, and they’d believe it, even if the query says it’s round! The problem? Docs are rarely up-to-date. And when you’ve got deadlines, there’s no time to update them. Worst of all, if there’s a mistake in the doc, who’s going to catch it? SQL errors get flagged, but docs? Good luck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DBT:&lt;/strong&gt; Whoa, slow down. Docs are crucial! They bring stakeholders, downstream, and upstream users together. No more wading through murky SQL—everyone can just read the doc in a pretty UI! Sure, if the doc is wrong, that’s a problem, but you can automate doc checks using tools like dbt checkpoint. And let’s be honest, if your doc says the Earth is flat, that’s on you.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Dev Setup &amp;amp; Deployment: The Struggle is Real
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DE:&lt;/strong&gt; Setting up a DBT project is an achievement in itself. Managing versioning, syntax, spacing, and linting across devs is a nightmare. If there’s an error, I’m diving through a swamp of logs. Plus, I have to set up a virtual environment, learn &lt;a href="https://en.wikipedia.org/wiki/Docker_(software)" rel="noopener noreferrer"&gt;Docker&lt;/a&gt;, and deploy to &lt;a href="https://aws.amazon.com/ecs/" rel="noopener noreferrer"&gt;AWS ECS&lt;/a&gt;. Seriously?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DBT:&lt;/strong&gt; I feel your pain, but have you heard of &lt;a href="https://www.getdbt.com/product/dbt-cloud" rel="noopener noreferrer"&gt;DBT Cloud&lt;/a&gt;? It solves all of these issues! Just give me your money, and I’ll make everything easier. Oh, and I’ve teamed up with Databricks—so now you can run &lt;a href="https://docs.databricks.com/en/jobs/how-to/use-dbt-in-workflows.html" rel="noopener noreferrer"&gt;DBT tasks&lt;/a&gt; in Databricks Workflows! It’s in the premium workspace though, so you’ll need to cough up a bit more.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. DBT for Small Projects? Worth It?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DE:&lt;/strong&gt; Why are even small projects pushing to use DBT? It’s expensive! Not just the learning curve, but hiring a software engineer who’s constantly complaining about everything costs a fortune. Seriously, the guy is impossible to work with!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DBT:&lt;/strong&gt; Listen, even if your project is small, one day you’ll be a &lt;a href="https://en.wikipedia.org/wiki/Unicorn_(finance)" rel="noopener noreferrer"&gt;unicorn&lt;/a&gt;. You need tools that can handle the pressure when you scale. But, if you want to keep costs down, you could try a &lt;a href="https://en.wikipedia.org/wiki/Minimalism" rel="noopener noreferrer"&gt;minimalist&lt;/a&gt; approach. Skip the fancy Jinja and stick to organized SQL files. Slowly adopt more DBT features as you grow. It’s like evolving from the Stone Age to the modern era—DBT is the future!&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
DBT is a powerful tool, but it can be a bit of a pain for data engineers used to more traditional methods. It demands a lot—sometimes too much. But with the right mindset (and maybe a software engineer who knows what they’re doing), DBT can help you scale and succeed. Just... be ready for a few headaches along the way!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Add Reverse Proxy to Your Azure Web App</title>
      <dc:creator>Pradip Sodha</dc:creator>
      <pubDate>Thu, 05 Sep 2024 15:42:15 +0000</pubDate>
      <link>https://dev.to/sudo_pradip/how-to-add-reverse-proxy-to-your-azure-web-app-4cpn</link>
      <guid>https://dev.to/sudo_pradip/how-to-add-reverse-proxy-to-your-azure-web-app-4cpn</guid>
      <description>&lt;p&gt;While searching for a proper article on &lt;strong&gt;how to add a &lt;a href="https://en.wikipedia.org/wiki/Reverse_proxy" rel="noopener noreferrer"&gt;reverse proxy&lt;/a&gt; in &lt;a href="https://azure.microsoft.com/en-in/products/app-service" rel="noopener noreferrer"&gt;Azure Web App&lt;/a&gt;&lt;/strong&gt;, I couldn't find comprehensive documentation. So, here we are! In this article, we will explore how to add a reverse proxy to your Azure Web App, whether you're using &lt;a href="https://en.wikipedia.org/wiki/Node.js" rel="noopener noreferrer"&gt;Node.js&lt;/a&gt;, Java, &lt;a href="https://en.wikipedia.org/wiki/PHP" rel="noopener noreferrer"&gt;PHP&lt;/a&gt;, or &lt;a href="https://en.wikipedia.org/wiki/.NET_Framework" rel="noopener noreferrer"&gt;.NET&lt;/a&gt; as your runtime stack. This approach works seamlessly since Azure Web Apps are hosted on &lt;a href="https://en.wikipedia.org/wiki/Internet_Information_Services" rel="noopener noreferrer"&gt;IIS server&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a Reverse Proxy?
&lt;/h3&gt;

&lt;p&gt;A reverse proxy is a method to forward incoming requests to another server. This setup is particularly useful in scenarios like having a frontend exposed to a public endpoint and a backend deployed on a private network. With a reverse proxy, you can route traffic from the public frontend to the private backend.&lt;/p&gt;

&lt;p&gt;One common use case is using an Azure Web App as the frontend and an &lt;a href="https://learn.microsoft.com/en-us/azure/azure-functions/functions-overview?pivots=programming-language-csharp" rel="noopener noreferrer"&gt;Azure Functions&lt;/a&gt; serving as the API backend. Both may exist on the same &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview" rel="noopener noreferrer"&gt;private network&lt;/a&gt;, with the Web App connected to a gateway for public accessibility. Instead of deploying a new &lt;a href="https://learn.microsoft.com/en-us/azure/application-gateway/overview" rel="noopener noreferrer"&gt;Application Gateway&lt;/a&gt; (which can be costly), we can use the reverse proxy functionality within the Azure Web App to handle traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benefits of a Reverse Proxy
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Security and Anonymity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SSL Termination&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Centralized Authentication&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Content Modification&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Let's Get Started
&lt;/h3&gt;

&lt;p&gt;Since Azure Web Apps use the &lt;strong&gt;IIS server&lt;/strong&gt;, we need to install a reverse proxy extension. Here's how:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; Open your Azure Web App in the &lt;strong&gt;Azure portal&lt;/strong&gt;, search for "Extensions," and click the "Add" button.&lt;/p&gt;

&lt;p&gt;We will be using &lt;a href="https://github.com/EelcoKoster/ReverseProxySiteExtension/tree/master" rel="noopener noreferrer"&gt;EelcoKoster's extension&lt;/a&gt; and if you are concern about T&amp;amp;C then read &lt;a href="https://www.nuget.org/policies/Terms" rel="noopener noreferrer"&gt;nuget's T&amp;amp;C&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faob2swqn6cl7aotltbi6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faob2swqn6cl7aotltbi6.png" alt="Azure Portal &amp;gt; Web App &amp;gt; Extension" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.&lt;/strong&gt; Search for "reverseproxy" and select "ReverseProxy(1.0.4) by Eelco Koster, Jerome Haltom." Click "Add."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11lj0acz0wjw1mlmuw59.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11lj0acz0wjw1mlmuw59.png" alt="Add Extension" width="719" height="884"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.&lt;/strong&gt; Click on "Browse."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp93f2yo6v8krf48hy5n5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp93f2yo6v8krf48hy5n5.png" alt="Goto the extension setting" width="800" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.&lt;/strong&gt; Add the reverse proxy rules and click "Save to web.config." Restart your web app afterward. For demonstration purposes, we'll use a public sample REST API (&lt;code&gt;https://api.restful-api.dev&lt;/code&gt;) as the redirect URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;rewrite&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;rules&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;rule&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"APIProxy"&lt;/span&gt; &lt;span class="na"&gt;stopProcessing=&lt;/span&gt;&lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;match&lt;/span&gt; &lt;span class="na"&gt;url=&lt;/span&gt;&lt;span class="s"&gt;"^api/?(.*)"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;action&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"Rewrite"&lt;/span&gt; &lt;span class="na"&gt;url=&lt;/span&gt;&lt;span class="s"&gt;"https://api.restful-api.dev/{R:1}"&lt;/span&gt; &lt;span class="na"&gt;logRewrittenUrl=&lt;/span&gt;&lt;span class="s"&gt;"true"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/rule&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/rules&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;outboundRules&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;rule&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"AddCORSHeaders"&lt;/span&gt; &lt;span class="na"&gt;preCondition=&lt;/span&gt;&lt;span class="s"&gt;"IsApiResponse"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;match&lt;/span&gt; &lt;span class="na"&gt;serverVariable=&lt;/span&gt;&lt;span class="s"&gt;"RESPONSE_Access-Control-Allow-Origin"&lt;/span&gt; &lt;span class="na"&gt;pattern=&lt;/span&gt;&lt;span class="s"&gt;".*"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;action&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"Rewrite"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"*"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/rule&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;preConditions&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;preCondition&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"IsApiResponse"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;input=&lt;/span&gt;&lt;span class="s"&gt;"{RESPONSE_Content-Type}"&lt;/span&gt; &lt;span class="na"&gt;pattern=&lt;/span&gt;&lt;span class="s"&gt;"^application/json"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/preCondition&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/preConditions&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/outboundRules&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/rewrite&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw71iye0hu32iq8tkal9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw71iye0hu32iq8tkal9r.png" alt="web.config" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5.&lt;/strong&gt; Let's test, and if you see our web app result and public REST API results are same !!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkqb87jxwribe0x3c9bq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkqb87jxwribe0x3c9bq.png" alt="our web app result" width="800" height="99"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzgj1ie9h08og3y3445hq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzgj1ie9h08og3y3445hq.png" alt="rest api result" width="800" height="124"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;By following the steps above, you've successfully set up a reverse proxy in your Azure Web App. This method provides a cost-effective and efficient way to route traffic while enhancing security and managing backend services privately.&lt;/p&gt;

</description>
      <category>azure</category>
      <category>webdev</category>
      <category>proxy</category>
      <category>networking</category>
    </item>
    <item>
      <title>Top 5 Things You Should Know About Spark</title>
      <dc:creator>Pradip Sodha</dc:creator>
      <pubDate>Thu, 29 Aug 2024 12:55:52 +0000</pubDate>
      <link>https://dev.to/sudo_pradip/top-5-things-you-should-know-about-spark-4kg3</link>
      <guid>https://dev.to/sudo_pradip/top-5-things-you-should-know-about-spark-4kg3</guid>
      <description>&lt;h2&gt;
  
  
  1. &lt;a href="https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes" rel="noopener noreferrer"&gt;Dataframe is a Dataset&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bnfbwcexwkglilvhe5k.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bnfbwcexwkglilvhe5k.jpg" alt="discovering that dataframe is a dataset by meme" width="500" height="635"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Try searching for a DataFrame API in &lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html" rel="noopener noreferrer"&gt;Scala Spark documentation&lt;/a&gt; where all functions like withColumn, select, etc., are listed. Surprisingly, &lt;a href="https://spark.apache.org/docs/latest/api/scala/index.html?search=dataframe" rel="noopener noreferrer"&gt;you won't find&lt;/a&gt; it because a DataFrame is essentially a Dataset[&lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Row.html" rel="noopener noreferrer"&gt;Row&lt;/a&gt;]. So, you'll only find an API doc for &lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" rel="noopener noreferrer"&gt;Dataset&lt;/a&gt;, as DataFrame is just an alias.&lt;/p&gt;

&lt;p&gt;In Scala Spark, Scala is a &lt;a href="https://en.wikipedia.org/wiki/Category:Statically_typed_programming_languages" rel="noopener noreferrer"&gt;statically typed language&lt;/a&gt; where DataFrame is considered an &lt;a href="https://en.wikipedia.org/wiki/Talk%3ATyped_and_untyped_languages" rel="noopener noreferrer"&gt;untyped&lt;/a&gt; API, whereas Dataset is considered a typed API. However, calling a DataFrame untyped is slightly incorrect—they do have types, but Spark only checks them at runtime, whereas for Dataset, type checking happens at compile time.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. &lt;a href="https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html" rel="noopener noreferrer"&gt;Physical Plan&lt;/a&gt; is Too Abstract? Go Deeper
&lt;/h2&gt;

&lt;p&gt;Meet &lt;code&gt;df.queryExecution.debug.codegen&lt;/code&gt;. This is a valuable feature in Spark that provides the generated code, which is a close representation of what Spark will actually execute.&lt;/p&gt;

&lt;p&gt;Sometimes, the Spark documentation is not enough, and black-box testing also doesn't provide enough proof. This generated code gives you an idea and is a really handy tool. Yes, the code might seem cryptic, but thanks to AI, we can decode it.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;queryExecution&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;codegen&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudbp1vmzmsefd16pilfw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudbp1vmzmsefd16pilfw.png" alt="output of df.queryExecution.debug.codegen" width="800" height="643"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. &lt;a href="https://stackoverflow.com/questions/3554362/purpose-of-scalas-symbol" rel="noopener noreferrer"&gt;Symbol&lt;/a&gt; is a Simple Way to Refer to a Column
&lt;/h2&gt;

&lt;p&gt;There are five ways we can refer to a column name:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

&lt;span class="c1"&gt;//df is dataframe&lt;/span&gt;
&lt;span class="c1"&gt;//if column not exists then will throw error&lt;/span&gt;
&lt;span class="nf"&gt;df&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"columnName"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;//generic column&lt;/span&gt;
&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"columName"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;expr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"columnName"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;//become easy to write expression &lt;/span&gt;
&lt;span class="c1"&gt;// $"colA" + $"colB"&lt;/span&gt;
&lt;span class="n"&gt;$&lt;/span&gt;&lt;span class="s"&gt;"columnName"&lt;/span&gt;

&lt;span class="c1"&gt;//Simplest way, which uses scala symbol feature &lt;/span&gt;
&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;select&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;'columnName&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h2&gt;
  
  
  4. Column's Nullable Property is Not a Constraint
&lt;/h2&gt;

&lt;p&gt;A DataFrame column has three properties: column name, data type, and nullable flag. It's a misconception that Spark will enforce this as a constraint, similar to other databases. In reality, it's just a flag used for better execution planning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9x0ydorvk3iijczii9g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9x0ydorvk3iijczii9g.png" alt="shows nullable property" width="380" height="127"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;for more details, checkout &lt;a href="https://medium.com/p/1d1b7b042adb" rel="noopener noreferrer"&gt;https://medium.com/p/1d1b7b042adb&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Adding More Executors Doesn't Always Mean Faster Jobs
&lt;/h2&gt;

&lt;p&gt;While Spark's &lt;a href="https://spark.apache.org/docs/latest/cluster-overview.html" rel="noopener noreferrer"&gt;architecture&lt;/a&gt; supports horizontal scaling, simply increasing the number of executors to speed up slow jobs doesn't always yield the desired results. In many cases, this approach can backfire, leading to slower job performance and higher costs. Sometimes, jobs may run only slightly faster, but the increased resource usage can significantly raise costs.&lt;/p&gt;

&lt;p&gt;Finding the right balance of executor and core count is crucial for optimizing job performance while controlling costs. Factors such as shuffle partitions, number of cores, number of executors, source file or table size, number of files, scheduler mode, driver's capacity, and network latency all need to be considered. Everything should be in sync to achieve optimal performance. Be cautious about adding more executors, especially in scenarios involving skewed data, as this can exacerbate issues rather than solve them.&lt;/p&gt;

</description>
      <category>spark</category>
      <category>dataengineering</category>
      <category>development</category>
      <category>coding</category>
    </item>
    <item>
      <title>Avoid These Top 10 Mistakes When Using Apache Spark</title>
      <dc:creator>Pradip Sodha</dc:creator>
      <pubDate>Wed, 28 Aug 2024 09:05:06 +0000</pubDate>
      <link>https://dev.to/sudo_pradip/avoid-these-top-10-mistakes-when-using-apache-spark-366p</link>
      <guid>https://dev.to/sudo_pradip/avoid-these-top-10-mistakes-when-using-apache-spark-366p</guid>
      <description>&lt;p&gt;We all know how easy it is to overlook small parts of our code, especially when we have powerful tools like &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt; to handle the heavy lifting. Spark's core engine is great at optimizing our messy, complex code into a sleek, efficient &lt;a href="https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html" rel="noopener noreferrer"&gt;physical plan&lt;/a&gt;. But here's the catch: Spark isn't flawless. It's on a journey to perfection, sure, but it still has its limits. And Spark is upfront about those limitations, listing them out in the documentation (sometimes as little notes).&lt;/p&gt;

&lt;p&gt;But let’s be honest—how often do we skip the &lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html?search=Data" rel="noopener noreferrer"&gt;docs&lt;/a&gt; and head straight to &lt;a href="https://stackoverflow.com/" rel="noopener noreferrer"&gt;Stack Overflow&lt;/a&gt; or &lt;a href="https://chat.openai.com/" rel="noopener noreferrer"&gt;ChatGPT&lt;/a&gt; for quick answers? I've been there too. The thing is, while these shortcuts can be useful, they don't always tell the whole story. So, if you're ready to dive in, let's talk about some common mistakes and how to avoid them. Stay with me; this is going to be a ride!&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Content
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Mistake #1: Adding Columns the Wrong Way&lt;/li&gt;
&lt;li&gt;Mistake #2: Order of Narrow and Wide Transformation&lt;/li&gt;
&lt;li&gt;
Mistake #3: Overlooking Data Serialization Format &lt;/li&gt;
&lt;li&gt;Mistake #4: Not Using Parallel Listing on Input Paths&lt;/li&gt;
&lt;li&gt;Mistake #5: Ignoring Data Locality&lt;/li&gt;
&lt;li&gt;Mistake #6: Relying on Default Number of Shuffle Partitions&lt;/li&gt;
&lt;li&gt;Mistake #7: Overlooking Broadcast Join Thresholds&lt;/li&gt;
&lt;li&gt;Mistake #8: Relying on default storage level for Cache&lt;/li&gt;
&lt;li&gt;Mistake #9: Misconfiguring Spark Memory Settings&lt;/li&gt;
&lt;li&gt;Mistake #10: Relying Only on Cache and Persist&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Mistake #1: Adding Columns the Wrong Way
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;client:&lt;/em&gt; "Hey, can you add 5 columns? Make it quick, okay?"&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Developer:&lt;/em&gt; "Sure, I'll just use withColumn() in a loop 5 times!"&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Client:&lt;/em&gt; (Happy) "Great! Now, can you add 10 more columns? Make it quick, and keep the code short!"&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Developer:&lt;/em&gt; "No problem! I'll loop 15 times now."&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Spark:&lt;/em&gt; "Sorry I can't optimize" &lt;/p&gt;

&lt;p&gt;But wait—according to Spark's documentation... &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Don't use withColumn in loop&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxj4fj972phiugg2lk92o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxj4fj972phiugg2lk92o.png" alt="Apache Spark scala doc of withColumn with it's limitation highlighted" width="800" height="206"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Solution: &lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame" rel="noopener noreferrer"&gt;SelectExpr&lt;/a&gt; or &lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#select(cols:org.apache.spark.sql.Column*):org.apache.spark.sql.DataFrame" rel="noopener noreferrer"&gt;Select&lt;/a&gt;&lt;br&gt;
here is complete solution,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;addOrReplaceColumns&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newColumns&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;List&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Column&lt;/span&gt;&lt;span class="o"&gt;],&lt;/span&gt; &lt;span class="n"&gt;sourceColumns&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;List&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;List&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Column&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;val&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columnsToBeReplace&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newColumns&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;newColumns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;partition&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;sourceColumns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;column&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt;
  &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;restOfColumns&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;sourceColumns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;diff&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;columnsToBeReplace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;column&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;())).&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columnsToBeReplace&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="n"&gt;newColumns&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="n"&gt;restOfColumns&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;toList&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Mistake #2: Order of &lt;a href="https://stackoverflow.com/questions/77156805/wide-and-narrow-transformations-in-spark" rel="noopener noreferrer"&gt;Narrow and Wide Transformation&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Normally we focus on business logic when developing a data solution and it's common to ignore the order of narrow and wide transformation but things is spark recommended to combine all narrow first and then wide, for example,&lt;br&gt;
if you have&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;narrow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wide&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;narrow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;narrow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wide&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;narrow&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then try to arrange like,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;narrow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;narrow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;narrow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wide&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wide&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then spark will optimize your code more accurately for example all narrow transformation will happen as pipeline operation and only one shuffle will required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistake #3: Overlooking &lt;a href="https://spark.apache.org/docs/latest/tuning.html#data-serialization" rel="noopener noreferrer"&gt;Data Serialization Format&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;By default, Spark uses Java serialization, which is not the most efficient option. Switching to &lt;strong&gt;Kryo serialization&lt;/strong&gt; can lead to better performance, as it is faster and uses less memory. Use the following configuration to enable Kryo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spark.serializer"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"org.apache.spark.serializer.KryoSerializer"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistake #4: Not Using &lt;a href="https://spark.apache.org/docs/latest/tuning.html#parallel-listing-on-input-paths" rel="noopener noreferrer"&gt;Parallel Listing on Input Paths&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;When reading files from storage systems like Amazon S3, Azure Data Lake Storage (ADLS), or even local storage, Spark needs to list and find all matching files in the input directory before starting the next task. This &lt;strong&gt;listing process can become a bottleneck&lt;/strong&gt;, especially when dealing with large directories or a vast number of files. By default, Spark uses only a single thread to list files, which can significantly slow down the start of your job.&lt;/p&gt;

&lt;p&gt;To mitigate this, you can increase the number of threads used for listing files by setting the spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads property. This allows Spark to parallelize the file listing process, speeding up the initialization phase of your job.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Mistake #5: Ignoring &lt;a href="https://spark.apache.org/docs/latest/tuning.html#data-locality" rel="noopener noreferrer"&gt;Data Locality&lt;/a&gt;
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Data locality significantly impacts the performance of Spark jobs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When data and the code processing it are close together, computation is faster, as there is less need to move large chunks of data. Spark scheduling prioritizes data locality to minimize data movement, following levels of locality from best to worst: PROCESS_LOCAL (data and code in the same JVM), NODE_LOCAL (data on the same node), RACK_LOCAL (data on the same rack but different node), and ANY (data elsewhere on the network).&lt;/p&gt;

&lt;p&gt;Spark tries to schedule tasks at the highest locality level possible, but this isn't always feasible. If no idle executors have unprocessed data at the desired locality level, Spark can either wait for a busy executor to free up or fall back to a lower locality level by moving data to an idle executor. The time Spark waits before falling back can be adjusted using the spark.locality.wait settings. Adjusting these settings can help improve performance in scenarios with long-running tasks or when data locality is poor.&lt;/p&gt;

&lt;p&gt;In case of medium data skew or cluster with ample resources or using .catch() then &lt;strong&gt;increasing would benefits&lt;/strong&gt; rather than going to lower locality.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spark.conf.set("spark.locality.wait", "10s")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Mistake #6: Relying on Default Number of &lt;a href="https://community.databricks.com/t5/data-engineering/tuning-shuffle-partitions/td-p/22378" rel="noopener noreferrer"&gt;Shuffle Partitions&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;By default, Spark uses 200 partitions for shuffle operations (e.g., join, groupBy). This number might be too high or too low, depending on your dataset and cluster size.&lt;/p&gt;

&lt;p&gt;AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. &lt;/p&gt;

&lt;p&gt;But it's advisable to update shuffle partition before performing a wide transformation, if you need accurate optimization and if you are unsure spark recommended to set shuffle partition value to number of cores in your cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spark.sql.shuffle.partitions"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"num_core_in_cluster"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And don't forgot to tune &lt;code&gt;spark.default.parallelism&lt;/code&gt; this setting accordingly as well.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistake #7: Overlooking &lt;a href="https://sparktpoint.com/broadcast-join-in-spark/" rel="noopener noreferrer"&gt;Broadcast Join Thresholds&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Scenario:&lt;/p&gt;

&lt;p&gt;Developer: "I thought small lookup tables would be broadcasted automatically and my each of executors has 32GB of memory! Why are my joins so slow?"&lt;/p&gt;

&lt;p&gt;Spark: "Sorry, your lookup table is just above the default threshold."&lt;/p&gt;

&lt;p&gt;Broadcast joins can drastically speed up join operations when one of the tables is small enough to fit into memory on each worker node. However, if you don't adjust the broadcast join threshold, Spark might not broadcast tables that could be effectively broadcasted, leading to unnecessary shuffling.&lt;/p&gt;

&lt;p&gt;Solution:&lt;/p&gt;

&lt;p&gt;Adjust the broadcast join threshold using spark.sql.autoBroadcastJoinThreshold. If your lookup table is slightly larger than the default 10MB limit, increase the threshold.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// 50MB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;When setting the broadcast join threshold, don't base it only on executor memory. The driver loads the small table into memory first before distributing it to executors. Make sure the threshold is suitable for both driver and executor memory capacities to prevent memory issues and optimize performance.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Mistake #8: Relying on default storage level for &lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#cache():Dataset.this.type" rel="noopener noreferrer"&gt;Cache&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;It’s crucial to select the appropriate storage level for caching and persisting data based on the type of executors in your cluster,&lt;/p&gt;

&lt;p&gt;Choosing the right storage level based on the executor type and objectives is crucial for optimizing Spark performance and resource utilization. By understanding the trade-offs between speed, memory usage, and fault tolerance, you can tailor your Spark configuration to meet the specific needs of your application.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Executor Type&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Primary Objective&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Recommended Storage Level&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Alternative for Fault Tolerance&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory-Optimized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast access, low memory usage&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_ONLY_SER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stores RDD as serialized objects in memory. Balances speed and memory efficiency.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_ONLY_SER_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;MEMORY_ONLY&lt;/code&gt; if serialization overhead is not a concern.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_ONLY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stores RDD as deserialized objects in memory. Fastest access, highest memory usage.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_ONLY_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Use for small datasets that fit comfortably in memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU-Optimized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Balanced memory and disk&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK_SER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Serialized storage in memory, spills to disk if needed. Good for large datasets.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK_SER_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Preferred when memory is limited; avoids out-of-memory errors.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deserialized storage in memory, spills to disk. Faster access than &lt;code&gt;MEMORY_AND_DISK_SER&lt;/code&gt;.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Use when memory can accommodate deserialized objects, with fallback to disk.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;General Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Flexibility, moderate size datasets&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deserialized in-memory, spills to disk. Good balance for general use cases.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Good for mixed workloads; balances speed and fault tolerance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_ONLY_SER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Serialized in-memory storage. Optimized for memory efficiency and speed.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_ONLY_SER_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Suitable for datasets that fit well in memory after serialization.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Disk-Optimized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low memory, high fault tolerance&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DISK_ONLY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stores RDD partitions only on disk. Minimizes memory usage but slowest access.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DISK_ONLY_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Suitable for very large datasets where memory is a constraint.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK_SER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Serialized storage in memory with spillover to disk. More efficient than deserialized.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MEMORY_AND_DISK_SER_2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Balances disk usage and memory efficiency.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;_2&lt;/code&gt; options (e.g., &lt;code&gt;MEMORY_ONLY_2&lt;/code&gt;, &lt;code&gt;MEMORY_AND_DISK_2&lt;/code&gt;) are useful for scenarios where fault tolerance is crucial. They replicate data across two nodes, ensuring data is not lost if a node fails. This is particularly valuable in environments where reliability is prioritized over resource efficiency, such as production systems handling critical data or real-time processing pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;_SER&lt;/code&gt; option (e.g., MEMORY_AND_DISK_SER) Stores RDD as serialized Java objects (one byte array per partition) in memory. More memory-efficient than MEMORY_ONLY, but slower due to serialization/deserialization overhead.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Mistake #9: Misconfiguring Spark Memory Settings
&lt;/h2&gt;

&lt;p&gt;Scenario:&lt;/p&gt;

&lt;p&gt;Developer: "My Spark job keeps failing with out-of-memory errors. I gave it all the memory available!"&lt;/p&gt;

&lt;p&gt;Spark: "Memory isn't just for you; I need some for myself, too."&lt;/p&gt;

&lt;p&gt;Many users allocate almost all available memory to the executor heap space (spark.executor.memory) without considering Spark's overhead memory, causing frequent out-of-memory errors. Additionally, insufficient memory can lead to excessive garbage collection (GC) pauses, slowing down jobs.&lt;/p&gt;

&lt;p&gt;Solution:&lt;/p&gt;

&lt;p&gt;Properly configure memory settings by tuning spark.executor.memory and spark.executor.memoryOverhead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--conf&lt;/span&gt; spark.executor.memory&lt;span class="o"&gt;=&lt;/span&gt;4g &lt;span class="nt"&gt;--spark&lt;/span&gt;.executor.memoryOverhead&lt;span class="o"&gt;=&lt;/span&gt;512m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ensure you leave enough memory overhead to accommodate Spark's internal needs (shuffle, RDD storage, etc.). Typically, 10-15% of the total memory should be allocated as overhead.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spark.memory.fraction&lt;/code&gt; expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistake #10: Relying Only on Cache and Persist
&lt;/h2&gt;

&lt;p&gt;Many Spark developers are familiar with the cache() and persist() methods for improving performance, but they often overlook the value of &lt;a href="https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk" rel="noopener noreferrer"&gt;checkpoint()&lt;/a&gt;. While cache() and persist() keep data in memory or on disk to speed up processing, they don’t provide fault tolerance in the case of a failure. checkpoint(), on the other hand, saves the RDD to a reliable storage system, allowing for fault recovery and optimizing job execution.&lt;/p&gt;

&lt;p&gt;Using checkpoint() not only ensures that your job can recover from failures but also helps Spark optimize the execution of other jobs that share the same lineage. This can lead to improved performance and resource utilization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sparkContext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;setCheckpointDir&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"path/to/checkpoint/dir"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;checkpoint&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>dataengineering</category>
      <category>development</category>
      <category>coding</category>
      <category>dbt</category>
    </item>
    <item>
      <title>DBT and Software Engineering</title>
      <dc:creator>Pradip Sodha</dc:creator>
      <pubDate>Sat, 24 Aug 2024 14:27:57 +0000</pubDate>
      <link>https://dev.to/sudo_pradip/dbt-and-software-engineering-4006</link>
      <guid>https://dev.to/sudo_pradip/dbt-and-software-engineering-4006</guid>
      <description>&lt;p&gt;In recent years, the competition for data solution tools has heated up. While AWS, Azure, &lt;a href="https://cloud.google.com/free/" rel="noopener noreferrer"&gt;GCP&lt;/a&gt; and many more companies investing heavily into Data Engineering such as &lt;a href="https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/" rel="noopener noreferrer"&gt;AWS Glue Studio&lt;/a&gt;, &lt;a href="https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview" rel="noopener noreferrer"&gt;Azure DataFlow&lt;/a&gt;, &lt;a href="https://cloud.google.com/data-fusion" rel="noopener noreferrer"&gt;GCP cloud data fusion&lt;/a&gt;. While most companies focusing on low-code and drang-n-drop path. However, &lt;a href="https://www.getdbt.com/" rel="noopener noreferrer"&gt;DBT&lt;/a&gt; (Data Build Tool) takes a different approach by embracing &lt;a href="https://en.wikipedia.org/wiki/List_of_software_development_philosophies" rel="noopener noreferrer"&gt;software engineering principles&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Instead of opting for the easy path, DBT proposed the right way of doing things, grounded in sound engineering practices. we will explore why DBT is different from big giants but today let's dive into part of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Audience&lt;/li&gt;
&lt;li&gt;Software Engineering&lt;/li&gt;
&lt;li&gt;Limitations of Today's Data Pipelines&lt;/li&gt;
&lt;li&gt;DBT's Adherence to Software Engineering Practices&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this Post, we'll explore &lt;a href="https://en.wikipedia.org/wiki/Software_engineering" rel="noopener noreferrer"&gt;Software Engineering&lt;/a&gt; methods used in DBT (Data Build Tool). While a basic understanding of DBT's features from its &lt;a href="https://docs.getdbt.com/reference/references-overview" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; might suffice for contributing to a project. &lt;/p&gt;

&lt;p&gt;So, why read this? Well, we'll explain the Software Engineering methods used by DBT and why they matter, In short, we'll uncover the reasons behind DBT's features.&lt;/p&gt;

&lt;p&gt;That's make difference cause knowing reason and potential of feature is much more important than just mastering any feature, if you violate the reason or core of the feature, than that's feature is killed and it's just another workaround or patch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audience
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/Data_engineering" rel="noopener noreferrer"&gt;Data Engineers&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Data_analysis" rel="noopener noreferrer"&gt;Data Analytics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Software_engineering" rel="noopener noreferrer"&gt;Software Engineers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Software Engineering
&lt;/h2&gt;

&lt;p&gt;The realm of software engineering holds a vast history, having witnessed the contributions of numerous &lt;a href="https://en.wikipedia.org/wiki/List_of_computer_scientists" rel="noopener noreferrer"&gt;scientists&lt;/a&gt; and professionals. &lt;/p&gt;

&lt;p&gt;Their collective efforts have propelled software methodologies to new heights, constantly striving to surpass previous achievements while upholding an audacious spirit.&lt;/p&gt;

&lt;p&gt;Software engineering stands as the bedrock of modern technological advancements, weaving a rich tapestry of methodologies and&lt;br&gt;
practices that shape the way we design, develop, and maintain software systems. &lt;/p&gt;

&lt;p&gt;Its roots stretch back to the &lt;a href="https://en.wikipedia.org/wiki/History_of_software_engineering" rel="noopener noreferrer"&gt;mid-20th century&lt;/a&gt;, evolving from simple programming to a comprehensive discipline encompassing various principles, tools, and frameworks. &lt;/p&gt;

&lt;p&gt;Over the decades, software engineering has propelled innovations, enhancing reliability, scalability, and maintainability of systems across diverse industries.  &lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations of Today's Data Pipelines
&lt;/h2&gt;

&lt;p&gt;In the realm of &lt;a href="https://en.wikipedia.org/wiki/Big_data" rel="noopener noreferrer"&gt;big data&lt;/a&gt;, the sophistication of data pipelines has surged, enabling the handling of massive datasets. &lt;/p&gt;

&lt;p&gt;However, conventional data pipelines often exhibit limitations. They are prone to complexities, becoming intricate webs of disparate scripts, SQL queries, and manual interventions. &lt;/p&gt;

&lt;p&gt;These pipelines lack standardization, making them difficult to maintain and comprehend. As the data grows, managing these pipelines becomes a daunting challenge, hindering scalability and agility.&lt;/p&gt;




&lt;h2&gt;
  
  
  DBT: A Solution Rooted in Software Engineering
&lt;/h2&gt;

&lt;p&gt;Enter DBT (Data Build Tool), a paradigm shift in the world of data engineering that embodies the core principles of software engineering.&lt;/p&gt;

&lt;p&gt;DBT redefines the way data pipelines are built and managed, aligning itself with established software engineering practices to tackle the challenges prevalent in traditional data pipelines.&lt;br&gt;
DBT, stands as a revolutionary force in the domain of data transformation. &lt;/p&gt;

&lt;p&gt;It reimagines the handling of data by infusing principles of agility and discipline akin to those found in the software engineering realm. &lt;/p&gt;

&lt;p&gt;By treating data transformation as a form of software development,&lt;br&gt;
DBT enables the scalability and seamless management of significant data components, facilitating collaboration among large teams with&lt;br&gt;
unparalleled ease.&lt;/p&gt;




&lt;h2&gt;
  
  
  DBT's Adherence to Software Engineering Practices
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Separation_of_concerns" rel="noopener noreferrer"&gt;Separation of Concerns&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DBT distinguishes between data transformation logic and &lt;a href="https://docs.getdbt.com/docs/build/models" rel="noopener noreferrer"&gt;data modeling&lt;/a&gt;, allowing for modularization and easier management. For
instance, SQL queries in DBT focus on transforming raw data, while models define the final structured datasets.&lt;/li&gt;
&lt;li&gt;DBT has divided the usual data transformation into four parts: 1. Business logic (DQL), 2. &lt;a href="https://docs.getdbt.com/docs/build/materializations" rel="noopener noreferrer"&gt;Materialization&lt;/a&gt; (DDL &amp;amp; DML), 3. &lt;a href="https://docs.getdbt.com/docs/build/data-tests" rel="noopener noreferrer"&gt;Testing&lt;/a&gt;,
and 4. &lt;a href="https://docs.getdbt.com/docs/build/documentation" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;. These four areas now scale and maintain independently. Also, they are easier to read. &lt;a href="https://www.getdbt.com/what-is-analytics-engineering" rel="noopener noreferrer"&gt;Analytics engineers&lt;/a&gt;
can focus on one thing separately. For example, they can solely concentrate on business logic (just select statements) while writing
models. How to store or test or document have different section.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Benefits&lt;/em&gt;

&lt;ul&gt;
&lt;li&gt;Enhanced Maintainability&lt;/li&gt;
&lt;li&gt;Improved Reusability&lt;/li&gt;
&lt;li&gt;Better Collaboration&lt;/li&gt;
&lt;li&gt;Scalability and Flexibility&lt;/li&gt;
&lt;li&gt;Security and &lt;a href="https://en.wikipedia.org/wiki/Risk_management" rel="noopener noreferrer"&gt;Risk Mitigation&lt;/a&gt; (Individual can models can have access control and owner)&lt;/li&gt;
&lt;li&gt;Future-proofing&lt;/li&gt;
&lt;li&gt;Reduction of Complexity&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Reusability" rel="noopener noreferrer"&gt;Reusability&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Just as software modules can be reused, DBT promotes reusable code blocks (&lt;a href="https://docs.getdbt.com/docs/build/jinja-macros" rel="noopener noreferrer"&gt;macros&lt;/a&gt;) and models. This allows data
engineers to build upon existing components, fostering efficiency and consistency. Also DBT has good amount of packages that can
import and use directly in projects, allowing share standard and tested expression at glob.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Benefits&lt;/em&gt;

&lt;ul&gt;
&lt;li&gt;Efficiency&lt;/li&gt;
&lt;li&gt;Consistency and Standardization&lt;/li&gt;
&lt;li&gt;Ease of Maintenance&lt;/li&gt;
&lt;li&gt;Cost-Effectiveness&lt;/li&gt;
&lt;li&gt;Facilitates Collaboration&lt;/li&gt;
&lt;li&gt;Future-Proofing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Unit_testing" rel="noopener noreferrer"&gt;Unit Testing&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Similar to software unit &lt;a href="https://docs.getdbt.com/docs/build/data-tests" rel="noopener noreferrer"&gt;tests&lt;/a&gt;, DBT enables data engineers to create tests to validate the accuracy of transformations,
ensuring data quality throughout the pipeline. You can test each of your single transformation (a model) before subsequent step run.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Benefits&lt;/em&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Error Identification in Isolation&lt;/em&gt;: It allows the testing of individual components (units) of code in isolation, pinpointing errors or bugs specific to that unit. This facilitates easier debugging and troubleshooting.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Enhanced Code Quality&lt;/em&gt;: Unit tests enforce better coding practices by promoting modular and understandable code. Writing tests inherently requires breaking down functionalities into smaller, manageable units, leading to more maintainable and robust code.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Regression Prevention&lt;/em&gt;: Unit tests serve as a safety net. When modifications or updates are made, running unit tests ensures that existing functionalities are not negatively impacted, preventing unintended consequences through regression testing.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Facilitates Refactoring&lt;/em&gt;: Developers can confidently refactor or restructure code knowing that unit tests will quickly identify any potential issues. This flexibility encourages code improvements without the fear of breaking existing functionalities.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Improved Design and Documentation&lt;/em&gt;: Writing unit tests often necessitates clearer interfaces and more detailed documentation. This leads to better-designed APIs and clearer understanding of how code should be used.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Accelerates Development&lt;/em&gt;: Despite the initial time investment in writing tests, unit testing can speed up development by reducing time spent on debugging and rework. It aids in catching bugs early in the development cycle, saving time in the long run.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Supports Agile Development&lt;/em&gt;: Unit tests align well with agile methodologies by promoting frequent iterations and continuous integration. They facilitate a faster feedback loop, allowing developers to quickly verify changes.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Encourages Modular Development&lt;/em&gt;: Unit tests require breaking down functionalities into smaller units, promoting a modular approach to development. This modularity fosters reusability and simplifies integration.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Boosts Confidence in Code Changes&lt;/em&gt;: Unit tests provide confidence when making changes or additions to the codebase. Passing
tests indicate that the modified code behaves as expected, reducing the risk of introducing new bugs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Abstraction" rel="noopener noreferrer"&gt;Abstraction&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The abstraction principle involves concealing intricate underlying details while presenting a simplified and accessible
interface or representation. In DBT, for instance, model files encapsulate solely business logic, abstracting materialization and test
cases. This seemingly simple feature proves immensely helpful. It's akin to skimming a newspaper headline—if more details are needed,
delve deeper; if not, move swiftly to the next topic.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Benefits&lt;/em&gt;

&lt;ul&gt;
&lt;li&gt;Simplification of Complexity&lt;/li&gt;
&lt;li&gt;Enhanced Readability and Understandability&lt;/li&gt;
&lt;li&gt;Focus on Higher-Level Concepts&lt;/li&gt;
&lt;li&gt;Reduced Cognitive Load&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Coupling" rel="noopener noreferrer"&gt;Coupling&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The coupling principle refers to the degree of interconnectedness or dependency between different components or modules within a
system. Lower coupling indicates a lesser degree of dependency, while higher coupling suggests a stronger interconnection between
components.&lt;/li&gt;
&lt;li&gt;In DBT, managing coupling involves reducing dependencies between different parts of the data transformation process. Lower
coupling is desirable for several reasons.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Software_documentation" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DBT facilitates comprehensive &lt;a href="https://docs.getdbt.com/docs/build/documentation" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for data models and transformations, akin to software documentation.
This documentation aids in understanding the data flow, enhancing collaboration and knowledge sharing.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Deployment_environment" rel="noopener noreferrer"&gt;Environment Separation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the software world, it's common to use different environments like Development (Dev), User Acceptance Testing (UAT), and Production (Prod) to manage changes effectively and ensure stability. This practice, known as Environment Separation, helps isolate changes, allowing teams to test and validate new features or fixes in a controlled setting before exposing them to real users. &lt;/li&gt;
&lt;li&gt;It mitigates risks, ensures consistency, and facilitates compliance and security. Similarly, dbt (data build tool) seamlessly supports environment separation, allowing teams to define and manage different environments such as Dev, UAT, and Prod. This practice promotes better DataOps by ensuring that data transformations are thoroughly tested and validated before they impact production, improving reliability and reducing the risk of errors.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Backward_compatibility" rel="noopener noreferrer"&gt;Backward Compatibility&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clients often provide new requirements, or we may discover more optimal ways to perform tasks. When this happens, we tend to modify our existing models or queries. However, in a large project, a single query might be relied upon by many clients, making it challenging to notify all teams of changes. &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Additionally, new changes can sometimes introduce faults, which can disrupt data pipelines and violate one of the core principles of big data: availability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To address this, the software industry already employs strategies to manage such issues effectively. dbt (data build tool) supports different &lt;a href="https://docs.getdbt.com/docs/collaborate/govern/model-versions" rel="noopener noreferrer"&gt;model versions&lt;/a&gt;, allowing teams to maintain multiple versions, such as a pre-release version for testing and a stable version for production use. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This versioning approach makes dbt highly adaptive, enabling teams to migrate to new versions at their own pace. Furthermore, dbt allows setting a deprecation period, specifying how long an old API version will be supported before it is phased out, aligning with the concept of a Deprecation Policy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Benefits&lt;/em&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User Experience Stability&lt;/li&gt;
&lt;li&gt;Reduced Migration Costs&lt;/li&gt;
&lt;li&gt;Minimized Downtime&lt;/li&gt;
&lt;li&gt;Flexibility in Adopting Updates&lt;/li&gt;
&lt;li&gt;Flexibility in Adopting Updates&lt;/li&gt;
&lt;li&gt;Encourages Innovation&lt;/li&gt;
&lt;li&gt;Risk Mitigation&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;DBT's fusion of software engineering principles with the domain of big data revolutionizes how data pipelines are conceived, constructed, and maintained. By embracing the tenets of software engineering, DBT addresses the shortcomings of traditional data pipelines, ushering in a new era of efficiency, reliability, and agility in data engineering. As software engineering continues to evolve, its synergy with big data technologies like DBT paves the way for more robust, scalable, and manageable data ecosystems.&lt;/p&gt;

</description>
      <category>dbt</category>
      <category>softwareengineering</category>
      <category>dataengineering</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
