DEV Community: Pradip Sodha

I Just Wanted dbt State Without Another Account

Pradip Sodha — Thu, 11 Jun 2026 04:48:53 +0000

Background

DBT open whole new world of how data-transformation happen, DBT brings software engineering, now databricks, snowflake, fabric major cloud supports DBT natively, i will have another blog how DBT win the race of data-transformation where major cloud was in same race but there was one more player emerge which was SQLMesh, i fall in love with SQLMesh.

One of the SQLMesh feature was state management, DBT was stateless and then comes fivetran in picture, acquiring SQLMesh and then DBT and introducing dbt-state.

Introduction

dbt-state is way to make dbt run more mature, i can go on and on about it's features, and i'm sure slowly dbt-state will become a standard.

If i quickly list features then,

auto defer
NO-OP
auto incremental shallow clone

There is one problem in dbt-state, that i face was needs to login into either dbt cloud or app.dbt.state and i really don't wanted to lock in, my reasons was architecture and third party lock in and at last the price, i have azure blob or s3 for storage already.

So I created dbt-state-oss, so instead of using dbt state we can use existing cloud storage such as blob or s3, and server needs to run locally (we can host but works without that).

Mmmmm

One thing i noticed was dbt-state wasn't really storing a manifest file or major files, it was simply storing a one json file which look like,

{"target_table":"test.smoke.table","fingerprint":"abc123","execution_type":"FULL","last_modified_epoch":1780508102671,"table_type":null,"created_at":1780508102.6715958}

and most of feature was actually on-fly such as catalog of prod, state modified feature from git, that was shocker for me and at same time good new.

So question comes what is need of server then, and answer is

Has this exact model already been built, and is it still fresh relative to its inputs?

Quick Start

On dev/prod/ci, install and start the server,

pip install dbt-state-oss          # add [s3] or [azure] for those backends
dbt-state-oss --store local --port 50051

and then export ENV variables, so instead of dbt connect to dbt app state, it will connect to our server,

export RUN_CACHE_API_URL=localhost:50051 RUN_CACHE_API_SECURE=false RUN_CACHE_OAUTH_CLIENT_SECRET=dev

that's all good to go, now run your existing job as it is no change,

dbt build

Conclusion

SQLMesh has philosophy to keep things open-source, dbt-state is good feature, i really thanks Tobias (Toby) Mao and his team to bring such a wonderful feature but dbt-state should able to connect to existing cloud storage.

Scala's Ignored Features

Pradip Sodha — Mon, 07 Oct 2024 06:18:41 +0000

Scala often flies under the radar, seen as an underrated language despite its elegant design. Many developers know how to use Scala, but not all fully grasp its core concepts—leaving some powerful features untouched. In this article, we'll explore some key features of Scala that developers frequently overlook, helping you unlock the language’s full potential.

1. I’m Pure OOP

Let’s start with a simple question: How is Scala pure OOP? Take a look at this expression:

val x = 1 + 1

You might think this is straightforward: there's a variable x, an assignment operator =, a plus operator +, and two integer constants 1 and 1. But how about this expression:

val x = 1.+(1)

Surprised? This is equally valid in Scala because everything in Scala is an object, even the numbers! What’s really happening here is that 1 is an instance of the Int class, and the + method is called on it. The more you explore Scala, the more you'll see how object-oriented principles are embedded deeply within the language.

2. Free Will: Creative Variable Names

Have you ever heard the rule that variable names must start with an alphabetic character or underscore and can’t contain special characters? That might be true in languages like C, but Scala breaks free from these traditional constraints.

Consider this block of code:

val ten = 10 //1
var _20 = 20 //2
var #30 = 30 //3
val `@40` = 40 //4
val ` ` = 0 //5
var `` = null //6
println { null } //7
println { ` ` } //8


At first glance, you might think variables 5 and 7 are invalid, but in fact, the only invalid one here is line 6! Scala allows special characters and even spaces in variable names when enclosed in backticks. Line 8 will print `0`, and the empty space is a valid variable name.

---

### **3. Functional Meets OOP: Every Statement Returns Something**

In Scala, **everything is an expression**—even `if`-`else`, `match`, and loops. This makes functional programming more natural because these constructs always return a value, eliminating the need to declare mutable variables to store results.

For example:

scala
val result = if (condition) "yes" else "no"

Here, `if`-`else` returns a value just like a function would, reinforcing the functional paradigm. This feature paves the way for advanced techniques like **Higher-Order Functions** (HOF) and **Currying**.

---

### **4. Implicit Magic: Reducing Boilerplate**

Are you tired of passing the same context or parameter repeatedly across multiple functions? Scala’s **implicit** keyword can help reduce this redundancy.

Instead of explicitly passing variables around, you can declare an implicit value that will automatically be picked up by any function that expects it:

scala
implicit val context: Context = new Context()

def read(...)(implicit ctx: Context) = ...
def transform()(implicit ctx: Context) = ...
def write(...)(implicit ctx: Context) = ...

read(otherParam)
transform() // No need to pass context again
write(otherParam)


This makes your code more concise and expressive, while still ensuring that necessary values like contexts or configurations are passed correctly.

---

### **5. Lazy Evaluation: Only When You Need It**

You might know that `lazy` in Scala means delaying a variable’s evaluation until it’s first accessed. But what real-world scenarios benefit from `lazy`?

For example, imagine an **expensive computation** that shouldn’t be performed unless absolutely necessary:

scala
lazy val expensiveComputation = {
println("Computing...")
(1 to 1000000).sum
}

println("Before accessing lazy value")
// No computation happens yet

println(expensiveComputation) // Now the computation occurs


This is particularly useful for **resource initialization** (e.g., database connections), **memoization** (caching the result of a computation), Conditional Initialization, Deferred Initialization in Multi-threaded Environments, Lazy Logging, Avoiding Circular Dependencies.

---

### **6. Understanding Nil, Null, None, Nothing, and Unit**

Scala has several ways to represent "nothing" or the absence of value, but each has its own distinct role:

- **`Nil`**: Represents an empty list (`List()`).
- **`Null`**: A subtype of all reference types (`AnyRef`), representing the absence of an object.
- **`None`**: A safer alternative to `null`, used within the `Option` type.
- **`Nothing`**: A subtype of every type, representing the absence of value or the bottom of the type hierarchy. Often used in functions that throw exceptions.
- **`Unit`**: Equivalent to `void` in Java, but carries a value (`()`).

Understanding these types is key to writing robust Scala code, especially when dealing with optional values and avoiding null pointer issues.

---

### **7. Tail Recursion: Efficient Recursion Without Stack Overflow**

Recursion is a natural solution for many problems but can lead to **stack overflow** issues. Scala solves this with **tail recursion**, which allows the compiler to optimize recursive calls into iterative loops.

scala
@annotation.tailrec
def factorial(n: Int, acc: Int = 1): Int = {
if (n <= 1) acc
else factorial(n - 1, n * acc)
}

println(factorial(5)) // Output: 120


The Scala compiler transforms the tail-recursive function into a loop, preventing stack overflow and making recursion efficient.

---

### **8. Yield: Generating Collections from Loops**

In Scala, you can use `yield` in a `for` comprehension to generate a new collection from an existing one:

scala
val numbers = List(1, 2, 3, 4)
val doubled = for (n <- numbers) yield n * 2
println(doubled) // Output: List(2, 4, 6, 8)




The power of `yield` lies in transforming collections in a concise and expressive way.

---

### **Conclusion**

Scala’s beauty lies in its elegance and the seamless integration of OOP and functional programming. Features like `lazy`, `implicit`, and `tail recursion` are just the tip of the iceberg. As you dive deeper, you’ll uncover more of Scala’s hidden gems and understand why this language stands out as a truly well-crafted tool for developers who aim to write clean, efficient, and expressive code.

DBT vs. Data Engineers: A Love-Hate Saga!

Pradip Sodha — Thu, 03 Oct 2024 13:33:01 +0000

No doubt, DBT (Data Build Tool) has introduced a whole new approach to writing transformations. Or should I say, the “correct” way to write them? But if you’ve been a data engineer for 5 to 10 years, DBT might feel...strange. Maybe even unnecessary or over-complicated. You’ve likely found comfort in the traditional ways of doing things—Databricks notebooks, Azure Data Flows, stored procedures, etc.—giving you more control over your work. Let’s explore why, from a data engineer’s perspective, DBT can feel like an alien language and what makes it a tough beast to tame.

1. Only SELECT Statements? Say What?!

DE: If I open any DBT model file, all I see are SELECT statements. What if I want a good ol’ MERGE or INSERT statement? Oh wait, I have to check DBT docs for that? Really? I’ve mastered MERGE over the past decade, and now I need to look up docs like a newbie? And what if there’s a new DML feature? I have to wait for DBT to support it? Feels like I’m shackled! And don’t get me started on conditional updates or deletes—where do I even go to beg for those?

DBT: I get it. You’re right, but let’s flip the script. You don’t need to worry about DDL (create, alter) or DML (insert, merge) anymore. Focus on your transformations; I’ll handle the nitty-gritty! By abstracting these tedious commands, you can scale easier and faster. And hey, if DBT doesn’t support a feature yet, get yourself a solid software engineer. Seriously, they’re the magic fix for everything. What are you waiting for?

Pro Tip: DBT’s secret sauce is a good software engineer. Get one who’s also a data engineer, and you’ll be flying in no time. Heaven awaits, trust me!

2. Jinja is... Well, Something

DE: What is this Jinja stuff? Honestly, it’s harder to read and write than the worst SQL I’ve ever seen. Sometimes, I dream of Jinja syntax haunting me like floating code fragments.

DBT: Oof, I hear you. We’ve all had second thoughts about using Jinja, and managing it is a struggle. But hey, it’s a powerful templating engine! With its if-else and for-loops, you can write dynamic SQL that will take your transformations to the next level. Hang in there, my friend, the power will reveal itself!

3. Documentation Drama!

DE: Sure, documentation is great, but now my team relies on docs instead of SQL. The doc could say the Earth is flat, and they’d believe it, even if the query says it’s round! The problem? Docs are rarely up-to-date. And when you’ve got deadlines, there’s no time to update them. Worst of all, if there’s a mistake in the doc, who’s going to catch it? SQL errors get flagged, but docs? Good luck.

DBT: Whoa, slow down. Docs are crucial! They bring stakeholders, downstream, and upstream users together. No more wading through murky SQL—everyone can just read the doc in a pretty UI! Sure, if the doc is wrong, that’s a problem, but you can automate doc checks using tools like dbt checkpoint. And let’s be honest, if your doc says the Earth is flat, that’s on you.

4. Dev Setup & Deployment: The Struggle is Real

DE: Setting up a DBT project is an achievement in itself. Managing versioning, syntax, spacing, and linting across devs is a nightmare. If there’s an error, I’m diving through a swamp of logs. Plus, I have to set up a virtual environment, learn Docker, and deploy to AWS ECS. Seriously?

DBT: I feel your pain, but have you heard of DBT Cloud? It solves all of these issues! Just give me your money, and I’ll make everything easier. Oh, and I’ve teamed up with Databricks—so now you can run DBT tasks in Databricks Workflows! It’s in the premium workspace though, so you’ll need to cough up a bit more.

5. DBT for Small Projects? Worth It?

DE: Why are even small projects pushing to use DBT? It’s expensive! Not just the learning curve, but hiring a software engineer who’s constantly complaining about everything costs a fortune. Seriously, the guy is impossible to work with!

DBT: Listen, even if your project is small, one day you’ll be a unicorn. You need tools that can handle the pressure when you scale. But, if you want to keep costs down, you could try a minimalist approach. Skip the fancy Jinja and stick to organized SQL files. Slowly adopt more DBT features as you grow. It’s like evolving from the Stone Age to the modern era—DBT is the future!

Conclusion:

DBT is a powerful tool, but it can be a bit of a pain for data engineers used to more traditional methods. It demands a lot—sometimes too much. But with the right mindset (and maybe a software engineer who knows what they’re doing), DBT can help you scale and succeed. Just... be ready for a few headaches along the way!

How to Add Reverse Proxy to Your Azure Web App

Pradip Sodha — Thu, 05 Sep 2024 15:42:15 +0000

While searching for a proper article on how to add a reverse proxy in Azure Web App, I couldn't find comprehensive documentation. So, here we are! In this article, we will explore how to add a reverse proxy to your Azure Web App, whether you're using Node.js, Java, PHP, or .NET as your runtime stack. This approach works seamlessly since Azure Web Apps are hosted on IIS server.

What is a Reverse Proxy?

A reverse proxy is a method to forward incoming requests to another server. This setup is particularly useful in scenarios like having a frontend exposed to a public endpoint and a backend deployed on a private network. With a reverse proxy, you can route traffic from the public frontend to the private backend.

One common use case is using an Azure Web App as the frontend and an Azure Functions serving as the API backend. Both may exist on the same private network, with the Web App connected to a gateway for public accessibility. Instead of deploying a new Application Gateway (which can be costly), we can use the reverse proxy functionality within the Azure Web App to handle traffic.

Benefits of a Reverse Proxy

Security and Anonymity
SSL Termination
Centralized Authentication
Content Modification

Let's Get Started

Since Azure Web Apps use the IIS server, we need to install a reverse proxy extension. Here's how:

1. Open your Azure Web App in the Azure portal, search for "Extensions," and click the "Add" button.

We will be using EelcoKoster's extension and if you are concern about T&C then read nuget's T&C.

Web App > Extension" width="800" height="238">

2. Search for "reverseproxy" and select "ReverseProxy(1.0.4) by Eelco Koster, Jerome Haltom." Click "Add."

3. Click on "Browse."

4. Add the reverse proxy rules and click "Save to web.config." Restart your web app afterward. For demonstration purposes, we'll use a public sample REST API (https://api.restful-api.dev) as the redirect URL.

<rewrite>
  <rules>
    <rule name="APIProxy" stopProcessing="true">
      <match url="^api/?(.*)" />
      <action type="Rewrite" url="https://api.restful-api.dev/{R:1}" logRewrittenUrl="true" />
    </rule>
  </rules>
  <outboundRules>
    <rule name="AddCORSHeaders" preCondition="IsApiResponse">
      <match serverVariable="RESPONSE_Access-Control-Allow-Origin" pattern=".*" />
      <action type="Rewrite" value="*" />
    </rule>
    <preConditions>
      <preCondition name="IsApiResponse">
        <add input="{RESPONSE_Content-Type}" pattern="^application/json" />
      </preCondition>
    </preConditions>
  </outboundRules>
</rewrite>

5. Let's test, and if you see our web app result and public REST API results are same !!

Conclusion

By following the steps above, you've successfully set up a reverse proxy in your Azure Web App. This method provides a cost-effective and efficient way to route traffic while enhancing security and managing backend services privately.

Top 5 Things You Should Know About Spark

Pradip Sodha — Thu, 29 Aug 2024 12:55:52 +0000

1. Dataframe is a Dataset

Try searching for a DataFrame API in Scala Spark documentation where all functions like withColumn, select, etc., are listed. Surprisingly, you won't find it because a DataFrame is essentially a Dataset[Row]. So, you'll only find an API doc for Dataset, as DataFrame is just an alias.

In Scala Spark, Scala is a statically typed language where DataFrame is considered an untyped API, whereas Dataset is considered a typed API. However, calling a DataFrame untyped is slightly incorrect—they do have types, but Spark only checks them at runtime, whereas for Dataset, type checking happens at compile time.

2. Physical Plan is Too Abstract? Go Deeper

Meet df.queryExecution.debug.codegen. This is a valuable feature in Spark that provides the generated code, which is a close representation of what Spark will actually execute.

Sometimes, the Spark documentation is not enough, and black-box testing also doesn't provide enough proof. This generated code gives you an idea and is a really handy tool. Yes, the code might seem cryptic, but thanks to AI, we can decode it.



df.queryExecution.debug.codegen

3. Symbol is a Simple Way to Refer to a Column

There are five ways we can refer to a column name:



//df is dataframe
//if column not exists then will throw error
df("columnName") 

//generic column
col("columName")
expr("columnName")

//become easy to write expression 
// $"colA" + $"colB"
$"columnName"

//Simplest way, which uses scala symbol feature 
df.select('columnName)

4. Column's Nullable Property is Not a Constraint

A DataFrame column has three properties: column name, data type, and nullable flag. It's a misconception that Spark will enforce this as a constraint, similar to other databases. In reality, it's just a flag used for better execution planning.

for more details, checkout https://medium.com/p/1d1b7b042adb

5. Adding More Executors Doesn't Always Mean Faster Jobs

While Spark's architecture supports horizontal scaling, simply increasing the number of executors to speed up slow jobs doesn't always yield the desired results. In many cases, this approach can backfire, leading to slower job performance and higher costs. Sometimes, jobs may run only slightly faster, but the increased resource usage can significantly raise costs.

Finding the right balance of executor and core count is crucial for optimizing job performance while controlling costs. Factors such as shuffle partitions, number of cores, number of executors, source file or table size, number of files, scheduler mode, driver's capacity, and network latency all need to be considered. Everything should be in sync to achieve optimal performance. Be cautious about adding more executors, especially in scenarios involving skewed data, as this can exacerbate issues rather than solve them.

Avoid These Top 10 Mistakes When Using Apache Spark

Pradip Sodha — Wed, 28 Aug 2024 09:05:06 +0000

We all know how easy it is to overlook small parts of our code, especially when we have powerful tools like Apache Spark to handle the heavy lifting. Spark's core engine is great at optimizing our messy, complex code into a sleek, efficient physical plan. But here's the catch: Spark isn't flawless. It's on a journey to perfection, sure, but it still has its limits. And Spark is upfront about those limitations, listing them out in the documentation (sometimes as little notes).

But let’s be honest—how often do we skip the docs and head straight to Stack Overflow or ChatGPT for quick answers? I've been there too. The thing is, while these shortcuts can be useful, they don't always tell the whole story. So, if you're ready to dive in, let's talk about some common mistakes and how to avoid them. Stay with me; this is going to be a ride!

Table of Content

Mistake #1: Adding Columns the Wrong Way
Mistake #2: Order of Narrow and Wide Transformation
Mistake #3: Overlooking Data Serialization Format
Mistake #4: Not Using Parallel Listing on Input Paths
Mistake #5: Ignoring Data Locality
Mistake #6: Relying on Default Number of Shuffle Partitions
Mistake #7: Overlooking Broadcast Join Thresholds
Mistake #8: Relying on default storage level for Cache
Mistake #9: Misconfiguring Spark Memory Settings
Mistake #10: Relying Only on Cache and Persist

Mistake #1: Adding Columns the Wrong Way

client: "Hey, can you add 5 columns? Make it quick, okay?"

Developer: "Sure, I'll just use withColumn() in a loop 5 times!"

Client: (Happy) "Great! Now, can you add 10 more columns? Make it quick, and keep the code short!"

Developer: "No problem! I'll loop 15 times now."

Spark: "Sorry I can't optimize"

But wait—according to Spark's documentation...

Don't use withColumn in loop

Solution: SelectExpr or Select
here is complete solution,

def addOrReplaceColumns(newColumns: List[Column], sourceColumns: List[String]): List[Column] = {
  val (columnsToBeReplace, newColumns) = newColumns.partition(column => sourceColumns.contains(column.toString()))
  val restOfColumns = sourceColumns.diff(columnsToBeReplace.map(column => column.toString())).map(col => col)

  (columnsToBeReplace ++ newColumns ++ restOfColumns).toList
}

Mistake #2: Order of Narrow and Wide Transformation

Normally we focus on business logic when developing a data solution and it's common to ignore the order of narrow and wide transformation but things is spark recommended to combine all narrow first and then wide, for example,
if you have

narrow, wide, narrow, narrow, wide, narrow

then try to arrange like,

narrow, narrow, narrow, wide, wide

then spark will optimize your code more accurately for example all narrow transformation will happen as pipeline operation and only one shuffle will required.

Mistake #3: Overlooking Data Serialization Format

By default, Spark uses Java serialization, which is not the most efficient option. Switching to Kryo serialization can lead to better performance, as it is faster and uses less memory. Use the following configuration to enable Kryo:

spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

But does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

Mistake #4: Not Using Parallel Listing on Input Paths

When reading files from storage systems like Amazon S3, Azure Data Lake Storage (ADLS), or even local storage, Spark needs to list and find all matching files in the input directory before starting the next task. This listing process can become a bottleneck, especially when dealing with large directories or a vast number of files. By default, Spark uses only a single thread to list files, which can significantly slow down the start of your job.

To mitigate this, you can increase the number of threads used for listing files by setting the spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads property. This allows Spark to parallelize the file listing process, speeding up the initialization phase of your job.

spark.conf.set("spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads", 10)

Mistake #5: Ignoring Data Locality

Data locality significantly impacts the performance of Spark jobs.

When data and the code processing it are close together, computation is faster, as there is less need to move large chunks of data. Spark scheduling prioritizes data locality to minimize data movement, following levels of locality from best to worst: PROCESS_LOCAL (data and code in the same JVM), NODE_LOCAL (data on the same node), RACK_LOCAL (data on the same rack but different node), and ANY (data elsewhere on the network).

Spark tries to schedule tasks at the highest locality level possible, but this isn't always feasible. If no idle executors have unprocessed data at the desired locality level, Spark can either wait for a busy executor to free up or fall back to a lower locality level by moving data to an idle executor. The time Spark waits before falling back can be adjusted using the spark.locality.wait settings. Adjusting these settings can help improve performance in scenarios with long-running tasks or when data locality is poor.

In case of medium data skew or cluster with ample resources or using .catch() then increasing would benefits rather than going to lower locality.

spark.conf.set("spark.locality.wait", "10s")

Mistake #6: Relying on Default Number of Shuffle Partitions

By default, Spark uses 200 partitions for shuffle operations (e.g., join, groupBy). This number might be too high or too low, depending on your dataset and cluster size.

AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output.

But it's advisable to update shuffle partition before performing a wide transformation, if you need accurate optimization and if you are unsure spark recommended to set shuffle partition value to number of cores in your cluster.

spark.conf.set("spark.sql.shuffle.partitions", "num_core_in_cluster")

And don't forgot to tune spark.default.parallelism this setting accordingly as well.

Mistake #7: Overlooking Broadcast Join Thresholds

Scenario:

Developer: "I thought small lookup tables would be broadcasted automatically and my each of executors has 32GB of memory! Why are my joins so slow?"

Spark: "Sorry, your lookup table is just above the default threshold."

Broadcast joins can drastically speed up join operations when one of the tables is small enough to fit into memory on each worker node. However, if you don't adjust the broadcast join threshold, Spark might not broadcast tables that could be effectively broadcasted, leading to unnecessary shuffling.

Solution:

Adjust the broadcast join threshold using spark.sql.autoBroadcastJoinThreshold. If your lookup table is slightly larger than the default 10MB limit, increase the threshold.

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024) // 50MB

When setting the broadcast join threshold, don't base it only on executor memory. The driver loads the small table into memory first before distributing it to executors. Make sure the threshold is suitable for both driver and executor memory capacities to prevent memory issues and optimize performance.

Mistake #8: Relying on default storage level for Cache

It’s crucial to select the appropriate storage level for caching and persisting data based on the type of executors in your cluster,

Choosing the right storage level based on the executor type and objectives is crucial for optimizing Spark performance and resource utilization. By understanding the trade-offs between speed, memory usage, and fault tolerance, you can tailor your Spark configuration to meet the specific needs of your application.

Executor Type	Primary Objective	Recommended Storage Level	Description	Alternative for Fault Tolerance	Notes
Memory-Optimized	Fast access, low memory usage	`MEMORY_ONLY_SER`	Stores RDD as serialized objects in memory. Balances speed and memory efficiency.	`MEMORY_ONLY_SER_2`	Use `MEMORY_ONLY` if serialization overhead is not a concern.
		`MEMORY_ONLY`	Stores RDD as deserialized objects in memory. Fastest access, highest memory usage.	`MEMORY_ONLY_2`	Use for small datasets that fit comfortably in memory.
CPU-Optimized	Balanced memory and disk	`MEMORY_AND_DISK_SER`	Serialized storage in memory, spills to disk if needed. Good for large datasets.	`MEMORY_AND_DISK_SER_2`	Preferred when memory is limited; avoids out-of-memory errors.
		`MEMORY_AND_DISK`	Deserialized storage in memory, spills to disk. Faster access than `MEMORY_AND_DISK_SER`.	`MEMORY_AND_DISK_2`	Use when memory can accommodate deserialized objects, with fallback to disk.
General Purpose	Flexibility, moderate size datasets	`MEMORY_AND_DISK`	Deserialized in-memory, spills to disk. Good balance for general use cases.	`MEMORY_AND_DISK_2`	Good for mixed workloads; balances speed and fault tolerance.
		`MEMORY_ONLY_SER`	Serialized in-memory storage. Optimized for memory efficiency and speed.	`MEMORY_ONLY_SER_2`	Suitable for datasets that fit well in memory after serialization.
Disk-Optimized	Low memory, high fault tolerance	`DISK_ONLY`	Stores RDD partitions only on disk. Minimizes memory usage but slowest access.	`DISK_ONLY_2`	Suitable for very large datasets where memory is a constraint.
		`MEMORY_AND_DISK_SER`	Serialized storage in memory with spillover to disk. More efficient than deserialized.	`MEMORY_AND_DISK_SER_2`	Balances disk usage and memory efficiency.

The _2 options (e.g., MEMORY_ONLY_2, MEMORY_AND_DISK_2) are useful for scenarios where fault tolerance is crucial. They replicate data across two nodes, ensuring data is not lost if a node fails. This is particularly valuable in environments where reliability is prioritized over resource efficiency, such as production systems handling critical data or real-time processing pipelines.
The _SER option (e.g., MEMORY_AND_DISK_SER) Stores RDD as serialized Java objects (one byte array per partition) in memory. More memory-efficient than MEMORY_ONLY, but slower due to serialization/deserialization overhead.

Mistake #9: Misconfiguring Spark Memory Settings

Scenario:

Developer: "My Spark job keeps failing with out-of-memory errors. I gave it all the memory available!"

Spark: "Memory isn't just for you; I need some for myself, too."

Many users allocate almost all available memory to the executor heap space (spark.executor.memory) without considering Spark's overhead memory, causing frequent out-of-memory errors. Additionally, insufficient memory can lead to excessive garbage collection (GC) pauses, slowing down jobs.

Solution:

Properly configure memory settings by tuning spark.executor.memory and spark.executor.memoryOverhead.

--conf spark.executor.memory=4g --spark.executor.memoryOverhead=512m

Ensure you leave enough memory overhead to accommodate Spark's internal needs (shuffle, RDD storage, etc.). Typically, 10-15% of the total memory should be allocated as overhead.

spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

Mistake #10: Relying Only on Cache and Persist

Many Spark developers are familiar with the cache() and persist() methods for improving performance, but they often overlook the value of checkpoint(). While cache() and persist() keep data in memory or on disk to speed up processing, they don’t provide fault tolerance in the case of a failure. checkpoint(), on the other hand, saves the RDD to a reliable storage system, allowing for fault recovery and optimizing job execution.

Using checkpoint() not only ensures that your job can recover from failures but also helps Spark optimize the execution of other jobs that share the same lineage. This can lead to improved performance and resource utilization.

spark.sparkContext.setCheckpointDir("path/to/checkpoint/dir")
df.checkpoint()

DBT and Software Engineering

Pradip Sodha — Sat, 24 Aug 2024 14:27:57 +0000

In recent years, the competition for data solution tools has heated up. While AWS, Azure, GCP and many more companies investing heavily into Data Engineering such as AWS Glue Studio, Azure DataFlow, GCP cloud data fusion. While most companies focusing on low-code and drang-n-drop path. However, DBT (Data Build Tool) takes a different approach by embracing software engineering principles.

Instead of opting for the easy path, DBT proposed the right way of doing things, grounded in sound engineering practices. we will explore why DBT is different from big giants but today let's dive into part of it.

Introduction
Audience
Software Engineering
Limitations of Today's Data Pipelines
DBT's Adherence to Software Engineering Practices
Conclusion

Introduction

In this Post, we'll explore Software Engineering methods used in DBT (Data Build Tool). While a basic understanding of DBT's features from its documentation might suffice for contributing to a project.

So, why read this? Well, we'll explain the Software Engineering methods used by DBT and why they matter, In short, we'll uncover the reasons behind DBT's features.

That's make difference cause knowing reason and potential of feature is much more important than just mastering any feature, if you violate the reason or core of the feature, than that's feature is killed and it's just another workaround or patch.

Audience

Software Engineering

The realm of software engineering holds a vast history, having witnessed the contributions of numerous scientists and professionals.

Their collective efforts have propelled software methodologies to new heights, constantly striving to surpass previous achievements while upholding an audacious spirit.

Software engineering stands as the bedrock of modern technological advancements, weaving a rich tapestry of methodologies and
practices that shape the way we design, develop, and maintain software systems.

Its roots stretch back to the mid-20th century, evolving from simple programming to a comprehensive discipline encompassing various principles, tools, and frameworks.

Over the decades, software engineering has propelled innovations, enhancing reliability, scalability, and maintainability of systems across diverse industries.

Limitations of Today's Data Pipelines

In the realm of big data, the sophistication of data pipelines has surged, enabling the handling of massive datasets.

However, conventional data pipelines often exhibit limitations. They are prone to complexities, becoming intricate webs of disparate scripts, SQL queries, and manual interventions.

These pipelines lack standardization, making them difficult to maintain and comprehend. As the data grows, managing these pipelines becomes a daunting challenge, hindering scalability and agility.

DBT: A Solution Rooted in Software Engineering

Enter DBT (Data Build Tool), a paradigm shift in the world of data engineering that embodies the core principles of software engineering.

DBT redefines the way data pipelines are built and managed, aligning itself with established software engineering practices to tackle the challenges prevalent in traditional data pipelines.
DBT, stands as a revolutionary force in the domain of data transformation.

It reimagines the handling of data by infusing principles of agility and discipline akin to those found in the software engineering realm.

By treating data transformation as a form of software development,
DBT enables the scalability and seamless management of significant data components, facilitating collaboration among large teams with
unparalleled ease.

DBT's Adherence to Software Engineering Practices

Separation of Concerns
- DBT distinguishes between data transformation logic and data modeling, allowing for modularization and easier management. For instance, SQL queries in DBT focus on transforming raw data, while models define the final structured datasets.
- DBT has divided the usual data transformation into four parts: 1. Business logic (DQL), 2. Materialization (DDL & DML), 3. Testing, and 4. Documentation. These four areas now scale and maintain independently. Also, they are easier to read. Analytics engineers can focus on one thing separately. For example, they can solely concentrate on business logic (just select statements) while writing models. How to store or test or document have different section.
- Benefits
  - Enhanced Maintainability
  - Improved Reusability
  - Better Collaboration
  - Scalability and Flexibility
  - Security and Risk Mitigation (Individual can models can have access control and owner)
  - Future-proofing
  - Reduction of Complexity
Reusability
- Just as software modules can be reused, DBT promotes reusable code blocks (macros) and models. This allows data engineers to build upon existing components, fostering efficiency and consistency. Also DBT has good amount of packages that can import and use directly in projects, allowing share standard and tested expression at glob.
- Benefits
  - Efficiency
  - Consistency and Standardization
  - Ease of Maintenance
  - Cost-Effectiveness
  - Facilitates Collaboration
  - Future-Proofing
Unit Testing
- Similar to software unit tests, DBT enables data engineers to create tests to validate the accuracy of transformations, ensuring data quality throughout the pipeline. You can test each of your single transformation (a model) before subsequent step run.
- Benefits
  - Error Identification in Isolation: It allows the testing of individual components (units) of code in isolation, pinpointing errors or bugs specific to that unit. This facilitates easier debugging and troubleshooting.
  - Enhanced Code Quality: Unit tests enforce better coding practices by promoting modular and understandable code. Writing tests inherently requires breaking down functionalities into smaller, manageable units, leading to more maintainable and robust code.
  - Regression Prevention: Unit tests serve as a safety net. When modifications or updates are made, running unit tests ensures that existing functionalities are not negatively impacted, preventing unintended consequences through regression testing.
  - Facilitates Refactoring: Developers can confidently refactor or restructure code knowing that unit tests will quickly identify any potential issues. This flexibility encourages code improvements without the fear of breaking existing functionalities.
  - Improved Design and Documentation: Writing unit tests often necessitates clearer interfaces and more detailed documentation. This leads to better-designed APIs and clearer understanding of how code should be used.
  - Accelerates Development: Despite the initial time investment in writing tests, unit testing can speed up development by reducing time spent on debugging and rework. It aids in catching bugs early in the development cycle, saving time in the long run.
  - Supports Agile Development: Unit tests align well with agile methodologies by promoting frequent iterations and continuous integration. They facilitate a faster feedback loop, allowing developers to quickly verify changes.
  - Encourages Modular Development: Unit tests require breaking down functionalities into smaller units, promoting a modular approach to development. This modularity fosters reusability and simplifies integration.
  - Boosts Confidence in Code Changes: Unit tests provide confidence when making changes or additions to the codebase. Passing tests indicate that the modified code behaves as expected, reducing the risk of introducing new bugs.
Abstraction
- The abstraction principle involves concealing intricate underlying details while presenting a simplified and accessible interface or representation. In DBT, for instance, model files encapsulate solely business logic, abstracting materialization and test cases. This seemingly simple feature proves immensely helpful. It's akin to skimming a newspaper headline—if more details are needed, delve deeper; if not, move swiftly to the next topic.
- Benefits
  - Simplification of Complexity
  - Enhanced Readability and Understandability
  - Focus on Higher-Level Concepts
  - Reduced Cognitive Load
Coupling
- The coupling principle refers to the degree of interconnectedness or dependency between different components or modules within a system. Lower coupling indicates a lesser degree of dependency, while higher coupling suggests a stronger interconnection between components.
- In DBT, managing coupling involves reducing dependencies between different parts of the data transformation process. Lower coupling is desirable for several reasons.
Documentation
- DBT facilitates comprehensive documentation for data models and transformations, akin to software documentation. This documentation aids in understanding the data flow, enhancing collaboration and knowledge sharing.
Environment Separation
- In the software world, it's common to use different environments like Development (Dev), User Acceptance Testing (UAT), and Production (Prod) to manage changes effectively and ensure stability. This practice, known as Environment Separation, helps isolate changes, allowing teams to test and validate new features or fixes in a controlled setting before exposing them to real users.
- It mitigates risks, ensures consistency, and facilitates compliance and security. Similarly, dbt (data build tool) seamlessly supports environment separation, allowing teams to define and manage different environments such as Dev, UAT, and Prod. This practice promotes better DataOps by ensuring that data transformations are thoroughly tested and validated before they impact production, improving reliability and reducing the risk of errors.
Backward Compatibility
- Clients often provide new requirements, or we may discover more optimal ways to perform tasks. When this happens, we tend to modify our existing models or queries. However, in a large project, a single query might be relied upon by many clients, making it challenging to notify all teams of changes.

Additionally, new changes can sometimes introduce faults, which can disrupt data pipelines and violate one of the core principles of big data: availability.
To address this, the software industry already employs strategies to manage such issues effectively. dbt (data build tool) supports different model versions, allowing teams to maintain multiple versions, such as a pre-release version for testing and a stable version for production use.
This versioning approach makes dbt highly adaptive, enabling teams to migrate to new versions at their own pace. Furthermore, dbt allows setting a deprecation period, specifying how long an old API version will be supported before it is phased out, aligning with the concept of a Deprecation Policy.
Benefits
- User Experience Stability
- Reduced Migration Costs
- Minimized Downtime
- Flexibility in Adopting Updates
- Flexibility in Adopting Updates
- Encourages Innovation
- Risk Mitigation

Conclusion

DBT's fusion of software engineering principles with the domain of big data revolutionizes how data pipelines are conceived, constructed, and maintained. By embracing the tenets of software engineering, DBT addresses the shortcomings of traditional data pipelines, ushering in a new era of efficiency, reliability, and agility in data engineering. As software engineering continues to evolve, its synergy with big data technologies like DBT paves the way for more robust, scalable, and manageable data ecosystems.

DEV Community: Pradip Sodha

I Just Wanted dbt State Without Another Account

Background

Introduction

Mmmmm

Quick Start

Conclusion

Scala's Ignored Features

1. I’m Pure OOP

2. Free Will: Creative Variable Names

DBT vs. Data Engineers: A Love-Hate Saga!

1. Only SELECT Statements? Say What?!

2. Jinja is... Well, Something

3. Documentation Drama!

4. Dev Setup & Deployment: The Struggle is Real

5. DBT for Small Projects? Worth It?

How to Add Reverse Proxy to Your Azure Web App

What is a Reverse Proxy?

Benefits of a Reverse Proxy

Let's Get Started

Conclusion

Top 5 Things You Should Know About Spark

1. Dataframe is a Dataset

2. Physical Plan is Too Abstract? Go Deeper

3. Symbol is a Simple Way to Refer to a Column

4. Column's Nullable Property is Not a Constraint

5. Adding More Executors Doesn't Always Mean Faster Jobs

Avoid These Top 10 Mistakes When Using Apache Spark

Table of Content

Mistake #1: Adding Columns the Wrong Way

Mistake #2: Order of Narrow and Wide Transformation

Mistake #3: Overlooking Data Serialization Format

Mistake #4: Not Using Parallel Listing on Input Paths

Mistake #5: Ignoring Data Locality

Mistake #6: Relying on Default Number of Shuffle Partitions

Mistake #7: Overlooking Broadcast Join Thresholds

Mistake #8: Relying on default storage level for Cache

Mistake #9: Misconfiguring Spark Memory Settings

Mistake #10: Relying Only on Cache and Persist

DBT and Software Engineering

Table of Contents

Introduction

Audience

Software Engineering

Limitations of Today's Data Pipelines

DBT: A Solution Rooted in Software Engineering

DBT's Adherence to Software Engineering Practices

Conclusion