Why I chose this topic: I’ve spent the last six years cleaning up "cloud-native" messes where companies burn through their annual data budget by Q3. Most articles suggest "turn off your clusters when not in use." That’s not engineering; that’s basic hygiene. I’m writing this because I’m tired of seeing engineers treat Databricks like a bottomless credit card while their CFO stares at the AWS/Azure bill with genuine, existential dread.
I remember getting a Slack ping at 8:00 AM on a Monday. It was from our Head of Infrastructure. The message was simple: "Our Databricks spend is up 40% month-over-month. Fix it, or we’re cutting the dev environment."
I spent the next 48 hours staring at the Databricks Usage Report. I found the culprit: a "standard" job that was running on a massive 8-node r6id.4xlarge cluster for a job that barely touched 50GB of data. It was like using a freight train to deliver a single pizza. We were paying for high-memory nodes that were sitting at 5% utilization, just waiting for the job to finish its shuffle phase.
We all want to believe we’re building efficient data pipelines, but in reality, we’re often just throwing CPU cycles at poorly optimized Spark plans because "it’s fast enough." When the bill hits, we blame the provider. The truth is, Databricks is an incredible platform, but it’s an expensive one if you treat it like a set-and-forget black box.
The real problem: The "General Purpose" Trap
The real problem isn't that Databricks is expensive; it’s that it’s too easy to configure. Most teams start by selecting a "General Purpose" cluster type and checking the "Autoscaling" box. That is the default path to bankruptcy.
We stop thinking about how Spark actually handles data. We ignore the shuffle. We ignore the cost difference between On-Demand and Spot instances. We treat the cluster as a static resource rather than a dynamic, ephemeral tool. If your job finishes in 20 minutes, why are you paying for a cluster that takes 5 minutes to spin up and 10 minutes to terminate? You’re paying for 35 minutes of compute to do 20 minutes of work.
Photo by Andre Taissin on Unsplash
Step 1: Kill On-Demand instances for non-critical jobs
Stop paying retail price for your ETL. If your job can handle a failure and a retry, you have no business running it on On-Demand instances. Spot instances can save you up to 80% on compute costs.
In your Job configuration, switch to "Spot" instances. Yes, they get reclaimed. That’s why you need to ensure your job is idempotent and that you’ve configured spark.databricks.clusterUsageTags correctly. If a node gets pulled, let the job retry. The cost savings will dwarf the occasional 15-minute delay from a restart.
{
"spark_conf": {
"spark.databricks.clusterUsageTags": "spot-optimized-job"
},
"node_type_id": "i3.xlarge",
"spot_bid_price_percent": 100
}
Step 2: Stop over-provisioning memory (The "I3" switch)
I see teams defaulting to r6id (memory-optimized) nodes for every single job. Unless you are doing massive, memory-heavy joins on every single task, you are wasting money.
Move your standard ETL pipelines to i3 or c5 family instances. i3 instances come with local NVMe storage, which is a godsend for Spark shuffle performance. By moving to i3 instances, you get faster local disk I/O, which speeds up your shuffle-heavy jobs, allowing you to use fewer nodes overall.
Step 3: Photon is not optional, it's mandatory
If you are running Databricks Runtime 10.4 or higher, you should be using Photon. Don't argue with me about "compatibility." If your code doesn't run on Photon, it’s because you’re using legacy UDFs that are inherently slow.
Rewrite those UDFs into Spark SQL or native DataFrame operations. Photon isn't just a "feature"—it’s a rewritten query engine in C++. It’s faster, which means your job finishes in less time. In Databricks pricing, time is money.
# Instead of a slow Python UDF, use a built-in SQL function
from pyspark.sql import functions as F
# Slow
df.withColumn("processed", F.udf(lambda x: x.upper()))
# Fast (Photon friendly)
df.withColumn("processed", F.upper(F.col("column_name")))
Step 4: Use Job Clusters, not All-Purpose Clusters
This is the most common mistake I see. Developers use All-Purpose clusters for their scheduled jobs because it’s "easier." An All-Purpose cluster is meant for interactive notebooks. It’s expensive, it stays alive longer, and it doesn’t scale down as aggressively as a Job Cluster.
Every production pipeline must use a Job Cluster. Job Clusters are cheaper per DBU, and they are tied strictly to the lifecycle of the task. When the task finishes, the cluster dies. End of story.
Photo by Ilya Pavlov on Unsplash
Lessons learned from production
-
Autoscaling is a lie if your minimum is too high. If you set your min-workers to 4, you’re paying for 4 nodes even if you only need 1. Set your
min_workersto 0 or 1 for non-critical tasks. -
The "Small File" problem is a hidden tax. If you have millions of tiny files, your metadata overhead will destroy your performance. Compact your data using
OPTIMIZEandZORDERfrequently. It costs compute to run, but it saves 10x in query time later. -
Cluster Tags are your best friend. If you don't know who is spending the money, you can't optimize it. Force every job to have a
departmentandprojecttag.
Production considerations
When you move to Spot instances and aggressive autoscaling, you have to handle failures. Make sure your pipeline uses Delta Lake. Delta’s ACID transactions mean that if a Spot instance is reclaimed in the middle of a write, your table isn't corrupted. It just rolls back.
Also, watch your max_workers. If you set it too high, you might hit your cloud provider's vCPU quota. I’ve seen pipelines fail on Monday mornings because the company hit their AWS account limit. Know your limits, and set your cluster bounds to stay under them.
Conclusion
Cost optimization isn't about cutting corners; it’s about aligning your resource consumption with the actual requirements of the workload. Start by killing your All-Purpose clusters, switching to Spot, and forcing your team to use native Spark functions over UDFs.
Try it: Go to your Databricks usage dashboard right now. Find the top 3 most expensive jobs. Convert them to Job Clusters using Spot instances with i3 nodes. See what happens to your bill next week. You might be surprised.
SEO keywords: Databricks cost optimization, Spark performance tuning, FinOps for data, AWS Databricks best practices
Tags: #databricks #dataengineering #finops #spark
Cover photo by Domaintechnik Ledl.net on Unsplash.
Top comments (0)