DEV Community: Asaph

How Databricks Cluster Policies Can Enforce Good Behavior (and Save You Money)

Asaph — Fri, 18 Apr 2025 12:32:35 +0000

As Platform Engineers, we’re often tasked with enabling autonomy for data teams without letting costs spiral out of control. In the world of Databricks, cluster policies are one of the most powerful (and underrated) tools to achieve that balance.

In this post, I want to focus on a key capability of cluster policies: enforcing the separation between interactive (all-purpose) clusters and automated job clusters.

💭 Why Should You Care?

All-purpose clusters are intended for interactive, exploratory work like notebooks. But it’s tempting—and dangerous—to re-use them for scheduled jobs.

Why?

❌ They stay alive longer, often leading to idle resource costs.
❌ They can lead to unexpected behaviors when shared with multiple users and tasks.
❌ They bypass the monitoring and tagging you may have set up for job clusters.

As a result, your costs can quietly creep up, and your architecture becomes harder to reason about.

🔧 Enforcing It in Cluster Policies

Let’s say we want to ensure that only notebooks can run on a given cluster policy—and prevent users from misusing it for scheduled jobs.

Here’s the relevant part of the Databricks cluster policy:

{
  "workload_type": {
    "type": "fixed",
    "value": "all-purpose",
    "hidden": true
  },
  "workload_type.clients.jobs": {
    "type": "fixed",
    "value": false
  },
  "workload_type.clients.notebooks": {
    "type": "fixed",
    "value": true
  }
}

🚫 What Does This Do?

"workload_type.clients.jobs": false — Disables running scheduled jobs on this cluster.
"workload_type.clients.notebooks": true — Ensures this is only for interactive work.
"workload_type": "all-purpose" — Locks the cluster into the interactive compute mode.

Combined, this enforces strict usage boundaries and aligns well with the principle of using job clusters for automation and all-purpose clusters for ad hoc work.

💡 Bonus: Better Tagging and Cost Tracking

You can take it further by enforcing tags, such as team, owner, or cost_center, using custom_tags. This makes it easier to attribute cost to the right business unit or team:

"custom_tags": {
  "type": "fixed",
  "value": {
    "team": "data-engineering",
    "cost_center": "data-platform"
  }
}

🧩 When to Use This Policy

Use this policy when:

You want to give notebook users a safe and cost-controlled environment.
You want to prevent production jobs from piggybacking on shared clusters.
You want better visibility and accountability on interactive workloads.

🚀 Wrap-Up

Cluster policies aren’t just for compliance—they’re enablers for scale. By separating workloads, you encourage good behavior, save on compute costs, and make your platform more maintainable.

Have you tried locking down workloads like this? Drop a comment and let’s share war stories. 👇

Cost Comparison: Databricks Cluster Jobs vs. SQL Warehouse for Batch Processing

Asaph — Thu, 10 Apr 2025 06:58:33 +0000

1. Introduction

Batch processing is a fundamental component of data engineering, allowing businesses to process large volumes of data efficiently. Databricks offers multiple compute options for batch workloads, but choosing the right one can significantly impact cost, performance, and overall efficiency.

Two common choices for running batch jobs in Databricks are Cluster Jobs and SQL Warehouse. While both options provide scalability and reliability, they come with different pricing models, resource allocations, and execution behaviors. Selecting the most cost-effective solution requires understanding their strengths and trade-offs.

This article compares Databricks Cluster Jobs and SQL Warehouse, evaluating their cost-effectiveness for batch processing. By the end, you’ll have a clearer understanding of which compute option best suits your workload and budget.

2. Understanding Databricks Compute Options

Databricks provides multiple compute options tailored for different workloads. Choosing the right option depends on factors such as workload type, scalability needs, and cost considerations.

Databricks Cluster Jobs

Databricks Cluster Jobs run on standard compute clusters and are ideal for general-purpose batch processing, including ETL pipelines and machine learning workloads. Key features include:

Supports notebooks, scripts, and workflows, making it flexible for various data processing tasks.
Can scale dynamically with autoscaling, allowing efficient resource utilization.
Supports spot instances and cluster policies to optimize cost and governance.

SQL Warehouse

Databricks SQL Warehouse is designed specifically for SQL-based workloads and provides a managed compute layer optimized for querying large datasets. Key characteristics include:

Runs in three modes:
- Serverless Mode – Fully managed, Databricks handles infrastructure and auto-scaling.
- Pro Mode – Uses dedicated clusters, giving more control over configurations.
- Classic Mode – The legacy option with manual cluster management, but fewer optimizations compared to Pro Mode.
Ideal for analytics and reporting workloads requiring high concurrency and fast query execution.
Cost is based on DBU (Databricks Unit) pricing and the selected warehouse tier, making it more predictable for SQL-heavy processing.

Understanding these compute options is crucial for selecting the most efficient and cost-effective solution for batch processing in Databricks.

Performance & Use Case Considerations

Databricks Cluster Jobs

Best for complex ETL/ELT workflows involving Python, Scala, or R.
Suitable for data engineering pipelines requiring heavy transformations.
Provides more control over tuning, caching, and parallel execution.

SQL Warehouse

Best for SQL-based transformations (CTEs, aggregations, analytics).
Better suited for BI/analytics workloads that require fast query performance.
Can be more expensive for long-running transformations due to DBU-based pricing.

Use Case: Evaluating Compute Options for Large SQL Queries

Initial Considerations

To evaluate the best compute option for running complex SQL queries in Databricks, we tested a query involving multiple joins on tables exceeding 1TB in size. The query was initially executed using a X-Small Serverless SQL Warehouse in Databricks, with the following setup:

The query was run in a dedicated warehouse to avoid bottlenecks from other workloads.
Only one node was used (no auto-scaling).
Execution time: ~10 minutes.
Minimum time to terminate the instance after each execution: ~5 minutes.
This job runs once every two hours, totaling 12 executions per day, equating to 3 hours of Serverless Warehouse usage daily.

What We Tested

To compare performance and cost-effectiveness, we tested three compute options:

X-Small Serverless SQL Warehouse
Jobs Compute Without Photon (r6id.xlarge, 16 CPUs, 128GB RAM, no autoscaling)
Jobs Compute With Photon (r6id.xlarge, 16 CPUs, 128GB RAM, no autoscaling)

Results

1) X-Small Serverless SQL Warehouse

Execution Time: ~10 minutes
Run Mode: Executed directly in the Databricks console
Minimum Warehouse Tear-down Time: 5 minutes after query execution
Total Compute Time per Execution: 15 minutes

2) Jobs Compute Without Photon

Instance Type: r6id.xlarge (16 CPUs, 128GB RAM)
Execution Time: ~23 minutes
Instance Setup Time: ~5 minutes
Tear-down Time: <1 minute
Total Compute Time per Execution: ~28 minutes

3) Jobs Compute With Photon

Instance Type: r6id.xlarge (16 CPUs, 128GB RAM)
Execution Time: ~10 minutes
Instance Setup Time: ~5 minutes
Tear-down Time: <1 minute
Total Compute Time per Execution: ~15 minutes

AWS Prices

This prices were calculated based on the time that the instances were up and running for both

Performance & Cost Comparison

Engine	Execution Time	Price for Each Execution	Executions a Day	Days	Price AWS Monthly	Total Price Monthly
Jobs Compute With Photon	15 minutes	$0.475161	12	30	$108.86	$279.92
Jobs Compute Without Photon	27 minutes	$0.56112	12	30	$217.73	$419.73
X-Small Serverless SQL Warehouse	15 minutes	$1.377465	12	30	$0	$495.89

Summary

This analysis provides insights into the cost and performance of different Databricks compute options for batch processing:

Jobs Compute With Photon is the most cost-effective option, costing $279.92 per month with a 15-minute execution time. Photon significantly improves performance while keeping costs lower than the other options.
Jobs Compute Without Photon increases execution time to 27 minutes and has a higher total monthly cost of $419.73. This is due to both Databricks execution pricing and additional AWS instance costs.
X-Small Serverless SQL Warehouse achieves the same 15-minute execution time as the Photon job but at a significantly higher cost of $495.89 per month due to Databricks' serverless pricing model.

Key Takeaways

Photon optimizations significantly lower costs and execution time, making Jobs Compute With Photon the best option for this workload.
Serverless SQL Warehouse, while convenient, is the most expensive option due to Databricks’ pricing model. AWS costs are included in the Databricks pricing for Serverless, which explains the $0 AWS Monthly cost in the table.
Jobs Compute Without Photon is both slower and more expensive than the Photon version, making it the least efficient choice. This conclusion is based on a use case involving a huge query with multiple joins. Simpler queries are not considered in this use case and must be reevaluated separately.

When to Choose Each Option

Use Serverless SQL Warehouse if you need simplicity and minimal infrastructure management, as Databricks handles all scaling and maintenance.
Use Jobs Compute With Photon for cost-efficient, high-performance workloads, particularly if your SQL queries benefit from Photon's vectorized execution and query optimizations.
Use Jobs Compute Without Photon only if your workload does not benefit from Photon optimizations or if specific constraints prevent you from using it.

This comparison highlights how choosing the right compute option can lead to significant cost savings without sacrificing performance. 🚀

Leveraging the Terraform Databricks Jobs Module by Cloudnx

To streamline the deployment and management of Databricks jobs, I utilized the Terraform Databricks Jobs Module developed by Cloudnx. This module significantly simplifies the process of provisioning Databricks jobs by offering a robust and flexible framework for defining job configurations, handling multiple notebook tasks, and setting up custom clusters.

Key Features of the Module:

Custom Cluster Support: Allows precise control over cluster configurations, including instance types and Spark runtime versions.
Multi-Task Management: Supports multiple notebook tasks within a single job, enabling complex workflows.
Alerting and Notifications: Provides options for email notifications on job failures to ensure timely responses.
Environment-Specific Configurations: Facilitates deployment across different environments (e.g., dev, staging, production) with minimal changes.

Why I Chose This Module:

The module's modular design and compatibility with Terraform's infrastructure-as-code approach made it an ideal choice for automating Databricks workflows. By leveraging this module, I was able to:

Automate the creation of jobs tailored to my workload requirements.
Maintain consistent configurations across environments.
Reduce manual effort in managing Databricks resources.

Example Usage:

Below is an example of how I configured a Databricks job using this module:

module "linked_accounts_job" {
  providers = {
    databricks.workspace = databricks.workspace
  }
  source                   = "./module"
  job_description          = "Job dedicated to run linked accounts"
  job_name                 = "linked_account"
  cluster_identifier       = "analytics_cluster"
  notebook_paths = [
    abspath("${path.module}/query1.sql"),
    abspath("${path.module}/query2.sql"),
    abspath("${path.module}/query3.sql")
  ]
  responsible_team         = "analytics"
  contact_email            = "anything@anything.com"
  deployment_environment   = "dev"
  spark_runtime_version    = data.databricks_spark_version.this
  cluster_instance_type    = data.databricks_node_type.node_memory_photon_large
  failure_alert_email      = var.failure_alert_email
  databricks_notebook_path = var.databricks_notebook_path
}

This configuration allowed me to quickly deploy a production-ready batch processing job while ensuring scalability and reliability.

module's GitHub repository.

Optimizing Data Sync from PostgreSQL to Databricks with Fivetran

Asaph — Thu, 27 Mar 2025 06:25:26 +0000

In my previous work as a Platform Engineer for a remittance company, we faced a significant issue related to costs. Our Fivetran syncs were consuming an excessive amount of Databricks cluster resources, and as a result, we had to make a crucial decision to reduce operational costs in Databricks. As data volumes continued to grow, finding a solution that would maintain efficiency while lowering costs became a top priority.

Fivetran offers an elegant solution to sync data from various data sources to cloud platforms like Databricks, especially when S3 is used as the destination. In our case, however, the majority of our databases were PostgreSQL, which made the solution even more important as we needed to ensure cost efficiency for our sync process.

This article explores two approaches we considered to mitigate these issues and improve cost-efficiency while maintaining the integrity and accessibility of our data.

The Two Approaches for Syncing Data from PostgreSQL to Databricks

There are two primary strategies to sync data from PostgreSQL to Databricks using Fivetran: direct sync to Databricks and syncing via S3 to create external tables in Databricks. Let’s examine both approaches and how they impact cost, performance, and data management.

1. Direct Sync from PostgreSQL to Databricks

In this first approach, Fivetran establishes a direct connection from PostgreSQL to Databricks, facilitating real-time data movement from the source into Databricks. This method is simple to set up and requires minimal ongoing management.

Advantages:

Simplicity: With straightforward configuration, this method minimizes the need for extra processing steps.

Challenges:

High Compute Costs: Direct syncing incurs significant computational charges in Databricks, as every sync triggers cluster activity, which can quickly add up, especially with high-frequency or large data sets.
Limited Flexibility: Sync frequency can be more rigid, which may lead to performance issues or inefficient cost management.

2. Sync via S3 to Create External Tables in Databricks

In this approach, Fivetran first syncs data into an S3 bucket, and from there, the S3 destination functionality is used to create external tables in Databricks. External tables allow Databricks to query data directly from S3 without ingesting it into the Databricks warehouse, reducing computational load significantly.

Advantages:

Reduced Compute Costs: By leveraging external tables, the data is stored in S3, and Databricks queries the data directly from there. This avoids the need to allocate compute resources for loading the data into Databricks, cutting down on cluster usage and costs.
Flexible Data Access: This method allows for easy querying of data stored in S3 using external tables in Databricks, providing more flexibility in your data pipeline.

Challenges:

Latency: While external tables are cost-effective, they may introduce some latency compared to directly syncing data into Databricks, as Databricks queries external storage.
Setup Complexity: This approach requires additional setup steps, including configuring Fivetran to sync with S3 and setting up external tables in Databricks.

Additionally, in this configuration, we used Delta Lake as the data format for the S3 sync. Delta Lake enables ACID transactions and better data quality management, which ensures that the data remains reliable while being queried in Databricks. By using external tables, the data remains as external and not as managed tables, which means that data does not automatically sync metadata with the Unity Catalog.

To configure Fivetran for syncing data into S3 and using it as an external table source in Databricks, follow these steps:

In Fivetran Console, go to Destinations
Choose S3 Data Lake as your destination.
Table Format: Be sure to select DELTA as the table format to enable the benefits of Delta Lake’s transactional capabilities.
Maintain Delta Tables in Databricks: Enable the toggle for this option to ensure that Delta tables are maintained in Databricks as external tables, allowing for seamless querying and management.
Warehouse Configuration: Fill in the details for the Databricks warehouse you will be using for querying the external tables.

These configurations ensure that your S3 data is optimized for querying in Databricks while maintaining cost efficiency by using external tables.

Choosing the Right Approach: Cost, Performance, and Data Freshness

The decision between direct syncing and syncing through S3 depends largely on your organization's priorities:

Minimizing Compute Costs: If reducing compute usage in Databricks is a primary goal, syncing via S3 and using external tables is the better choice. This method significantly cuts down on the need for Databricks clusters while still providing easy access to data.
Data Volume: For large volumes of data, external tables offer a cost-effective solution, as they offload storage to S3, which is much more economical.

Conclusion: Optimizing Data Sync with Fivetran

Fivetran provides two efficient ways to sync data from PostgreSQL to Databricks, each with its advantages and trade-offs. Direct syncing offers simplicity and real-time access but comes with high computational costs. On the other hand, syncing data to S3 and using external tables in Databricks reduces compute costs significantly, making it a more cost-effective option for larger data volumes.

By considering the trade-offs between these two approaches, organizations can optimize their data pipelines, reduce costs, and maintain performance, ensuring they make the most out of their Databricks resources.