<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Asaph</title>
    <description>The latest articles on DEV Community by Asaph (@asaphtinoco).</description>
    <link>https://dev.to/asaphtinoco</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2980323%2F4d34a805-2209-4233-a9d7-d2057eb8d1fe.jpeg</url>
      <title>DEV Community: Asaph</title>
      <link>https://dev.to/asaphtinoco</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/asaphtinoco"/>
    <language>en</language>
    <item>
      <title>How Databricks Cluster Policies Can Enforce Good Behavior (and Save You Money)</title>
      <dc:creator>Asaph</dc:creator>
      <pubDate>Fri, 18 Apr 2025 12:32:35 +0000</pubDate>
      <link>https://dev.to/asaphtinoco/how-databricks-cluster-policies-can-enforce-good-behavior-and-save-you-money-4occ</link>
      <guid>https://dev.to/asaphtinoco/how-databricks-cluster-policies-can-enforce-good-behavior-and-save-you-money-4occ</guid>
      <description>&lt;p&gt;As Platform Engineers, we’re often tasked with enabling autonomy for data teams without letting costs spiral out of control. In the world of Databricks, &lt;strong&gt;cluster policies&lt;/strong&gt; are one of the most powerful (and underrated) tools to achieve that balance.&lt;/p&gt;

&lt;p&gt;In this post, I want to focus on a key capability of cluster policies: &lt;strong&gt;enforcing the separation between interactive (all-purpose) clusters and automated job clusters.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💭 Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;All-purpose clusters are intended for interactive, exploratory work like notebooks. But it’s tempting—and dangerous—to re-use them for scheduled jobs.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;strong&gt;They stay alive longer&lt;/strong&gt;, often leading to idle resource costs.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;They can lead to unexpected behaviors&lt;/strong&gt; when shared with multiple users and tasks.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;They bypass the monitoring and tagging you may have set up for job clusters.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a result, your costs can quietly creep up, and your architecture becomes harder to reason about.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 Enforcing It in Cluster Policies
&lt;/h2&gt;

&lt;p&gt;Let’s say we want to ensure that &lt;strong&gt;only notebooks can run on a given cluster policy&lt;/strong&gt;—and prevent users from misusing it for scheduled jobs.&lt;/p&gt;

&lt;p&gt;Here’s the relevant part of the Databricks cluster policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"workload_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"all-purpose"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hidden"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"workload_type.clients.jobs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"workload_type.clients.notebooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🚫 What Does This Do?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"workload_type.clients.jobs": false&lt;/code&gt; — Disables running scheduled jobs on this cluster.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"workload_type.clients.notebooks": true&lt;/code&gt; — Ensures this is only for interactive work.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"workload_type": "all-purpose"&lt;/code&gt; — Locks the cluster into the interactive compute mode.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combined, this enforces strict usage boundaries and aligns well with the principle of &lt;strong&gt;using job clusters for automation and all-purpose clusters for ad hoc work.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  💡 Bonus: Better Tagging and Cost Tracking
&lt;/h3&gt;

&lt;p&gt;You can take it further by &lt;strong&gt;enforcing tags&lt;/strong&gt;, such as &lt;code&gt;team&lt;/code&gt;, &lt;code&gt;owner&lt;/code&gt;, or &lt;code&gt;cost_center&lt;/code&gt;, using &lt;code&gt;custom_tags&lt;/code&gt;. This makes it easier to attribute cost to the right business unit or team:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"custom_tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"team"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data-engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cost_center"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data-platform"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧩 When to Use This Policy
&lt;/h3&gt;

&lt;p&gt;Use this policy when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to &lt;strong&gt;give notebook users a safe and cost-controlled environment&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You want to &lt;strong&gt;prevent production jobs from piggybacking on shared clusters&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You want &lt;strong&gt;better visibility and accountability&lt;/strong&gt; on interactive workloads.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🚀 Wrap-Up
&lt;/h3&gt;

&lt;p&gt;Cluster policies aren’t just for compliance—they’re enablers for scale. By separating workloads, you encourage good behavior, save on compute costs, and make your platform more maintainable.&lt;/p&gt;

&lt;p&gt;Have you tried locking down workloads like this? Drop a comment and let’s share war stories. 👇&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>cloud</category>
      <category>databricks</category>
    </item>
    <item>
      <title>Cost Comparison: Databricks Cluster Jobs vs. SQL Warehouse for Batch Processing</title>
      <dc:creator>Asaph</dc:creator>
      <pubDate>Thu, 10 Apr 2025 06:58:33 +0000</pubDate>
      <link>https://dev.to/asaphtinoco/cost-comparison-databricks-cluster-jobs-vs-sql-warehouse-for-batch-processing-gjg</link>
      <guid>https://dev.to/asaphtinoco/cost-comparison-databricks-cluster-jobs-vs-sql-warehouse-for-batch-processing-gjg</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;Batch processing is a fundamental component of data engineering, allowing businesses to process large volumes of data efficiently. Databricks offers multiple compute options for batch workloads, but choosing the right one can significantly impact cost, performance, and overall efficiency.  &lt;/p&gt;

&lt;p&gt;Two common choices for running batch jobs in Databricks are &lt;strong&gt;Cluster Jobs&lt;/strong&gt; and &lt;strong&gt;SQL Warehouse&lt;/strong&gt;. While both options provide scalability and reliability, they come with different pricing models, resource allocations, and execution behaviors. Selecting the most cost-effective solution requires understanding their strengths and trade-offs.  &lt;/p&gt;

&lt;p&gt;This article compares &lt;strong&gt;Databricks Cluster Jobs&lt;/strong&gt; and &lt;strong&gt;SQL Warehouse&lt;/strong&gt;, evaluating their cost-effectiveness for batch processing. By the end, you’ll have a clearer understanding of which compute option best suits your workload and budget.  &lt;/p&gt;




&lt;h2&gt;
  
  
  2. Understanding Databricks Compute Options
&lt;/h2&gt;

&lt;p&gt;Databricks provides multiple compute options tailored for different workloads. Choosing the right option depends on factors such as workload type, scalability needs, and cost considerations.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Databricks Cluster Jobs
&lt;/h3&gt;

&lt;p&gt;Databricks &lt;strong&gt;Cluster Jobs&lt;/strong&gt; run on standard compute clusters and are ideal for general-purpose batch processing, including ETL pipelines and machine learning workloads. Key features include:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports &lt;strong&gt;notebooks, scripts, and workflows&lt;/strong&gt;, making it flexible for various data processing tasks.
&lt;/li&gt;
&lt;li&gt;Can &lt;strong&gt;scale dynamically&lt;/strong&gt; with autoscaling, allowing efficient resource utilization.
&lt;/li&gt;
&lt;li&gt;Supports &lt;strong&gt;spot instances and cluster policies&lt;/strong&gt; to optimize cost and governance.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Warehouse
&lt;/h3&gt;

&lt;p&gt;Databricks &lt;strong&gt;SQL Warehouse&lt;/strong&gt; is designed specifically for SQL-based workloads and provides a managed compute layer optimized for querying large datasets. Key characteristics include:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs in three modes:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Serverless Mode&lt;/strong&gt; – Fully managed, Databricks handles infrastructure and auto-scaling.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro Mode&lt;/strong&gt; – Uses dedicated clusters, giving more control over configurations.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classic Mode&lt;/strong&gt; – The legacy option with manual cluster management, but fewer optimizations compared to Pro Mode.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Ideal for &lt;strong&gt;analytics and reporting&lt;/strong&gt; workloads requiring high concurrency and fast query execution.
&lt;/li&gt;

&lt;li&gt;Cost is based on &lt;strong&gt;DBU (Databricks Unit) pricing&lt;/strong&gt; and the selected &lt;strong&gt;warehouse tier&lt;/strong&gt;, making it more predictable for SQL-heavy processing.
&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Understanding these compute options is crucial for selecting the most efficient and cost-effective solution for batch processing in Databricks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance &amp;amp; Use Case Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Databricks Cluster Jobs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Best for &lt;strong&gt;complex ETL/ELT workflows&lt;/strong&gt; involving Python, Scala, or R.
&lt;/li&gt;
&lt;li&gt;Suitable for &lt;strong&gt;data engineering pipelines&lt;/strong&gt; requiring heavy transformations.
&lt;/li&gt;
&lt;li&gt;Provides &lt;strong&gt;more control&lt;/strong&gt; over tuning, caching, and parallel execution.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SQL Warehouse
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Best for &lt;strong&gt;SQL-based transformations&lt;/strong&gt; (CTEs, aggregations, analytics).
&lt;/li&gt;
&lt;li&gt;Better suited for &lt;strong&gt;BI/analytics workloads&lt;/strong&gt; that require fast query performance.
&lt;/li&gt;
&lt;li&gt;Can be &lt;strong&gt;more expensive for long-running transformations&lt;/strong&gt; due to DBU-based pricing.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Use Case: Evaluating Compute Options for Large SQL Queries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial Considerations
&lt;/h3&gt;

&lt;p&gt;To evaluate the best compute option for running complex SQL queries in Databricks, we tested a query involving &lt;strong&gt;multiple joins&lt;/strong&gt; on tables exceeding &lt;strong&gt;1TB&lt;/strong&gt; in size. The query was initially executed using a &lt;strong&gt;X-Small Serverless SQL Warehouse&lt;/strong&gt; in &lt;strong&gt;Databricks&lt;/strong&gt;, with the following setup:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The query was run in a &lt;strong&gt;dedicated warehouse&lt;/strong&gt; to avoid bottlenecks from other workloads.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only one node was used&lt;/strong&gt; (no auto-scaling).
&lt;/li&gt;
&lt;li&gt;Execution time: &lt;strong&gt;~10 minutes&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Minimum time to terminate the instance after each execution: &lt;strong&gt;~5 minutes&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;This job runs &lt;strong&gt;once every two hours&lt;/strong&gt;, totaling &lt;strong&gt;12 executions per day&lt;/strong&gt;, equating to &lt;strong&gt;3 hours of Serverless Warehouse usage daily&lt;/strong&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What We Tested
&lt;/h3&gt;

&lt;p&gt;To compare performance and cost-effectiveness, we tested three compute options:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;X-Small Serverless SQL Warehouse&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jobs Compute Without Photon&lt;/strong&gt; (r6id.xlarge, 16 CPUs, 128GB RAM, no autoscaling)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jobs Compute With Photon&lt;/strong&gt; (r6id.xlarge, 16 CPUs, 128GB RAM, no autoscaling)
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1) X-Small Serverless SQL Warehouse&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution Time:&lt;/strong&gt; ~10 minutes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run Mode:&lt;/strong&gt; Executed directly in the Databricks console
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum Warehouse Tear-down Time:&lt;/strong&gt; &lt;strong&gt;5 minutes&lt;/strong&gt; after query execution
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Compute Time per Execution:&lt;/strong&gt; &lt;strong&gt;15 minutes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2) Jobs Compute Without Photon&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instance Type:&lt;/strong&gt; r6id.xlarge (16 CPUs, 128GB RAM)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Time:&lt;/strong&gt; ~23 minutes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance Setup Time:&lt;/strong&gt; ~5 minutes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tear-down Time:&lt;/strong&gt; &lt;strong&gt;&amp;lt;1 minute&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Compute Time per Execution:&lt;/strong&gt; &lt;strong&gt;~28 minutes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3) Jobs Compute With Photon&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instance Type:&lt;/strong&gt; r6id.xlarge (16 CPUs, 128GB RAM)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Time:&lt;/strong&gt; ~10 minutes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance Setup Time:&lt;/strong&gt; ~5 minutes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tear-down Time:&lt;/strong&gt; &lt;strong&gt;&amp;lt;1 minute&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Compute Time per Execution:&lt;/strong&gt; &lt;strong&gt;~15 minutes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AWS Prices
&lt;/h3&gt;

&lt;p&gt;This prices were calculated based on the time that the instances were up and running for both &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhakxgndwmq1va6n44y7f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhakxgndwmq1va6n44y7f.png" alt="AWS Prices for Engines" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Performance &amp;amp; Cost Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Execution Time&lt;/th&gt;
&lt;th&gt;Price for Each Execution&lt;/th&gt;
&lt;th&gt;Executions a Day&lt;/th&gt;
&lt;th&gt;Days&lt;/th&gt;
&lt;th&gt;Price AWS Monthly&lt;/th&gt;
&lt;th&gt;Total Price Monthly&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jobs Compute With Photon&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;$0.475161&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;$108.86&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$279.92&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jobs Compute Without Photon&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27 minutes&lt;/td&gt;
&lt;td&gt;$0.56112&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;$217.73&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$419.73&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X-Small Serverless SQL Warehouse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;$1.377465&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$495.89&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;This analysis provides insights into the cost and performance of different Databricks compute options for batch processing:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Jobs Compute With Photon&lt;/strong&gt; is the most cost-effective option, costing &lt;strong&gt;$279.92 per month&lt;/strong&gt; with a &lt;strong&gt;15-minute execution time&lt;/strong&gt;. Photon significantly improves performance while keeping costs lower than the other options.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jobs Compute Without Photon&lt;/strong&gt; increases execution time to &lt;strong&gt;27 minutes&lt;/strong&gt; and has a higher total monthly cost of &lt;strong&gt;$419.73&lt;/strong&gt;. This is due to both Databricks execution pricing and additional AWS instance costs.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;X-Small Serverless SQL Warehouse&lt;/strong&gt; achieves the same &lt;strong&gt;15-minute execution time&lt;/strong&gt; as the Photon job but at a significantly higher cost of &lt;strong&gt;$495.89 per month&lt;/strong&gt; due to Databricks' serverless pricing model.
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Photon optimizations significantly lower costs and execution time, making Jobs Compute With Photon the best option for this workload.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serverless SQL Warehouse, while convenient, is the most expensive option due to Databricks’ pricing model.&lt;/strong&gt; AWS costs are included in the Databricks pricing for Serverless, which explains the $0 AWS Monthly cost in the table.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jobs Compute Without Photon is both slower and more expensive than the Photon version, making it the least efficient choice.&lt;/strong&gt; This conclusion is based on a use case involving a huge query with multiple joins. Simpler queries are not considered in this use case and must be reevaluated separately.
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  When to Choose Each Option
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use Serverless SQL Warehouse&lt;/strong&gt; if you need simplicity and minimal infrastructure management, as Databricks handles all scaling and maintenance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Jobs Compute With Photon&lt;/strong&gt; for cost-efficient, high-performance workloads, particularly if your SQL queries benefit from &lt;strong&gt;Photon's vectorized execution and query optimizations&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Jobs Compute Without Photon&lt;/strong&gt; only if your workload does not benefit from Photon optimizations or if specific constraints prevent you from using it.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This comparison highlights how &lt;strong&gt;choosing the right compute option can lead to significant cost savings without sacrificing performance&lt;/strong&gt;. 🚀  &lt;/p&gt;
&lt;h3&gt;
  
  
  Leveraging the Terraform Databricks Jobs Module by Cloudnx
&lt;/h3&gt;

&lt;p&gt;To streamline the deployment and management of Databricks jobs, I utilized the &lt;strong&gt;Terraform Databricks Jobs Module&lt;/strong&gt; developed by Cloudnx. This module significantly simplifies the process of provisioning Databricks jobs by offering a robust and flexible framework for defining job configurations, handling multiple notebook tasks, and setting up custom clusters.&lt;/p&gt;
&lt;h4&gt;
  
  
  Key Features of the Module:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Custom Cluster Support&lt;/strong&gt;: Allows precise control over cluster configurations, including instance types and Spark runtime versions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Task Management&lt;/strong&gt;: Supports multiple notebook tasks within a single job, enabling complex workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting and Notifications&lt;/strong&gt;: Provides options for email notifications on job failures to ensure timely responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment-Specific Configurations&lt;/strong&gt;: Facilitates deployment across different environments (e.g., dev, staging, production) with minimal changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Why I Chose This Module:
&lt;/h4&gt;

&lt;p&gt;The module's modular design and compatibility with Terraform's infrastructure-as-code approach made it an ideal choice for automating Databricks workflows. By leveraging this module, I was able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automate the creation of jobs tailored to my workload requirements.&lt;/li&gt;
&lt;li&gt;Maintain consistent configurations across environments.&lt;/li&gt;
&lt;li&gt;Reduce manual effort in managing Databricks resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Example Usage:
&lt;/h4&gt;

&lt;p&gt;Below is an example of how I configured a Databricks job using this module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"linked_accounts_job"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;databricks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workspace&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;databricks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workspace&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./module"&lt;/span&gt;
  &lt;span class="nx"&gt;job_description&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Job dedicated to run linked accounts"&lt;/span&gt;
  &lt;span class="nx"&gt;job_name&lt;/span&gt;                 &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"linked_account"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_identifier&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"analytics_cluster"&lt;/span&gt;
  &lt;span class="nx"&gt;notebook_paths&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"${path.module}/query1.sql"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nx"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"${path.module}/query2.sql"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nx"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"${path.module}/query3.sql"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;responsible_team&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"analytics"&lt;/span&gt;
  &lt;span class="nx"&gt;contact_email&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"anything@anything.com"&lt;/span&gt;
  &lt;span class="nx"&gt;deployment_environment&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"dev"&lt;/span&gt;
  &lt;span class="nx"&gt;spark_runtime_version&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;databricks_spark_version&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_instance_type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;databricks_node_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;node_memory_photon_large&lt;/span&gt;
  &lt;span class="nx"&gt;failure_alert_email&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failure_alert_email&lt;/span&gt;
  &lt;span class="nx"&gt;databricks_notebook_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;databricks_notebook_path&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration allowed me to quickly deploy a production-ready batch processing job while ensuring scalability and reliability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/asaphtinoco/terraform-databricks-jobs" rel="noopener noreferrer"&gt;module's GitHub repository&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Optimizing Data Sync from PostgreSQL to Databricks with Fivetran</title>
      <dc:creator>Asaph</dc:creator>
      <pubDate>Thu, 27 Mar 2025 06:25:26 +0000</pubDate>
      <link>https://dev.to/asaphtinoco/optimizing-data-sync-from-postgresql-to-databricks-with-fivetran-5e9g</link>
      <guid>https://dev.to/asaphtinoco/optimizing-data-sync-from-postgresql-to-databricks-with-fivetran-5e9g</guid>
      <description>&lt;p&gt;In my previous work as a Platform Engineer for a remittance company, we faced a significant issue related to costs. Our Fivetran syncs were consuming an excessive amount of Databricks cluster resources, and as a result, we had to make a crucial decision to reduce operational costs in Databricks. As data volumes continued to grow, finding a solution that would maintain efficiency while lowering costs became a top priority.&lt;/p&gt;

&lt;p&gt;Fivetran offers an elegant solution to sync data from various data sources to cloud platforms like Databricks, especially when S3 is used as the destination. In our case, however, the majority of our databases were PostgreSQL, which made the solution even more important as we needed to ensure cost efficiency for our sync process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This article explores two approaches we considered to mitigate these issues and improve cost-efficiency while maintaining the integrity and accessibility of our data.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Two Approaches for Syncing Data from PostgreSQL to Databricks
&lt;/h2&gt;

&lt;p&gt;There are two primary strategies to sync data from PostgreSQL to Databricks using Fivetran: &lt;strong&gt;direct sync to Databricks&lt;/strong&gt; and &lt;strong&gt;syncing via S3 to create external tables in Databricks&lt;/strong&gt;. Let’s examine both approaches and how they impact cost, performance, and data management.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Direct Sync from PostgreSQL to Databricks
&lt;/h3&gt;

&lt;p&gt;In this first approach, Fivetran establishes a direct connection from PostgreSQL to Databricks, facilitating real-time data movement from the source into Databricks. This method is simple to set up and requires minimal ongoing management.&lt;/p&gt;

&lt;h4&gt;
  
  
  Advantages:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity&lt;/strong&gt;: With straightforward configuration, this method minimizes the need for extra processing steps.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Challenges:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Compute Costs&lt;/strong&gt;: Direct syncing incurs significant computational charges in Databricks, as every sync triggers cluster activity, which can quickly add up, especially with high-frequency or large data sets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited Flexibility&lt;/strong&gt;: Sync frequency can be more rigid, which may lead to performance issues or inefficient cost management.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. Sync via S3 to Create External Tables in Databricks
&lt;/h3&gt;

&lt;p&gt;In this approach, Fivetran first syncs data into an S3 bucket, and from there, the S3 destination functionality is used to create external tables in Databricks. External tables allow Databricks to query data directly from S3 without ingesting it into the Databricks warehouse, reducing computational load significantly.&lt;/p&gt;

&lt;h4&gt;
  
  
  Advantages:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reduced Compute Costs&lt;/strong&gt;: By leveraging external tables, the data is stored in S3, and Databricks queries the data directly from there. This avoids the need to allocate compute resources for loading the data into Databricks, cutting down on cluster usage and costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Data Access&lt;/strong&gt;: This method allows for easy querying of data stored in S3 using external tables in Databricks, providing more flexibility in your data pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Challenges:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: While external tables are cost-effective, they may introduce some latency compared to directly syncing data into Databricks, as Databricks queries external storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Setup Complexity&lt;/strong&gt;: This approach requires additional setup steps, including configuring Fivetran to sync with S3 and setting up external tables in Databricks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, in this configuration, we used &lt;strong&gt;Delta Lake&lt;/strong&gt; as the data format for the S3 sync. Delta Lake enables ACID transactions and better data quality management, which ensures that the data remains reliable while being queried in Databricks. By using external tables, the data remains as external and not as managed tables, which means that data does not automatically sync metadata with the Unity Catalog.&lt;/p&gt;

&lt;p&gt;To configure Fivetran for syncing data into S3 and using it as an external table source in Databricks, follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;In Fivetran Console&lt;/strong&gt;, go to &lt;strong&gt;Destinations&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose S3 Data Lake&lt;/strong&gt; as your destination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table Format&lt;/strong&gt;: Be sure to select &lt;strong&gt;DELTA&lt;/strong&gt; as the table format to enable the benefits of Delta Lake’s transactional capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintain Delta Tables in Databricks&lt;/strong&gt;: Enable the toggle for this option to ensure that Delta tables are maintained in Databricks as external tables, allowing for seamless querying and management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warehouse Configuration&lt;/strong&gt;: Fill in the details for the &lt;strong&gt;Databricks warehouse&lt;/strong&gt; you will be using for querying the external tables.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fca6kmt8k7jtpps4xq7sb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fca6kmt8k7jtpps4xq7sb.png" alt="Fivetran UI - S3 Destination" width="605" height="885"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These configurations ensure that your S3 data is optimized for querying in Databricks while maintaining cost efficiency by using external tables.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing the Right Approach: Cost, Performance, and Data Freshness
&lt;/h2&gt;

&lt;p&gt;The decision between direct syncing and syncing through S3 depends largely on your organization's priorities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimizing Compute Costs&lt;/strong&gt;: If reducing compute usage in Databricks is a primary goal, syncing via S3 and using external tables is the better choice. This method significantly cuts down on the need for Databricks clusters while still providing easy access to data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Volume&lt;/strong&gt;: For large volumes of data, external tables offer a cost-effective solution, as they offload storage to S3, which is much more economical.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion: Optimizing Data Sync with Fivetran
&lt;/h2&gt;

&lt;p&gt;Fivetran provides two efficient ways to sync data from PostgreSQL to Databricks, each with its advantages and trade-offs. Direct syncing offers simplicity and real-time access but comes with high computational costs. On the other hand, syncing data to S3 and using external tables in Databricks reduces compute costs significantly, making it a more cost-effective option for larger data volumes. &lt;/p&gt;

&lt;p&gt;By considering the trade-offs between these two approaches, organizations can optimize their data pipelines, reduce costs, and maintain performance, ensuring they make the most out of their Databricks resources.&lt;/p&gt;

</description>
      <category>fivetran</category>
      <category>databricks</category>
      <category>postgres</category>
      <category>costoptmization</category>
    </item>
  </channel>
</rss>
