<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: databricks</title>
    <description>The latest articles tagged 'databricks' on DEV Community.</description>
    <link>https://dev.to/t/databricks</link>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tag/databricks"/>
    <language>en</language>
    <item>
      <title>Deep Dive: Personal Agents and Their Role in the A…</title>
      <dc:creator>Norvik Tech</dc:creator>
      <pubDate>Tue, 02 Jun 2026 18:06:48 +0000</pubDate>
      <link>https://dev.to/norviktech/deep-dive-personal-agents-and-their-role-in-the-a-12no</link>
      <guid>https://dev.to/norviktech/deep-dive-personal-agents-and-their-role-in-the-a-12no</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://norvik.tech/en/news/agentes-personales-snowflake-databricks-2026" rel="noopener noreferrer"&gt;norvik.tech&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Explore how personal agents are transforming the AI stack with Snowflake and Databricks, and what it means for your tech strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Personal Agents in the AI Landscape
&lt;/h2&gt;

&lt;p&gt;Personal agents represent a significant advancement in AI technology, acting as intermediaries that facilitate more intuitive interaction between users and complex data systems. These agents can automate tasks such as data analysis, insights generation, and decision support, enhancing productivity within organizations. According to a recent report, companies integrating personal agents into their workflows have experienced a &lt;strong&gt;25% increase&lt;/strong&gt; in efficiency, underscoring their potential value.&lt;/p&gt;

&lt;p&gt;[INTERNAL:data-integration|Optimizing Your Data Strategy]&lt;/p&gt;

&lt;h3&gt;
  
  
  How Personal Agents Function
&lt;/h3&gt;

&lt;p&gt;Personal agents leverage natural language processing (NLP) and machine learning algorithms to understand user commands and retrieve relevant data. For instance, a user might ask their personal agent to generate a sales report, prompting the agent to query databases, analyze trends, and present findings in an easily digestible format. This automation significantly reduces the burden on data teams and accelerates decision-making processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Architecture of Personal Agents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Components
&lt;/h3&gt;

&lt;p&gt;The architecture of personal agents typically includes several key components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Ingestion Layer&lt;/strong&gt;: Collects data from various sources, including databases, APIs, and cloud storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing Engine&lt;/strong&gt;: Analyzes the ingested data using algorithms that can learn from interactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User Interface&lt;/strong&gt;: Provides a platform for users to interact with the agent, often through chat interfaces or dashboards.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[INTERNAL:machine-learning|Integrating ML with Business Processes]&lt;/p&gt;

&lt;h4&gt;
  
  
  Interaction Flow
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;User inputs a command via the interface.&lt;/li&gt;
&lt;li&gt;The agent processes the command, querying relevant datasets.&lt;/li&gt;
&lt;li&gt;Insights are generated and presented back to the user in real-time.
This streamlined process allows businesses to harness insights without extensive manual input.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Use Cases for Personal Agents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Real-World Applications
&lt;/h3&gt;

&lt;p&gt;Personal agents find applications across various industries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retail&lt;/strong&gt;: Automating inventory management and customer interactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance&lt;/strong&gt;: Streamlining reporting processes and risk analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare&lt;/strong&gt;: Assisting in patient data management and appointment scheduling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Specific Examples
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;A major retailer implemented a personal agent to automate customer service queries, resulting in a &lt;strong&gt;30% reduction&lt;/strong&gt; in response times.&lt;/li&gt;
&lt;li&gt;A financial institution used a personal agent for real-time risk assessment, leading to more informed investment decisions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implications for Business Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Does This Mean for Your Business?
&lt;/h3&gt;

&lt;p&gt;For companies in Colombia, Spain, and LATAM, adopting personal agents can transform operational efficiency. The initial investment may be substantial, but the ROI is evident in the form of enhanced productivity and reduced labor costs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Regional Considerations
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;In Colombia, where businesses are increasingly digitalizing, adopting such technology can set companies apart from competitors.&lt;/li&gt;
&lt;li&gt;Spanish firms may see significant benefits in customer engagement through personalized interactions driven by these agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Steps for Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How to Get Started
&lt;/h3&gt;

&lt;p&gt;To effectively implement personal agents in your organization, consider the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Assess Your Current Data Infrastructure&lt;/strong&gt;: Identify gaps in your current processes that personal agents could fill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pilot Program&lt;/strong&gt;: Start with a small-scale pilot to measure effectiveness and gather user feedback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate Performance&lt;/strong&gt;: Use metrics such as time savings and user satisfaction to gauge success before full-scale deployment.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By following these steps, your organization can smoothly transition into leveraging personal agents while minimizing risks associated with adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Preguntas frecuentes
&lt;/h3&gt;

&lt;h4&gt;
  
  
  ¿Qué son los agentes personales y cómo funcionan?
&lt;/h4&gt;

&lt;p&gt;Los agentes personales son herramientas que utilizan inteligencia artificial para automatizar tareas y facilitar la interacción del usuario con los sistemas de datos. Procesan comandos de lenguaje natural para generar informes y análisis relevantes.&lt;/p&gt;

&lt;h4&gt;
  
  
  ¿En qué industrias se pueden aplicar estos agentes?
&lt;/h4&gt;

&lt;p&gt;Se utilizan en diversas industrias como retail, finanzas y salud para mejorar la eficiencia operativa y la experiencia del cliente. Las aplicaciones son amplias y variadas según las necesidades del sector.&lt;/p&gt;

&lt;h4&gt;
  
  
  ¿Cuál es el retorno de inversión al implementar agentes personales?
&lt;/h4&gt;

&lt;p&gt;Las empresas han reportado mejoras significativas en la eficiencia operativa y reducción de costos laborales, lo que se traduce en un retorno de inversión positivo en un período relativamente corto.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Custom Software Solutions?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Norvik Tech&lt;/strong&gt; builds high-impact software for businesses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;consulting&lt;/li&gt;
&lt;li&gt;development&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;a href="https://norvik.tech" rel="noopener noreferrer"&gt;Visit norvik.tech&lt;/a&gt; to schedule a free consultation.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>personalagents</category>
      <category>snowflake</category>
      <category>databricks</category>
    </item>
    <item>
      <title>Why Your In-House Databricks Team Is Probably Losing You Money</title>
      <dc:creator>Lucy </dc:creator>
      <pubDate>Wed, 27 May 2026 10:44:20 +0000</pubDate>
      <link>https://dev.to/lucy1/why-your-in-house-databricks-team-is-probably-losing-you-money-35m9</link>
      <guid>https://dev.to/lucy1/why-your-in-house-databricks-team-is-probably-losing-you-money-35m9</guid>
      <description>&lt;p&gt;60% of enterprise AI projects get abandoned because of data readiness and infrastructure issues.&lt;/p&gt;

&lt;p&gt;Not because of bad ideas. Not because of wrong tooling. Because the foundation wasn't built right and by the time anyone noticed, the cost of fixing it was higher than starting over.&lt;/p&gt;

&lt;p&gt;If you're running Databricks in-house, there's a decent chance you're heading toward one of four failure modes. I've seen each of them play out, sometimes in the same org.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The "unicorn engineer" job post
&lt;/h2&gt;

&lt;p&gt;You know the one. It asks for someone who can handle platform architecture, complex ETL pipeline design, MLOps, &lt;em&gt;and&lt;/em&gt; data governance. Maybe Unity Catalog experience preferred. Definitely Spark optimization. Oh, and some Python.&lt;/p&gt;

&lt;p&gt;That person doesn't exist. Or if they do, they're already at a FAANG and not answering your recruiter.&lt;/p&gt;

&lt;p&gt;What actually happens: you hire someone capable, and they spend most of their time on operational noise that manually partitioning tables, babysitting cluster configs, debugging integration issues that have nothing to do with your actual data problems.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Databricks has gotten genuinely complex. Delta Lake, Lakeflow Declarative Pipelines, Unity Catalog- these aren't plug-and-play. A generalist data engineer in 2026 is not the same as a Databricks platform specialist.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A consulting partner brings people who've already built this across multiple clients. You're not buying hours. You're buying what they learned the hard way somewhere else multi-cloud workspace topology, Liquid Clustering, private endpoint configs without waiting for your team to acquire those scars.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The cloud bill no one is watching
&lt;/h2&gt;

&lt;p&gt;Here's one I've seen kill otherwise solid data platforms quietly.&lt;/p&gt;

&lt;p&gt;In-house team gets the pipelines working. Everyone moves on. Nobody sets up auto-termination. Nobody enforces cluster policies. Clusters run indefinitely. Variable workloads stay on always-on compute when they should be hitting Serverless SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Traditional In-House Setup] ---&amp;gt; Over-provisioned Clusters ---&amp;gt; High Idle Waste &amp;amp; Skyrocketing Bills
[Consulting-Led Framework] ---&amp;gt; Serverless SQL + Cluster Policies ---&amp;gt; Automated Auto-Termination &amp;amp; Controlled Spend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bill climbs slowly, and then suddenly it's a boardroom conversation.&lt;/p&gt;

&lt;p&gt;A proper FinOps setup isn't exciting work, but it has a direct, measurable line to your cloud costs. Things like mandatory &lt;code&gt;auto_termination_minutes&lt;/code&gt;, enforced instance pool configs, and routing the right workloads away from always-on clusters. This is table stakes, it just often doesn't get done when you're underwater on pipeline work.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Governance that gets bolted on after the fact
&lt;/h2&gt;

&lt;p&gt;The pattern is almost universal:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build the pipelines&lt;/li&gt;
&lt;li&gt;Ship the dashboards&lt;/li&gt;
&lt;li&gt;Deal with governance "later"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By the time "later" arrives, you've got fragmented data silos, ML models stuck in sandbox environments, inconsistent access controls, and no data lineage. Then someone asks about compliance.&lt;/p&gt;

&lt;p&gt;Unity Catalog isn't an afterthought, it's the thing you configure &lt;em&gt;before&lt;/em&gt; the pipelines, not after. Role-based access controls, automated data quality monitoring, end-to-end lineage tracking. If these aren't in the foundation, your downstream reports are unreliable by design.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The uncomfortable truth:&lt;/strong&gt; A lot of teams treat governance like a documentation task. It's not. It's infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  4. The hiring timeline nobody accounts for
&lt;/h2&gt;

&lt;p&gt;Realistic timeline from job post to a team that's onboarded, trained on Databricks, and actually productive:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6–9 months.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's not pessimism, that's just recruiting + onboarding + platform ramp-up. Most orgs don't factor this in when they're comparing in-house costs against consulting rates.&lt;/p&gt;

&lt;p&gt;A consulting firm gets there faster because they're not starting from scratch. Pre-built IaC templates, established Bronze/Silver/Gold ingestion patterns, CI/CD already wired up. Deployment that takes your internal team six months can happen in weeks.&lt;/p&gt;

&lt;p&gt;That gap matters if your competitors are already running predictive analytics in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  So what actually works?
&lt;/h2&gt;

&lt;p&gt;It's not a binary choice, and framing it that way is usually how you end up making the wrong call.&lt;/p&gt;

&lt;p&gt;The companies that handle this well use a hybrid model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bring in specialists&lt;/strong&gt; for the hard setup — architecture, Unity Catalog, cluster optimization, MLOps scaffolding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep internal team focused&lt;/strong&gt; on domain knowledge, custom data products, and the business problems that actually need context to solve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your internal engineers understand your data, your customers, and your edge cases. That's valuable and hard to transfer. But asking them to also be platform infrastructure experts is how you end up with both things done poorly.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;In-house default&lt;/th&gt;
&lt;th&gt;What fixes it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Skill gaps&lt;/td&gt;
&lt;td&gt;Overhire, underdeliver&lt;/td&gt;
&lt;td&gt;Consulting for platform-specific work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud costs&lt;/td&gt;
&lt;td&gt;Idle compute, no policies&lt;/td&gt;
&lt;td&gt;FinOps framework from day one&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance&lt;/td&gt;
&lt;td&gt;Bolted on later&lt;/td&gt;
&lt;td&gt;Unity Catalog before pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;6–9 months to productivity&lt;/td&gt;
&lt;td&gt;Pre-built templates + IaC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The architecture decisions you make in the first few months of a Databricks deployment are surprisingly hard to undo. Getting them right upfront — even with outside help — is almost always cheaper than refactoring a broken foundation at scale.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you gone through a Databricks migration or build-out? Curious what broke first — drop it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>dataengineering</category>
      <category>mlops</category>
      <category>cloudcosts</category>
    </item>
    <item>
      <title>Adeloop: Turning Semantic Data Models Into APIs for AI Agents</title>
      <dc:creator>Adeloop</dc:creator>
      <pubDate>Tue, 26 May 2026 11:12:19 +0000</pubDate>
      <link>https://dev.to/adeloop/adeloop-turning-semantic-data-models-into-apis-for-ai-agents-212c</link>
      <guid>https://dev.to/adeloop/adeloop-turning-semantic-data-models-into-apis-for-ai-agents-212c</guid>
      <description>&lt;p&gt;Most AI agents today can call APIs.&lt;/p&gt;

&lt;p&gt;But very few systems solve the real problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how do you safely expose business data to AI agents without giving them raw database access?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s what we built in Adeloop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing: Adeloop Agent Console API
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpia2lwwx7ucn1j19iku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpia2lwwx7ucn1j19iku.png" alt=" " width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Adeloop can now publish semantic domains as governed APIs for external AI agents and applications.&lt;/p&gt;

&lt;p&gt;The flow is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connect a warehouse, database, spreadsheet, or file source&lt;/li&gt;
&lt;li&gt;Turn tables into a semantic domain&lt;/li&gt;
&lt;li&gt;Publish selected domains&lt;/li&gt;
&lt;li&gt;Generate an API key&lt;/li&gt;
&lt;li&gt;Connect from ChatGPT, Claude, Cursor, n8n, Zapier, or your backend&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The important part:&lt;/p&gt;

&lt;p&gt;External AI agents never access raw SQL directly.&lt;/p&gt;

&lt;p&gt;Adeloop becomes the governed execution layer between AI and data.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why This Matters
&lt;/h1&gt;

&lt;p&gt;Most “AI data chat” products are either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unsafe SQL generators&lt;/li&gt;
&lt;li&gt;notebook wrappers&lt;/li&gt;
&lt;li&gt;or vector search over metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That breaks quickly at scale.&lt;/p&gt;

&lt;p&gt;Instead, Adeloop uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;question
→ semantic routing
→ metric/dimension planning
→ bounded SQL compilation
→ source pushdown execution
→ governed JSON response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queries stay close to the warehouse&lt;/li&gt;
&lt;li&gt;millions of rows are not pulled into Python&lt;/li&gt;
&lt;li&gt;agents receive structured JSON&lt;/li&gt;
&lt;li&gt;governance and rate limits stay enforced&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The default execution mode is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;semantic_sql_pushdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not Python.&lt;/p&gt;

&lt;p&gt;Not sandbox compute.&lt;/p&gt;

&lt;p&gt;Not “LLM writes random SQL”.&lt;/p&gt;




&lt;h1&gt;
  
  
  Example
&lt;/h1&gt;

&lt;p&gt;An external agent can ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Show top customers by revenue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8h33cjmjlci1almwnnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8h33cjmjlci1almwnnl.png" alt=" " width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Adeloop then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;selects the semantic domain&lt;/li&gt;
&lt;li&gt;resolves semantic metrics/dimensions&lt;/li&gt;
&lt;li&gt;compiles safe SQL&lt;/li&gt;
&lt;li&gt;executes against Postgres/MySQL/Snowflake/etc&lt;/li&gt;
&lt;li&gt;returns answer + JSON + execution metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Top result is Acme with total_revenue = 124500"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"execution"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"semantic_sql_pushdown"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"engine"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"postgresql"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sandboxUsed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  MCP + OpenAPI Support
&lt;/h1&gt;

&lt;p&gt;We also added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MCP-compatible JSON-RPC endpoint&lt;/li&gt;
&lt;li&gt;OpenAPI 3.1 action schema&lt;/li&gt;
&lt;li&gt;API key scopes&lt;/li&gt;
&lt;li&gt;usage logs&lt;/li&gt;
&lt;li&gt;semantic metadata endpoints&lt;/li&gt;
&lt;li&gt;deterministic domain routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;adeloopchat&lt;/li&gt;
&lt;li&gt;Claude tools&lt;/li&gt;
&lt;li&gt;Cursor&lt;/li&gt;
&lt;li&gt;n8n&lt;/li&gt;
&lt;li&gt;ChatGPT Actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs37pxmh3pyjdyrzkmr9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs37pxmh3pyjdyrzkmr9q.png" alt=" " width="799" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;can consume governed business data without direct warehouse access.&lt;/p&gt;




&lt;h1&gt;
  
  
  One Important Architecture Decision
&lt;/h1&gt;

&lt;p&gt;We intentionally did NOT add E2B/sandbox execution into the main API path.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because most business questions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;aggregations&lt;/li&gt;
&lt;li&gt;grouped metrics&lt;/li&gt;
&lt;li&gt;dashboards&lt;/li&gt;
&lt;li&gt;top-N queries&lt;/li&gt;
&lt;li&gt;filters&lt;/li&gt;
&lt;li&gt;time-series analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those should execute through SQL pushdown near the data source.&lt;/p&gt;

&lt;p&gt;Python notebooks and sandbox compute belong later as async premium analysis jobs for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;forecasting&lt;/li&gt;
&lt;li&gt;anomaly detection&lt;/li&gt;
&lt;li&gt;ML&lt;/li&gt;
&lt;li&gt;simulations&lt;/li&gt;
&lt;li&gt;notebook/report generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Normal analytics APIs should stay fast, deterministic, and scalable.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Bigger Goal
&lt;/h1&gt;

&lt;p&gt;I think AI agents will need something equivalent to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a semantic execution layer for enterprise data&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not just chat over databases.&lt;/p&gt;

&lt;p&gt;Something that handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;governance&lt;/li&gt;
&lt;li&gt;semantic metrics&lt;/li&gt;
&lt;li&gt;execution planning&lt;/li&gt;
&lt;li&gt;safe query compilation&lt;/li&gt;
&lt;li&gt;federation&lt;/li&gt;
&lt;li&gt;caching&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;API contracts for agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the direction we’re building toward with Adeloop.&lt;/p&gt;

&lt;p&gt;Would love feedback from people building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI agents&lt;/li&gt;
&lt;li&gt;semantic layers&lt;/li&gt;
&lt;li&gt;MCP tooling&lt;/li&gt;
&lt;li&gt;data infrastructure&lt;/li&gt;
&lt;li&gt;analytics engineering systems&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>semanticlayer</category>
      <category>snowflake</category>
      <category>databricks</category>
      <category>agents</category>
    </item>
    <item>
      <title>Databricks and FSx for ONTAP S3 Access Points — A Layer-by-Layer Validation of Observed Boundaries</title>
      <dc:creator>Yoshiki Fujiwara(藤原 善基)@AWS Community Builder</dc:creator>
      <pubDate>Sun, 24 May 2026 11:17:38 +0000</pubDate>
      <link>https://dev.to/aws-builders/databricks-and-fsx-for-ontap-s3-access-points-a-layer-by-layer-validation-of-observed-boundaries-p4d</link>
      <guid>https://dev.to/aws-builders/databricks-and-fsx-for-ontap-s3-access-points-a-layer-by-layer-validation-of-observed-boundaries-p4d</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Connecting Databricks to FSx for ONTAP S3 Access Points is significantly harder than Athena (&lt;a href="https://dev.to/aws-builders/query-nas-data-in-place-with-athena-and-fsx-for-ontap-s3-access-points-3lhh"&gt;Part 1&lt;/a&gt;). After testing every approach I could find — Unity Catalog External Locations, NFS mounts, Instance Profiles, multiple VPC configurations — here is what I found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unity Catalog's session policy&lt;/strong&gt; initially blocked the FSx for ONTAP S3 AP ARN pattern → 403&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Setting the &lt;code&gt;access_point&lt;/code&gt; field&lt;/strong&gt; on the External Location partially resolves the session policy: explicit-path file read succeeds, but UC table creation, subdirectory listing, and write operations remain blocked — meaning UC governance features (lineage, tags, fine-grained access) cannot yet be applied&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NFS kernel mount&lt;/strong&gt; is blocked by seccomp by design (confirmed by Databricks Support)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance Profile + boto3&lt;/strong&gt; works for direct S3 AP access (bypassing Unity Catalog)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark read with explicit file path&lt;/strong&gt; works under UC governance — 1000 rows of sensor data readable with full schema inference, proving data access is possible even if table creation is blocked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quick Decision Guide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read-only SQL analytics on NAS data&lt;/strong&gt; → Use Athena (Part 1) or Snowflake External Table (&lt;a href="https://dev.to/aws-builders/snowflake-and-fsx-for-ontap-s3-access-points-from-access-denied-to-working-external-tables-9k8"&gt;Part 3&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governed Databricks lakehouse on NAS data&lt;/strong&gt; → Stage via FPolicy → Lambda → S3 → Auto Loader → UC Managed Table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exploratory PoC (time-limited)&lt;/strong&gt; → Instance Profile + boto3 with compensating controls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article is a layer-by-layer validation of observed integration boundaries between Databricks and FSx for ONTAP S3 Access Points. It is not an argument against Databricks. Databricks remains a strong platform for lakehouse, ML, and production Delta workloads. This article focuses narrowly on one integration boundary: direct access from Databricks to FSx for ONTAP S3 Access Points.&lt;/p&gt;

&lt;p&gt;This article documents the full troubleshooting journey, including the strace analysis that identified the root cause of NFS mount failures.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This article documents observed behavior in one validated environment. It should not be interpreted as a general compatibility statement for all Databricks configurations or future platform versions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository&lt;/strong&gt;: &lt;a href="https://github.com/Yoshiki0705/fsxn-lakehouse-integrations" rel="noopener noreferrer"&gt;fsxn-lakehouse-integrations&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to reproduce this validation, the repository's &lt;code&gt;integrations/databricks/&lt;/code&gt; directory contains environment setup notes, and &lt;code&gt;verification-pack/&lt;/code&gt; contains test templates and evidence recording formats. The verification pack is intentionally template-first by design, so validation runs can produce consistent, reviewable evidence across environments. Actual result files will be added as validation runs are completed.&lt;/p&gt;

&lt;p&gt;This article also includes a &lt;strong&gt;Snowflake ↔ Databricks concept mapping table&lt;/strong&gt; (showing which capabilities work on each platform) and an &lt;strong&gt;AI Readiness Score&lt;/strong&gt; to help teams quantitatively compare pattern options for FSx for ONTAP integration.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Read This Article
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This article is:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A reproduction-focused validation report&lt;/li&gt;
&lt;li&gt;Evidence from one environment (DBR 17.3 LTS, ap-northeast-1)&lt;/li&gt;
&lt;li&gt;A starting point for vendor confirmation and architecture discussion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This article is not:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A general compatibility statement&lt;/li&gt;
&lt;li&gt;A production certification&lt;/li&gt;
&lt;li&gt;A statement on behalf of Databricks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Read by role:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Databricks admin&lt;/strong&gt;: Unity Catalog External Location → Governance Impact Summary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage engineer&lt;/strong&gt;: NFS Mount investigation → Evidence Matrix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data engineer&lt;/strong&gt;: Instance Profile + boto3 → Next Validation Metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partner / SA&lt;/strong&gt;: Decision Matrix → Discovery Questions → Partner Conversation Guide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opening a support case&lt;/strong&gt;: Databricks Support Case Packet&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Prerequisite Concepts
&lt;/h2&gt;

&lt;p&gt;Before reading this article, it helps to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unity Catalog Storage Credential&lt;/strong&gt; — an object that stores a reference to a cloud IAM role for accessing external storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unity Catalog External Location&lt;/strong&gt; — maps a cloud storage path to a storage credential for governed access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance Profile on AWS&lt;/strong&gt; — an IAM role attached to an EC2 instance, providing credentials via IMDS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks-managed VPC vs Customer-managed VPC&lt;/strong&gt; — whether Databricks or the customer controls the workspace network&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster access modes&lt;/strong&gt; — Standard (shared, multi-user with UC governance) and Dedicated (single-user with sudo access). Unity Catalog requires &lt;a href="https://docs.databricks.com/aws/en/compute/standard-overview" rel="noopener noreferrer"&gt;standard or dedicated access mode&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Access Point ARN vs S3 bucket ARN&lt;/strong&gt; — S3 AP uses &lt;code&gt;arn:aws:s3:&amp;lt;region&amp;gt;:&amp;lt;account&amp;gt;:accesspoint/&amp;lt;name&amp;gt;&lt;/code&gt;, not &lt;code&gt;arn:aws:s3:::&amp;lt;bucket&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Driver vs executor behavior in Spark&lt;/strong&gt; — the driver orchestrates; executors run distributed tasks. Credentials and network paths may differ between them&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Concepts: Databricks Storage &amp;amp; Ingestion Architecture
&lt;/h2&gt;

&lt;p&gt;Understanding Databricks' storage and ingestion concepts is essential for evaluating FSx for ONTAP S3 AP integration boundaries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Partner Quick Reference&lt;/strong&gt;: If your customer asks "Can Databricks read our NAS data via S3 Access Points?" — the answer is "partially, with limitations." File-level read works under UC governance, but table creation and directory listing are blocked. For governed analytics on NAS data today, recommend Snowflake External Table or Athena. For Databricks-specific workloads, recommend staged ingestion to S3 → UC Managed Table (see Recommended Architecture Pattern). If the customer already uses Databricks, the FPolicy → Lambda → S3 → Auto Loader pattern preserves full UC governance on ingested data. See Discovery Questions for customer qualification.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Storage Credential → External Location → External Table/Volume
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Storage Credential (IAM Role ARN + External ID)
    │
    └── External Location (cloud storage path + credential + access_point field)
            │
            ├── External Table (tabular data: Parquet, Delta, Iceberg)
            └── External Volume (non-tabular: images, documents, audio)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;FSx S3 AP Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Credential&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IAM Role that Databricks assumes to access cloud storage. During AssumeRole, Databricks generates a session policy that restricts what the assumed session can do — even if the IAM role itself has broader permissions.&lt;/td&gt;
&lt;td&gt;✅ Created&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;External Location&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maps S3 path to a Storage Credential; defines access boundary&lt;/td&gt;
&lt;td&gt;✅ Created (with &lt;code&gt;access_point&lt;/code&gt; field)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;External Table&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;UC-governed table whose data resides in External Location&lt;/td&gt;
&lt;td&gt;❌ CREATE TABLE blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;External Volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;UC-governed volume for unstructured files in External Location&lt;/td&gt;
&lt;td&gt;❌ Blocked (same session policy issue)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;External Volume&lt;/strong&gt; is the Databricks equivalent of Snowflake's Directory Table — it provides governed access to non-tabular files (images, documents, audio, video). Since External Volume requires External Location creation with full subdirectory access, it is currently blocked by the same session policy limitation that blocks External Table creation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto Loader (Incremental Ingestion)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/ingestion/auto-loader/index.html" rel="noopener noreferrer"&gt;Auto Loader&lt;/a&gt; is Databricks' equivalent of Snowflake's Snowpipe — it incrementally processes new files as they arrive in cloud storage.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;FSx S3 AP Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Directory Listing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Periodically lists directory to find new files&lt;/td&gt;
&lt;td&gt;⚠️ Requires External Location (blocked)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File Notification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Uses S3 Event Notifications + SQS for real-time detection&lt;/td&gt;
&lt;td&gt;❌ Not possible (FSx S3 AP doesn't support S3 Events)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Auto Loader supported formats&lt;/strong&gt; (8 formats): JSON, CSV, Parquet, Avro, ORC, XML, TEXT, BINARYFILE.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;FSx S3 AP latency context&lt;/strong&gt;: Even if Directory Listing mode were unblocked, FSx S3 AP ListObjectsV2 latency is significantly higher than native S3 (tens of seconds to minutes for large directories). This would impact Auto Loader polling intervals and new-file detection speed. Plan for minutes-level detection latency, not seconds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Concept Mapping: Snowflake ↔ Databricks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Snowflake Concept&lt;/th&gt;
&lt;th&gt;Databricks Equivalent&lt;/th&gt;
&lt;th&gt;FSx S3 AP (Snowflake)&lt;/th&gt;
&lt;th&gt;FSx S3 AP (Databricks)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage Integration&lt;/td&gt;
&lt;td&gt;Storage Credential&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External Stage&lt;/td&gt;
&lt;td&gt;External Location&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ (partial)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External Table&lt;/td&gt;
&lt;td&gt;External Table&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌ Blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Directory Table&lt;/td&gt;
&lt;td&gt;External Volume&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌ Blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowpipe&lt;/td&gt;
&lt;td&gt;Auto Loader&lt;/td&gt;
&lt;td&gt;⚠️ (no S3 Events)&lt;/td&gt;
&lt;td&gt;❌ Blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;COPY INTO&lt;/td&gt;
&lt;td&gt;COPY INTO / Auto Loader&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌ Blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AWS_ACCESS_POINT_ARN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;access_point&lt;/code&gt; field&lt;/td&gt;
&lt;td&gt;✅ (resolves all)&lt;/td&gt;
&lt;td&gt;⚠️ (partial resolution)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cortex Search (RAG)&lt;/td&gt;
&lt;td&gt;Mosaic AI / MLflow&lt;/td&gt;
&lt;td&gt;✅ (via COPY INTO)&lt;/td&gt;
&lt;td&gt;⚠️ (boto3 + external)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Data Ingestion Alternatives for FSx for ONTAP (When Auto Loader Is Blocked)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Throughput constraint&lt;/strong&gt;: All S3 AP operations are bounded by the FSx for ONTAP file system's provisioned throughput capacity (e.g., 128 MB/s in this validation environment). This throughput is shared with NFS/SMB workloads on the same file system. Plan ingestion windows and concurrent access accordingly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Since Auto Loader requires External Location (currently blocked on FSx S3 AP), use these alternatives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Governance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FPolicy → Lambda → S3 → Auto Loader&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FPolicy detects file changes → Lambda copies to S3 → Auto Loader ingests&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;✅ Full UC (on S3 copy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Glue ETL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Glue job reads from FSx S3 AP → writes to S3/Delta&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;AWS-side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EMR Serverless&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spark job reads from FSx S3 AP → writes to S3/Delta&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;AWS-side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS DataSync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scheduled sync from FSx NFS → S3 bucket&lt;/td&gt;
&lt;td&gt;Minutes-Hours&lt;/td&gt;
&lt;td&gt;AWS-side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SnapMirror to S3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ONTAP-native replication to S3 bucket&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;ONTAP-side&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SnapMirror to S3 caveat&lt;/strong&gt;: Object metadata in SnapMirror S3 targets differs from NFS file metadata. Validate schema compatibility and file naming conventions before using SnapMirror S3 as an ingestion path for analytics engines.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Recommended production pattern:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FSx for ONTAP ──FPolicy──▶ Lambda ──▶ S3 Bucket ──▶ Auto Loader ──▶ Delta Table (UC governed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Iceberg interoperability note&lt;/strong&gt;: Once data is in UC as a managed Delta or Iceberg table, external engines can access it via &lt;a href="https://docs.databricks.com/aws/en/external-access/" rel="noopener noreferrer"&gt;UC's Iceberg REST Catalog&lt;/a&gt; — enabling Athena, EMR, and Trino to query the same governed table without data duplication. This makes the DataSync → S3 → UC path a hub for multi-engine access.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  AI Readiness Score
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Governance&lt;/th&gt;
&lt;th&gt;Performance&lt;/th&gt;
&lt;th&gt;AI Capability&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Operational Simplicity&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Athena + FSx S3 AP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;★★★☆☆&lt;/td&gt;
&lt;td&gt;★★★★☆&lt;/td&gt;
&lt;td&gt;★☆☆☆☆ (SQL only)&lt;/td&gt;
&lt;td&gt;★★★★★&lt;/td&gt;
&lt;td&gt;★★★★★&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.6&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snowflake External Table&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;★★★★☆&lt;/td&gt;
&lt;td&gt;★★★☆☆&lt;/td&gt;
&lt;td&gt;★★★★☆ (Cortex AI)&lt;/td&gt;
&lt;td&gt;★★★★★&lt;/td&gt;
&lt;td&gt;★★★★☆&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Staged to S3 → UC Table&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;★★★★★&lt;/td&gt;
&lt;td&gt;★★★★★&lt;/td&gt;
&lt;td&gt;★★★★★ (full Mosaic AI)&lt;/td&gt;
&lt;td&gt;★★☆☆☆&lt;/td&gt;
&lt;td&gt;★★☆☆☆&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;boto3 PoC (Databricks)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;★☆☆☆☆&lt;/td&gt;
&lt;td&gt;★★☆☆☆&lt;/td&gt;
&lt;td&gt;★★★☆☆ (driver-only)&lt;/td&gt;
&lt;td&gt;★★★★★&lt;/td&gt;
&lt;td&gt;★★★☆☆&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bedrock KB + FSx S3 AP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;★★★☆☆&lt;/td&gt;
&lt;td&gt;★★★★☆&lt;/td&gt;
&lt;td&gt;★★★★☆ (RAG)&lt;/td&gt;
&lt;td&gt;★★★★☆&lt;/td&gt;
&lt;td&gt;★★★★☆&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Governance&lt;/strong&gt;: UC lineage, tags, masking, row filters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt;: Query latency, distributed processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Capability&lt;/strong&gt;: Breadth of AI/ML functions available&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: Storage efficiency, compute cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Simplicity&lt;/strong&gt;: Setup, maintenance, pipeline complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Scoring methodology&lt;/strong&gt;: Each dimension rated by the author based on validated evidence in this article series. This is not an official AWS assessment or certification. Scores reflect observed capabilities in one test environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance note&lt;/strong&gt;: Performance scores reflect relative comparison within FSx S3 AP access patterns, not comparison with native S3 bucket performance. All patterns accessing FSx S3 AP have higher latency than equivalent native S3 operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to use this score&lt;/strong&gt;: Use Overall score as a starting point for pattern selection. Scores ≥ 4.0 indicate strong fit for governed production workloads. Scores 3.5–3.9 indicate viable paths with trade-offs. Scores &amp;lt; 3.0 indicate PoC-only paths requiring compensating controls.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;When to choose which:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose &lt;strong&gt;Snowflake External Table&lt;/strong&gt; (4.0) when governed AI on NAS data without copying is the priority&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Staged to S3 → UC Table&lt;/strong&gt; (3.8) when maximum Databricks performance and full Mosaic AI are required (accepts data duplication cost)&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Bedrock KB&lt;/strong&gt; (3.8) when AWS-native RAG with zero-copy on FSx is the primary requirement&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;boto3 PoC&lt;/strong&gt; (2.8) only for time-limited exploration with explicit approval; with compensating controls (see Compensating Controls section), governance risk can be partially mitigated for PoC scope. Post-expiration actions must be defined: terminate cluster, remove instance profile, archive evidence.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;Process unstructured data (images, documents, audio) stored on FSx for ONTAP from Databricks — without copying data to S3. FSx for ONTAP S3 Access Points should make this possible by exposing NFS/SMB file data via S3 API.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/aws-builders/query-nas-data-in-place-with-athena-and-fsx-for-ontap-s3-access-points-3lhh"&gt;Part 1&lt;/a&gt;, Athena worked cleanly in my validation using the &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;official AWS tutorial pattern&lt;/a&gt;. Databricks, however, has multiple security layers that interact with S3 AP in unexpected ways.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Environment
&lt;/h2&gt;

&lt;p&gt;I tested across two workspace configurations:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Runtime scope&lt;/strong&gt;: Only DBR 17.3 LTS (Spark 4.0.0) was tested. This article does not compare DBR 16.x, 18.x, ML runtimes, GPU runtimes, or serverless compute. Runtime-level behavior may differ across versions and compute types. This article does not compare behavior across DBR versions or access modes beyond those listed in the test environment.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│ Workspace 1: Databricks-managed VPC                                 │
│ - VPC created and managed by Databricks                             │
│ - Limited network control                                           │
│ - VPC Peering to FSx for ONTAP VPC                                  │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Workspace 2: Customer-managed VPC (same VPC as FSx for ONTAP)       │
│ - Full network control                                              │
│ - Direct connectivity to FSx for ONTAP (no peering needed)          │
│ - NAT Gateway for Databricks control plane                          │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cluster modes tested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard (Shared Access)&lt;/li&gt;
&lt;li&gt;Dedicated (Single User) — provides sudo/root access&lt;/li&gt;
&lt;li&gt;Dedicated with Instance Profile&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All tests used DBR 17.3 LTS (Spark 4.0.0), ap-northeast-1.&lt;/p&gt;




&lt;h2&gt;
  
  
  Approach 1: Unity Catalog External Location
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Setup
&lt;/h3&gt;

&lt;p&gt;The Databricks-governed path for S3 data access is to create a Storage Credential and External Location. I tested whether the same pattern could work with an FSx for ONTAP S3 Access Point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What I expected to work
&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;FSx-S3-AP-alias&amp;gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Error
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AccessDenied: User: arn:aws:sts::&amp;lt;ACCOUNT&amp;gt;:assumed-role/databricks-...-cross-account-role/
  databricks-unity-catalog-credential-&amp;lt;WORKSPACE_ID&amp;gt;
is not authorized to perform: s3:ListBucket on resource:
  "arn:aws:s3:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT&amp;gt;:accesspoint/&amp;lt;AP_NAME&amp;gt;"
because no session policy allows the s3:ListBucket action
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observed Boundary
&lt;/h3&gt;

&lt;p&gt;Unity Catalog applies a &lt;strong&gt;session policy&lt;/strong&gt; when it calls &lt;code&gt;AssumeRole&lt;/code&gt;. This session policy acts as a permissions boundary — even if the IAM role has &lt;code&gt;s3:*&lt;/code&gt; on &lt;code&gt;*&lt;/code&gt;, the session policy restricts what the assumed session can do.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The evidence narrows the failure domain, but does not identify Databricks internal implementation details.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this validation, the generated session policy behavior allowed access to a standard S3 bucket path but did not allow the FSx for ONTAP S3 Access Point ARN pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arn:aws:s3:::bucket-name       ✅ Allowed
arn:aws:s3:::bucket-name/*     ✅ Allowed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But FSx for ONTAP S3 AP uses a different ARN format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arn:aws:s3:&amp;lt;region&amp;gt;:&amp;lt;account&amp;gt;:accesspoint/&amp;lt;name&amp;gt;    ❌ Not in session policy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Proof
&lt;/h3&gt;

&lt;p&gt;The same IAM role works fine for regular S3 buckets through Unity Catalog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This works — regular S3 bucket
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-workspace-storage-bucket/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# SUCCESS
&lt;/span&gt;
&lt;span class="c1"&gt;# This fails — FSx for ONTAP S3 Access Point
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&amp;lt;FSx-S3-AP-alias&amp;gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# AccessDenied: no session policy allows...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Status
&lt;/h3&gt;

&lt;p&gt;In my initial validation, this behaved like a platform boundary in Unity Catalog's generated session policy. I opened a support case to confirm whether S3 Access Point ARN patterns can be supported for external locations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (&lt;code&gt;access_point&lt;/code&gt; field not set)&lt;/strong&gt; — Unity Catalog session policy blocks all S3 AP operations:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhh05qqisia04lu739xp1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhh05qqisia04lu739xp1.png" alt="Session policy error before access_point field — UNAUTHORIZED_ACCESS on dbutils.fs.ls" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Without the &lt;code&gt;access_point&lt;/code&gt; field, &lt;code&gt;dbutils.fs.ls&lt;/code&gt; on the S3 AP alias returns UNAUTHORIZED_ACCESS. The session policy only allows standard S3 bucket ARNs.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Update (2026-05-24): &lt;code&gt;access_point&lt;/code&gt; Field Resolves Session Policy
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical Update (2026-05-26)&lt;/strong&gt;: Databricks Support subsequently confirmed that the &lt;code&gt;access_point&lt;/code&gt; field was never released as a generally available feature and has been removed from documentation. The partial success described below is "a side effect of incomplete internal handling, not a supported code path." Unity Catalog External Locations do not currently support S3 Access Points. See the full support confirmation at the end of this section.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Databricks Support (May 2026) confirmed that Unity Catalog External Locations support an &lt;code&gt;access_point&lt;/code&gt; field. Setting this field includes the S3 AP ARN in the generated session policy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration that works:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;External Location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://&amp;lt;FSx-S3-AP-alias&amp;gt;/&lt;/span&gt;
  &lt;span class="na"&gt;Credential&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;storage-credential-name&amp;gt;&lt;/span&gt;
  &lt;span class="na"&gt;access_point&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:s3:&amp;lt;region&amp;gt;:&amp;lt;account&amp;gt;:accesspoint/&amp;lt;ap-name&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;API call to set the field:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; PATCH &lt;span class="se"&gt;\&lt;/span&gt;
  https://&amp;lt;workspace&amp;gt;/api/2.1/unity-catalog/external-locations/&amp;lt;location-name&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &amp;lt;token&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"access_point": "arn:aws:s3:&amp;lt;region&amp;gt;:&amp;lt;account&amp;gt;:accesspoint/&amp;lt;ap-name&amp;gt;"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What now works under UC governance:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dbutils.fs.ls("s3://&amp;lt;alias&amp;gt;/")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Top-level listing (287 items)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dbutils.fs.head("s3://&amp;lt;alias&amp;gt;/file.txt")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Read file content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;spark.read.text("s3://&amp;lt;alias&amp;gt;/file.txt")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Spark read with explicit file path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;spark.read.csv("s3://&amp;lt;alias&amp;gt;/path/to/file.csv")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;1000 rows, schema inferred&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;After (&lt;code&gt;access_point&lt;/code&gt; field set)&lt;/strong&gt; — Top-level listing succeeds, 287 items visible:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1q5n5m33qy9vg8zzijl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1q5n5m33qy9vg8zzijl.png" alt="dbutils.fs.ls success — 287 items listed from FSx for ONTAP S3 AP" width="800" height="428"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;With the &lt;code&gt;access_point&lt;/code&gt; field configured, &lt;code&gt;dbutils.fs.ls&lt;/code&gt; at the top level returns 287 items from the FSx for ONTAP volume.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sensor data read via Spark&lt;/strong&gt; — 1000 rows with schema inference:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flydan8xi3arhr2jxb8r6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flydan8xi3arhr2jxb8r6.png" alt="Spark DataFrame reading sensor CSV from FSx for ONTAP S3 AP — 1000 rows" width="800" height="428"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;&lt;code&gt;spark.read.csv&lt;/code&gt; with explicit file path successfully reads 1000 sensor readings with full schema inference (timestamp, machine_id, temperature_c, vibration_mm_s, pressure_bar, rpm, status, location).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What still does NOT work:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Error&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dbutils.fs.ls("s3://&amp;lt;alias&amp;gt;/subdir/")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;AccessDenied on getFileStatus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;spark.read.load("s3://&amp;lt;alias&amp;gt;/subdir/")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Forbidden (directory-level access)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CREATE TABLE LOCATION 's3://&amp;lt;alias&amp;gt;/...'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;UC_CLOUD_STORAGE_ACCESS_FAILURE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;dbutils.fs.cp&lt;/code&gt; (PutObject)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;AccessDenied&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Remaining blockers&lt;/strong&gt; — Subdirectory listing and UC table creation fail:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsr0tsyysbj2v5argvh4j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsr0tsyysbj2v5argvh4j.png" alt="Subdirectory ls blocked and CREATE TABLE fails — UC governance cannot be applied" width="800" height="428"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Subdirectory &lt;code&gt;dbutils.fs.ls&lt;/code&gt; returns UNAUTHORIZED_ACCESS. &lt;code&gt;CREATE TABLE LOCATION&lt;/code&gt; fails with UC_CLOUD_STORAGE_ACCESS_FAILURE. Without a UC table, governance features (lineage, tags, fine-grained access control) cannot be applied.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rkdqqy3hy548yy8s6mb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rkdqqy3hy548yy8s6mb.png" alt="Summary of what works and what doesn't — governance impact" width="800" height="428"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Summary: Data is readable but not governable. The critical blocker is &lt;code&gt;CREATE TABLE LOCATION&lt;/code&gt; failure, which prevents Unity Catalog governance from being applied to the data.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key pattern&lt;/strong&gt;: File-level read operations succeed (GetObject with explicit key). Directory-level operations (ListObjectsV2 with prefix, HeadObject on prefix) fail for subdirectories. This suggests the session policy scopes ListObjectsV2 to the root prefix only.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Implication&lt;/strong&gt;: Explicit-path file read works, but without UC table creation, Unity Catalog governance features — lineage, fine-grained access control, governance tags, column masking, row filtering — cannot be applied. The data is technically readable through the External Location path but not registerable as a governed UC table. This limits the practical value for production governance use cases until the subdirectory listing and table creation issues are resolved.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Requirements for this path:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer-managed VPC workspace (same VPC as FSx for ONTAP)&lt;/li&gt;
&lt;li&gt;External Location with &lt;code&gt;access_point&lt;/code&gt; field set&lt;/li&gt;
&lt;li&gt;Storage Credential IAM role with S3 AP permissions&lt;/li&gt;
&lt;li&gt;NAT Gateway for control plane connectivity&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Approach 2: NFS Mount (Managed VPC)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The Idea
&lt;/h3&gt;

&lt;p&gt;If S3 AP doesn't work through Unity Catalog, mount the FSx for ONTAP volume directly via NFS.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Setup
&lt;/h3&gt;

&lt;p&gt;Created VPC Peering between Databricks-managed VPC and FSx for ONTAP VPC. Updated route tables and security groups.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Result
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;timeout &lt;/span&gt;3 bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo &amp;gt; /dev/tcp/10.0.3.133/2049'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"REACHABLE"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"NOT REACHABLE"&lt;/span&gt;
&lt;span class="c"&gt;# NOT REACHABLE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;NFS port (TCP 2049) is unreachable&lt;/strong&gt; from Databricks-managed VPC, even with VPC Peering configured. From the customer-controlled routing perspective, route tables and FSx for ONTAP-side security groups were configured to allow NFS. However, cluster-side egress remained governed by the Databricks-managed environment, and NFS egress was not permitted.&lt;/p&gt;
&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;Databricks-managed VPC gives you limited network control. The egress rules on cluster instances are managed by Databricks, not by customer-added security group rules.&lt;/p&gt;


&lt;h2&gt;
  
  
  Approach 3: NFS Mount (Customer-managed VPC)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The Setup
&lt;/h3&gt;

&lt;p&gt;Deployed a new workspace in the &lt;strong&gt;same VPC&lt;/strong&gt; as FSx for ONTAP. No peering needed — direct L3 connectivity.&lt;/p&gt;
&lt;h3&gt;
  
  
  Network Verification (All Pass)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"TCP 2049 (NFS):"&lt;/span&gt;
&lt;span class="nb"&gt;timeout &lt;/span&gt;3 bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo &amp;gt; /dev/tcp/10.0.3.133/2049'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"REACHABLE"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"TCP 111 (portmapper):"&lt;/span&gt;
&lt;span class="nb"&gt;timeout &lt;/span&gt;3 bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo &amp;gt; /dev/tcp/10.0.3.133/111'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"REACHABLE"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"TCP 635 (mountd):"&lt;/span&gt;
&lt;span class="nb"&gt;timeout &lt;/span&gt;3 bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo &amp;gt; /dev/tcp/10.0.3.133/635'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"REACHABLE"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TCP 2049 (NFS): REACHABLE ✅
TCP 111 (portmapper): REACHABLE ✅
TCP 635 (mountd): REACHABLE ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The &lt;code&gt;/dev/tcp&lt;/code&gt; test confirms TCP reachability. NFSv3 mountd may use TCP or UDP depending on configuration. The exact transport should be validated with &lt;code&gt;rpcinfo&lt;/code&gt; if needed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  sudo Access (Dedicated Mode)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;sudo whoami&lt;/span&gt;
&lt;span class="c"&gt;# root ✅&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  NFS Client Installation and Export Verification
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nfs-common
showmount &lt;span class="nt"&gt;-e&lt;/span&gt; 10.0.3.133
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Export list for 10.0.3.133:
/vol1 (everyone) ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Everything looks perfect. Network connected, root access available, NFS exports visible. Let's mount:&lt;/p&gt;
&lt;h3&gt;
  
  
  The Mount Attempt
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/fsxn
&lt;span class="nb"&gt;sudo &lt;/span&gt;mount &lt;span class="nt"&gt;-t&lt;/span&gt; nfs &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;nfsvers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3,nolock 10.0.3.133:/vol1 /mnt/fsxn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mount.nfs: access denied by server while mounting 10.0.3.133:/vol1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Wait, what?&lt;/strong&gt; The server is showing the export to everyone, we have root access, the network is connected... why "access denied by server"?&lt;/p&gt;


&lt;h2&gt;
  
  
  The Investigation: Why NFS Mount Fails
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. The error message says "access denied &lt;strong&gt;by server&lt;/strong&gt;" — but is it really the server?&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Verify ONTAP Export Policy
&lt;/h3&gt;

&lt;p&gt;Via ONTAP REST API (accessible from the same cluster):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"clients"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ro_rule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"any"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rw_rule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"any"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"superuser"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"any"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"protocols"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"any"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The export policy is &lt;strong&gt;maximally permissive&lt;/strong&gt; — all clients, all protocols, read-write, superuser. ONTAP is not denying access.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: This permissive export policy was used only to eliminate ONTAP export restrictions as a variable during troubleshooting. It is not a production recommendation. For production, restrict: client CIDR, protocol, read/write rule, superuser mapping, and volume/junction path scope.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  ONTAP Production Hardening Checklist
&lt;/h3&gt;

&lt;p&gt;For production deployments, harden the ONTAP configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Restrict export policy client CIDR to known analytics subnets only&lt;/li&gt;
&lt;li&gt;[ ] Avoid &lt;code&gt;rw=any&lt;/code&gt; and &lt;code&gt;superuser=any&lt;/code&gt; — use explicit security flavors&lt;/li&gt;
&lt;li&gt;[ ] Map S3 Access Point file system user to a least-privilege NAS user (not root/UID 0)&lt;/li&gt;
&lt;li&gt;[ ] Validate NFS/SMB ACL behavior when S3 AP is active&lt;/li&gt;
&lt;li&gt;[ ] Validate S3 API access against file-level permissions&lt;/li&gt;
&lt;li&gt;[ ] Capture ONTAP audit evidence where required (&lt;a href="https://docs.netapp.com/us-en/ontap/nas-audit/index.html" rel="noopener noreferrer"&gt;ONTAP FPolicy&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;[ ] Document junction path and volume scope&lt;/li&gt;
&lt;li&gt;[ ] Isolate analytics volumes from production NFS/SMB workloads if throughput contention is a concern&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: strace the mount command
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;sudo &lt;/span&gt;strace &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;mount mount &lt;span class="nt"&gt;-t&lt;/span&gt; nfs &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;nfsvers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3,nolock 10.0.3.133:/vol1 /mnt/fsxn 2&amp;gt;&amp;amp;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mount.nfs: trying 10.0.3.133 prog 100003 vers 3 prot TCP port 2049
mount.nfs: trying 10.0.3.133 prog 100005 vers 3 prot UDP port 635
mount&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"10.0.3.133:/vol1"&lt;/span&gt;, &lt;span class="s2"&gt;"/mnt/fsxn"&lt;/span&gt;, &lt;span class="s2"&gt;"nfs"&lt;/span&gt;, ...&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; EACCES &lt;span class="o"&gt;(&lt;/span&gt;Permission denied&lt;span class="o"&gt;)&lt;/span&gt;
mount.nfs: mount&lt;span class="o"&gt;(&lt;/span&gt;2&lt;span class="o"&gt;)&lt;/span&gt;: Permission denied
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key finding&lt;/strong&gt;: &lt;code&gt;mount.nfs&lt;/code&gt; successfully connects to both NFS (port 2049) and mountd (port 635), but the &lt;code&gt;mount()&lt;/code&gt; &lt;strong&gt;syscall&lt;/strong&gt; returns &lt;code&gt;EACCES&lt;/code&gt;. The denial happens at the kernel level, not at the server.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TCP/UDP note&lt;/strong&gt;: The initial reachability check used &lt;code&gt;/dev/tcp&lt;/code&gt;, confirming TCP reachability. During the actual mount attempt, &lt;code&gt;mount.nfs&lt;/code&gt; tried mountd over UDP as shown in the strace output. This is not a contradiction — NFSv3 mountd may use either transport. For production troubleshooting, use &lt;code&gt;rpcinfo&lt;/code&gt; and packet capture to confirm the actual protocol and port mapping.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 3: Manual NFS RPC Calls (User-space)
&lt;/h3&gt;

&lt;p&gt;To prove ONTAP is granting access, I performed manual NFS RPC calls using Python sockets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;struct&lt;/span&gt;

&lt;span class="c1"&gt;# MOUNT RPC (program 100005, version 3, procedure MNT)
&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AF_INET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SOCK_DGRAM&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;settimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sendto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mount_rpc_packet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.0.3.133&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;635&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Parse: status=0 (MNT3_OK), file_handle=44 bytes
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MOUNT RPC: SUCCESS ✅&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# NFS3 FSINFO, GETATTR, READDIRPLUS — all succeed
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NFS3 FSINFO: SUCCESS ✅&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NFS3 GETATTR: SUCCESS ✅&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NFS3 READDIRPLUS: SUCCESS ✅&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;All NFS operations succeed at user-space level.&lt;/strong&gt; ONTAP grants full access. The problem is not the server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: tmpfs Mount Test
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;sudo &lt;/span&gt;mount &lt;span class="nt"&gt;-t&lt;/span&gt; tmpfs tmpfs /tmp/test_mount &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"SUCCESS"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"FAILED"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SUCCESS ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;mount()&lt;/code&gt; syscall itself is allowed. Only NFS filesystem type is blocked.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Seccomp Status
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;%sh
&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/self/status | &lt;span class="nb"&gt;grep &lt;/span&gt;Seccomp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Seccomp:        2
Seccomp_filters:        1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Seccomp: 2&lt;/code&gt; = BPF filter mode active.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Conclusion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│ Evidence Chain:                                                 │
│                                                                 │
│ 1. Network connectivity      → ✅ All NFS ports reachable       │
│ 2. ONTAP export policy       → ✅ 0.0.0.0/0, rw=any, su=any     │
│ 3. NFS RPC (user-space)      → ✅ All operations succeed        │
│ 4. mount() with type="nfs"   → ❌ EACCES                        │
│ 5. mount() with type="tmpfs" → ✅ Success                       │
│ 6. Seccomp                   → Active (BPF filter mode)         │
│                                                                 │
│ Conclusion: The evidence points to a local platform security    │
│ boundary, likely seccomp filtering or an equivalent runtime     │
│ restriction, blocking the NFS mount path.                       │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error message "access denied &lt;strong&gt;by server&lt;/strong&gt;" is misleading. The &lt;code&gt;mount.nfs&lt;/code&gt; program interprets the kernel's EACCES as a server-side denial, but strace reveals the truth: the denial is local.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If sharing this finding&lt;/strong&gt;: This is not a Databricks compatibility verdict. It is a layer-by-layer validation of observed boundaries in one environment (DBR 17.3 LTS, ap-northeast-1). Platform behavior may differ across runtime versions, access modes, and configurations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Because Databricks does not publicly document this specific syscall/filesystem-type behavior, treat this as validation evidence rather than an official platform statement until confirmed by Databricks Support.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  All Mount Options Tested
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Options&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-o nfsvers=3,nolock&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;access denied&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-o nfsvers=4.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;access denied&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-o nfsvers=3,nolock,resvport&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;access denied&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-o nfsvers=3,nolock,noresvport&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;access denied&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-o sec=sys&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;access denied&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(no options)&lt;/td&gt;
&lt;td&gt;access denied&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;tmpfs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SUCCESS&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Evidence Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Evidence&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Network&lt;/td&gt;
&lt;td&gt;TCP 2049 / TCP 111 / TCP 635 reachable&lt;/td&gt;
&lt;td&gt;✅ Pass&lt;/td&gt;
&lt;td&gt;Network path exists between cluster and FSx for ONTAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ONTAP export&lt;/td&gt;
&lt;td&gt;Export policy allows 0.0.0.0/0, rw=any, su=any&lt;/td&gt;
&lt;td&gt;✅ Pass&lt;/td&gt;
&lt;td&gt;Export policy is not the blocker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFS server RPC&lt;/td&gt;
&lt;td&gt;MOUNT / FSINFO / GETATTR / READDIRPLUS succeed via user-space&lt;/td&gt;
&lt;td&gt;✅ Pass&lt;/td&gt;
&lt;td&gt;ONTAP grants NFS operations to this client&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local syscall&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mount(type=nfs)&lt;/code&gt; returns EACCES&lt;/td&gt;
&lt;td&gt;❌ Fail&lt;/td&gt;
&lt;td&gt;Evidence points to a local runtime boundary affecting kernel NFS mount&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local syscall control&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mount(type=tmpfs)&lt;/code&gt; succeeds&lt;/td&gt;
&lt;td&gt;✅ Pass&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mount()&lt;/code&gt; syscall is not universally blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime security&lt;/td&gt;
&lt;td&gt;Seccomp mode 2 observed in the tested process context&lt;/td&gt;
&lt;td&gt;Observed&lt;/td&gt;
&lt;td&gt;Runtime filtering may restrict NFS-specific mount&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unity Catalog S3&lt;/td&gt;
&lt;td&gt;External Location test on S3 AP ARN → AccessDenied&lt;/td&gt;
&lt;td&gt;❌ Fail&lt;/td&gt;
&lt;td&gt;Session policy does not allow S3 AP ARN pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instance Profile S3&lt;/td&gt;
&lt;td&gt;boto3 GetObject on S3 AP → Success&lt;/td&gt;
&lt;td&gt;✅ Pass&lt;/td&gt;
&lt;td&gt;IAM role itself has correct permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;showmount -e&lt;/code&gt; confirms that the export is visible through mountd. It does not guarantee that the local runtime allows the kernel NFS mount operation to complete. &lt;code&gt;showmount -e&lt;/code&gt; validates NFS export visibility only. It does not validate the file system user identity associated with the S3 Access Point. For S3 AP authorization, record the associated UNIX or Windows identity and verify file-level permissions separately — these are independent authorization paths.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  FSx for ONTAP S3 AP Authorization Path
&lt;/h2&gt;

&lt;p&gt;FSx for ONTAP S3 Access Points use a &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-ap-manage-access-fsxn.html" rel="noopener noreferrer"&gt;dual-layer authorization model&lt;/a&gt; that combines AWS IAM permissions with file system-level permissions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — S3-side authorization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM identity-based policy (caller's permissions)&lt;/li&gt;
&lt;li&gt;S3 Access Point resource policy&lt;/li&gt;
&lt;li&gt;VPC endpoint policy (if applicable)&lt;/li&gt;
&lt;li&gt;SCP / RCP (if applicable)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — FSx for ONTAP-side authorization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File system user associated with the access point&lt;/li&gt;
&lt;li&gt;UNIX mode-bits / NFSv4 ACLs (for UNIX security style volumes)&lt;/li&gt;
&lt;li&gt;Windows ACLs (for NTFS security style volumes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the Databricks validation, the failure occurs &lt;strong&gt;before Layer 2&lt;/strong&gt; — Unity Catalog's generated session policy restricts the assumed role session at the S3 API level, preventing the request from reaching FSx for ONTAP-side authorization. The Instance Profile + boto3 path bypasses Unity Catalog's session policy, allowing both layers to be evaluated normally.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For production, both layers must be configured with least-privilege. A permissive file system user (e.g., root / UID 0) combined with a broad IAM policy creates an overly permissive access path.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Approach 4: Instance Profile + boto3
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Setup
&lt;/h3&gt;

&lt;p&gt;Customer-managed VPC workspace, Dedicated cluster with an Instance Profile attached.&lt;/p&gt;

&lt;h3&gt;
  
  
  IMDS Access
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# IMDSv2 token
&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://169.254.169.254/latest/api/token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-aws-ec2-metadata-token-ttl-seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;21600&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PUT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Token: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ✅ Success
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Regular S3 Access
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ap-northeast-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_buckets&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ListBuckets: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Buckets&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; buckets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ✅ 58 buckets
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  FSx for ONTAP S3 AP Access
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_objects_v2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;FSx-S3-AP-alias&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MaxKeys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Objects: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;KeyCount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ✅ Works
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This works.&lt;/strong&gt; Instance Profile credentials bypass Unity Catalog's session policy entirely. boto3 talks directly to the S3 API with the EC2 instance's IAM role.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Governance warning&lt;/strong&gt;&lt;br&gt;
Instance Profile + boto3 is a pragmatic workaround for PoC and controlled experiments. It bypasses Unity Catalog governance, including fine-grained access control, lineage, and centralized data access auditing. Do not treat this as a production lakehouse governance pattern without a separate security and compliance review. Databricks recommends &lt;a href="https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html" rel="noopener noreferrer"&gt;Unity Catalog external locations&lt;/a&gt; as the standard governed access mechanism.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope note&lt;/strong&gt;&lt;br&gt;
The Instance Profile + boto3 sample above runs on the driver node only (single-node PoC pattern). Whether the same credential, network path, and concurrency behavior applies to Spark executors in a multi-node cluster requires separate validation.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Approach 5: S3 AP + Instance Profile (Managed VPC with VPC Peering)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Hypothesis
&lt;/h3&gt;

&lt;p&gt;If Instance Profile + boto3 works on a Customer-managed VPC (Approach 4), does it also work from a Databricks-managed VPC with VPC Peering to the FSx for ONTAP VPC? This would validate whether the S3 Gateway Endpoint in the Databricks-managed VPC can route S3 AP requests to the FSx for ONTAP backend.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Databricks-managed VPC (&lt;code&gt;vpc-060209589cbe4c298&lt;/code&gt;, CIDR: 10.53.0.0/16)&lt;/li&gt;
&lt;li&gt;FSx for ONTAP VPC (&lt;code&gt;vpc-0ae01826f906191af&lt;/code&gt;, CIDR: 10.0.0.0/16)&lt;/li&gt;
&lt;li&gt;VPC Peering: &lt;code&gt;pcx-02167ddf900a30782&lt;/code&gt; (active)&lt;/li&gt;
&lt;li&gt;Route tables: updated in both directions&lt;/li&gt;
&lt;li&gt;FSx for ONTAP security group: allows all traffic (0.0.0.0/0)&lt;/li&gt;
&lt;li&gt;S3 Gateway Endpoint: &lt;code&gt;vpce-020b59ab4da0b44b8&lt;/code&gt; (full access policy)&lt;/li&gt;
&lt;li&gt;Cluster: m5.large × 3, DBR 17.3 LTS, Dedicated mode, Instance Profile attached&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dns_resolution"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"ip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"52.219.151.110"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vpc_peering_443"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"result_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vpc_peering_nfs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"result_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"s3_ap_access"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Read timeout"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"imds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Analysis
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DNS resolution&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;S3 AP alias resolves to S3 endpoint IP (52.219.x.x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Peering (TCP 443)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;FSx for ONTAP management IP unreachable — egress blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPC Peering (NFS 2049)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;NFS port unreachable — egress blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;S3 AP via S3 Gateway Endpoint&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Read timeout — S3 service reachable but FSx for ONTAP backend connection fails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IMDS / Instance Profile&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Credentials available and valid&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key finding&lt;/strong&gt;: Even with VPC Peering established, routes configured, and security groups permissive, the Databricks-managed VPC's egress restrictions block connectivity to the FSx for ONTAP backend. The S3 Gateway Endpoint routes requests to the S3 service, but FSx for ONTAP S3 AP requires the S3 service to reach the FSx for ONTAP file system — which is in a different VPC from the Databricks cluster. The S3 service-side routing to the FSx for ONTAP backend is not affected by customer-side VPC Peering.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: This result confirms that FSx for ONTAP S3 AP access requires the requesting service (Databricks cluster) to be in the same VPC as the FSx for ONTAP file system, or to use a network configuration where the S3 service can reach the FSx for ONTAP backend. VPC Peering between the requester VPC and the FSx for ONTAP VPC does not help because S3 AP requests are routed through the S3 service, not directly to the FSx for ONTAP IP.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;S3 AP requests do not traverse VPC Peering. They are routed through the S3 service endpoint. For FSx for ONTAP S3 AP to work, the S3 service must be able to reach the FSx for ONTAP file system's internal endpoint. This is handled by AWS internally when the request originates from the same region, but the Databricks-managed VPC's egress restrictions appear to interfere with this path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customer-managed VPC (same VPC as FSx for ONTAP) remains the only validated path for Instance Profile + boto3 access to FSx for ONTAP S3 AP from Databricks.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  IMDS Access Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cluster Mode&lt;/th&gt;
&lt;th&gt;Workspace Type&lt;/th&gt;
&lt;th&gt;IMDS&lt;/th&gt;
&lt;th&gt;boto3 S3&lt;/th&gt;
&lt;th&gt;boto3 S3 AP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard (Shared)&lt;/td&gt;
&lt;td&gt;Managed VPC&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedicated&lt;/td&gt;
&lt;td&gt;Managed VPC&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedicated&lt;/td&gt;
&lt;td&gt;Customer VPC&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedicated + Instance Profile&lt;/td&gt;
&lt;td&gt;Managed VPC (VPC Peering)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dedicated + Instance Profile&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Customer VPC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Row 4 note: IMDS works and Instance Profile credentials are valid, but S3 AP access times out because the Databricks-managed VPC egress restrictions block FSx for ONTAP backend connectivity. Regular S3 bucket access was not tested with a permissive policy (AccessDenied was due to intentionally scoped IAM policy, not network).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;IMDS is blocked on all configurations except Dedicated mode with an explicitly registered Instance Profile on a Customer-managed VPC workspace.&lt;/p&gt;




&lt;h2&gt;
  
  
  Complete Results Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Blocker&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;UC External Location + dbutils.fs (without &lt;code&gt;access_point&lt;/code&gt; field)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Generated session policy did not allow S3 AP ARN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1b&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;UC External Location + &lt;code&gt;access_point&lt;/code&gt; field (file-level read)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Top-level ls, head, spark.read with explicit path all work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1c&lt;/td&gt;
&lt;td&gt;UC External Location + &lt;code&gt;access_point&lt;/code&gt; field (subdirectory ls)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Prefix-based ListObjectsV2 still blocked for subdirectories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1d&lt;/td&gt;
&lt;td&gt;UC External Location + CREATE TABLE LOCATION&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;UC_CLOUD_STORAGE_ACCESS_FAILURE during internal validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;UC External Location + Spark read (directory)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Same prefix-level access issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;NFS mount (Managed VPC, VPC Peering)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Egress blocked (port 2049)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;NFS mount (Customer VPC, Dedicated)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;NFS mount blocked by seccomp by design (confirmed by Databricks Support)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;boto3 (Managed VPC, no Instance Profile)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;IMDS blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;boto3 (Customer VPC, no Instance Profile)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;IMDS blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Instance Profile + boto3 (Customer VPC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works (bypasses UC governance)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NFS RPC user-space (Customer VPC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works but impractical for production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;No Isolation Shared mode&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Legacy access mode; not pursued&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;S3 AP + Instance Profile + boto3 (Managed VPC, VPC Peering)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Managed VPC egress blocks FSx for ONTAP backend connectivity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Governance Impact Summary
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Documentation status (Updated 2026-05-26)&lt;/strong&gt;: Databricks Support confirmed that the &lt;code&gt;access_point&lt;/code&gt; field was never released as GA and has been removed from documentation. Unity Catalog External Locations do not currently support S3 Access Points as storage targets. The partial success observed is a side effect, not a supported code path. Feature gap reported to UC engineering — no timeline available.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Access path&lt;/th&gt;
&lt;th&gt;Governance model&lt;/th&gt;
&lt;th&gt;Auditability&lt;/th&gt;
&lt;th&gt;Production suitability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unity Catalog External Location&lt;/td&gt;
&lt;td&gt;Centralized UC governance (fine-grained, lineage)&lt;/td&gt;
&lt;td&gt;High (if supported)&lt;/td&gt;
&lt;td&gt;Preferred, but blocked in this validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instance Profile + boto3&lt;/td&gt;
&lt;td&gt;EC2 IAM role based&lt;/td&gt;
&lt;td&gt;AWS-side logs possible if enabled; UC lineage not captured&lt;/td&gt;
&lt;td&gt;PoC only unless separately approved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel NFS mount&lt;/td&gt;
&lt;td&gt;Filesystem / OS level&lt;/td&gt;
&lt;td&gt;Outside UC governance&lt;/td&gt;
&lt;td&gt;Not viable in this validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User-space NFS RPC&lt;/td&gt;
&lt;td&gt;Custom application path&lt;/td&gt;
&lt;td&gt;Custom logging required&lt;/td&gt;
&lt;td&gt;Experimental only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Athena + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;IAM / S3 AP / Athena workgroup&lt;/td&gt;
&lt;td&gt;AWS-side evidence possible&lt;/td&gt;
&lt;td&gt;Best current read-only SQL analytics fit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bedrock Knowledge Bases + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;IAM / S3 AP / Bedrock Knowledge Base role / guardrails where used&lt;/td&gt;
&lt;td&gt;AWS-side evidence possible&lt;/td&gt;
&lt;td&gt;AWS-documented RAG / GenAI path; validated with permission-aware retrieval in &lt;a href="https://dev.to/aws-builders/building-an-agentic-access-aware-rag-system-with-amazon-fsx-for-netapp-ontap-s3-vectors-and-s3-2b86"&gt;related series&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue / EMR Serverless + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;IAM / S3 AP / Glue / EMR job roles&lt;/td&gt;
&lt;td&gt;AWS-side evidence possible&lt;/td&gt;
&lt;td&gt;Validated ETL / Spark path in this broader series where verification-pack evidence is available; validate production write-back semantics separately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;AWS-side audit events, such as CloudTrail data events where enabled and applicable, may show S3 API access by the instance profile, but they do not replace Unity Catalog lineage, table-level privileges, or centralized Databricks governance controls.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  MLOps Boundary
&lt;/h3&gt;

&lt;p&gt;Using boto3 to read objects from FSx for ONTAP S3 AP does not automatically make the downstream ML workflow governed.&lt;/p&gt;

&lt;p&gt;If the data retrieved via Instance Profile + boto3 is used for ML or GenAI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Register derived datasets in governed storage (Unity Catalog managed location)&lt;/li&gt;
&lt;li&gt;Track experiments with &lt;a href="https://docs.databricks.com/en/mlflow/index.html" rel="noopener noreferrer"&gt;MLflow&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Register models in &lt;a href="https://docs.databricks.com/aws/en/catalog-explorer/explore-models" rel="noopener noreferrer"&gt;Unity Catalog&lt;/a&gt; where applicable&lt;/li&gt;
&lt;li&gt;Document source data access path (S3 AP alias, prefix, timestamp)&lt;/li&gt;
&lt;li&gt;Record whether training data lineage is captured or externalized&lt;/li&gt;
&lt;li&gt;Ensure the ML compute uses an access mode compatible with Unity Catalog governance&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Models in Unity Catalog provides centralized access control, auditing, lineage, and model discovery across workspaces. If the PoC data path bypasses UC, the model lifecycle should still be governed through UC model registry.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  AI / RAG Data Readiness Checklist
&lt;/h3&gt;

&lt;p&gt;If the FSx for ONTAP S3 AP data is intended for AI, RAG, or GenAI pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Are documents classified by sensitivity (PHI, PII, financial, internal, public)?&lt;/li&gt;
&lt;li&gt;[ ] Are file-level permissions preserved or re-modeled for the AI pipeline?&lt;/li&gt;
&lt;li&gt;[ ] Is metadata available for filtering and retrieval (file type, date, owner)?&lt;/li&gt;
&lt;li&gt;[ ] Is freshness requirement defined (real-time, daily, weekly)?&lt;/li&gt;
&lt;li&gt;[ ] Is read-only access sufficient, or does the pipeline need write-back?&lt;/li&gt;
&lt;li&gt;[ ] Is human review required for generated output before downstream use?&lt;/li&gt;
&lt;li&gt;[ ] Is permission-aware retrieval required (user A sees only their authorized documents)?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If permission-aware retrieval is required, define one of:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enforce at source access path&lt;/strong&gt; — use per-user or per-group S3 Access Points with scoped file system users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-model permissions in metadata index&lt;/strong&gt; — extract file-level ACLs into a searchable metadata store and filter at query time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter retrieval results by user/group claims&lt;/strong&gt; — apply post-retrieval filtering based on authenticated user identity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do not proceed&lt;/strong&gt; until authorization model is validated and approved by security owner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Instance Profile + boto3 approval requirements&lt;/strong&gt; (for regulated workloads):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data owner approval&lt;/li&gt;
&lt;li&gt;Security owner approval&lt;/li&gt;
&lt;li&gt;Platform owner approval&lt;/li&gt;
&lt;li&gt;Compliance reviewer approval (if regulated data involved)&lt;/li&gt;
&lt;li&gt;Defined: allowed prefix, allowed operations, logging requirements, expiration date&lt;/li&gt;
&lt;li&gt;Approval record location (where the decision is stored)&lt;/li&gt;
&lt;li&gt;Review / expiration date (when the approval must be re-evaluated)&lt;/li&gt;
&lt;li&gt;Incident escalation contact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For regulated workloads, do not use Instance Profile + boto3 for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Patient-facing responses or clinical decision support&lt;/li&gt;
&lt;li&gt;Financial decision automation&lt;/li&gt;
&lt;li&gt;Unreviewed access to regulated datasets&lt;/li&gt;
&lt;li&gt;Writeback to source-controlled data locations&lt;/li&gt;
&lt;li&gt;Workloads requiring Unity Catalog lineage&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Recommended path today&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;th&gt;Next validation action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SQL query on structured files&lt;/td&gt;
&lt;td&gt;Athena + FSx for ONTAP S3 AP (Part 1)&lt;/td&gt;
&lt;td&gt;Verified, simple, governed&lt;/td&gt;
&lt;td&gt;Scale test with production data sizes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG / GenAI over NAS documents&lt;/td&gt;
&lt;td&gt;Bedrock Knowledge Bases + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-build-rag-with-bedrock.html" rel="noopener noreferrer"&gt;AWS-documented tutorial&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Validate retrieval accuracy, permission-aware filtering, and sync freshness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETL pipeline on NAS data&lt;/td&gt;
&lt;td&gt;Glue or EMR Serverless + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;Validated in this broader series where verification-pack evidence is available&lt;/td&gt;
&lt;td&gt;Validate throughput impact and production write-back semantics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serverless file processing&lt;/td&gt;
&lt;td&gt;Lambda + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-process-files-with-lambda.html" rel="noopener noreferrer"&gt;AWS-documented tutorial&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Validate concurrency and throughput for your workload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Databricks governance with Unity Catalog&lt;/td&gt;
&lt;td&gt;Wait for platform support&lt;/td&gt;
&lt;td&gt;UC session policy currently blocks S3 AP ARN in my validation&lt;/td&gt;
&lt;td&gt;Monitor Databricks support case response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Databricks unstructured data PoC&lt;/td&gt;
&lt;td&gt;Dedicated cluster + Instance Profile + boto3&lt;/td&gt;
&lt;td&gt;Works, but bypasses UC governance&lt;/td&gt;
&lt;td&gt;Validate executor-scale behavior separately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production Databricks lakehouse tables&lt;/td&gt;
&lt;td&gt;Use supported cloud storage (S3 bucket)&lt;/td&gt;
&lt;td&gt;Required for Delta write semantics&lt;/td&gt;
&lt;td&gt;N/A — use standard pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Databricks distributed processing over FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;Not validated yet&lt;/td&gt;
&lt;td&gt;Driver-only boto3 success does not prove executor-scale behavior&lt;/td&gt;
&lt;td&gt;Test with multi-node cluster and Spark mapPartitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise read-only analytics&lt;/td&gt;
&lt;td&gt;Athena / Glue / EMR Serverless / FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;Best current fit for AWS-native path&lt;/td&gt;
&lt;td&gt;Production workload isolation test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Video streaming from NAS&lt;/td&gt;
&lt;td&gt;CloudFront + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-stream-video-with-cloudfront.html" rel="noopener noreferrer"&gt;AWS-documented tutorial&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Validate caching and latency for your content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;This article does not recommend bypassing Unity Catalog for production governed lakehouse workloads. The Instance Profile + boto3 path is documented because it worked in a controlled validation environment, not because it is the preferred governance model.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture Decision Guidance
&lt;/h2&gt;

&lt;p&gt;Databricks remains the recommended platform for curated lakehouse workloads, governed Delta tables, ML pipelines, and multi-step data engineering. FSx for ONTAP S3 AP should be treated as a source integration boundary that may require staging, validation, or an alternate read path depending on governance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Databricks when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is already in supported object storage (S3 bucket)&lt;/li&gt;
&lt;li&gt;Delta Lake write semantics are required (INSERT, MERGE, OPTIMIZE, VACUUM)&lt;/li&gt;
&lt;li&gt;Unity Catalog lineage and fine-grained governance are mandatory&lt;/li&gt;
&lt;li&gt;Large-scale Spark processing is required&lt;/li&gt;
&lt;li&gt;ML/AI workloads need integrated compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use AWS-native services + FSx for ONTAP S3 AP when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The primary requirement is read-only SQL analytics over NAS data → &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;Athena&lt;/a&gt; (validated in Part 1)&lt;/li&gt;
&lt;li&gt;RAG / GenAI over enterprise documents → &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-build-rag-with-bedrock.html" rel="noopener noreferrer"&gt;Bedrock Knowledge Bases&lt;/a&gt; (AWS-documented path)&lt;/li&gt;
&lt;li&gt;ETL pipelines reading/transforming NAS data → &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-transform-data-with-glue.html" rel="noopener noreferrer"&gt;Glue&lt;/a&gt; (validated in this broader series where verification-pack evidence is available)&lt;/li&gt;
&lt;li&gt;Spark-scale processing without persistent clusters → &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-run-spark-with-emr-serverless.html" rel="noopener noreferrer"&gt;EMR Serverless&lt;/a&gt; (validated in this broader series where verification-pack evidence is available)&lt;/li&gt;
&lt;li&gt;Serverless file processing (thumbnails, text extraction, transcription) → &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-process-files-with-lambda.html" rel="noopener noreferrer"&gt;Lambda&lt;/a&gt; (AWS-documented path)&lt;/li&gt;
&lt;li&gt;Video streaming from NAS → &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-stream-video-with-cloudfront.html" rel="noopener noreferrer"&gt;CloudFront&lt;/a&gt; (AWS-documented path)&lt;/li&gt;
&lt;li&gt;External partner file exchange → &lt;a href="https://docs.aws.amazon.com/transfer/latest/userguide/fsx-s3-access-points.html" rel="noopener noreferrer"&gt;Transfer Family&lt;/a&gt; (AWS-documented path)&lt;/li&gt;
&lt;li&gt;BI and AI-assisted analytics → QuickSight candidate path, typically via Athena or Glue Catalog&lt;/li&gt;
&lt;li&gt;Source data copy should be minimized&lt;/li&gt;
&lt;li&gt;Workload isolation and governance can be validated with AWS-side controls&lt;/li&gt;
&lt;li&gt;Serverless, pay-per-query or pay-per-invocation cost model is preferred&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use controlled boto3 PoC only when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The workload is exploratory and time-limited&lt;/li&gt;
&lt;li&gt;Unity Catalog lineage is not required for the PoC scope&lt;/li&gt;
&lt;li&gt;Explicit approval is obtained from data owner, security owner, and platform owner&lt;/li&gt;
&lt;li&gt;Compensating controls are defined and documented&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  FSx for ONTAP Sizing Considerations
&lt;/h3&gt;

&lt;p&gt;Before selecting an analytics engine, validate FSx for ONTAP-side capacity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput capacity&lt;/strong&gt; — S3 API throughput is bounded by the FSx for ONTAP file system's provisioned throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected S3 API request rate&lt;/strong&gt; — high-frequency small object reads may hit IOPS limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File count and average object size&lt;/strong&gt; — large directories with many small files may increase listing latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix layout&lt;/strong&gt; — flat vs hierarchical prefix design affects listing performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NFS/SMB production workload window&lt;/strong&gt; — analytics queries share throughput with existing file workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot / backup / replication schedule&lt;/strong&gt; — SnapMirror and backup operations consume throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolation strategy&lt;/strong&gt; — consider a dedicated volume or SVM for analytics access to avoid contention&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Delta Lake production workloads require more than object read access. They require validated behavior for transaction log writes, atomic commit assumptions, concurrent writers, checkpointing, recovery, and lifecycle operations. This article does not validate FSx for ONTAP S3 AP for Delta write-path semantics.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Compensating Controls for Controlled boto3 PoC
&lt;/h2&gt;

&lt;p&gt;If Instance Profile + boto3 is approved for a controlled PoC, define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dedicated cluster only (no shared compute)&lt;/li&gt;
&lt;li&gt;Single-purpose instance profile (not reused across workloads)&lt;/li&gt;
&lt;li&gt;Least-privilege S3 Access Point policy (specific prefix only)&lt;/li&gt;
&lt;li&gt;Read-only permissions by default&lt;/li&gt;
&lt;li&gt;Allowed prefix list (explicitly documented)&lt;/li&gt;
&lt;li&gt;CloudTrail data event coverage where enabled and applicable&lt;/li&gt;
&lt;li&gt;Notebook/job owner (named individual)&lt;/li&gt;
&lt;li&gt;Approval expiration date&lt;/li&gt;
&lt;li&gt;No production writeback&lt;/li&gt;
&lt;li&gt;No regulated data unless separately approved with compensating controls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recommended Databricks-side controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restrict instance profile usage to an approved group via workspace admin settings&lt;/li&gt;
&lt;li&gt;Enforce dedicated access mode through cluster policy&lt;/li&gt;
&lt;li&gt;Restrict cluster creation permissions to approved users&lt;/li&gt;
&lt;li&gt;Tag PoC clusters with owner, approval ID, and expiration date&lt;/li&gt;
&lt;li&gt;Disable or terminate clusters after approval expiration&lt;/li&gt;
&lt;li&gt;Review workspace audit logs for cluster and instance profile usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Post-expiration mandatory actions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Terminate all PoC clusters using the instance profile&lt;/li&gt;
&lt;li&gt;Remove the instance profile from workspace admin settings&lt;/li&gt;
&lt;li&gt;Archive all evidence (notebooks, logs, results) to approved storage&lt;/li&gt;
&lt;li&gt;Update approval record with completion date and findings&lt;/li&gt;
&lt;li&gt;Confirm no residual access paths remain (audit workspace settings)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Data Protection Considerations
&lt;/h2&gt;

&lt;p&gt;FSx for ONTAP S3 AP exposes access to file data; it does not replace ONTAP volume-level protection. When analytics workloads access source data via S3 AP, validate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot schedule impact&lt;/strong&gt; — analytics reads do not conflict with scheduled snapshots, but heavy write-back could&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SnapMirror replication policy&lt;/strong&gt; — source volume replication continues regardless of S3 AP access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup window vs analytics query window&lt;/strong&gt; — concurrent backup and analytics may compete for throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write-back isolation&lt;/strong&gt; — analytics results should be written to a separate volume or prefix, not the source-of-record volume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery behavior&lt;/strong&gt; — if analytics workload reads during a failover event, understand the RPO/RTO implications&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;ONTAP S3 NAS bucket data is protected by volume-level SnapMirror asynchronous replication, not by S3-level replication. Plan DR at the volume level.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Discovery Questions for Partners
&lt;/h2&gt;

&lt;p&gt;When a customer asks about Databricks + FSx for ONTAP S3 Access Points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Are the target files currently stored on NFS, SMB, or both?&lt;/li&gt;
&lt;li&gt;Is the workload read-only analytics, unstructured object processing, or Delta write?&lt;/li&gt;
&lt;li&gt;Is Unity Catalog lineage mandatory for this use case?&lt;/li&gt;
&lt;li&gt;Is this a regulated dataset (PHI, PII, financial)?&lt;/li&gt;
&lt;li&gt;Can the PoC run with a dedicated instance profile and limited prefix?&lt;/li&gt;
&lt;li&gt;What is the required concurrency and data size?&lt;/li&gt;
&lt;li&gt;Is executor-scale Spark processing required, or is driver-only sufficient?&lt;/li&gt;
&lt;li&gt;What rollback action is acceptable if FSx for ONTAP throughput impact is observed?&lt;/li&gt;
&lt;li&gt;Who approves non-Unity Catalog access paths?&lt;/li&gt;
&lt;li&gt;What evidence is required for security review?&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Troubleshooting Playbook
&lt;/h2&gt;

&lt;p&gt;When Databricks access to FSx for ONTAP S3 AP fails, isolate one layer at a time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;IAM&lt;/strong&gt; — Can the instance profile call &lt;code&gt;s3:ListBucket&lt;/code&gt; on the S3 AP ARN? Can it call &lt;code&gt;s3:GetObject&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unity Catalog&lt;/strong&gt; — Does the same role work for a standard S3 bucket? Does it fail only for the FSx for ONTAP S3 AP ARN?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network&lt;/strong&gt; — Is the workspace customer-managed or Databricks-managed? Can the cluster reach NFS TCP 2049? Are route tables and security groups correct?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NFS server&lt;/strong&gt; — Does &lt;code&gt;showmount -e&lt;/code&gt; work? Does the ONTAP export policy allow the client?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local runtime&lt;/strong&gt; — Does &lt;code&gt;strace&lt;/code&gt; show &lt;code&gt;mount()&lt;/code&gt; returning &lt;code&gt;EACCES&lt;/code&gt;? Does &lt;code&gt;tmpfs&lt;/code&gt; mount succeed? Does user-space NFS RPC succeed?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workaround&lt;/strong&gt; — Does Dedicated + Instance Profile + boto3 work? Is bypassing Unity Catalog acceptable for this PoC?&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Known Failure Signatures
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely layer&lt;/th&gt;
&lt;th&gt;Next step&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;no session policy allows s3:ListBucket&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unity Catalog session policy&lt;/td&gt;
&lt;td&gt;Compare regular S3 bucket vs FSx for ONTAP S3 AP with the same role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TCP 2049 unreachable&lt;/td&gt;
&lt;td&gt;Network / managed VPC boundary&lt;/td&gt;
&lt;td&gt;Test from customer-managed VPC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;mount.nfs: access denied by server&lt;/code&gt; with &lt;code&gt;mount()&lt;/code&gt; EACCES in strace&lt;/td&gt;
&lt;td&gt;Local runtime restriction&lt;/td&gt;
&lt;td&gt;Capture strace and &lt;code&gt;/proc/self/status&lt;/code&gt; seccomp output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;boto3 &lt;code&gt;NoCredentialsError&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Instance profile / IMDS blocked&lt;/td&gt;
&lt;td&gt;Verify cluster mode is Dedicated and instance profile is registered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;boto3 &lt;code&gt;ReadTimeoutError&lt;/code&gt; on S3 AP&lt;/td&gt;
&lt;td&gt;FSx for ONTAP backend or VPC endpoint routing&lt;/td&gt;
&lt;td&gt;Test with a fresh SVM/volume to isolate; check FSx for ONTAP CPU utilization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;boto3 &lt;code&gt;ReadTimeoutError&lt;/code&gt; on S3 AP from Managed VPC (IMDS works)&lt;/td&gt;
&lt;td&gt;Managed VPC egress restriction blocking FSx for ONTAP backend&lt;/td&gt;
&lt;td&gt;Deploy in Customer-managed VPC (same VPC as FSx for ONTAP); VPC Peering does not resolve this&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Driver-only boto3 works, but Spark job fails&lt;/td&gt;
&lt;td&gt;Executor credential/network path&lt;/td&gt;
&lt;td&gt;Validate credentials, routing, and concurrency from executors separately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What This Article Does Not Conclude
&lt;/h2&gt;

&lt;p&gt;This article does not conclude that Databricks cannot ever support FSx for ONTAP S3 AP. It documents the behavior observed in one validated environment and identifies the platform boundaries that need vendor confirmation or additional support.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Tell Stakeholders
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Current recommendation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;AWS-documented native service paths&lt;/strong&gt; where they match the workload: Athena for SQL, Bedrock Knowledge Bases for RAG/GenAI, Glue or EMR Serverless for ETL/Spark, Lambda for serverless file processing, CloudFront for streaming, and Transfer Family for partner file exchange&lt;/li&gt;
&lt;li&gt;Treat Athena as the validated read-oriented SQL path in Part 1. Treat Glue / EMR Serverless as validated ETL / Spark paths only where corresponding &lt;a href="https://github.com/Yoshiki0705/fsxn-lakehouse-integrations/tree/main/verification-pack" rel="noopener noreferrer"&gt;verification-pack&lt;/a&gt; evidence is available.&lt;/li&gt;
&lt;li&gt;Treat Bedrock Knowledge Bases, Lambda (file processing), CloudFront, and Transfer Family as AWS-documented candidate paths that still require workload-specific validation&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;Databricks + Instance Profile + boto3&lt;/strong&gt; only for controlled PoC or unstructured data experiments&lt;/li&gt;
&lt;li&gt;Do not position Unity Catalog + FSx for ONTAP S3 AP as production-ready until the session policy supports S3 Access Point ARN patterns&lt;/li&gt;
&lt;li&gt;Do not rely on kernel NFS mounts inside Databricks until the platform explicitly supports this path&lt;/li&gt;
&lt;li&gt;For Delta Lake production tables, continue to use supported object storage patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This validation should be used to guide architecture selection, not to disqualify Databricks from lakehouse workloads.&lt;/p&gt;

&lt;p&gt;This validation should not be used to compare AWS-native services and Databricks as competing platforms. AWS-native services (Athena, Bedrock, Glue, EMR Serverless, Lambda) each have &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/using-access-points-with-aws-services.html" rel="noopener noreferrer"&gt;AWS-documented integration paths&lt;/a&gt; with FSx for ONTAP S3 AP — some validated in this series, others requiring workload-specific validation. Databricks is strong for governed lakehouse, Delta, ML, and production-scale data engineering workloads. The right choice depends on the access pattern, governance requirement, and workload type.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key contributions of this validation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identified the root cause of NFS mount failure (seccomp BPF filter, not server-side denial) via strace analysis&lt;/li&gt;
&lt;li&gt;Discovered the &lt;code&gt;access_point&lt;/code&gt; field on External Location (via Databricks Support) that partially resolves the session policy&lt;/li&gt;
&lt;li&gt;Proved that file-level read under UC governance is possible (1000 rows, schema inference)&lt;/li&gt;
&lt;li&gt;Mapped the complete evidence chain: network → ONTAP → NFS RPC → kernel → seccomp&lt;/li&gt;
&lt;li&gt;Established that Customer-managed VPC (same VPC as FSx) is the only validated network path&lt;/li&gt;
&lt;li&gt;Provided a reusable troubleshooting playbook for future S3 AP integration attempts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. "S3-compatible" ≠ "works everywhere S3 works"
&lt;/h3&gt;

&lt;p&gt;FSx for ONTAP S3 AP is S3-compatible at the API level, but platform security layers (session policies, VPC restrictions) may not recognize the ARN format. S3 API compatibility and platform-integrated S3 governance are different things.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Error messages can be misleading
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;mount.nfs: access denied by server&lt;/code&gt; made me spend hours checking ONTAP export policies. The real issue was a local runtime restriction. Always use &lt;code&gt;strace&lt;/code&gt; when mount fails unexpectedly.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Platform security boundaries are not always documented
&lt;/h3&gt;

&lt;p&gt;You discover these boundaries by hitting them. The troubleshooting playbook above can save you time.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Customer-managed VPC is essential for storage integration
&lt;/h3&gt;

&lt;p&gt;If you need to connect Databricks to anything beyond standard S3 buckets, deploy in a Customer-managed VPC. Databricks-managed VPC provides limited customer control over cluster networking compared with a customer-managed VPC.&lt;/p&gt;

&lt;p&gt;This was further confirmed by testing S3 AP access from a Databricks-managed VPC with VPC Peering: even with VPC Peering active, routes configured, security groups permissive, and a S3 Gateway Endpoint present, S3 AP requests to FSx for ONTAP timed out. The Databricks-managed VPC egress restrictions block not only direct IP communication but also S3 AP backend connectivity.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;S3 AP routing note&lt;/strong&gt;: S3 AP requests are routed through the S3 service endpoint, not directly to the FSx for ONTAP IP. VPC Peering between the requester VPC and the FSx for ONTAP VPC does not help because the S3 service needs internal connectivity to the FSx for ONTAP file system. Customer-managed VPC (same VPC as FSx for ONTAP) is the only validated path.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Databricks Control Plane (SaaS)
        ^
        | NAT Gateway (required outbound)
        |
Databricks Cluster ENI (Customer VPC, private subnet)
        |
        | Private VPC routing (no internet required)
        v
FSx for ONTAP ENI / SVM (same VPC, private subnet)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the Databricks Support Case Packet, include network evidence: cluster subnet ID, FSx for ONTAP subnet ID, route table IDs, security group rules, and DNS resolution for FSx for ONTAP endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Instance Profile is a pragmatic PoC workaround
&lt;/h3&gt;

&lt;p&gt;Use Instance Profile + boto3 as a controlled PoC workaround. Do not use it as a substitute for Unity Catalog governance without a formal security review.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Always isolate variables when troubleshooting
&lt;/h3&gt;

&lt;p&gt;When FSx for ONTAP S3 AP wasn't responding, I created a new SVM and volume to isolate the issue. This confirmed the problem was SVM-specific rather than a platform-wide limitation.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Negative validation creates value
&lt;/h3&gt;

&lt;p&gt;A failed integration path can still create value when it prevents the wrong production architecture. This validation helps teams avoid assuming S3 API compatibility equals platform governance compatibility, choose the right engine for the right access pattern, and reduce time spent on ambiguous troubleshooting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Databricks Support Case Packet
&lt;/h2&gt;

&lt;p&gt;If you open a support case with Databricks, include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workspace type: Databricks-managed VPC or customer-managed VPC&lt;/li&gt;
&lt;li&gt;Cluster access mode and DBR version&lt;/li&gt;
&lt;li&gt;IAM role / instance profile configuration&lt;/li&gt;
&lt;li&gt;Unity Catalog storage credential and external location configuration&lt;/li&gt;
&lt;li&gt;Full AccessDenied error message (including the ARN and "no session policy" text)&lt;/li&gt;
&lt;li&gt;S3 AP ARN and alias format&lt;/li&gt;
&lt;li&gt;Network test results for NFS ports (TCP 2049, TCP 111, TCP 635)&lt;/li&gt;
&lt;li&gt;strace output showing &lt;code&gt;mount()&lt;/code&gt; EACCES&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/proc/self/status&lt;/code&gt; showing seccomp mode&lt;/li&gt;
&lt;li&gt;User-space NFS RPC success evidence (if applicable)&lt;/li&gt;
&lt;li&gt;Instance Profile boto3 success evidence (if applicable)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;showmount -e&lt;/code&gt; output (confirms export visibility)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tmpfs&lt;/code&gt; mount success evidence (proves mount syscall itself is allowed)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Use Case Fit Matrix
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;When this article says "validated in this broader series," it refers to evidence captured in the linked verification-pack or related articles, not to Databricks-specific validation in this Part 2 article.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Best current path&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SQL analytics on structured NAS files&lt;/td&gt;
&lt;td&gt;Athena + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;Verified read-oriented path with AWS-side governance controls, serverless&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise IT RAG over documents&lt;/td&gt;
&lt;td&gt;Bedrock Knowledge Bases + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-build-rag-with-bedrock.html" rel="noopener noreferrer"&gt;AWS-documented tutorial&lt;/a&gt;; also validated in &lt;a href="https://dev.to/aws-builders/building-an-agentic-access-aware-rag-system-with-amazon-fsx-for-netapp-ontap-s3-vectors-and-s3-2b86"&gt;related series&lt;/a&gt; with permission-aware retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETL / data transformation&lt;/td&gt;
&lt;td&gt;Glue or EMR Serverless + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;Validated in this broader series where &lt;a href="https://github.com/Yoshiki0705/fsxn-lakehouse-integrations/tree/main/verification-pack" rel="noopener noreferrer"&gt;verification-pack&lt;/a&gt; evidence is available; validate production write-back semantics separately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serverless file processing (thumbnails, OCR, transcription)&lt;/td&gt;
&lt;td&gt;Lambda + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-process-files-with-lambda.html" rel="noopener noreferrer"&gt;AWS-documented tutorial&lt;/a&gt;; validate for your workload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large-scale Spark ETL&lt;/td&gt;
&lt;td&gt;EMR Serverless + FSx for ONTAP S3 AP or standard S3 bucket&lt;/td&gt;
&lt;td&gt;Validated in this series; Databricks executor-scale not validated on S3 AP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production Delta Lake tables&lt;/td&gt;
&lt;td&gt;Supported object storage (S3 bucket)&lt;/td&gt;
&lt;td&gt;Required for Delta write semantics and UC governance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unstructured data experimentation (Databricks)&lt;/td&gt;
&lt;td&gt;Instance Profile + boto3 PoC&lt;/td&gt;
&lt;td&gt;Works in driver-only pattern, needs governance review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Video streaming from NAS&lt;/td&gt;
&lt;td&gt;CloudFront + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-stream-video-with-cloudfront.html" rel="noopener noreferrer"&gt;AWS-documented tutorial&lt;/a&gt;; validate caching, latency, and file size for your content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External partner file exchange&lt;/td&gt;
&lt;td&gt;Transfer Family + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://docs.aws.amazon.com/transfer/latest/userguide/fsx-s3-access-points.html" rel="noopener noreferrer"&gt;AWS-documented path&lt;/a&gt;; also validated in &lt;a href="https://dev.to/aws-builders/smart-routing-transfer-family-ingestion-and-voice-chat-permission-aware-rag-v42-3iml"&gt;related series&lt;/a&gt;; validate file operation limitations (rename, append, upload size)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lightweight serverless analytics&lt;/td&gt;
&lt;td&gt;DuckDB Lambda + FSx for ONTAP S3 AP&lt;/td&gt;
&lt;td&gt;Planned Part 3 validation; candidate for lightweight, low-idle-cost analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BI / dashboarding over NAS data&lt;/td&gt;
&lt;td&gt;Candidate: QuickSight via Athena or Glue Catalog&lt;/td&gt;
&lt;td&gt;AWS positions BI as a candidate use case; validate whether access path is Athena-backed or catalog-mediated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Cost Model Considerations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Primary cost driver&lt;/th&gt;
&lt;th&gt;Best fit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Athena&lt;/td&gt;
&lt;td&gt;Data scanned (per TB)&lt;/td&gt;
&lt;td&gt;Occasional SQL queries, serverless&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bedrock Knowledge Bases&lt;/td&gt;
&lt;td&gt;Model invocation + embedding + retrieval&lt;/td&gt;
&lt;td&gt;RAG / GenAI over enterprise documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glue&lt;/td&gt;
&lt;td&gt;DPU-hours&lt;/td&gt;
&lt;td&gt;ETL pipelines, data transformation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Databricks&lt;/td&gt;
&lt;td&gt;DBU + cloud compute instance hours&lt;/td&gt;
&lt;td&gt;Lakehouse pipelines, ML, Delta workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EMR Serverless&lt;/td&gt;
&lt;td&gt;vCPU / memory × runtime duration&lt;/td&gt;
&lt;td&gt;Spark ETL without persistent clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda + DuckDB&lt;/td&gt;
&lt;td&gt;Invocation duration × memory&lt;/td&gt;
&lt;td&gt;Lightweight serverless analytics, event-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFront&lt;/td&gt;
&lt;td&gt;Data transfer + requests&lt;/td&gt;
&lt;td&gt;Video/media streaming from NAS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Cost comparison is not the focus of this article. Each engine has a fundamentally different pricing model. Databricks provides &lt;a href="https://docs.databricks.com/aws/en/admin/clusters/policy-definition.html" rel="noopener noreferrer"&gt;compute policies&lt;/a&gt; to control cluster creation, instance types, auto-termination, and cost-related attributes. For cost optimization, evaluate based on workload pattern (interactive vs batch, frequency, data volume) rather than unit price alone.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Partner / Customer Conversation Guide
&lt;/h2&gt;

&lt;p&gt;If a customer asks whether Databricks can directly process FSx for ONTAP S3 Access Point data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS-native service paths&lt;/strong&gt; such as Athena, Bedrock Knowledge Bases, Glue, EMR Serverless, Lambda, CloudFront, and Transfer Family have &lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/using-access-points-with-aws-services.html" rel="noopener noreferrer"&gt;AWS-documented integration patterns&lt;/a&gt; with FSx for ONTAP S3 AP. In this series, Athena (Part 1), Glue, and EMR Serverless have been validated; the other paths should be validated per workload, Region, IAM model, FSx for ONTAP-side authorization, and governance requirement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Unity Catalog&lt;/strong&gt; integration requires vendor confirmation for S3 Access Point ARN handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance Profile + boto3&lt;/strong&gt; can be used for controlled PoC experiments, but it bypasses Unity Catalog governance and is classified as a &lt;a href="https://docs.databricks.com/en/admin/sql/data-access-configuration.html" rel="noopener noreferrer"&gt;legacy data access pattern&lt;/a&gt; by Databricks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production Delta Lake workloads&lt;/strong&gt; should continue to use supported object storage patterns&lt;/li&gt;
&lt;li&gt;Any Databricks integration should be validated per workspace type, cluster mode, runtime version, IAM path, and governance requirement&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Next Validation Metrics
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Current blocker&lt;/strong&gt;: Executor-scale validation requires a Customer-managed VPC workspace (same VPC as FSx for ONTAP). The Databricks-managed VPC workspace was tested with VPC Peering and Instance Profile (2026-05-24) — S3 AP access timed out due to managed VPC egress restrictions. A Customer-managed VPC workspace creation is pending Databricks support ticket resolution.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For executor-scale validation (not yet performed):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Object listing latency per executor&lt;/li&gt;
&lt;li&gt;Total objects processed across cluster&lt;/li&gt;
&lt;li&gt;Per-executor success/failure rate&lt;/li&gt;
&lt;li&gt;Throughput per executor&lt;/li&gt;
&lt;li&gt;Retry count and S3 API error rate&lt;/li&gt;
&lt;li&gt;FSx for ONTAP throughput utilization during distributed access&lt;/li&gt;
&lt;li&gt;Cost per processed GB&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Driver-only boto3 success is not sufficient for Spark workloads. The next validation should run boto3 calls from executors using &lt;code&gt;mapPartitions&lt;/code&gt; and compare credential, routing, latency, and error behavior across workers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Executor-scale validation should not only test success/failure. It should capture per-executor latency, retry count, error code, and object count so that routing and concurrency behavior can be reviewed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark run guidance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cold run: at least 1 (first access after cluster start, no metadata cache)&lt;/li&gt;
&lt;li&gt;Warm metadata run: at least 1 (after initial listing populates metadata cache)&lt;/li&gt;
&lt;li&gt;Repeated run: at least 3 (steady-state measurement)&lt;/li&gt;
&lt;li&gt;Report: p50, p90, p95, p99 latency, plus average, min, max, and outliers&lt;/li&gt;
&lt;li&gt;Include: object count, average object size, prefix depth, concurrent executor count&lt;/li&gt;
&lt;li&gt;Include: FSx for ONTAP throughput utilization during test window&lt;/li&gt;
&lt;li&gt;Note: S3 AP via FSx for ONTAP may exhibit metadata warm-up effects and prefix layout sensitivity. Cold vs warm differences should be documented explicitly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Additional FSx for ONTAP metrics to capture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FSx for ONTAP throughput utilization (% of provisioned capacity)&lt;/li&gt;
&lt;li&gt;FSx for ONTAP CPU utilization&lt;/li&gt;
&lt;li&gt;Network throughput (inbound/outbound)&lt;/li&gt;
&lt;li&gt;S3 API request count by operation (List, Get, Head)&lt;/li&gt;
&lt;li&gt;File count per prefix&lt;/li&gt;
&lt;li&gt;Average object size&lt;/li&gt;
&lt;li&gt;NFS/SMB latency during concurrent S3 API reads (contention indicator)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expected output format (JSONL per executor):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"executor_host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ip-10-0-xx-yy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"partition_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"operation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"list_objects_v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;183&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"objects_seen"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"error_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Adoption Success Metrics
&lt;/h2&gt;

&lt;p&gt;For a controlled Databricks + FSx for ONTAP S3 AP PoC, define success criteria beyond technical pass/fail:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline metrics (capture before validation):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average search/access time (minutes) for target documents&lt;/li&gt;
&lt;li&gt;Monthly document access count via current path&lt;/li&gt;
&lt;li&gt;Current copy pipeline runtime (if applicable)&lt;/li&gt;
&lt;li&gt;Current data freshness lag (hours)&lt;/li&gt;
&lt;li&gt;Current support ticket count related to data access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PoC outcome metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Number of target datasets evaluated&lt;/li&gt;
&lt;li&gt;Number of successful read operations&lt;/li&gt;
&lt;li&gt;Number of governance exceptions required&lt;/li&gt;
&lt;li&gt;Time to first successful access&lt;/li&gt;
&lt;li&gt;Number of support issues raised&lt;/li&gt;
&lt;li&gt;Whether the customer selected Athena, Databricks, or another engine after validation&lt;/li&gt;
&lt;li&gt;Decision outcome: proceed / adjust / stop&lt;/li&gt;
&lt;li&gt;Time saved by early boundary identification (vs discovering in production)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stop criteria:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No measurable business value after validation period&lt;/li&gt;
&lt;li&gt;Governance exception required for production path with no compensating control available&lt;/li&gt;
&lt;li&gt;Executor-scale validation fails with unacceptable error rate (define threshold before PoC)&lt;/li&gt;
&lt;li&gt;FSx for ONTAP workload impact exceeds approved threshold (e.g., throughput utilization &amp;gt; 80%)&lt;/li&gt;
&lt;li&gt;Vendor confirmation indicates unsupported path with no roadmap commitment&lt;/li&gt;
&lt;li&gt;Security review rejects the access path without remediation option&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Evaluation Criteria
&lt;/h2&gt;

&lt;p&gt;Across this series, each engine is evaluated by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read-path compatibility&lt;/li&gt;
&lt;li&gt;Write-path compatibility&lt;/li&gt;
&lt;li&gt;Governance model&lt;/li&gt;
&lt;li&gt;Operational impact&lt;/li&gt;
&lt;li&gt;Performance evidence&lt;/li&gt;
&lt;li&gt;Production readiness gap&lt;/li&gt;
&lt;li&gt;Best-fit use case&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Well-Architected Mapping
&lt;/h3&gt;

&lt;p&gt;These criteria align with the &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/analytics-lens.html" rel="noopener noreferrer"&gt;AWS Well-Architected Data Analytics Lens&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pillar&lt;/th&gt;
&lt;th&gt;Evaluation focus in this series&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Governance model, IAM/AP policy, audit evidence, session policy behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reliability&lt;/td&gt;
&lt;td&gt;Failure modes, rollback path, support case evidence, DR considerations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance Efficiency&lt;/td&gt;
&lt;td&gt;Throughput, executor-scale behavior, FSx for ONTAP utilization, latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Optimization&lt;/td&gt;
&lt;td&gt;Engine-specific cost model, idle cost, cost per processed GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational Excellence&lt;/td&gt;
&lt;td&gt;Runbook, evidence template, support packet, monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Business Value of Negative Validation
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Negative validation is not failure. It is risk reduction.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A failed integration path can still create value when it prevents the wrong production architecture. This validation helps teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Avoid assuming&lt;/strong&gt; S3 API compatibility equals platform governance compatibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose the right engine&lt;/strong&gt; for the right access pattern (Athena for read-only SQL, Databricks for lakehouse/ML)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identify early&lt;/strong&gt; when vendor confirmation is required before committing architecture&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce time&lt;/strong&gt; spent on ambiguous troubleshooting by providing reproducible evidence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prevent wasted PoC investment&lt;/strong&gt; by documenting boundaries before production design&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable informed conversations&lt;/strong&gt; with vendors, partners, and security reviewers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For enterprise customers, early boundary identification can save weeks of engineering time and prevent costly architecture rework after production deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Series index:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/query-nas-data-in-place-with-athena-and-fsx-for-ontap-s3-access-points-3lhh"&gt;Part 1: Athena — Query NAS Data In Place&lt;/a&gt; (validated read-oriented path, 9/9 negative tests pass)&lt;/li&gt;
&lt;li&gt;Part 2: Databricks (this article) — session policy deep dive&lt;/li&gt;
&lt;li&gt;Part 3: Snowflake — LIST Works, SELECT Doesn't (same session policy pattern)&lt;/li&gt;
&lt;li&gt;Part 4: DuckDB Lambda — lightweight serverless analytics validation&lt;/li&gt;
&lt;li&gt;Part 5: EMR Spark — read-write ETL pipeline (coming soon)&lt;/li&gt;
&lt;li&gt;Part 6: Redshift Spectrum — DWH meets NAS data (coming soon)&lt;/li&gt;
&lt;li&gt;Part 7: Trino — open-source SQL on NAS data (coming soon)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Open items:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support cases: Waiting for Databricks response on session policy and NFS mount questions&lt;/li&gt;
&lt;li&gt;FUSE NFS client: Investigating whether a user-space NFS client can bypass the runtime restriction&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Caution on FUSE/user-space NFS&lt;/strong&gt;: FUSE or user-space NFS clients should be treated as experimental only. They require separate validation for POSIX semantics, caching behavior, consistency, performance, failure recovery, and vendor supportability. Do not treat user-space NFS RPC success as a production workaround.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Related series by the same author&lt;/strong&gt; (FSx for ONTAP S3 Access Points with other AWS services):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/building-an-agentic-access-aware-rag-system-with-amazon-fsx-for-netapp-ontap-s3-vectors-and-s3-2b86"&gt;Building an Agentic Access-Aware RAG System with Amazon FSx for NetApp ONTAP, S3 Vectors, and S3 Access Points&lt;/a&gt; — Bedrock Knowledge Bases + permission-aware retrieval (&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-Agentic-Access-Aware-RAG" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/fsx-for-ontap-s3-access-points-as-a-serverless-automation-boundary-ai-data-pipelines-ili"&gt;FSx for ONTAP S3 Access Points as a Serverless Automation Boundary — AI Data Pipelines, Volume-Level SnapMirror DR, and Capacity Guardrails&lt;/a&gt; — Lambda, Bedrock, SageMaker, 17 industry use cases (&lt;a href="https://github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/aws-builders/smart-routing-transfer-family-ingestion-and-voice-chat-permission-aware-rag-v42-3iml"&gt;Smart Routing, Transfer Family Ingestion, and Voice Chat — Permission-Aware RAG v4.2&lt;/a&gt; — Transfer Family + SFTP ingestion for RAG pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ONTAP S3 Multiprotocol vs FSx for ONTAP S3 Access Points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.netapp.com/us-en/ontap/s3-multiprotocol/" rel="noopener noreferrer"&gt;ONTAP S3 multiprotocol&lt;/a&gt; (ONTAP 9.12.1+): S3 NAS bucket model on ONTAP SVM, enabling S3 clients to access NAS data directly on the ONTAP cluster&lt;/li&gt;
&lt;li&gt;FSx for ONTAP S3 Access Points: AWS-managed S3 Access Point endpoint attached to FSx for ONTAP volume, integrating with AWS IAM, VPC, and S3-compatible services&lt;/li&gt;
&lt;li&gt;Both expose NAS data via S3-style access, but the authorization path, service integration, and operational model differ. This article focuses on FSx for ONTAP S3 Access Points.&lt;/li&gt;
&lt;/ul&gt;




&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-fsx-netapp-ontap-s3-access/" rel="noopener noreferrer"&gt;AWS What's New: Amazon FSx for NetApp ONTAP now supports Amazon S3 access (Dec 2, 2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/s3-access-points.html" rel="noopener noreferrer"&gt;FSx for ONTAP S3 Access Points documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/tutorial-query-data-with-athena.html" rel="noopener noreferrer"&gt;AWS Tutorial: Query files with Athena&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/security/network/classic/customer-managed-vpc.html" rel="noopener noreferrer"&gt;Databricks Customer-managed VPC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html" rel="noopener noreferrer"&gt;Databricks Instance Profiles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html" rel="noopener noreferrer"&gt;Databricks Unity Catalog External Locations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Yoshiki0705/fsxn-lakehouse-integrations" rel="noopener noreferrer"&gt;GitHub: fsxn-lakehouse-integrations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article is part of the "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive" series. All tests were performed on a real AWS environment with FSx for ONTAP (ONTAP 9.17.1, ap-northeast-1) and Databricks (DBR 17.3 LTS, Premium tier) in May 2026.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Scope reminder&lt;/strong&gt;: This article documents observed behavior in one validated environment. It does not validate production readiness, distributed executor-scale processing, or all Databricks runtime versions. Terminology uses "observed in this environment" rather than "unsupported" or "incompatible" — platform behavior may change with future updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future updates&lt;/strong&gt;: If Databricks platform behavior changes or vendor confirmation becomes available, this article should be updated with the new validation result rather than treated as a permanent compatibility statement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer&lt;/strong&gt;: This article is an independent validation report and does not represent Databricks, AWS, or NetApp official guidance. Product behavior, support status, and platform capabilities may change. Always validate in your own environment and consult vendor documentation and support channels.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>aws</category>
      <category>databricks</category>
      <category>amazonfsxfornetappontap</category>
      <category>lakehouse</category>
    </item>
    <item>
      <title>What Does a Databricks Consulting Partner Actually Do? (An Enterprise Buyer's Guide)</title>
      <dc:creator>Lucy </dc:creator>
      <pubDate>Wed, 20 May 2026 09:26:49 +0000</pubDate>
      <link>https://dev.to/lucy1/what-does-a-databricks-consulting-partner-actually-do-an-enterprise-buyers-guide-168m</link>
      <guid>https://dev.to/lucy1/what-does-a-databricks-consulting-partner-actually-do-an-enterprise-buyers-guide-168m</guid>
      <description>&lt;p&gt;You've probably sat through at least one vendor call where someone said &lt;br&gt;
"end-to-end Databricks implementation" three times in ten minutes and still left with no idea what they'd actually &lt;em&gt;do&lt;/em&gt; after signing.&lt;/p&gt;

&lt;p&gt;That's the problem with how most &lt;strong&gt;Databricks consulting services&lt;/strong&gt; are sold. The language is polished. The decks look great. But the specifics? Suspiciously vague.&lt;/p&gt;

&lt;p&gt;So let's just say the quiet part out loud here's what a real partner does, &lt;br&gt;
week by week, and what separates a genuinely good one from a well-branded generalist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4 Things a Databricks Partner Is Actually Responsible For
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Architecture First, Not Notebooks First
&lt;/h3&gt;

&lt;p&gt;The first red flag? A partner who opens a Databricks workspace before they've audited your current data estate.&lt;/p&gt;

&lt;p&gt;A good one starts by understanding what you already have to your sources, your pipelines, your governance gaps, where money is quietly leaking. Only then do they design an environment that fits your workloads.&lt;/p&gt;

&lt;p&gt;In practice, that means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choosing the right cloud (AWS, Azure, or GCP) based on your existing 
infrastructure which is not what the partner is most comfortable with&lt;/li&gt;
&lt;li&gt;Designing a medallion architecture (Bronze → Silver → Gold) with your 
actual data volumes in mind&lt;/li&gt;
&lt;li&gt;Standing up Unity Catalog for governance from day one, not as an afterthought 
six months later when things get messy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Pipeline Engineering, The Real Heavy Lifting
&lt;/h3&gt;

&lt;p&gt;Most enterprise data sits across five different places: a legacy ERP, a couple of SaaS tools, some flat files someone's been emailing around, and a Snowflake instance that half the team has forgotten the password to.&lt;/p&gt;

&lt;p&gt;A Databricks partner consolidates this: building Delta Live Tables pipelines or custom Spark jobs that handle schema evolution, bad data, and SLA expectations. Not "it works on my machine" pipelines. Production-grade ones.&lt;/p&gt;

&lt;p&gt;If you're coming from Hadoop or an aging data warehouse, this is where 90% of the real effort lives. It's also where you'll quickly learn whether your partner has actually done this before or just watched the conference talk.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cost and Performance- Ongoing, Not Optional
&lt;/h3&gt;

&lt;p&gt;Here's something vendors rarely lead with: Databricks compute costs can spiral fast if nobody's actively managing them.&lt;/p&gt;

&lt;p&gt;A partner worth keeping around puts in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-scaling cluster policies so you're not paying for idle compute at 2am&lt;/li&gt;
&lt;li&gt;Photon engine tuning for SQL-heavy workloads&lt;/li&gt;
&lt;li&gt;Cost dashboards that map spend to actual business units, so finance 
stops asking you to explain the cloud bill&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a one-time setup. It's a habit. If a partner treats it as a &lt;br&gt;
checkbox, your AWS invoice will tell you eventually.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. ML and AI Enablement- When You're Ready to Go Beyond Dashboards
&lt;/h3&gt;

&lt;p&gt;A lot of enterprise teams reach a point where SQL dashboards aren't enough. They want predictions, recommendations, anomaly detection that is actual ML in production.&lt;/p&gt;

&lt;p&gt;A Databricks partner with real ML capability sets up MLflow for experiment tracking, builds feature pipelines through Feature Store, and helps your data science team stop rebuilding infrastructure every time they want to ship a model.&lt;/p&gt;

&lt;p&gt;This is genuinely where the Databricks ecosystem shines and where the right partner can save months of engineering time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Actually Vet a Databricks Partner (Beyond the Sales Deck)
&lt;/h2&gt;

&lt;p&gt;Most of this won't be on their website. You have to ask.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check for Databricks certification at the engineer level&lt;/strong&gt;, not just a partner tier badge. Certified Data Engineer Associate or Professional means someone on their team has passed a hands-on technical exam. That's meaningful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask for vertical-specific references&lt;/strong&gt;- A partner who's built lakehouse pipelines for a D2C brand thinks about schema design very differently than one who's only done banking compliance reporting. Generic case studies are a yellow flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pin down the post-go-live model&lt;/strong&gt;- Ask: &lt;em&gt;"What does month three with &lt;br&gt;
your team look like?"&lt;/em&gt; If the answer is vague or pivots back to the &lt;br&gt;
onboarding process, they're not thinking past the implementation phase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confirm you own the code&lt;/strong&gt;- Sounds obvious. Isn't always. Any partner &lt;br&gt;
who builds undocumented pipelines or ties you to proprietary tooling is &lt;br&gt;
creating dependency, not capability. Get this in writing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timing Matters More Than Most People Think
&lt;/h2&gt;

&lt;p&gt;The best moment to bring in a Databricks partner is before your data &lt;br&gt;
team has built workarounds they're now defending as architecture.&lt;/p&gt;

&lt;p&gt;Before ad-hoc notebooks become your production pipeline. Before cluster &lt;br&gt;
policies are an afterthought. Before your engineers are spending more time firefighting than building.&lt;/p&gt;

&lt;p&gt;If AI and ML use cases are on your roadmap alongside the data modernization work and they probably should be, it's worth reading &lt;a href="https://dev.to/lucy1/why-mid-market-enterprises-need-an-ai-consulting-partner-before-2027-g50"&gt;why mid-market enterprises are moving on AI consulting partnerships before 2027&lt;/a&gt;. The timelines are more connected than most teams realize.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Last Thing: Good Partners Ask Uncomfortable Questions
&lt;/h2&gt;

&lt;p&gt;The best Databricks consulting services engagement you'll ever have won't start with a proposal. It'll start with questions that make you think.&lt;/p&gt;

&lt;p&gt;Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"What does 'data-ready' actually mean for your business in 12 months?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Who currently owns data quality decisions and what happens when 
something breaks?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"What's the real blocker for your team right now? skills, tooling, 
or architecture?"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a vendor skips all of that and jumps to pricing, pay attention to &lt;br&gt;
that instinct telling you something's off.&lt;/p&gt;

&lt;p&gt;For a grounded look at what structured &lt;a href="https://www.lucentinnovation.com/services/databricks-consulting" rel="noopener noreferrer"&gt;Databricks consulting services&lt;/a&gt; &lt;br&gt;
actually cover certifications, engagement models, and specific deliverables. it's a solid benchmark before your next vendor call.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Evaluating Databricks partners? Drop the questions you're struggling to &lt;br&gt;
get straight answers on in the comments, happy to help you cut through the noise.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>databricks</category>
      <category>dataengineering</category>
      <category>databricksconsulting</category>
      <category>databrickspartners</category>
    </item>
    <item>
      <title>Cosa sono i modelli di apprendimento automatico? Tipi - Databricks</title>
      <dc:creator>Jose Francisco Bustamante Ocampo</dc:creator>
      <pubDate>Sat, 16 May 2026 15:38:55 +0000</pubDate>
      <link>https://dev.to/jose_franciscobustamante/cosa-sono-i-modelli-di-apprendimento-automatico-tipi-databricks-3llg</link>
      <guid>https://dev.to/jose_franciscobustamante/cosa-sono-i-modelli-di-apprendimento-automatico-tipi-databricks-3llg</guid>
      <description>&lt;h1&gt;
  
  
  Cosa sono i modelli di apprendimento automatico? Tipi - Databricks
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Breaking ai news from &lt;strong&gt;Google News: Machine Learning (IT)&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Happened
&lt;/h2&gt;

&lt;p&gt;📰 Google News: Machine Learning (IT) is reporting on this story. This is a ai development worth watching closely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;This story could have significant implications for the global community following ai trends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📌 Reported by &lt;strong&gt;Google News: Machine Learning (IT)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;📌 Category: &lt;strong&gt;ai&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;📌 &lt;a href="https://news.google.com/rss/articles/CBMidkFVX3lxTE9yY0FWeWFaelBDdGc0WFdBd3o1UUlLVm1VWHAtSDB4Sjd2eDVjbFYxb0MwdURSaW1qMnhZWVVQQWViTHVwZXNxYlhVRWhfNE5DYW9rLVppeUotTURERmNYQl9ES3RGcEhmc3BJc1RscFFkTVBzdEE?oc=5" rel="noopener noreferrer"&gt;Read full story →&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;a href="https://t.me/GlobalWFeed" rel="noopener noreferrer"&gt;&lt;strong&gt;Follow GlobalWFeed on Telegram →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;🤖 Pubblicato automaticamente da Global Feed Bot&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cosa</category>
      <category>tipi</category>
      <category>databricks</category>
    </item>
    <item>
      <title>Lakebase, Meet PDB: The "Third-Generation" Database Oracle Shipped in 2013</title>
      <dc:creator>Rick Houlihan</dc:creator>
      <pubDate>Mon, 11 May 2026 19:32:27 +0000</pubDate>
      <link>https://dev.to/rick_houlihan_cf110dba340/lakebase-meet-pdb-the-third-generation-database-oracle-shipped-in-2013-4l8b</link>
      <guid>https://dev.to/rick_houlihan_cf110dba340/lakebase-meet-pdb-the-third-generation-database-oracle-shipped-in-2013-4l8b</guid>
      <description>&lt;p&gt;&lt;strong&gt;By Rick Houlihan &amp;amp; Patrick Meredith&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Databricks named the right problem. Their answer is a credible execution of an idea Oracle Multitenant solved a decade earlier — and as it turns out, the gap they think they've found in Oracle was only one PL/SQL package away from closing.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Pitch That Started This
&lt;/h2&gt;

&lt;p&gt;A colleague forwarded me the Databricks blog post the other day. Opening line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"In our previous blog, we introduced Lakebase, the third-generation database architecture that fundamentally separates storage and compute."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;— Databricks, &lt;em&gt;"How agentic software development will change databases"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, like what Oracle did 12 years ago.&lt;/p&gt;

&lt;p&gt;I'm being a little snide. Bear with me — there's a real article underneath. The blog is a thoughtful read about how AI agents are changing database workloads, and most of the diagnosis is right. Their telemetry is interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"In Databricks's Lakebase service, AI agents now create roughly 4x more databases than human users."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"[O]n average, each database project has ~10 branches and some databases with nested branches reaching depths of over 500 iterations…"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"[F]or about half of these agentic applications, the database compute lifetime is less than 10 seconds."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last number is real. Agents don't behave like humans. They generate variants by the dozen, run them in parallel, evaluate against an eval set, keep the winner, throw away the losers. Evolutionary development. The economics break down completely on a database that costs $200/month per instance with a five-minute provisioning cycle.&lt;/p&gt;

&lt;p&gt;So Databricks is right about the problem. They're right that databases need a branching primitive. They're right that storage and compute need to scale independently. They're right that the always-on cost floor doesn't survive contact with agents.&lt;/p&gt;

&lt;p&gt;This article is not about whether they're wrong on the diagnosis.&lt;/p&gt;

&lt;p&gt;It's about whether &lt;strong&gt;their answer is novel&lt;/strong&gt; — and what the architecture-correct version looks like. Because Oracle has been shipping the same primitive in the engine since July 2013, and a small Python + PL/SQL wrapper is all that separates it from the developer experience Databricks just announced.&lt;/p&gt;

&lt;p&gt;Patrick and I thought it was worth writing this down.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Lakebase Actually Is
&lt;/h2&gt;

&lt;p&gt;Spoiler: it's Neon.&lt;/p&gt;

&lt;p&gt;Databricks announced its agreement to acquire Neon on May 14, 2025. The press release didn't disclose a price (industry reporting put it at roughly $1 billion), but it did volunteer a useful telemetry data point: &lt;em&gt;"over 80 percent of the databases provisioned on Neon were created automatically by AI agents rather than by humans."&lt;/em&gt; That number is also the reason this acquisition happened — Neon, founded in 2021 by Postgres committers, had built a serverless Postgres architecture that AI agents could actually afford to use: stateless compute nodes, a Paxos-based safekeeper quorum holding WAL, and a pageserver materializing pages on demand from object storage. Branches were stamped as metadata pointers at a moment in WAL history; copy-on-write at the storage layer made divergence cheap.&lt;/p&gt;

&lt;p&gt;That architecture is good engineering. It's also exactly what Databricks now ships as Lakebase. Their own architecture deep-dive opens with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"In the lakebase architecture, your compute is stateless. It does not rely on a local data directory. Instead, it streams WAL to a Paxos-based quorum of safekeepers."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;— Databricks, &lt;em&gt;"How lakebase architecture delivers 5x faster Postgres writes"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The same post describes how, when Postgres compute requests a page from storage, the pageserver &lt;em&gt;"reconstructs it by finding the most recent materialized image of that page and replaying any WAL deltas on top."&lt;/em&gt; If you've read Neon's published architecture overview, this is familiar vocabulary — stateless compute → safekeepers → pageserver → object storage — because it &lt;em&gt;is&lt;/em&gt; Neon's architecture. Lakebase is Neon with a Databricks brand on top.&lt;/p&gt;

&lt;p&gt;To be clear: that's not a problem. Neon is good engineering. Acquiring it and integrating it with the lakehouse is a perfectly defensible product move — buying a four-year-old startup whose technology already solves the agent-economics problem is faster than building one yourself. Nobody should be mad about an acquisition.&lt;/p&gt;

&lt;p&gt;The problem is the next thing Databricks did, which was call a four-year-old Postgres-branching architecture &lt;em&gt;"the third-generation database architecture that fundamentally separates storage and compute."&lt;/em&gt; That's a marketing claim, not an architectural one, and it has two specific issues. First, "third generation" implies a chronology — first generation was monolithic, second was something, this is the third — and Databricks has never been particularly clear about what the second generation was, which is convenient because any honest answer would include systems older than Lakebase that already do what Lakebase does. Second, the &lt;em&gt;"fundamentally separates storage and compute"&lt;/em&gt; phrasing treats compute/storage separation as a 2025 innovation, which is awkward when Snowflake shipped that architecture commercially in 2014 and Oracle shipped a multitenant variant of it in July 2013.&lt;/p&gt;

&lt;p&gt;"Third generation" sells better than "we acquired a 2021 startup six months ago, here's what they built." It also doesn't survive a history check.&lt;/p&gt;

&lt;p&gt;That's the next section.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Third-Generation" Sleight of Hand
&lt;/h2&gt;

&lt;p&gt;Same Databricks blog post — &lt;em&gt;"A New Era of Databases: Lakebase,"&lt;/em&gt; June 12, 2025 — one "Database Architecture Evolution" section, three generations laid out in sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation 1 — the monoliths:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Examples: MySQL, Postgres, classic Oracle"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Database systems started as absolute monoliths."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Generation 2 — proprietary loose coupling:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Examples: Aurora, Oracle Exadata"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"As cloud infrastructure improved, vendors physically separated storage from compute, moving storage into proprietary backend tiers."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Same Oracle. Two generations. One page apart. Pick one.&lt;/p&gt;

&lt;p&gt;I'll be charitable and assume the intended argument was &lt;em&gt;"early Oracle was a monolith, modern Oracle isn't."&lt;/em&gt; Fine. Then "modern" deserves a timeline.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;What was separated&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2001&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Oracle Real Application Clusters (RAC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple compute nodes against a single shared SAN/NAS storage substrate (Oracle 9i)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2008&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Oracle Exadata v1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database servers vs. intelligent storage cells with predicate offload (Smart Scan), GA September 2008&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2010&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Google Dremel / BigQuery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Disaggregated storage and compute, columnar — VLDB 2010 paper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;July 1, 2013&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Oracle Database 12c / Multitenant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;CREATE PLUGGABLE DATABASE … FROM … SNAPSHOT COPY&lt;/code&gt; ships in the engine&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2014&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Snowflake (GA)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Three-layer cloud-native: storage / virtual warehouses / cloud services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nov 2014 / Jul 2015&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amazon Aurora&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compute decoupled from a 6-way replicated storage layer across 3 AZs (preview Nov 2014, GA July 2015)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Neon (founded)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Postgres-specific WAL-level disaggregation with branching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;May 14, 2025&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Lakebase&lt;/strong&gt; = Databricks acquires Neon&lt;/td&gt;
&lt;td&gt;Neon's architecture wrapped around open lake storage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftn7q2ruc2ujku70azhz2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftn7q2ruc2ujku70azhz2.png" alt=" " width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Storage and compute have been separated in production databases for 25 years. Across two paradigms, four vendors, and at minimum seven shipping systems before Lakebase showed up. "Third generation" isn't an architectural claim. It's a marketing label that requires the reader to forget about Oracle RAC, Exadata, Dremel, Multitenant, Snowflake, Aurora, and Neon in roughly that order.&lt;/p&gt;

&lt;p&gt;So what's actually new in Lakebase? The same blog is honest about this if you read past the generation label:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Like Gen 2, it separates compute from storage, but with a critical difference: both the storage infrastructure and the data formats are completely open."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Translation: Gen 2 already separated storage from compute. Their own text concedes the point. The Gen 3 differentiator they're actually claiming is &lt;em&gt;open data formats&lt;/em&gt;. We'll dismantle that claim in Section 10 — short version, "open formats" turns out to do less work than the marketing suggests once you ask which formats, governed by whom, queryable how. But file the claim for now.&lt;/p&gt;

&lt;p&gt;The other thing the launch blog flags as Gen 3 distinctive is branching:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Databases can be branched and cloned the way developers branch code."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Branching as a developer-experience primitive is a fair thing to call out — it genuinely changes how AI agents and dev workflows interact with databases, and we conceded that point in Section 1. Branching as a &lt;em&gt;database-engine&lt;/em&gt; primitive, though, has shipped in Oracle Multitenant since July 1, 2013, with documented syntax, multiple supported storage substrates, and a hard limit four to eight times higher than Lakebase's. Which is the next section.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Third-generation database architecture? We're on our fifth." - Patrick Meredith&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  PDB Snapshot Copy: The Branching Primitive Oracle Has Shipped Since 2013
&lt;/h2&gt;

&lt;p&gt;The syntax is one statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;PLUGGABLE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;my_experiment_branch&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;base_experiment_pdb&lt;/span&gt;
  &lt;span class="n"&gt;SNAPSHOT&lt;/span&gt; &lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Oracle 19c SQL Reference describes what happens underneath:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The &lt;code&gt;SNAPSHOT COPY&lt;/code&gt; clause instructs the database to clone the source PDB using storage snapshots. This reduces the time required to create the clone because the database does not need to make a complete copy of the source data files."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;— &lt;em&gt;&lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-PLUGGABLE-DATABASE.html" rel="noopener noreferrer"&gt;Oracle Database 19c SQL Language Reference: CREATE PLUGGABLE DATABASE&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What "storage snapshots" means depends on the substrate. The same reference is explicit: with &lt;code&gt;CLONEDB=FALSE&lt;/code&gt;, &lt;em&gt;"the underlying file system for the source PDB's files must support storage snapshots. Such file systems include Oracle Automatic Storage Management Cluster File System (Oracle ACFS) and Direct NFS Client storage."&lt;/em&gt; With &lt;code&gt;CLONEDB=TRUE&lt;/code&gt;, &lt;em&gt;"the underlying file system for the source PDB's files can be any local file system, network file system (NFS), or clustered file system that has Direct NFS enabled. However, the source PDB must remain in open read-only mode as long as any clones exist."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Storage substrate&lt;/th&gt;
&lt;th&gt;Snapshot mechanism&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Oracle ACFS&lt;/td&gt;
&lt;td&gt;Copy-on-write storage snapshots&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CLONEDB=FALSE&lt;/code&gt; path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direct NFS Client (dNFS)&lt;/td&gt;
&lt;td&gt;Copy-on-write storage snapshots on snapshot-capable NFS array&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CLONEDB=FALSE&lt;/code&gt; path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exadata sparse disk groups&lt;/td&gt;
&lt;td&gt;Copy-on-write&lt;/td&gt;
&lt;td&gt;Source PDB must be read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard FS + &lt;code&gt;CLONEDB=TRUE&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;dNFS sparse files over NFS&lt;/td&gt;
&lt;td&gt;Source PDB must remain open read-only while clones exist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exascale (23ai+)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Redirect-on-write&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"created quickly, consume little storage space upon initial creation, and can be created in practically unlimited numbers"&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note the precision on "redirect-on-write" — that's Oracle's official term &lt;em&gt;only&lt;/em&gt; for Exascale snapshots in 23ai+. Older substrates use copy-on-write semantics. Per the &lt;a href="https://docs.oracle.com/en/learn/exadb-xs-pdb-snapshot/index.html" rel="noopener noreferrer"&gt;Exadata Database Service on Exascale Infrastructure documentation&lt;/a&gt;: &lt;em&gt;"These PDB snapshots leverage Exascale redirect-on-write technology so that they are created quickly, consume little storage space upon initial creation, and can be created in practically unlimited numbers."&lt;/em&gt; The distinction matters if you're going to argue with someone about it.&lt;/p&gt;

&lt;p&gt;Sibling features in the Multitenant family:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDB Snapshot Carousel&lt;/strong&gt; (introduced in &lt;strong&gt;18c&lt;/strong&gt;, not 19c — common citation error). Per &lt;a href="https://oracle-base.com/articles/18c/multitenant-pdb-snapshot-carousel-18c" rel="noopener noreferrer"&gt;oracle-base.com&lt;/a&gt;: &lt;em&gt;"Oracle 18c introduced the concept of a snapshot carousel, which is a series of point-in-time copies, or snapshots, of a PDB."&lt;/em&gt; Default 8 snapshots, hard cap at 8 via &lt;code&gt;MAX_PDB_SNAPSHOTS&lt;/code&gt;. Oldest is overwritten when full. Useful for short-horizon point-in-time recovery without the overhead of full backups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refreshable Clones.&lt;/strong&gt; Physically full copies with incremental redo apply. Different beast from snapshot copies (full storage cost, but ongoing sync from source). Convertible one-way to a regular PDB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDB density.&lt;/strong&gt; Up to &lt;strong&gt;4098 PDBs per CDB&lt;/strong&gt; on Enterprise Edition with Multitenant licensing — the &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/MAX_PDBS.html" rel="noopener noreferrer"&gt;&lt;code&gt;MAX_PDBS&lt;/code&gt; reference&lt;/a&gt; lists possible values of &lt;code&gt;5&lt;/code&gt;, &lt;code&gt;254&lt;/code&gt;, or &lt;code&gt;4098&lt;/code&gt; by edition (Standard/Express, Standard Edition 2, Enterprise Edition respectively).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now compare ceilings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Branch limit&lt;/th&gt;
&lt;th&gt;Branch depth&lt;/th&gt;
&lt;th&gt;Cross-region&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Aurora&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15 copy-on-write clones per source; 16th becomes a full copy&lt;/td&gt;
&lt;td&gt;No explicit depth ceiling, but each level re-consumes the 15 budget&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"You can't create a clone in a different AWS Region from the source Aurora DB cluster"&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Lakebase&lt;/strong&gt; (Databricks doc)&lt;/td&gt;
&lt;td&gt;500 per project; &lt;strong&gt;only 10 unarchived (active) at once&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Hundreds nested (per their telemetry)&lt;/td&gt;
&lt;td&gt;Per region&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Oracle Multitenant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Up to 4098 PDBs per CDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No documented depth limit&lt;/td&gt;
&lt;td&gt;RAC + Data Guard, cross-region via Active Data Guard&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Lakebase's 500-per-project ceiling is generous compared to Aurora's 15. Oracle's 4098 is generous compared to Lakebase's 500 by an order of magnitude. And Lakebase has another hard cap that doesn't appear in the cloning side of the comparison: it allows only &lt;strong&gt;10 unarchived (active) branches at once&lt;/strong&gt;. Oracle has no equivalent active-cap; you tune branch density via Resource Manager based on your workload, which is the next section.&lt;/p&gt;

&lt;p&gt;This primitive shipped on July 1, 2013, in Oracle Database 12c. Twelve years before Lakebase. In the database engine, not in a wrapper. With a single SQL statement, documented in the official SQL Language Reference. There is no Postgres extension here. There is no separate page server, no Paxos quorum, no $1B acquisition. It's just &lt;code&gt;CREATE PLUGGABLE DATABASE … SNAPSHOT COPY&lt;/code&gt;, and it has been since the series finale of Breaking Bad.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compute Story Most People Get Wrong
&lt;/h2&gt;

&lt;p&gt;A note on this section: the structural argument here came from Patrick during a Slack thread when he challenged me on the scale-to-zero comparison. I had it wrong initially. Here's the correct read, in his voice.&lt;/p&gt;

&lt;p&gt;The naive comparison says Lakebase wins on scale-to-zero because branches scale individually to zero compute when idle. Oracle, the story goes, is "always on" — fixed ECPUs allocated to the ADB instance, multiple PDBs sharing the pool, no per-branch zero-cost dormancy.&lt;/p&gt;

&lt;p&gt;That comparison gets the shape right and the conclusion wrong.&lt;/p&gt;

&lt;p&gt;Yes, in Autonomous Database Serverless, ECPUs are allocated at the instance level, not per PDB. Yes, Snapshot Copy PDB branches inside an ADB share that pool. The naive read says: "uh-oh, no isolation, abandoned branches will eat compute." The correct read is: &lt;strong&gt;abandoned branches in a shared pool consume nothing by construction&lt;/strong&gt; — because they aren't reserving anything.&lt;/p&gt;

&lt;p&gt;Walk through the mechanics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Closed PDBs consume zero CPU and zero shadow processes.&lt;/strong&gt; &lt;code&gt;ALTER PLUGGABLE DATABASE foo CLOSE IMMEDIATE;&lt;/code&gt; and the branch is dormant. The 26c SQL Reference describes the semantic: &lt;em&gt;"the PDB equivalent of the SQL*Plus &lt;code&gt;SHUTDOWN&lt;/code&gt; command with the immediate mode."&lt;/em&gt; Metadata stays in the dictionary; nothing else stays resident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle open PDBs consume near-zero.&lt;/strong&gt; Just metadata pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Active PDBs draw from the shared pool.&lt;/strong&gt; That pool auto-scales: per the Oracle docs, &lt;em&gt;"with compute auto scaling enabled the database can use up to three times more CPU and IO resources than specified by the number of ECPUs."&lt;/em&gt; You pay for the burst when it happens, not when it doesn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Manager governs the priority.&lt;/strong&gt; CPU shares, &lt;code&gt;MAX_IOPS&lt;/code&gt;, &lt;code&gt;MAX_MBPS&lt;/code&gt;, sessions, parallel servers, per-PDB &lt;code&gt;SGA_TARGET&lt;/code&gt; and &lt;code&gt;PGA_AGGREGATE_LIMIT&lt;/code&gt;. You decide which branches get more pool when contended.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;V$PDBS&lt;/code&gt; and &lt;code&gt;V$RESOURCE_LIMIT&lt;/code&gt; expose per-branch consumption&lt;/strong&gt; so a supervisor process can watch and auto-suspend.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So what's the real difference? Lakebase per-DB scale-to-zero with cold-start latency on resume. Oracle shared elastic pool with no cold start.&lt;/p&gt;

&lt;p&gt;For an agentic workflow, where the supervisor might wake an "abandoned" branch tomorrow to revisit a hypothesis it shelved today, &lt;strong&gt;the no-cold-start property matters.&lt;/strong&gt; The branch has been consuming nothing; the moment it gets a connection, it's responsive within milliseconds because the compute pool is already warm. Lakebase, by design, has to spin compute back up.&lt;/p&gt;

&lt;p&gt;Which means the elasticity scoreboard most people read off the spec sheet — &lt;em&gt;"Lakebase: scale-to-zero ✅ / Oracle: shared pool ❌"&lt;/em&gt; — is solving the same problem two different ways and pretending one wins. Different shape. Same economics for abandoned experiments. Faster wakeup on Oracle when the agent comes back.&lt;/p&gt;

&lt;p&gt;Sharing compute between PDBs isn't a bug. It means abandoned branches aren't wasting compute, period.&lt;/p&gt;

&lt;p&gt;Or as I put it in Slack when this came up: &lt;em&gt;"What we want is exactly what we already have. The compute is scaled. Abandoned branches contribute nothing."&lt;/em&gt; That's the architecture.&lt;/p&gt;

&lt;p&gt;— Patrick&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Limits
&lt;/h2&gt;

&lt;p&gt;Side-by-side, with citations on every claim:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Lakebase&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Oracle Multitenant + ADB&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total branches&lt;/td&gt;
&lt;td&gt;500 / project (&lt;a href="https://www.databricks.com/blog/database-branching-postgres-git-style-workflows-databricks-lakebase" rel="noopener noreferrer"&gt;Databricks doc&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;Up to 4098 / CDB (&lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/MAX_PDBS.html" rel="noopener noreferrer"&gt;MAX_PDBS&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active branches&lt;/td&gt;
&lt;td&gt;10 (hard cap)&lt;/td&gt;
&lt;td&gt;No hard cap; tuned via Resource Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Branch creation speed&lt;/td&gt;
&lt;td&gt;Instant (metadata + COW)&lt;/td&gt;
&lt;td&gt;Near-instant on snapshot-capable storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold-start on resume&lt;/td&gt;
&lt;td&gt;Sub-second to multi-second&lt;/td&gt;
&lt;td&gt;None — shared pool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ACID&lt;/td&gt;
&lt;td&gt;Postgres MVCC&lt;/td&gt;
&lt;td&gt;Full ACID, RAC, Active Data Guard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failover behavior&lt;/td&gt;
&lt;td&gt;Postgres-standard (kills in-flight)&lt;/td&gt;
&lt;td&gt;Transparent Application Continuity — in-flight transaction replay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector search&lt;/td&gt;
&lt;td&gt;Postgres extension&lt;/td&gt;
&lt;td&gt;In-engine, optimized by 40-year-old CBO&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;jsonb (sequential traversal)&lt;/td&gt;
&lt;td&gt;OSON binary, hash-indexed O(1) field access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graph&lt;/td&gt;
&lt;td&gt;Postgres extension&lt;/td&gt;
&lt;td&gt;SQL/PGQ, in-engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-modal queries (vector + JSON + graph + relational)&lt;/td&gt;
&lt;td&gt;Limited by extension boundaries&lt;/td&gt;
&lt;td&gt;Single transaction, single query plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open data format&lt;/td&gt;
&lt;td&gt;"Postgres page on S3" (Postgres-only readable)&lt;/td&gt;
&lt;td&gt;OSON + Iceberg + Parquet + Mongo wire + native SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mongo wire compatibility&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Yes (Oracle MongoDB API)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Lakebase wins on developer-experience polish today.&lt;/strong&gt; The branching UX is wired into the product, the CLI is published, the dashboard renders branch trees. Credit where due — that's a real product investment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Oracle wins on every limit that matters once you stop counting GitHub stars.&lt;/strong&gt; Density (4098 vs 500). Active concurrency (no cap vs 10). ACID. Failover that doesn't kill your transactions. Vector + JSON + graph + spatial + relational in one query plan optimized by 40 years of CBO development. Mongo wire compatibility, for the developers who already wrote against MongoDB and don't want to rewrite their app to evaluate a database.&lt;/p&gt;

&lt;p&gt;The DX gap is real. It's also the easiest gap to close, which is the next section.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DX Gap, And Why It's Trivial to Close
&lt;/h2&gt;

&lt;p&gt;Patrick said it best in the original Slack thread: &lt;em&gt;"We probably should develop a lightweight external API too. That should be extremely simple — it's all external to the database."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;He was right and he's already shipped it.&lt;/p&gt;

&lt;p&gt;The DX gap is real. There is no &lt;code&gt;pdb branch my-experiment&lt;/code&gt; command in stock Oracle. Lakebase has a polished branching UX with a published CLI, a dashboard, and &lt;code&gt;git&lt;/code&gt;-shaped semantics. We're not going to pretend otherwise.&lt;/p&gt;

&lt;p&gt;But this is a wrapper-shaped problem, not a kernel-shaped problem. Patrick built the wrapper:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/pmeredit/pdb-branch" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;code&gt;pmeredit/pdb-branch&lt;/code&gt;&lt;/strong&gt;&lt;/a&gt; — &lt;em&gt;"a small multi-language library over a shared PL/SQL package for making Oracle PDB snapshot copies feel like cheap database branches for agentic workflow experiments."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Python, Node.js, Rust, and Java bindings, plus a Rust-built &lt;code&gt;pdb&lt;/code&gt; CLI. Releases alongside this article.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The architecture is small enough to fit on a napkin:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PDB_BRANCH&lt;/code&gt; PL/SQL package&lt;/strong&gt; — installed and upgraded automatically by the language binding at startup. Wraps &lt;code&gt;CREATE PLUGGABLE DATABASE … SNAPSHOT COPY&lt;/code&gt; with idempotent lifecycle DDL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three control tables in &lt;code&gt;CDB$ROOT&lt;/code&gt;:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PDB_BRANCH_BRANCHES&lt;/code&gt; — branch registry (name, parent, state, expiration, score)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PDB_BRANCH_EVENTS&lt;/code&gt; — audit log of branch lifecycle events&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PDB_BRANCH_PROFILES&lt;/code&gt; — branch-to-Resource-Manager-profile mapping&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;BranchClient&lt;/code&gt; wrappers in four languages&lt;/strong&gt; — Python over &lt;code&gt;python-oracledb&lt;/code&gt;, Node.js over &lt;code&gt;oracledb&lt;/code&gt;, Rust over the ODPI-C-based &lt;code&gt;oracle&lt;/code&gt; crate (with a pure-Rust &lt;code&gt;oracle-rs&lt;/code&gt; path for non-SYSDBA work), and Java. One PL/SQL contract, four idiomatic surfaces.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;A &lt;code&gt;pdb&lt;/code&gt; Rust CLI&lt;/strong&gt; — &lt;code&gt;bin/pdb&lt;/code&gt; wraps the Rust binding so callers don't need to know Cargo's &lt;code&gt;target/&lt;/code&gt; layout. &lt;code&gt;git branch&lt;/code&gt;-shaped commands, &lt;code&gt;.pdbprofile&lt;/code&gt; TOML config, and per-flag environment-variable overrides.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Optional Resource Manager profiles:&lt;/strong&gt; &lt;code&gt;PDB_BRANCH_ACTIVE&lt;/code&gt;, &lt;code&gt;PDB_BRANCH_IDLE&lt;/code&gt;, &lt;code&gt;PDB_BRANCH_BACKGROUND&lt;/code&gt;.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Two ways to drive it. The library surface (Python shown; Node/Rust/Java are equivalents):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pdb_branch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BranchClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BranchClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# auto-installs/upgrades PL/SQL package
&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_branch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_RAG_042&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;from_pdb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GOLDEN_MASTER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;try smaller chunk size and rerank before answer synthesis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_RAG_042&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.91&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval: qa_regression_v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;promote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_RAG_042&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner for current retrieval policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cleanup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;close_idle_after_minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;drop_expired&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or, at the shell, the same workflow via the &lt;code&gt;pdb&lt;/code&gt; CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/pdb init &lt;span class="nt"&gt;--dsn&lt;/span&gt; localhost:1521/FREE &lt;span class="nt"&gt;--user&lt;/span&gt; sys &lt;span class="nt"&gt;--password&lt;/span&gt; ... &lt;span class="nt"&gt;--from&lt;/span&gt; FREEPDB1
bin/pdb branch AGENT_RAG_042 &lt;span class="nt"&gt;--notes&lt;/span&gt; &lt;span class="s2"&gt;"try smaller chunk size and rerank"&lt;/span&gt;
bin/pdb score   AGENT_RAG_042 0.91 &lt;span class="nt"&gt;--notes&lt;/span&gt; &lt;span class="s2"&gt;"eval: qa_regression_v3"&lt;/span&gt;
bin/pdb promote AGENT_RAG_042
bin/pdb branch &lt;span class="nt"&gt;-d&lt;/span&gt; AGENT_RAG_042
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;bin/pdb init&lt;/code&gt; writes a &lt;code&gt;.pdbprofile&lt;/code&gt; so the daily commands stay short. The CLI also accepts environment-variable overrides and flag overrides — flags beat env vars beat &lt;code&gt;.pdbprofile&lt;/code&gt; beat local defaults.&lt;/p&gt;

&lt;p&gt;That's the entire developer experience. Branch, score, promote, reap. The argument that Oracle "doesn't have &lt;code&gt;git branch&lt;/code&gt; for databases" was true a week ago. Today there's a CLI in the repo, an integration test that runs it against an Oracle Free container in CI, and a Rust binary you can drop in your &lt;code&gt;$PATH&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One architectural point worth elevating: the two-connection security model.&lt;/strong&gt; The agent never gets &lt;code&gt;SYSDBA&lt;/code&gt;. There are two distinct connections:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Control-plane connection&lt;/strong&gt; — trusted orchestration code → &lt;code&gt;CDB$ROOT&lt;/code&gt; as &lt;code&gt;SYSDBA&lt;/code&gt; → uses &lt;code&gt;BranchClient&lt;/code&gt; to create, open, close, and drop PDB branches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload connection&lt;/strong&gt; — the agent → branch PDB → normal application user → ordinary SQL against branch-local data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agent receives only a DSN to its assigned branch and standard application credentials. It cannot create branches, drop branches, or escape its sandbox. Lakebase has nothing analogous in its branching API today; the agent-vs-supervisor security boundary is enforced at the cloud-IAM layer rather than in the database itself, and that's a category weaker than separation of concerns enforced inside the engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snapshot-copy fallback is engineered, not aspirational.&lt;/strong&gt; When the library requests &lt;code&gt;SNAPSHOT COPY&lt;/code&gt; and the underlying storage rejects it — Oracle Free's container filesystem returns &lt;code&gt;ORA-17525&lt;/code&gt; / &lt;code&gt;ORA-65169&lt;/code&gt;, for instance — the library transparently retries as a full clone, records a &lt;code&gt;SNAPSHOT_COPY_FALLBACK&lt;/code&gt; row in &lt;code&gt;PDB_BRANCH_EVENTS&lt;/code&gt;, and (in the Python binding) emits a &lt;code&gt;SnapshotCopyFallbackWarning&lt;/code&gt;. Correctness is preserved on substrates that can't sparse-clone; the events table makes it visible when that happened so capacity planning isn't a guessing game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free deployment path:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Oracle Database 23ai/26ai Free Docker image&lt;/strong&gt; — &lt;code&gt;container-registry.oracle.com/database/free&lt;/code&gt;. CDB service &lt;code&gt;FREE&lt;/code&gt;, default PDB &lt;code&gt;FREEPDB1&lt;/code&gt;. Multiple branch PDBs supported. The Free image's container filesystem doesn't support storage snapshots, so &lt;code&gt;snapshot_copy=True&lt;/code&gt; is silently treated as a full clone via the fallback path above — which means 10–30 branches realistic on a laptop, not hundreds. &lt;strong&gt;$0 cost forever&lt;/strong&gt;, and the Oracle Free integration tests in the repo run the Python, Node.js, Rust, Java, and CLI surfaces against this image in CI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-managed CDB on 19c+ with snapshot-capable storage&lt;/strong&gt; — production target. ACFS, dNFS, Exadata sparse, or Exascale. Branch DDL uses Oracle Managed Files via &lt;code&gt;CREATE_FILE_DEST&lt;/code&gt;, preferring &lt;code&gt;DB_CREATE_FILE_DEST&lt;/code&gt; when set and otherwise deriving a destination from the parent PDB's datafile directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ADB Serverless / Always Free is explicitly NOT a v1 target.&lt;/strong&gt; ADB application connections land in an existing PDB, not in &lt;code&gt;CDB$ROOT&lt;/code&gt;, so they cannot run PDB branch DDL. A real architectural constraint of ADB's tenancy model, not a &lt;code&gt;pdb-branch&lt;/code&gt; limitation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The README is honest about v1 boundaries: the idempotent installer doesn't migrate destructive schema changes yet; PL/SQL identifiers are restricted to simple unquoted Oracle names; promotion is metadata-only, with scaling and export workflows left to deployment-specific adapters. That's an honest v1 scope.&lt;/p&gt;

&lt;p&gt;The article is the "why." The repo is the "how." They land together, today.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agentic Workflow on Oracle
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7ppv8v23xv2yy5zmx5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7ppv8v23xv2yy5zmx5w.png" alt=" " width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The lifecycle Patrick described in our Slack thread, mapped to the actual &lt;code&gt;pdb-branch&lt;/code&gt; API:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 — heavy experimentation.&lt;/strong&gt; The supervisor holds the SYSDBA control-plane connection and spins up branches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hypothesis&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hypotheses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_branch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hypothesis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;from_pdb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GOLDEN_MASTER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hypothesis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hypothesis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PDB_BRANCH_ACTIVE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent receives a DSN to its assigned branch plus an app-user credential. Agents do not see &lt;code&gt;CDB$ROOT&lt;/code&gt;. They run their experiments — vector queries, JSON queries, SQL, whatever the eval needs — against ordinary Oracle PDBs. Once the branch PDB is open there is no special "branch query mode": the branch is just an isolated Oracle PDB service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 — evaluate.&lt;/strong&gt; Supervisor logs scores back to &lt;code&gt;PDB_BRANCH_BRANCHES&lt;/code&gt; as agents finish:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_RAG_042&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.91&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval: qa_regression_v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The supervisor process can watch &lt;code&gt;V$PDBS&lt;/code&gt; (open mode, last open time, total size) and &lt;code&gt;V$RESOURCE_LIMIT&lt;/code&gt; (per-PDB CPU and I/O draw) for liveness and resource consumption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3 — promote and reap.&lt;/strong&gt; Winners stay active. Losers get downgraded or closed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;promote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_RAG_042&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner for current retrieval policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cleanup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;close_idle_after_minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;drop_expired&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;cleanup&lt;/code&gt; is the auto-suspend / auto-drop primitive. In production you don't run this from the supervisor; you schedule &lt;code&gt;PDB_BRANCH.CLEANUP&lt;/code&gt; from &lt;code&gt;DBMS_SCHEDULER&lt;/code&gt; so the orchestration code doesn't need to babysit branch lifecycle.&lt;/p&gt;

&lt;p&gt;Behind those four method calls, the SQL is exactly what you'd expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;PLUGGABLE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;AGENT_RAG_042&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GOLDEN_MASTER&lt;/span&gt; &lt;span class="n"&gt;SNAPSHOT&lt;/span&gt; &lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;PLUGGABLE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;AGENT_RAG_042&lt;/span&gt; &lt;span class="k"&gt;OPEN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;PLUGGABLE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;AGENT_RAG_042&lt;/span&gt;
    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;DB_PERFORMANCE_PROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'PDB_BRANCH_ACTIVE'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;PDB_BRANCH_BRANCHES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PARENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;STATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOTES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CREATED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'AGENT_RAG_042'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'GOLDEN_MASTER'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ACTIVE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;'try smaller chunk size...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SYSTIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;PDB_BRANCH_EVENTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BRANCH_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EVENT_TYPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DETAILS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EVENT_TIME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'AGENT_RAG_042'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'CREATED'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'{"from":"GOLDEN_MASTER"}'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SYSTIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five statements, one transaction. The branch is live. An agent connects to &lt;code&gt;AGENT_RAG_042&lt;/code&gt; as &lt;code&gt;app_user&lt;/code&gt; and runs its experiment.&lt;/p&gt;

&lt;p&gt;This is what Databricks calls evolutionary algorithms in the database. It's the right framing. The substrate has been Oracle for a decade; what was missing was the wrapper that makes it feel like git. Each language binding is roughly one module long, the Rust &lt;code&gt;pdb&lt;/code&gt; CLI is one binary, and they all sit on top of one shared PL/SQL package. The whole DX gap was about that much code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Reality
&lt;/h2&gt;

&lt;p&gt;Both platforms have real costs and real free entry points. Skipping the marketing-deck pricing slide and going straight to what an engineer would actually pay:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload pattern&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Lakebase&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Oracle ADB Serverless 2 ECPU&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50 mostly-idle branches, occasional bursts&lt;/td&gt;
&lt;td&gt;$80–$150/mo&lt;/td&gt;
&lt;td&gt;$190–$290/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100+ branches, high density&lt;/td&gt;
&lt;td&gt;Hits the 10-active wall&lt;/td&gt;
&lt;td&gt;Scales naturally to thousands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sustained 8+ hr/day activity&lt;/td&gt;
&lt;td&gt;Capacity-unit cost climbs&lt;/td&gt;
&lt;td&gt;Cheaper at sustained load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage at scale&lt;/td&gt;
&lt;td&gt;$0.345 / GB-month&lt;/td&gt;
&lt;td&gt;~$0.024 / GB-month (≈15× cheaper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free for prototyping&lt;/td&gt;
&lt;td&gt;Always Free tier (limited)&lt;/td&gt;
&lt;td&gt;Free Docker image: $0 forever&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are public list prices as of mid-2026, picked from each vendor's published rates. Run the numbers for your workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The honest read:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lakebase wins on bursty, mostly-idle floors with light data.&lt;/strong&gt; That's the optimization point of per-DB scale-to-zero, and they do it well.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oracle wins on density, sustained activity, and storage at scale.&lt;/strong&gt; When agents are actually doing work, the shared-pool model delivers more compute per dollar. When experiment data grows, the storage cost differential alone (~15×) can dominate the total.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oracle Free Docker is genuinely free.&lt;/strong&gt; No cloud signup, no credit card, no quotas. Patrick's &lt;code&gt;pdb-branch&lt;/code&gt; README documents this as the recommended local prototyping path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the compute story restated as economics. Per-DB scale-to-zero looks cheap when nothing is running. Shared elastic pool is cheaper when anything is running. Pick the model that matches your workload, not the marketing scoreboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's &lt;em&gt;Actually&lt;/em&gt; New About Lakebase
&lt;/h2&gt;

&lt;p&gt;Worth giving Databricks an honest hearing. The "third-generation" framing collapses the moment you check the dates. What about their other claim — that in Lakebase &lt;em&gt;"both the storage infrastructure and the data formats are completely open"&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;That one survives partway and dies in the details.&lt;/p&gt;

&lt;p&gt;The operational store in Lakebase is &lt;strong&gt;Postgres page format on cloud object storage.&lt;/strong&gt; That's what they mean by "open storage infrastructure." But Postgres' on-disk page layout is a physical storage format, not a portable interchange format. The only thing that can &lt;em&gt;read&lt;/em&gt; a Postgres page file is the Postgres engine. Calling that "open" because the Postgres source code is open is a category error. By that logic, MongoDB's BSON is "open" because the spec is published.&lt;/p&gt;

&lt;p&gt;The other openness claim — that the same data is queryable as Iceberg by external analytical engines — is true. But the Iceberg view isn't the operational store. It's a &lt;strong&gt;separate projection layer&lt;/strong&gt; (the "Mooncake" bridge — Databricks' OLTP-to-lakehouse export pipeline). Iceberg files are derived from the operational Postgres pages, not the same bytes.&lt;/p&gt;

&lt;p&gt;Which means Lakebase's actual architecture is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;canonical store&lt;/strong&gt; in Postgres-only page format. Closed to anything that isn't Postgres.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;projected shape&lt;/strong&gt; in Iceberg, exported to make the data analytically accessible.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's exactly &lt;strong&gt;canonical form + projected shape.&lt;/strong&gt; It's the architecture pattern I've been calling Unified Model Theory for the last two years. Databricks reinvented UMT, called the closed canonical store "open," and called the projection layer "openness."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpq0c9bw7zp6k2jz9n8xs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpq0c9bw7zp6k2jz9n8xs.png" alt=" " width="800" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Oracle's answer to "open data" is the converged engine itself: same canonical store, multiple shapes natively in the engine — SQL, JSON Duality Views, Property Graph, Vector, Spatial, Full-Text Search, Mongo wire protocol, OSON serialization out, Iceberg/Parquet for analytics. No bridge layer required. The cost-based optimizer sees all the modalities in a single query plan.&lt;/p&gt;

&lt;p&gt;The architecture-correct way to expose canonical data through multiple shapes is to do it in the engine. That is what Oracle has been shipping for 40 years and what UMT formalizes. Databricks' Lakebase + Mooncake architecture is one valid implementation pattern of the same idea, with two extra hops and a new vocabulary.&lt;/p&gt;

&lt;p&gt;What's actually new in Lakebase isn't the architecture. It's the &lt;strong&gt;packaging&lt;/strong&gt; — a polished branching UX wired into a data lake brand and a billion dollars of marketing oxygen. That's a real product investment and a credible push into a market segment Oracle has under-marketed. Credit where due.&lt;/p&gt;

&lt;p&gt;It's just not "third-generation database architecture." It's first-generation Postgres branching with a second-generation marketing department.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Take
&lt;/h2&gt;

&lt;p&gt;Three things to land:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Agents do need branching.&lt;/strong&gt; Databricks' diagnosis is correct, and the agentic future they describe is real. Database branching is the missing primitive for evolutionary development. Cost floors do break the economics. Storage and compute do need to scale independently. Credit where due.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Lakebase is competent execution of an idea Oracle Multitenant solved in 2013.&lt;/strong&gt; Neon is good engineering. Lakebase is Neon plus a brand and a UX layer. That's fine — but it isn't "third generation." It's a four-year-old Postgres-branching architecture, recently acquired and rebranded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The architecture-correct version exists today.&lt;/strong&gt; Full ACID. Up to 4098 branches per CDB. Vector, graph, JSON, spatial, full-text — single engine, single transaction, single query plan optimized by 40 years of cost-based optimizer development. Transparent Application Continuity replays in-flight transactions across failover. The two-connection security model keeps agents out of &lt;code&gt;CDB$ROOT&lt;/code&gt; by construction.&lt;/p&gt;

&lt;p&gt;The only real gap was developer experience. Patrick's &lt;a href="https://github.com/pmeredit/pdb-branch" rel="noopener noreferrer"&gt;&lt;code&gt;pdb-branch&lt;/code&gt;&lt;/a&gt; closes it. &lt;strong&gt;Today.&lt;/strong&gt; A Python client, a PL/SQL package, three control tables, and a sane API. Branch, score, promote, reap.&lt;/p&gt;

&lt;p&gt;Stop reinventing 2013. Build the wrapper. Ship.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Third-generation database architecture? We're on our fifth.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;— Rick &amp;amp; Patrick&lt;/p&gt;




&lt;h2&gt;
  
  
  Citations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Databricks (primary subject):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"How agentic software development will change databases" — &lt;a href="https://www.databricks.com/blog/how-agentic-software-development-will-change-databases" rel="noopener noreferrer"&gt;https://www.databricks.com/blog/how-agentic-software-development-will-change-databases&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"A New Era of Databases: Lakebase" (June 12, 2025) — &lt;a href="https://www.databricks.com/blog/what-is-a-lakebase" rel="noopener noreferrer"&gt;https://www.databricks.com/blog/what-is-a-lakebase&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"How lakebase architecture delivers 5x faster Postgres writes" — &lt;a href="https://www.databricks.com/blog/how-lakebase-architecture-delivers-5x-faster-postgres-writes" rel="noopener noreferrer"&gt;https://www.databricks.com/blog/how-lakebase-architecture-delivers-5x-faster-postgres-writes&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"Database Branching in Postgres: Git-Style Workflows" — &lt;a href="https://www.databricks.com/blog/database-branching-postgres-git-style-workflows-databricks-lakebase" rel="noopener noreferrer"&gt;https://www.databricks.com/blog/database-branching-postgres-git-style-workflows-databricks-lakebase&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"Databricks Agrees to Acquire Neon" press release (May 14, 2025) — &lt;a href="https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-neon-help-developers-deliver-ai-systems" rel="noopener noreferrer"&gt;https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-neon-help-developers-deliver-ai-systems&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Oracle Database documentation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;12c Multitenant Concepts — &lt;a href="https://docs.oracle.com/database/121/CNCPT/cdbovrvw.htm" rel="noopener noreferrer"&gt;https://docs.oracle.com/database/121/CNCPT/cdbovrvw.htm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;19c CREATE PLUGGABLE DATABASE — &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-PLUGGABLE-DATABASE.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-PLUGGABLE-DATABASE.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;19c Cloning a PDB — &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/cloning-a-pdb.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/cloning-a-pdb.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;19c Administering a PDB Snapshot Carousel — &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/administering-pdb-snapshots.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/administering-pdb-snapshots.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;19c MAX_PDBS reference — &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/MAX_PDBS.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/MAX_PDBS.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;21c V$PDBS reference — &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/21/refrn/V-PDBS.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/database/oracle/oracle-database/21/refrn/V-PDBS.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;26c ALTER PLUGGABLE DATABASE — &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/26/sqlrf/ALTER-PLUGGABLE-DATABASE.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/database/oracle/oracle-database/26/sqlrf/ALTER-PLUGGABLE-DATABASE.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Resource Manager for PDBs (19c) — &lt;a href="https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/using-oracle-resource-manager-for-pdbs-with-sql-plus.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/using-oracle-resource-manager-for-pdbs-with-sql-plus.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;ADB Compute Models (ECPU/OCPU) — &lt;a href="https://docs.oracle.com/en/cloud/paas/autonomous-database/serverless/adbsb/autonomous-compute-models.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/cloud/paas/autonomous-database/serverless/adbsb/autonomous-compute-models.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;ADB Auto-Scale 3× — &lt;a href="https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/autonomous-auto-scale.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/autonomous-auto-scale.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PDB Snapshots on Exadata Exascale (23ai+) — &lt;a href="https://docs.oracle.com/en/learn/exadb-xs-pdb-snapshot/index.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/en/learn/exadb-xs-pdb-snapshot/index.html&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Historical context:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dremel 2020 retrospective (VLDB) — &lt;a href="https://www.vldb.org/pvldb/vol13/p3461-melnik.pdf" rel="noopener noreferrer"&gt;https://www.vldb.org/pvldb/vol13/p3461-melnik.pdf&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Aurora 10-year retrospective — &lt;a href="https://aws.amazon.com/blogs/aws/celebrating-10-years-of-amazon-aurora-innovation/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/aws/celebrating-10-years-of-amazon-aurora-innovation/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Aurora cloning hard limits — &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Snowflake architecture — &lt;a href="https://docs.snowflake.com/en/user-guide/intro-key-concepts" rel="noopener noreferrer"&gt;https://docs.snowflake.com/en/user-guide/intro-key-concepts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Oracle 18c PDB Snapshot Carousel introduction — &lt;a href="https://oracle-base.com/articles/18c/multitenant-pdb-snapshot-carousel-18c" rel="noopener noreferrer"&gt;https://oracle-base.com/articles/18c/multitenant-pdb-snapshot-carousel-18c&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Neon / Postgres branching:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Neon architecture overview — &lt;a href="https://neon.com/docs/introduction/architecture-overview" rel="noopener noreferrer"&gt;https://neon.com/docs/introduction/architecture-overview&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Neon branching docs — &lt;a href="https://neon.com/docs/introduction/branching" rel="noopener noreferrer"&gt;https://neon.com/docs/introduction/branching&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TechTarget on Databricks/Neon acquisition — &lt;a href="https://www.techtarget.com/searchdatamanagement/news/366623864/Databricks-adds-Postgres-database-with-1B-Neon-acquisition" rel="noopener noreferrer"&gt;https://www.techtarget.com/searchdatamanagement/news/366623864/Databricks-adds-Postgres-database-with-1B-Neon-acquisition&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Companion repository:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pmeredit/pdb-branch — &lt;a href="https://github.com/pmeredit/pdb-branch" rel="noopener noreferrer"&gt;https://github.com/pmeredit/pdb-branch&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>lakebase</category>
      <category>databricks</category>
      <category>oracle</category>
    </item>
    <item>
      <title>The Silent Bug That Exposed All Tenant Data in Databricks Unity Catalog</title>
      <dc:creator>spkibe</dc:creator>
      <pubDate>Mon, 11 May 2026 07:39:39 +0000</pubDate>
      <link>https://dev.to/spkibe/the-silent-bug-that-exposed-all-tenant-data-in-databricks-unity-catalog-4egj</link>
      <guid>https://dev.to/spkibe/the-silent-bug-that-exposed-all-tenant-data-in-databricks-unity-catalog-4egj</guid>
      <description>&lt;p&gt;We were building a multi-tenant data platform on Databricks. Multiple organisations sharing the same physical tables — each one should see only their own rows. Standard stuff.&lt;br&gt;
We implemented it using Unity Catalog's row-level security and column masking. The functions compiled. The filter showed as applied in &lt;em&gt;DESCRIBE EXTENDED&lt;/em&gt;. Every test from the admin account looked perfect.&lt;br&gt;
Then we logged in as a real tenant user.&lt;br&gt;
They could see every tenant's data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Row-Level Security and Column Masking Actually Do&lt;/strong&gt;&lt;br&gt;
Before getting to the bug, a quick primer on how Unity Catalog security works — because understanding the mechanism is what makes the bug obvious in hindsight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Row-Level Security — Row Filters&lt;/strong&gt;&lt;br&gt;
A row filter is a SQL function you attach to a table. Unity Catalog calls it automatically on every query, passing the value of a specified column from each row. If the function returns TRUE, the row is shown. If it returns FALSE, the row is completely hidden — not counted, not visible, not even hinted at.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Attach a row filter to a table&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TENANT_KEY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user never writes a WHERE clause for this. They cannot remove it. It fires invisibly on every query from every tool — SQL editor, notebook, BI connection, API call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Column-Level Masking — Column Masks&lt;/strong&gt;&lt;br&gt;
A column mask is a SQL function attached to a specific column. Instead of hiding rows, it transforms values at query time. The row is visible but sensitive fields are replaced, generalized, or redacted based on who is asking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Attach a column mask&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;
  &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;FIRST_NAME&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;MASK&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mask_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same SELECT returns different values depending on the user's group membership:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbzb8qp7f6hh80ns693av.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbzb8qp7f6hh80ns693av.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One table. One query. Different results per role. Platform-enforced.&lt;/p&gt;

&lt;p&gt;Why This Matters&lt;br&gt;
The old approach — dynamic views, one per tenant per role — requires you to trust that every developer always queries the right view, that views stay in sync with schema changes, and that no one ever accidentally gets direct table access. Unity Catalog removes all of that trust dependency. Security lives at the storage engine layer, not the SQL layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Bug&lt;/strong&gt;&lt;br&gt;
Here is the row filter function we wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt;
&lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_key&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt;
  &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'admin_group'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;OR&lt;/span&gt;
  &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_group_mapping&lt;/span&gt; &lt;span class="n"&gt;tgm&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tgm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tgm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_key&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tenant_key&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read it carefully.&lt;br&gt;
The function parameter is named tenant_key.&lt;br&gt;
The mapping table column is also named tenant_key.&lt;br&gt;
In the WHERE clause:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AND CAST(tgm.tenant_key AS BIGINT) = tenant_key&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
SQL sees two references to tenant_key. It resolves both as the table column tgm.tenant_key. The function parameter is completely ignored.&lt;/p&gt;

&lt;p&gt;The comparison becomes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tgm.tenant_key = tgm.tenant_key&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why It Was So Hard to Spot&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No error was thrown.
The function compiled without warnings. Unity Catalog reported it as valid SQL.&lt;/li&gt;
&lt;li&gt;DESCRIBE EXTENDED showed the filter was applied.
&lt;code&gt;Row Filter: my_catalog.governance.filter_by_tenant(TENANT_KEY)&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything looked correct at the metadata level. The filter was attached. The problem was invisible in the schema description.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Admin tests passed.
Our initial testing was done from an admin account. The admin bypass (IS_ACCOUNT_GROUP_MEMBER('admin_group')) fires before the EXISTS check, so it returned TRUE for the correct reason. We never noticed the EXISTS was broken.&lt;/li&gt;
&lt;li&gt;The function fails open, not closed.
When Unity Catalog cannot properly evaluate a row filter, it fails open — showing rows rather than blocking them. This is the safer choice for uptime but the dangerous choice for security. A broken filter that silently shows everything is much harder to detect than a broken filter that throws an error.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Diagnosis&lt;/strong&gt;&lt;br&gt;
The key test was running the filter function directly as the tenant user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run as the tenant user, not the admin&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;can_see_tenant_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;can_see_tenant_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;can_see_tenant_3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;can_see_tenant_1 = true
can_see_tenant_2 = true
can_see_tenant_3 = true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A user who should only see tenant 3 could see all three. The function was returning true everywhere regardless of tenant key. That confirmed the EXISTS logic was broken — and pointed directly to the parameter name collision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix — Rename the Parameter&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt;
&lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_tenant_key&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt;
    &lt;span class="c1"&gt;-- Null tenant keys are always hidden&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;p_tenant_key&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt;

    &lt;span class="c1"&gt;-- Admin bypass&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'admin_group'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;

    &lt;span class="c1"&gt;-- Tenant check — p_tenant_key is the parameter&lt;/span&gt;
    &lt;span class="c1"&gt;-- tgm.tenant_key is the table column&lt;/span&gt;
    &lt;span class="c1"&gt;-- SQL can now distinguish between them&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_group_mapping&lt;/span&gt; &lt;span class="n"&gt;tgm&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tgm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tgm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_key&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p_tenant_key&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;

    &lt;span class="c1"&gt;-- Explicit deny — everything else sees zero rows&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt;
  &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two changes:&lt;/p&gt;

&lt;p&gt;Parameter renamed from tenant_key to p_tenant_key — eliminates the name collision&lt;br&gt;
CASE structure with explicit ELSE FALSE — makes the deny-by-default behaviour visible and intentional&lt;/p&gt;

&lt;p&gt;After recreating the function and reapplying the row filter, the same test returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;can_see_tenant_1 = false
can_see_tenant_2 = false
can_see_tenant_3 = true

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Drop and Reapply After Fixing&lt;/strong&gt;&lt;br&gt;
Updating the function is not enough on its own. You also need to drop and reapply the row filter so the table picks up the new function definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;
  &lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TENANT_KEY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Column Masking Side&lt;/strong&gt;&lt;br&gt;
For completeness — column masking uses the same pattern and has the same naming risk. Here is what a safe masking function looks like with the p_ prefix convention applied:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt;
&lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mask_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="k"&gt;CASE&lt;/span&gt;
  &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'full_access_group'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;p_name&lt;/span&gt;
  &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'admin_group'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;p_name&lt;/span&gt;
  &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'partial_access_group'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;LEFT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'***'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'#### MASKED ####'&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply it inline at table creation to avoid broken dependencies later:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MEMBER_KEY&lt;/span&gt;   &lt;span class="nb"&gt;BIGINT&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TENANT_KEY&lt;/span&gt;   &lt;span class="nb"&gt;BIGINT&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FIRST_NAME&lt;/span&gt;   &lt;span class="n"&gt;STRING&lt;/span&gt;  &lt;span class="n"&gt;MASK&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mask_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LAST_NAME&lt;/span&gt;    &lt;span class="n"&gt;STRING&lt;/span&gt;  &lt;span class="n"&gt;MASK&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mask_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DATE_OF_BIRTH&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;   &lt;span class="n"&gt;MASK&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mask_dob&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;DELTA&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Row filter applied separately&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TENANT_KEY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Declaring masks inline means they survive DROP TABLE / CREATE TABLE cycles. The row filter does not — always reapply it after recreating a table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Rule&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Never name a row filter function parameter the same as a column in any table the function queries.&lt;/p&gt;

&lt;p&gt;Prefix all function parameters with p_. It is one character. It prevents this entire class of silent security failure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filter_by_tenant(tenant_key BIGINT)   ← dangerous
filter_by_tenant(p_tenant_key BIGINT) ← safe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Full Verification Checklist&lt;/strong&gt;&lt;br&gt;
Run these in order before trusting any row filter in production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- 1. Confirm groups are account-level (not workspace-level)&lt;/span&gt;
&lt;span class="c1"&gt;--    Run as the target user:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'your_tenant_group'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- Expected: true&lt;/span&gt;

&lt;span class="c1"&gt;-- 2. Confirm filter function returns correct values per tenant&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter_by_tenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Expected: false, false, true (for a tenant 3 user)&lt;/span&gt;

&lt;span class="c1"&gt;-- 3. Confirm filter is attached to the table&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;EXTENDED&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Look for: Row Filter: my_catalog.governance.filter_by_tenant(TENANT_KEY)&lt;/span&gt;

&lt;span class="c1"&gt;-- 4. Confirm mapping table has correct data&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_group_mapping&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 5. Confirm the EXISTS subquery works correctly&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;governance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_group_mapping&lt;/span&gt; &lt;span class="n"&gt;tgm&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tgm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;tgm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;exists_result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Expected: true (for tenant 3 user)&lt;/span&gt;

&lt;span class="c1"&gt;-- 6. Run query as target user and confirm only their rows appear&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;TENANT_KEY&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;TENANT_KEY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Expected: only their tenant_key in results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Other Gotchas We Hit Along the Way&lt;/strong&gt;&lt;br&gt;
While we are here — these are the other issues that burned us during the same implementation:&lt;br&gt;
Workspace groups vs account groups. &lt;em&gt;IS_ACCOUNT_GROUP_MEMBER()&lt;/em&gt; only recognises account-level groups created in the Databricks Account Console, not workspace-level groups. A workspace group always returns false. This one caused hours of confusion.&lt;br&gt;
Cluster identity. Notebooks attached to a cluster run queries as the cluster owner's identity, not the logged-in user. &lt;em&gt;IS_ACCOUNT_GROUP_MEMBER() _checks the cluster owner's groups. Switch to a SQL Warehouse — it always evaluates per the logged-in user.&lt;br&gt;
Broken dependencies after catalog deletion. Column masks hold references to functions by their fully-qualified path. Delete the catalog containing a masking function without first dropping the masks, and every table with that mask becomes unqueryable with _UC_DEPENDENCY_DOES_NOT_EXIST&lt;/em&gt;. Always drop masks before dropping catalogs.&lt;br&gt;
Row filter lost after DROP TABLE. When you drop and recreate a table, inline column masks are preserved in the CREATE TABLE statement. Row filters are not. Always reapply &lt;em&gt;ALTER TABLE SET ROW FILTER&lt;/em&gt; after recreating any filtered table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
Unity Catalog row-level security and column masking are genuinely powerful. One filter function and one masking function replace hundreds of views, a duplicate encrypted schema, and developer-discipline-as-security-policy.&lt;br&gt;
But the parameter name collision bug is subtle enough that it will catch you if you are not looking for it. The function looks right. It compiles cleanly. It attaches without errors. And it silently hands every user a complete view of every tenant's data.&lt;br&gt;
Prefix your parameters. Always.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>datamasking</category>
      <category>dataengineering</category>
      <category>unitycatalog</category>
    </item>
    <item>
      <title>Building Your First Data Warehouse in Databricks — End to End 🎉</title>
      <dc:creator>Qvfagundes</dc:creator>
      <pubDate>Mon, 11 May 2026 03:00:00 +0000</pubDate>
      <link>https://dev.to/vf-insights/building-your-first-data-warehouse-in-databricks-end-to-end-49ln</link>
      <guid>https://dev.to/vf-insights/building-your-first-data-warehouse-in-databricks-end-to-end-49ln</guid>
      <description>&lt;h1&gt;
  
  
  Building Your First Data Warehouse in Databricks — End to End 🎉
&lt;/h1&gt;

&lt;p&gt;This is it. The article the entire series has been building toward.&lt;/p&gt;

&lt;p&gt;We've covered Databricks fundamentals, Apache Spark, Delta Lake, DBFS, DataFrames, SQL, and the Medallion Architecture. Now we wire everything together into a real, working data warehouse — from raw data ingestion all the way to queryable Gold tables.&lt;/p&gt;

&lt;p&gt;By the end of this article you'll have a functioning Lakehouse with Bronze, Silver, and Gold layers, a database registered in the Databricks catalog, and the ability to query your warehouse like a real data engineer.&lt;/p&gt;

&lt;p&gt;Let's build it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We'll build a &lt;strong&gt;Sales Data Warehouse&lt;/strong&gt; using a publicly available dataset. Here's the full architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CSV Files (raw sales data)
        ↓
   🥉 BRONZE
   bronze.sales_raw
   Raw Delta table, append-only
        ↓
   🥈 SILVER
   silver.sales
   Cleaned, deduplicated, enriched
        ↓
   🥇 GOLD
   gold.monthly_revenue     — Revenue by region and month
   gold.product_performance — Top products by sales volume
   gold.customer_segments   — Customers segmented by spend tier
        ↓
   SQL queries / BI tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 0: The Dataset
&lt;/h2&gt;

&lt;p&gt;We'll use the &lt;strong&gt;Online Retail dataset&lt;/strong&gt; — a real e-commerce transaction dataset available in Databricks sample data.&lt;/p&gt;

&lt;p&gt;It contains ~500,000 rows of UK retail transactions with these columns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;InvoiceNo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Order ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;StockCode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Product code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Description&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Product name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Quantity&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Integer&lt;/td&gt;
&lt;td&gt;Units ordered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;InvoiceDate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Order date and time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;UnitPrice&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Double&lt;/td&gt;
&lt;td&gt;Price per unit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CustomerID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Double&lt;/td&gt;
&lt;td&gt;Customer identifier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Country&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;String&lt;/td&gt;
&lt;td&gt;Customer country&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Step 1: Set Up Your Databases
&lt;/h2&gt;

&lt;p&gt;Start a new notebook. This will be your &lt;strong&gt;setup notebook&lt;/strong&gt; — run it once to create the structure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# notebook: 00_setup
&lt;/span&gt;
&lt;span class="c1"&gt;# Create the three layer databases
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE DATABASE IF NOT EXISTS bronze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE DATABASE IF NOT EXISTS silver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE DATABASE IF NOT EXISTS gold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create the mount point directories
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/bronze/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/silver/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/gold/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Databases and directories created.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now check the Databricks &lt;strong&gt;Data&lt;/strong&gt; tab — you should see three new databases: &lt;code&gt;bronze&lt;/code&gt;, &lt;code&gt;silver&lt;/code&gt;, and &lt;code&gt;gold&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Bronze — Ingest Raw Data
&lt;/h2&gt;

&lt;p&gt;Create a new notebook: &lt;code&gt;01_bronze_ingestion&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# notebook: 01_bronze_ingestion
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lit&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting Bronze ingestion...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# -------------------------------------------------------
# Read the raw CSV from Databricks sample datasets
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/databricks-datasets/online_retail/data-001/data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw rows ingested: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;raw_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;printSchema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# -------------------------------------------------------
# Add Bronze metadata columns
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;bronze_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_df&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;input_file_name&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source_system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;online_retail_csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# -------------------------------------------------------
# Write to Bronze Delta table
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;bronze_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwriteSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/bronze/sales_raw/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Register in catalog
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE IF NOT EXISTS bronze.sales_raw
    USING DELTA
    LOCATION &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/bronze/sales_raw/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Quick validation
&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/bronze/sales_raw/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Bronze table written. Total rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the cell. You should see output similar to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw rows ingested: 541,909
✅ Bronze table written. Total rows: 541,909
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's peek at what we landed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.sales_raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see messy data — nulls in &lt;code&gt;CustomerID&lt;/code&gt;, negative quantities (returns), zero-price rows. That's fine. Bronze captures reality. Silver fixes it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Silver — Clean and Enrich
&lt;/h2&gt;

&lt;p&gt;Create a new notebook: &lt;code&gt;02_silver_transformation&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# notebook: 02_silver_transformation
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;round&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_timestamp&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting Silver transformation...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# -------------------------------------------------------
# Read from Bronze
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;bronze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.sales_raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bronze rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# -------------------------------------------------------
# Cleaning rules
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;silver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 1. Drop rows with null CustomerID (anonymous sessions)`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CustomerID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 2. Drop duplicates on InvoiceNo + StockCode`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropDuplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InvoiceNo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;StockCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 3. Remove returns (negative quantities) and zero-price items`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 4. Cast and clean types`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CustomerID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CustomerID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InvoiceDate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;to_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InvoiceDate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M/d/yyyy H:mm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 5. Derive new columns`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TotalAmount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;year&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InvoiceDate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;month&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InvoiceDate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TotalAmount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High Value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TotalAmount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mid Value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low Value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 6. Rename to snake_case`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InvoiceNo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;StockCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InvoiceDate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CustomerID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TotalAmount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 7. Drop Bronze metadata`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source_system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    \
    &lt;span class="sb"&gt;`# 8. Add Silver metadata`&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_processed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Silver rows after cleaning: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# -------------------------------------------------------
# Write to Silver Delta table
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwriteSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/silver/sales/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE IF NOT EXISTS silver.sales
    USING DELTA
    LOCATION &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/silver/sales/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Silver table written.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver.sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bronze rows: 541,909
Silver rows after cleaning: 397,924
✅ Silver table written.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We dropped ~144,000 rows — nulls, returns, zero-price items. What remains is clean, trusted data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Gold — Build Business Tables
&lt;/h2&gt;

&lt;p&gt;Create a new notebook: &lt;code&gt;03_gold_aggregations&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We'll build three Gold tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gold Table 1: Monthly Revenue by Country
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# notebook: 03_gold_aggregations
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;round&lt;/span&gt;

&lt;span class="n"&gt;silver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver.sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# -------------------------------------------------------
# Gold 1: Monthly Revenue by Country
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;monthly_revenue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_order_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;monthly_revenue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwriteSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/gold/monthly_revenue/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE IF NOT EXISTS gold.monthly_revenue
    USING DELTA
    LOCATION &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/gold/monthly_revenue/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ gold.monthly_revenue written.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monthly_revenue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Gold Table 2: Product Performance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# -------------------------------------------------------
# Gold 2: Product Performance
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;product_performance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;units_sold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;times_ordered&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_buyers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unit_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_unit_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;product_performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwriteSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/gold/product_performance/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE IF NOT EXISTS gold.product_performance
    USING DELTA
    LOCATION &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/gold/product_performance/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ gold.product_performance written.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Gold Table 3: Customer Segments
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# -------------------------------------------------------
# Gold 3: Customer Segments
# -------------------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;customer_segments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lifetime_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_order_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_products_bought&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;segment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lifetime_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VIP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lifetime_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loyal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lifetime_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Regular&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Occasional&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lifetime_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;customer_segments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwriteSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/gold/customer_segments/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE IF NOT EXISTS gold.customer_segments
    USING DELTA
    LOCATION &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/mnt/warehouse/gold/customer_segments/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ gold.customer_segments written.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_segments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5: Query Your Data Warehouse
&lt;/h2&gt;

&lt;p&gt;Open the &lt;strong&gt;SQL Editor&lt;/strong&gt; in Databricks. Your warehouse is live. Start querying.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- What were the top 5 revenue months?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;monthly_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;monthly_orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unique_customers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;monthly_customers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monthly_revenue&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;monthly_revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- What are the top 10 best-selling products?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;units_sold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unique_buyers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_performance&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- How are customers distributed by segment?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;segment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lifetime_value&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_lifetime_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_orders&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_orders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_segments&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;segment&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;avg_lifetime_value&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Which countries generate the most revenue?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_orders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monthly_revenue&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You're querying a real data warehouse. Built by you. From scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Validate Your Warehouse
&lt;/h2&gt;

&lt;p&gt;Good data engineers always validate. Run these checks before calling it done:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# notebook: 04_validation
&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== DATA WAREHOUSE VALIDATION ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Row counts across layers
&lt;/span&gt;&lt;span class="n"&gt;bronze_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.sales_raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;silver_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver.sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🥉 Bronze rows:  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bronze_count&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🥈 Silver rows:  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;silver_count&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;silver_count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;bronze_count&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; of bronze)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Gold table counts
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gold.monthly_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gold.product_performance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gold.customer_segments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🥇 &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Null checks on Silver
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;spark_sum&lt;/span&gt;

&lt;span class="n"&gt;silver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver.sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;null_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nf"&gt;spark_sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Null counts on critical Silver columns:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;null_counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Revenue sanity check
&lt;/span&gt;&lt;span class="n"&gt;total_revenue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Total Silver revenue: £&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;✅ Validation complete.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 7: Optimize Your Tables
&lt;/h2&gt;

&lt;p&gt;Now that everything is built, run maintenance on your Gold tables for faster queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;

&lt;span class="c1"&gt;-- Compact small files&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monthly_revenue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_performance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_segments&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Speed up common filter patterns&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monthly_revenue&lt;/span&gt;     &lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_performance&lt;/span&gt; &lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_segments&lt;/span&gt;   &lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;segment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What You've Built
&lt;/h2&gt;

&lt;p&gt;Let's look at the complete picture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📁 Databases created:
   bronze / silver / gold

📄 Tables created:
   bronze.sales_raw          — 541,909 rows  (raw, as-is)
   silver.sales              — 397,924 rows  (clean, enriched)
   gold.monthly_revenue      — aggregated by year/month/country
   gold.product_performance  — aggregated by product
   gold.customer_segments    — aggregated by customer

🏗️ Architecture:
   Medallion (Bronze → Silver → Gold)
   All tables in Delta format
   Silver partitioned by year/month
   Gold tables OPTIMIZE'd with ZORDER

🔍 Queryable via:
   Databricks SQL Editor
   Any BI tool via JDBC/ODBC connector
   Databricks notebooks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Where to Go From Here
&lt;/h2&gt;

&lt;p&gt;You've built your first data warehouse in Databricks. Here's what to explore next:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;: Take your four notebooks and wire them into a &lt;strong&gt;Databricks Workflow&lt;/strong&gt; — a scheduled pipeline that runs Bronze → Silver → Gold automatically on a schedule or trigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incremental loads&lt;/strong&gt;: Update the Bronze ingestion to load only new files, and update Silver to use &lt;strong&gt;MERGE&lt;/strong&gt; instead of overwrite — real production pipelines are incremental.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unity Catalog&lt;/strong&gt;: In production Databricks, Unity Catalog provides centralized access control, data lineage, and governance across all your tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks SQL Warehouses&lt;/strong&gt;: Connect Power BI, Tableau, or Looker directly to your Gold tables via a SQL Warehouse endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;dbt on Databricks&lt;/strong&gt;: Use dbt to manage your Silver and Gold transformations with version control, testing, and documentation built in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series Complete 🎉
&lt;/h2&gt;

&lt;p&gt;You went from zero to a working data warehouse in Databricks. That's not a small thing.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>ai</category>
      <category>dataengineering</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The observability gap for data science and analytics agents</title>
      <dc:creator>Raluca Crisan</dc:creator>
      <pubDate>Sun, 10 May 2026 11:05:31 +0000</pubDate>
      <link>https://dev.to/rraluca07/the-observability-gap-for-data-science-and-analytics-agents-3cnd</link>
      <guid>https://dev.to/rraluca07/the-observability-gap-for-data-science-and-analytics-agents-3cnd</guid>
      <description>&lt;p&gt;Databricks and similar enterprise data platforms have spent a great deal of effort and time to full-proof their product suite with relevant observability and tracing. Not surprisingly this is needed as part of enterprise support especially in regulated sectors. But for the specific case of sophisticated data science and analytics agents there is a gap in the observability suite not just for Databricks but across all big and small analytics and data science agent providers.&lt;/p&gt;

&lt;p&gt;In the case of Databricks, even with notebooks as a primary user interface, given the offerings across data lineage, data management and MLflow, the level of control and tracing is no doubt high. However both large vendors like Databricks and Snowflake and smaller analytics and data science agents suppliers share an observability gap. The gap is inherent to coding agent architectures and does not apply equally to all agents. A text-to-SQL assistant can be wrong in an ‘obvious’ way: the result makes no sense. A multi-step python or spark pipeline produced by an agent is different. Even when made by a human, it’s hard to unpick pipeline logic given endless combinations of joins, data issues, data characteristics. This problem doesn’t go away when an agent is involved. E.g. Genie can plan a solution,run code, use cell outputs to improve results, and fix errors automatically. The question is what beyond the initial reasoning and the final artifact can be inspected in this instance and what can be reliably/not-probabilistically logged. &lt;/p&gt;

&lt;p&gt;To achieve their objectives, these more sophisticated data science and analytics agents need to create relatively complex multi-step pipelines. Past the initial data retrieval and the final storage step, the pipelines themselves are just arbitrary code. Observability for this type of scripts when they are man-made span a whole area of companies in the MLOps space including Databricks’ own Mlflow. But it is unclear what observability is out there when this code is produced by agents - short of asking the agent itself to instrument the code (probabilistically), thus somewhat defeating the purpose of observability in the first place. &lt;/p&gt;

&lt;p&gt;Now that we’ve narrowed the gap in observability from the bigger data platform context to a specific area: the ‘executed pipeline code’ element part of these more sophisticated analytics and data science agents workflow, my first question was to see if Mlflow or a different ‘off-the-shelf’ tool in the ecosystem can fill this gap directly. For why OpenTelemetry is not enough here please see the previous blogpost.&lt;/p&gt;

&lt;p&gt;Unsurprisingly, Mlflow is heading in the direction of more granular instrumentation with the least amount of effort - on anyone’s part, human or agent. For classic ML, a single mlflow.autolog() call can automatically capture params, metrics, models, datasets, and artifacts around supported training APIs, while for GenAI and agent workflows, one-line tracing primitives like @mlflow.trace, mlflow.trace(...), and mlflow.start_span() add function- and block-level visibility, including parent-child relationships, inputs, outputs, exceptions, and execution time. &lt;/p&gt;

&lt;p&gt;My initial experiments with trying to instrument agent-created code with Mlflow deterministically  have allowed me to track the models as experiments which was a good step in the right direction 👍, but of course I cannot track data transformations - with Mlflow or with anything else that I’m familiar with. &lt;br&gt;
Trying to track with autolog was the better option for me - rather than the tracing function, because I’m not really tracking the agent, I’m trying to track what’s happening in the code produced by the agent when it runs. Below some example basic tracking:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5x6sk8daei5jef0jrng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5x6sk8daei5jef0jrng.png" alt=" " width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The gap is of course tracking what actually happens inside the pipeline outside the model itself, all the data operations for which no observability is present. While the code is of course the best evidence in other use cases, for pipeline types structures where the outcomes are heavily influenced by the particulars of the data, the code is not enough - observability on code and runtime execution both is needed and for these data science and analytics agents, the code they produce (outside the model itself) is currently a black box - an example table of interim artifacts below (made using &lt;a href="https://docs.etiq.ai/" rel="noopener noreferrer"&gt;Etiq&lt;/a&gt;), which at the moment tooling like Mlflow does not capture for agent written code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69bbs07fybmr2c7vco5q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69bbs07fybmr2c7vco5q.png" alt=" " width="512" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this space we were brainwashed to believe that observability matters at all cost; however I feel for this instance given the perception of coding agents in the market, an argument might have to be made for why it really matters. &lt;br&gt;
First, it’s about auditability. Truly not everyone cares about this and not everyone should. But in regulated sectors like finance or healthcare this matters. For model validation in e.g. finance, the type of data lineage documentation required involves more than what gets stored in Unity catalogue, Delta lakes or Mlflow model tracking - all useful components. This type of use case needs to reflect the transformations that happen in the code itself once executed and teams currently do this manually. At the moment, the use of semiautonomous coding agents for these use cases is minimal but this is not where the enterprise stack is going.&lt;/p&gt;

&lt;p&gt;Second, observability for these more sophisticated agents moves into other related risks, such as reproducibility, error propagation across longer pipelines, and general control issues for agent generated code. &lt;br&gt;
Without observability, it is harder to track ‘semantic mistakes’ the agent might make, such as not using the correct metric definition, or applying the analysis or model to the wrong population. A bad transformation early in the pipeline affects everything downstream. I’m not sure what exactly is the level of observability needed to help us mitigate the potential issues, but without any we certainly would struggle. &lt;/p&gt;

&lt;p&gt;Reproducibility is another area that does require some level of observability: if transformation execution is not observable, the final notebook may not be a faithful record of the run that produced the result. Similarly, we would struggle to compare agent runs over time (or rather without observability we would struggle more).&lt;/p&gt;

&lt;p&gt;The key argument for in-depth-observability on agent generated code is enterprise level control especially for regulated sectors. Usage of these sophisticated data science and analytics agents in regulated sectors might be small to begin with relative to the size of the overall data platform offering. However as Databricks and large enterprise data platforms are feeling the pressure from coding agents and foundational models, there just aren’t that many avenues left to go into. If Databricks’ long-term position is around providing the governed system in which semiautonomous enterprise agents can actually run, then any observability gap will prove problematic. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>agents</category>
      <category>databricks</category>
    </item>
    <item>
      <title>How to Choose the Right Databricks Consulting Firm: 7 Things Enterprises Get Wrong</title>
      <dc:creator>Lucy </dc:creator>
      <pubDate>Thu, 07 May 2026 13:14:35 +0000</pubDate>
      <link>https://dev.to/lucy1/how-to-choose-the-right-databricks-consulting-firm-7-things-enterprises-get-wrong-541</link>
      <guid>https://dev.to/lucy1/how-to-choose-the-right-databricks-consulting-firm-7-things-enterprises-get-wrong-541</guid>
      <description>&lt;p&gt;We've seen this more times than we'd like. A company drops serious money on a Databricks engagement, and nine months later they've got a half-migrated lakehouse, a Unity Catalog nobody's actually managing, and a "knowledge transfer session" that transferred nothing except a Confluence link nobody bookmarked. Picking the wrong Databricks consultants is painful. And it's almost always avoidable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's where enterprises consistently go wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Treating Certifications Like a Proxy for Skill
&lt;/h2&gt;

&lt;p&gt;Databricks certs test whether someone read the documentation. They don't test what happens when a Delta Lake merge tanks a production cluster on a Friday night. Ask for specifics. What Spark executor errors have they actually debugged? How did they fix Z-ordering that was slowing down query performance instead of helping it? If they can't walk you through a real incident, the cert doesn't tell you much.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Not Pushing Hard on Unity Catalog
&lt;/h2&gt;

&lt;p&gt;This is the one where vague answers hide the most risk. Unity Catalog is now central to how governance actually works on Databricks — metastore structure, cross-workspace data sharing, attribute-based access control. Ask how they've handled multi-business-unit deployments. Ask what breaks when you try to share data across workspaces without planning the catalog hierarchy first. The consultants who've actually done it won't need to think long before answering.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Assuming Spark Experience Transfers Cleanly
&lt;/h2&gt;

&lt;p&gt;It doesn't. A strong Spark engineer isn't automatically a strong Databricks engineer. Photon engine tuning, Delta Live Tables pipeline architecture, Databricks Asset Bundles — these require platform-specific knowledge that general Spark work doesn't build. We've brought in Spark-heavy consultants who struggled with DLT and had never touched Databricks Workflows outside a tutorial. Ask for specific project examples, not credential claims.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Skipping the MLflow Conversation Entirely
&lt;/h2&gt;

&lt;p&gt;If any ML workloads are in scope and the consulting firm can't speak clearly about MLflow model registry promotion, experiment tracking strategy, or Feature Store integration — that's worth noting. A lot of firms pitch ML capabilities because the market asks for them, not because they've built production ML systems on Databricks. You can usually tell within five minutes of asking detailed questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Underestimating Migration Complexity
&lt;/h2&gt;

&lt;p&gt;This is where most projects actually fall apart. Moving off Hive metastores, Teradata, or on-prem Hadoop into Databricks involves decisions that compound quickly — schema evolution handling, ACID conflicts when porting existing workloads to Delta, incremental vs. full-load tradeoffs that aren't obvious until you're mid-migration. Any Databricks consultants who promise a smooth lift-and-shift haven't run one before. Push for specifics on how they've handled schema drift and what their rollback strategy looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Not Locking In a Cost Governance Plan From Day One
&lt;/h2&gt;

&lt;p&gt;Cluster policy design, autoscaling rules, Spot instance configuration — these aren't details to figure out after the platform is running. We've seen companies end up paying three times what their workloads should cost because nobody set up a governance framework before the first jobs started running. If cost optimization isn't a named deliverable in the initial scope, ask why not.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Accepting Documentation That Shows Up at the End
&lt;/h2&gt;

&lt;p&gt;Most firms hand over a Confluence export at project close and call it knowledge transfer. Real handoff means annotated notebooks, runbooks your team can actually follow, and live walkthroughs of your Workflows and scheduling logic while the consultants are still around to answer questions. If this isn't written into the engagement scope from the start, don't expect it to happen.&lt;/p&gt;

&lt;p&gt;The firms worth hiring &lt;a href="https://www.lucentinnovation.com/services/databricks-consulting" rel="noopener noreferrer"&gt;databricks consultants&lt;/a&gt;, aren't the ones with the most case studies on their homepage. They're the ones who can tell you what went wrong on a project and what they learned from it. If you're in the middle of evaluating options right now, you can see how we think about Databricks consulting, including how we scope engagements to avoid exactly these problems.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>dataengineering</category>
      <category>cloudcomputing</category>
      <category>databricksconsultingfirm</category>
    </item>
    <item>
      <title>How Databricks Genie Turns Plain English Into SQL Code</title>
      <dc:creator>Lucy </dc:creator>
      <pubDate>Thu, 07 May 2026 09:51:42 +0000</pubDate>
      <link>https://dev.to/lucy1/how-databricks-genie-turns-plain-english-into-sql-code-3fa9</link>
      <guid>https://dev.to/lucy1/how-databricks-genie-turns-plain-english-into-sql-code-3fa9</guid>
      <description>&lt;p&gt;If you have spent time working inside a data team, you already know how a typical Tuesday looks.&lt;/p&gt;

&lt;p&gt;A message comes in from the sales manager. Then one from finance. Then someone from the product team who just needs "a quick number." Before 10 AM, your backlog is three queries deep. None of them are complicated on their own. But together they eat up the hours you were planning to use on the pipeline work that actually needed you.&lt;/p&gt;

&lt;p&gt;This is not a small problem. Research from &lt;a href="https://medium.com/wrenai/leveraging-ai-to-handle-ad-hoc-data-requests-across-teams-0a3db3ae9f2c" rel="noopener noreferrer"&gt;Wren AI&lt;/a&gt; found that data analysts in fast-paced industries spend up to 50 to 70 percent of their time handling ad-hoc data requests. And as &lt;a href="https://www.owox.com/blog/articles/analysts-guide-managing-one-off-ad-hoc-requests" rel="noopener noreferrer"&gt;OWOX&lt;/a&gt; points out, each one-off request keeps analysts stuck in reactive mode instead of doing the forward-looking work that actually moves the business.&lt;/p&gt;

&lt;p&gt;Databricks built &lt;a href="https://www.databricks.com/product/business-intelligence/genie" rel="noopener noreferrer"&gt;AI/BI Genie&lt;/a&gt; to take a serious chunk of that workload off the data team. And based on how it works under the hood, it is worth understanding before you dismiss it as just another chatbot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Databricks Genie?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.databricks.com/blog/aibi-genie-now-generally-available" rel="noopener noreferrer"&gt;AI/BI Genie&lt;/a&gt; is a conversational analytics tool built directly into the Databricks platform. It became Generally Available in June 2025 and is free for all Databricks SQL customers with no extra license needed.&lt;/p&gt;

&lt;p&gt;The idea is simple on the surface. A business user types a question in plain English. Genie writes the SQL, runs it, and returns a table of results along with a chart and a plain-language summary.&lt;/p&gt;

&lt;p&gt;But what makes it different from the dozen other "ask your data a question" tools out there is what happens behind that simple interface.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Genie Actually Works: The Compound AI System
&lt;/h2&gt;

&lt;p&gt;Genie is not just one model reading your question and guessing. &lt;a href="https://www.datacamp.com/tutorial/databricks-genie" rel="noopener noreferrer"&gt;DataCamp's deep dive into the architecture&lt;/a&gt; describes it as a compound AI system, which means it uses a chain of specialized agents working together.&lt;/p&gt;

&lt;p&gt;Here is the rough breakdown of what happens when someone asks a question:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An &lt;strong&gt;intent parsing agent&lt;/strong&gt; figures out what the user is really asking, including the metric, the time range, the filters, and the aggregation type.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;planner agent&lt;/strong&gt; breaks multi-step questions into an ordered execution plan.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;retriever agent&lt;/strong&gt; finds the right tables, columns, and example queries to ground the request in your actual data.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;SQL generation agent&lt;/strong&gt; turns the plan into a real, executable SQL query.&lt;/li&gt;
&lt;li&gt;The query runs against your Databricks SQL warehouse.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;verifier&lt;/strong&gt; checks the result. If something looks off, it can trigger a re-run or ask the user to clarify.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;summarizer&lt;/strong&gt; writes a plain-language takeaway and picks the right visualization.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is a lot of steps happening in seconds. And the reason this matters is that a simple single-model text-to-SQL approach fails a lot in production. Genie's multi-agent design is specifically built to reduce that failure rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Genie Spaces: Where the Real Setup Happens
&lt;/h2&gt;

&lt;p&gt;The part most articles skip over is what makes Genie useful versus what makes it unreliable. That difference comes down to how well a &lt;strong&gt;Genie Space&lt;/strong&gt; is configured.&lt;/p&gt;

&lt;p&gt;According to the &lt;a href="https://docs.databricks.com/aws/en/genie/" rel="noopener noreferrer"&gt;official Databricks documentation&lt;/a&gt;, a Genie Space is where a domain expert, such as a data analyst, sets up the context that Genie works from. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which tables and views Genie can access&lt;/li&gt;
&lt;li&gt;How business terms are defined ("active user" means X, "net revenue" means column Y)&lt;/li&gt;
&lt;li&gt;Example queries that show Genie how to handle common question patterns&lt;/li&gt;
&lt;li&gt;Text instructions for edge cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup matters more than most people expect. Genie uses the names and descriptions from annotated tables and columns to convert natural language questions into equivalent SQL queries. If your column is named &lt;code&gt;amt_net_rev_adj&lt;/code&gt; with no description, Genie will guess. If it is named &lt;code&gt;adjusted_net_revenue&lt;/code&gt; and described clearly, Genie has the context it needs.&lt;/p&gt;

&lt;p&gt;You can build different Genie Spaces for different teams. One for finance. One for sales. One for operations. Each one has its own tables, its own vocabulary, and its own guardrails. This keeps a sales rep from accidentally querying financial tables they should not see, and it keeps Genie focused on the questions that actually matter to each group.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security and Governance Are Built In, Not Bolted On
&lt;/h2&gt;

&lt;p&gt;One worry that comes up every time you let non-technical users query data directly is access control. What happens if someone asks a question that would return data they are not supposed to see?&lt;/p&gt;

&lt;p&gt;Genie handles this through Unity Catalog, which is Databricks' governance layer. According to the &lt;a href="https://docs.databricks.com/aws/en/genie/" rel="noopener noreferrer"&gt;Databricks Genie documentation&lt;/a&gt;, each user's own Unity Catalog data permissions are applied to the query results. Row filters and column masks are automatically enforced per user. If a user does not have SELECT access to a table, they will not see results from that table, even if they ask Genie a question that would normally involve it.&lt;/p&gt;

&lt;p&gt;This is not a new access control layer you have to build. It extends the permissions your team already set up in Unity Catalog. That makes the conversation with your security and compliance teams a lot shorter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarking: The Step Most Teams Skip
&lt;/h2&gt;

&lt;p&gt;This is where a lot of Genie rollouts go wrong.&lt;/p&gt;

&lt;p&gt;A team sets up a Genie Space, tries a few questions manually, gets answers that look right, and rolls it out to the business team. Then an executive asks something the space was not tested on, gets a weird result, and suddenly nobody trusts Genie anymore.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.databricks.com/blog/aibi-genie-now-generally-available" rel="noopener noreferrer"&gt;Databricks team is direct about this&lt;/a&gt;: any AI effort should start with an evaluation phase. Failure to do so means failure in production.&lt;/p&gt;

&lt;p&gt;Genie has a built-in benchmarking tool for exactly this reason. You write a list of test questions that represent the real questions users will ask. You add the correct SQL answer for each one. Genie runs its own queries and compares the results to yours.&lt;/p&gt;

&lt;p&gt;According to &lt;a href="https://www.databricks.com/blog/how-build-production-ready-genie-spaces-and-build-trust-along-way" rel="noopener noreferrer"&gt;Databricks' production readiness guide&lt;/a&gt;, the typical expectation is that Genie benchmarks should be above 80 percent accuracy before you move on to user acceptance testing. They also recommend adding two to four different phrasings of the same question, because users will not always ask the same question the same way.&lt;/p&gt;

&lt;p&gt;There is also an "Ask for Review" feature. If a user gets an answer they are not sure about, they can flag it. A space admin gets notified, reviews the SQL, and corrects it if needed. The user gets notified once the answer is verified. This feedback loop is how Genie gets better over time instead of drifting.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.databricks.com/blog/whats-new-aibi-october-2025-roundup" rel="noopener noreferrer"&gt;October 2025 release notes&lt;/a&gt; also added a "Knowledge Extraction" feature. When a user gives a thumbs up to a generated query, Genie analyzes that interaction and proposes knowledge snippets such as metric definitions or filter patterns that the space admin can approve and add to the knowledge store.&lt;/p&gt;

&lt;p&gt;That is a real improvement over tools that treat every question as if it is the first one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Good SQL Schema Documentation Does for Genie
&lt;/h2&gt;

&lt;p&gt;This is worth its own section because it surprises a lot of engineers.&lt;/p&gt;

&lt;p&gt;When you first set up a Genie Space, you will quickly discover that the quality of Genie's answers is almost entirely dependent on how well your tables and columns are documented. This is not a new idea. Good data teams have always known that schema documentation matters. Genie just makes that documentation pay off in a way that is immediately visible to everyone, not just other engineers.&lt;/p&gt;

&lt;p&gt;Here is a practical example from the &lt;a href="https://www.databricks.com/blog/building-confidence-your-genie-space-benchmarks-and-ask-review" rel="noopener noreferrer"&gt;Databricks benchmarking blog&lt;/a&gt;. One team wanted Genie to calculate the "best sales rep in Asia." Genie kept failing that question. The fix was not a model update. It was adding a single example SQL query to the instructions page showing exactly how to calculate that metric. After that, Genie answered it correctly every time.&lt;/p&gt;

&lt;p&gt;That is the pattern you will see over and over. The fix is almost never "change the model." It is "give Genie more context about what the question actually means."&lt;/p&gt;




&lt;h2&gt;
  
  
  Genie Code: Writing Dashboards With Natural Language
&lt;/h2&gt;

&lt;p&gt;One feature that deserves more attention is Genie Code.&lt;/p&gt;

&lt;p&gt;When you create an AI/BI Dashboard in Databricks, it automatically creates a companion Genie Space. But Genie Code goes a step further. It lets you write and edit the actual SQL and Python cells in your dashboard notebooks using natural language prompts.&lt;/p&gt;

&lt;p&gt;Instead of writing a complex window function from scratch, you describe what you want in plain English and Genie writes the code. You review it, tweak it if needed, and move on. This is especially useful for analysts who know what they want but do not always remember the exact SQL syntax for a specific aggregation or join pattern.&lt;/p&gt;

&lt;p&gt;This is part of the same thinking that drives tools like GitHub Copilot, but scoped specifically to the Databricks analytics environment with all the governance context already built in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Benefits and How
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.databricks.com/blog/next-generation-databricks-genie" rel="noopener noreferrer"&gt;next-generation Genie announcement&lt;/a&gt; points to something real in how teams are using this. Customers created over 1.5 million Genie Spaces in 2026 alone. That adoption happened because different roles found different value in the same tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business analysts and managers&lt;/strong&gt; stop waiting. A question that used to take two days to get answered from the data team now takes thirty seconds. This is the most visible benefit, and it is the one that gets internal champions bought in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data engineers&lt;/strong&gt; get time back. As &lt;a href="https://www.sigmacomputing.com/blog/how-to-implement-ad-hoc-reporting-without-driving-your-data-department-crazy" rel="noopener noreferrer"&gt;Sigma Computing writes&lt;/a&gt;, the BI bottleneck is not just stressful, it also delays decisions that need to be made quickly. When business users can self-serve the common questions, data engineers can stay focused on the work that actually requires an engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data analysts&lt;/strong&gt; turn their existing knowledge into a reusable asset. They set up the Genie Space once, document it well, add example queries, and the business team can self-serve on top of that work without sending messages every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Executives&lt;/strong&gt; get faster decisions. Questions that need a quick answer before a meeting get an answer before the meeting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Embedding Genie Outside of Databricks
&lt;/h2&gt;

&lt;p&gt;One of the more practical things in the latest release is that Genie does not have to live only inside the Databricks workspace.&lt;/p&gt;

&lt;p&gt;Using the Genie Conversation APIs, developers can embed Genie into Slack, Microsoft Teams, or custom internal applications. A sales team that never opens Databricks can ask questions directly from Slack and get back a chart and a summary without leaving the tool they already work in.&lt;/p&gt;

&lt;p&gt;The latest version of Genie also connects to enterprise knowledge sources like Google Drive and SharePoint, according to the &lt;a href="https://www.databricks.com/blog/next-generation-databricks-genie" rel="noopener noreferrer"&gt;next-gen Genie release post&lt;/a&gt;. This means Genie can now blend structured data from your Delta tables with unstructured content from documents to answer questions that used to require a human to piece together.&lt;/p&gt;




&lt;h2&gt;
  
  
  How This Connects to Broader AI Agent Work on Databricks
&lt;/h2&gt;

&lt;p&gt;Genie is a great starting point, but it is part of a larger picture on the Databricks platform.&lt;/p&gt;

&lt;p&gt;Once teams get comfortable with Genie handling their self-serve analytics layer, the next question that usually comes up is: what about workflows that go beyond answering questions? What about agents that can take action, run multi-step reasoning tasks, or be deployed as part of a production application?&lt;/p&gt;

&lt;p&gt;That is where the Mosaic AI Agent Framework comes in. If you are thinking ahead to that kind of work, it is worth reading about how &lt;a href="https://www.lucentinnovation.com/resources/it-insights/mosaic-ai-agent-framework" rel="noopener noreferrer"&gt;Mosaic AI handles evaluation, governance, and production deployment for AI agents on Databricks&lt;/a&gt;. The evaluation mindset is the same. The MLflow tracing and Unity Catalog governance carry over. But the scope is broader.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Need to Make Genie Work in Production
&lt;/h2&gt;

&lt;p&gt;To be direct: setting up Genie is easy. Getting it to work well in production takes real work.&lt;/p&gt;

&lt;p&gt;Here is what consistently makes the difference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean, well-described tables.&lt;/strong&gt; Column names and descriptions need to match how your business teams actually talk. If marketing calls something "activation rate" and your table calls it &lt;code&gt;usr_actv_rt_wk&lt;/code&gt;, Genie will have trouble making that connection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real example queries.&lt;/strong&gt; The example queries in a Genie Space teach Genie how to handle your organization's specific metric logic. The more representative they are, the better Genie handles questions it has never seen before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A benchmark set before launch.&lt;/strong&gt; According to &lt;a href="https://www.databricks.com/blog/how-build-production-ready-genie-spaces-and-build-trust-along-way" rel="noopener noreferrer"&gt;Databricks' own best practices&lt;/a&gt;, most Genie Spaces should reach above 80 percent benchmark accuracy before they go to user testing. That bar exists for a reason. Missing it means users lose trust quickly and it is hard to rebuild.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Someone who owns the space long term.&lt;/strong&gt; Genie Spaces need a person responsible for reviewing flagged responses, updating example queries as data changes, and approving knowledge snippets from user feedback. Without that owner, quality drifts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proper Unity Catalog setup.&lt;/strong&gt; If your tables are not already in Unity Catalog with access controls in place, that needs to happen first. Genie's governance layer depends on it.&lt;/p&gt;

&lt;p&gt;A lot of teams underestimate how much foundational data engineering work feeds into a good Genie rollout. If your team is already stretched thin on that infrastructure layer, it can make sense to bring in specialized help. That is why some teams choose to &lt;a href="https://www.lucentinnovation.com/specialists/hire-data-engineers" rel="noopener noreferrer"&gt;hire experienced data engineers&lt;/a&gt; who already understand how the Databricks ecosystem fits together, rather than trying to figure it out while also building the Genie Space.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;If you already have a Databricks SQL workspace, you can create a Genie Space today. No extra license. No new tool to install.&lt;/p&gt;

&lt;p&gt;Start small. Pick one team, one topic, and a focused set of tables. Write clear column descriptions. Add ten to fifteen example queries that cover the most common patterns. Build a benchmark test set before you open it to users. Then release it to a small group and watch what they ask.&lt;/p&gt;

&lt;p&gt;The questions that Genie cannot answer well are your roadmap for improving the space. That feedback loop, questions, failures, fixes, is how good Genie Spaces are built over time. It is the same loop that any good data product depends on. Genie just makes each iteration faster and more visible.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Genie is not magic. It is a well-engineered system that works best when the data behind it is clean, documented, and governed correctly.&lt;/p&gt;

&lt;p&gt;The teams that get the most out of it are the ones that treat the Genie Space setup like they treat any other production data product. That means documentation, testing, ownership, and a willingness to iterate based on real user feedback.&lt;/p&gt;

&lt;p&gt;That is not a high bar. It is the same bar good data teams already hold themselves to. Genie just gives them a way to deliver the output of that work directly to the people who need it, without requiring a SQL ticket for every question.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you set up a Genie Space yet? What was the hardest part of the setup? Drop a comment. Real-world experience from different environments is always useful.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources Referenced&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/product/business-intelligence/genie" rel="noopener noreferrer"&gt;Databricks AI/BI Genie Product Page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/blog/aibi-genie-now-generally-available" rel="noopener noreferrer"&gt;AI/BI Genie Generally Available Announcement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/blog/next-generation-databricks-genie" rel="noopener noreferrer"&gt;Next Generation of Databricks Genie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/aws/en/genie/benchmarks" rel="noopener noreferrer"&gt;Genie Benchmarks Documentation (AWS)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/blog/building-confidence-your-genie-space-benchmarks-and-ask-review" rel="noopener noreferrer"&gt;Building Confidence With Benchmarks and Ask for Review&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/blog/how-build-production-ready-genie-spaces-and-build-trust-along-way" rel="noopener noreferrer"&gt;How to Build Production-Ready Genie Spaces&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/blog/whats-new-aibi-october-2025-roundup" rel="noopener noreferrer"&gt;What's New in AI/BI, October 2025&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/aws/en/genie/" rel="noopener noreferrer"&gt;What Is a Genie Space, Official Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/tutorial/databricks-genie" rel="noopener noreferrer"&gt;DataCamp: Databricks Genie Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/wrenai/leveraging-ai-to-handle-ad-hoc-data-requests-across-teams-0a3db3ae9f2c" rel="noopener noreferrer"&gt;Wren AI: Leveraging AI for Ad-Hoc Requests&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.owox.com/blog/articles/analysts-guide-managing-one-off-ad-hoc-requests" rel="noopener noreferrer"&gt;OWOX: Analyst's Guide to Ad-Hoc Requests&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sigmacomputing.com/blog/how-to-implement-ad-hoc-reporting-without-driving-your-data-department-crazy" rel="noopener noreferrer"&gt;Sigma Computing: Ad-Hoc Reporting Without Burnout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.lucentinnovation.com/resources/it-insights/mosaic-ai-agent-framework" rel="noopener noreferrer"&gt;Mosaic AI Agent Framework on Databricks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.lucentinnovation.com/specialists/hire-data-engineers" rel="noopener noreferrer"&gt;Hire Data Engineers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>databricks</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
