DEV Community: databricks

Deep Dive: Personal Agents and Their Role in the A…

Norvik Tech — Tue, 02 Jun 2026 18:06:48 +0000

Originally published at norvik.tech

Introduction

Explore how personal agents are transforming the AI stack with Snowflake and Databricks, and what it means for your tech strategy.

Understanding Personal Agents in the AI Landscape

Personal agents represent a significant advancement in AI technology, acting as intermediaries that facilitate more intuitive interaction between users and complex data systems. These agents can automate tasks such as data analysis, insights generation, and decision support, enhancing productivity within organizations. According to a recent report, companies integrating personal agents into their workflows have experienced a 25% increase in efficiency, underscoring their potential value.

[INTERNAL:data-integration|Optimizing Your Data Strategy]

How Personal Agents Function

Personal agents leverage natural language processing (NLP) and machine learning algorithms to understand user commands and retrieve relevant data. For instance, a user might ask their personal agent to generate a sales report, prompting the agent to query databases, analyze trends, and present findings in an easily digestible format. This automation significantly reduces the burden on data teams and accelerates decision-making processes.

Technical Architecture of Personal Agents

Core Components

The architecture of personal agents typically includes several key components:

Data Ingestion Layer: Collects data from various sources, including databases, APIs, and cloud storage.
Processing Engine: Analyzes the ingested data using algorithms that can learn from interactions.
User Interface: Provides a platform for users to interact with the agent, often through chat interfaces or dashboards.

[INTERNAL:machine-learning|Integrating ML with Business Processes]

Interaction Flow

User inputs a command via the interface.
The agent processes the command, querying relevant datasets.
Insights are generated and presented back to the user in real-time. This streamlined process allows businesses to harness insights without extensive manual input.

Use Cases for Personal Agents

Real-World Applications

Personal agents find applications across various industries:

Retail: Automating inventory management and customer interactions.
Finance: Streamlining reporting processes and risk analysis.
Healthcare: Assisting in patient data management and appointment scheduling.

Specific Examples

A major retailer implemented a personal agent to automate customer service queries, resulting in a 30% reduction in response times.
A financial institution used a personal agent for real-time risk assessment, leading to more informed investment decisions.

Implications for Business Strategy

What Does This Mean for Your Business?

For companies in Colombia, Spain, and LATAM, adopting personal agents can transform operational efficiency. The initial investment may be substantial, but the ROI is evident in the form of enhanced productivity and reduced labor costs.

Regional Considerations

In Colombia, where businesses are increasingly digitalizing, adopting such technology can set companies apart from competitors.
Spanish firms may see significant benefits in customer engagement through personalized interactions driven by these agents.

Next Steps for Implementation

How to Get Started

To effectively implement personal agents in your organization, consider the following steps:

Assess Your Current Data Infrastructure: Identify gaps in your current processes that personal agents could fill.
Pilot Program: Start with a small-scale pilot to measure effectiveness and gather user feedback.
Evaluate Performance: Use metrics such as time savings and user satisfaction to gauge success before full-scale deployment.

By following these steps, your organization can smoothly transition into leveraging personal agents while minimizing risks associated with adoption.

Frequently Asked Questions

Preguntas frecuentes

¿Qué son los agentes personales y cómo funcionan?

Los agentes personales son herramientas que utilizan inteligencia artificial para automatizar tareas y facilitar la interacción del usuario con los sistemas de datos. Procesan comandos de lenguaje natural para generar informes y análisis relevantes.

¿En qué industrias se pueden aplicar estos agentes?

Se utilizan en diversas industrias como retail, finanzas y salud para mejorar la eficiencia operativa y la experiencia del cliente. Las aplicaciones son amplias y variadas según las necesidades del sector.

¿Cuál es el retorno de inversión al implementar agentes personales?

Las empresas han reportado mejoras significativas en la eficiencia operativa y reducción de costos laborales, lo que se traduce en un retorno de inversión positivo en un período relativamente corto.

Need Custom Software Solutions?

Norvik Tech builds high-impact software for businesses:

consulting
development

👉 Visit norvik.tech to schedule a free consultation.

Why Your In-House Databricks Team Is Probably Losing You Money

Lucy — Wed, 27 May 2026 10:44:20 +0000

60% of enterprise AI projects get abandoned because of data readiness and infrastructure issues.

Not because of bad ideas. Not because of wrong tooling. Because the foundation wasn't built right and by the time anyone noticed, the cost of fixing it was higher than starting over.

If you're running Databricks in-house, there's a decent chance you're heading toward one of four failure modes. I've seen each of them play out, sometimes in the same org.

1. The "unicorn engineer" job post

You know the one. It asks for someone who can handle platform architecture, complex ETL pipeline design, MLOps, and data governance. Maybe Unity Catalog experience preferred. Definitely Spark optimization. Oh, and some Python.

That person doesn't exist. Or if they do, they're already at a FAANG and not answering your recruiter.

What actually happens: you hire someone capable, and they spend most of their time on operational noise that manually partitioning tables, babysitting cluster configs, debugging integration issues that have nothing to do with your actual data problems.

Databricks has gotten genuinely complex. Delta Lake, Lakeflow Declarative Pipelines, Unity Catalog- these aren't plug-and-play. A generalist data engineer in 2026 is not the same as a Databricks platform specialist.

A consulting partner brings people who've already built this across multiple clients. You're not buying hours. You're buying what they learned the hard way somewhere else multi-cloud workspace topology, Liquid Clustering, private endpoint configs without waiting for your team to acquire those scars.

2. The cloud bill no one is watching

Here's one I've seen kill otherwise solid data platforms quietly.

In-house team gets the pipelines working. Everyone moves on. Nobody sets up auto-termination. Nobody enforces cluster policies. Clusters run indefinitely. Variable workloads stay on always-on compute when they should be hitting Serverless SQL.

[Traditional In-House Setup] ---> Over-provisioned Clusters ---> High Idle Waste & Skyrocketing Bills
[Consulting-Led Framework] ---> Serverless SQL + Cluster Policies ---> Automated Auto-Termination & Controlled Spend

The bill climbs slowly, and then suddenly it's a boardroom conversation.

A proper FinOps setup isn't exciting work, but it has a direct, measurable line to your cloud costs. Things like mandatory auto_termination_minutes, enforced instance pool configs, and routing the right workloads away from always-on clusters. This is table stakes, it just often doesn't get done when you're underwater on pipeline work.

3. Governance that gets bolted on after the fact

The pattern is almost universal:

Build the pipelines
Ship the dashboards
Deal with governance "later"

By the time "later" arrives, you've got fragmented data silos, ML models stuck in sandbox environments, inconsistent access controls, and no data lineage. Then someone asks about compliance.

Unity Catalog isn't an afterthought, it's the thing you configure before the pipelines, not after. Role-based access controls, automated data quality monitoring, end-to-end lineage tracking. If these aren't in the foundation, your downstream reports are unreliable by design.

The uncomfortable truth: A lot of teams treat governance like a documentation task. It's not. It's infrastructure.

4. The hiring timeline nobody accounts for

Realistic timeline from job post to a team that's onboarded, trained on Databricks, and actually productive:

6–9 months.

That's not pessimism, that's just recruiting + onboarding + platform ramp-up. Most orgs don't factor this in when they're comparing in-house costs against consulting rates.

A consulting firm gets there faster because they're not starting from scratch. Pre-built IaC templates, established Bronze/Silver/Gold ingestion patterns, CI/CD already wired up. Deployment that takes your internal team six months can happen in weeks.

That gap matters if your competitors are already running predictive analytics in production.

So what actually works?

It's not a binary choice, and framing it that way is usually how you end up making the wrong call.

The companies that handle this well use a hybrid model:

Bring in specialists for the hard setup — architecture, Unity Catalog, cluster optimization, MLOps scaffolding
Keep internal team focused on domain knowledge, custom data products, and the business problems that actually need context to solve

Your internal engineers understand your data, your customers, and your edge cases. That's valuable and hard to transfer. But asking them to also be platform infrastructure experts is how you end up with both things done poorly.

TL;DR

Problem	In-house default	What fixes it
Skill gaps	Overhire, underdeliver	Consulting for platform-specific work
Cloud costs	Idle compute, no policies	FinOps framework from day one
Governance	Bolted on later	Unity Catalog before pipelines
Speed	6–9 months to productivity	Pre-built templates + IaC

The architecture decisions you make in the first few months of a Databricks deployment are surprisingly hard to undo. Getting them right upfront — even with outside help — is almost always cheaper than refactoring a broken foundation at scale.

Have you gone through a Databricks migration or build-out? Curious what broke first — drop it in the comments.

Adeloop: Turning Semantic Data Models Into APIs for AI Agents

Adeloop — Tue, 26 May 2026 11:12:19 +0000

Most AI agents today can call APIs.

But very few systems solve the real problem:

how do you safely expose business data to AI agents without giving them raw database access?

That’s what we built in Adeloop.

Introducing: Adeloop Agent Console API

Adeloop can now publish semantic domains as governed APIs for external AI agents and applications.

The flow is simple:

Connect a warehouse, database, spreadsheet, or file source
Turn tables into a semantic domain
Publish selected domains
Generate an API key
Connect from ChatGPT, Claude, Cursor, n8n, Zapier, or your backend

The important part:

External AI agents never access raw SQL directly.

Adeloop becomes the governed execution layer between AI and data.

Why This Matters

Most “AI data chat” products are either:

unsafe SQL generators
notebook wrappers
or vector search over metadata

That breaks quickly at scale.

Instead, Adeloop uses:

question
→ semantic routing
→ metric/dimension planning
→ bounded SQL compilation
→ source pushdown execution
→ governed JSON response

This means:

queries stay close to the warehouse
millions of rows are not pulled into Python
agents receive structured JSON
governance and rate limits stay enforced

The default execution mode is:

semantic_sql_pushdown

Not Python.

Not sandbox compute.

Not “LLM writes random SQL”.

Example

An external agent can ask:

{
  "question": "Show top customers by revenue",
  "limit": 10
}

Adeloop then:

selects the semantic domain
resolves semantic metrics/dimensions
compiles safe SQL
executes against Postgres/MySQL/Snowflake/etc
returns answer + JSON + execution metadata

Example response:

{
  "answer": "Top result is Acme with total_revenue = 124500",
  "execution": {
    "mode": "semantic_sql_pushdown",
    "engine": "postgresql",
    "sandboxUsed": false
  }
}

MCP + OpenAPI Support

We also added:

MCP-compatible JSON-RPC endpoint
OpenAPI 3.1 action schema
API key scopes
usage logs
semantic metadata endpoints
deterministic domain routing

So tools like:

adeloopchat
Claude tools
Cursor
n8n
ChatGPT Actions

can consume governed business data without direct warehouse access.

One Important Architecture Decision

We intentionally did NOT add E2B/sandbox execution into the main API path.

Why?

Because most business questions are:

aggregations
grouped metrics
dashboards
top-N queries
filters
time-series analytics

Those should execute through SQL pushdown near the data source.

Python notebooks and sandbox compute belong later as async premium analysis jobs for:

forecasting
anomaly detection
ML
simulations
notebook/report generation

Normal analytics APIs should stay fast, deterministic, and scalable.

The Bigger Goal

I think AI agents will need something equivalent to:

a semantic execution layer for enterprise data

Not just chat over databases.

Something that handles:

governance
semantic metrics
execution planning
safe query compilation
federation
caching
observability
API contracts for agents

That’s the direction we’re building toward with Adeloop.

Would love feedback from people building:

AI agents
semantic layers
MCP tooling
data infrastructure
analytics engineering systems

Databricks and FSx for ONTAP S3 Access Points — A Layer-by-Layer Validation of Observed Boundaries

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 24 May 2026 11:17:38 +0000

TL;DR

Connecting Databricks to FSx for ONTAP S3 Access Points is significantly harder than Athena (Part 1). After testing every approach I could find — Unity Catalog External Locations, NFS mounts, Instance Profiles, multiple VPC configurations — here is what I found:

Unity Catalog's session policy initially blocked the FSx for ONTAP S3 AP ARN pattern → 403
Setting the access_point field on the External Location partially resolves the session policy: explicit-path file read succeeds, but UC table creation, subdirectory listing, and write operations remain blocked — meaning UC governance features (lineage, tags, fine-grained access) cannot yet be applied
NFS kernel mount is blocked by seccomp by design (confirmed by Databricks Support)
Instance Profile + boto3 works for direct S3 AP access (bypassing Unity Catalog)
Spark read with explicit file path works under UC governance — 1000 rows of sensor data readable with full schema inference, proving data access is possible even if table creation is blocked

Quick Decision Guide:

Read-only SQL analytics on NAS data → Use Athena (Part 1) or Snowflake External Table (Part 3)
Governed Databricks lakehouse on NAS data → Stage via FPolicy → Lambda → S3 → Auto Loader → UC Managed Table
Exploratory PoC (time-limited) → Instance Profile + boto3 with compensating controls

This article is a layer-by-layer validation of observed integration boundaries between Databricks and FSx for ONTAP S3 Access Points. It is not an argument against Databricks. Databricks remains a strong platform for lakehouse, ML, and production Delta workloads. This article focuses narrowly on one integration boundary: direct access from Databricks to FSx for ONTAP S3 Access Points.

This article documents the full troubleshooting journey, including the strace analysis that identified the root cause of NFS mount failures.

This article documents observed behavior in one validated environment. It should not be interpreted as a general compatibility statement for all Databricks configurations or future platform versions.

GitHub Repository: fsxn-lakehouse-integrations

If you want to reproduce this validation, the repository's integrations/databricks/ directory contains environment setup notes, and verification-pack/ contains test templates and evidence recording formats. The verification pack is intentionally template-first by design, so validation runs can produce consistent, reviewable evidence across environments. Actual result files will be added as validation runs are completed.

This article also includes a Snowflake ↔ Databricks concept mapping table (showing which capabilities work on each platform) and an AI Readiness Score to help teams quantitatively compare pattern options for FSx for ONTAP integration.

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (DBR 17.3 LTS, ap-northeast-1)
A starting point for vendor confirmation and architecture discussion

This article is not:

A general compatibility statement
A production certification
A statement on behalf of Databricks

Read by role:

Databricks admin: Unity Catalog External Location → Governance Impact Summary
Storage engineer: NFS Mount investigation → Evidence Matrix
Data engineer: Instance Profile + boto3 → Next Validation Metrics
Partner / SA: Decision Matrix → Discovery Questions → Partner Conversation Guide
Opening a support case: Databricks Support Case Packet

Prerequisite Concepts

Before reading this article, it helps to understand:

Unity Catalog Storage Credential — an object that stores a reference to a cloud IAM role for accessing external storage
Unity Catalog External Location — maps a cloud storage path to a storage credential for governed access
Instance Profile on AWS — an IAM role attached to an EC2 instance, providing credentials via IMDS
Databricks-managed VPC vs Customer-managed VPC — whether Databricks or the customer controls the workspace network
Cluster access modes — Standard (shared, multi-user with UC governance) and Dedicated (single-user with sudo access). Unity Catalog requires standard or dedicated access mode
S3 Access Point ARN vs S3 bucket ARN — S3 AP uses arn:aws:s3:<region>:<account>:accesspoint/<name>, not arn:aws:s3:::<bucket>
Driver vs executor behavior in Spark — the driver orchestrates; executors run distributed tasks. Credentials and network paths may differ between them

Key Concepts: Databricks Storage & Ingestion Architecture

Understanding Databricks' storage and ingestion concepts is essential for evaluating FSx for ONTAP S3 AP integration boundaries.

Partner Quick Reference: If your customer asks "Can Databricks read our NAS data via S3 Access Points?" — the answer is "partially, with limitations." File-level read works under UC governance, but table creation and directory listing are blocked. For governed analytics on NAS data today, recommend Snowflake External Table or Athena. For Databricks-specific workloads, recommend staged ingestion to S3 → UC Managed Table (see Recommended Architecture Pattern). If the customer already uses Databricks, the FPolicy → Lambda → S3 → Auto Loader pattern preserves full UC governance on ingested data. See Discovery Questions for customer qualification.

Storage Credential → External Location → External Table/Volume

Storage Credential (IAM Role ARN + External ID)
    │
    └── External Location (cloud storage path + credential + access_point field)
            │
            ├── External Table (tabular data: Parquet, Delta, Iceberg)
            └── External Volume (non-tabular: images, documents, audio)

Concept	Description	FSx S3 AP Status
Storage Credential	IAM Role that Databricks assumes to access cloud storage. During AssumeRole, Databricks generates a session policy that restricts what the assumed session can do — even if the IAM role itself has broader permissions.	✅ Created
External Location	Maps S3 path to a Storage Credential; defines access boundary	✅ Created (with `access_point` field)
External Table	UC-governed table whose data resides in External Location	❌ CREATE TABLE blocked
External Volume	UC-governed volume for unstructured files in External Location	❌ Blocked (same session policy issue)

External Volume is the Databricks equivalent of Snowflake's Directory Table — it provides governed access to non-tabular files (images, documents, audio, video). Since External Volume requires External Location creation with full subdirectory access, it is currently blocked by the same session policy limitation that blocks External Table creation.

Auto Loader (Incremental Ingestion)

Auto Loader is Databricks' equivalent of Snowflake's Snowpipe — it incrementally processes new files as they arrive in cloud storage.

Mode	Description	FSx S3 AP Status
Directory Listing	Periodically lists directory to find new files	⚠️ Requires External Location (blocked)
File Notification	Uses S3 Event Notifications + SQS for real-time detection	❌ Not possible (FSx S3 AP doesn't support S3 Events)

Auto Loader supported formats (8 formats): JSON, CSV, Parquet, Avro, ORC, XML, TEXT, BINARYFILE.

FSx S3 AP latency context: Even if Directory Listing mode were unblocked, FSx S3 AP ListObjectsV2 latency is significantly higher than native S3 (tens of seconds to minutes for large directories). This would impact Auto Loader polling intervals and new-file detection speed. Plan for minutes-level detection latency, not seconds.

Concept Mapping: Snowflake ↔ Databricks

Snowflake Concept	Databricks Equivalent	FSx S3 AP (Snowflake)	FSx S3 AP (Databricks)
Storage Integration	Storage Credential	✅	✅
External Stage	External Location	✅	✅ (partial)
External Table	External Table	✅	❌ Blocked
Directory Table	External Volume	✅	❌ Blocked
Snowpipe	Auto Loader	⚠️ (no S3 Events)	❌ Blocked
COPY INTO	COPY INTO / Auto Loader	✅	❌ Blocked
`AWS_ACCESS_POINT_ARN`	`access_point` field	✅ (resolves all)	⚠️ (partial resolution)
Cortex Search (RAG)	Mosaic AI / MLflow	✅ (via COPY INTO)	⚠️ (boto3 + external)

Data Ingestion Alternatives for FSx for ONTAP (When Auto Loader Is Blocked)

Throughput constraint: All S3 AP operations are bounded by the FSx for ONTAP file system's provisioned throughput capacity (e.g., 128 MB/s in this validation environment). This throughput is shared with NFS/SMB workloads on the same file system. Plan ingestion windows and concurrent access accordingly.

Since Auto Loader requires External Location (currently blocked on FSx S3 AP), use these alternatives:

Method	Description	Latency	Governance
FPolicy → Lambda → S3 → Auto Loader	FPolicy detects file changes → Lambda copies to S3 → Auto Loader ingests	Seconds	✅ Full UC (on S3 copy)
AWS Glue ETL	Glue job reads from FSx S3 AP → writes to S3/Delta	Minutes	AWS-side
EMR Serverless	Spark job reads from FSx S3 AP → writes to S3/Delta	Minutes	AWS-side
AWS DataSync	Scheduled sync from FSx NFS → S3 bucket	Minutes-Hours	AWS-side
SnapMirror to S3	ONTAP-native replication to S3 bucket	Minutes	ONTAP-side

SnapMirror to S3 caveat: Object metadata in SnapMirror S3 targets differs from NFS file metadata. Validate schema compatibility and file naming conventions before using SnapMirror S3 as an ingestion path for analytics engines.

Recommended production pattern:

FSx for ONTAP ──FPolicy──▶ Lambda ──▶ S3 Bucket ──▶ Auto Loader ──▶ Delta Table (UC governed)

Iceberg interoperability note: Once data is in UC as a managed Delta or Iceberg table, external engines can access it via UC's Iceberg REST Catalog — enabling Athena, EMR, and Trino to query the same governed table without data duplication. This makes the DataSync → S3 → UC path a hub for multi-engine access.

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
Athena + FSx S3 AP	★★★☆☆	★★★★☆	★☆☆☆☆ (SQL only)	★★★★★	★★★★★	3.6
Snowflake External Table	★★★★☆	★★★☆☆	★★★★☆ (Cortex AI)	★★★★★	★★★★☆	4.0
Staged to S3 → UC Table	★★★★★	★★★★★	★★★★★ (full Mosaic AI)	★★☆☆☆	★★☆☆☆	3.8
boto3 PoC (Databricks)	★☆☆☆☆	★★☆☆☆	★★★☆☆ (driver-only)	★★★★★	★★★☆☆	2.8
Bedrock KB + FSx S3 AP	★★★☆☆	★★★★☆	★★★★☆ (RAG)	★★★★☆	★★★★☆	3.8

Governance: UC lineage, tags, masking, row filters
Performance: Query latency, distributed processing
AI Capability: Breadth of AI/ML functions available
Cost: Storage efficiency, compute cost
Operational Simplicity: Setup, maintenance, pipeline complexity

Scoring methodology: Each dimension rated by the author based on validated evidence in this article series. This is not an official AWS assessment or certification. Scores reflect observed capabilities in one test environment.

Performance note: Performance scores reflect relative comparison within FSx S3 AP access patterns, not comparison with native S3 bucket performance. All patterns accessing FSx S3 AP have higher latency than equivalent native S3 operations.

How to use this score: Use Overall score as a starting point for pattern selection. Scores ≥ 4.0 indicate strong fit for governed production workloads. Scores 3.5–3.9 indicate viable paths with trade-offs. Scores < 3.0 indicate PoC-only paths requiring compensating controls.

When to choose which:

Choose Snowflake External Table (4.0) when governed AI on NAS data without copying is the priority
Choose Staged to S3 → UC Table (3.8) when maximum Databricks performance and full Mosaic AI are required (accepts data duplication cost)
Choose Bedrock KB (3.8) when AWS-native RAG with zero-copy on FSx is the primary requirement
Choose boto3 PoC (2.8) only for time-limited exploration with explicit approval; with compensating controls (see Compensating Controls section), governance risk can be partially mitigated for PoC scope. Post-expiration actions must be defined: terminate cluster, remove instance profile, archive evidence.

The Goal

Process unstructured data (images, documents, audio) stored on FSx for ONTAP from Databricks — without copying data to S3. FSx for ONTAP S3 Access Points should make this possible by exposing NFS/SMB file data via S3 API.

In Part 1, Athena worked cleanly in my validation using the official AWS tutorial pattern. Databricks, however, has multiple security layers that interact with S3 AP in unexpected ways.

Test Environment

I tested across two workspace configurations:

Runtime scope: Only DBR 17.3 LTS (Spark 4.0.0) was tested. This article does not compare DBR 16.x, 18.x, ML runtimes, GPU runtimes, or serverless compute. Runtime-level behavior may differ across versions and compute types. This article does not compare behavior across DBR versions or access modes beyond those listed in the test environment.

┌─────────────────────────────────────────────────────────────────────┐
│ Workspace 1: Databricks-managed VPC                                 │
│ - VPC created and managed by Databricks                             │
│ - Limited network control                                           │
│ - VPC Peering to FSx for ONTAP VPC                                  │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Workspace 2: Customer-managed VPC (same VPC as FSx for ONTAP)       │
│ - Full network control                                              │
│ - Direct connectivity to FSx for ONTAP (no peering needed)          │
│ - NAT Gateway for Databricks control plane                          │
└─────────────────────────────────────────────────────────────────────┘

Cluster modes tested:

Standard (Shared Access)
Dedicated (Single User) — provides sudo/root access
Dedicated with Instance Profile

All tests used DBR 17.3 LTS (Spark 4.0.0), ap-northeast-1.

Approach 1: Unity Catalog External Location

The Setup

The Databricks-governed path for S3 data access is to create a Storage Credential and External Location. I tested whether the same pattern could work with an FSx for ONTAP S3 Access Point.

# What I expected to work
files = dbutils.fs.ls("s3://<FSx-S3-AP-alias>/")

The Error

AccessDenied: User: arn:aws:sts::<ACCOUNT>:assumed-role/databricks-...-cross-account-role/
  databricks-unity-catalog-credential-<WORKSPACE_ID>
is not authorized to perform: s3:ListBucket on resource:
  "arn:aws:s3:<REGION>:<ACCOUNT>:accesspoint/<AP_NAME>"
because no session policy allows the s3:ListBucket action

Observed Boundary

Unity Catalog applies a session policy when it calls AssumeRole. This session policy acts as a permissions boundary — even if the IAM role has s3:* on *, the session policy restricts what the assumed session can do.

The evidence narrows the failure domain, but does not identify Databricks internal implementation details.

In this validation, the generated session policy behavior allowed access to a standard S3 bucket path but did not allow the FSx for ONTAP S3 Access Point ARN pattern:

arn:aws:s3:::bucket-name       ✅ Allowed
arn:aws:s3:::bucket-name/*     ✅ Allowed

But FSx for ONTAP S3 AP uses a different ARN format:

arn:aws:s3:<region>:<account>:accesspoint/<name>    ❌ Not in session policy

Proof

The same IAM role works fine for regular S3 buckets through Unity Catalog:

# This works — regular S3 bucket
dbutils.fs.ls("s3://my-workspace-storage-bucket/")
# SUCCESS

# This fails — FSx for ONTAP S3 Access Point
dbutils.fs.ls("s3://<FSx-S3-AP-alias>/")
# AccessDenied: no session policy allows...

Status

In my initial validation, this behaved like a platform boundary in Unity Catalog's generated session policy. I opened a support case to confirm whether S3 Access Point ARN patterns can be supported for external locations.

Before (access_point field not set) — Unity Catalog session policy blocks all S3 AP operations:

Without the access_point field, dbutils.fs.ls on the S3 AP alias returns UNAUTHORIZED_ACCESS. The session policy only allows standard S3 bucket ARNs.

Update (2026-05-24): `access_point` Field Resolves Session Policy

Critical Update (2026-05-26): Databricks Support subsequently confirmed that the access_point field was never released as a generally available feature and has been removed from documentation. The partial success described below is "a side effect of incomplete internal handling, not a supported code path." Unity Catalog External Locations do not currently support S3 Access Points. See the full support confirmation at the end of this section.

Databricks Support (May 2026) confirmed that Unity Catalog External Locations support an access_point field. Setting this field includes the S3 AP ARN in the generated session policy.

Configuration that works:

External Location:
  URL: s3://<FSx-S3-AP-alias>/
  Credential: <storage-credential-name>
  access_point: arn:aws:s3:<region>:<account>:accesspoint/<ap-name>

API call to set the field:

curl -X PATCH \
  https://<workspace>/api/2.1/unity-catalog/external-locations/<location-name> \
  -H "Authorization: Bearer <token>" \
  -d '{"access_point": "arn:aws:s3:<region>:<account>:accesspoint/<ap-name>"}'

What now works under UC governance:

Operation	Result	Notes
`dbutils.fs.ls("s3://<alias>/")`	✅	Top-level listing (287 items)
`dbutils.fs.head("s3://<alias>/file.txt")`	✅	Read file content
`spark.read.text("s3://<alias>/file.txt")`	✅	Spark read with explicit file path
`spark.read.csv("s3://<alias>/path/to/file.csv")`	✅	1000 rows, schema inferred

After (access_point field set) — Top-level listing succeeds, 287 items visible:

With the access_point field configured, dbutils.fs.ls at the top level returns 287 items from the FSx for ONTAP volume.

Sensor data read via Spark — 1000 rows with schema inference:

spark.read.csv with explicit file path successfully reads 1000 sensor readings with full schema inference (timestamp, machine_id, temperature_c, vibration_mm_s, pressure_bar, rpm, status, location).

What still does NOT work:

Operation	Result	Error
`dbutils.fs.ls("s3://<alias>/subdir/")`	❌	AccessDenied on getFileStatus
`spark.read.load("s3://<alias>/subdir/")`	❌	Forbidden (directory-level access)
`CREATE TABLE LOCATION 's3://<alias>/...'`	❌	UC_CLOUD_STORAGE_ACCESS_FAILURE
`dbutils.fs.cp` (PutObject)	❌	AccessDenied

Remaining blockers — Subdirectory listing and UC table creation fail:

Subdirectory dbutils.fs.ls returns UNAUTHORIZED_ACCESS. CREATE TABLE LOCATION fails with UC_CLOUD_STORAGE_ACCESS_FAILURE. Without a UC table, governance features (lineage, tags, fine-grained access control) cannot be applied.

Summary: Data is readable but not governable. The critical blocker is CREATE TABLE LOCATION failure, which prevents Unity Catalog governance from being applied to the data.

Key pattern: File-level read operations succeed (GetObject with explicit key). Directory-level operations (ListObjectsV2 with prefix, HeadObject on prefix) fail for subdirectories. This suggests the session policy scopes ListObjectsV2 to the root prefix only.

Implication: Explicit-path file read works, but without UC table creation, Unity Catalog governance features — lineage, fine-grained access control, governance tags, column masking, row filtering — cannot be applied. The data is technically readable through the External Location path but not registerable as a governed UC table. This limits the practical value for production governance use cases until the subdirectory listing and table creation issues are resolved.

Requirements for this path:

Customer-managed VPC workspace (same VPC as FSx for ONTAP)
External Location with access_point field set
Storage Credential IAM role with S3 AP permissions
NAT Gateway for control plane connectivity

Approach 2: NFS Mount (Managed VPC)

The Idea

If S3 AP doesn't work through Unity Catalog, mount the FSx for ONTAP volume directly via NFS.

The Setup

Created VPC Peering between Databricks-managed VPC and FSx for ONTAP VPC. Updated route tables and security groups.

The Result

%sh
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/2049' && echo "REACHABLE" || echo "NOT REACHABLE"
# NOT REACHABLE

NFS port (TCP 2049) is unreachable from Databricks-managed VPC, even with VPC Peering configured. From the customer-controlled routing perspective, route tables and FSx for ONTAP-side security groups were configured to allow NFS. However, cluster-side egress remained governed by the Databricks-managed environment, and NFS egress was not permitted.

Lesson

Databricks-managed VPC gives you limited network control. The egress rules on cluster instances are managed by Databricks, not by customer-added security group rules.

Approach 3: NFS Mount (Customer-managed VPC)

The Setup

Deployed a new workspace in the same VPC as FSx for ONTAP. No peering needed — direct L3 connectivity.

Network Verification (All Pass)

%sh
echo "TCP 2049 (NFS):"
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/2049' && echo "REACHABLE"
echo "TCP 111 (portmapper):"
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/111' && echo "REACHABLE"
echo "TCP 635 (mountd):"
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/635' && echo "REACHABLE"

TCP 2049 (NFS): REACHABLE ✅
TCP 111 (portmapper): REACHABLE ✅
TCP 635 (mountd): REACHABLE ✅

Note: The /dev/tcp test confirms TCP reachability. NFSv3 mountd may use TCP or UDP depending on configuration. The exact transport should be validated with rpcinfo if needed.

sudo Access (Dedicated Mode)

%sh
sudo whoami
# root ✅

NFS Client Installation and Export Verification

%sh
sudo apt-get install -y nfs-common
showmount -e 10.0.3.133

Export list for 10.0.3.133:
/vol1 (everyone) ✅

Everything looks perfect. Network connected, root access available, NFS exports visible. Let's mount:

The Mount Attempt

%sh
sudo mkdir -p /mnt/fsxn
sudo mount -t nfs -o nfsvers=3,nolock 10.0.3.133:/vol1 /mnt/fsxn

mount.nfs: access denied by server while mounting 10.0.3.133:/vol1

Wait, what? The server is showing the export to everyone, we have root access, the network is connected... why "access denied by server"?

The Investigation: Why NFS Mount Fails

This is where it gets interesting. The error message says "access denied by server" — but is it really the server?

Step 1: Verify ONTAP Export Policy

Via ONTAP REST API (accessible from the same cluster):

{
  "rules": [{
    "clients": [{"match": "0.0.0.0/0"}],
    "ro_rule": ["any"],
    "rw_rule": ["any"],
    "superuser": ["any"],
    "protocols": ["any"]
  }]
}

The export policy is maximally permissive — all clients, all protocols, read-write, superuser. ONTAP is not denying access.

Important: This permissive export policy was used only to eliminate ONTAP export restrictions as a variable during troubleshooting. It is not a production recommendation. For production, restrict: client CIDR, protocol, read/write rule, superuser mapping, and volume/junction path scope.

ONTAP Production Hardening Checklist

For production deployments, harden the ONTAP configuration:

[ ] Restrict export policy client CIDR to known analytics subnets only
[ ] Avoid rw=any and superuser=any — use explicit security flavors
[ ] Map S3 Access Point file system user to a least-privilege NAS user (not root/UID 0)
[ ] Validate NFS/SMB ACL behavior when S3 AP is active
[ ] Validate S3 API access against file-level permissions
[ ] Capture ONTAP audit evidence where required (ONTAP FPolicy)
[ ] Document junction path and volume scope
[ ] Isolate analytics volumes from production NFS/SMB workloads if throughput contention is a concern

Step 2: strace the mount command

%sh
sudo strace -f -e trace=mount mount -t nfs -o nfsvers=3,nolock 10.0.3.133:/vol1 /mnt/fsxn 2>&1

mount.nfs: trying 10.0.3.133 prog 100003 vers 3 prot TCP port 2049
mount.nfs: trying 10.0.3.133 prog 100005 vers 3 prot UDP port 635
mount("10.0.3.133:/vol1", "/mnt/fsxn", "nfs", ...) = -1 EACCES (Permission denied)
mount.nfs: mount(2): Permission denied

Key finding: mount.nfs successfully connects to both NFS (port 2049) and mountd (port 635), but the mount() syscall returns EACCES. The denial happens at the kernel level, not at the server.

TCP/UDP note: The initial reachability check used /dev/tcp, confirming TCP reachability. During the actual mount attempt, mount.nfs tried mountd over UDP as shown in the strace output. This is not a contradiction — NFSv3 mountd may use either transport. For production troubleshooting, use rpcinfo and packet capture to confirm the actual protocol and port mapping.

Step 3: Manual NFS RPC Calls (User-space)

To prove ONTAP is granting access, I performed manual NFS RPC calls using Python sockets:

import socket, struct

# MOUNT RPC (program 100005, version 3, procedure MNT)
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.settimeout(5)
sock.sendto(mount_rpc_packet, ("10.0.3.133", 635))
response = sock.recv(4096)
# Parse: status=0 (MNT3_OK), file_handle=44 bytes
print("MOUNT RPC: SUCCESS ✅")

# NFS3 FSINFO, GETATTR, READDIRPLUS — all succeed
print("NFS3 FSINFO: SUCCESS ✅")
print("NFS3 GETATTR: SUCCESS ✅")
print("NFS3 READDIRPLUS: SUCCESS ✅")

All NFS operations succeed at user-space level. ONTAP grants full access. The problem is not the server.

Step 4: tmpfs Mount Test

%sh
sudo mount -t tmpfs tmpfs /tmp/test_mount && echo "SUCCESS" || echo "FAILED"

SUCCESS ✅

The mount() syscall itself is allowed. Only NFS filesystem type is blocked.

Step 5: Seccomp Status

%sh
cat /proc/self/status | grep Seccomp

Seccomp:        2
Seccomp_filters:        1

Seccomp: 2 = BPF filter mode active.

The Conclusion

┌─────────────────────────────────────────────────────────────────┐
│ Evidence Chain:                                                 │
│                                                                 │
│ 1. Network connectivity      → ✅ All NFS ports reachable       │
│ 2. ONTAP export policy       → ✅ 0.0.0.0/0, rw=any, su=any     │
│ 3. NFS RPC (user-space)      → ✅ All operations succeed        │
│ 4. mount() with type="nfs"   → ❌ EACCES                        │
│ 5. mount() with type="tmpfs" → ✅ Success                       │
│ 6. Seccomp                   → Active (BPF filter mode)         │
│                                                                 │
│ Conclusion: The evidence points to a local platform security    │
│ boundary, likely seccomp filtering or an equivalent runtime     │
│ restriction, blocking the NFS mount path.                       │
└─────────────────────────────────────────────────────────────────┘

The error message "access denied by server" is misleading. The mount.nfs program interprets the kernel's EACCES as a server-side denial, but strace reveals the truth: the denial is local.

If sharing this finding: This is not a Databricks compatibility verdict. It is a layer-by-layer validation of observed boundaries in one environment (DBR 17.3 LTS, ap-northeast-1). Platform behavior may differ across runtime versions, access modes, and configurations.

Important: Because Databricks does not publicly document this specific syscall/filesystem-type behavior, treat this as validation evidence rather than an official platform statement until confirmed by Databricks Support.

All Mount Options Tested

Options	Result
`-o nfsvers=3,nolock`	access denied
`-o nfsvers=4.1`	access denied
`-o nfsvers=3,nolock,resvport`	access denied
`-o nfsvers=3,nolock,noresvport`	access denied
`-o sec=sys`	access denied
(no options)	access denied
tmpfs	SUCCESS

Evidence Matrix

Layer	Evidence	Result	Interpretation
Network	TCP 2049 / TCP 111 / TCP 635 reachable	✅ Pass	Network path exists between cluster and FSx for ONTAP
ONTAP export	Export policy allows 0.0.0.0/0, rw=any, su=any	✅ Pass	Export policy is not the blocker
NFS server RPC	MOUNT / FSINFO / GETATTR / READDIRPLUS succeed via user-space	✅ Pass	ONTAP grants NFS operations to this client
Local syscall	`mount(type=nfs)` returns EACCES	❌ Fail	Evidence points to a local runtime boundary affecting kernel NFS mount
Local syscall control	`mount(type=tmpfs)` succeeds	✅ Pass	`mount()` syscall is not universally blocked
Runtime security	Seccomp mode 2 observed in the tested process context	Observed	Runtime filtering may restrict NFS-specific mount
Unity Catalog S3	External Location test on S3 AP ARN → AccessDenied	❌ Fail	Session policy does not allow S3 AP ARN pattern
Instance Profile S3	boto3 GetObject on S3 AP → Success	✅ Pass	IAM role itself has correct permissions

showmount -e confirms that the export is visible through mountd. It does not guarantee that the local runtime allows the kernel NFS mount operation to complete. showmount -e validates NFS export visibility only. It does not validate the file system user identity associated with the S3 Access Point. For S3 AP authorization, record the associated UNIX or Windows identity and verify file-level permissions separately — these are independent authorization paths.

FSx for ONTAP S3 AP Authorization Path

FSx for ONTAP S3 Access Points use a dual-layer authorization model that combines AWS IAM permissions with file system-level permissions:

Layer 1 — S3-side authorization:

IAM identity-based policy (caller's permissions)
S3 Access Point resource policy
VPC endpoint policy (if applicable)
SCP / RCP (if applicable)

Layer 2 — FSx for ONTAP-side authorization:

File system user associated with the access point
UNIX mode-bits / NFSv4 ACLs (for UNIX security style volumes)
Windows ACLs (for NTFS security style volumes)

In the Databricks validation, the failure occurs before Layer 2 — Unity Catalog's generated session policy restricts the assumed role session at the S3 API level, preventing the request from reaching FSx for ONTAP-side authorization. The Instance Profile + boto3 path bypasses Unity Catalog's session policy, allowing both layers to be evaluated normally.

For production, both layers must be configured with least-privilege. A permissive file system user (e.g., root / UID 0) combined with a broad IAM policy creates an overly permissive access path.

Approach 4: Instance Profile + boto3

The Setup

Customer-managed VPC workspace, Dedicated cluster with an Instance Profile attached.

IMDS Access

import urllib.request, json

# IMDSv2 token
req = urllib.request.Request(
    "http://169.254.169.254/latest/api/token",
    headers={"X-aws-ec2-metadata-token-ttl-seconds": "21600"},
    method="PUT"
)
token = urllib.request.urlopen(req, timeout=2).read().decode()
print(f"Token: {token[:20]}...")  # ✅ Success

Regular S3 Access

import boto3
s3 = boto3.client("s3", region_name="ap-northeast-1")
buckets = s3.list_buckets()
print(f"ListBuckets: {len(buckets['Buckets'])} buckets")  # ✅ 58 buckets

FSx for ONTAP S3 AP Access

response = s3.list_objects_v2(
    Bucket="<FSx-S3-AP-alias>",
    MaxKeys=10
)
print(f"Objects: {response['KeyCount']}")  # ✅ Works

This works. Instance Profile credentials bypass Unity Catalog's session policy entirely. boto3 talks directly to the S3 API with the EC2 instance's IAM role.

Governance warning
Instance Profile + boto3 is a pragmatic workaround for PoC and controlled experiments. It bypasses Unity Catalog governance, including fine-grained access control, lineage, and centralized data access auditing. Do not treat this as a production lakehouse governance pattern without a separate security and compliance review. Databricks recommends Unity Catalog external locations as the standard governed access mechanism.

Scope note
The Instance Profile + boto3 sample above runs on the driver node only (single-node PoC pattern). Whether the same credential, network path, and concurrency behavior applies to Spark executors in a multi-node cluster requires separate validation.

Approach 5: S3 AP + Instance Profile (Managed VPC with VPC Peering)

The Hypothesis

If Instance Profile + boto3 works on a Customer-managed VPC (Approach 4), does it also work from a Databricks-managed VPC with VPC Peering to the FSx for ONTAP VPC? This would validate whether the S3 Gateway Endpoint in the Databricks-managed VPC can route S3 AP requests to the FSx for ONTAP backend.

The Setup

Databricks-managed VPC (vpc-060209589cbe4c298, CIDR: 10.53.0.0/16)
FSx for ONTAP VPC (vpc-0ae01826f906191af, CIDR: 10.0.0.0/16)
VPC Peering: pcx-02167ddf900a30782 (active)
Route tables: updated in both directions
FSx for ONTAP security group: allows all traffic (0.0.0.0/0)
S3 Gateway Endpoint: vpce-020b59ab4da0b44b8 (full access policy)
Cluster: m5.large × 3, DBR 17.3 LTS, Dedicated mode, Instance Profile attached

The Result

{
  "dns_resolution": {"success": true, "ip": "52.219.151.110"},
  "vpc_peering_443": {"success": false, "result_code": 11},
  "vpc_peering_nfs": {"success": false, "result_code": 11},
  "s3_ap_access": {"success": false, "error": "Read timeout"},
  "imds": {"success": true}
}

Analysis

Layer	Result	Interpretation
DNS resolution	✅	S3 AP alias resolves to S3 endpoint IP (52.219.x.x)
VPC Peering (TCP 443)	❌	FSx for ONTAP management IP unreachable — egress blocked
VPC Peering (NFS 2049)	❌	NFS port unreachable — egress blocked
S3 AP via S3 Gateway Endpoint	❌	Read timeout — S3 service reachable but FSx for ONTAP backend connection fails
IMDS / Instance Profile	✅	Credentials available and valid

Key finding: Even with VPC Peering established, routes configured, and security groups permissive, the Databricks-managed VPC's egress restrictions block connectivity to the FSx for ONTAP backend. The S3 Gateway Endpoint routes requests to the S3 service, but FSx for ONTAP S3 AP requires the S3 service to reach the FSx for ONTAP file system — which is in a different VPC from the Databricks cluster. The S3 service-side routing to the FSx for ONTAP backend is not affected by customer-side VPC Peering.

Important: This result confirms that FSx for ONTAP S3 AP access requires the requesting service (Databricks cluster) to be in the same VPC as the FSx for ONTAP file system, or to use a network configuration where the S3 service can reach the FSx for ONTAP backend. VPC Peering between the requester VPC and the FSx for ONTAP VPC does not help because S3 AP requests are routed through the S3 service, not directly to the FSx for ONTAP IP.

Lesson

S3 AP requests do not traverse VPC Peering. They are routed through the S3 service endpoint. For FSx for ONTAP S3 AP to work, the S3 service must be able to reach the FSx for ONTAP file system's internal endpoint. This is handled by AWS internally when the request originates from the same region, but the Databricks-managed VPC's egress restrictions appear to interfere with this path.

Customer-managed VPC (same VPC as FSx for ONTAP) remains the only validated path for Instance Profile + boto3 access to FSx for ONTAP S3 AP from Databricks.

IMDS Access Matrix

Cluster Mode	Workspace Type	IMDS	boto3 S3	boto3 S3 AP
Standard (Shared)	Managed VPC	❌	❌	❌
Dedicated	Managed VPC	❌	❌	❌
Dedicated	Customer VPC	❌	❌	❌
Dedicated + Instance Profile	Managed VPC (VPC Peering)	✅	⚠️	❌
Dedicated + Instance Profile	Customer VPC	✅	✅	✅

Row 4 note: IMDS works and Instance Profile credentials are valid, but S3 AP access times out because the Databricks-managed VPC egress restrictions block FSx for ONTAP backend connectivity. Regular S3 bucket access was not tested with a permissive policy (AccessDenied was due to intentionally scoped IAM policy, not network).

IMDS is blocked on all configurations except Dedicated mode with an explicitly registered Instance Profile on a Customer-managed VPC workspace.

Complete Results Summary

#	Approach	Result	Blocker
1	UC External Location + dbutils.fs (without `access_point` field)	❌	Generated session policy did not allow S3 AP ARN
1b	UC External Location + `access_point` field (file-level read)	✅	Top-level ls, head, spark.read with explicit path all work
1c	UC External Location + `access_point` field (subdirectory ls)	❌	Prefix-based ListObjectsV2 still blocked for subdirectories
1d	UC External Location + CREATE TABLE LOCATION	❌	UC_CLOUD_STORAGE_ACCESS_FAILURE during internal validation
2	UC External Location + Spark read (directory)	❌	Same prefix-level access issue
3	NFS mount (Managed VPC, VPC Peering)	❌	Egress blocked (port 2049)
4	NFS mount (Customer VPC, Dedicated)	❌	NFS mount blocked by seccomp by design (confirmed by Databricks Support)
5	boto3 (Managed VPC, no Instance Profile)	❌	IMDS blocked
6	boto3 (Customer VPC, no Instance Profile)	❌	IMDS blocked
7	Instance Profile + boto3 (Customer VPC)	✅	Works (bypasses UC governance)
8	NFS RPC user-space (Customer VPC)	✅	Works but impractical for production
9	No Isolation Shared mode	❌	Legacy access mode; not pursued
10	S3 AP + Instance Profile + boto3 (Managed VPC, VPC Peering)	❌	Managed VPC egress blocks FSx for ONTAP backend connectivity

Governance Impact Summary

Documentation status (Updated 2026-05-26): Databricks Support confirmed that the access_point field was never released as GA and has been removed from documentation. Unity Catalog External Locations do not currently support S3 Access Points as storage targets. The partial success observed is a side effect, not a supported code path. Feature gap reported to UC engineering — no timeline available.

Access path	Governance model	Auditability	Production suitability
Unity Catalog External Location	Centralized UC governance (fine-grained, lineage)	High (if supported)	Preferred, but blocked in this validation
Instance Profile + boto3	EC2 IAM role based	AWS-side logs possible if enabled; UC lineage not captured	PoC only unless separately approved
Kernel NFS mount	Filesystem / OS level	Outside UC governance	Not viable in this validation
User-space NFS RPC	Custom application path	Custom logging required	Experimental only
Athena + FSx for ONTAP S3 AP	IAM / S3 AP / Athena workgroup	AWS-side evidence possible	Best current read-only SQL analytics fit
Bedrock Knowledge Bases + FSx for ONTAP S3 AP	IAM / S3 AP / Bedrock Knowledge Base role / guardrails where used	AWS-side evidence possible	AWS-documented RAG / GenAI path; validated with permission-aware retrieval in related series
Glue / EMR Serverless + FSx for ONTAP S3 AP	IAM / S3 AP / Glue / EMR job roles	AWS-side evidence possible	Validated ETL / Spark path in this broader series where verification-pack evidence is available; validate production write-back semantics separately

AWS-side audit events, such as CloudTrail data events where enabled and applicable, may show S3 API access by the instance profile, but they do not replace Unity Catalog lineage, table-level privileges, or centralized Databricks governance controls.

MLOps Boundary

Using boto3 to read objects from FSx for ONTAP S3 AP does not automatically make the downstream ML workflow governed.

If the data retrieved via Instance Profile + boto3 is used for ML or GenAI:

Register derived datasets in governed storage (Unity Catalog managed location)
Track experiments with MLflow
Register models in Unity Catalog where applicable
Document source data access path (S3 AP alias, prefix, timestamp)
Record whether training data lineage is captured or externalized
Ensure the ML compute uses an access mode compatible with Unity Catalog governance

Models in Unity Catalog provides centralized access control, auditing, lineage, and model discovery across workspaces. If the PoC data path bypasses UC, the model lifecycle should still be governed through UC model registry.

AI / RAG Data Readiness Checklist

If the FSx for ONTAP S3 AP data is intended for AI, RAG, or GenAI pipelines:

[ ] Are documents classified by sensitivity (PHI, PII, financial, internal, public)?
[ ] Are file-level permissions preserved or re-modeled for the AI pipeline?
[ ] Is metadata available for filtering and retrieval (file type, date, owner)?
[ ] Is freshness requirement defined (real-time, daily, weekly)?
[ ] Is read-only access sufficient, or does the pipeline need write-back?
[ ] Is human review required for generated output before downstream use?
[ ] Is permission-aware retrieval required (user A sees only their authorized documents)?

If permission-aware retrieval is required, define one of:

Enforce at source access path — use per-user or per-group S3 Access Points with scoped file system users
Re-model permissions in metadata index — extract file-level ACLs into a searchable metadata store and filter at query time
Filter retrieval results by user/group claims — apply post-retrieval filtering based on authenticated user identity
Do not proceed until authorization model is validated and approved by security owner

Instance Profile + boto3 approval requirements (for regulated workloads):

Data owner approval
Security owner approval
Platform owner approval
Compliance reviewer approval (if regulated data involved)
Defined: allowed prefix, allowed operations, logging requirements, expiration date
Approval record location (where the decision is stored)
Review / expiration date (when the approval must be re-evaluated)
Incident escalation contact

For regulated workloads, do not use Instance Profile + boto3 for:

Patient-facing responses or clinical decision support
Financial decision automation
Unreviewed access to regulated datasets
Writeback to source-controlled data locations
Workloads requiring Unity Catalog lineage

Decision Matrix

Requirement	Recommended path today	Notes	Next validation action
SQL query on structured files	Athena + FSx for ONTAP S3 AP (Part 1)	Verified, simple, governed	Scale test with production data sizes
RAG / GenAI over NAS documents	Bedrock Knowledge Bases + FSx for ONTAP S3 AP	AWS-documented tutorial	Validate retrieval accuracy, permission-aware filtering, and sync freshness
ETL pipeline on NAS data	Glue or EMR Serverless + FSx for ONTAP S3 AP	Validated in this broader series where verification-pack evidence is available	Validate throughput impact and production write-back semantics
Serverless file processing	Lambda + FSx for ONTAP S3 AP	AWS-documented tutorial	Validate concurrency and throughput for your workload
Databricks governance with Unity Catalog	Wait for platform support	UC session policy currently blocks S3 AP ARN in my validation	Monitor Databricks support case response
Databricks unstructured data PoC	Dedicated cluster + Instance Profile + boto3	Works, but bypasses UC governance	Validate executor-scale behavior separately
Production Databricks lakehouse tables	Use supported cloud storage (S3 bucket)	Required for Delta write semantics	N/A — use standard pattern
Databricks distributed processing over FSx for ONTAP S3 AP	Not validated yet	Driver-only boto3 success does not prove executor-scale behavior	Test with multi-node cluster and Spark mapPartitions
Enterprise read-only analytics	Athena / Glue / EMR Serverless / FSx for ONTAP S3 AP	Best current fit for AWS-native path	Production workload isolation test
Video streaming from NAS	CloudFront + FSx for ONTAP S3 AP	AWS-documented tutorial	Validate caching and latency for your content

This article does not recommend bypassing Unity Catalog for production governed lakehouse workloads. The Instance Profile + boto3 path is documented because it worked in a controlled validation environment, not because it is the preferred governance model.

Architecture Decision Guidance

Databricks remains the recommended platform for curated lakehouse workloads, governed Delta tables, ML pipelines, and multi-step data engineering. FSx for ONTAP S3 AP should be treated as a source integration boundary that may require staging, validation, or an alternate read path depending on governance requirements.

Use Databricks when:

Data is already in supported object storage (S3 bucket)
Delta Lake write semantics are required (INSERT, MERGE, OPTIMIZE, VACUUM)
Unity Catalog lineage and fine-grained governance are mandatory
Large-scale Spark processing is required
ML/AI workloads need integrated compute

Use AWS-native services + FSx for ONTAP S3 AP when:

The primary requirement is read-only SQL analytics over NAS data → Athena (validated in Part 1)
RAG / GenAI over enterprise documents → Bedrock Knowledge Bases (AWS-documented path)
ETL pipelines reading/transforming NAS data → Glue (validated in this broader series where verification-pack evidence is available)
Spark-scale processing without persistent clusters → EMR Serverless (validated in this broader series where verification-pack evidence is available)
Serverless file processing (thumbnails, text extraction, transcription) → Lambda (AWS-documented path)
Video streaming from NAS → CloudFront (AWS-documented path)
External partner file exchange → Transfer Family (AWS-documented path)
BI and AI-assisted analytics → QuickSight candidate path, typically via Athena or Glue Catalog
Source data copy should be minimized
Workload isolation and governance can be validated with AWS-side controls
Serverless, pay-per-query or pay-per-invocation cost model is preferred

Use controlled boto3 PoC only when:

The workload is exploratory and time-limited
Unity Catalog lineage is not required for the PoC scope
Explicit approval is obtained from data owner, security owner, and platform owner
Compensating controls are defined and documented

FSx for ONTAP Sizing Considerations

Before selecting an analytics engine, validate FSx for ONTAP-side capacity:

Throughput capacity — S3 API throughput is bounded by the FSx for ONTAP file system's provisioned throughput
Expected S3 API request rate — high-frequency small object reads may hit IOPS limits
File count and average object size — large directories with many small files may increase listing latency
Prefix layout — flat vs hierarchical prefix design affects listing performance
NFS/SMB production workload window — analytics queries share throughput with existing file workloads
Snapshot / backup / replication schedule — SnapMirror and backup operations consume throughput
Isolation strategy — consider a dedicated volume or SVM for analytics access to avoid contention

Delta Lake production workloads require more than object read access. They require validated behavior for transaction log writes, atomic commit assumptions, concurrent writers, checkpointing, recovery, and lifecycle operations. This article does not validate FSx for ONTAP S3 AP for Delta write-path semantics.

Compensating Controls for Controlled boto3 PoC

If Instance Profile + boto3 is approved for a controlled PoC, define:

Dedicated cluster only (no shared compute)
Single-purpose instance profile (not reused across workloads)
Least-privilege S3 Access Point policy (specific prefix only)
Read-only permissions by default
Allowed prefix list (explicitly documented)
CloudTrail data event coverage where enabled and applicable
Notebook/job owner (named individual)
Approval expiration date
No production writeback
No regulated data unless separately approved with compensating controls

Recommended Databricks-side controls:

Restrict instance profile usage to an approved group via workspace admin settings
Enforce dedicated access mode through cluster policy
Restrict cluster creation permissions to approved users
Tag PoC clusters with owner, approval ID, and expiration date
Disable or terminate clusters after approval expiration
Review workspace audit logs for cluster and instance profile usage

Post-expiration mandatory actions:

Terminate all PoC clusters using the instance profile
Remove the instance profile from workspace admin settings
Archive all evidence (notebooks, logs, results) to approved storage
Update approval record with completion date and findings
Confirm no residual access paths remain (audit workspace settings)

Data Protection Considerations

FSx for ONTAP S3 AP exposes access to file data; it does not replace ONTAP volume-level protection. When analytics workloads access source data via S3 AP, validate:

Snapshot schedule impact — analytics reads do not conflict with scheduled snapshots, but heavy write-back could
SnapMirror replication policy — source volume replication continues regardless of S3 AP access
Backup window vs analytics query window — concurrent backup and analytics may compete for throughput
Write-back isolation — analytics results should be written to a separate volume or prefix, not the source-of-record volume
Recovery behavior — if analytics workload reads during a failover event, understand the RPO/RTO implications

ONTAP S3 NAS bucket data is protected by volume-level SnapMirror asynchronous replication, not by S3-level replication. Plan DR at the volume level.

Discovery Questions for Partners

When a customer asks about Databricks + FSx for ONTAP S3 Access Points:

Are the target files currently stored on NFS, SMB, or both?
Is the workload read-only analytics, unstructured object processing, or Delta write?
Is Unity Catalog lineage mandatory for this use case?
Is this a regulated dataset (PHI, PII, financial)?
Can the PoC run with a dedicated instance profile and limited prefix?
What is the required concurrency and data size?
Is executor-scale Spark processing required, or is driver-only sufficient?
What rollback action is acceptable if FSx for ONTAP throughput impact is observed?
Who approves non-Unity Catalog access paths?
What evidence is required for security review?

Troubleshooting Playbook

When Databricks access to FSx for ONTAP S3 AP fails, isolate one layer at a time:

IAM — Can the instance profile call s3:ListBucket on the S3 AP ARN? Can it call s3:GetObject?
Unity Catalog — Does the same role work for a standard S3 bucket? Does it fail only for the FSx for ONTAP S3 AP ARN?
Network — Is the workspace customer-managed or Databricks-managed? Can the cluster reach NFS TCP 2049? Are route tables and security groups correct?
NFS server — Does showmount -e work? Does the ONTAP export policy allow the client?
Local runtime — Does strace show mount() returning EACCES? Does tmpfs mount succeed? Does user-space NFS RPC succeed?
Workaround — Does Dedicated + Instance Profile + boto3 work? Is bypassing Unity Catalog acceptable for this PoC?

Known Failure Signatures

Symptom	Likely layer	Next step
`no session policy allows s3:ListBucket`	Unity Catalog session policy	Compare regular S3 bucket vs FSx for ONTAP S3 AP with the same role
TCP 2049 unreachable	Network / managed VPC boundary	Test from customer-managed VPC
`mount.nfs: access denied by server` with `mount()` EACCES in strace	Local runtime restriction	Capture strace and `/proc/self/status` seccomp output
boto3 `NoCredentialsError`	Instance profile / IMDS blocked	Verify cluster mode is Dedicated and instance profile is registered
boto3 `ReadTimeoutError` on S3 AP	FSx for ONTAP backend or VPC endpoint routing	Test with a fresh SVM/volume to isolate; check FSx for ONTAP CPU utilization
boto3 `ReadTimeoutError` on S3 AP from Managed VPC (IMDS works)	Managed VPC egress restriction blocking FSx for ONTAP backend	Deploy in Customer-managed VPC (same VPC as FSx for ONTAP); VPC Peering does not resolve this
Driver-only boto3 works, but Spark job fails	Executor credential/network path	Validate credentials, routing, and concurrency from executors separately

What This Article Does Not Conclude

This article does not conclude that Databricks cannot ever support FSx for ONTAP S3 AP. It documents the behavior observed in one validated environment and identifies the platform boundaries that need vendor confirmation or additional support.

What to Tell Stakeholders

Current recommendation:

Use AWS-documented native service paths where they match the workload: Athena for SQL, Bedrock Knowledge Bases for RAG/GenAI, Glue or EMR Serverless for ETL/Spark, Lambda for serverless file processing, CloudFront for streaming, and Transfer Family for partner file exchange
Treat Athena as the validated read-oriented SQL path in Part 1. Treat Glue / EMR Serverless as validated ETL / Spark paths only where corresponding verification-pack evidence is available.
Treat Bedrock Knowledge Bases, Lambda (file processing), CloudFront, and Transfer Family as AWS-documented candidate paths that still require workload-specific validation
Use Databricks + Instance Profile + boto3 only for controlled PoC or unstructured data experiments
Do not position Unity Catalog + FSx for ONTAP S3 AP as production-ready until the session policy supports S3 Access Point ARN patterns
Do not rely on kernel NFS mounts inside Databricks until the platform explicitly supports this path
For Delta Lake production tables, continue to use supported object storage patterns

This validation should be used to guide architecture selection, not to disqualify Databricks from lakehouse workloads.

This validation should not be used to compare AWS-native services and Databricks as competing platforms. AWS-native services (Athena, Bedrock, Glue, EMR Serverless, Lambda) each have AWS-documented integration paths with FSx for ONTAP S3 AP — some validated in this series, others requiring workload-specific validation. Databricks is strong for governed lakehouse, Delta, ML, and production-scale data engineering workloads. The right choice depends on the access pattern, governance requirement, and workload type.

Key contributions of this validation:

Identified the root cause of NFS mount failure (seccomp BPF filter, not server-side denial) via strace analysis
Discovered the access_point field on External Location (via Databricks Support) that partially resolves the session policy
Proved that file-level read under UC governance is possible (1000 rows, schema inference)
Mapped the complete evidence chain: network → ONTAP → NFS RPC → kernel → seccomp
Established that Customer-managed VPC (same VPC as FSx) is the only validated network path
Provided a reusable troubleshooting playbook for future S3 AP integration attempts

Lessons Learned

1. "S3-compatible" ≠ "works everywhere S3 works"

FSx for ONTAP S3 AP is S3-compatible at the API level, but platform security layers (session policies, VPC restrictions) may not recognize the ARN format. S3 API compatibility and platform-integrated S3 governance are different things.

2. Error messages can be misleading

mount.nfs: access denied by server made me spend hours checking ONTAP export policies. The real issue was a local runtime restriction. Always use strace when mount fails unexpectedly.

3. Platform security boundaries are not always documented

You discover these boundaries by hitting them. The troubleshooting playbook above can save you time.

4. Customer-managed VPC is essential for storage integration

If you need to connect Databricks to anything beyond standard S3 buckets, deploy in a Customer-managed VPC. Databricks-managed VPC provides limited customer control over cluster networking compared with a customer-managed VPC.

This was further confirmed by testing S3 AP access from a Databricks-managed VPC with VPC Peering: even with VPC Peering active, routes configured, security groups permissive, and a S3 Gateway Endpoint present, S3 AP requests to FSx for ONTAP timed out. The Databricks-managed VPC egress restrictions block not only direct IP communication but also S3 AP backend connectivity.

S3 AP routing note: S3 AP requests are routed through the S3 service endpoint, not directly to the FSx for ONTAP IP. VPC Peering between the requester VPC and the FSx for ONTAP VPC does not help because the S3 service needs internal connectivity to the FSx for ONTAP file system. Customer-managed VPC (same VPC as FSx for ONTAP) is the only validated path.

Databricks Control Plane (SaaS)
        ^
        | NAT Gateway (required outbound)
        |
Databricks Cluster ENI (Customer VPC, private subnet)
        |
        | Private VPC routing (no internet required)
        v
FSx for ONTAP ENI / SVM (same VPC, private subnet)

For the Databricks Support Case Packet, include network evidence: cluster subnet ID, FSx for ONTAP subnet ID, route table IDs, security group rules, and DNS resolution for FSx for ONTAP endpoint.

5. Instance Profile is a pragmatic PoC workaround

Use Instance Profile + boto3 as a controlled PoC workaround. Do not use it as a substitute for Unity Catalog governance without a formal security review.

6. Always isolate variables when troubleshooting

When FSx for ONTAP S3 AP wasn't responding, I created a new SVM and volume to isolate the issue. This confirmed the problem was SVM-specific rather than a platform-wide limitation.

7. Negative validation creates value

A failed integration path can still create value when it prevents the wrong production architecture. This validation helps teams avoid assuming S3 API compatibility equals platform governance compatibility, choose the right engine for the right access pattern, and reduce time spent on ambiguous troubleshooting.

Databricks Support Case Packet

If you open a support case with Databricks, include:

Workspace type: Databricks-managed VPC or customer-managed VPC
Cluster access mode and DBR version
IAM role / instance profile configuration
Unity Catalog storage credential and external location configuration
Full AccessDenied error message (including the ARN and "no session policy" text)
S3 AP ARN and alias format
Network test results for NFS ports (TCP 2049, TCP 111, TCP 635)
strace output showing mount() EACCES
/proc/self/status showing seccomp mode
User-space NFS RPC success evidence (if applicable)
Instance Profile boto3 success evidence (if applicable)
showmount -e output (confirms export visibility)
tmpfs mount success evidence (proves mount syscall itself is allowed)

Use Case Fit Matrix

When this article says "validated in this broader series," it refers to evidence captured in the linked verification-pack or related articles, not to Databricks-specific validation in this Part 2 article.

Use case	Best current path	Why
SQL analytics on structured NAS files	Athena + FSx for ONTAP S3 AP	Verified read-oriented path with AWS-side governance controls, serverless
Enterprise IT RAG over documents	Bedrock Knowledge Bases + FSx for ONTAP S3 AP	AWS-documented tutorial; also validated in related series with permission-aware retrieval
ETL / data transformation	Glue or EMR Serverless + FSx for ONTAP S3 AP	Validated in this broader series where verification-pack evidence is available; validate production write-back semantics separately
Serverless file processing (thumbnails, OCR, transcription)	Lambda + FSx for ONTAP S3 AP	AWS-documented tutorial; validate for your workload
Large-scale Spark ETL	EMR Serverless + FSx for ONTAP S3 AP or standard S3 bucket	Validated in this series; Databricks executor-scale not validated on S3 AP
Production Delta Lake tables	Supported object storage (S3 bucket)	Required for Delta write semantics and UC governance
Unstructured data experimentation (Databricks)	Instance Profile + boto3 PoC	Works in driver-only pattern, needs governance review
Video streaming from NAS	CloudFront + FSx for ONTAP S3 AP	AWS-documented tutorial; validate caching, latency, and file size for your content
External partner file exchange	Transfer Family + FSx for ONTAP S3 AP	AWS-documented path; also validated in related series; validate file operation limitations (rename, append, upload size)
Lightweight serverless analytics	DuckDB Lambda + FSx for ONTAP S3 AP	Planned Part 3 validation; candidate for lightweight, low-idle-cost analytics
BI / dashboarding over NAS data	Candidate: QuickSight via Athena or Glue Catalog	AWS positions BI as a candidate use case; validate whether access path is Athena-backed or catalog-mediated

Cost Model Considerations

Engine	Primary cost driver	Best fit
Athena	Data scanned (per TB)	Occasional SQL queries, serverless
Bedrock Knowledge Bases	Model invocation + embedding + retrieval	RAG / GenAI over enterprise documents
Glue	DPU-hours	ETL pipelines, data transformation
Databricks	DBU + cloud compute instance hours	Lakehouse pipelines, ML, Delta workloads
EMR Serverless	vCPU / memory × runtime duration	Spark ETL without persistent clusters
Lambda + DuckDB	Invocation duration × memory	Lightweight serverless analytics, event-driven
CloudFront	Data transfer + requests	Video/media streaming from NAS

Cost comparison is not the focus of this article. Each engine has a fundamentally different pricing model. Databricks provides compute policies to control cluster creation, instance types, auto-termination, and cost-related attributes. For cost optimization, evaluate based on workload pattern (interactive vs batch, frequency, data volume) rather than unit price alone.

Partner / Customer Conversation Guide

If a customer asks whether Databricks can directly process FSx for ONTAP S3 Access Point data:

AWS-native service paths such as Athena, Bedrock Knowledge Bases, Glue, EMR Serverless, Lambda, CloudFront, and Transfer Family have AWS-documented integration patterns with FSx for ONTAP S3 AP. In this series, Athena (Part 1), Glue, and EMR Serverless have been validated; the other paths should be validated per workload, Region, IAM model, FSx for ONTAP-side authorization, and governance requirement.
Databricks Unity Catalog integration requires vendor confirmation for S3 Access Point ARN handling
Instance Profile + boto3 can be used for controlled PoC experiments, but it bypasses Unity Catalog governance and is classified as a legacy data access pattern by Databricks
Production Delta Lake workloads should continue to use supported object storage patterns
Any Databricks integration should be validated per workspace type, cluster mode, runtime version, IAM path, and governance requirement

Next Validation Metrics

Current blocker: Executor-scale validation requires a Customer-managed VPC workspace (same VPC as FSx for ONTAP). The Databricks-managed VPC workspace was tested with VPC Peering and Instance Profile (2026-05-24) — S3 AP access timed out due to managed VPC egress restrictions. A Customer-managed VPC workspace creation is pending Databricks support ticket resolution.

For executor-scale validation (not yet performed):

Object listing latency per executor
Total objects processed across cluster
Per-executor success/failure rate
Throughput per executor
Retry count and S3 API error rate
FSx for ONTAP throughput utilization during distributed access
Cost per processed GB

Driver-only boto3 success is not sufficient for Spark workloads. The next validation should run boto3 calls from executors using mapPartitions and compare credential, routing, latency, and error behavior across workers.

Executor-scale validation should not only test success/failure. It should capture per-executor latency, retry count, error code, and object count so that routing and concurrency behavior can be reviewed.

Benchmark run guidance:

Cold run: at least 1 (first access after cluster start, no metadata cache)
Warm metadata run: at least 1 (after initial listing populates metadata cache)
Repeated run: at least 3 (steady-state measurement)
Report: p50, p90, p95, p99 latency, plus average, min, max, and outliers
Include: object count, average object size, prefix depth, concurrent executor count
Include: FSx for ONTAP throughput utilization during test window
Note: S3 AP via FSx for ONTAP may exhibit metadata warm-up effects and prefix layout sensitivity. Cold vs warm differences should be documented explicitly.

Additional FSx for ONTAP metrics to capture:

FSx for ONTAP throughput utilization (% of provisioned capacity)
FSx for ONTAP CPU utilization
Network throughput (inbound/outbound)
S3 API request count by operation (List, Get, Head)
File count per prefix
Average object size
NFS/SMB latency during concurrent S3 API reads (contention indicator)

Expected output format (JSONL per executor):

{"executor_host": "ip-10-0-xx-yy", "partition_id": 3, "operation": "list_objects_v2", "status": "success", "latency_ms": 183, "objects_seen": 100, "error_code": null}

Adoption Success Metrics

For a controlled Databricks + FSx for ONTAP S3 AP PoC, define success criteria beyond technical pass/fail:

Baseline metrics (capture before validation):

Average search/access time (minutes) for target documents
Monthly document access count via current path
Current copy pipeline runtime (if applicable)
Current data freshness lag (hours)
Current support ticket count related to data access

PoC outcome metrics:

Number of target datasets evaluated
Number of successful read operations
Number of governance exceptions required
Time to first successful access
Number of support issues raised
Whether the customer selected Athena, Databricks, or another engine after validation
Decision outcome: proceed / adjust / stop
Time saved by early boundary identification (vs discovering in production)

Stop criteria:

No measurable business value after validation period
Governance exception required for production path with no compensating control available
Executor-scale validation fails with unacceptable error rate (define threshold before PoC)
FSx for ONTAP workload impact exceeds approved threshold (e.g., throughput utilization > 80%)
Vendor confirmation indicates unsupported path with no roadmap commitment
Security review rejects the access path without remediation option

Series Evaluation Criteria

Across this series, each engine is evaluated by:

Read-path compatibility
Write-path compatibility
Governance model
Operational impact
Performance evidence
Production readiness gap
Best-fit use case

Well-Architected Mapping

These criteria align with the AWS Well-Architected Data Analytics Lens:

Pillar	Evaluation focus in this series
Security	Governance model, IAM/AP policy, audit evidence, session policy behavior
Reliability	Failure modes, rollback path, support case evidence, DR considerations
Performance Efficiency	Throughput, executor-scale behavior, FSx for ONTAP utilization, latency
Cost Optimization	Engine-specific cost model, idle cost, cost per processed GB
Operational Excellence	Runbook, evidence template, support packet, monitoring

Business Value of Negative Validation

Negative validation is not failure. It is risk reduction.

A failed integration path can still create value when it prevents the wrong production architecture. This validation helps teams:

Avoid assuming S3 API compatibility equals platform governance compatibility
Choose the right engine for the right access pattern (Athena for read-only SQL, Databricks for lakehouse/ML)
Identify early when vendor confirmation is required before committing architecture
Reduce time spent on ambiguous troubleshooting by providing reproducible evidence
Prevent wasted PoC investment by documenting boundaries before production design
Enable informed conversations with vendors, partners, and security reviewers

For enterprise customers, early boundary identification can save weeks of engineering time and prevent costly architecture rework after production deployment.

What's Next

Series index:

Part 1: Athena — Query NAS Data In Place (validated read-oriented path, 9/9 negative tests pass)
Part 2: Databricks (this article) — session policy deep dive
Part 3: Snowflake — LIST Works, SELECT Doesn't (same session policy pattern)
Part 4: DuckDB Lambda — lightweight serverless analytics validation
Part 5: EMR Spark — read-write ETL pipeline (coming soon)
Part 6: Redshift Spectrum — DWH meets NAS data (coming soon)
Part 7: Trino — open-source SQL on NAS data (coming soon)

Open items:

Support cases: Waiting for Databricks response on session policy and NFS mount questions
FUSE NFS client: Investigating whether a user-space NFS client can bypass the runtime restriction

Caution on FUSE/user-space NFS: FUSE or user-space NFS clients should be treated as experimental only. They require separate validation for POSIX semantics, caching behavior, consistency, performance, failure recovery, and vendor supportability. Do not treat user-space NFS RPC success as a production workaround.

References

Related series by the same author (FSx for ONTAP S3 Access Points with other AWS services):

Building an Agentic Access-Aware RAG System with Amazon FSx for NetApp ONTAP, S3 Vectors, and S3 Access Points — Bedrock Knowledge Bases + permission-aware retrieval (GitHub)
FSx for ONTAP S3 Access Points as a Serverless Automation Boundary — AI Data Pipelines, Volume-Level SnapMirror DR, and Capacity Guardrails — Lambda, Bedrock, SageMaker, 17 industry use cases (GitHub)
Smart Routing, Transfer Family Ingestion, and Voice Chat — Permission-Aware RAG v4.2 — Transfer Family + SFTP ingestion for RAG pipeline

ONTAP S3 Multiprotocol vs FSx for ONTAP S3 Access Points:

ONTAP S3 multiprotocol (ONTAP 9.12.1+): S3 NAS bucket model on ONTAP SVM, enabling S3 clients to access NAS data directly on the ONTAP cluster
FSx for ONTAP S3 Access Points: AWS-managed S3 Access Point endpoint attached to FSx for ONTAP volume, integrating with AWS IAM, VPC, and S3-compatible services
Both expose NAS data via S3-style access, but the authorization path, service integration, and operational model differ. This article focuses on FSx for ONTAP S3 Access Points.

This article is part of the "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive" series. All tests were performed on a real AWS environment with FSx for ONTAP (ONTAP 9.17.1, ap-northeast-1) and Databricks (DBR 17.3 LTS, Premium tier) in May 2026.

Scope reminder: This article documents observed behavior in one validated environment. It does not validate production readiness, distributed executor-scale processing, or all Databricks runtime versions. Terminology uses "observed in this environment" rather than "unsupported" or "incompatible" — platform behavior may change with future updates.

Future updates: If Databricks platform behavior changes or vendor confirmation becomes available, this article should be updated with the new validation result rather than treated as a permanent compatibility statement.

Disclaimer: This article is an independent validation report and does not represent Databricks, AWS, or NetApp official guidance. Product behavior, support status, and platform capabilities may change. Always validate in your own environment and consult vendor documentation and support channels.

What Does a Databricks Consulting Partner Actually Do? (An Enterprise Buyer's Guide)

Lucy — Wed, 20 May 2026 09:26:49 +0000

You've probably sat through at least one vendor call where someone said
"end-to-end Databricks implementation" three times in ten minutes and still left with no idea what they'd actually do after signing.

That's the problem with how most Databricks consulting services are sold. The language is polished. The decks look great. But the specifics? Suspiciously vague.

So let's just say the quiet part out loud here's what a real partner does,
week by week, and what separates a genuinely good one from a well-branded generalist.

The 4 Things a Databricks Partner Is Actually Responsible For

1. Architecture First, Not Notebooks First

The first red flag? A partner who opens a Databricks workspace before they've audited your current data estate.

A good one starts by understanding what you already have to your sources, your pipelines, your governance gaps, where money is quietly leaking. Only then do they design an environment that fits your workloads.

In practice, that means:

Choosing the right cloud (AWS, Azure, or GCP) based on your existing infrastructure which is not what the partner is most comfortable with
Designing a medallion architecture (Bronze → Silver → Gold) with your actual data volumes in mind
Standing up Unity Catalog for governance from day one, not as an afterthought six months later when things get messy

2. Pipeline Engineering, The Real Heavy Lifting

Most enterprise data sits across five different places: a legacy ERP, a couple of SaaS tools, some flat files someone's been emailing around, and a Snowflake instance that half the team has forgotten the password to.

A Databricks partner consolidates this: building Delta Live Tables pipelines or custom Spark jobs that handle schema evolution, bad data, and SLA expectations. Not "it works on my machine" pipelines. Production-grade ones.

If you're coming from Hadoop or an aging data warehouse, this is where 90% of the real effort lives. It's also where you'll quickly learn whether your partner has actually done this before or just watched the conference talk.

3. Cost and Performance- Ongoing, Not Optional

Here's something vendors rarely lead with: Databricks compute costs can spiral fast if nobody's actively managing them.

A partner worth keeping around puts in:

Auto-scaling cluster policies so you're not paying for idle compute at 2am
Photon engine tuning for SQL-heavy workloads
Cost dashboards that map spend to actual business units, so finance stops asking you to explain the cloud bill

This isn't a one-time setup. It's a habit. If a partner treats it as a
checkbox, your AWS invoice will tell you eventually.

4. ML and AI Enablement- When You're Ready to Go Beyond Dashboards

A lot of enterprise teams reach a point where SQL dashboards aren't enough. They want predictions, recommendations, anomaly detection that is actual ML in production.

A Databricks partner with real ML capability sets up MLflow for experiment tracking, builds feature pipelines through Feature Store, and helps your data science team stop rebuilding infrastructure every time they want to ship a model.

This is genuinely where the Databricks ecosystem shines and where the right partner can save months of engineering time.

How to Actually Vet a Databricks Partner (Beyond the Sales Deck)

Most of this won't be on their website. You have to ask.

Check for Databricks certification at the engineer level, not just a partner tier badge. Certified Data Engineer Associate or Professional means someone on their team has passed a hands-on technical exam. That's meaningful.

Ask for vertical-specific references- A partner who's built lakehouse pipelines for a D2C brand thinks about schema design very differently than one who's only done banking compliance reporting. Generic case studies are a yellow flag.

Pin down the post-go-live model- Ask: "What does month three with
your team look like?" If the answer is vague or pivots back to the
onboarding process, they're not thinking past the implementation phase.

Confirm you own the code- Sounds obvious. Isn't always. Any partner
who builds undocumented pipelines or ties you to proprietary tooling is
creating dependency, not capability. Get this in writing.

Timing Matters More Than Most People Think

The best moment to bring in a Databricks partner is before your data
team has built workarounds they're now defending as architecture.

Before ad-hoc notebooks become your production pipeline. Before cluster
policies are an afterthought. Before your engineers are spending more time firefighting than building.

If AI and ML use cases are on your roadmap alongside the data modernization work and they probably should be, it's worth reading why mid-market enterprises are moving on AI consulting partnerships before 2027. The timelines are more connected than most teams realize.

One Last Thing: Good Partners Ask Uncomfortable Questions

The best Databricks consulting services engagement you'll ever have won't start with a proposal. It'll start with questions that make you think.

Things like:

"What does 'data-ready' actually mean for your business in 12 months?"
"Who currently owns data quality decisions and what happens when something breaks?"
"What's the real blocker for your team right now? skills, tooling, or architecture?"

If a vendor skips all of that and jumps to pricing, pay attention to
that instinct telling you something's off.

For a grounded look at what structured Databricks consulting services
actually cover certifications, engagement models, and specific deliverables. it's a solid benchmark before your next vendor call.

Evaluating Databricks partners? Drop the questions you're struggling to
get straight answers on in the comments, happy to help you cut through the noise.

Cosa sono i modelli di apprendimento automatico? Tipi - Databricks

Jose Francisco Bustamante Ocampo — Sat, 16 May 2026 15:38:55 +0000

Cosa sono i modelli di apprendimento automatico? Tipi - Databricks

TL;DR: Breaking ai news from Google News: Machine Learning (IT).

What Happened

📰 Google News: Machine Learning (IT) is reporting on this story. This is a ai development worth watching closely.

Why It Matters

This story could have significant implications for the global community following ai trends.

Key Takeaways

📌 Reported by Google News: Machine Learning (IT)
📌 Category: ai
📌 Read full story →

Follow GlobalWFeed on Telegram →

🤖 Pubblicato automaticamente da Global Feed Bot

Lakebase, Meet PDB: The "Third-Generation" Database Oracle Shipped in 2013

Rick Houlihan — Mon, 11 May 2026 19:32:27 +0000

By Rick Houlihan & Patrick Meredith

Databricks named the right problem. Their answer is a credible execution of an idea Oracle Multitenant solved a decade earlier — and as it turns out, the gap they think they've found in Oracle was only one PL/SQL package away from closing.

The Pitch That Started This

A colleague forwarded me the Databricks blog post the other day. Opening line:

"In our previous blog, we introduced Lakebase, the third-generation database architecture that fundamentally separates storage and compute."

— Databricks, "How agentic software development will change databases"

So, like what Oracle did 12 years ago.

I'm being a little snide. Bear with me — there's a real article underneath. The blog is a thoughtful read about how AI agents are changing database workloads, and most of the diagnosis is right. Their telemetry is interesting:

"In Databricks's Lakebase service, AI agents now create roughly 4x more databases than human users."
"[O]n average, each database project has ~10 branches and some databases with nested branches reaching depths of over 500 iterations…"
"[F]or about half of these agentic applications, the database compute lifetime is less than 10 seconds."

That last number is real. Agents don't behave like humans. They generate variants by the dozen, run them in parallel, evaluate against an eval set, keep the winner, throw away the losers. Evolutionary development. The economics break down completely on a database that costs $200/month per instance with a five-minute provisioning cycle.

So Databricks is right about the problem. They're right that databases need a branching primitive. They're right that storage and compute need to scale independently. They're right that the always-on cost floor doesn't survive contact with agents.

This article is not about whether they're wrong on the diagnosis.

It's about whether their answer is novel — and what the architecture-correct version looks like. Because Oracle has been shipping the same primitive in the engine since July 2013, and a small Python + PL/SQL wrapper is all that separates it from the developer experience Databricks just announced.

Patrick and I thought it was worth writing this down.

What Lakebase Actually Is

Spoiler: it's Neon.

Databricks announced its agreement to acquire Neon on May 14, 2025. The press release didn't disclose a price (industry reporting put it at roughly $1 billion), but it did volunteer a useful telemetry data point: "over 80 percent of the databases provisioned on Neon were created automatically by AI agents rather than by humans." That number is also the reason this acquisition happened — Neon, founded in 2021 by Postgres committers, had built a serverless Postgres architecture that AI agents could actually afford to use: stateless compute nodes, a Paxos-based safekeeper quorum holding WAL, and a pageserver materializing pages on demand from object storage. Branches were stamped as metadata pointers at a moment in WAL history; copy-on-write at the storage layer made divergence cheap.

That architecture is good engineering. It's also exactly what Databricks now ships as Lakebase. Their own architecture deep-dive opens with:

"In the lakebase architecture, your compute is stateless. It does not rely on a local data directory. Instead, it streams WAL to a Paxos-based quorum of safekeepers."

— Databricks, "How lakebase architecture delivers 5x faster Postgres writes"

The same post describes how, when Postgres compute requests a page from storage, the pageserver "reconstructs it by finding the most recent materialized image of that page and replaying any WAL deltas on top." If you've read Neon's published architecture overview, this is familiar vocabulary — stateless compute → safekeepers → pageserver → object storage — because it is Neon's architecture. Lakebase is Neon with a Databricks brand on top.

To be clear: that's not a problem. Neon is good engineering. Acquiring it and integrating it with the lakehouse is a perfectly defensible product move — buying a four-year-old startup whose technology already solves the agent-economics problem is faster than building one yourself. Nobody should be mad about an acquisition.

The problem is the next thing Databricks did, which was call a four-year-old Postgres-branching architecture "the third-generation database architecture that fundamentally separates storage and compute." That's a marketing claim, not an architectural one, and it has two specific issues. First, "third generation" implies a chronology — first generation was monolithic, second was something, this is the third — and Databricks has never been particularly clear about what the second generation was, which is convenient because any honest answer would include systems older than Lakebase that already do what Lakebase does. Second, the "fundamentally separates storage and compute" phrasing treats compute/storage separation as a 2025 innovation, which is awkward when Snowflake shipped that architecture commercially in 2014 and Oracle shipped a multitenant variant of it in July 2013.

"Third generation" sells better than "we acquired a 2021 startup six months ago, here's what they built." It also doesn't survive a history check.

That's the next section.

The "Third-Generation" Sleight of Hand

Same Databricks blog post — "A New Era of Databases: Lakebase," June 12, 2025 — one "Database Architecture Evolution" section, three generations laid out in sequence.

Generation 1 — the monoliths:

"Examples: MySQL, Postgres, classic Oracle"

"Database systems started as absolute monoliths."

Generation 2 — proprietary loose coupling:

"Examples: Aurora, Oracle Exadata"

"As cloud infrastructure improved, vendors physically separated storage from compute, moving storage into proprietary backend tiers."

Same Oracle. Two generations. One page apart. Pick one.

I'll be charitable and assume the intended argument was "early Oracle was a monolith, modern Oracle isn't." Fine. Then "modern" deserves a timeline.

Year	System	What was separated
2001	Oracle Real Application Clusters (RAC)	Multiple compute nodes against a single shared SAN/NAS storage substrate (Oracle 9i)
2008	Oracle Exadata v1	Database servers vs. intelligent storage cells with predicate offload (Smart Scan), GA September 2008
2010	Google Dremel / BigQuery	Disaggregated storage and compute, columnar — VLDB 2010 paper
July 1, 2013	Oracle Database 12c / Multitenant	`CREATE PLUGGABLE DATABASE … FROM … SNAPSHOT COPY` ships in the engine
2014	Snowflake (GA)	Three-layer cloud-native: storage / virtual warehouses / cloud services
Nov 2014 / Jul 2015	Amazon Aurora	Compute decoupled from a 6-way replicated storage layer across 3 AZs (preview Nov 2014, GA July 2015)
2021	Neon (founded)	Postgres-specific WAL-level disaggregation with branching
May 14, 2025	Lakebase = Databricks acquires Neon	Neon's architecture wrapped around open lake storage

Storage and compute have been separated in production databases for 25 years. Across two paradigms, four vendors, and at minimum seven shipping systems before Lakebase showed up. "Third generation" isn't an architectural claim. It's a marketing label that requires the reader to forget about Oracle RAC, Exadata, Dremel, Multitenant, Snowflake, Aurora, and Neon in roughly that order.

So what's actually new in Lakebase? The same blog is honest about this if you read past the generation label:

"Like Gen 2, it separates compute from storage, but with a critical difference: both the storage infrastructure and the data formats are completely open."

Translation: Gen 2 already separated storage from compute. Their own text concedes the point. The Gen 3 differentiator they're actually claiming is open data formats. We'll dismantle that claim in Section 10 — short version, "open formats" turns out to do less work than the marketing suggests once you ask which formats, governed by whom, queryable how. But file the claim for now.

The other thing the launch blog flags as Gen 3 distinctive is branching:

"Databases can be branched and cloned the way developers branch code."

Branching as a developer-experience primitive is a fair thing to call out — it genuinely changes how AI agents and dev workflows interact with databases, and we conceded that point in Section 1. Branching as a database-engine primitive, though, has shipped in Oracle Multitenant since July 1, 2013, with documented syntax, multiple supported storage substrates, and a hard limit four to eight times higher than Lakebase's. Which is the next section.

"Third-generation database architecture? We're on our fifth." - Patrick Meredith

PDB Snapshot Copy: The Branching Primitive Oracle Has Shipped Since 2013

The syntax is one statement:

CREATE PLUGGABLE DATABASE my_experiment_branch
  FROM base_experiment_pdb
  SNAPSHOT COPY;

The Oracle 19c SQL Reference describes what happens underneath:

"The SNAPSHOT COPY clause instructs the database to clone the source PDB using storage snapshots. This reduces the time required to create the clone because the database does not need to make a complete copy of the source data files."

— Oracle Database 19c SQL Language Reference: CREATE PLUGGABLE DATABASE

What "storage snapshots" means depends on the substrate. The same reference is explicit: with CLONEDB=FALSE, "the underlying file system for the source PDB's files must support storage snapshots. Such file systems include Oracle Automatic Storage Management Cluster File System (Oracle ACFS) and Direct NFS Client storage." With CLONEDB=TRUE, "the underlying file system for the source PDB's files can be any local file system, network file system (NFS), or clustered file system that has Direct NFS enabled. However, the source PDB must remain in open read-only mode as long as any clones exist."

So:

Storage substrate	Snapshot mechanism	Notes
Oracle ACFS	Copy-on-write storage snapshots	`CLONEDB=FALSE` path
Direct NFS Client (dNFS)	Copy-on-write storage snapshots on snapshot-capable NFS array	`CLONEDB=FALSE` path
Exadata sparse disk groups	Copy-on-write	Source PDB must be read-only
Standard FS + `CLONEDB=TRUE`	dNFS sparse files over NFS	Source PDB must remain open read-only while clones exist
Exascale (23ai+)	Redirect-on-write	"created quickly, consume little storage space upon initial creation, and can be created in practically unlimited numbers"

Note the precision on "redirect-on-write" — that's Oracle's official term only for Exascale snapshots in 23ai+. Older substrates use copy-on-write semantics. Per the Exadata Database Service on Exascale Infrastructure documentation: "These PDB snapshots leverage Exascale redirect-on-write technology so that they are created quickly, consume little storage space upon initial creation, and can be created in practically unlimited numbers." The distinction matters if you're going to argue with someone about it.

Sibling features in the Multitenant family:

PDB Snapshot Carousel (introduced in 18c, not 19c — common citation error). Per oracle-base.com: "Oracle 18c introduced the concept of a snapshot carousel, which is a series of point-in-time copies, or snapshots, of a PDB." Default 8 snapshots, hard cap at 8 via MAX_PDB_SNAPSHOTS. Oldest is overwritten when full. Useful for short-horizon point-in-time recovery without the overhead of full backups.
Refreshable Clones. Physically full copies with incremental redo apply. Different beast from snapshot copies (full storage cost, but ongoing sync from source). Convertible one-way to a regular PDB.
PDB density. Up to 4098 PDBs per CDB on Enterprise Edition with Multitenant licensing — the MAX_PDBS reference lists possible values of 5, 254, or 4098 by edition (Standard/Express, Standard Edition 2, Enterprise Edition respectively).

Now compare ceilings:

Platform	Branch limit	Branch depth	Cross-region
AWS Aurora	15 copy-on-write clones per source; 16th becomes a full copy	No explicit depth ceiling, but each level re-consumes the 15 budget	"You can't create a clone in a different AWS Region from the source Aurora DB cluster"
Lakebase (Databricks doc)	500 per project; only 10 unarchived (active) at once	Hundreds nested (per their telemetry)	Per region
Oracle Multitenant	Up to 4098 PDBs per CDB	No documented depth limit	RAC + Data Guard, cross-region via Active Data Guard

Lakebase's 500-per-project ceiling is generous compared to Aurora's 15. Oracle's 4098 is generous compared to Lakebase's 500 by an order of magnitude. And Lakebase has another hard cap that doesn't appear in the cloning side of the comparison: it allows only 10 unarchived (active) branches at once. Oracle has no equivalent active-cap; you tune branch density via Resource Manager based on your workload, which is the next section.

This primitive shipped on July 1, 2013, in Oracle Database 12c. Twelve years before Lakebase. In the database engine, not in a wrapper. With a single SQL statement, documented in the official SQL Language Reference. There is no Postgres extension here. There is no separate page server, no Paxos quorum, no $1B acquisition. It's just CREATE PLUGGABLE DATABASE … SNAPSHOT COPY, and it has been since the series finale of Breaking Bad.

The Compute Story Most People Get Wrong

A note on this section: the structural argument here came from Patrick during a Slack thread when he challenged me on the scale-to-zero comparison. I had it wrong initially. Here's the correct read, in his voice.

The naive comparison says Lakebase wins on scale-to-zero because branches scale individually to zero compute when idle. Oracle, the story goes, is "always on" — fixed ECPUs allocated to the ADB instance, multiple PDBs sharing the pool, no per-branch zero-cost dormancy.

That comparison gets the shape right and the conclusion wrong.

Yes, in Autonomous Database Serverless, ECPUs are allocated at the instance level, not per PDB. Yes, Snapshot Copy PDB branches inside an ADB share that pool. The naive read says: "uh-oh, no isolation, abandoned branches will eat compute." The correct read is: abandoned branches in a shared pool consume nothing by construction — because they aren't reserving anything.

Walk through the mechanics:

Closed PDBs consume zero CPU and zero shadow processes. ALTER PLUGGABLE DATABASE foo CLOSE IMMEDIATE; and the branch is dormant. The 26c SQL Reference describes the semantic: "the PDB equivalent of the SQL*Plus SHUTDOWN command with the immediate mode." Metadata stays in the dictionary; nothing else stays resident.
Idle open PDBs consume near-zero. Just metadata pages.
Active PDBs draw from the shared pool. That pool auto-scales: per the Oracle docs, "with compute auto scaling enabled the database can use up to three times more CPU and IO resources than specified by the number of ECPUs." You pay for the burst when it happens, not when it doesn't.
Resource Manager governs the priority. CPU shares, MAX_IOPS, MAX_MBPS, sessions, parallel servers, per-PDB SGA_TARGET and PGA_AGGREGATE_LIMIT. You decide which branches get more pool when contended.
V$PDBS and V$RESOURCE_LIMIT expose per-branch consumption so a supervisor process can watch and auto-suspend.

So what's the real difference? Lakebase per-DB scale-to-zero with cold-start latency on resume. Oracle shared elastic pool with no cold start.

For an agentic workflow, where the supervisor might wake an "abandoned" branch tomorrow to revisit a hypothesis it shelved today, the no-cold-start property matters. The branch has been consuming nothing; the moment it gets a connection, it's responsive within milliseconds because the compute pool is already warm. Lakebase, by design, has to spin compute back up.

Which means the elasticity scoreboard most people read off the spec sheet — "Lakebase: scale-to-zero ✅ / Oracle: shared pool ❌" — is solving the same problem two different ways and pretending one wins. Different shape. Same economics for abandoned experiments. Faster wakeup on Oracle when the agent comes back.

Sharing compute between PDBs isn't a bug. It means abandoned branches aren't wasting compute, period.

Or as I put it in Slack when this came up: "What we want is exactly what we already have. The compute is scaled. Abandoned branches contribute nothing." That's the architecture.

— Patrick

The Hard Limits

Side-by-side, with citations on every claim:

Capability	Lakebase	Oracle Multitenant + ADB
Total branches	500 / project (Databricks doc)	Up to 4098 / CDB (MAX_PDBS)
Active branches	10 (hard cap)	No hard cap; tuned via Resource Manager
Branch creation speed	Instant (metadata + COW)	Near-instant on snapshot-capable storage
Cold-start on resume	Sub-second to multi-second	None — shared pool
ACID	Postgres MVCC	Full ACID, RAC, Active Data Guard
Failover behavior	Postgres-standard (kills in-flight)	Transparent Application Continuity — in-flight transaction replay
Vector search	Postgres extension	In-engine, optimized by 40-year-old CBO
JSON	jsonb (sequential traversal)	OSON binary, hash-indexed O(1) field access
Graph	Postgres extension	SQL/PGQ, in-engine
Cross-modal queries (vector + JSON + graph + relational)	Limited by extension boundaries	Single transaction, single query plan
Open data format	"Postgres page on S3" (Postgres-only readable)	OSON + Iceberg + Parquet + Mongo wire + native SQL
Mongo wire compatibility	None	Yes (Oracle MongoDB API)

Lakebase wins on developer-experience polish today. The branching UX is wired into the product, the CLI is published, the dashboard renders branch trees. Credit where due — that's a real product investment.

Oracle wins on every limit that matters once you stop counting GitHub stars. Density (4098 vs 500). Active concurrency (no cap vs 10). ACID. Failover that doesn't kill your transactions. Vector + JSON + graph + spatial + relational in one query plan optimized by 40 years of CBO development. Mongo wire compatibility, for the developers who already wrote against MongoDB and don't want to rewrite their app to evaluate a database.

The DX gap is real. It's also the easiest gap to close, which is the next section.

The DX Gap, And Why It's Trivial to Close

Patrick said it best in the original Slack thread: "We probably should develop a lightweight external API too. That should be extremely simple — it's all external to the database."

He was right and he's already shipped it.

The DX gap is real. There is no pdb branch my-experiment command in stock Oracle. Lakebase has a polished branching UX with a published CLI, a dashboard, and git-shaped semantics. We're not going to pretend otherwise.

But this is a wrapper-shaped problem, not a kernel-shaped problem. Patrick built the wrapper:

pmeredit/pdb-branch — "a small multi-language library over a shared PL/SQL package for making Oracle PDB snapshot copies feel like cheap database branches for agentic workflow experiments."

Python, Node.js, Rust, and Java bindings, plus a Rust-built pdb CLI. Releases alongside this article.

The architecture is small enough to fit on a napkin:

PDB_BRANCH PL/SQL package — installed and upgraded automatically by the language binding at startup. Wraps CREATE PLUGGABLE DATABASE … SNAPSHOT COPY with idempotent lifecycle DDL.
Three control tables in CDB$ROOT:
- PDB_BRANCH_BRANCHES — branch registry (name, parent, state, expiration, score)
- PDB_BRANCH_EVENTS — audit log of branch lifecycle events
- PDB_BRANCH_PROFILES — branch-to-Resource-Manager-profile mapping
BranchClient wrappers in four languages — Python over python-oracledb, Node.js over oracledb, Rust over the ODPI-C-based oracle crate (with a pure-Rust oracle-rs path for non-SYSDBA work), and Java. One PL/SQL contract, four idiomatic surfaces.
A pdb Rust CLI — bin/pdb wraps the Rust binding so callers don't need to know Cargo's target/ layout. git branch-shaped commands, .pdbprofile TOML config, and per-flag environment-variable overrides.
Optional Resource Manager profiles: PDB_BRANCH_ACTIVE, PDB_BRANCH_IDLE, PDB_BRANCH_BACKGROUND.

Two ways to drive it. The library surface (Python shown; Node/Rust/Java are equivalents):

from pdb_branch import BranchClient

client = BranchClient(connection)  # auto-installs/upgrades PL/SQL package

client.create_branch(
    "AGENT_RAG_042",
    from_pdb="GOLDEN_MASTER",
    notes="try smaller chunk size and rerank before answer synthesis",
)
client.record_score("AGENT_RAG_042", 0.91, notes="eval: qa_regression_v3")
client.promote("AGENT_RAG_042", notes="winner for current retrieval policy")
client.cleanup(close_idle_after_minutes=60, drop_expired=True)

Or, at the shell, the same workflow via the pdb CLI:

bin/pdb init --dsn localhost:1521/FREE --user sys --password ... --from FREEPDB1
bin/pdb branch AGENT_RAG_042 --notes "try smaller chunk size and rerank"
bin/pdb score   AGENT_RAG_042 0.91 --notes "eval: qa_regression_v3"
bin/pdb promote AGENT_RAG_042
bin/pdb branch -d AGENT_RAG_042

bin/pdb init writes a .pdbprofile so the daily commands stay short. The CLI also accepts environment-variable overrides and flag overrides — flags beat env vars beat .pdbprofile beat local defaults.

That's the entire developer experience. Branch, score, promote, reap. The argument that Oracle "doesn't have git branch for databases" was true a week ago. Today there's a CLI in the repo, an integration test that runs it against an Oracle Free container in CI, and a Rust binary you can drop in your $PATH.

One architectural point worth elevating: the two-connection security model. The agent never gets SYSDBA. There are two distinct connections:

Control-plane connection — trusted orchestration code → CDB$ROOT as SYSDBA → uses BranchClient to create, open, close, and drop PDB branches.
Workload connection — the agent → branch PDB → normal application user → ordinary SQL against branch-local data.

The agent receives only a DSN to its assigned branch and standard application credentials. It cannot create branches, drop branches, or escape its sandbox. Lakebase has nothing analogous in its branching API today; the agent-vs-supervisor security boundary is enforced at the cloud-IAM layer rather than in the database itself, and that's a category weaker than separation of concerns enforced inside the engine.

Snapshot-copy fallback is engineered, not aspirational. When the library requests SNAPSHOT COPY and the underlying storage rejects it — Oracle Free's container filesystem returns ORA-17525 / ORA-65169, for instance — the library transparently retries as a full clone, records a SNAPSHOT_COPY_FALLBACK row in PDB_BRANCH_EVENTS, and (in the Python binding) emits a SnapshotCopyFallbackWarning. Correctness is preserved on substrates that can't sparse-clone; the events table makes it visible when that happened so capacity planning isn't a guessing game.

Free deployment path:

Oracle Database 23ai/26ai Free Docker image — container-registry.oracle.com/database/free. CDB service FREE, default PDB FREEPDB1. Multiple branch PDBs supported. The Free image's container filesystem doesn't support storage snapshots, so snapshot_copy=True is silently treated as a full clone via the fallback path above — which means 10–30 branches realistic on a laptop, not hundreds. $0 cost forever, and the Oracle Free integration tests in the repo run the Python, Node.js, Rust, Java, and CLI surfaces against this image in CI.
Self-managed CDB on 19c+ with snapshot-capable storage — production target. ACFS, dNFS, Exadata sparse, or Exascale. Branch DDL uses Oracle Managed Files via CREATE_FILE_DEST, preferring DB_CREATE_FILE_DEST when set and otherwise deriving a destination from the parent PDB's datafile directory.
ADB Serverless / Always Free is explicitly NOT a v1 target. ADB application connections land in an existing PDB, not in CDB$ROOT, so they cannot run PDB branch DDL. A real architectural constraint of ADB's tenancy model, not a pdb-branch limitation.

The README is honest about v1 boundaries: the idempotent installer doesn't migrate destructive schema changes yet; PL/SQL identifiers are restricted to simple unquoted Oracle names; promotion is metadata-only, with scaling and export workflows left to deployment-specific adapters. That's an honest v1 scope.

The article is the "why." The repo is the "how." They land together, today.

The Agentic Workflow on Oracle

The lifecycle Patrick described in our Slack thread, mapped to the actual pdb-branch API:

Phase 1 — heavy experimentation. The supervisor holds the SYSDBA control-plane connection and spins up branches:

for hypothesis in hypotheses:
    branches.create_branch(
        f"AGENT_{hypothesis.id}",
        from_pdb="GOLDEN_MASTER",
        notes=hypothesis.description,
    )
    branches.set_profile(f"AGENT_{hypothesis.id}", "PDB_BRANCH_ACTIVE")

Each agent receives a DSN to its assigned branch plus an app-user credential. Agents do not see CDB$ROOT. They run their experiments — vector queries, JSON queries, SQL, whatever the eval needs — against ordinary Oracle PDBs. Once the branch PDB is open there is no special "branch query mode": the branch is just an isolated Oracle PDB service.

Phase 2 — evaluate. Supervisor logs scores back to PDB_BRANCH_BRANCHES as agents finish:

branches.record_score("AGENT_RAG_042", 0.91, notes="eval: qa_regression_v3")

The supervisor process can watch V$PDBS (open mode, last open time, total size) and V$RESOURCE_LIMIT (per-PDB CPU and I/O draw) for liveness and resource consumption.

Phase 3 — promote and reap. Winners stay active. Losers get downgraded or closed:

branches.promote("AGENT_RAG_042", notes="winner for current retrieval policy")
branches.cleanup(close_idle_after_minutes=60, drop_expired=True)

cleanup is the auto-suspend / auto-drop primitive. In production you don't run this from the supervisor; you schedule PDB_BRANCH.CLEANUP from DBMS_SCHEDULER so the orchestration code doesn't need to babysit branch lifecycle.

Behind those four method calls, the SQL is exactly what you'd expect:

CREATE PLUGGABLE DATABASE AGENT_RAG_042
    FROM GOLDEN_MASTER SNAPSHOT COPY;

ALTER PLUGGABLE DATABASE AGENT_RAG_042 OPEN;

ALTER PLUGGABLE DATABASE AGENT_RAG_042
    SET DB_PERFORMANCE_PROFILE='PDB_BRANCH_ACTIVE';

INSERT INTO PDB_BRANCH_BRANCHES (NAME, PARENT, STATE, NOTES, CREATED)
VALUES ('AGENT_RAG_042', 'GOLDEN_MASTER', 'ACTIVE',
        'try smaller chunk size...', SYSTIMESTAMP);

INSERT INTO PDB_BRANCH_EVENTS (BRANCH_NAME, EVENT_TYPE, DETAILS, EVENT_TIME)
VALUES ('AGENT_RAG_042', 'CREATED', '{"from":"GOLDEN_MASTER"}', SYSTIMESTAMP);

Five statements, one transaction. The branch is live. An agent connects to AGENT_RAG_042 as app_user and runs its experiment.

This is what Databricks calls evolutionary algorithms in the database. It's the right framing. The substrate has been Oracle for a decade; what was missing was the wrapper that makes it feel like git. Each language binding is roughly one module long, the Rust pdb CLI is one binary, and they all sit on top of one shared PL/SQL package. The whole DX gap was about that much code.

Cost Reality

Both platforms have real costs and real free entry points. Skipping the marketing-deck pricing slide and going straight to what an engineer would actually pay:

Workload pattern	Lakebase	Oracle ADB Serverless 2 ECPU
50 mostly-idle branches, occasional bursts	$80–$150/mo	$190–$290/mo
100+ branches, high density	Hits the 10-active wall	Scales naturally to thousands
Sustained 8+ hr/day activity	Capacity-unit cost climbs	Cheaper at sustained load
Storage at scale	$0.345 / GB-month	~$0.024 / GB-month (≈15× cheaper)
Free for prototyping	Always Free tier (limited)	Free Docker image: $0 forever

These are public list prices as of mid-2026, picked from each vendor's published rates. Run the numbers for your workload.

The honest read:

Lakebase wins on bursty, mostly-idle floors with light data. That's the optimization point of per-DB scale-to-zero, and they do it well.
Oracle wins on density, sustained activity, and storage at scale. When agents are actually doing work, the shared-pool model delivers more compute per dollar. When experiment data grows, the storage cost differential alone (~15×) can dominate the total.
Oracle Free Docker is genuinely free. No cloud signup, no credit card, no quotas. Patrick's pdb-branch README documents this as the recommended local prototyping path.

This is the compute story restated as economics. Per-DB scale-to-zero looks cheap when nothing is running. Shared elastic pool is cheaper when anything is running. Pick the model that matches your workload, not the marketing scoreboard.

What's Actually New About Lakebase

Worth giving Databricks an honest hearing. The "third-generation" framing collapses the moment you check the dates. What about their other claim — that in Lakebase "both the storage infrastructure and the data formats are completely open"?

That one survives partway and dies in the details.

The operational store in Lakebase is Postgres page format on cloud object storage. That's what they mean by "open storage infrastructure." But Postgres' on-disk page layout is a physical storage format, not a portable interchange format. The only thing that can read a Postgres page file is the Postgres engine. Calling that "open" because the Postgres source code is open is a category error. By that logic, MongoDB's BSON is "open" because the spec is published.

The other openness claim — that the same data is queryable as Iceberg by external analytical engines — is true. But the Iceberg view isn't the operational store. It's a separate projection layer (the "Mooncake" bridge — Databricks' OLTP-to-lakehouse export pipeline). Iceberg files are derived from the operational Postgres pages, not the same bytes.

Which means Lakebase's actual architecture is:

A canonical store in Postgres-only page format. Closed to anything that isn't Postgres.
A projected shape in Iceberg, exported to make the data analytically accessible.

That's exactly canonical form + projected shape. It's the architecture pattern I've been calling Unified Model Theory for the last two years. Databricks reinvented UMT, called the closed canonical store "open," and called the projection layer "openness."

Oracle's answer to "open data" is the converged engine itself: same canonical store, multiple shapes natively in the engine — SQL, JSON Duality Views, Property Graph, Vector, Spatial, Full-Text Search, Mongo wire protocol, OSON serialization out, Iceberg/Parquet for analytics. No bridge layer required. The cost-based optimizer sees all the modalities in a single query plan.

The architecture-correct way to expose canonical data through multiple shapes is to do it in the engine. That is what Oracle has been shipping for 40 years and what UMT formalizes. Databricks' Lakebase + Mooncake architecture is one valid implementation pattern of the same idea, with two extra hops and a new vocabulary.

What's actually new in Lakebase isn't the architecture. It's the packaging — a polished branching UX wired into a data lake brand and a billion dollars of marketing oxygen. That's a real product investment and a credible push into a market segment Oracle has under-marketed. Credit where due.

It's just not "third-generation database architecture." It's first-generation Postgres branching with a second-generation marketing department.

The Real Take

Three things to land:

1. Agents do need branching. Databricks' diagnosis is correct, and the agentic future they describe is real. Database branching is the missing primitive for evolutionary development. Cost floors do break the economics. Storage and compute do need to scale independently. Credit where due.

2. Lakebase is competent execution of an idea Oracle Multitenant solved in 2013. Neon is good engineering. Lakebase is Neon plus a brand and a UX layer. That's fine — but it isn't "third generation." It's a four-year-old Postgres-branching architecture, recently acquired and rebranded.

3. The architecture-correct version exists today. Full ACID. Up to 4098 branches per CDB. Vector, graph, JSON, spatial, full-text — single engine, single transaction, single query plan optimized by 40 years of cost-based optimizer development. Transparent Application Continuity replays in-flight transactions across failover. The two-connection security model keeps agents out of CDB$ROOT by construction.

The only real gap was developer experience. Patrick's pdb-branch closes it. Today. A Python client, a PL/SQL package, three control tables, and a sane API. Branch, score, promote, reap.

Stop reinventing 2013. Build the wrapper. Ship.

Third-generation database architecture? We're on our fifth.

— Rick & Patrick

Citations

Databricks (primary subject):

"How agentic software development will change databases" — https://www.databricks.com/blog/how-agentic-software-development-will-change-databases
"A New Era of Databases: Lakebase" (June 12, 2025) — https://www.databricks.com/blog/what-is-a-lakebase
"How lakebase architecture delivers 5x faster Postgres writes" — https://www.databricks.com/blog/how-lakebase-architecture-delivers-5x-faster-postgres-writes
"Database Branching in Postgres: Git-Style Workflows" — https://www.databricks.com/blog/database-branching-postgres-git-style-workflows-databricks-lakebase
"Databricks Agrees to Acquire Neon" press release (May 14, 2025) — https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-neon-help-developers-deliver-ai-systems

Oracle Database documentation:

12c Multitenant Concepts — https://docs.oracle.com/database/121/CNCPT/cdbovrvw.htm
19c CREATE PLUGGABLE DATABASE — https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-PLUGGABLE-DATABASE.html
19c Cloning a PDB — https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/cloning-a-pdb.html
19c Administering a PDB Snapshot Carousel — https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/administering-pdb-snapshots.html
19c MAX_PDBS reference — https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/MAX_PDBS.html
21c V$PDBS reference — https://docs.oracle.com/en/database/oracle/oracle-database/21/refrn/V-PDBS.html
26c ALTER PLUGGABLE DATABASE — https://docs.oracle.com/en/database/oracle/oracle-database/26/sqlrf/ALTER-PLUGGABLE-DATABASE.html
Resource Manager for PDBs (19c) — https://docs.oracle.com/en/database/oracle/oracle-database/19/multi/using-oracle-resource-manager-for-pdbs-with-sql-plus.html
ADB Compute Models (ECPU/OCPU) — https://docs.oracle.com/en/cloud/paas/autonomous-database/serverless/adbsb/autonomous-compute-models.html
ADB Auto-Scale 3× — https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/autonomous-auto-scale.html
PDB Snapshots on Exadata Exascale (23ai+) — https://docs.oracle.com/en/learn/exadb-xs-pdb-snapshot/index.html

Historical context:

Dremel 2020 retrospective (VLDB) — https://www.vldb.org/pvldb/vol13/p3461-melnik.pdf
Aurora 10-year retrospective — https://aws.amazon.com/blogs/aws/celebrating-10-years-of-amazon-aurora-innovation/
Aurora cloning hard limits — https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html
Snowflake architecture — https://docs.snowflake.com/en/user-guide/intro-key-concepts
Oracle 18c PDB Snapshot Carousel introduction — https://oracle-base.com/articles/18c/multitenant-pdb-snapshot-carousel-18c

Neon / Postgres branching:

Neon architecture overview — https://neon.com/docs/introduction/architecture-overview
Neon branching docs — https://neon.com/docs/introduction/branching
TechTarget on Databricks/Neon acquisition — https://www.techtarget.com/searchdatamanagement/news/366623864/Databricks-adds-Postgres-database-with-1B-Neon-acquisition

Companion repository:

pmeredit/pdb-branch — https://github.com/pmeredit/pdb-branch

The Silent Bug That Exposed All Tenant Data in Databricks Unity Catalog

spkibe — Mon, 11 May 2026 07:39:39 +0000

We were building a multi-tenant data platform on Databricks. Multiple organisations sharing the same physical tables — each one should see only their own rows. Standard stuff.
We implemented it using Unity Catalog's row-level security and column masking. The functions compiled. The filter showed as applied in DESCRIBE EXTENDED. Every test from the admin account looked perfect.
Then we logged in as a real tenant user.
They could see every tenant's data.

What Row-Level Security and Column Masking Actually Do
Before getting to the bug, a quick primer on how Unity Catalog security works — because understanding the mechanism is what makes the bug obvious in hindsight.

Row-Level Security — Row Filters
A row filter is a SQL function you attach to a table. Unity Catalog calls it automatically on every query, passing the value of a specified column from each row. If the function returns TRUE, the row is shown. If it returns FALSE, the row is completely hidden — not counted, not visible, not even hinted at.

-- Attach a row filter to a table
ALTER TABLE my_catalog.my_schema.my_table
  SET ROW FILTER my_catalog.governance.filter_by_tenant
  ON (TENANT_KEY);

The user never writes a WHERE clause for this. They cannot remove it. It fires invisibly on every query from every tool — SQL editor, notebook, BI connection, API call.

Column-Level Masking — Column Masks
A column mask is a SQL function attached to a specific column. Instead of hiding rows, it transforms values at query time. The row is visible but sensitive fields are replaced, generalized, or redacted based on who is asking.

-- Attach a column mask
ALTER TABLE my_catalog.my_schema.my_table
  ALTER COLUMN FIRST_NAME
  SET MASK my_catalog.governance.mask_name;

The same SELECT returns different values depending on the user's group membership:

One table. One query. Different results per role. Platform-enforced.

Why This Matters
The old approach — dynamic views, one per tenant per role — requires you to trust that every developer always queries the right view, that views stay in sync with schema changes, and that no one ever accidentally gets direct table access. Unity Catalog removes all of that trust dependency. Security lives at the storage engine layer, not the SQL layer.

The Bug
Here is the row filter function we wrote:

CREATE OR REPLACE FUNCTION
my_catalog.governance.filter_by_tenant(tenant_key BIGINT)
RETURNS BOOLEAN
RETURN
  IS_ACCOUNT_GROUP_MEMBER('admin_group')
  OR
  EXISTS (
    SELECT 1
    FROM my_catalog.governance.tenant_group_mapping tgm
    WHERE IS_ACCOUNT_GROUP_MEMBER(tgm.group_name)
      AND CAST(tgm.tenant_key AS BIGINT) = tenant_key
  );

Read it carefully.
The function parameter is named tenant_key.
The mapping table column is also named tenant_key.
In the WHERE clause:

AND CAST(tgm.tenant_key AS BIGINT) = tenant_key
SQL sees two references to tenant_key. It resolves both as the table column tgm.tenant_key. The function parameter is completely ignored.

The comparison becomes:

tgm.tenant_key = tgm.tenant_key

Why It Was So Hard to Spot

No error was thrown. The function compiled without warnings. Unity Catalog reported it as valid SQL.
DESCRIBE EXTENDED showed the filter was applied. Row Filter: my_catalog.governance.filter_by_tenant(TENANT_KEY)

Everything looked correct at the metadata level. The filter was attached. The problem was invisible in the schema description.

Admin tests passed. Our initial testing was done from an admin account. The admin bypass (IS_ACCOUNT_GROUP_MEMBER('admin_group')) fires before the EXISTS check, so it returned TRUE for the correct reason. We never noticed the EXISTS was broken.
The function fails open, not closed. When Unity Catalog cannot properly evaluate a row filter, it fails open — showing rows rather than blocking them. This is the safer choice for uptime but the dangerous choice for security. A broken filter that silently shows everything is much harder to detect than a broken filter that throws an error.

The Diagnosis
The key test was running the filter function directly as the tenant user:

-- Run as the tenant user, not the admin
SELECT
  my_catalog.governance.filter_by_tenant(1) AS can_see_tenant_1,
  my_catalog.governance.filter_by_tenant(2) AS can_see_tenant_2,
  my_catalog.governance.filter_by_tenant(3) AS can_see_tenant_3;

Result:

can_see_tenant_1 = true
can_see_tenant_2 = true
can_see_tenant_3 = true

A user who should only see tenant 3 could see all three. The function was returning true everywhere regardless of tenant key. That confirmed the EXISTS logic was broken — and pointed directly to the parameter name collision.

The Fix — Rename the Parameter

CREATE OR REPLACE FUNCTION
my_catalog.governance.filter_by_tenant(p_tenant_key BIGINT)
RETURNS BOOLEAN
RETURN
  CASE
    -- Null tenant keys are always hidden
    WHEN p_tenant_key IS NULL THEN FALSE

    -- Admin bypass
    WHEN IS_ACCOUNT_GROUP_MEMBER('admin_group') THEN TRUE

    -- Tenant check — p_tenant_key is the parameter
    -- tgm.tenant_key is the table column
    -- SQL can now distinguish between them
    WHEN EXISTS (
      SELECT 1
      FROM my_catalog.governance.tenant_group_mapping tgm
      WHERE IS_ACCOUNT_GROUP_MEMBER(tgm.group_name)
        AND CAST(tgm.tenant_key AS BIGINT) = p_tenant_key
    ) THEN TRUE

    -- Explicit deny — everything else sees zero rows
    ELSE FALSE
  END;

Two changes:

Parameter renamed from tenant_key to p_tenant_key — eliminates the name collision
CASE structure with explicit ELSE FALSE — makes the deny-by-default behaviour visible and intentional

After recreating the function and reapplying the row filter, the same test returned:

can_see_tenant_1 = false
can_see_tenant_2 = false
can_see_tenant_3 = true

Drop and Reapply After Fixing
Updating the function is not enough on its own. You also need to drop and reapply the row filter so the table picks up the new function definition:

ALTER TABLE my_catalog.my_schema.my_table
  DROP ROW FILTER;

ALTER TABLE my_catalog.my_schema.my_table
  SET ROW FILTER my_catalog.governance.filter_by_tenant
  ON (TENANT_KEY);

The Column Masking Side
For completeness — column masking uses the same pattern and has the same naming risk. Here is what a safe masking function looks like with the p_ prefix convention applied:

CREATE OR REPLACE FUNCTION
my_catalog.governance.mask_name(p_name STRING)
RETURNS STRING
RETURN CASE
  WHEN IS_ACCOUNT_GROUP_MEMBER('full_access_group') THEN p_name
  WHEN IS_ACCOUNT_GROUP_MEMBER('admin_group')       THEN p_name
  WHEN IS_ACCOUNT_GROUP_MEMBER('partial_access_group')
    THEN CONCAT(LEFT(p_name, 1), '***')
  ELSE '#### MASKED ####'
END;

Apply it inline at table creation to avoid broken dependencies later:

CREATE TABLE IF NOT EXISTS my_catalog.my_schema.members
(
    MEMBER_KEY   BIGINT  NOT NULL,
    TENANT_KEY   BIGINT  NOT NULL,
    FIRST_NAME   STRING  MASK my_catalog.governance.mask_name,
    LAST_NAME    STRING  MASK my_catalog.governance.mask_name,
    DATE_OF_BIRTH DATE   MASK my_catalog.governance.mask_dob
)
USING DELTA;

-- Row filter applied separately
ALTER TABLE my_catalog.my_schema.members
  SET ROW FILTER my_catalog.governance.filter_by_tenant
  ON (TENANT_KEY);

Declaring masks inline means they survive DROP TABLE / CREATE TABLE cycles. The row filter does not — always reapply it after recreating a table.

The Rule

Never name a row filter function parameter the same as a column in any table the function queries.

Prefix all function parameters with p_. It is one character. It prevents this entire class of silent security failure.

filter_by_tenant(tenant_key BIGINT)   ← dangerous
filter_by_tenant(p_tenant_key BIGINT) ← safe

Full Verification Checklist
Run these in order before trusting any row filter in production:

-- 1. Confirm groups are account-level (not workspace-level)
--    Run as the target user:
SELECT IS_ACCOUNT_GROUP_MEMBER('your_tenant_group');
-- Expected: true

-- 2. Confirm filter function returns correct values per tenant
SELECT
  my_catalog.governance.filter_by_tenant(1) AS t1,
  my_catalog.governance.filter_by_tenant(2) AS t2,
  my_catalog.governance.filter_by_tenant(3) AS t3;
-- Expected: false, false, true (for a tenant 3 user)

-- 3. Confirm filter is attached to the table
DESCRIBE EXTENDED my_catalog.my_schema.my_table;
-- Look for: Row Filter: my_catalog.governance.filter_by_tenant(TENANT_KEY)

-- 4. Confirm mapping table has correct data
SELECT * FROM my_catalog.governance.tenant_group_mapping;

-- 5. Confirm the EXISTS subquery works correctly
SELECT EXISTS (
  SELECT 1
  FROM my_catalog.governance.tenant_group_mapping tgm
  WHERE IS_ACCOUNT_GROUP_MEMBER(tgm.group_name)
    AND tgm.tenant_key = 3
) AS exists_result;
-- Expected: true (for tenant 3 user)

-- 6. Run query as target user and confirm only their rows appear
SELECT COUNT(*), TENANT_KEY
FROM my_catalog.my_schema.my_table
GROUP BY TENANT_KEY;
-- Expected: only their tenant_key in results

Other Gotchas We Hit Along the Way
While we are here — these are the other issues that burned us during the same implementation:
Workspace groups vs account groups. IS_ACCOUNT_GROUP_MEMBER() only recognises account-level groups created in the Databricks Account Console, not workspace-level groups. A workspace group always returns false. This one caused hours of confusion.
Cluster identity. Notebooks attached to a cluster run queries as the cluster owner's identity, not the logged-in user. IS_ACCOUNT_GROUP_MEMBER() _checks the cluster owner's groups. Switch to a SQL Warehouse — it always evaluates per the logged-in user.
Broken dependencies after catalog deletion. Column masks hold references to functions by their fully-qualified path. Delete the catalog containing a masking function without first dropping the masks, and every table with that mask becomes unqueryable with _UC_DEPENDENCY_DOES_NOT_EXIST. Always drop masks before dropping catalogs.
Row filter lost after DROP TABLE. When you drop and recreate a table, inline column masks are preserved in the CREATE TABLE statement. Row filters are not. Always reapply ALTER TABLE SET ROW FILTER after recreating any filtered table.

Summary
Unity Catalog row-level security and column masking are genuinely powerful. One filter function and one masking function replace hundreds of views, a duplicate encrypted schema, and developer-discipline-as-security-policy.
But the parameter name collision bug is subtle enough that it will catch you if you are not looking for it. The function looks right. It compiles cleanly. It attaches without errors. And it silently hands every user a complete view of every tenant's data.
Prefix your parameters. Always.

Building Your First Data Warehouse in Databricks — End to End 🎉

Qvfagundes — Mon, 11 May 2026 03:00:00 +0000

Building Your First Data Warehouse in Databricks — End to End 🎉

This is it. The article the entire series has been building toward.

We've covered Databricks fundamentals, Apache Spark, Delta Lake, DBFS, DataFrames, SQL, and the Medallion Architecture. Now we wire everything together into a real, working data warehouse — from raw data ingestion all the way to queryable Gold tables.

By the end of this article you'll have a functioning Lakehouse with Bronze, Silver, and Gold layers, a database registered in the Databricks catalog, and the ability to query your warehouse like a real data engineer.

Let's build it.

What We're Building

We'll build a Sales Data Warehouse using a publicly available dataset. Here's the full architecture:

CSV Files (raw sales data)
        ↓
   🥉 BRONZE
   bronze.sales_raw
   Raw Delta table, append-only
        ↓
   🥈 SILVER
   silver.sales
   Cleaned, deduplicated, enriched
        ↓
   🥇 GOLD
   gold.monthly_revenue     — Revenue by region and month
   gold.product_performance — Top products by sales volume
   gold.customer_segments   — Customers segmented by spend tier
        ↓
   SQL queries / BI tool

Step 0: The Dataset

We'll use the Online Retail dataset — a real e-commerce transaction dataset available in Databricks sample data.

It contains ~500,000 rows of UK retail transactions with these columns:

Column	Type	Description
`InvoiceNo`	String	Order ID
`StockCode`	String	Product code
`Description`	String	Product name
`Quantity`	Integer	Units ordered
`InvoiceDate`	String	Order date and time
`UnitPrice`	Double	Price per unit
`CustomerID`	Double	Customer identifier
`Country`	String	Customer country

Step 1: Set Up Your Databases

Start a new notebook. This will be your setup notebook — run it once to create the structure.

# notebook: 00_setup

# Create the three layer databases
spark.sql("CREATE DATABASE IF NOT EXISTS bronze")
spark.sql("CREATE DATABASE IF NOT EXISTS silver")
spark.sql("CREATE DATABASE IF NOT EXISTS gold")

# Create the mount point directories
dbutils.fs.mkdirs("/mnt/warehouse/bronze/")
dbutils.fs.mkdirs("/mnt/warehouse/silver/")
dbutils.fs.mkdirs("/mnt/warehouse/gold/")

print("✅ Databases and directories created.")

Now check the Databricks Data tab — you should see three new databases: bronze, silver, and gold.

Step 2: Bronze — Ingest Raw Data

Create a new notebook: 01_bronze_ingestion

# notebook: 01_bronze_ingestion

from pyspark.sql.functions import current_timestamp, input_file_name, lit

print("Starting Bronze ingestion...")

# -------------------------------------------------------
# Read the raw CSV from Databricks sample datasets
# -------------------------------------------------------
raw_df = spark.read.csv(
    "/databricks-datasets/online_retail/data-001/data.csv",
    header=True,
    inferSchema=True
)

print(f"Raw rows ingested: {raw_df.count():,}")
raw_df.printSchema()

# -------------------------------------------------------
# Add Bronze metadata columns
# -------------------------------------------------------
bronze_df = raw_df \
    .withColumn("_ingested_at", current_timestamp()) \
    .withColumn("_source_file", input_file_name()) \
    .withColumn("_source_system", lit("online_retail_csv"))

# -------------------------------------------------------
# Write to Bronze Delta table
# -------------------------------------------------------
bronze_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/mnt/warehouse/bronze/sales_raw/")

# Register in catalog
spark.sql("""
    CREATE TABLE IF NOT EXISTS bronze.sales_raw
    USING DELTA
    LOCATION '/mnt/warehouse/bronze/sales_raw/'
""")

# Quick validation
count = spark.read.format("delta").load("/mnt/warehouse/bronze/sales_raw/").count()
print(f"✅ Bronze table written. Total rows: {count:,}")

Run the cell. You should see output similar to:

Raw rows ingested: 541,909
✅ Bronze table written. Total rows: 541,909

Let's peek at what we landed:

display(spark.read.table("bronze.sales_raw").limit(10))

You'll see messy data — nulls in CustomerID, negative quantities (returns), zero-price rows. That's fine. Bronze captures reality. Silver fixes it.

Step 3: Silver — Clean and Enrich

Create a new notebook: 02_silver_transformation

# notebook: 02_silver_transformation

from pyspark.sql.functions import (
    col, upper, trim, round, to_timestamp,
    year, month, when, current_timestamp
)

print("Starting Silver transformation...")

# -------------------------------------------------------
# Read from Bronze
# -------------------------------------------------------
bronze = spark.read.table("bronze.sales_raw")
print(f"Bronze rows: {bronze.count():,}")

# -------------------------------------------------------
# Cleaning rules
# -------------------------------------------------------
silver = bronze \
    \
    `# 1. Drop rows with null CustomerID (anonymous sessions)`
    .dropna(subset=["CustomerID"]) \
    \
    `# 2. Drop duplicates on InvoiceNo + StockCode`
    .dropDuplicates(["InvoiceNo", "StockCode"]) \
    \
    `# 3. Remove returns (negative quantities) and zero-price items`
    .filter(col("Quantity") > 0) \
    .filter(col("UnitPrice") > 0) \
    \
    `# 4. Cast and clean types`
    .withColumn("CustomerID", col("CustomerID").cast("integer")) \
    .withColumn("InvoiceDate", to_timestamp(col("InvoiceDate"), "M/d/yyyy H:mm")) \
    .withColumn("UnitPrice", round(col("UnitPrice"), 2)) \
    \
    `# 5. Derive new columns`
    .withColumn("TotalAmount", round(col("Quantity") * col("UnitPrice"), 2)) \
    .withColumn("Description", upper(trim(col("Description")))) \
    .withColumn("Year", year(col("InvoiceDate"))) \
    .withColumn("Month", month(col("InvoiceDate"))) \
    .withColumn("Tier",
        when(col("TotalAmount") >= 500, "High Value")
        .when(col("TotalAmount") >= 100, "Mid Value")
        .otherwise("Low Value")
    ) \
    \
    `# 6. Rename to snake_case`
    .withColumnRenamed("InvoiceNo",   "invoice_id") \
    .withColumnRenamed("StockCode",   "product_code") \
    .withColumnRenamed("Description", "product_name") \
    .withColumnRenamed("Quantity",    "quantity") \
    .withColumnRenamed("InvoiceDate", "invoice_date") \
    .withColumnRenamed("UnitPrice",   "unit_price") \
    .withColumnRenamed("CustomerID",  "customer_id") \
    .withColumnRenamed("Country",     "country") \
    .withColumnRenamed("TotalAmount", "total_amount") \
    .withColumnRenamed("Year",        "year") \
    .withColumnRenamed("Month",       "month") \
    .withColumnRenamed("Tier",        "tier") \
    \
    `# 7. Drop Bronze metadata`
    .drop("_ingested_at", "_source_file", "_source_system") \
    \
    `# 8. Add Silver metadata`
    .withColumn("_processed_at", current_timestamp())

print(f"Silver rows after cleaning: {silver.count():,}")

# -------------------------------------------------------
# Write to Silver Delta table
# -------------------------------------------------------
silver.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("year", "month") \
    .save("/mnt/warehouse/silver/sales/")

spark.sql("""
    CREATE TABLE IF NOT EXISTS silver.sales
    USING DELTA
    LOCATION '/mnt/warehouse/silver/sales/'
""")

print(f"✅ Silver table written.")
display(spark.read.table("silver.sales").limit(5))

Expected output:

Bronze rows: 541,909
Silver rows after cleaning: 397,924
✅ Silver table written.

We dropped ~144,000 rows — nulls, returns, zero-price items. What remains is clean, trusted data.

Step 4: Gold — Build Business Tables

Create a new notebook: 03_gold_aggregations

We'll build three Gold tables.

Gold Table 1: Monthly Revenue by Country

# notebook: 03_gold_aggregations

from pyspark.sql.functions import sum, count, avg, countDistinct, round

silver = spark.read.table("silver.sales")

# -------------------------------------------------------
# Gold 1: Monthly Revenue by Country
# -------------------------------------------------------
monthly_revenue = silver \
    .groupBy("year", "month", "country") \
    .agg(
        round(sum("total_amount"), 2).alias("total_revenue"),
        count("invoice_id").alias("total_orders"),
        round(avg("total_amount"), 2).alias("avg_order_value"),
        countDistinct("customer_id").alias("unique_customers")
    ) \
    .orderBy("year", "month", "total_revenue", ascending=[True, True, False])

monthly_revenue.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/mnt/warehouse/gold/monthly_revenue/")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.monthly_revenue
    USING DELTA
    LOCATION '/mnt/warehouse/gold/monthly_revenue/'
""")

print("✅ gold.monthly_revenue written.")
display(monthly_revenue.limit(10))

Gold Table 2: Product Performance

# -------------------------------------------------------
# Gold 2: Product Performance
# -------------------------------------------------------
product_performance = silver \
    .groupBy("product_code", "product_name") \
    .agg(
        round(sum("total_amount"), 2).alias("total_revenue"),
        sum("quantity").alias("units_sold"),
        count("invoice_id").alias("times_ordered"),
        countDistinct("customer_id").alias("unique_buyers"),
        round(avg("unit_price"), 2).alias("avg_unit_price")
    ) \
    .orderBy("total_revenue", ascending=False)

product_performance.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/mnt/warehouse/gold/product_performance/")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.product_performance
    USING DELTA
    LOCATION '/mnt/warehouse/gold/product_performance/'
""")

print("✅ gold.product_performance written.")
display(product_performance.limit(10))

Gold Table 3: Customer Segments

# -------------------------------------------------------
# Gold 3: Customer Segments
# -------------------------------------------------------
customer_segments = silver \
    .groupBy("customer_id", "country") \
    .agg(
        round(sum("total_amount"), 2).alias("lifetime_value"),
        count("invoice_id").alias("total_orders"),
        round(avg("total_amount"), 2).alias("avg_order_value"),
        countDistinct("product_code").alias("unique_products_bought")
    ) \
    .withColumn("segment",
        when(col("lifetime_value") >= 5000, "VIP")
        .when(col("lifetime_value") >= 1000, "Loyal")
        .when(col("lifetime_value") >= 200,  "Regular")
        .otherwise("Occasional")
    ) \
    .orderBy("lifetime_value", ascending=False)

customer_segments.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/mnt/warehouse/gold/customer_segments/")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.customer_segments
    USING DELTA
    LOCATION '/mnt/warehouse/gold/customer_segments/'
""")

print("✅ gold.customer_segments written.")
display(customer_segments.limit(10))

Step 5: Query Your Data Warehouse

Open the SQL Editor in Databricks. Your warehouse is live. Start querying.

-- What were the top 5 revenue months?
SELECT
    year,
    month,
    SUM(total_revenue)     AS monthly_revenue,
    SUM(total_orders)      AS monthly_orders,
    SUM(unique_customers)  AS monthly_customers
FROM gold.monthly_revenue
GROUP BY year, month
ORDER BY monthly_revenue DESC
LIMIT 5;

-- What are the top 10 best-selling products?
SELECT
    product_name,
    total_revenue,
    units_sold,
    unique_buyers
FROM gold.product_performance
LIMIT 10;

-- How are customers distributed by segment?
SELECT
    segment,
    COUNT(*)                   AS customer_count,
    ROUND(AVG(lifetime_value), 2) AS avg_lifetime_value,
    ROUND(AVG(total_orders), 1)   AS avg_orders
FROM gold.customer_segments
GROUP BY segment
ORDER BY avg_lifetime_value DESC;

-- Which countries generate the most revenue?
SELECT
    country,
    ROUND(SUM(total_revenue), 2) AS total_revenue,
    SUM(total_orders)             AS total_orders
FROM gold.monthly_revenue
GROUP BY country
ORDER BY total_revenue DESC
LIMIT 10;

You're querying a real data warehouse. Built by you. From scratch.

Step 6: Validate Your Warehouse

Good data engineers always validate. Run these checks before calling it done:

# notebook: 04_validation

print("=== DATA WAREHOUSE VALIDATION ===\n")

# Row counts across layers
bronze_count = spark.read.table("bronze.sales_raw").count()
silver_count = spark.read.table("silver.sales").count()

print(f"🥉 Bronze rows:  {bronze_count:>10,}")
print(f"🥈 Silver rows:  {silver_count:>10,}  ({silver_count/bronze_count:.1%} of bronze)")
print()

# Gold table counts
for table in ["gold.monthly_revenue", "gold.product_performance", "gold.customer_segments"]:
    count = spark.table(table).count()
    print(f"🥇 {table}: {count:,} rows")

print()

# Null checks on Silver
from pyspark.sql.functions import col, sum as spark_sum

silver = spark.read.table("silver.sales")
null_counts = silver.select([
    spark_sum(col(c).isNull().cast("int")).alias(c)
    for c in ["invoice_id", "customer_id", "total_amount", "invoice_date"]
])

print("Null counts on critical Silver columns:")
display(null_counts)

# Revenue sanity check
total_revenue = silver.agg({"total_amount": "sum"}).collect()[0][0]
print(f"\nTotal Silver revenue: £{total_revenue:,.2f}")
print("\n✅ Validation complete.")

Step 7: Optimize Your Tables

Now that everything is built, run maintenance on your Gold tables for faster queries:

%sql

-- Compact small files
OPTIMIZE gold.monthly_revenue;
OPTIMIZE gold.product_performance;
OPTIMIZE gold.customer_segments;

-- Speed up common filter patterns
OPTIMIZE gold.monthly_revenue     ZORDER BY (year, month, country);
OPTIMIZE gold.product_performance ZORDER BY (total_revenue);
OPTIMIZE gold.customer_segments   ZORDER BY (segment, country);

What You've Built

Let's look at the complete picture:

📁 Databases created:
   bronze / silver / gold

📄 Tables created:
   bronze.sales_raw          — 541,909 rows  (raw, as-is)
   silver.sales              — 397,924 rows  (clean, enriched)
   gold.monthly_revenue      — aggregated by year/month/country
   gold.product_performance  — aggregated by product
   gold.customer_segments    — aggregated by customer

🏗️ Architecture:
   Medallion (Bronze → Silver → Gold)
   All tables in Delta format
   Silver partitioned by year/month
   Gold tables OPTIMIZE'd with ZORDER

🔍 Queryable via:
   Databricks SQL Editor
   Any BI tool via JDBC/ODBC connector
   Databricks notebooks

Where to Go From Here

You've built your first data warehouse in Databricks. Here's what to explore next:

Orchestration: Take your four notebooks and wire them into a Databricks Workflow — a scheduled pipeline that runs Bronze → Silver → Gold automatically on a schedule or trigger.

Incremental loads: Update the Bronze ingestion to load only new files, and update Silver to use MERGE instead of overwrite — real production pipelines are incremental.

Unity Catalog: In production Databricks, Unity Catalog provides centralized access control, data lineage, and governance across all your tables.

Databricks SQL Warehouses: Connect Power BI, Tableau, or Looker directly to your Gold tables via a SQL Warehouse endpoint.

dbt on Databricks: Use dbt to manage your Silver and Gold transformations with version control, testing, and documentation built in.

Series Complete 🎉

You went from zero to a working data warehouse in Databricks. That's not a small thing.

The observability gap for data science and analytics agents

Raluca Crisan — Sun, 10 May 2026 11:05:31 +0000

Databricks and similar enterprise data platforms have spent a great deal of effort and time to full-proof their product suite with relevant observability and tracing. Not surprisingly this is needed as part of enterprise support especially in regulated sectors. But for the specific case of sophisticated data science and analytics agents there is a gap in the observability suite not just for Databricks but across all big and small analytics and data science agent providers.

In the case of Databricks, even with notebooks as a primary user interface, given the offerings across data lineage, data management and MLflow, the level of control and tracing is no doubt high. However both large vendors like Databricks and Snowflake and smaller analytics and data science agents suppliers share an observability gap. The gap is inherent to coding agent architectures and does not apply equally to all agents. A text-to-SQL assistant can be wrong in an ‘obvious’ way: the result makes no sense. A multi-step python or spark pipeline produced by an agent is different. Even when made by a human, it’s hard to unpick pipeline logic given endless combinations of joins, data issues, data characteristics. This problem doesn’t go away when an agent is involved. E.g. Genie can plan a solution,run code, use cell outputs to improve results, and fix errors automatically. The question is what beyond the initial reasoning and the final artifact can be inspected in this instance and what can be reliably/not-probabilistically logged.

To achieve their objectives, these more sophisticated data science and analytics agents need to create relatively complex multi-step pipelines. Past the initial data retrieval and the final storage step, the pipelines themselves are just arbitrary code. Observability for this type of scripts when they are man-made span a whole area of companies in the MLOps space including Databricks’ own Mlflow. But it is unclear what observability is out there when this code is produced by agents - short of asking the agent itself to instrument the code (probabilistically), thus somewhat defeating the purpose of observability in the first place.

Now that we’ve narrowed the gap in observability from the bigger data platform context to a specific area: the ‘executed pipeline code’ element part of these more sophisticated analytics and data science agents workflow, my first question was to see if Mlflow or a different ‘off-the-shelf’ tool in the ecosystem can fill this gap directly. For why OpenTelemetry is not enough here please see the previous blogpost.

Unsurprisingly, Mlflow is heading in the direction of more granular instrumentation with the least amount of effort - on anyone’s part, human or agent. For classic ML, a single mlflow.autolog() call can automatically capture params, metrics, models, datasets, and artifacts around supported training APIs, while for GenAI and agent workflows, one-line tracing primitives like @mlflow.trace, mlflow.trace(...), and mlflow.start_span() add function- and block-level visibility, including parent-child relationships, inputs, outputs, exceptions, and execution time.

My initial experiments with trying to instrument agent-created code with Mlflow deterministically have allowed me to track the models as experiments which was a good step in the right direction 👍, but of course I cannot track data transformations - with Mlflow or with anything else that I’m familiar with.
Trying to track with autolog was the better option for me - rather than the tracing function, because I’m not really tracking the agent, I’m trying to track what’s happening in the code produced by the agent when it runs. Below some example basic tracking:

The gap is of course tracking what actually happens inside the pipeline outside the model itself, all the data operations for which no observability is present. While the code is of course the best evidence in other use cases, for pipeline types structures where the outcomes are heavily influenced by the particulars of the data, the code is not enough - observability on code and runtime execution both is needed and for these data science and analytics agents, the code they produce (outside the model itself) is currently a black box - an example table of interim artifacts below (made using Etiq), which at the moment tooling like Mlflow does not capture for agent written code.

In this space we were brainwashed to believe that observability matters at all cost; however I feel for this instance given the perception of coding agents in the market, an argument might have to be made for why it really matters.
First, it’s about auditability. Truly not everyone cares about this and not everyone should. But in regulated sectors like finance or healthcare this matters. For model validation in e.g. finance, the type of data lineage documentation required involves more than what gets stored in Unity catalogue, Delta lakes or Mlflow model tracking - all useful components. This type of use case needs to reflect the transformations that happen in the code itself once executed and teams currently do this manually. At the moment, the use of semiautonomous coding agents for these use cases is minimal but this is not where the enterprise stack is going.

Second, observability for these more sophisticated agents moves into other related risks, such as reproducibility, error propagation across longer pipelines, and general control issues for agent generated code.
Without observability, it is harder to track ‘semantic mistakes’ the agent might make, such as not using the correct metric definition, or applying the analysis or model to the wrong population. A bad transformation early in the pipeline affects everything downstream. I’m not sure what exactly is the level of observability needed to help us mitigate the potential issues, but without any we certainly would struggle.

Reproducibility is another area that does require some level of observability: if transformation execution is not observable, the final notebook may not be a faithful record of the run that produced the result. Similarly, we would struggle to compare agent runs over time (or rather without observability we would struggle more).

The key argument for in-depth-observability on agent generated code is enterprise level control especially for regulated sectors. Usage of these sophisticated data science and analytics agents in regulated sectors might be small to begin with relative to the size of the overall data platform offering. However as Databricks and large enterprise data platforms are feeling the pressure from coding agents and foundational models, there just aren’t that many avenues left to go into. If Databricks’ long-term position is around providing the governed system in which semiautonomous enterprise agents can actually run, then any observability gap will prove problematic.

How to Choose the Right Databricks Consulting Firm: 7 Things Enterprises Get Wrong

Lucy — Thu, 07 May 2026 13:14:35 +0000

We've seen this more times than we'd like. A company drops serious money on a Databricks engagement, and nine months later they've got a half-migrated lakehouse, a Unity Catalog nobody's actually managing, and a "knowledge transfer session" that transferred nothing except a Confluence link nobody bookmarked. Picking the wrong Databricks consultants is painful. And it's almost always avoidable.

Here's where enterprises consistently go wrong.

1. Treating Certifications Like a Proxy for Skill

Databricks certs test whether someone read the documentation. They don't test what happens when a Delta Lake merge tanks a production cluster on a Friday night. Ask for specifics. What Spark executor errors have they actually debugged? How did they fix Z-ordering that was slowing down query performance instead of helping it? If they can't walk you through a real incident, the cert doesn't tell you much.

2. Not Pushing Hard on Unity Catalog

This is the one where vague answers hide the most risk. Unity Catalog is now central to how governance actually works on Databricks — metastore structure, cross-workspace data sharing, attribute-based access control. Ask how they've handled multi-business-unit deployments. Ask what breaks when you try to share data across workspaces without planning the catalog hierarchy first. The consultants who've actually done it won't need to think long before answering.

3. Assuming Spark Experience Transfers Cleanly

It doesn't. A strong Spark engineer isn't automatically a strong Databricks engineer. Photon engine tuning, Delta Live Tables pipeline architecture, Databricks Asset Bundles — these require platform-specific knowledge that general Spark work doesn't build. We've brought in Spark-heavy consultants who struggled with DLT and had never touched Databricks Workflows outside a tutorial. Ask for specific project examples, not credential claims.

4. Skipping the MLflow Conversation Entirely

If any ML workloads are in scope and the consulting firm can't speak clearly about MLflow model registry promotion, experiment tracking strategy, or Feature Store integration — that's worth noting. A lot of firms pitch ML capabilities because the market asks for them, not because they've built production ML systems on Databricks. You can usually tell within five minutes of asking detailed questions.

5. Underestimating Migration Complexity

This is where most projects actually fall apart. Moving off Hive metastores, Teradata, or on-prem Hadoop into Databricks involves decisions that compound quickly — schema evolution handling, ACID conflicts when porting existing workloads to Delta, incremental vs. full-load tradeoffs that aren't obvious until you're mid-migration. Any Databricks consultants who promise a smooth lift-and-shift haven't run one before. Push for specifics on how they've handled schema drift and what their rollback strategy looks like.

6. Not Locking In a Cost Governance Plan From Day One

Cluster policy design, autoscaling rules, Spot instance configuration — these aren't details to figure out after the platform is running. We've seen companies end up paying three times what their workloads should cost because nobody set up a governance framework before the first jobs started running. If cost optimization isn't a named deliverable in the initial scope, ask why not.

7. Accepting Documentation That Shows Up at the End

Most firms hand over a Confluence export at project close and call it knowledge transfer. Real handoff means annotated notebooks, runbooks your team can actually follow, and live walkthroughs of your Workflows and scheduling logic while the consultants are still around to answer questions. If this isn't written into the engagement scope from the start, don't expect it to happen.

The firms worth hiring databricks consultants, aren't the ones with the most case studies on their homepage. They're the ones who can tell you what went wrong on a project and what they learned from it. If you're in the middle of evaluating options right now, you can see how we think about Databricks consulting, including how we scope engagements to avoid exactly these problems.

How Databricks Genie Turns Plain English Into SQL Code

Lucy — Thu, 07 May 2026 09:51:42 +0000

If you have spent time working inside a data team, you already know how a typical Tuesday looks.

A message comes in from the sales manager. Then one from finance. Then someone from the product team who just needs "a quick number." Before 10 AM, your backlog is three queries deep. None of them are complicated on their own. But together they eat up the hours you were planning to use on the pipeline work that actually needed you.

This is not a small problem. Research from Wren AI found that data analysts in fast-paced industries spend up to 50 to 70 percent of their time handling ad-hoc data requests. And as OWOX points out, each one-off request keeps analysts stuck in reactive mode instead of doing the forward-looking work that actually moves the business.

Databricks built AI/BI Genie to take a serious chunk of that workload off the data team. And based on how it works under the hood, it is worth understanding before you dismiss it as just another chatbot.

What Is Databricks Genie?

AI/BI Genie is a conversational analytics tool built directly into the Databricks platform. It became Generally Available in June 2025 and is free for all Databricks SQL customers with no extra license needed.

The idea is simple on the surface. A business user types a question in plain English. Genie writes the SQL, runs it, and returns a table of results along with a chart and a plain-language summary.

But what makes it different from the dozen other "ask your data a question" tools out there is what happens behind that simple interface.

How Genie Actually Works: The Compound AI System

Genie is not just one model reading your question and guessing. DataCamp's deep dive into the architecture describes it as a compound AI system, which means it uses a chain of specialized agents working together.

Here is the rough breakdown of what happens when someone asks a question:

An intent parsing agent figures out what the user is really asking, including the metric, the time range, the filters, and the aggregation type.
A planner agent breaks multi-step questions into an ordered execution plan.
A retriever agent finds the right tables, columns, and example queries to ground the request in your actual data.
A SQL generation agent turns the plan into a real, executable SQL query.
The query runs against your Databricks SQL warehouse.
A verifier checks the result. If something looks off, it can trigger a re-run or ask the user to clarify.
A summarizer writes a plain-language takeaway and picks the right visualization.

That is a lot of steps happening in seconds. And the reason this matters is that a simple single-model text-to-SQL approach fails a lot in production. Genie's multi-agent design is specifically built to reduce that failure rate.

Genie Spaces: Where the Real Setup Happens

The part most articles skip over is what makes Genie useful versus what makes it unreliable. That difference comes down to how well a Genie Space is configured.

According to the official Databricks documentation, a Genie Space is where a domain expert, such as a data analyst, sets up the context that Genie works from. This includes:

Which tables and views Genie can access
How business terms are defined ("active user" means X, "net revenue" means column Y)
Example queries that show Genie how to handle common question patterns
Text instructions for edge cases

This setup matters more than most people expect. Genie uses the names and descriptions from annotated tables and columns to convert natural language questions into equivalent SQL queries. If your column is named amt_net_rev_adj with no description, Genie will guess. If it is named adjusted_net_revenue and described clearly, Genie has the context it needs.

You can build different Genie Spaces for different teams. One for finance. One for sales. One for operations. Each one has its own tables, its own vocabulary, and its own guardrails. This keeps a sales rep from accidentally querying financial tables they should not see, and it keeps Genie focused on the questions that actually matter to each group.

Security and Governance Are Built In, Not Bolted On

One worry that comes up every time you let non-technical users query data directly is access control. What happens if someone asks a question that would return data they are not supposed to see?

Genie handles this through Unity Catalog, which is Databricks' governance layer. According to the Databricks Genie documentation, each user's own Unity Catalog data permissions are applied to the query results. Row filters and column masks are automatically enforced per user. If a user does not have SELECT access to a table, they will not see results from that table, even if they ask Genie a question that would normally involve it.

This is not a new access control layer you have to build. It extends the permissions your team already set up in Unity Catalog. That makes the conversation with your security and compliance teams a lot shorter.

Benchmarking: The Step Most Teams Skip

This is where a lot of Genie rollouts go wrong.

A team sets up a Genie Space, tries a few questions manually, gets answers that look right, and rolls it out to the business team. Then an executive asks something the space was not tested on, gets a weird result, and suddenly nobody trusts Genie anymore.

The Databricks team is direct about this: any AI effort should start with an evaluation phase. Failure to do so means failure in production.

Genie has a built-in benchmarking tool for exactly this reason. You write a list of test questions that represent the real questions users will ask. You add the correct SQL answer for each one. Genie runs its own queries and compares the results to yours.

According to Databricks' production readiness guide, the typical expectation is that Genie benchmarks should be above 80 percent accuracy before you move on to user acceptance testing. They also recommend adding two to four different phrasings of the same question, because users will not always ask the same question the same way.

There is also an "Ask for Review" feature. If a user gets an answer they are not sure about, they can flag it. A space admin gets notified, reviews the SQL, and corrects it if needed. The user gets notified once the answer is verified. This feedback loop is how Genie gets better over time instead of drifting.

The October 2025 release notes also added a "Knowledge Extraction" feature. When a user gives a thumbs up to a generated query, Genie analyzes that interaction and proposes knowledge snippets such as metric definitions or filter patterns that the space admin can approve and add to the knowledge store.

That is a real improvement over tools that treat every question as if it is the first one.

What Good SQL Schema Documentation Does for Genie

This is worth its own section because it surprises a lot of engineers.

When you first set up a Genie Space, you will quickly discover that the quality of Genie's answers is almost entirely dependent on how well your tables and columns are documented. This is not a new idea. Good data teams have always known that schema documentation matters. Genie just makes that documentation pay off in a way that is immediately visible to everyone, not just other engineers.

Here is a practical example from the Databricks benchmarking blog. One team wanted Genie to calculate the "best sales rep in Asia." Genie kept failing that question. The fix was not a model update. It was adding a single example SQL query to the instructions page showing exactly how to calculate that metric. After that, Genie answered it correctly every time.

That is the pattern you will see over and over. The fix is almost never "change the model." It is "give Genie more context about what the question actually means."

Genie Code: Writing Dashboards With Natural Language

One feature that deserves more attention is Genie Code.

When you create an AI/BI Dashboard in Databricks, it automatically creates a companion Genie Space. But Genie Code goes a step further. It lets you write and edit the actual SQL and Python cells in your dashboard notebooks using natural language prompts.

Instead of writing a complex window function from scratch, you describe what you want in plain English and Genie writes the code. You review it, tweak it if needed, and move on. This is especially useful for analysts who know what they want but do not always remember the exact SQL syntax for a specific aggregation or join pattern.

This is part of the same thinking that drives tools like GitHub Copilot, but scoped specifically to the Databricks analytics environment with all the governance context already built in.

Who Benefits and How

The next-generation Genie announcement points to something real in how teams are using this. Customers created over 1.5 million Genie Spaces in 2026 alone. That adoption happened because different roles found different value in the same tool.

Business analysts and managers stop waiting. A question that used to take two days to get answered from the data team now takes thirty seconds. This is the most visible benefit, and it is the one that gets internal champions bought in.

Data engineers get time back. As Sigma Computing writes, the BI bottleneck is not just stressful, it also delays decisions that need to be made quickly. When business users can self-serve the common questions, data engineers can stay focused on the work that actually requires an engineer.

Data analysts turn their existing knowledge into a reusable asset. They set up the Genie Space once, document it well, add example queries, and the business team can self-serve on top of that work without sending messages every time.

Executives get faster decisions. Questions that need a quick answer before a meeting get an answer before the meeting.

Embedding Genie Outside of Databricks

One of the more practical things in the latest release is that Genie does not have to live only inside the Databricks workspace.

Using the Genie Conversation APIs, developers can embed Genie into Slack, Microsoft Teams, or custom internal applications. A sales team that never opens Databricks can ask questions directly from Slack and get back a chart and a summary without leaving the tool they already work in.

The latest version of Genie also connects to enterprise knowledge sources like Google Drive and SharePoint, according to the next-gen Genie release post. This means Genie can now blend structured data from your Delta tables with unstructured content from documents to answer questions that used to require a human to piece together.

How This Connects to Broader AI Agent Work on Databricks

Genie is a great starting point, but it is part of a larger picture on the Databricks platform.

Once teams get comfortable with Genie handling their self-serve analytics layer, the next question that usually comes up is: what about workflows that go beyond answering questions? What about agents that can take action, run multi-step reasoning tasks, or be deployed as part of a production application?

That is where the Mosaic AI Agent Framework comes in. If you are thinking ahead to that kind of work, it is worth reading about how Mosaic AI handles evaluation, governance, and production deployment for AI agents on Databricks. The evaluation mindset is the same. The MLflow tracing and Unity Catalog governance carry over. But the scope is broader.

What You Need to Make Genie Work in Production

To be direct: setting up Genie is easy. Getting it to work well in production takes real work.

Here is what consistently makes the difference:

Clean, well-described tables. Column names and descriptions need to match how your business teams actually talk. If marketing calls something "activation rate" and your table calls it usr_actv_rt_wk, Genie will have trouble making that connection.

Real example queries. The example queries in a Genie Space teach Genie how to handle your organization's specific metric logic. The more representative they are, the better Genie handles questions it has never seen before.

A benchmark set before launch. According to Databricks' own best practices, most Genie Spaces should reach above 80 percent benchmark accuracy before they go to user testing. That bar exists for a reason. Missing it means users lose trust quickly and it is hard to rebuild.

Someone who owns the space long term. Genie Spaces need a person responsible for reviewing flagged responses, updating example queries as data changes, and approving knowledge snippets from user feedback. Without that owner, quality drifts.

Proper Unity Catalog setup. If your tables are not already in Unity Catalog with access controls in place, that needs to happen first. Genie's governance layer depends on it.

A lot of teams underestimate how much foundational data engineering work feeds into a good Genie rollout. If your team is already stretched thin on that infrastructure layer, it can make sense to bring in specialized help. That is why some teams choose to hire experienced data engineers who already understand how the Databricks ecosystem fits together, rather than trying to figure it out while also building the Genie Space.

Where to Start

If you already have a Databricks SQL workspace, you can create a Genie Space today. No extra license. No new tool to install.

Start small. Pick one team, one topic, and a focused set of tables. Write clear column descriptions. Add ten to fifteen example queries that cover the most common patterns. Build a benchmark test set before you open it to users. Then release it to a small group and watch what they ask.

The questions that Genie cannot answer well are your roadmap for improving the space. That feedback loop, questions, failures, fixes, is how good Genie Spaces are built over time. It is the same loop that any good data product depends on. Genie just makes each iteration faster and more visible.

Final Thought

Genie is not magic. It is a well-engineered system that works best when the data behind it is clean, documented, and governed correctly.

The teams that get the most out of it are the ones that treat the Genie Space setup like they treat any other production data product. That means documentation, testing, ownership, and a willingness to iterate based on real user feedback.

That is not a high bar. It is the same bar good data teams already hold themselves to. Genie just gives them a way to deliver the output of that work directly to the people who need it, without requiring a SQL ticket for every question.

Have you set up a Genie Space yet? What was the hardest part of the setup? Drop a comment. Real-world experience from different environments is always useful.

Sources Referenced

DEV Community: databricks

Deep Dive: Personal Agents and Their Role in the A…

Introduction

Understanding Personal Agents in the AI Landscape

How Personal Agents Function

Technical Architecture of Personal Agents

Core Components

Interaction Flow

Use Cases for Personal Agents

Real-World Applications

Specific Examples

Implications for Business Strategy

What Does This Mean for Your Business?

Regional Considerations

Next Steps for Implementation

How to Get Started

Frequently Asked Questions

Preguntas frecuentes

¿Qué son los agentes personales y cómo funcionan?

¿En qué industrias se pueden aplicar estos agentes?

¿Cuál es el retorno de inversión al implementar agentes personales?

Need Custom Software Solutions?

Why Your In-House Databricks Team Is Probably Losing You Money

1. The "unicorn engineer" job post

2. The cloud bill no one is watching

3. Governance that gets bolted on after the fact

4. The hiring timeline nobody accounts for

So what actually works?

TL;DR

Adeloop: Turning Semantic Data Models Into APIs for AI Agents

Introducing: Adeloop Agent Console API

Why This Matters

Example

MCP + OpenAPI Support

One Important Architecture Decision

The Bigger Goal

Databricks and FSx for ONTAP S3 Access Points — A Layer-by-Layer Validation of Observed Boundaries

TL;DR

How to Read This Article

Prerequisite Concepts

Key Concepts: Databricks Storage & Ingestion Architecture

Storage Credential → External Location → External Table/Volume

Auto Loader (Incremental Ingestion)

Concept Mapping: Snowflake ↔ Databricks

Data Ingestion Alternatives for FSx for ONTAP (When Auto Loader Is Blocked)

AI Readiness Score

The Goal

Test Environment

Approach 1: Unity Catalog External Location

The Setup

The Error

Observed Boundary

Proof

Status

Update (2026-05-24): access_point Field Resolves Session Policy

Approach 2: NFS Mount (Managed VPC)

The Idea

The Setup

The Result

Lesson

Approach 3: NFS Mount (Customer-managed VPC)

The Setup

Network Verification (All Pass)

sudo Access (Dedicated Mode)

NFS Client Installation and Export Verification

The Mount Attempt

The Investigation: Why NFS Mount Fails

Step 1: Verify ONTAP Export Policy

ONTAP Production Hardening Checklist

Step 2: strace the mount command

Step 3: Manual NFS RPC Calls (User-space)

Step 4: tmpfs Mount Test

Step 5: Seccomp Status

The Conclusion

All Mount Options Tested

Evidence Matrix

FSx for ONTAP S3 AP Authorization Path

Approach 4: Instance Profile + boto3

The Setup

IMDS Access

Update (2026-05-24): `access_point` Field Resolves Session Policy