Vinicius Fagundes

Posted on Apr 20

Databricks File System (DBFS) + Connecting to Cloud Storage

#databricks #data #ai #database

Databricks File System (DBFS) + Connecting to Cloud Storage

You've got a cluster running and notebooks ready to go. Now the real question: where does your data actually live?

In this article we'll cover the Databricks File System (DBFS), how it relates to your cloud storage buckets, and how to connect Databricks to AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

What is DBFS and How It Works

DBFS (Databricks File System) is a distributed file system that is mounted into every Databricks workspace and accessible from every cluster.

When you interact with files using paths like /dbfs/... or dbfs:/... in your notebooks, you're using DBFS.

Here's the key thing to understand: DBFS is not a storage system itself. It's a layer that abstracts access to the actual storage underneath.

Your Code (notebook)
        ↓
      DBFS API
        ↓
  ┌─────────────────────────────────────┐
  │  Actual Storage                     │
  │  ├── DBFS Root (cloud object store) │
  │  ├── Mounted S3 buckets             │
  │  ├── Mounted ADLS containers        │
  │  └── Mounted GCS buckets            │
  └─────────────────────────────────────┘

In Community Edition, the DBFS root is Databricks-managed storage. In a full cloud deployment, the DBFS root is backed by an object storage bucket in your own cloud account (S3, ADLS, or GCS).

DBFS vs Cloud Object Storage

This is where people get confused. Let's clear it up:

	DBFS	Cloud Object Storage (S3/ADLS/GCS)
What it is	A virtual file system layer	The actual storage infrastructure
Who manages it	Databricks	You (in your cloud account)
Access method	`dbfs:/` paths in notebooks	Direct SDK or mounted via DBFS
Persistence	Tied to your workspace	Independent, survives workspace deletion
Best for	Temp files, libraries, samples	Production data

The golden rule: your production data should always live in your own cloud storage — not in the DBFS root managed by Databricks. If your workspace gets deleted or recreated, you don't want your data to disappear with it.

DBFS is great for:

Storing temporary files during processing
Installing libraries on clusters
Accessing Databricks sample datasets (/databricks-datasets/)

Your actual data (raw files, Delta tables, outputs) should live in S3, ADLS, or GCS — and be mounted into Databricks via DBFS mounts.

Navigating DBFS in a Notebook

You can interact with DBFS directly from notebooks using dbutils.fs:

# List files in a directory
dbutils.fs.ls("dbfs:/")

# List Databricks sample datasets
dbutils.fs.ls("/databricks-datasets/")

# Create a directory
dbutils.fs.mkdirs("/tmp/my-project/")

# Copy a file
dbutils.fs.cp("/databricks-datasets/airlines/part-00000", "/tmp/airlines-sample.csv")

# Delete a file or directory
dbutils.fs.rm("/tmp/airlines-sample.csv")

# Check if a path exists
try:
    dbutils.fs.ls("/tmp/my-project/")
    print("Path exists")
except:
    print("Path does not exist")

You can also use the %fs magic command for quick navigation:

%fs ls /databricks-datasets/
%fs ls /tmp/

Reading Files from DBFS in a Notebook

Once files are in DBFS (or mounted from cloud storage), reading them is straightforward:

# Read a CSV file
df = spark.read.csv("/databricks-datasets/airlines/part-00000", header=True, inferSchema=True)
df.show(5)

# Read a JSON file
df = spark.read.json("/tmp/my-data/events.json")

# Read a Parquet file (preferred format for performance)
df = spark.read.parquet("/mnt/mydata/transactions/")

# Read with explicit schema (faster and safer than inferSchema)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("amount", DoubleType(), True)
])

df = spark.read.csv("/mnt/mydata/sales.csv", schema=schema, header=True)

💡 Always prefer explicit schemas over inferSchema=True in production. inferSchema reads the data twice (once to guess types, once to load it) and can get types wrong.

Mounting Cloud Storage to Databricks

Mounting means creating a shortcut in DBFS that points to a folder in your cloud storage. Once mounted, you access your cloud data using simple /mnt/... paths instead of long cloud-specific URIs.

Without mount:  s3://my-company-bucket/raw/sales/2024/
With mount:     /mnt/raw/sales/2024/

Much cleaner — and your notebooks don't need to know which cloud you're on.

Mounting AWS S3

Step 1 — Create an IAM role or access key with read/write access to your S3 bucket.

Step 2 — Store your credentials in Databricks Secrets (never hardcode credentials in notebooks):

# Using instance profile (recommended for production on AWS)
dbutils.fs.mount(
  source = "s3a://your-bucket-name/",
  mount_point = "/mnt/s3-data",
  extra_configs = {"fs.s3a.aws.credentials.provider": 
                   "com.amazonaws.auth.InstanceProfileCredentialsProvider"}
)

Step 3 — Or with access keys stored in Databricks Secrets:

dbutils.fs.mount(
  source = "s3a://your-bucket-name/",
  mount_point = "/mnt/s3-data",
  extra_configs = {
    "fs.s3a.access.key": dbutils.secrets.get(scope="aws", key="access-key"),
    "fs.s3a.secret.key": dbutils.secrets.get(scope="aws", key="secret-key")
  }
)

Mounting Azure Data Lake Storage (ADLS Gen2)

dbutils.fs.mount(
  source = "abfss://your-container@your-account.dfs.core.windows.net/",
  mount_point = "/mnt/adls-data",
  extra_configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": 
        "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id": 
        dbutils.secrets.get(scope="azure", key="client-id"),
    "fs.azure.account.oauth2.client.secret": 
        dbutils.secrets.get(scope="azure", key="client-secret"),
    "fs.azure.account.oauth2.client.endpoint": 
        "https://login.microsoftonline.com/your-tenant-id/oauth2/token"
  }
)

Mounting Google Cloud Storage (GCS)

dbutils.fs.mount(
  source = "gs://your-bucket-name/",
  mount_point = "/mnt/gcs-data",
  extra_configs = {
    "fs.gs.auth.service.account.enable": "true",
    "google.cloud.auth.service.account.json.keyfile": 
        dbutils.secrets.get(scope="gcp", key="service-account-key")
  }
)

Managing Your Mounts

# List all current mounts
dbutils.fs.mounts()

# Check a specific mount
dbutils.fs.ls("/mnt/s3-data/")

# Unmount when no longer needed
dbutils.fs.unmount("/mnt/s3-data")

💡 Mounts persist across cluster restarts — you only need to create them once per workspace. A common pattern is to have a setup notebook that creates all mounts, which you run once when setting up a new workspace.

Best Practices for Data Organization

Before you start dumping data into your storage, take 5 minutes to plan your folder structure. You'll thank yourself later.

A clean, consistent structure to follow:

/mnt/
  ├── raw/               ← Bronze layer: raw files exactly as received
  │   ├── sales/
  │   │   ├── 2024/01/
  │   │   ├── 2024/02/
  │   └── customers/
  │
  ├── processed/         ← Silver layer: cleaned Delta tables
  │   ├── sales/
  │   └── customers/
  │
  └── curated/           ← Gold layer: aggregated, business-ready tables
      ├── daily_revenue/
      └── customer_summary/

This maps directly to the Medallion Architecture we'll build in articles 9 and 10.

A few rules worth following from day one:

Use Parquet or Delta format — avoid CSV and JSON for anything beyond ingestion. Parquet is columnar, compressed, and much faster to query.

Partition large tables by date — if you have years of data, partition by year/month so queries only scan what they need:

/mnt/processed/sales/year=2024/month=01/
/mnt/processed/sales/year=2024/month=02/

Never store raw and processed data in the same folder — keep layers separated. Mixing them leads to confusion and accidental overwrites.

Use lowercase, hyphens or underscores — avoid spaces and special characters in folder names. They cause problems across different tools and operating systems.

Wrapping Up

Here's what to take away from this article:

DBFS is a virtual file system layer — it abstracts access to real storage underneath
Your production data should live in your own cloud storage (S3, ADLS, GCS), not in the DBFS root
Use dbutils.fs to navigate and manage files in DBFS from notebooks
Mounting creates a /mnt/... shortcut to your cloud storage — much cleaner than raw cloud URIs
Plan your folder structure from the start: raw → processed → curated maps to Bronze → Silver → Gold
Always use Databricks Secrets for credentials — never hardcode them in notebooks

In the next article we get into the heart of the work: DataFrames and SQL in Databricks — reading, writing, and transforming data.

DEV Community

Databricks File System (DBFS) + Connecting to Cloud Storage

Databricks File System (DBFS) + Connecting to Cloud Storage

What is DBFS and How It Works

DBFS vs Cloud Object Storage

Navigating DBFS in a Notebook

Reading Files from DBFS in a Notebook

Mounting Cloud Storage to Databricks

Mounting AWS S3

Mounting Azure Data Lake Storage (ADLS Gen2)

Mounting Google Cloud Storage (GCS)

Managing Your Mounts

Best Practices for Data Organization

A few rules worth following from day one:

Wrapping Up

Top comments (0)