Databricks File System (DBFS) + Connecting to Cloud Storage
You've got a cluster running and notebooks ready to go. Now the real question: where does your data actually live?
In this article we'll cover the Databricks File System (DBFS), how it relates to your cloud storage buckets, and how to connect Databricks to AWS S3, Azure Data Lake Storage, and Google Cloud Storage.
What is DBFS and How It Works
DBFS (Databricks File System) is a distributed file system that is mounted into every Databricks workspace and accessible from every cluster.
When you interact with files using paths like /dbfs/... or dbfs:/... in your notebooks, you're using DBFS.
Here's the key thing to understand: DBFS is not a storage system itself. It's a layer that abstracts access to the actual storage underneath.
Your Code (notebook)
↓
DBFS API
↓
┌─────────────────────────────────────┐
│ Actual Storage │
│ ├── DBFS Root (cloud object store) │
│ ├── Mounted S3 buckets │
│ ├── Mounted ADLS containers │
│ └── Mounted GCS buckets │
└─────────────────────────────────────┘
In Community Edition, the DBFS root is Databricks-managed storage. In a full cloud deployment, the DBFS root is backed by an object storage bucket in your own cloud account (S3, ADLS, or GCS).
DBFS vs Cloud Object Storage
This is where people get confused. Let's clear it up:
| DBFS | Cloud Object Storage (S3/ADLS/GCS) | |
|---|---|---|
| What it is | A virtual file system layer | The actual storage infrastructure |
| Who manages it | Databricks | You (in your cloud account) |
| Access method |
dbfs:/ paths in notebooks |
Direct SDK or mounted via DBFS |
| Persistence | Tied to your workspace | Independent, survives workspace deletion |
| Best for | Temp files, libraries, samples | Production data |
The golden rule: your production data should always live in your own cloud storage — not in the DBFS root managed by Databricks. If your workspace gets deleted or recreated, you don't want your data to disappear with it.
DBFS is great for:
- Storing temporary files during processing
- Installing libraries on clusters
- Accessing Databricks sample datasets (
/databricks-datasets/)
Your actual data (raw files, Delta tables, outputs) should live in S3, ADLS, or GCS — and be mounted into Databricks via DBFS mounts.
Navigating DBFS in a Notebook
You can interact with DBFS directly from notebooks using dbutils.fs:
# List files in a directory
dbutils.fs.ls("dbfs:/")
# List Databricks sample datasets
dbutils.fs.ls("/databricks-datasets/")
# Create a directory
dbutils.fs.mkdirs("/tmp/my-project/")
# Copy a file
dbutils.fs.cp("/databricks-datasets/airlines/part-00000", "/tmp/airlines-sample.csv")
# Delete a file or directory
dbutils.fs.rm("/tmp/airlines-sample.csv")
# Check if a path exists
try:
dbutils.fs.ls("/tmp/my-project/")
print("Path exists")
except:
print("Path does not exist")
You can also use the %fs magic command for quick navigation:
%fs ls /databricks-datasets/
%fs ls /tmp/
Reading Files from DBFS in a Notebook
Once files are in DBFS (or mounted from cloud storage), reading them is straightforward:
# Read a CSV file
df = spark.read.csv("/databricks-datasets/airlines/part-00000", header=True, inferSchema=True)
df.show(5)
# Read a JSON file
df = spark.read.json("/tmp/my-data/events.json")
# Read a Parquet file (preferred format for performance)
df = spark.read.parquet("/mnt/mydata/transactions/")
# Read with explicit schema (faster and safer than inferSchema)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("amount", DoubleType(), True)
])
df = spark.read.csv("/mnt/mydata/sales.csv", schema=schema, header=True)
💡 Always prefer explicit schemas over
inferSchema=Truein production.inferSchemareads the data twice (once to guess types, once to load it) and can get types wrong.
Mounting Cloud Storage to Databricks
Mounting means creating a shortcut in DBFS that points to a folder in your cloud storage. Once mounted, you access your cloud data using simple /mnt/... paths instead of long cloud-specific URIs.
Without mount: s3://my-company-bucket/raw/sales/2024/
With mount: /mnt/raw/sales/2024/
Much cleaner — and your notebooks don't need to know which cloud you're on.
Mounting AWS S3
Step 1 — Create an IAM role or access key with read/write access to your S3 bucket.
Step 2 — Store your credentials in Databricks Secrets (never hardcode credentials in notebooks):
# Using instance profile (recommended for production on AWS)
dbutils.fs.mount(
source = "s3a://your-bucket-name/",
mount_point = "/mnt/s3-data",
extra_configs = {"fs.s3a.aws.credentials.provider":
"com.amazonaws.auth.InstanceProfileCredentialsProvider"}
)
Step 3 — Or with access keys stored in Databricks Secrets:
dbutils.fs.mount(
source = "s3a://your-bucket-name/",
mount_point = "/mnt/s3-data",
extra_configs = {
"fs.s3a.access.key": dbutils.secrets.get(scope="aws", key="access-key"),
"fs.s3a.secret.key": dbutils.secrets.get(scope="aws", key="secret-key")
}
)
Mounting Azure Data Lake Storage (ADLS Gen2)
dbutils.fs.mount(
source = "abfss://your-container@your-account.dfs.core.windows.net/",
mount_point = "/mnt/adls-data",
extra_configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id":
dbutils.secrets.get(scope="azure", key="client-id"),
"fs.azure.account.oauth2.client.secret":
dbutils.secrets.get(scope="azure", key="client-secret"),
"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/your-tenant-id/oauth2/token"
}
)
Mounting Google Cloud Storage (GCS)
dbutils.fs.mount(
source = "gs://your-bucket-name/",
mount_point = "/mnt/gcs-data",
extra_configs = {
"fs.gs.auth.service.account.enable": "true",
"google.cloud.auth.service.account.json.keyfile":
dbutils.secrets.get(scope="gcp", key="service-account-key")
}
)
Managing Your Mounts
# List all current mounts
dbutils.fs.mounts()
# Check a specific mount
dbutils.fs.ls("/mnt/s3-data/")
# Unmount when no longer needed
dbutils.fs.unmount("/mnt/s3-data")
💡 Mounts persist across cluster restarts — you only need to create them once per workspace. A common pattern is to have a
setupnotebook that creates all mounts, which you run once when setting up a new workspace.
Best Practices for Data Organization
Before you start dumping data into your storage, take 5 minutes to plan your folder structure. You'll thank yourself later.
A clean, consistent structure to follow:
/mnt/
├── raw/ ← Bronze layer: raw files exactly as received
│ ├── sales/
│ │ ├── 2024/01/
│ │ ├── 2024/02/
│ └── customers/
│
├── processed/ ← Silver layer: cleaned Delta tables
│ ├── sales/
│ └── customers/
│
└── curated/ ← Gold layer: aggregated, business-ready tables
├── daily_revenue/
└── customer_summary/
This maps directly to the Medallion Architecture we'll build in articles 9 and 10.
A few rules worth following from day one:
Use Parquet or Delta format — avoid CSV and JSON for anything beyond ingestion. Parquet is columnar, compressed, and much faster to query.
Partition large tables by date — if you have years of data, partition by year/month so queries only scan what they need:
/mnt/processed/sales/year=2024/month=01/
/mnt/processed/sales/year=2024/month=02/
Never store raw and processed data in the same folder — keep layers separated. Mixing them leads to confusion and accidental overwrites.
Use lowercase, hyphens or underscores — avoid spaces and special characters in folder names. They cause problems across different tools and operating systems.
Wrapping Up
Here's what to take away from this article:
- DBFS is a virtual file system layer — it abstracts access to real storage underneath
- Your production data should live in your own cloud storage (S3, ADLS, GCS), not in the DBFS root
- Use
dbutils.fsto navigate and manage files in DBFS from notebooks -
Mounting creates a
/mnt/...shortcut to your cloud storage — much cleaner than raw cloud URIs - Plan your folder structure from the start: raw → processed → curated maps to Bronze → Silver → Gold
- Always use Databricks Secrets for credentials — never hardcode them in notebooks
In the next article we get into the heart of the work: DataFrames and SQL in Databricks — reading, writing, and transforming data.
Top comments (0)