Vinicius Fagundes

Posted on Apr 6

Clusters & Notebooks: Your Databricks Workspace Explained

#databricks #python #ai #sql

A cluster is a set of computers that Databricks provisions to run your code.

When you write Python or SQL in a notebook, it doesn't run on your laptop. It runs on these machines — in the cloud — powered by Apache Spark.

No cluster = nowhere to run your code. It's that simple.

Think of a cluster like a rented server room. You define how powerful it should be, spin it up, use it, and shut it down. With Databricks, all the infrastructure management is handled for you.

The Anatomy of a Cluster

A Databricks cluster has two types of nodes:

┌─────────────────────────────────────┐
│           CLUSTER                   │
│                                     │
│   ┌──────────────┐                  │
│   │  Driver Node │  ← Your code     │
│   │  (1 machine) │    runs here     │
│   └──────┬───────┘    first         │
│          │                          │
│   ┌──────▼───────┐                  │
│   │ Worker Nodes │  ← Heavy lifting │
│   │ (1-N machines│    done here     │
│   └──────────────┘                  │
└─────────────────────────────────────┘

Driver node: Coordinates the work. Receives your code, builds the execution plan, and distributes tasks to workers. There's always exactly one.

Worker nodes: Execute the actual data processing in parallel. You can have zero (single-node) to hundreds of workers depending on your needs.

💡 In Community Edition, you always get a single-node cluster — driver only, no workers. It's enough for learning but not for production-scale data.

All-Purpose vs Job Clusters

There are two types of clusters in Databricks, and picking the right one matters:

	All-Purpose Cluster	Job Cluster
Use case	Interactive development	Running automated pipelines
Lifecycle	You start/stop manually	Starts and stops per job run
Cost	More expensive (sits idle)	Cheaper (only runs when needed)
Best for	Notebooks, exploration, debugging	Scheduled workflows in production

Rule of thumb: Use all-purpose clusters while you're developing. Switch to job clusters when you automate.

Cluster Configuration: What Actually Matters

When creating a cluster, you'll face a wall of settings. Here's what to actually pay attention to:

Databricks Runtime Version

This is the software stack that runs on your cluster — Spark version, Python version, and pre-installed libraries.

Example: 13.3 LTS (Spark 3.4.1, Scala 2.12)

LTS (Long Term Support): Stable, well-tested. Use this for production.
Latest: Newest features but less tested. Fine for experimentation.
ML Runtime: Comes with ML libraries (TensorFlow, PyTorch, scikit-learn) pre-installed.

💡 When in doubt, pick the most recent LTS version.

Node Type

The size of each machine. More CPU and RAM = faster processing = higher cost.

For learning: pick the smallest available option. You won't notice the difference on small datasets.

Autoscaling

When enabled, Databricks automatically adds or removes worker nodes based on workload.

Min workers: 2
Max workers: 8

Great for production — cost-efficient and handles variable loads. For development, just set a fixed number to avoid surprises on your bill.

Auto Termination

Automatically shuts down the cluster after N minutes of inactivity. Always enable this. Forgetting to shut down a cluster is the fastest way to burn through credits.

Terminate after: 30 minutes of inactivity

Creating and Attaching a Cluster

Create:

Go to Compute → Create Cluster
Set a name, runtime version, and node type
Enable auto-termination (30–60 min recommended)
Click Create Cluster

Attach to a notebook:

Open your notebook
In the top bar, click the cluster dropdown (it'll say "Detached")
Select your running cluster

That's it. Your notebook is now connected and ready to run code.

Part 2: Notebooks

What is a Databricks Notebook?

A Databricks notebook is an interactive document where you write and execute code, see outputs, and document your work — all in one place.

If you've used Jupyter notebooks, the concept is identical. If you haven't, think of it as a document where each section (called a cell) is a block of runnable code.

Notebooks are where you'll spend most of your time in Databricks: exploring data, writing transformations, debugging pipelines, and building out logic before automating it.

Cell Types

Every cell in a notebook has a type. The default type is set when you create the notebook (Python, SQL, Scala, or R), but you can override it per cell using magic commands.

Magic Command	What it does
`%python`	Run Python code
`%sql`	Run SQL and display results as a table
`%scala`	Run Scala code
`%r`	Run R code
`%md`	Write Markdown (documentation)
`%sh`	Run shell commands
`%fs`	Interact with the Databricks file system
`%run`	Run another notebook inline

Example of mixing languages in one notebook:

# This cell is Python (default)
df = spark.read.csv("/databricks-datasets/airlines/part-00000")
df.printSchema()

-- %sql magic command switches this cell to SQL
%sql
SELECT origin, COUNT(*) as total_flights
FROM airlines
GROUP BY origin
ORDER BY total_flights DESC
LIMIT 10

%md
## Results
The table above shows the top 10 busiest origin airports in the dataset.

Useful Keyboard Shortcuts

Action	Shortcut
Run current cell and move to next	`Shift + Enter`
Run current cell and stay	`Ctrl + Enter`
Add cell below	`B` (in command mode)
Add cell above	`A` (in command mode)
Delete cell	`D D` (double tap D)
Enter command mode	`Esc`
Enter edit mode	`Enter`
Auto-complete	`Tab`
Comment/uncomment	`Ctrl + /`

💡 Press Esc to enter command mode (navigate between cells), Enter to go back to edit mode (type inside a cell). Same as Jupyter.

Notebook Best Practices

A few habits that will make your notebooks cleaner and easier to maintain:

1. Always start with a description cell
Use %md to explain what the notebook does, who wrote it, and when.

%md
## Bronze Ingestion — Sales Data
Ingests raw sales CSV files from ADLS into the Bronze Delta table.

**Author**: Your Name  
**Last updated**: 2024-01-15

2. One notebook, one responsibility
Don't build your entire pipeline in a single notebook. Split by concern: one for ingestion, one for transformation, one for aggregation.

3. Use display() instead of show()
df.show() prints plain text. display(df) renders an interactive table with sorting, filtering, and chart options.

# Meh
df.show()

# Better
display(df)

4. Clear outputs before committing
If you're storing notebooks in Git, clear all cell outputs first. Output blobs make diffs unreadable.

The `dbutils` Library

One more thing worth knowing: dbutils is a Databricks utility library available in every notebook. It gives you quick access to common operations:

# List files in DBFS
dbutils.fs.ls("/databricks-datasets/")

# Mount cloud storage
dbutils.fs.mount(
  source = "wasbs://container@account.blob.core.windows.net",
  mount_point = "/mnt/mydata"
)

# Pass variables between notebooks
dbutils.widgets.text("input_date", "2024-01-01")
input_date = dbutils.widgets.get("input_date")

# Exit a notebook and return a value (used in Workflows)
dbutils.notebook.exit("success")

You'll use dbutils constantly — especially dbutils.fs for navigating the file system and dbutils.notebook for chaining notebooks together in pipelines.

Wrapping Up

Here's what to take away from this article:

A cluster is the compute infrastructure that runs your code — always a driver node, optionally worker nodes
All-purpose clusters are for development; job clusters are for production pipelines
Key config settings: Runtime version, node type, autoscaling, and auto-termination
A notebook is your interactive coding environment — cells can mix Python, SQL, Scala, and Markdown
display() beats show(), and dbutils is your best friend for file system and notebook operations

In the next article, we go deeper under the hood: Apache Spark in plain English — the engine that makes everything you've set up actually fast and scalable.

DEV Community

Clusters & Notebooks: Your Databricks Workspace Explained

The Anatomy of a Cluster

All-Purpose vs Job Clusters

Cluster Configuration: What Actually Matters

Databricks Runtime Version

Node Type

Autoscaling

Auto Termination

Creating and Attaching a Cluster

Part 2: Notebooks

What is a Databricks Notebook?

Cell Types

Useful Keyboard Shortcuts

Notebook Best Practices

The `dbutils` Library

Wrapping Up

Top comments (0)

The Anatomy of a Cluster

All-Purpose vs Job Clusters

Cluster Configuration: What Actually Matters

Databricks Runtime Version

Node Type

Autoscaling

Auto Termination

Creating and Attaching a Cluster

Part 2: Notebooks

What is a Databricks Notebook?

Cell Types

Useful Keyboard Shortcuts

Notebook Best Practices

The dbutils Library

Wrapping Up

The `dbutils` Library