DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

Clusters & Notebooks: Your Databricks Workspace Explained

A cluster is a set of computers that Databricks provisions to run your code.

When you write Python or SQL in a notebook, it doesn't run on your laptop. It runs on these machines — in the cloud — powered by Apache Spark.

No cluster = nowhere to run your code. It's that simple.

Think of a cluster like a rented server room. You define how powerful it should be, spin it up, use it, and shut it down. With Databricks, all the infrastructure management is handled for you.


The Anatomy of a Cluster

A Databricks cluster has two types of nodes:

┌─────────────────────────────────────┐
│           CLUSTER                   │
│                                     │
│   ┌──────────────┐                  │
│   │  Driver Node │  ← Your code     │
│   │  (1 machine) │    runs here     │
│   └──────┬───────┘    first         │
│          │                          │
│   ┌──────▼───────┐                  │
│   │ Worker Nodes │  ← Heavy lifting │
│   │ (1-N machines│    done here     │
│   └──────────────┘                  │
└─────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Driver node: Coordinates the work. Receives your code, builds the execution plan, and distributes tasks to workers. There's always exactly one.

Worker nodes: Execute the actual data processing in parallel. You can have zero (single-node) to hundreds of workers depending on your needs.

💡 In Community Edition, you always get a single-node cluster — driver only, no workers. It's enough for learning but not for production-scale data.


All-Purpose vs Job Clusters

There are two types of clusters in Databricks, and picking the right one matters:

All-Purpose Cluster Job Cluster
Use case Interactive development Running automated pipelines
Lifecycle You start/stop manually Starts and stops per job run
Cost More expensive (sits idle) Cheaper (only runs when needed)
Best for Notebooks, exploration, debugging Scheduled workflows in production

Rule of thumb: Use all-purpose clusters while you're developing. Switch to job clusters when you automate.


Cluster Configuration: What Actually Matters

When creating a cluster, you'll face a wall of settings. Here's what to actually pay attention to:

Databricks Runtime Version

This is the software stack that runs on your cluster — Spark version, Python version, and pre-installed libraries.

Example: 13.3 LTS (Spark 3.4.1, Scala 2.12)
Enter fullscreen mode Exit fullscreen mode
  • LTS (Long Term Support): Stable, well-tested. Use this for production.
  • Latest: Newest features but less tested. Fine for experimentation.
  • ML Runtime: Comes with ML libraries (TensorFlow, PyTorch, scikit-learn) pre-installed.

💡 When in doubt, pick the most recent LTS version.

Node Type

The size of each machine. More CPU and RAM = faster processing = higher cost.

For learning: pick the smallest available option. You won't notice the difference on small datasets.

Autoscaling

When enabled, Databricks automatically adds or removes worker nodes based on workload.

Min workers: 2
Max workers: 8
Enter fullscreen mode Exit fullscreen mode

Great for production — cost-efficient and handles variable loads. For development, just set a fixed number to avoid surprises on your bill.

Auto Termination

Automatically shuts down the cluster after N minutes of inactivity. Always enable this. Forgetting to shut down a cluster is the fastest way to burn through credits.

Terminate after: 30 minutes of inactivity
Enter fullscreen mode Exit fullscreen mode

Creating and Attaching a Cluster

Create:

  1. Go to ComputeCreate Cluster
  2. Set a name, runtime version, and node type
  3. Enable auto-termination (30–60 min recommended)
  4. Click Create Cluster

Attach to a notebook:

  1. Open your notebook
  2. In the top bar, click the cluster dropdown (it'll say "Detached")
  3. Select your running cluster

That's it. Your notebook is now connected and ready to run code.


Part 2: Notebooks

What is a Databricks Notebook?

A Databricks notebook is an interactive document where you write and execute code, see outputs, and document your work — all in one place.

If you've used Jupyter notebooks, the concept is identical. If you haven't, think of it as a document where each section (called a cell) is a block of runnable code.

Notebooks are where you'll spend most of your time in Databricks: exploring data, writing transformations, debugging pipelines, and building out logic before automating it.


Cell Types

Every cell in a notebook has a type. The default type is set when you create the notebook (Python, SQL, Scala, or R), but you can override it per cell using magic commands.

Magic Command What it does
%python Run Python code
%sql Run SQL and display results as a table
%scala Run Scala code
%r Run R code
%md Write Markdown (documentation)
%sh Run shell commands
%fs Interact with the Databricks file system
%run Run another notebook inline

Example of mixing languages in one notebook:

# This cell is Python (default)
df = spark.read.csv("/databricks-datasets/airlines/part-00000")
df.printSchema()
Enter fullscreen mode Exit fullscreen mode
-- %sql magic command switches this cell to SQL
%sql
SELECT origin, COUNT(*) as total_flights
FROM airlines
GROUP BY origin
ORDER BY total_flights DESC
LIMIT 10
Enter fullscreen mode Exit fullscreen mode
%md
## Results
The table above shows the top 10 busiest origin airports in the dataset.
Enter fullscreen mode Exit fullscreen mode

Useful Keyboard Shortcuts

Action Shortcut
Run current cell and move to next Shift + Enter
Run current cell and stay Ctrl + Enter
Add cell below B (in command mode)
Add cell above A (in command mode)
Delete cell D D (double tap D)
Enter command mode Esc
Enter edit mode Enter
Auto-complete Tab
Comment/uncomment Ctrl + /

💡 Press Esc to enter command mode (navigate between cells), Enter to go back to edit mode (type inside a cell). Same as Jupyter.


Notebook Best Practices

A few habits that will make your notebooks cleaner and easier to maintain:

1. Always start with a description cell
Use %md to explain what the notebook does, who wrote it, and when.

%md
## Bronze Ingestion — Sales Data
Ingests raw sales CSV files from ADLS into the Bronze Delta table.

**Author**: Your Name  
**Last updated**: 2024-01-15
Enter fullscreen mode Exit fullscreen mode

2. One notebook, one responsibility
Don't build your entire pipeline in a single notebook. Split by concern: one for ingestion, one for transformation, one for aggregation.

3. Use display() instead of show()
df.show() prints plain text. display(df) renders an interactive table with sorting, filtering, and chart options.

# Meh
df.show()

# Better
display(df)
Enter fullscreen mode Exit fullscreen mode

4. Clear outputs before committing
If you're storing notebooks in Git, clear all cell outputs first. Output blobs make diffs unreadable.


The dbutils Library

One more thing worth knowing: dbutils is a Databricks utility library available in every notebook. It gives you quick access to common operations:

# List files in DBFS
dbutils.fs.ls("/databricks-datasets/")

# Mount cloud storage
dbutils.fs.mount(
  source = "wasbs://container@account.blob.core.windows.net",
  mount_point = "/mnt/mydata"
)

# Pass variables between notebooks
dbutils.widgets.text("input_date", "2024-01-01")
input_date = dbutils.widgets.get("input_date")

# Exit a notebook and return a value (used in Workflows)
dbutils.notebook.exit("success")
Enter fullscreen mode Exit fullscreen mode

You'll use dbutils constantly — especially dbutils.fs for navigating the file system and dbutils.notebook for chaining notebooks together in pipelines.


Wrapping Up

Here's what to take away from this article:

  • A cluster is the compute infrastructure that runs your code — always a driver node, optionally worker nodes
  • All-purpose clusters are for development; job clusters are for production pipelines
  • Key config settings: Runtime version, node type, autoscaling, and auto-termination
  • A notebook is your interactive coding environment — cells can mix Python, SQL, Scala, and Markdown
  • display() beats show(), and dbutils is your best friend for file system and notebook operations

In the next article, we go deeper under the hood: Apache Spark in plain English — the engine that makes everything you've set up actually fast and scalable.

Top comments (0)