A cluster is a set of computers that Databricks provisions to run your code.
When you write Python or SQL in a notebook, it doesn't run on your laptop. It runs on these machines — in the cloud — powered by Apache Spark.
No cluster = nowhere to run your code. It's that simple.
Think of a cluster like a rented server room. You define how powerful it should be, spin it up, use it, and shut it down. With Databricks, all the infrastructure management is handled for you.
The Anatomy of a Cluster
A Databricks cluster has two types of nodes:
┌─────────────────────────────────────┐
│ CLUSTER │
│ │
│ ┌──────────────┐ │
│ │ Driver Node │ ← Your code │
│ │ (1 machine) │ runs here │
│ └──────┬───────┘ first │
│ │ │
│ ┌──────▼───────┐ │
│ │ Worker Nodes │ ← Heavy lifting │
│ │ (1-N machines│ done here │
│ └──────────────┘ │
└─────────────────────────────────────┘
Driver node: Coordinates the work. Receives your code, builds the execution plan, and distributes tasks to workers. There's always exactly one.
Worker nodes: Execute the actual data processing in parallel. You can have zero (single-node) to hundreds of workers depending on your needs.
💡 In Community Edition, you always get a single-node cluster — driver only, no workers. It's enough for learning but not for production-scale data.
All-Purpose vs Job Clusters
There are two types of clusters in Databricks, and picking the right one matters:
| All-Purpose Cluster | Job Cluster | |
|---|---|---|
| Use case | Interactive development | Running automated pipelines |
| Lifecycle | You start/stop manually | Starts and stops per job run |
| Cost | More expensive (sits idle) | Cheaper (only runs when needed) |
| Best for | Notebooks, exploration, debugging | Scheduled workflows in production |
Rule of thumb: Use all-purpose clusters while you're developing. Switch to job clusters when you automate.
Cluster Configuration: What Actually Matters
When creating a cluster, you'll face a wall of settings. Here's what to actually pay attention to:
Databricks Runtime Version
This is the software stack that runs on your cluster — Spark version, Python version, and pre-installed libraries.
Example: 13.3 LTS (Spark 3.4.1, Scala 2.12)
- LTS (Long Term Support): Stable, well-tested. Use this for production.
- Latest: Newest features but less tested. Fine for experimentation.
- ML Runtime: Comes with ML libraries (TensorFlow, PyTorch, scikit-learn) pre-installed.
💡 When in doubt, pick the most recent LTS version.
Node Type
The size of each machine. More CPU and RAM = faster processing = higher cost.
For learning: pick the smallest available option. You won't notice the difference on small datasets.
Autoscaling
When enabled, Databricks automatically adds or removes worker nodes based on workload.
Min workers: 2
Max workers: 8
Great for production — cost-efficient and handles variable loads. For development, just set a fixed number to avoid surprises on your bill.
Auto Termination
Automatically shuts down the cluster after N minutes of inactivity. Always enable this. Forgetting to shut down a cluster is the fastest way to burn through credits.
Terminate after: 30 minutes of inactivity
Creating and Attaching a Cluster
Create:
- Go to Compute → Create Cluster
- Set a name, runtime version, and node type
- Enable auto-termination (30–60 min recommended)
- Click Create Cluster
Attach to a notebook:
- Open your notebook
- In the top bar, click the cluster dropdown (it'll say "Detached")
- Select your running cluster
That's it. Your notebook is now connected and ready to run code.
Part 2: Notebooks
What is a Databricks Notebook?
A Databricks notebook is an interactive document where you write and execute code, see outputs, and document your work — all in one place.
If you've used Jupyter notebooks, the concept is identical. If you haven't, think of it as a document where each section (called a cell) is a block of runnable code.
Notebooks are where you'll spend most of your time in Databricks: exploring data, writing transformations, debugging pipelines, and building out logic before automating it.
Cell Types
Every cell in a notebook has a type. The default type is set when you create the notebook (Python, SQL, Scala, or R), but you can override it per cell using magic commands.
| Magic Command | What it does |
|---|---|
%python |
Run Python code |
%sql |
Run SQL and display results as a table |
%scala |
Run Scala code |
%r |
Run R code |
%md |
Write Markdown (documentation) |
%sh |
Run shell commands |
%fs |
Interact with the Databricks file system |
%run |
Run another notebook inline |
Example of mixing languages in one notebook:
# This cell is Python (default)
df = spark.read.csv("/databricks-datasets/airlines/part-00000")
df.printSchema()
-- %sql magic command switches this cell to SQL
%sql
SELECT origin, COUNT(*) as total_flights
FROM airlines
GROUP BY origin
ORDER BY total_flights DESC
LIMIT 10
%md
## Results
The table above shows the top 10 busiest origin airports in the dataset.
Useful Keyboard Shortcuts
| Action | Shortcut |
|---|---|
| Run current cell and move to next | Shift + Enter |
| Run current cell and stay | Ctrl + Enter |
| Add cell below |
B (in command mode) |
| Add cell above |
A (in command mode) |
| Delete cell |
D D (double tap D) |
| Enter command mode | Esc |
| Enter edit mode | Enter |
| Auto-complete | Tab |
| Comment/uncomment | Ctrl + / |
💡 Press
Escto enter command mode (navigate between cells),Enterto go back to edit mode (type inside a cell). Same as Jupyter.
Notebook Best Practices
A few habits that will make your notebooks cleaner and easier to maintain:
1. Always start with a description cell
Use %md to explain what the notebook does, who wrote it, and when.
%md
## Bronze Ingestion — Sales Data
Ingests raw sales CSV files from ADLS into the Bronze Delta table.
**Author**: Your Name
**Last updated**: 2024-01-15
2. One notebook, one responsibility
Don't build your entire pipeline in a single notebook. Split by concern: one for ingestion, one for transformation, one for aggregation.
3. Use display() instead of show()
df.show() prints plain text. display(df) renders an interactive table with sorting, filtering, and chart options.
# Meh
df.show()
# Better
display(df)
4. Clear outputs before committing
If you're storing notebooks in Git, clear all cell outputs first. Output blobs make diffs unreadable.
The dbutils Library
One more thing worth knowing: dbutils is a Databricks utility library available in every notebook. It gives you quick access to common operations:
# List files in DBFS
dbutils.fs.ls("/databricks-datasets/")
# Mount cloud storage
dbutils.fs.mount(
source = "wasbs://container@account.blob.core.windows.net",
mount_point = "/mnt/mydata"
)
# Pass variables between notebooks
dbutils.widgets.text("input_date", "2024-01-01")
input_date = dbutils.widgets.get("input_date")
# Exit a notebook and return a value (used in Workflows)
dbutils.notebook.exit("success")
You'll use dbutils constantly — especially dbutils.fs for navigating the file system and dbutils.notebook for chaining notebooks together in pipelines.
Wrapping Up
Here's what to take away from this article:
- A cluster is the compute infrastructure that runs your code — always a driver node, optionally worker nodes
- All-purpose clusters are for development; job clusters are for production pipelines
- Key config settings: Runtime version, node type, autoscaling, and auto-termination
- A notebook is your interactive coding environment — cells can mix Python, SQL, Scala, and Markdown
-
display()beatsshow(), anddbutilsis your best friend for file system and notebook operations
In the next article, we go deeper under the hood: Apache Spark in plain English — the engine that makes everything you've set up actually fast and scalable.
Top comments (0)