Akhilesh

Posted on May 3

Jupyter Notebooks: Where Data Science Actually Happens

#ai #jupyter #productivity #beginners

Open any data science tutorial online.

Not a Python script. Not a GitHub repo. A tutorial with explanation, code, output, visualizations, and commentary all woven together in one document.

That document is a Jupyter Notebook.

The best ML papers show their experiments in notebooks. Kaggle grandmasters share their approach in notebooks. Every data science course teaches in notebooks. When you get a data science job, your analysis will probably live in a notebook before it becomes a production script.

Learning Jupyter is not optional. It is the environment where the real work happens.

What Jupyter Actually Is

A Jupyter Notebook is a document that contains a mix of live code, formatted text, equations, and visualizations. You run individual cells instead of entire scripts. Each cell's output appears immediately below it.

The killer feature is this: you can run cell 1, see the output, think about it, then write and run cell 2 based on what you learned. Analysis becomes interactive and iterative instead of run-the-whole-script-and-scroll-through-output.

The second killer feature: the document tells a story. Code plus explanation plus output plus charts, all in sequence, all in one file you can share.

Installation and Launch

pip install jupyter notebook

Launch from your project directory:

jupyter notebook

Your browser opens automatically at http://localhost:8888. You see a file browser. Click "New" → "Python 3 (ipykernel)" to create a new notebook.

Or install JupyterLab, the more modern interface:

pip install jupyterlab
jupyter lab

JupyterLab looks like an IDE with a file browser on the left, multiple tabs, and a cleaner interface. Either works. JupyterLab is preferred for serious work.

Anatomy of a Notebook

Every notebook is made of cells. Three types.

Code cells run Python (or R, Julia, SQL if configured). Output appears immediately below.

Markdown cells render formatted text. Headers, bold, italics, bullet lists, links, equations. These are where you explain what your code does and what your findings mean.

Raw cells contain unformatted text that Jupyter leaves alone. Used rarely.

The mode toggle in the top toolbar switches between cell types. Or press M to convert to Markdown, Y to convert back to code.

Essential Keyboard Shortcuts

Jupyter has two modes. Edit mode is when you are typing inside a cell. Command mode is when no cell is being edited. Press Escape to enter command mode. Press Enter to enter edit mode.

In command mode:

A          → insert cell Above
B          → insert cell Below
D D        → delete cell (press D twice)
M          → convert to Markdown
Y          → convert to Code
Z          → undo cell deletion
Shift + M  → merge selected cells

In edit mode:

Shift + Enter    → run cell, move to next
Ctrl + Enter     → run cell, stay on it
Alt + Enter      → run cell, insert new below
Tab              → autocomplete
Shift + Tab      → show function signature/docstring
Ctrl + /         → comment/uncomment line

Learn these. Reaching for the mouse to run cells is slow. Keyboard-driven Jupyter is fast.

Your First Real Notebook

Here is how a data analysis notebook actually looks. Every section has a purpose.

# Titanic Survival Analysis

**Dataset:** Titanic passenger data, 891 rows  
**Question:** Which factors most strongly predicted survival?  
**Author:** Your Name | Date: 2024-03-01

# Cell 1: Imports - always first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="darkgrid")
print("Libraries loaded")

Output:

Libraries loaded

## 1. Load and Inspect Data

# Cell 2: Load
df = pd.read_csv("titanic.csv")
print(f"Shape: {df.shape}")
df.head()

# Cell 3: Missing values
df.isnull().sum()[df.isnull().sum() > 0]

**Findings:** Age is missing 177 values (19.9%). Cabin is missing 687 (77%).
We will impute age with median and drop Cabin.

# Cell 4: Clean
df["Age"] = df["Age"].fillna(df["Age"].median())
df = df.drop(columns=["Cabin"])
print("Cleaning done. Missing values:", df.isnull().sum().sum())

## 2. Survival by Gender

# Cell 5: Gender analysis
survival_by_sex = df.groupby("Sex")["Survived"].mean() * 100
print(survival_by_sex.round(1))

fig, ax = plt.subplots(figsize=(7, 4))
survival_by_sex.plot(kind="bar", color=["coral", "steelblue"], ax=ax)
ax.set_title("Survival Rate by Gender")
ax.set_ylabel("Survival Rate (%)")
ax.set_xlabel("")
ax.tick_params(axis="x", rotation=0)
plt.tight_layout()
plt.show()

This is what a proper notebook looks like. Markdown explains the context. Code does the work. Output shows the result. Markdown interprets the finding. The reader follows your thinking, not just your code.

The Variable State Problem

This is the most important thing to understand about Jupyter.

The kernel maintains state between cells. Every variable you define stays in memory until you restart the kernel or explicitly delete it. This creates a trap.

# Cell 1: Run this
x = 10
print(x)

# Cell 2: Run this
x = 20
print(x)

# Cell 3: Go back and run Cell 1 again
# x is now 20 in memory even though Cell 1 says x = 10
# The cell shows output of 10 but x in memory is actually... undefined until you run it
# Then x becomes 10 again
# Cell 2's x = 20 is gone from memory but the output still shows 20

This causes subtle bugs. Your notebook can show outputs that no longer match the current state of the variables. The number next to each cell shows execution order. [1], [2], [3] in sequence is safe. [3], [1], [2] means cells ran out of order and state may be inconsistent.

The fix: Kernel → Restart and Run All before sharing any notebook. This resets all variables and reruns every cell in order from top to bottom. If the notebook fails, it has a bug. A notebook that only works when cells are run in a specific manual order is broken.

Magic Commands

Jupyter has magic commands that extend Python with useful utilities.

%time df.groupby("Sex")["Survived"].mean()

Times a single expression.

%%timeit
result = df.groupby(["Sex", "Pclass"])["Survived"].mean()

Times an entire cell repeatedly and reports average.

%matplotlib inline

Makes matplotlib plots appear inline inside the notebook instead of in a popup window. Usually set once at the top.

%who

Lists all variables currently in memory.

%run script.py

Runs an external Python script and makes its variables available in the notebook.

%%bash
ls -la
git log --oneline -5

Runs the cell contents as shell commands. Incredibly useful for running terminal commands without leaving the notebook.

?pd.DataFrame.groupby

Shows documentation for any function. Double question mark ?? shows the source code.

Organizing a Professional Notebook

Structure every notebook the same way. Consistency makes notebooks readable and maintainable.

1. Title and description (Markdown)
2. Imports (single code cell, all imports together)
3. Configuration and constants (file paths, parameters)
4. Data loading
5. Data inspection
6. Data cleaning
7. Exploratory analysis (multiple sections)
8. Modeling or main analysis
9. Results and interpretation
10. Conclusions (Markdown summary)

Each section starts with a Markdown header. Code cells within a section do one thing. If a code cell is doing five different things, split it.

Keep cells short. A 50-line code cell should be three cells. Short cells are easier to debug because when an error occurs you know exactly which part failed.

Exporting Notebooks

jupyter nbconvert --to html notebook.ipynb
jupyter nbconvert --to pdf notebook.ipynb
jupyter nbconvert --to script notebook.ipynb
jupyter nbconvert --to markdown notebook.ipynb

HTML export produces a self-contained file you can email or post online. PDF requires LaTeX. Script converts to a .py file, removing markdown and keeping only code.

For sharing analysis with non-technical stakeholders, HTML export is ideal. They see the full notebook, code included or hidden, in any browser.

nbstripout: Clean Notebooks Before Committing

Never commit notebooks with output to Git. Output can contain sensitive data, large images, or paths specific to your machine. It also creates enormous diffs that obscure what actually changed.

pip install nbstripout
nbstripout --install

After installation, nbstripout automatically strips output from notebooks before every git commit. Your colleagues get clean notebooks with no output. When they open the notebook and run it, they generate their own output.

This is a professional habit. Install it on every machine you use.

Notebook Extensions Worth Installing

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

Then open the Nbextensions tab in Jupyter. Useful ones to enable:

Table of Contents (2): generates a clickable table of contents from markdown headers. Essential for long notebooks.

ExecuteTime: shows how long each cell took to run.

Collapsible Headings: lets you collapse sections to focus on what you are working on.

Variable Inspector: shows all current variables, their types, and values. Like a live spreadsheet of your memory state.

Jupyter vs Scripts: When to Use Each

Use a notebook when: exploring new data, prototyping analysis, presenting findings, teaching a concept, writing a report with code embedded.

Use a Python script when: building a reusable module, running scheduled jobs, deploying to production, writing code that will be imported by other code.

The workflow is: explore and prototype in a notebook, extract clean reusable functions into scripts, import those scripts in future notebooks. The notebook is the thinking environment. The script is the production code.

A Resource Worth Knowing

Jake VanderPlas wrote the entire Python Data Science Handbook as a series of Jupyter Notebooks, available free at github.com/jakevdp/PythonDataScienceHandbook. Every concept taught through interactive notebooks you can clone and run. The notebooks on NumPy, Pandas, Matplotlib, and scikit-learn are the best introductory material on those libraries that exists. His approach to structuring analysis notebooks has influenced how thousands of practitioners work.

Search "Jake VanderPlas Python Data Science Handbook" and the GitHub repo comes right up.

Try This

Create a Jupyter notebook called titanic_analysis.ipynb.

It must have all of the following:

A proper title cell in Markdown with your name and the date.

All imports in a single cell at the top.

At least five section headers using Markdown ##.

Between each code section, a Markdown cell that explains what you found and what you will do next.

Code that loads the Titanic dataset, cleans it, and produces at least four visualizations: survival by gender, survival by class, age distribution by survival status, fare distribution with a reference line at the median.

A conclusions section in Markdown at the bottom summarizing three specific findings from your analysis.

Before sharing, run Kernel → Restart and Run All. Fix any errors until it runs clean from top to bottom.

Export it as HTML. Open the HTML in your browser. That is what anyone you share it with will see.

What's Next

Jupyter runs locally. Google Colab runs in the cloud and gives you free GPU access. For everything in this series that requires serious compute, training neural networks, running large experiments, Colab is where it happens. That is the last post in Phase 5.

After that, Phase 6 starts. Machine learning. The thing this whole series has been building toward.

DEV Community