DEV Community: Eve Loraine Nuñal

Building a Live F1 Dashboard Using OpenF1 and Streamlit

Eve Loraine Nuñal — Mon, 20 Apr 2026 18:29:13 +0000

Formula 1 generates massive amounts of telemetry data during each race weekend — lap times, sector speeds, tire compounds, and pit stop durations. For developers who are also F1 enthusiasts, accessing this data and building visualizations is an exciting way to combine technical skills with a passion for the sport.

In this article, we'll walk through building a complete F1 dashboard using Streamlit for the user interface, Plotly for interactive visualizations, and the OpenF1 API as our data source. The final dashboard allows users to explore race sessions, compare driver lap times, and analyze pit stop performance across the 2025 and 2026 seasons.

Tech Stack Overview

Before diving into the code, let's understand the key technologies:

Technology	Purpose
Streamlit	Python framework that turns data scripts into web apps with minimal code
Plotly Express & Graph Objects	Creates interactive, browser-based charts
Pandas	Data manipulation and DataFrame operations
Requests	HTTP client for API calls
OpenF1 API	Free, real-time F1 data endpoint

Getting Started

Prerequisites

Before running the dashboard, ensure you have Python 3.8 or higher installed on your system. You can check your Python version with:

python --version

Step 1: Clone the Repository

The complete source code is available on GitHub. Clone the repository to your local machine:

git clone <https://github.com/e-raine/F1-Dashboard-Using-Openf1-and-Streamlit.git>
cd F1-Dashboard-Using-Openf1-and-Streamlit

Alternatively, if you prefer not to use Git, you can download the repository as a ZIP file from https://github.com/e-raine/F1-Dashboard-Using-Openf1-and-Streamlit and extract it.

Step 2: Create a Virtual Environment (Recommended)

Creating a virtual environment isolates the project dependencies and prevents conflicts with other Python projects:

On macOS/Linux:

python -m venv venv
source venv/bin/activate

On Windows:

python -m venv venv
venv\Scripts\activate

You'll know the virtual environment is active when you see (venv) at the beginning of your terminal prompt.

Step 3: Install Required Packages

Install all necessary packages using pip.

pip install streamlit pandas plotly requests

Step 4: Verify Installation

Test that everything installed correctly by checking the versions:

python -c "import streamlit; import pandas; import plotly; import requests; print('All packages installed successfully!')"

Step 5: Run the Dashboard

Launch the Streamlit application:

streamlit run app.py

Your default browser should automatically open to http://localhost:8501. If it doesn't, manually navigate to that address.

Understanding the OpenF1 API

OpenF1 is a community-driven, open-source API that provides live and historical Formula 1 data. It's completely free and requires no authentication — you can start making requests immediately. The API is organized into several endpoints:

Endpoint	Data Provided
`/meetings`	Race weekend information (location, country, circuit)
`/sessions`	Individual sessions (practice, qualifying, race)
`/drivers`	Driver details (name, number, team)
`/laps`	Lap-by-lap timing data
`/pit`	Pit stop durations and lap numbers
`/car_data`	Telemetry (speed, throttle, brake, gear)
`/weather`	Track temperature, air temperature, rainfall

The base URL is https://api.openf1.org/v1. You can filter any endpoint by adding query parameters like ?session_key=123&driver_number=44.

No API key needed – OpenF1 is open to everyone. You can even test endpoints directly in your browser. For example:
https://api.openf1.org/v1/meetings?year=2025

Project Structure

The dashboard is organized into logical sections:

Configuration & Setup — Page settings and title
Data Fetching Layer — Reusable API wrapper with caching
Sidebar Controls — Season selection (2025/2026)
Race Selection — Chronological dropdown with formatted dates
Tabbed Views — Drivers, lap times, and pit stops

Let's examine each component in detail.

1. Streamlit Page Configuration

st.set_page_config(
    page_title="My F1 Dashboard",
    page_icon="🏎️",
    layout="wide"
)

The set_page_config() call must be the first Streamlit command. Here we're setting:

A browser tab title
A favicon emoji
Wide layout — essential for data dashboards with multiple columns

2. Building a Robust API Fetching Function

The fetch_data() function is the backbone of the application:

@st.cache_data(ttl=300)
def fetch_data(endpoint, params=None):
    BASE_URL = "<https://api.openf1.org/v1>"
    response = requests.get(f"{BASE_URL}/{endpoint}", params=params)
    if response.status_code == 200:
        return pd.DataFrame(response.json())
    return pd.DataFrame()

Key Design Decisions:

@st.cache_data(ttl=300) — This decorator is critical. Without caching, every user interaction would trigger a new API request. The ttl=300 parameter caches data for 5 minutes, balancing freshness with performance. Streamlit's caching is smarter than a simple memoization — it detects changes in function arguments and invalidates the cache appropriately.

Error handling for API problems — Network errors, timeouts, and HTTP error codes are now caught. Instead of crashing the app, we show a friendly error message and return an empty DataFrame. The timeout=10 prevents the app from hanging indefinitely if the API is slow.

Empty DataFrame fallback — Returning pd.DataFrame() instead of None or raising an exception ensures the rest of the code can safely call .empty checks without try/except blocks.

3. Handling the 2026 Calendar Changes

Real-world APIs require handling edge cases. For the 2026 season, the Bahrain and Saudi Arabian Grands Prix were cancelled:

if year == 2026:
    cancelled_gps = ["Bahrain", "Saudi Arabia"]
    sessions = sessions[~sessions["country_name"].isin(cancelled_gps)]

This uses boolean indexing with the ~ (NOT) operator to filter out cancelled races. The sidebar also displays a warning to users about this calendar change.

4. Session State for Season Selection

Streamlit's reactive model means variables are re-run on every interaction. To persist state across reruns, we use st.session_state:

if "selected_year" not in st.session_state:
    st.session_state.selected_year = 2026
if select_2025:
    st.session_state.selected_year = 2025

The two-column button layout in the sidebar creates a clean UI for season switching:

col1, col2 = st.sidebar.columns(2)
with col1:
    select_2025 = st.button("🏁 2025 Season", use_container_width=True)

5. Race Selection with Formatted Dates

The race dropdown combines multiple data points into readable options:

race_options = [f"{formatted_dates[i]} - {race_names[i]} ({race_circuits[i]})"
                for i in range(len(race_names))]

This produces options like "Mar 16 - Australia (Albert Park)" — much more user-friendly than showing raw session keys.

The date formatting handles API inconsistencies:

date_obj = datetime.fromisoformat(date.replace('Z', '+00:00'))
formatted_dates.append(date_obj.strftime("%b %d"))

The 'Z' in ISO timestamps indicates UTC. Replacing it with '+00:00' makes it compatible with Python's fromisoformat().

6. Driver Display with Team Colors

In the Drivers tab, we create a responsive grid using columns and HTML:

cols = st.columns(4)
for index, (_, driver) in enumerate(drivers.iterrows()):
    col = cols[index % 4]
    team_color = driver.get("team_colour", "CCCCCC")
    if not team_color.startswith("#"):
        team_color = f"#{team_color}"

The modulo operator (index % 4) distributes drivers across four columns. Team colors from the API sometimes lack the # prefix, so we normalize them before injecting into HTML.

The HTML card uses inline styles for a clean, color-coded border:

st.markdown(f"""
<div style="border-left: 4px solid {team_color}; padding: 10px; ...">
    <strong>{driver['full_name']}</strong><br>
    <small>{driver['team_name']}</small>
</div>
""", unsafe_allow_html=True)

Security note: unsafe_allow_html=True is acceptable here because we're generating the HTML content programmatically, not accepting user input.

7. Lap Time Visualization with Plotly

The lap time comparison tab is the most technically interesting section. It:

Fetches driver data to get driver numbers and team colors
Identifies default drivers (Russell and Antonelli) using string matching
Fetches lap data for each selected driver
Creates an overlay chart with team-colored lines

Finding Default Drivers:

george_russell = drivers[drivers["full_name"].str.contains("Russell", case=False, na=False)]

Using .str.contains() with case=False provides flexible matching. The na=False parameter prevents errors when encountering null values in the series.

Building the Multi-Trace Chart:

fig = go.Figure()
for driver_num in selected_drivers:
    laps = fetch_data("laps", {
        "session_key": session_key,
        "driver_number": driver_num
    })
    fig.add_trace(go.Scatter(
        x=laps["lap_number"],
        y=laps["lap_duration"],
        mode="lines+markers",
        name=driver_acronyms.get(driver_num),
        line=dict(color=f"#{team_color}", width=2)
    ))

The chart uses lines+markers mode to show both the trend (lap time evolution) and individual data points (each lap). The hovermode="x unified" setting synchronizes tooltips across all traces.

Data Cleaning:

laps["lap_duration"] = pd.to_numeric(laps["lap_duration"], errors="coerce")
laps = laps.dropna(subset=["lap_duration"])

The API might return strings or invalid values. errors="coerce" converts unparseable values to NaN, which we then drop. This prevents chart-breaking errors.

8. Pit Stop Analysis

The pit stop tab demonstrates a different chart type — a bar chart using Plotly Express:

pit_stops_with_names = pit_stops.merge(
    drivers[["driver_number", "full_name"]],
    on="driver_number"
)

fig = px.bar(
    pit_stops_with_names,
    x="full_name",
    y="pit_duration",
    text="pit_duration"
)

The .merge() operation enriches pit stop data with driver names. The text parameter adds data labels directly on the bars, with formatting customized in update_traces():

fig.update_traces(texttemplate='%{text:.2f}s', textposition='outside')

Finding the fastest pit stop uses idxmin() — a vectorized operation that's more efficient than sorting:

fastest_idx = pit_stops_with_names["pit_duration"].idxmin()
fastest = pit_stops_with_names.loc[fastest_idx]

Error Handling & User Feedback

Throughout the dashboard, we provide clear feedback when data is unavailable:

if sessions.empty:
    st.warning(f"No race sessions found for {selected_year}")
    st.stop()

The st.stop() method halts execution gracefully, preventing downstream errors from accessing empty DataFrames.

API-Specific Error Handling

The fetch_data() function now catches three common failure modes:

Network errors – The API server might be down or unreachable.
Timeout errors – The API didn't respond within 10 seconds.
HTTP errors – The API returned a status code like 404 (not found) or 500 (server error).

In each case, the user sees a clear error message in the Streamlit UI, and the dashboard continues to function for other endpoints.

For upcoming races, we show a contextual message:

if session_date > now:
    st.info("📅 **Upcoming Race** - Data will be available after the session")

Performance Considerations (For Beginners)

When building a dashboard that fetches live data, performance matters. Here's why each design choice was made:

1. Why caching matters

Every time you interact with a Streamlit app (clicking a button, selecting a dropdown), the entire script reruns from top to bottom. Without caching, that means every click would send a fresh request to the OpenF1 API. With @st.cache_data, the first request fetches data, and subsequent reruns use the cached result until it expires (5 minutes).

2. TTL (Time To Live) explained

ttl=300 means "cache this data for 300 seconds (5 minutes)". Why 5 minutes? F1 sessions are long (90+ minutes), but lap times don't change after a session ends. For live sessions, 5 minutes is a reasonable balance — you won't see updates faster than that, but you also won't hammer the API with requests every second.

3. Selective data fetching

The dashboard does not fetch all lap data for all 20 drivers at once. Instead, it only fetches lap data for the drivers you explicitly select in the multi-select box. This reduces network traffic and memory usage.

4. Vectorization vs. loops

Pandas operations like idxmin() and .isin() are implemented in C and run much faster than Python loops. For example, finding the fastest pit stop using idxmin() is about 100x faster than iterating through rows manually.

5. Session state prevents redundant queries

Without st.session_state, changing the race selection would trigger a refetch of the driver list even if the season hasn't changed. Session state preserves values across reruns, so the API is only called when absolutely necessary.

6. Empty DataFrame pattern

Returning an empty DataFrame (pd.DataFrame()) instead of None allows the rest of the code to use .empty checks. This is much faster than wrapping every API call in try/except.

Possible Improvements to the Dashboard

Here are several ways you could extend this project:

Feature	Implementation Approach
Qualifying comparison	Add a tab for Q1/Q2/Q3 sector times
Driver head-to-head	Lap time delta chart between two drivers
Tire strategy analysis	Fetch stint data from the API's `stints` endpoint
Track map visualization	Use `session_key` with telemetry endpoints
Live timing	Increase cache TTL and add auto-refresh
Weather overlay	Fetch weather data from OpenF1's `weather` endpoint

Troubleshooting Common Issues

Issue	Solution
"Module not found" errors	Ensure you've activated the virtual environment and run `pip install -r requirements.txt`
Empty charts or no data	Some sessions (especially upcoming races) haven't occurred yet. Try a completed race from 2025
API rate limiting	OpenF1 is generous, but if you encounter limits, increase `ttl` to 600 seconds and avoid rapid manual refreshes
Port 8501 is busy	Run `streamlit run app.py --server.port 8502` to use a different port
Git clone permission denied	Use HTTPS instead of SSH: `git clone <https://github.com/e-raine/F1-Dashboard-Using-Openf1-and-Streamlit.git`>
API connection errors	Check your internet connection. OpenF1 is publicly hosted; if it's down, the dashboard will show error messages gracefully
SSL certificate errors	Update your Python environment: `pip install --upgrade certifi`

Conclusion

This dashboard demonstrates how to build an application by combining Streamlit's reactive framework, Plotly's interactive charts, and the OpenF1 API. The key architectural patterns worth remembering are: cache API calls with TTL-based expiration to avoid rate limiting and improve performance, use Streamlit's session state to persist user selections across rerenders, provide graceful fallbacks for missing or incomplete data, format data at display time rather than storage time to keep your raw data clean, and always give users context about upcoming or ongoing sessions instead of leaving them staring at empty charts. Whether you're building this for your own F1 analytics or as a portfolio project, these same patterns scale cleanly to any real-time sports data API — from MotoGP to the NBA.

Resources

GitHub Repository: https://github.com/e-raine/F1-Dashboard-Using-Openf1-and-Streamlit
OpenF1 API Documentation: https://docs.openf1.org
Streamlit Documentation: https://docs.streamlit.io
Plotly Python Documentation: https://plotly.com/python

How to Get Started with Scikit-Learn: A Beginner-Friendly Guide to Machine Learning in Python

Eve Loraine Nuñal — Thu, 24 Apr 2025 12:53:33 +0000

Co-authored with @marverickdev

Do you want to get started with machine learning, but you do not know where to start? Do you want to take advantage of the data manipulation capabilities of Python and make your own ML model locally? Well, there is a Python library designed to do just that, which is being used by startups and companies alike, and the name is Scikit-Learn!

What is Scikit-Learn, exactly?

Scikit-learn, also known as sklearn, is the primary machine learning library for Python that provides fundamental tools for both beginners and experienced developers to use for AI model training, data analysis, deep learning, and statistical modeling. It includes essential modules for classification, regression, clustering, dimensionality reduction, model selection and preprocessing. It has tools for model selection, including cross-validation methods like KFold and cross_val_score, hyperparameter search techniques such as GridSearchCV and RandomizedSearchCV, and utilizes for scoring, validation curves, and data splitting.

As is the case with most Python libraries, it is open-source and free-to-use, making it easily accessible by anyone willing to learn machine learning, and it is built upon other open-source libraries within Python, like SciPy for advanced scientific operations, NumPy for efficient numerical computations, Matplotlib for data visualization, and Cython for increased efficiency and speed, similar to that of C/C++.

Why Use Scikit-Learn?

Without libraries like scikit-learn, diving into machine learning would feel a lot like trying to bake a cake from scratch without a recipe — messy, time-consuming, and probably a little burnt. Scikit-learn hands you a ready-made toolkit packed with reliable, beginner-friendly tools for everything from classification and regression to clustering and dimensionality reduction. It’s well-documented and super popular in both classrooms and companies, like Spotify, AWS, J.P Morgan, and Evernote, meaning there’s always someone who’s faced the same problem you’re tackling. And because it’s actively maintained, you’re not stuck using outdated methods — you get access to the latest techniques without the hassle.

From a developer’s point of view, scikit-learn is like having a set of interchangeable LEGO bricks. Its consistent, clean interface means you don’t have to memorize a million different function names for different algorithms. Whether you’re using a decision tree or a support vector machine, you’ll be calling familiar functions like fit(), predict(), and score(). This makes experimenting way smoother, leaving developers free to focus on building smarter models, rather than wrestling with complicated code. Plus, its active community means plenty of tutorials, updates, and fixes are always within reach.

Scikit-Learn vs. TensorFlow vs. Pytorch

Scikit-Learn, TensorFlow, and PyTorch are three of the most widely used libraries in machine learning and deep learning, each serving different purposes and catering to distinct workflows.

Scikit-Learn is the go-to library for classical machine learning tasks, offering a simple and consistent API for algorithms like linear regression, support vector machines (SVMs), and random forests. It excels in handling small-to-medium-sized structured datasets (e.g., CSV files) and is built on NumPy and SciPy, making it efficient for CPU-based computations. However, it lacks native GPU support and is not designed for deep learning—though it does include a basic multi-layer perceptron (MLP) for simple neural networks. Scikit-Learn is ideal for tasks like customer segmentation, fraud detection, and traditional predictive modeling where deep learning is unnecessary.

TensorFlow, developed by Google, is a powerful framework for deep learning, particularly suited for large-scale neural network training and deployment. Its high-level Keras API simplifies model building, while its low-level operations allow for fine-grained control. TensorFlow supports distributed training, making it a strong choice for production environments, and it integrates well with mobile (LiteRT) and web deployment (TensorFlow.js). It is widely used in industry for applications like image recognition, natural language processing (NLP), and recommender systems. While it has a steeper learning curve than Scikit-Learn, its robustness and scalability make it a favorite for production-grade deep learning.

PyTorch, developed by Meta (Facebook), is the preferred framework for research and rapid prototyping in deep learning. Its dynamic computation graph (eager execution) allows for more intuitive debugging and flexibility, making it popular in academia and cutting-edge research. PyTorch’s Pythonic design and strong GPU acceleration (via CUDA) enable quick experimentation with novel architectures like transformers, generative adversarial networks (GANs), and reinforcement learning models. While historically lagging behind TensorFlow in deployment tools, PyTorch has improved significantly with TorchScript and ONNX support, narrowing the gap. Researchers and startups often favor PyTorch for its ease of use and dynamic nature.

Choosing the Right Tool

Use Scikit-Learn for classical ML tasks where deep learning is overkill.
Use TensorFlow for scalable deep learning in production, especially when deployment is a priority.
Use PyTorch for research, experimentation, and when flexibility in model design is crucial.

Getting Started with Scikit-Learn

Creating a Virtual Environment (Optional)

Before we can go ahead with the installation, it is recommended to create a virtual environment for Python so that the installation is isolated to the project. To do so, type this command in your preferred IDE’s terminal (We will be using VS Code for this guide):

python -m venv .venv

Keep in mind that this is optional when you are to get started in Python and Scikit-Learn, but this is a precaution to prevent any unexpected errors in your other Python projects.

Installation

To install Scikit-Learn into your project, enter this command in the terminal:

python -m pip install scikit-learn

when you are using VS Code, this would ensure that the packages install in the selected Python environment.

Importing Scikit-Learn Modules

You can import different parts of Scikit-Learn depending on what you need.

Here’s an example:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Basic Workflow in Scikit-Learn

📌Import Necessary Modules
Use modules that are important for your project/use case (e.g. Logistic Regression)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

📌Load and Prepare Data
Import the dataset for training

# Example with built-in dataset
from sklearn.datasets import load_iris
data = load_iris()
X = data.data  # Features
y = data.target  # Target variable

📌Split Data into Training and Test Sets
Set the variables for data splitting, which splits the dataset into subsets used for data training and testing

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

📌Preprocess the Data
Preprocessing cleans the dataset such that it can be read by the model and make it suitable for training and testing

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  # Note: Only transform the test set

📌Choose and Train a Model
Choose the model that you imported in the first step

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

📌Make Predictions (if learning is supervised)
Supervised learning trains the model to make connections or predictions based on the dataset the model is being trained on

y_pred = model.predict(X_test)

📌Evaluate the Model
Tests the accuracy of the model

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

How to use Scikit-Learn

Scikit-learn follows a consistent API pattern across all algorithms:

Import the appropriate model
Instantiate the model with parameters
Fit the model to your data
Predict or evaluate with new data

Example 1: Classification (Iris Dataset)

Separates the data based on the type of irises and their petal and sepal length & width

# Importing required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

# 3. Initialize and train KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# 4. Make predictions and evaluate model
predictions = knn.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")

Output (from matplotlib.pyplot)

Output (from Principal Component Analysis [PCA])

Example 2: Regression (Diabetes)

Illustrates the pattern of diabetes patients through regression, which estimates the relationships between the various patient features and the presence of diabetes

# Importing required libraries
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load and prepare dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_X = diabetes_X[:, np.newaxis, 2]  # Use only one feature

# 2. Split data into training and test sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# 3. Create and train linear regression model
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)

# 4. Make predictions and evaluate model
diabetes_y_pred = regr.predict(diabetes_X_test)
print("Coefficients: \n", regr.coef_)
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))

# 5. Visualize results
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

Output (from matplotlib)

Example 3: Clustering (K-Means)

Groups similar data into one area (clusters) and pinpoint the center of each area (means)

# Importing required libraries
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 1. Create synthetic data
X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# 2. Create and fit model
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

# 3. Get cluster assignments
labels = kmeans.labels_

# 4. Visualization
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.show()

Output (from matplotlib)

Tips for Effective Use

Always split your data into training and test sets
Preprocess your data appropriately (scaling, encoding, etc.)
Start with simple models before trying complex ones
Use cross-validation to evaluate model performance
Explore the extensive documentation for each algorithm's parameters

Other Examples of Models in Scikit-Learn

1️⃣Decision Tree

What it Does: It makes predictions by learning simple rules from data.
When to use: When you need an interpretable model (but can overfit).

# Import required libraries
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

# 1. Generate synthetic data (sine wave with noise)
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))  # Add noise to every 5th point

# 2. Initialize and train models
regr_shallow = DecisionTreeRegressor(max_depth=2)  # Simple model
regr_deep = DecisionTreeRegressor(max_depth=5)     # Complex model
regr_shallow.fit(X, y)
regr_deep.fit(X, y)

# 3. Create test data and predictions
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_shallow = regr_shallow.predict(X_test)
y_deep = regr_deep.predict(X_test)

# 4. Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="Data")
plt.plot(X_test, y_shallow, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_deep, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Decision Tree Regression: Effect of Tree Depth")
plt.legend()
plt.show()

Output (from matplotlib)

2️⃣Random Forest

What it does: Ensemble of decision trees (more accurate, less overfitting).
When to use:When you want a robust, "just works" model.

# Import required libraries
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# 1. Load the diabetes dataset
data = load_diabetes()
X = data.data  # Features (medical measurements)
y = data.target  # Target variable (disease progression score)

# 2. Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4. Predict on the test set and evaluate performance
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))  # Lower is better
print("R² Score:", r2_score(y_test, y_pred))       # Closer to 1 is better

# 5. Display feature importances (which variables matter most?)
importances = model.feature_importances_
for i, (feature, importance) in enumerate(zip(data.feature_names, importances)):
    print(f"{i+1}. {feature}: {importance:.4f}")

Output (from terminal)

3️⃣k-Nearest Neighbors (k-NN)

What it does: Predicts based on the closest training examples (no training needed).
When to use: For small datasets or when local patterns matter.

# Import required libraries
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data  # Features (chemical properties)
y = wine.target  # Target variable (wine class: 0, 1, or 2)

# 2. Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Standardize features (critical for distance-based models like KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # Fit on train, transform train
X_test = scaler.transform(X_test)        # Transform test (no fitting)

# 4. Train a KNN classifier with 3 neighbors
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# 5. Predict and evaluate performance
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Output (from terminal)

Conclusion

As you can see, Scikit-Learn has many uses and applications that are critical for our daily lives, many of which we are not aware of. From research and predictive analytics to natural language processing and recommendation algorithms, all can be easily done with the help of this Python library. So, if you are looking for a great starting point in the realm of machine learning, consider going deep into Scikit-Learn for your first ML project.

If you want to learn more about Scikit-Learn, you can refer to the official user guide for further details.

References / Materials:

https://www.ibm.com/think/topics/scikit-learn#:~:text=Scikit-learn%2C%20or%20sklearn%2C,modeling%20with%20a%20consistent%20interface.
https://scikit-learn.org/stable/testimonials/testimonials.html
https://www.analyticsvidhya.com/blog/2021/07/15-most-important-features-of-scikit-learn/
https://scikit-learn.org/1.5/auto_examples/linear_model/plot_ols.html