DEV Community

Yosuke Hanaoka
Yosuke Hanaoka

Posted on

1

Machine Learning Model Deployment as a Web App using Streamlit

Introduction

A machine learning model is essentially a set of rules or mechanisms used to make predictions or find patterns in data. To put it super simply (and without fear of oversimplification), a trendline calculated using the least squares method in Excel is also a model. However, models used in real applications are not so simple—they often involve more complex equations and algorithms, not just simple equations.

In this post, I’m going to start by building a very simple machine learning model and releasing it as a very simple web app to get a feel for the process.

Here, I’ll focus only on the process, not the ML model itself. Alsom I’ll use Streamlit and Streamlit Community Cloud to easily release Python web applications.

TL;DR:

Using scikit-learn, a popular Python library for machine learning, you can quickly train data and create a model with just a few lines of code for simple tasks. The model can then be saved as a reusable file with joblib. This saved model can be imported/load like a regular Python library in a web application, allowing the app to make predictions using the trained model!

App URL: https://yh-machine-learning.streamlit.app/
GitHub: https://github.com/yoshan0921/yh-machine-learning.git

Technology Stack

  • Python
  • Streamlit: For creating the web application interface.
  • scikit-learn: For loading and using the pre-trained Random Forest model.
  • NumPy & Pandas: For data manipulation and processing.
  • Matplotlib & Seaborn: For generating visualizations.

What I Made

This app allows you to examine predictions made by a random forest model trained on the Palmer Penguins dataset. (See the end of this article for more details on the training data.)

Specifically, the model predicts penguin species based on a variety of features, including species, island, beak length, flipper length, body size, and sex. Users can navigate the app to see how different features affect the model's predictions.

  • Prediction Screen
    Prediction Screen

  • Learning Data/Visualization Screen
    Learning Data/Visualization Screen

Development Step1 - Creating the Model

Step1.1 Import Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
Enter fullscreen mode Exit fullscreen mode

pandas is a Python library specialized in data manipulation and analysis. It supports data loading, preprocessing, and structuring using DataFrames, preparing data for machine learning models.
sklearn is a comprehensive Python library for machine learning that provides tools for training and evaluating. In this post, I will build a model using a learning method called Random Forest.
joblib is a Python library that helps save and load Python objects, like machine learning models, in a very efficient way.

Step1.2 Read Data

df = pd.read_csv("./dataset/penguins_cleaned.csv")
X_raw = df.drop("species", axis=1)
y_raw = df.species
Enter fullscreen mode Exit fullscreen mode

Load the dataset (training data) and separate it into features (X) and target variables (y).

Step1.3 Encode the Category Variables

encode = ["island", "sex"]
X_encoded = pd.get_dummies(X_raw, columns=encode)

target_mapper = {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2}
y_encoded = y_raw.apply(lambda x: target_mapper[x])
Enter fullscreen mode Exit fullscreen mode

The categorical variables are converted into a numerical format using one-hot encoding (X_encoded). For example, if “island” contains the categories “Biscoe”, “Dream”, and “Torgersen”, a new column is created for each (island_Biscoe, island_Dream, island_Torgersen). The same is done for sex. If the original data is “Biscoe,” the island_Biscoe column will be set to 1 and the others to 0.
The target variable species is mapped to numerical values (y_encoded).

Step1.4 Split the Dataset

x_train, x_test, y_train, y_test = train_test_split(
    X_encoded, y_encoded, test_size=0.3, random_state=1
)
Enter fullscreen mode Exit fullscreen mode

To evaluate a model, it is necessary to measure the model's performance on data not used for training. 7:3 is widely used as a general practice in machine learning.

Step1.5 Train a Random Forest Model

clf = RandomForestClassifier()
clf.fit(x_train, y_train)
Enter fullscreen mode Exit fullscreen mode

The fit method is used to train the model.
The x_train represents the training data for the explanatory variables, and the y_train represents the target variables.
By calling this method, the model trained based on the training data is stored in clf.

Step1.6 Save the Model

joblib.dump(clf, "penguin_classifier_model.pkl")
Enter fullscreen mode Exit fullscreen mode

joblib.dump() is a function for saving Python objects in binary format. By saving the model in this format, the model can be loaded from a file and used as-is without having to be trained again.

Sample Code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
# Read the data file (Using data from the Data Professor)
df = pd.read_csv("./dataset/penguins_cleaned.csv")
# Define features and targets
X_raw = df.drop("species", axis=1)
print(X_raw)
y_raw = df.species
print(y_raw)
# Data encoding
encode = ["island", "sex"]
X_encoded = pd.get_dummies(X_raw, columns=encode)
target_mapper = {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2}
y_encoded = y_raw.apply(lambda x: target_mapper[x])
# Split data for cross-validation
x_train, x_test, y_train, y_test = train_test_split(
X_encoded, y_encoded, test_size=0.3, random_state=1
)
# Train the ML model
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
# Display the accuracy of the model
print(f"Train accuracy: {accuracy_score(y_train, clf.predict(x_train))}")
print(f"Test accuracy: {accuracy_score(y_test, clf.predict(x_test))}")
# Save the model
joblib.dump(clf, "penguin_classifier_model.pkl")
print("Model creation completed!")

Development Step2 - Building the Web App and Integrating the Model

Step2.1 Import Libraries

import streamlit as st
import numpy as np
import pandas as pd
import joblib
Enter fullscreen mode Exit fullscreen mode

stremlit is a Python library that makes it easy to create and share custom web applications for machine learning and data science projects.
numpy is a fundamental Python library for numerical computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.

Step2.2 Retrieve and encode input data

data = {
    "island": island,
    "bill_length_mm": bill_length_mm,
    "bill_depth_mm": bill_depth_mm,
    "flipper_length_mm": flipper_length_mm,
    "body_mass_g": body_mass_g,
    "sex": sex,
}
input_df = pd.DataFrame(data, index=[0])

encode = ["island", "sex"]
input_encoded_df = pd.get_dummies(input_df, prefix=encode)
Enter fullscreen mode Exit fullscreen mode

Input values are retrieved from the input form created by Stremlit, and categorical variables are encoded using the same rules as when the model was created. Note that the order of each data must also be the same as when the model was created. If the order is different, an error will occur when executing a forecast using the model.

Step2.3 Load the Model

clf = joblib.load("penguin_classifier_model.pkl")
Enter fullscreen mode Exit fullscreen mode

"penguin_classifier_model.pkl" is the file where the previously saved model is stored. This file contains a trained RandomForestClassifier in binary format. Running this code loads the model into clf, allowing you to use it for predictions and evaluations on new data.

Step2.4 Perform prediction

prediction = clf.predict(input_encoded_df)
prediction_proba = clf.predict_proba(input_encoded_df)
Enter fullscreen mode Exit fullscreen mode

clf.predict(input_encoded_df): Uses the trained model to predict the class for the new encoded input data, storing the result in prediction.
clf.predict_proba(input_encoded_df): Calculates the probability for each class, storing the results in prediction_proba.

Sample Code

import streamlit as st
import numpy as np
import pandas as pd
import joblib
# Create container
container = st.container(border=True)
# Input parameters fileds
container.header("Input features")
sex = container.selectbox("Sex", ("male", "female"))
island = container.selectbox(
"Island",
(
"Biscoe",
"Dream",
"Torgersen",
),
)
bill_length_mm = container.slider("Bill length (mm)", 32.1, 59.6, 43.9)
bill_depth_mm = container.slider("Bill depth (mm)", 13.1, 21.5, 17.2)
flipper_length_mm = container.slider("Flipper length (mm)", 172.0, 231.0, 201.0)
body_mass_g = container.slider("Body mass (g)", 2700.0, 6300.0, 4207.0)
# Create Dataframe for the input features
data = {
"island": island,
"bill_length_mm": bill_length_mm,
"bill_depth_mm": bill_depth_mm,
"flipper_length_mm": flipper_length_mm,
"body_mass_g": body_mass_g,
"sex": sex,
}
input_df = pd.DataFrame(data, index=[0])
# Data encoding for category variables
encode = ["island", "sex"]
input_encoded_df = pd.get_dummies(input_df, prefix=encode)
# Ensure all dummy variables used during model training are present in this order
expected_columns = [
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"island_Biscoe",
"island_Dream",
"island_Torgersen",
"sex_female",
"sex_male",
]
# Add missing category variables as columns with 0 value
for col in expected_columns:
if col not in input_encoded_df.columns:
input_encoded_df[col] = False
# Reorder df_penguins in line with expected_columns
input_encoded_df = input_encoded_df[expected_columns]
# Load the model
clf = joblib.load("penguin_classifier_model.pkl")
# Execute prediction
prediction = clf.predict(input_encoded_df)
prediction_proba = clf.predict_proba(input_encoded_df)
prediction_proba = [n * 100 for n in prediction_proba]
# Display prediction result
st.write("## 🐧Prediction results")
penguins_species = np.array(["Adelie", "Chinstrap", "Gentoo"])
st.success(str(penguins_species[prediction][0]))
# Display prediction probability
df_prediction_proba = pd.DataFrame(prediction_proba)
df_prediction_proba.columns = ["Adelie", "Chinstrap", "Gentoo"]
df_prediction_proba.rename(columns={0: "Adelie", 1: "Chinstrap", 2: "Gentoo"})
st.dataframe(
df_prediction_proba,
column_config={
"Adelie": st.column_config.ProgressColumn(
"Adelie", format="%d %%", min_value=0, max_value=100
),
"Chinstrap": st.column_config.ProgressColumn(
"Chinstrap", format="%d %%", min_value=0, max_value=100
),
"Gentoo": st.column_config.ProgressColumn(
"Gentoo", format="%d %%", min_value=0, max_value=100
),
},
hide_index=True,
width=704,
)
# Display sidebar
if __name__ == "__main__":
render_sidebar()
view raw stremlit_app.py hosted with ❤ by GitHub

Step3. Deploy

Stremlit Community Cloud

You can publish your developed application on the Internet by accessing the Stremlit Community Cloud (https://streamlit.io/cloud) and specifying the URL of the GitHub repository.

About Data Set

Image description

Artwork by @allison_horst (https://github.com/allisonhorst)

The model is trained using the Palmer Penguins dataset, a widely recognized dataset for practicing machine learning techniques. This dataset provides information on three penguin species (Adelie, Chinstrap, and Gentoo) from the Palmer Archipelago in Antarctica. Key features include:

  • Species: The species of the penguin (Adelie, Chinstrap, Gentoo).
  • Island: The specific island where the penguin was observed (Biscoe, Dream, Torgersen).
  • Bill Length: The length of the penguin's bill (mm).
  • Bill Depth: The depth of the penguin's bill (mm).
  • Flipper Length: The length of the penguin's flipper (mm).
  • Body Mass: The mass of the penguin (g).
  • Sex: The sex of the penguin (male or female).

This dataset is sourced from Kaggle, and it can be accessed here. The diversity in features makes it an excellent choice for building a classification model and understanding the importance of each feature in species prediction.

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay