DEV Community: Dare Johnson

How To Implement Code Modularity in Data Science and Machine Learning Projects

Dare Johnson — Sat, 10 Feb 2024 08:23:36 +0000

In the data science field where things evolve fast and codes can soon become obsolete, the ability to know how to structure projects can not be overstated. As data projects become complex, adopting a modular code approach emerges as a strategic imperative. Unfortunately, many new learners in the field of data science aren’t taught this modular approach to structuring data science scripts. The reason partly being that many data boot camps or data schools don’t include an end-to-end approach or method, which uses modularization of different stages as the industry standards.

This guide examines the substantial advantages of structuring data science projects using modular code approaches, highlights the inherent limitations of conventional tools such as Jupyter Notebooks, presents a recommended project structure, and provides a comprehensive demonstration utilizing the iconic Iris dataset.

Outline

Code Modularity
Advantages of Modular Code in Data Science Projects
Limitations of Jupyter Notebook
Common Machine Learning Lifecycle
Demonstration Using The Iris Dataset
Conclusion

Code Modularity

The practice of breaking down a program into separate components (more appropriate, modules, in programming parlance) whereby each component or module is responsible for a specific functionality is known as code modularity. In essence, when applied to a data science problem, code modularity is all about breaking down different stages of a data project into separate modules or scripts.

Advantages of Modular Code

Code Organization and Readability
The strategic compartmentalization of code into modules fosters a sense of clarity, disentangling intricate project structures. This approach heightens the readability of the codebase, enabling a more lucid understanding of individual components.
Code Reusability
Modules serve as reusable building blocks that transcend project boundaries. Thoughtfully designed modules can be repurposed, eliminating redundancy and expediting the development lifecycle.
Collaboration and Teamwork
A modular structure paves the way for seamless collaboration among team members. Each module can be independently developed, tested, and maintained, allowing concurrent progress on different aspects of the project.
Maintainability and Debugging
Debugging becomes a more straightforward endeavor as issues are confined within specific modules. Changes made to one module can be contained and tested, reducing unintended side effects throughout the codebase.

Limitations of Jupyter Notebook

Generally, data scientists love the Jupyter Notebook, myself also included. Jupyter Notebook arguably birthed the inspiration behind Google Colab and Deepnote. The Jupyter Notebook is so popular that several IDEs (VS Code, JetBrains DataSpell, etc.) support it.

While the Jupyter Notebook offers interactivity and is widely accepted and adopted, it carries certain limitations that hinder efficient project organization, among which are:

Monolithic Structure
This is best explained as a situation where codes from data ingestion all through to the stage of model deployment (or some other steps in the data science process) are in one codebase, for instance in a single Jupyter Notebook.

One major disadvantage is that it can make the code more difficult to maintain and debug. Since all of the code is in a single notebook, it can be challenging to find and fix errors or to update the code without affecting other parts of the notebook
Version Control and Collaboration
Steps of different stages in data science projects are repetitive and iterative, thus it becomes very necessary to have a version control system in place to keep track of changes over time, and when necessary to revert to previous versions if something goes wrong. While version control systems such as Git can also track changes to the code inside a Jupyter Notebook, it can be more challenging to track changes to the codebase. In addition, separating codes into modules not only encourages easy maintainability and reuse, it also fosters collaboration, because different engineers can work on different versions and modules concurrently.

Therefore, at best, Jupyter Notebooks are most useful for Exploratory Data analysis (EDA) and prototyping.

Typical Stages of Machine Learning Project

Depending on the complexity of a typical machine learning project, most data science/ML projects have some or more of these structures: data ingestion and loading, data preprocessing, data exploration, feature engineering and feature selection, model training and model evaluation, model deployment, etc.

Data Ingestion and loading: The process of importing data from one or more sources to a targeted location, either for storage or immediate use. Common data sources include APIs, databases/ data warehouses, etc.
Data Preprocessing: A critical step in every data science project that involves cleaning and transforming data into useful formats suitable for analysis
Data Exploration and Visualization: The process of examining data to understand it by summarizing it, and identifying patterns and relationships, sometimes with the aid of visualizations. It is a critical step that reveals potential concerns such as missing values, outliers, etc.
Feature Engineering and Selection: While feature engineering involves crafting or transforming more features or attributes (e.g. log transform, binning, etc.) intending to improve the performance of machine learning models, feature selection, on the other hand, is aimed at identifying and removing irrelevant and redundant features which do not contribute to the model accuracy, based on some certain rules.
Model Training and Evaluation: model training includes making the machine learning algorithm learn patterns from the training data to make predictions, while (model) evaluation is the process of accessing the performance of a model on test data using a particular evaluation setup. The goal of the former is to minimize errors between predictions and the actual outcomes, while the latter aims to measure (evaluate) the performance of the model on unseen data.
Model Deployment: This involves the process of making a trained model available for end users in production, to use for prediction on new data.

Demonstration Using The Iris Dataset

To elucidate the potency of modularity, let's apply this approach to the popular Iris dataset.

1. Load data
We will leverage Scikit-learn's built-in functionality to effortlessly load the Iris dataset. The Iris dataset version is tied to the version of Scikit-learn presently installed on our machine. The Scikit-learn version used in this project is version 1.3.0. To know your own Scikit-learn version, you can execute the code here in either of Python terminal or Jupyter notebook



import sklearn
print(sklearn.__version__)

To load the dataset, use the code snippet below and save the code as data_loading.py in a directory. In this same directory, we would save the rest of the codes further down this article.



# data_loading.py
from sklearn.datasets import load_iris

# create the function to load the data
def load_data():
    iris = load_iris()
    data = iris.data
    target = iris.target
    return data, target

To see the output, add and execute the following code as the latter end of the data_loading.py.



data, target = load_data()
print("Data:")
print(data)
print(“Target:")
print(target)

To run data_loading.py, open your command line or terminal, navigate to the directory where you saved script.py (in our case data_loading.py), and then type “python data_loading.py”, after running it, the output should look as below:

2. Data preprocessing
The only preprocess step here is to split our data into train and test data, and then we save our data as data_preprocessing.py.



# data_preprocessing.py
from data_loading import load_data # to access the data and target from data_loading.py file
from sklearn.model_selection import train_test_split

def preprocess_data(data, target):
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

To see the output for the X_train, X_test, y_train and y_test respectively, running the code below as part of the data_preprocess.py will print out X_train, X_test, y_train and y_test:



data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
print("X_train:")
print(X_train)
print("X_test:")
print(X_test)
print("y_train:")
print(y_train)
print("y_test:")
print(y_test)

Output for y_train and y_test is as below:

3. Feature engineering/selection
These modules/stages can be extended based on specific project requirements. Our demo data (Iris dataset) is small in our case, and these stages might not be necessary.

4. Model training
We will train a Random Forest Classifier on the Iris dataset to create and save as model_training.py in the same directory as before.



# model_training.py
from data_loading import load_data # to access the data and target from data_loading.py file
from data_preprocessing import preprocess_data # to access the function from data_preprocessing.py file
from sklearn.ensemble import RandomForestClassifier 

# create a function to fit the algorithm to learn from the data
def train_model(X_train, y_train):
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    return model

data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
model = train_model(X_train, y_train)

print(model)

The output here is nothing but the actual model, i.e

RandomForestClassifier(random_state=42)

5. Model evaluation
In this, we will evaluate our model and save the file as model_evaluation.py



# model_evaluation.py
from data_loading import load_data # from data_loading.py file
from data_preprocessing import preprocess_data # from data_preprocessing.py file
from model_training import train_model # from model_training.py file
from sklearn.metrics import accuracy_score

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy


data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
model = train_model(X_train, y_train)
accuracy = evaluate_model(model, X_test, y_test)
print(f"The accuracy of the model is {accuracy:.2f}")

The output of our codes above would display the accuracy of the model as 1.00.

In a real and ideal situation, an accuracy of 100% (1.00) would call for a concern; this, most of the time would suggest “overfitting”: a situation where the model has learned the training data so well, and thus not performing well on the new data. However in our case, the simplicity (small and not messy) of our data here is for a mere demonstrative purpose for code modularity, and not the subject of overfitting and accuracy.

Having said that, let’s consider how the hyperparameter code would look like even though, our accuracy has already churned out a perfect metric

6. Hyperparameter tuning
We will fine-tune the model's hyperparameters using a GridSearchCV approach and save it as hyperparameter_tuning.py with the code below:



# hyperparameter_tuning.py
from data_loading import load_data # from data_loading.py file
from data_preprocessing import preprocess_data # from data_preprocessing.py file
from model_training import train_model # from model_training.py file

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier 


def tune_hyperparameters(X_train, y_train):
    param_grid = {
        # tuning just 3:
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }

    grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
                               param_grid, cv=3, verbose=1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_params_

data, target = load_data()
X_train, X_test, y_train, y_test = preprocess_data(data, target)
best_params = tune_hyperparameters(X_train, y_train)
print(f"The best hyperparameters are {best_params}")

The output of hyperparameter tuning:

7. The main script
Now, having had the separate files (.py) for all the stages or steps of building our model, the main.py script (below) orchestrates the entire workflow from data ingestion to hyperparameter tuning using the modular components:



# main.py
import data_loading
import data_preprocessing
import hyperparameter_tuning
import model_training
import model_evaluation

if __name__ == "__main__":
    # Load Data
    data, target = data_loading.load_data()

    # Preprocess Data
    X_train, X_test, y_train, y_test = data_preprocessing.preprocess_data(data, target)

       # Hyperparameter Tuning
    best_params = hyperparameter_tuning.tune_hyperparameters(X_train, y_train)
    print("Best hyperparameters:", best_params)

    # Train Model
    model = model_training.train_model(X_train, y_train)

    # Evaluate Model
    accuracy = model_evaluation.evaluate_model(model, X_test, y_test)
    print("Model accuracy:", accuracy)

The outputs will be the individual output of each of the separate code files, culminating in displaying the accuracy.

Conclusion

In the expansive space of data science projects, embracing a modular code approach unleashes an array of benefits such as propelling towards a clean project organization, reusability, collaboration, and maintainability to new heights. The iris dataset used for illustration purposes might have been quite small and not messy; but notwithstanding, the idea is clear. Generally, as the code (.py or .r file) is methodically separated into well-defined modules, data scientists lay a solid foundation for efficient development and streamlined project management. Although Jupyter notebooks possess their own merits, they grapple with limitations about code organization, version control, and modularity.
As a promising extension, there is a follow-up post to this, whereby there will be incorporation of the Scikit-learn Pipeline module, to further augment the project's modular development experience. The next article will examine the advantages of Pipelines, spotlighting the features of pipelines to automate and standardize data processing and modeling steps, thereby amplifying the development journey. Click to read.

How to Streamline Machine Learning Projects with Scikit-learn Pipeline

Dare Johnson — Sat, 10 Feb 2024 08:23:03 +0000

This is a follow-up tutorial. We will go through how to use the Scikit Learn Pipeline module in addition to modularization. If you need to go through the previous tutorial which is on code modularization in data science, check here

However, just for a recap, the first tutorial introduced the industry standard of separating different stages of machine learning projects into modular scripts whereby each stage deals with different processes (data loading, data preprocessing, model selection, etc.)

Scikit-learn Pipeline
Advantages of Scikit-learn Pipeline
Illustration of Scikit-learn Pipeline
- About the dataset
- Step 1: Load the dataset
- Step 2: Preprocess Pipeline
- Step 3: Training Pipeline
- Step 4: Hyperparameter Pipeline
- Serialize the model
Conclusion

The advantages of script modularization include code organization, code reusability, collaboration and teamwork, maintainability, and debugging.

Now, let's talk about the Scikit-learn Pipeline module briefly.

Scikit-learn Pipeline

A Scikit-learn (Sklearn) pipeline is a powerful tool for streamlining, simplifying, and organizing machine learning workflows. It's essentially a way to automate a sequence of data processing and modeling steps into a single, cohesive unit.

The Pipeline allows chaining together multiple data processing and modeling steps into a single, unified object.
Modular Script + Pipeline Implementation = Best Industry Practices
(The essence of this tutorial is to show how we can make use of the Sklearn Pipeline module in modular scripts for an even more streamlined industry standard)

Here's a list of aspects of the machine learning process where Scikit-learn Pipeline can be used:

1. Data Preprocessing

Imputing missing values.
Scaling and standardizing features.
Encoding categorical variables.
Handling outliers.

2. Feature Engineering

Creating new features or transforming existing ones.
Applying dimensionality reduction techniques (e.g., PCA)

3. Model Training and Evaluation

Constructing a sequence of data preprocessing and modeling steps.
Cross-validation and evaluation.

4. Hyperparameter Tuning

Using Grid Search or Random Search to find optimal hyperparameters.

5. Model Selection

Comparing different models with the same preprocessing steps.
Selecting the best model based on cross-validation results.

6. Prediction and Inference

Applying the entire preprocessing and modeling pipeline to new data.

7. Deployment and Production

Packaging the entire pipeline into a deployable unit.

8. Pipeline Serialization and Persistence

Saving the entire pipeline to disk for future use.

Before demonstrating a few of these steps with a sample dataset, it is worth mentioning that the general outlook of Scikit-learn Pipeline loosely tends to follow this pattern:

Importing Necessary Libraries
Creating Individual Transformers and Estimators
Constructing the Pipeline
Fitting and Transforming Data
Fitting the Final Estimator

Advantages of Using Scikit-learn Pipeline

Don’t forget, from our previous article, the purpose is to demonstrate modular codes in data science problems, then in this follow-up, we are adding more enhancements to our code, to make it robust by the use of Pipeline.

The Scikit-learn Pipeline offers a range of advantages

Simplicity and Readability of Code: This can not be overemphasized. Sequences of data processing steps can be organized and structured into a single unit of codes, as we will soon see with an example. This will make the codes to be readable and then easily maintainable. This advantage of assembling multiple steps (transformers and an estimator) often results in a seamless workflow.
Prevention of Data Leakage: Scikit-learn Pipeline will take care of likely data leakage. To be clear about what data leakage is; data leakage occurs when your machine learning model gets access to information during training that it shouldn't have. It is an unintended exposure or mixing of data between training and test datasets. As a result, machine learning model performance will appear very great during training but otherwise when faced with new and unseen data. Using the Scikit-learn Pipeline is one of the ways to avoid data leakage.
Ease of Deployment and productionization: Pipelines also make the transition from model development to model deployment known as productionization of machine learning models easy. This is achieved by the Pipeline ensuring that all the steps and the model training are applied consistently in the same order and then encapsulate the entire workflow from preprocessing to deployment

Illustration of Scikit-learn Pipeline with Tips Dataset

For illustration, we will use the shipped seaborn dataset to demo a few of these steps (as this data is quite a simple one):

Data Preprocessing (Imputation, Scaling, Encoding)
Model Training and Evaluation (including Cross-Validation)
Hyperparameter Tuning (GridSearchCV)
Prediction and Inference

About the dataset

A few things about the data we will be making use of:

The Tips dataset is part of the Seaborn data repository which serves as an illustrative example within the Seaborn package.
You can easily load the Tips dataset using the sns.load_dataset("tips") command.
The dataset contains information related to restaurant tips and comes with these variables/features:
- total_bill: The total bill amount.
- tip: The tip amount.
- time: Whether the meal was during lunch or dinner.
- smoker: Whether the customer is a smoker.
- size: The size of the dining party.

We want to make a (binary) classification problem out of this dataset, whereby given the features, we would predict if the tip feature is greater than 5 or less. If the tip is greater than 5, our target (y) which we want to predict will be 1, but if the tip is less than 5, our y will be 0. Do you understand?
Let’s get at it.

Step 1: Load the dataset

We will load the dataset we described above with the code snippet here. We will save it as load_data.py. We aim to use both modular code and Scikit-learn Pipeline to show how to approach machine learning problems.

import seaborn as sns #import the seaborn library

#load the tips dataset from seaborn library
def load_tips_data():
    df = sns.load_dataset('tips')
    return df

if __name__ == "__main__":
    tips_data = load_tips_data()
    print(tips_data.head())

Output:

Step 2: Preprocessing Pipeline

At this stage, we will apply the preprocess Pipeline to our data previously loaded in the load_data.py. We will save the script as preprocess_data.py.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from load_data import load_tips_data #to load_data.py script

class PreprocessingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.numeric_features = ['total_bill', 'size']
        self.categorical_features = ['sex', 'smoker', 'day', 'time']

        self.numeric_preprocessor = Pipeline(
            steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())
        ])

        self.categorical_preprocessor = Pipeline(
            steps=[('imputer', SimpleImputer(fill_value='missing', strategy='constant')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ])

        self.preprocessor = ColumnTransformer(
            transformers=[('numeric', self.numeric_preprocessor, self.numeric_features),
                ('categorical', self.categorical_preprocessor, self.categorical_features)
            ]
        )

    def fit(self, X, y=None):
        self.preprocessor.fit(X, y)
        return self

    def transform(self, X):
        return self.preprocessor.transform(X)

# Create an instance of PreprocessingTransformer
preprocessor_instance = PreprocessingTransformer()

# Load and preprocess data
tips_data = load_tips_data() # load_tips_data from the first script
X = tips_data.drop('tip', axis=1)
y = (tips_data['tip'] > 5).astype(int)

# Fit the preprocessor & transform the data
preprocessor_instance.fit(X, y)
transformed_data = preprocessor_instance.transform(X)

Let us see so far what our codes have done to the Tips dataset:

The preprocessing steps are as follows:

Feature Selection: The features are divided into numeric (total_bill, size) and categorical (sex, smoker, day, time) features.
Numeric Preprocessing: The numeric features are preprocessed using a pipeline that includes:

Imputation: Missing values in the numeric features are replaced with the mean value of the respective feature.
Scaling: The numeric features are scaled to have zero mean and unit variance. This is done using the StandardScaler which standardizes features by removing the mean and scaling to unit variance.

Categorical Preprocessing: The categorical features are preprocessed using a pipeline that includes:

Imputation: Missing values in the categorical features are replaced with the constant string ‘missing’.
One-Hot Encoding: The categorical features are one-hot encoded. This means that each unique value in each categorical feature is turned into its binary feature.

Column Transformation: The ColumnTransformer applies the appropriate preprocessing pipeline to each subset of features (numeric and categorical).
Fitting and Transforming: The preprocessing transformer is fitted to the data, learning any parameters necessary for imputation and scaling. It then transforms the data according to the fitted parameters.

To see the output. If we use print(transformed_data), our output looks as below:

The output of the code is a Numpy array (transformed_data). This represents the preprocessed features of the dataset. This data has been scaled and encoded, which is why it does not look easily interpretable like the actual Tips dataset we began with.

Let’s move on to train the preprocessed data. (Assuming we are dealing with other datasets that require more preprocessed steps, our codes above would have to be more inclusive and accommodating to allow for all the necessary preprocessed steps.

Step 3: Training Pipeline

In this third step, we will introduce the training Pipeline, where the two previous scripts will be inputs. We will save this as train_model.py.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from load_data import load_tips_data # accessing the load_data script
from preprocess_data import PreprocessingTransformer # accessing the preprocess_data script
from sklearn.pipeline import Pipeline

class TrainAndEvaluateModelTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        self.model = RandomForestClassifier(random_state=42)
        self.model.fit(X_train, y_train)

        return self

    def transform(self, X):
        y_pred = self.model.predict(X)
        return y_pred

if __name__ == "__main__":
    tips_data = load_tips_data()  # Load and preprocess data
    preprocessor = PreprocessingTransformer()  # Create an instance without passing data
    preprocessed_data = preprocessor.fit_transform(tips_data)  # Fit and transform
    X = preprocessed_data
    y = (tips_data['tip'] > 5).astype(int)

    # Create a pipeline with TrainAndEvaluateModelTransformer
    model_pipeline = Pipeline([('model', TrainAndEvaluateModelTransformer())]) 

    # Fit and evaluate the model
    y_pred = model_pipeline.fit_transform(X, y)
    accuracy = accuracy_score(y, y_pred)
    print(f"Model Accuracy: {accuracy}")

Here is the summary of what happened in the train_model script.

Create Model Pipeline: A pipeline is created with TrainAndEvaluateModelTransformer. This transformer splits the data into training and testing sets, trains a RandomForestClassifier on the training data, and then uses this trained model to make predictions.
Fit and Evaluate the Model: The model pipeline is fitted to the data and used to make predictions. The accuracy of these predictions is then calculated and printed to the console. The output of this script will be the accuracy of the model, which is the proportion of correct predictions made by the model. This will be printed to the console as a decimal between 0 and 1, with 1 indicating perfect accuracy.

Let’s take a look at the hyperparameter tuning Pipeline.

Step 4: Hyperparameter Pipeline

Even though our model accuracy was not bad in the last step, the reason for this hyperparameter stage is to demonstrate also how we can use Pipeline for hyperparameter tuning. And we will also be saving the code as hyperparameter_tuning.py.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer  # Import the PreprocessingTransformer
from sklearn.pipeline import Pipeline

if __name__ == "__main__":
    # Load and preprocess data
    tips_data = load_tips_data()
    X = tips_data.drop('tip', axis=1)
    y = (tips_data['tip'] > 5).astype(int)

    # Create a pipeline for preprocessing and model training
    model_pipeline = Pipeline([
        ('preprocessor', PreprocessingTransformer()),  # Use the custom preprocessing transformer
        ('model', RandomForestClassifier(random_state=42))
    ])

    # Define hyperparameter grid
    param_grid = {
        'model__n_estimators': [50, 100, 150],
        'model__max_depth': [None, 10, 20],
        'model__min_samples_split': [2, 5, 10]
    }

    # Perform GridSearchCV
    grid_search = GridSearchCV(model_pipeline, param_grid, cv=5)
    grid_search.fit(X, y)

    # Get the best tuned model
    best_tuned_model = grid_search.best_estimator_

    print("Best Hyperparameters:", best_tuned_model.named_steps['model'].get_params())

Briefly, the hyperparameter Pipeline code defines the hyperparameter grid (includes different values for several hyperparameters), performs the grid search (tries different combinations of hyperparameters which results in the best score), gets the best model with the best hyperparameters combination, and lastly, print out the very best hyperparameters.

This is our output, the best hyperparameters, as printed by the last line of code:

Step 5: Serialize the model

Finally, these whole modular scripts can be serialized and made ready for another program to be used as a pickled file (in production). We will save the script as serialized_model.py.

import joblib # for saving and loading Python objects to/from disk
from load_data import load_tips_data
from preprocess_data import PreprocessingTransformer
from train_model import TrainAndEvaluateModelTransformer
from hyperparameter_tuning import GridSearchCV, RandomForestClassifier

def serialize_model(model, filename):
    joblib.dump(model, filename)
    print(f"Model serialized and saved as '{filename}'")

if __name__ == "__main__":
    data = load_tips_data()
    X = PreprocessingTransformer().fit_transform(data)
    y = (data['tip'] > 5).astype(int)
    trained_model = TrainAndEvaluateModelTransformer().fit(X, y)
    tuned_model = GridSearchCV(RandomForestClassifier(random_state=42), param_grid={
        'n_estimators': [50, 100, 150],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }, cv=5).fit(X, y)
    serialize_model(tuned_model, "best_model.pkl")

This last code will take a bit longer to finish executing because it involves all the steps (the Pipelines), and then the serializing (“pickling”) of the best model into a file.
The output of our code will be a print statement indicating that the model has been serialized and saved as ‘best_model.pkl’. We won’t see the model itself in the console, but a new file named ‘best_model.pkl’ will be created in our current working directory. This same file contains our trained and tuned model. (To load the model back into memory, we can use joblib.load('best_model.pkl'). This will return the model object)

Conclusion

In conclusion, we made attempts to show how to use modular scripts in conjunction with the Scikit-learn Pipeline module (with its classes) - way beyond what is demonstrated in many online courses whereby the Jupyter Notebook is portrayed as the only and best way to make data science projects work. Machine learning projects can be better streamlined, going by the industry's best approaches.
The codes, the approaches, and the structures used here are never cast in stone; depending on the aim, nature, and goal of a data science/machine learning project, flexibility is always very much allowed.

User-Defined Functions in SQL: Expanding Your Database Toolkit

Dare Johnson — Fri, 06 Oct 2023 15:42:01 +0000

Why would you need to be able to create your own functions (UDF) in SQL? What is a typical function in SQL, and by extension, User-Defined Functions (UDF)?

Either as a seasoned SQL developer or random SQL user, it can't be overstated why you need to know how to create your own function.
Aside the fact that SQL developers write complex queries, design databases (table, schema, etc.), they also go advanced a bit into creating Stored proc and even complex functions to make their work an ease.

In order to avoid the need to define all the very basics, I will assume the reader is at least an immediate SQL user. But well, a beginner SQL analyst reading this would still gain one thing or the other. Before you read the last paragraph in this article, you would have known how to go about creating a User-Defined Functions (UDF), and some more.

A SQL function being a type of routine (program) is such that can be created to perform various tasks.

Broadly speaking, there are two types of functions:

Built-in Functions
User-Defined Functions (UDFs)

While Built-in Functions come in different kinds (Aggregate, Ranking, String, Date/DateTime, etc.), UDFs also their types, as we will see shortly.

By the way, let's go practical.

I will be using Microsoft SQL Server to do the demo in this article, while also using some data readily available online - AdventureWorksDW2014

Question:
Create a User-Defined Function to calculate total sum of sales for any specified product between two dates.

This is fairly simple one, but we will start somewhere.

Writing this as a function, it would look something like this:

CREATE FUNCTION dbo.CalcProductTotalSales
(@ProductKey INT, @start_date DATETIME, @end_date DATETIME)
RETURNS DECIMAL(18, 2)
AS
BEGIN
    DECLARE @UnitPrice MONEY;

    SELECT @UnitPrice = SUM(UnitPrice)
    FROM [dbo].[FactInternetSales]
    WHERE ProductKey = @ProductKey
    AND OrderDate BETWEEN @start_date AND @end_date;

    RETURN @UnitPrice;
END;

Sincerely speaking, it may not look so attractive, right?
Let me break it down:

Between BEGIN and END is the actual SELECT statement which does the analysis;
"CREATE" keyword is common in SQL; it is used to create objects such as TABLES, DATABASES, VIEWS, etc., in this regard we used it to create a FUNCTION, thus the FUNCTION keyword; and the function name is dbo.CalcProductTotalSales
Parentheses that follows function name specify the necessary variables to be made use of in the body of the function (between BEGIN and END keywords);
As many "@" seen in the function shows the individual variables needed within the function. "@" symbol is used to introduce variables in SQL Server.

While the function above is executed within SQL Server query editor (query window), it becomes stored away as seen below in the programmability folder within the database:

The implication of this is that even if SQL Server Management Studio (SSMS) is shut down, that particular function created remains permanent, and can continually be accessed, next time you restart your server.

But then, how do we then use it?
It is by simply using the usual SELECT statement, along with the function name, while supplying the parameters (productkey, start_date and end_date), as below:

SELECT dbo.CalcProductTotalSales(310, '2010-12-29 00:00:00.000', '2014-01-28 00:00:00.000');

If you recall the question to which we created User-Defined Function (to calculate total sum of sales for any specified product between two dates), you would know that, even without using a function approach, we can as well easily write a SELECT query to carry out our analysis. In fact, the SELECT statement below does same thing as our function:

SELECT SUM(UnitPrice) as total_revenue
FROM [dbo].[FactInternetSales]
WHERE ProductKey = 310
AND OrderDate BETWEEN '2010-12-29 00:00:00.000' AND '2014-01-28 00:00:00.000';

But then, why would we desire to create a function to an obvious SQL problem which needs a mere SELECT statement to solve?

An obvious reason is, if there is no SQL built-in function readily available to solve a desired analysis in SQL, then User-Defined Function (UDF) can answer to the call. (The UDF example given above is such a simple one, used for illustration purpose)
Secondly, if there be need to keep re-using a particular query in further future analysis, we can as well make a function of it. (SQL VIEWs and Stored Procedures share similar application as this, among many others).
Lastly, UDFs help to simplify development by encapsulating (complex) business logic and make them easily available for reuse.

The advantage of using function includes:

Code reusability
Simplicity
Consistency
Modularity

Having introduced quite a straightforward function to prove a point, let's see more reason why you will come to a point where you will need to create your own function someday - if you have never hitherto!

Do you recall CAST() function?

This very function which converts one data type to the other? The reason why you don't get to create it afresh any time you need it is because it comes handy, already built-in in SQL and all you need to do is just to call it as at when needed.
If you were to create CAST function which can convert CHAR (string) to INT (integer) alone, error handling not factored in, it would look like this:

CREATE FUNCTION dbo.CastAsInt(@value NVARCHAR(MAX))
RETURNS INT
AS
BEGIN
    DECLARE @result INT;

    SELECT @result = CAST(@value AS INT);

    RETURN @result;
END;

Now, the fact that someday, you will need a function within SQL which is not natively available (or built-in), it is for this reason that it is essential to master how to craft your own UDFs, User-Defined Functions.
SQL creators knew that it would be impossible to factor in every functions one could ever think of as a built-in function, then the template to be able to construct one yourself is provided for.

Let us assume that we want to be able to calculate a person's age given the person's date of birth: that's, a formula which can calculate age from date of birth;
There is not any built-in function in SQL Server for such. What is close to this function is using the combination of DATEDIFF alongside GETDATE() to calculate the difference in YEAR.

(There is the AGE() function to calculate same in PostgreSQL. MySQL also has TIMESTAMPDIFF() function which is not that straightforward too)

To create a permanent function which can be used again and again in SQL Server, let's see how this can be done, we will call it dbo.CalculateAge:

CREATE FUNCTION dbo.CalculateAge(@dob DATE)
RETURNS INT
AS
BEGIN
    DECLARE @age INT;

    SELECT @age = DATEDIFF(YEAR, @dob, GETDATE()) - 
                  CASE 
                      WHEN MONTH(@dob) > MONTH(GETDATE()) OR (MONTH(@dob) = MONTH(GETDATE()) AND DAY(@dob) > DAY(GETDATE()))
                      THEN 1 
                      ELSE 0 
                  END;

    RETURN @age;
END;

No more, no less.

NB: The "dbo" is the very schema name which differentiates it from any other schema. It could be another name, but "dbo" (database owner) here is the default schema while the database/table was created.

If you supply any column or data (of date/datetime data type) to the function dbo.CalculateAge, it gives you the age on the go, without having to continually write the whole DATEDIFF, GETDATE functions again.
For instance,

What function can you think of again in SQL (SQL Server) which is not built-in? They can be many, depending on the use case.

Having said all these and explained why SQL developer should be able to create a UDF personally, let's briefly look at how to create one, making use of the template already provided in SSMS.
But foremost, this is worth mentioning, we have about 3 types of SQL User Defined Functions:

Scalar Functions (returns a single data value of the type defined in the RETURNS clause)
Table-Valued Functions (returns a table data type)
Inline Table_Valued Functions (returns a table data type)

To create any type of function using the templates inside SSMS,

Right click the query editor and click Insert Snippet:

Click on Function:

Choose the type of function you desire to create based on what it returns (scalar, table)

Choosing the first option (Create Inline Table Function):

There, you have the template. From the template, you can build your desired function.
(All the examples shown earlier are of Scalar type - because the function returns a single value)

Final Note & Considerations

There are a few considerations to keep in mind when creating User-Defined Functions (UDFs) in SQL Server:

UDFs can’t be used to perform actions that modify the database state.
Error handling is restricted in a UDF. A UDF doesn’t support TRY...CATCH, @ ERROR or RAISERROR1.
UDFs can’t call a stored procedure but can call an extended stored procedure. In likewise manner, UDFs can’t return multiple result sets. Use a stored procedure if you need to return multiple result sets.
UDFs can be nested; that is, one user-defined function can call another. The nesting level is incremented when the called function starts execution, and decremented when the called function finishes execution. In addition, user-defined functions can be nested up to 32 levels.

There you go! As you explore the more of the power of SQL data sleuthing, don't forget to consider what typical UDFs can do to aid your analyses.

DEV Community: Dare Johnson

How To Implement Code Modularity in Data Science and Machine Learning Projects

Outline

Code Modularity

Advantages of Modular Code

Limitations of Jupyter Notebook

Typical Stages of Machine Learning Project

Demonstration Using The Iris Dataset

Conclusion

How to Streamline Machine Learning Projects with Scikit-learn Pipeline

Table of Contents

Scikit-learn Pipeline

Advantages of Using Scikit-learn Pipeline

Illustration of Scikit-learn Pipeline with Tips Dataset

About the dataset

Step 1: Load the dataset

Step 2: Preprocessing Pipeline

Step 3: Training Pipeline

Step 4: Hyperparameter Pipeline

Step 5: Serialize the model

Conclusion

User-Defined Functions in SQL: Expanding Your Database Toolkit