DEV Community: Chris

Evaluating A Machine Learning Classification Model

Chris — Sun, 15 Sep 2024 19:44:00 +0000

In machine learning (ml), model evaluation is a crucial step to understand how well your model performs on unseen data. This is essential because, while a model might perform well on the training dataset, its ability to generalize to new data determines its true value. Evaluating classification models involves several methods and metrics, each designed to give insights into different aspects of the model's performance.

Purpose of Model Evaluation

The main goal of model evaluation is to assess the quality of machine learning predictions and ensure that the model performs well on data it has not seen before (generalization). By evaluating ml models, we can:

Determine the effectiveness of the model in making accurate predictions.
Compare different models to choose the best one for a particular problem.
Fine-tune models to improve their performance by adjusting parameters or features.

Common Evaluation Procedures

There are several procedures for evaluating machine learning models:

Training and testing on the same data
- Rewards overly complex models that "overfit" the training data and won't necessarily generalize.
Train/test split
- Split the dataset into two pieces, so that the model can be trained and tested on different data
- Better estimate of out-of-sample performance, but still a "high variance" estimate
- Useful due to its speed, simplicity, and flexibility
K-fold cross-validation
- Systematically create "K" train/test splits and average the results together
- Even better estimate of out-of-sample performance
- Runs "K" times slower than train/test split

We can deduce from the above evaluation procedures that:

Training and testing on the same data is a classic cause of overfitting in which you build an overly complex model that won't generalize to new data and that is not actually useful.
Train_Test_Split provides a much better estimate of out-of-sample performance.
K-fold cross-validation does better by systematically creating “K” train test splits and averaging the results together.

The Choice of A Model Evaluation Metric.

The choice of a model evaluation metric depends on the specific machine learning problem you are solving. For classification problems, you can use the classification accuracy; but this has its own limitations. However, we will discuss the limitations of the classification accuracy and also focus on other important classification evaluation metrics in this guide.

Classification Accuracy and Its Limitations.

Accuracy is one of the simplest and most commonly used evaluation metrics, represented by the percentage of correct predictions made by the model. However, accuracy has its limitations, especially when dealing with imbalanced datasets, where one class is significantly more frequent than others. In such cases, a model might achieve high accuracy simply by always predicting the majority class, without actually learning meaningful patterns.

We've chosen the Pima Indians Diabetes dataset for this tutorial, which includes the health data and diabetes status of 768 patients. Let's read the data and print the first 5 rows of the data.

The label column indicates 1, if the patients has diabetes and 0, if the patients doesn't have diabetes. Albeit, we intend to answer the question:

Question: Can we predict the diabetes status of a patient given their health measurements?

With this in mind, let's define our features metrics X and response vector Y. We use train_test_split to split X and Y into training and testing set.

Next, we train a logistic regression model on training set. During the fit step, the logreg model object is learning the relationship between the X_train and Y_train. Finally we make a class predictions for the testing sets

Now , we've made prediction for the testing set, we can calculate the classification accuracy, which is simply the percentage of correct predictions.

However, anytime you use a classification accuracy as your evaluation metric, it is important to compare it with Null accuracy, which is the accuracy that could be achieved by always predicting the most frequent class

Null Accuracy

The Null accuracy answers the question; if my model was to predict the predominant class 100 percent of the time, how often will it be correct?

In the scenario above, 32% of the y_test are 1 (ones). In other words, a dumb model that predicts that the patients has diabetes, would be right 68% of the time(which is the zeros).This provides a baseline against which we might want to measure our logistic regression model.

When we compare the Null accuracy of 68% and the model accuracy of 69%, our model doesn't look very good. This demonstrates one weakness of classification accuracy as a model evaluation metric. The classification accuracy doesn't tell us anything about the underlying distribution of the testing test.

Let's look at the calculation of the null accuracy.

In summary:

The classification accuracy is the easiest classification metric to understand.
But, it does not tell you the underlying distribution of response values or predictions
And, it does not tell you what "types" of errors your classifier is making, this is why it is good to evaluate your models using the Confusion Matrix

Understanding Confusion Matrix And Its Advantage.

The confusion matrix is a table that describes the performance of a classification model. It provides a more detailed breakdown of a model’s performance, showing how often predictions fall into each category. It consists of four outcomes:

True Positives (TP): we correctly predicted that they do have diabetes; when both the actual and predicted values are 1
True Negatives (TN): we correctly predicted that they don't have diabetes; when both the actual and predicted values are 0
False Positives (FP): we incorrectly predicted that they do have diabetes; when the actual value is 0 but the predicted value is 1 (a "Type I error")
False Negatives (FN): we incorrectly predicted that they don't have diabetes; when the actual value is 1 but the predicted value is 0 (a "Type II error")

By using a confusion matrix, you can compute more nuanced metrics like precision, recall, and F1-score, which provide a clearer picture of a model’s performance in the presence of class imbalances.

Metrics Computed from a Confusion Matrix

Recall (Sensitivity): The proportion of correctly predicted positive cases out of all actual positives. It measures the model’s ability to identify positive cases. It means when the actual value is positive, how often is the prediction correct?

How "sensitive" is the classifier to detecting positive instances?
Also known as "True Positive Rate" or "Recall"

Specificity: The proportion of correctly predicted negative cases out of all actual negatives. This means when the actual value is negative, how often is the prediction correct?

How "specific" (or "selective") is the classifier in predicting positive instances? False Positive Rate: When the actual value is negative, how often is the prediction incorrect?

Precision: The proportion of correctly predicted positive cases out of all predicted positives. It indicates the accuracy of positive predictions. This means when a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances? Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.

Conclusion

In conclusion, The choice of metric depends on your specific business objective, however, the confusion matrix gives you a more complete picture of how your classifier is performing. It also allows you to compute various classification metrics, and these metrics can guide your model selection.

SQL Window Functions

Chris — Thu, 28 Sep 2023 10:38:58 +0000

In this article, we discuss what window functions are and how they help you do your job as a data analyst or specialist.

Outline:

What Are Window Functions?
Types Of Window Functions
Case study example
How are Window Functions Useful and who should use them?

What Are Window Functions?
Window Functions are types of SQL function that allows you to perform calculations based on data in different groups of rows. The term window describes the set of rows on which the function operates. This window is defined using the OVER clause and can be based on various criteria, such as partitioning and ordering.

Here’s what the window_function syntax looks like:

image credit: database star

The OVER Keyword: indicates that this is to be used as a window function.
The PARTITION BY clause: will let you define the window of data to look at
The ORDER BY clause: defines the order in which the function will run on the data.

Types of Window Functions.

The window functions are divided into four main groups:

Ranking Functions – Assign a rank to each row in the dataset. E.g ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(), PERCENT_RANK() etc.
Analytic Functions – This allows you to access the values of the previous or following rows (in relation to the current row). They can also return the first or last value and divide rows into (close to) equal groups. E.g LEAD(), LAG(), FIRST_VALUE(), LAST_VALUE(), etc.
Aggregate Functions – Regular aggregate functions can be used as window functions. Most commonly, you can count the values in the window, sum them, or find their average, minimum, and maximum values. Since these are window functions, you can do this using one or several grouping criteria. MAX(), MIN(), AVG(), SUM(), COUNT(), etc.
Distribution Functions – Calculate the cumulative or relative rank of a value within the dataset. E.g CUME_DIST().

The Data for the Examples

Let’s say we want to calculate the ranking of employees based on their salaries within different departments.
Consider the following sample database table named "Employee":
employee_id – The ID of the employees and the primary key (PK) of the table.

Employee_Name – The name of the employee
Department – The department of the employees in the organization.
Salary – The amount the employees earn in a month.

Now, let's calculate the rank of employees within each department based on their salaries. We'll use the RANK() function for this example:

In this SQL query:
We select the employee's name, department, and salary, and calculate the rank of employees within each department based on their salary.

The PARTITION BY clause divides the result set into partitions by the "Department" column. This means that ranking will restart for each department.
The ORDER BY clause sorts the employees within each department by salary in descending order.
The RANK() function assigns a rank to each employee based on their salary within their respective department.

The result of this query would look like:

In this result, you can see that within each department, employees are ranked based on their salaries, with the highest salary receiving a lower rank. The RANK() function handles ties by assigning the same rank to employees with equal salaries (as seen with Alice and Bob in the IT department or Cathy and John in the HR department).

How Are SQL Window Functions Helpful?

Knowing SQL Window Functions makes writing complex reports easier. In business cases, they can help you rank data, analyze time series data, and make time period comparisons (i.e. year-to-year, quarter-to-quarter, month-to-month. They are perfect for all data analysts who work with SQL. If you want to SQL expert and take your reporting to a new level, then window functions are for you.

PRESENTING TO A NON-TECHNICAL AUDIENCE AS DATA EXPERTS

Chris — Mon, 31 Jan 2022 10:36:21 +0000

Communicating with non-technical stakeholders can present a challenge when discussing a proposed business solution or the application and deployment of a new technology model for a business problem. This is a common struggle for everyone from the CDO (Chief Data Officer) to junior data professionals when it comes to clearly communicating complex technical concepts to executive leadership, internal and external clients, and customers alike. It may even feel like you’re speaking two completely different languages as you get into more in-depth discussions.

The data science workflow and project lifecycle is a daunting process filled with many technical comments, language, and constructs in the build-up process, leaving the non-technical stakeholders with no clear understanding of the impact of the project. It is clear to note that the essence and priority of the entire journey is the communication of the impact of the findings and solutions to management executives, internal or external customers, and stakeholders.

By experience, I noticed most data scientists work and focus only on developing and deploying Models and Machine Learning solutions, giving little to no attention to strategies to communicate their results to a non-technical audience. This creates a wide gap in the minds of the audience as to what the impact of the solution is all about.

No management executive or stakeholder cares about the fancifulness or technicality of your model or codes. What they are interested in, is how you’re proposed model or solution communicates and translates to business impact or value.

As technical experts and data leaders, we should endeavor to simplify complex comments and concepts into understandable components for the non-technical audience to comprehend. By this, we make it easier for the non-technical stakeholders to understand tech on a high level. This exerts a degree of leadership and influence within the context and domain.

When communicating your findings, it’s got to do a lot more with your target audience. A few approaches to guide us to communicate our results properly can be found below.

1) Know your audience: This is the first of many steps to communicating effectively. Learning who your audience is will ensure you come up with a structure and choose an appropriate approach to driving your message home as Data Leaders. Some notable points here could include:
i) Finding out the audience backgrounds and expertise
ii) Gauging their level of knowledge on the subject matter beforehand
iii) What does the audience need to know?
iv) How will the audience use the information?

2) Use Visual Aids ( visualizations software):
As the saying goes, a picture is worth a thousand words. Human beings generally can grasp concepts in pictures much faster than they can digest words. Using visuals to communicate thoughts and processes facilitates engagements and can make it easier for a non-technical audience to understand concepts at a very high level. Create an illusion of data flow in form of directional arrows. Connect the box to visualize integrated systems. Some visual software out there includes Plotly, PowerBI, Tableau, Powerpoint, etc. as the case may be depending on your need.

Also, using visual storytelling helps to effectively communicate technical results to non-technical audiences.

3) Choose your communication channel:

Effective engagement with non-technical stakeholders entails choosing the right platform to communicate technical results and findings. Meeting rooms should be serene and conducive and it’s advisable to have a smaller group for proper engagement and interactions.

Communication is a two-way path. Listening and giving feedback. It’s not always about articulating your personal views and ideas. Listening and receiving feedback are as important. Clear communication is the key to both understanding and explaining thoughts and processes to a non-technical audience. By maintaining simple language, and following a straightforward explanation strategy, it becomes easier for the non-technical stakeholders to understand technical concepts at a high level.

In conclusion, clear communication leads to clear business impacts and better Data and Technical Leaders.