DEV Community: ML_Concepts

Introduction to Java

ML_Concepts — Thu, 23 Feb 2023 05:13:01 +0000

Java is a high-level programming language that is widely used for developing a variety of applications, from desktop to mobile to web applications. It is an object-oriented language that is known for its portability, security, and robustness.

Setting up the environment in Java involves a few steps:

Download and install the Java Development Kit (JDK) from the official website.

Set up the environment variables in your operating system, which are required for running Java applications. This includes setting the PATH variable to include the directory where the Java compiler and runtime are installed.

Choose an Integrated Development Environment (IDE) such as Eclipse, NetBeans, or IntelliJ IDEA to write and run Java code.

Once the environment is set up, you can start writing Java code using the basic syntax of the language. Here are a few important concepts to keep in mind:

Java is case-sensitive, so be sure to use proper capitalization when naming classes, variables, and methods.

Java programs are written in classes, which contain methods that perform specific tasks.

The main() method is the entry point of any Java program and is where the program starts executing.

Java uses curly braces to denote code blocks, and statements must end with a semicolon.

Java has several built-in data types, including integers, floats, doubles, booleans, and characters.

Java supports control structures such as if-else statements, for loops, while loops, and switch statements.

By mastering these basic concepts, you can start writing simple Java programs and gradually move on to more complex applications. If you want to know in detail about these concepts in Java read our original articles on

ml-concepts.com

and

ml-concepts.com

Java Basic Syntax

Java basic syntax includes the following components:

Single-line comments start with double forward slashes (//) and continue until the end of the line.

// This is a single-line comment

Multi-line comments start with a forward slash followed by an asterisk (/) and end with an asterisk followed by a forward slash (/).

/* This is a multi-line comment that can span multiple lines */

Identifiers: Identifiers are used to name variables, methods, and classes in Java. They can include letters, digits, underscores, and dollar signs, but cannot start with a digit.

int age; double salary; String firstName;
3 . Keywords: Keywords are reserved words in Java that have a specific meaning and cannot be used as identifiers.

Variables: Variables are used to store values in Java. They must be declared with a data type before they can be used.
Data Types: Java supports several data types, including integers, floats, doubles, booleans, and characters.
Operators: Operators are used to performing operations on variables and values in Java.

For a more detailed explanation visit our original article on —

ml-concepts.com

In conclusion, Java is a powerful and widely-used programming language for building a variety of applications, from mobile apps to enterprise systems. In this introduction to Java, we covered the basics of Java syntax, including comments, identifiers, keywords, variables, data types, operators, and control structures. We also provided an example of a “Hello, World!” program in Java. Whether you are just starting out with programming or looking to expand your skills, Java is a great language to learn.

Thank you for reading this article on Java programming. I hope you found it informative and helpful in your learning journey. If you have any doubts or questions, please feel free to ask and I’ll be happy to help you.

Python OOP: Harnessing the Power of Classes and Objects

ML_Concepts — Tue, 14 Feb 2023 14:14:51 +0000

Python OOP, or Object-Oriented Programming, is a programming paradigm that emphasizes the use of objects to represent and interact with data and functionality. In Python, objects are created by defining classes, which are essentially blueprints for creating objects.
Classes define the attributes and methods of objects. Attributes are the characteristics or data that an object has, while methods are the functions or operations that an object can perform. When you create an object from a class, the object inherits the attributes and methods of the class.
One of the main benefits of using OOP in Python is that it allows for code reusability and modularity. By encapsulating data and functionality within objects, you can easily reuse and modify them without affecting the rest of the code.
In Python, you can define classes using the class keyword, followed by the name of the class and a colon. The body of the class contains the attributes and methods of the class. For example:
`class Person:
def init(self, name, age):
self.name = name
self.age = age

def greet(self):
    print(f"Hello, my name is {self.name} and I am {self.age} years old.")`

In this example, we've defined a Person class with two attributes (name and age) and one method (greet). The init method is a special method that is called when an object is created from the class. It initializes the attributes of the object with the values passed as arguments.
To create an object from the Person class, you can call the class as if it were a function:
person1 = Person("Alice", 25)

This creates a Person object with the name "Alice" and age 25. To access the attributes and methods of the object, you can use dot notation:

print(person1.name) # "Alice" person1.greet() # "Hello, my name is Alice and I am 25 years old."

For more detail and a brief explanation of Python OOP (Object Oriented Programming), check our original article on- Python OOP (Object Oriented Programming)
Python Classes and Objects

In Python, a class is a blueprint or a template for creating objects. Objects are instances of a class and represent a specific entity that has its own properties (attributes) and methods. Here is an example of a class definition in Python:
`class Car:
def init(self, make, model, year):
self.make = make
self.model = model
self.year = year

def start(self):
    print(f"The {self.year} {self.make} {self.model} is starting.")`

In this example, the Car class has three attributes (make, model, and year) and one method (start). The init method is a special method that gets called when an object of the class is created. It sets the initial values of the object's attributes.

Overall, Python classes and objects provide a powerful way to encapsulate data and behavior into reusable and modular entities. By creating classes and objects, you can write code that is easier to read, maintain, and modify. For more detail and a brief explanation check our original article on- Python Classes and Objects

Decimal Scaling - Another Data Normalization Technique?

ML_Concepts — Tue, 17 Jan 2023 03:57:36 +0000

Note: use this link to check out our original article on Decimal Scaling in data mining.

Decimal scaling is a pre-processing technique used in machine learning to scale the values of input features. This technique aims to bring the values of the features within a certain range, typically between -1 and 1, to make them more manageable for the machine learning model. This can be especially useful when the values of the input features have a wide range, as this can cause problems for certain types of models, such as neural networks.

The technique involves multiplying each value in the feature set by a power of 10, usually 10^n, where n is a positive or negative integer. This is done to change the scale of the values, to make them more consistent and easier to work with. For example, if a feature has values that range from 0 to 1,000,000, multiplying by 10^-6 will bring the values down to a range of 0 to 1. Similarly, if a feature has values that range from 0 to 0.0001, multiplying by 10^4 will bring the values up to a range of 0 to 1.

There are several benefits to using decimal scaling in machine learning. One of the most important is that it can help to improve the performance of the model. This is because many machine learning algorithms, such as neural networks, are sensitive to the scale of the input features. If the values of an attribute are too large or too small, this can cause problems with the training and evaluation of the model. By scaling the values of the features, decimal scaling can help to ensure that the model is working with a consistent and manageable set of input data.

Another benefit of decimal scaling is that it can help to reduce the risk of overfitting. Overfitting is a common problem in machine learning, and it occurs when the model is too closely fit to the training data, resulting in poor generalization performance. By scaling the values of the features, decimal scaling can help to reduce the risk of overfitting by making the model less sensitive to slight variations in the input data.

However, it is important to note that decimal scaling should be used with caution. It does not apply to all datasets and models. For example, if the data is already in a consistent range, there is no need to perform the scaling. Additionally, the scaling factor should be chosen carefully, as an inappropriate scaling factor can lead to poor performance of the model. It is also important to note that decimal scaling should only be applied to the input features, and not the output variable.

In conclusion, decimal scaling is a preprocessing technique used in machine learning to scale the values of input features. The goal of this technique is to bring the values of the features within a certain range, typically between -1 and 1, to make them more manageable for the machine learning model. This technique can be especially useful when the values of the input features have a wide range, as this can cause problems for certain types of models. However, it is important to use decimal scaling with caution and it should be applied only if it is necessary.
If you liked this article then check out our other articles on Normalization in machine-learning models.

Min-Max-Normalization

Z-Score Normalization

Summary
In this article, I tried to explain decimal scaling in simple terms. If you have any questions about the post, please put them in the comment section and I will do my best to answer them.

Mean, Median, and Mode, Now with Python..!

ML_Concepts — Fri, 13 Jan 2023 11:07:21 +0000

Introduction

Note: Use this link to check out our original article on measures for central tendencies.

Central tendency is a statistical concept that describes the central or typical value of a dataset. In other words, it provides a single value that represents the center or middle of a dataset. There are three main measures of central tendency: mean, median, and mode. Each of these measures provides a different perspective on the center of a dataset, and they are often used in combination to gain a better understanding of the data.

Mean

The mean, also known as the average, is calculated by summing all of the values in a dataset and dividing by the number of values. In Python, you can calculate the mean of a list of numbers using the mean() function from the statistics module. For example, the following code calculates the mean of a list of numbers.

from statistics import mean numbers = [1, 2, 3, 4, 5] print(mean(numbers))
Median

The median is the middle value of a dataset when it is ordered from least to greatest. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values. In Python, you can calculate the median of a list of numbers using the median() function from the statistics module. For example, the following code calculates the median of a list of numbers.

from statistics import median numbers = [1, 2, 3, 4, 5] print(median(numbers))

Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode, multiple modes, or no mode at all. In Python, you can calculate the mode of a list of numbers using the mode() function from the statistics module. For example, the following code calculates the mode of a list of numbers.

from statistics import mode numbers = [1, 2, 3, 4, 5, 2] print(mode(numbers))
Keep in mind that mode is not defined for continuous data and only makes sense when you have a countable set of data.

If you want to calculate these measures for data that is not numerical, you can use python's built-in collections library which has a Counter class that would allow you to count the occurrences of each element in the data and then use that to calculate the mode.

It's important to note that central tendency measures are not always appropriate for all datasets. For example, if a dataset has a large number of outliers, the mean will be heavily influenced by these outliers and may not accurately represent the center of the dataset. In such cases, the median may be a more appropriate measure of central tendency. Additionally, if a dataset has multiple modes, it may be difficult to determine which mode is the most important or relevant.

Another important concept related to central tendency is skewness. Skewness refers to the asymmetry of a dataset. A dataset is symmetric if the mean, median, and mode are all equal. A dataset is positively skewed if the mean is greater than the median, and the mode is the smallest value. A dataset is negatively skewed if the mean is less than the median, and the mode is the largest value.

In conclusion, central tendency measures provide a single value that represents the center or middle of a dataset. Mean, median, and mode are the most commonly used measures of central tendency. Mean is the average of a dataset, the median is the middle value, and mode is the value that

Summary

In this article, I tried to explain measures for central tendencies in simple terms. If you have any questions about the post, please put them in the comment section, and I will do my best to answer them.

What Is Pruning In Decision Tree?

ML_Concepts — Thu, 12 Jan 2023 07:10:00 +0000

Note: Use this link to check out our original article on pruning to reduce overfitting in machine learning.

Pruning is a technique used to reduce the size of a decision tree by removing branches that do not contribute significantly to the accuracy of the tree. The goal of pruning is to improve the generalization performance of the tree by reducing overfitting.

There are several methods for pruning decision trees, but one common approach is reduced error pruning. This method begins by constructing a complete decision tree using a training dataset, and then iteratively removing branches from the tree and evaluating the impact on the accuracy of the tree using a validation dataset.

The basic idea behind reduced error pruning is to remove a branch if the accuracy of the tree does not decrease significantly as a result of the removal. This is done by comparing the accuracy of the tree before and after the removal of each branch, using a validation dataset. If the accuracy of the tree does not decrease significantly, then the branch is removed.

Another pruning method is Cost complexity pruning. It uses a regularization parameter which is used to control the trade-off between the complexity of the tree and the training error. the regularization term will be added to the classification error and the tree will be pruned in such a way that the regularized error is minimum.

Another popular method is Minimum Description Length(MDL) pruning. It is based on information theory which states that a good model is one which has the shortest description length. Model selection is performed in such a way that the complexity penalty and the data fitting penalty are balanced.

Another method is the Iterative Dichotomiser 3(ID3) algorithm. It uses the concept of entropy to decide which feature to split on. The feature with the highest information gain is chosen as the splitting feature. To prune the tree, a test set is used to evaluate the accuracy of the tree after pruning. If the accuracy does not decrease significantly, the tree is pruned.

Finally, another pruning method is Minimum Description Length(MDL) pruning. This method tries to reduce the tree's complexity by cutting branches that do not contribute a lot to the accuracy of the tree. It is based on the idea of describing the tree in the shortest way possible. The MDL principle states that a good model is one that has the shortest description length. MDL pruning is done by finding the smallest subtree that can be used to approximate the full tree, while still achieving similar accuracy.

In conclusion, pruning is an important technique for improving the generalization performance of decision trees by reducing overfitting. There are several methods for pruning decision trees, each with its own strengths and weaknesses. Choosing the right method for a given problem depends on the characteristics of the dataset and the specific requirements of the problem.

Summary
In this article, I tried to explain the pruning of decision trees in simple terms. If you have any questions about the post, please put them in the comment section, and I will do my best to answer them

Using K-Fold Cross Validation in Machine Learning

ML_Concepts — Thu, 01 Dec 2022 12:00:13 +0000

First of all, what is cross-validation?
Note: use this link to check out our original article on cross-validation

Cross-validation is one of the most commonly used techniques used to test the performance of an AI model. It is a resampling technique used to assess a model in the event that we have restricted information. To perform cross-validation, we keep a portion of the original dataset aside which is not shown to the model while it gets trained on the rest of the dataset, and use it later to assess the model’s performance.

What is the need for cross-validation?

We typically lean toward a train-test split on the datasets in a typical model training setup, with the goal that we can keep our training and testing datasets separated from each other for a reliable model assessment. Contingent on the size of the dataset we partition those datasets as 80/20 or 70/30.

The above ratio used during the train test split causes the model’s exactness vacillations as it won’t be steady on one firm unequivocal precision value. It will constantly vary. Because of this, we will be not sure about our model and its precision. The model hence is unfit to be used for our business problem statement.

To forestall this, the cross-validation idea is here to help us. There are different techniques that may be used to cross-validating a model. Regardless, all of them have a similar computation.

Partition the original dataset into two sections: one for training, the other for testing

Train the model on the train set

Test the model on the test set

Rehash 1–3 stages two or multiple times. This number relies upon the CV (cross-validation) technique that you are utilizing.

There are different sorts of cross-validations, yet k-fold cross-validation is the most famous one which we will talk about here.

The K-fold cross-validation approach isolates the dataset into K partitions of trials of comparable sizes, which are called folds. For each learning trial, the expectation capability utilizes k-1 folds, and the remaining fold is utilized as the test set.

For instance, consider a dataset having 1,000 records or sections or components, and we need to perform K-fold cross-validation on this dataset. Assuming K=5, that implies there will be 5 folds or 5 trials, on this premise, we will measure the model’s precision.

Each fold will have n/k samples in its test dataset

Where,

n= No. of the records in the original datasets

K= No. of folds

Therefore, for dataset having 1,000 records will have 5 folds containing n/k = 1000/5, i.e. 200 records as test samples and the remaining 800 records as training samples.

Assuming, the K1 trail yields the A1 accuracy.

Also, K2 fold will perform for the dataset having 1000 records, yet this time 200 records for the test dataset will be entirely unexpected from the past one. It implies there is no overlapping in the testing samples with the previous trial. A2 is the accuracy for the K2 fold. The fold will continue to perform these operations until it reaches K=5, i.e., the number we provided initially.

Assuming here that A1, A2, A3, A4, and A5 are the accuracies of K1, k2. K3, K4, and K5 fold respectively.

Generally, accuracy is determined as the mean of all the k-fold cross-validation accuracies which isn’t presently fluctuating as well as there is no irregularity in this which we are getting in the train-test split.

Additionally, here you can be certain about your model as, you will have the least, most extreme, and average accuracy from K folds. In our example, assuming that the K3 fold generates the highest accuracy as A3 and A5 as minimum accuracy for K5, we have already calculated mean accuracy as A which is the mean of accuracies generated from K folds.

With this data, you will be ready to stand sure as, yes this is our model. It has the least accuracy as A5, and the greatest accuracy as A3 while normal as A. which will be more reliable for business issues arrangements since this time your model has firm and unequivocal performance as opposed to previous changes and irregularity.

Notice that cross-validations permit you to get not just a gauge of the presentation of your model, yet additionally an action of how exactly this gauge is.

How to Determine the value of K

The most important phase in the process is deciding the value of K. Picking this value accurately ought to assist you with building models with a low predisposition. Commonly, K is set equivalent to 5 or 10. For instance, with scikit-learn, the default worth of k is 5. This will give you 5 gatherings.

While one of the impediments of k-fold cross-validation is that it won’t work efficiently and effectively on imbalanced datasets, for which we have stratified cross-validation.

If you liked this article then check out our other articles on validation techniques:

Overfitting and underfitting

Reduce overfitting: Feature reduction and Dropouts

Pruning to Reduce Overfitting

Reduce Overfitting: Using Regularization

Summary

In this article, I tried to explain K-fold cross-validation in simple terms. If you have any questions about the post, please ask them in the comment section and I will do my best to answer them.

Min-Max Normalization

ML_Concepts — Wed, 30 Nov 2022 11:46:55 +0000

One of the most important transformations you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. Consider the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 25. Note that scaling the target values is generally not required.

There are two common ways to get all attributes to have the same scale ,min-max scaling, and standardization.

We do this by subtracting the min value and dividing by the max minus the min. Scikit-Learn provides a transformer called MinMax Scalar for this. It has a feature range hyper parameter that lets you change the range if you don’t want 0–1 for some reason.

Min-max normalization is one of the most popular ways to normalize data.

For every feature,

the minimum value of that feature gets transformed into a 0,
the maximum value gets transformed into a 1,
and every other value gets transformed into a value between 0 and 1.

Where,

X-scaled- new value of feature X which is scaled.

X-old value of feature X.

X-min: -Minimum value of feature X.

X-max: - Maximum value of feature X.

Let us consider one example to make the concept method clear. We have a dataset containing some features, which is shown below in the figure.

As we are able to see here feature age and feature Estimated Salary are totally different with respect to scale, feeding this type of data to the model will result in poor performance and will fail in the real world. That’s why Feature scaling is a must and here we are talking about Minmax Scaling.

After using Scikit learn scaling let’s see the difference between before scaling and after scaling of features Age & Estimated Salary.

As you can see this technique enables us to interpret the data easily. There are no large numbers, only concise data that do not require further transformation and can be used in the decision-making process immediately.

Min-max normalization has one fairly significant downside: it does not handle outliers very well. For example, if you have 99 values between 0 and 40, and one value is 100, then the 99 values will all be transformed into a value between 0 and 0.4.

That data is just as squished as before!

Take a look at the image below to see an example of this.

After normalizing, look at the below diagram it fixed the squishing problem on the y-axis, but the x-axis is still problematic. And the point in orange color is an outlier, which the min-max normalizer doesn’t handle.

You can normalize your dataset using the sci-kit-learn object MinMax Scaler.

Good practice usage with the MinMax Scaler and other scaling techniques is as follows.

Fit the scaler using available training data. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the fit () function.
Apply the scale to training data. This means you can use the normalized data to train your model. This is done by calling the transform () function.
Apply the scale to data going forward. This means you can prepare new data in the future on which you want to make predictions.

We have already covered min-max normalization in detail on our website. Please use the below link to go there.

https://ml-concepts.com/2021/10/08/min-max-normalization/

Also, to know about Overfitting and underfitting in machine learning just click on the link.

Z-score Normalization, google will also understand that this link is for Z-score Normalization.

Also, to know about Embedded Methods and lasso regression, just visit the article.

Summary
In this article, I tried to explain MinMax Normalization in simple terms. If you have any questions related to the post, put them in the comment section and I will do my best to answer them.

Overfitting and Underfitting in Machine Learning

ML_Concepts — Wed, 30 Nov 2022 04:16:41 +0000

Introduction

A common danger in Machine learning is overfitting, producing a model that performs well on training data, but that generalizes very poorly on new data or test data or we can say unseen data. This could involve learning noise in the data or learning to identify specific output rather than whatever factors are actually predictive of the desired outcome.

And the other side is underfitting, producing a model that doesn’t perform well even on training data. This mainly happens due to a provisional lack of features or less training dataset, also when the model tries to build the linear relationship through a nonlinear relationship.

What is Model Fitting?

Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained. The generalization of a model to new data is ultimately what allows us to use machine learning algorithms every day to make predictions and classify data. The definition of a good model fit is one that accurately approximates the output of an unknown input when it is provided with unknowable inputs. A model’s fitting is the process of adjusting its parameters in order to improve its accuracy.

Understanding model fit is important for understanding the root cause of poor model accuracy. In fact, overfitting and underfitting are the two biggest causes of the poor performance of machine learning algorithms. Hence, model fitting is the essence of machine learning. If our model doesn’t fit our data correctly, the outcomes it produces will not be accurate enough to be useful for practical decision-making.

Overfitting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is something that we humans do all too often, and unfortunately, machines can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well.

Fig. shows an example of a high-degree polynomial life satisfaction model that strongly overfits the training data. Even though it performs much better on the training data than the simple linear model, would you really trust its predictions?

Complex models such as deep neural networks can detect subtle patterns in the data, but if the training set is noisy, or if it is too small (which introduces sampling noise), then the model is likely to detect patterns in the noise itself. Obviously, these patterns will not generalize to new instances. For example, say you feed your life satisfaction model many more attributes, including uninformative ones such as the country’s name. In that case, a complex model may detect patterns like the fact that all county‐ tries in the training data with a w in their name have a life satisfaction greater than 7: New Zealand (7.3), Norway (7.4), Sweden (7.2), and Switzerland (7.5).

How confident are you that the W-satisfaction rule generalizes to Rwanda or Zimbabwe? Obviously, this pattern occurred in the training data by pure chance, but the model has no way to tell whether a pattern is real or simply the result of noise in the data.

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible solutions are: -

To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model).

by reducing the number of attributes in the training data or by constraining the model.
To gather more training data.
To reduce the noise in the training data (e.g., fix data errors and remove outliers).

Underfitting the Training Data

Underfitting is the opposite of overfitting, it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate, even in the training examples.

The main options to fix this problem are:

Selecting a more powerful model, with more parameters.
Feeding better features to the learning algorithm (feature engineering).
Reducing the constraints on the model.

Another way of thinking about the overfitting problem is as a trade-off between bias and variance. Both are measures of what would happen if you were to retrain your model many times on different sets of training data (from the same larger population).

For example, the degree 0 models in “Overfitting and Underfitting” will make a lot of mistakes for pretty much any training set (drawn from the same population), which means that it has a high bias.

However, any two randomly chosen training sets should give pretty similar models (since any two randomly chosen training sets should have pretty similar average values). So, we say that it has a low variance. High bias and low variance typically correspond to underfitting.

On the other hand, the degree 9 models fit the training set perfectly. It has a very low bias but very high variance (since any two training sets would likely give rise to very different models). This corresponds to overfitting.

If your model has a high bias (which means it performs poorly even on your training data) then one thing to try is adding more features. Going from the degree 0 models in “Overfitting and Underfitting”.

to the degree, 1 model was a big improvement. If your model has high variance, then you can similarly remove features. But another solution is to obtain more data (if you can).

Also, visit for additional information on Model Fitting @ ML Concepts where we have already covered this concept in detail.

Everything you need to know about Model Fitting in Machine Learning

Clicking on "Min-Max Normalization" Will take you to our Min-Max Normalization post.

Z-score Normalization, google will also understand that this link is for Z-score Normalization.

Summary

In this article, I tried to explain overfitting and underfitting in simple terms. If you have any questions related to the post, put them in the comment section and I will do my best to answer them.

Z-Score Normalization

ML_Concepts — Tue, 29 Nov 2022 12:36:24 +0000

What is Normalization?

Let’s first understand here actually what is meant by data normalization and then we will come to the topic of Z score Normalization. So normalization is just a feature scaling technique which is needed for our features while feeding to the model of machine learning.

As our data is collected from a variety of sources, it might come from different places as it includes different scales for features too. So it becomes necessary to bring all those features to one common scale to feed the data further to the model otherwise it will lead to poor model building and it will have no impact or use in a real-world scenario.

Data normalization consists of remodeling numeric columns to a standard scale. Data normalization is generally considered the development of clean data. Data Normalization is preprocessing technique in machine learning and Z score is one of its types of it, also it includes other concepts or types such as minimax scaler, standard scaler, etc. but here we will going to look at Z-Score Normalization.

In the short feature, scaling is a technique to standardize the independent features present in data in a fixed range.

What is z-score normalization?

The letter ‘Z’ in z-score stands for Zeta (6th letter of the Greek alphabet) which comes from the Zeta Model that was originally developed by Edward Altman to estimate the chances of a public company going bankrupt. Also referred to as zero-mean Normalization. Z-Score helps in the normalization of data.

If we normalize the data into a simpler form with the help of z-score normalization, then it’s very easy to understand by our brains. It is a strategy of normalizing data that avoids this outlier issue. In this technique, values are normalized based on the mean and standard deviation of the data.

The essence of this technique is the data transformation by the conversion of the values to a common scale where an average number/mean equals zero and a standard deviation is one. Technically, it measures the standard deviations below or above the mean. Standardization or z-score normalization does not get affected by outliers because there is no predefined range of transformed features.

A value is normalized under the formula We use the following formula to perform a z-score normalization on every value in a dataset.

where ,

x: Original value.
μ: Mean of data.
σ: Standard deviation of data.

A z score represents the number of standard deviations a value (x) is above or below the mean of a set of numbers when the data are normally distributed. Using z scores allows the translation of a value’s raw distance from the mean into units of standard deviations.

If a z score is negative, the raw value (x) is below the mean.

If the z score is positive, the raw value (x) is above the mean.

For example, for a data set that is normally distributed with a mean of 60 and a standard deviation of 10, suppose a statistician wants to determine the z score for a value of 80. This value (x = 80) is 20 units above the mean, so the z value is,

Thus, z = (X – μ) / σ

= (80-60)/10

=2.00

This z score signifies that the raw score of 70 is two standard deviations above the mean. How is this z score interpreted? The empirical rule states that 95% of all values are within two standard deviations of the mean if the data are approximately normally distributed.

What is meant by Empirical Rule here...??

The empirical rule is an important rule of thumb that is used to state the approximate percentage of values that lie within a given number of standard deviations from the mean of a set of data if the data are normally distributed.

A normal distribution is shown below and it is estimated that ,

68% of the data points lie between +/- 1 standard deviation.

95% of the data points lie between +/- 2 standard deviation.

99.7% of the data points lie between +/- 3 standard deviation.

So, the z score is the number of standard deviations that a value, x, is above or below the mean. If the value of x is less than the mean, the z score is negative; if the value of x is more than the mean, the z score is positive; and if the value of x equals the mean, the associated z score is zero.

This formula allows the conversion of the distance of any x value from its mean into standard deviation units. A standard z-score table can be used to find probabilities for any normal curve problem that has been converted to z-scores.

Another Example

Suppose the scores for a certain exam are normally distributed with a mean of 80 and a standard deviation of 4. Find the z-score for an exam score of 87.

We can use the following steps to calculate the z-score:

The mean is μ = 80
The standard deviation is σ = 4
The individual value we’re interested in is X = 87
Thus, z = (X – μ) / σ = (87 – 80) /4 = 1.75

Here are some important facts about z-scores.

A positive z-score says the data point is above average.
A negative z-score says the data point is below average.
A z-score close to 0 says the data point is close to average.
A data point can be considered unusual if its z-score is above 3 or below -3

Advantages of z-score normalization.

It allows a data administrator to understand the probability of a score occurring within the normal distribution of the data.
The z-score enables a data administrator to compare two different scores that are from different normal distributions of the data.

We have already covered z score normalization in detail on our website. For more additional information visit the following link.
[https://ml-concepts.com/2021/10/08/z-score-normalization/]

Summary

In this article, I tried to explain Z score Normalization in simple terms. If you have any questions related to the post, put them in the comment section and I will do my best to answer them.

References

The following tutorials provide additional information on different normalization techniques.

https://en.wikipedia.org/wiki/Standard_score.

https://www.codecademy.com/article/normalization.

https://www.statology.org/standardization-vs-normalization.