Machine learning is entirely about developing algorithms which qualify in extracting information automatically from a predefined dataset. ‘Automatically’ here refers to the model’s ability to effectively perform the task of extraction on a dataset whenever the dataset is subjected to an algorithm. Data is said to be at the core of machine learning. This is due to the fact that a model, once developed, employs data to get itself trained to be exposed to various situations and generate explicit results. Training here refers to ‘Learning’ which takes place by automatic identification of certain patterns being formed on certain ratio of the exposed data. Once trained i.e., minimization of error function has taken place, the model is then tested on the remaining portion of data. Thus, data is divided into two categories namely, training data and testing data.
In order to gain insights into how the process of learning, training occurs within a model, it is beneficial to understand the background mathematics responsible for every step of the process.
In part 1 of this series, Probability and Statistics for Machine Learning have been discussed.
Probability is a field of mathematics consisting of tools to quantify the uncertainty of events.
When talking about probability we often come across terms such as Uncertainty. While Uncertainty, in real life would be reflected on as something negative, it turns out to be absolutely normal in case of mathematics and programming. This is due to the fact that when programming is performed, there are always some assumptions considered which are not bound to be true. Hence, when backend is focused on, the practice of correction of errors in the program is observed repeatedly after each predefined cycle. Probability serves as a tool which is employed in providing logic to quantify and manage uncertainty.
Probability of an even is calculated by the following method –
The assigned probability is a fractional value and is always in the range between 0 and 1. If events are termed to be Equally Likely, Probability of occurrence is 1 divided by total possible trails.
Frequentist probability or frequentism is an interpretation of probability. It majorly defines an event's probability as the limit of its relative frequency in many trials. Probabilities can be found (in principle) by a repeatable objective process which supports statistical requirements of many experimental researchers.
In this approach, events are observed and counted. Post this, their frequencies obtained provide the basis for directly calculating a probability, hence the name “frequentist.”
Frequentist Probability interpretation require a random experiment which is well defined. All the possible values obtained from the said random experiment is defined as Sample Space of the experiment and a subset obtained from this Sample Space is termed as an Event.
The core concept of Frequentist Probability is straight – forward. For any given Event, only one of the two possibility is applicable. The event might either occur or the event does not occur. The result is then found by obtaining the frequencies of both of these likely events and thus, the relative frequency is the probability of the event. As the number of trials increase, the modification in relative frequency is reduced. It does not claim to capture all connotations of the concept 'probable' in colloquial speech of natural languages. It provides a distinct guidance information of practical experiments, but whether this guidance is useful or mis – interprets has always been a controversy.
As rightly said by Leonid Pekelis wrote in an Optimizely article,
Frequentist arguments are more counter-factual in nature, and resemble the type of logic that lawyers use in court. Most of us learn frequentist statistics in entry-level statistics courses. A t-test, where we ask, “Is this variation different from the control?” is a basic building block of this approach.
The interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification of a personal belief is known as the concept of Bayesian Probability.
where A and B are events, P(A|B) is the conditional probability that event A occurs given that event B has already occurred (P(B|A) has the same meaning but with the roles of A and B reversed) and P(A) and P(B) are the marginal probabilities of event A and event B occurring respectively.
Bayesian Probability enables Reasoning with Hypothesis. The major difference between Bayesian Inference and Frequentist Inference is that in Bayesian Inference, probability is assigned to the hypothesis obtained from the analyses and in Frequentist Inference, the hypothesis is tested without being assigned a probability.
Bayesian Probability works on specifying a probability. This specified probability there on, is updated to the posterior obtained probability with reference to the new data.
Standard steps and formulae are employed while implementing Bayesian Probability. The steps undertaken are –
To model all sources of uncertainty, Random Variables are used. The uncertainty that is being referred here can be of varied types. The uncertainty arising from lack of data. Statistical Uncertainty also called Aleatoric uncertainty represents unknowns that are variable with respect to time. Systematic uncertainty also called as Epistemic uncertainty occurs due to errors in measurement which occurs due to model negligence to consider certain aspects or because a certain proportion of data has been deliberately masked.
Determination of prior probability distribution by considering the information handy.
Sequential application of Baye’s Formula when additional data is accessible.
Calculate and analyse posterior distribution using Baye’s formula. This posterior distribution becomes the next prior distribution and so on.
As Boundless Rationality wrote,
A fundamental aspect of Bayesian inference is updating your beliefs in light of new evidence. Essentially, you start out with a prior belief and then update it in light of new evidence. An important aspect of this prior belief is your degree of confidence in it.
Probability is one of those aspects which are widely employed in Machine Learning. Techniques of Machine Learning such as Classification predictive modelling make use of probability on a large-scale basis. Modelling of a problem as a prediction of membership class further simplifies the problem thereby enabling the model to learn effectively. The model is capable of capturing the ambiguities accurately thereby enabling the user to interpret the probability of a certain even to occur or not occur in the reference of the domain. The prediction step of the entire process is simplified by basic understanding of probability and its characteristics and approach techniques. The probability can also further be scaled using probability calibration process for making better predictions.
The Linear Regression technique of Machine Learning is a technique which operates to minimise the mean square error of the model predictions. Logistic Regression technique is one in which the negative log of prediction is minimised. For both of these techniques, probability places an active role in the process.
Frameworks such as Maximum Likelihood Estimation (MLE) used for estimation of model parameters is a general probabilistic modelling framework which is used to reduce the divergence between the observed probability and the predicted probability. This divergence is mainly measured during the training phase of the model using entropy. The difference and entropy between distributions are directly related to probability theory.
Statistics a field of mathematics which is a mandatory prerequisite for understanding the process of Machine Learning and its related concepts and techniques.
Statistical data enables in accurate selection of Train Data and Test Data as mentioned previously. Statistical Machine Learning techniques include the following –
Those observations which occur to deviate from other observations obtained by performing operations on the dataset. IN other words, Outliers are those points which do not fit into the observation set due to varied reasons such as error in measurement, variable measurement, sampling errors, data processing errors. Two types of Outliers namely Univariate Outliers and Multivariate Outliers persists. Univariate as the name suggest are 1 dimensional and hence can be obtained by merely looking at the distribution of data points in single feature space. Multivariate as the name suggests, are n-dimensional and is not feasible to be obtained by merely observations. This is where the necessity of training the model for value prediction arises. Depending on the environment of simulation, outliers can be point outliers, contextual outliers, and collective outliers. Point outliers are single data point that occurs far away from the rest of the data. Contextual outlier on the other hand can be due to noise arising due to background noise signal when performing speech recognition process. The special part about collective outliers is that these outliers can indicate the discovery of entirely new phenomenon based on data points of the data set. Probabilistic and Statistical Modelling (Parametric) is one of the techniques employed in the process of Outlier detection.
Z – Score – The standard score of an observation is a metric that indicates the number of standard deviations a data point belongs to from the mean of the sample space assuming Gaussian Distribution. Python Library namely Scipy and Sci-kit Learn contain functions that are easy to use with Pandas and Numpy. However, it is not recommended to use Z-score when we cannot assume distributions to be parametric.
A technique which effectively enables the user to get information about the data set based on the statistics from the subset of the sample space, eliminating the need to analyse every individual.
The figure above illustrates the process of Data Sampling which is useful in drawing conclusions about the sample space from the sample.
Gaussian Distribution function is a continuous function which approximates the exact binomial distribution of events. It is the most common distribution function for independent, randomly generated variables and is a bell – shaped curve. This graph of Gaussian distribution is characterized by a couple of parameters namely mean and standard deviation. Mean is the maximum point of the graph about which the graph characteristic is that of symmetry. Standard Deviation on the other hand determines the amount of deviation from the mean value.
# calculate summary stats from numpy.random import seed from numpy.random import randn from numpy import mean from numpy import var from numpy import std seed(1) data = 5 * randn(10000) + 50 # generate univariate observations # calculate statistics print('Mean: %.3f' % mean(data)) print('Variance: %.3f' % var(data)) print('Standard Deviation: %.3f' % std(data))
The method of Statistical Inference employed for determining a conclusion from conflicting hypothesis. A null hypothesis and secondary hypothesis is put forward for probability of distribution of data.
If the sample has a probability of occurrence less than that of pre-defined threshold value of probability, given that the null hypothesis is true, the difference between null hypothesis and the sample is statistically significant. This will lead to rejection of null hypothesis and acceptance of alternate hypothesis. The difference between both of these errors is determined by Type 1 error and Type 2 error which is thereby controlled by significance levels.
In the Part 1 of Mathematics for Machine Learning, we observed how Probability and Statistics go hand in hand when it comes to application of Machine Learning concepts. We hovered through the important aspects of Probability and Statistics and underwent detailed analysis of their application.
In the next part of this series, Linear Algebra and Optimization for Machine Learning will be discussed.