DEV Community

Cover image for 9 months of Machine Learning and beyond: Machine Learning A-Z
Alexey Tukalo
Alexey Tukalo

Posted on

9 months of Machine Learning and beyond: Machine Learning A-Z

Introduction

Before I even started properly studying machine learning last summer, I've already had several machine learning courses purchased on Udemy. The most basic among that courses was Machine Learning A-Z: AI, Python & R, so, it became my starting point. This course served as a perfect introduction to the field, covering a wide range of classical machine learning techniques and some deep learning.

Course Impression

Typically, as programmers, we work with structured data. However, the world is inherently messy. Machine learning proves to be an invaluable tool for dealing with unstructured information. I was very impressed by the course because it introduced a whole new world of approaches that felt like gaining a superpower.

Course Content

The course explains the machine learning process step by step. The initial, crucial stage of the process is data preprocessing, which happens even before any algorithms can be applied.

Preprocessing of data

Very beginning of preprocessing is data splitting. It is common to divide dataset into three parts: training, validation, and test sets. A training set is used for training of a model, a validation set helps assessing overfitting during training, and a test set is used to evaluate the model’s performance after training.

Handling missing data is another critical aspect. Depending on the situation and the amount of data missing, there are two primary options:

  • Imputing missing values based on other data points
  • Removing rows with missing data entirely

Moreover, often it is important to perform feature scaling, because some machine learning algorithms are sensitive to the scale of the input data. For instance, algorithms that compute distances between data points, like K-Nearest Neighbors (K-NN), will be biased towards variables with a larger scale if the data is not adjusted to compensate this. Feature scaling helps to make sure that the range of independent variables equally contributes to the analysis. This can be done through methods like normalization or standardization. Normalization rescales features to a fixed range, usually from 0 to 1. Standardization adjusts all features to have 0 mean and standard deviation of 1.

These preprocessing steps are necessary to create a robust machine learning models that perform well in real-world scenarios.

Classic Machine Learning Models

Regression

Regression models are a type of statistical tool used for predicting a continuous outcome based on one or more input variables. They are fundamental for forecasting and determining the strength of relationships between variables. These models work by creating an equation that best fits the observed data. I already had some experiences with regression models especially with Linear Regression from the stat courses I took years ago.

Polynomial Regression extends linear regression by adding terms with powers greater than one. This allows the model to fit a wider range of data shapes, capturing more complex relationships between variables. However, higher-degree polynomials can lead to overfitting, where the model fits the training data too closely and performs poorly on unseen data. This occurs because the model learns noise from the training data, mistaking it for actual relationships.

Next, the course introduces Support Vector Regression (SVR), a powerful model that can encapsulate non-linear relationships with a lower risk of overfitting and can model exponential relationships. The main goal of SVR is to create a prediction line that fits most of the data points as closely as possible while also trying to keep the line as smooth and flat as possible. In other words, SVR tries to strike a balance between closely following the training data and avoiding overly complex models that might not work well on new, unseen data. It does this by allowing for a small margin of error, within which deviations are acceptable. This makes SVR a robust choice for predicting continuous values, especially when the data is complex or has a lot of variability.

After that Decision Trees and Random Forests are introduced. Typically known for classification, these techniques are also applicable in regression settings. The course explains how these models can predict an output based on decision rules inferred from the data features. Decision Trees and Random Forests create models based on a series of binary decisions from the features within the dataset. This approach can lead to models that fit well on training data but fail to generalize to new data because the decision-making process is arbitrary and doesn’t necessarily capture underlying mathematical relationships between variables.

On the other hand, methods like SVR and Polynomial Regression aim to identify the mathematical relationships inherent in the data. For example, SVR tries to fit the best possible curve within a certain margin of error, and polynomial regression can model relationships that follow a polynomial equation. If the true relationship between the variables is mathematical, these methods are likely to perform better with less risk of overfitting. This ability to uncover and leverage mathematical relationships makes SVR, Linear, and Polynomial Regression more robust for predicting outcomes where the underlying data relationships are strong and clear.

Model Selection in Regression

The section on regression wraps up with strategies for choosing the best model. Experimentation with different approaches and evaluation of their performance on test data is still recommended, since an experiment is still the only way to select a truly optimal model.

Classification

Classification involves predicting a categorical response based on input variables.

Logistic Regression, despite its name, is a basic classification technique, ideal for binary classification problems. It is used for prediction of outcomes that have two possible states e.g., yes/no, true/false. It works by modelling the probability of the default class, usually labeled 1, as a function of the input features. Logistic regression applies the logistic function to the output of a linear equation, producing a probability score between 0 and 1. This model is robust, straightforward, and efficient for binary classification problems.

The next model in the course is K-Nearest Neighbors (K-NN). It classifies a data point based on how its neighbors are classified, capable of handling multi-class problems and more complex decision boundaries.

The course also covers Support Vector Machines (SVM) for classification, explaining the use of different kernels to handle linear and non-linear classification. Support Vector Machine constructs a hyperplane in a multidimensional space to separate different classes. SVM performs well in high-dimensional spaces. It is versatile due to its ability to use different kernel functions to make the hyperplane more adaptable to the data. For example, linear kernels are great for linearly separable data, while radial basis function (RBF) kernels can map non-linear relationships.

Clustering

Classification and clustering are both methods of organizing data but serve different purposes. Classification is a supervised learning approach where the model is trained on labeled data. This means the model learns from examples that already have an assigned category or class. Its task is to predict the category for new data based on what it has learned. For example, a classification model might determine whether emails are spam or not spam based on training with a dataset of emails labeled accordingly.

Clustering, on the other hand, is an unsupervised learning technique that involves grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It’s used when we don’t have predefined labels for data. The model itself discovers the inherent groupings in the data. An example of clustering might be segmenting customers into groups based on purchasing behavior without prior knowledge of the different customer types.

Both methods are fundamental in data analysis:

  • Classification uses labeled data for predictive modeling.
  • Clustering helps to discover hidden patterns in data.

Clustering Techniques

K-Means is a popular clustering technique that partitions data into K distinct, non-overlapping clusters based on their features. The process involves randomly initializing K points as cluster centers and assigning each data point to the nearest cluster based on Euclidean distance. The cluster centers are then recalculated as the mean of the assigned points, and this process repeats until the centroids stabilize and no longer move significantly. This method is particularly effective for large datasets and is widely used due to its simplicity and efficiency. K-Means works best with data where the clusters are spherical and evenly sized, making it less effective with complex cluster shapes.

Hierarchical Clustering, unlike K-Means, does not require the number of clusters to be specified in advance. It builds a hierarchy of clusters either by a divisive method or an agglomerative method.

In the agglomerative approach, each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The process continues until all points are merged into a single cluster at the top of the hierarchy. This method is beneficial for identifying the level of similarity between data points and is visually represented using a dendrogram, which can help determine the number of clusters by cutting the dendrogram at a suitable level.

The divisive method of hierarchical clustering, also known as top-down clustering, starts with all observations in a single cluster and progressively splits the cluster into smaller ones. This approach begins at the top of the hierarchy and works its way down, making it conceptually straightforward: every split is designed to create the most distinct and coherent clusters possible at each level of division.

In practice, the divisive method involves examining the cluster at each step and choosing the best point to split it. This involves measuring the distance between observations within the cluster and identifying the largest distance as the point to divide. The process continues recursively, splitting each subsequent cluster until each observation is its own cluster or until a specified number of clusters is reached. It is generally more computationally intensive than the agglomerative approach, as it requires a global view of the data at each split, making it less commonly used in very large datasets.

Hierarchical clustering is particularly useful for smaller datasets or when the relationships between data points need to be closely examined, such as in biological sciences or when clustering historical data.

Deep Learning Models

Deep learning is a subset of machine learning that employs neural networks with many layers. It is a significantly different from classical machine learning techniques. While classical machine learning focuses on features that are often manually selected and engineered, deep learning aims to train neural networks to learn features. The models automate feature extraction by building complex patterns from simpler ones. This makes deep learning exceptionally powerful for tasks such as image and speech recognition, where the input data is highly dimensional and the relationships within the data are complex. However, it requires vast amounts of information to train deep learning models.

Artificial Neural Network

A fundamental element of deep learning is the forward densely connected neural network, or Artificial Neural Network (ANN). In these networks, neurons are arranged in layers, with the first layer taking the input data and the last layer producing output. Each neuron in one layer connects to every neuron in the next layer, making the network "densely connected." These neurons have weights and biases that adjust as the network learns from data during the training process. The output of each neuron is calculated by a nonlinear activation function, which introduces the ability to capture nonlinear relationships in the data.

Layers of neurons, in ANNs, can be represented by vectors consisting of the weights and biases. Data is propagated forward through these layers using matrix multiplication. An output of each layer is calculated by multiplying the input data by the weight matrix and then adding a bias term. This output then passes through an activation function before it is sent to the next layer.

The activation function is crucial because it introduces non-linearity into the model, allowing the network to learn and model complex, non-linear relationships in the data. Without non-linear activation functions, the network, regardless of its depth, would still behave just like a single-layer perceptron, which can only learn linear boundaries.

Convolutional Neural Network

An alternative to basic ANNs is the Convolutional Neural Network (CNN). Unlike densely connected networks where every input is connected to each neuron, CNNs operate over volumes of pixels and use filters to create feature maps that summarize the presence of detected features in the input, such as edges in images. This makes CNNs highly efficient for tasks that involve spatial hierarchies, as they reduce the number of parameters needed, reducing the computational burden.

Convolutional Neural Networks are specialized kinds of neural networks for processing data that has a grid-like topology, such as images. CNNs use filters that perform convolution operations as the filter slides over the input to create a feature map that summarizes the presence of detected features in the input. This makes them exceptionally efficient for image related tasks.

CNNs leverage the mathematical operation of convolution, a fundamental technique in digital signal processing. In the context of DSP, convolution is used to alter a signal by a filter, extracting important features. Similarly, in CNNs, convolution involves applying a filter over an image to produce a feature map. This process effectively allows the network to detect similarities or specific features in the image that correspond to the filter. For example, a filter might be learn to detect edges or specific shapes.

As the input image is processed through successive convolutional layers, the CNN uses multiple filters at each layer to search for increasingly complex patterns. The first layer may detect simple edges or textures, while deeper layers can recognize more complex features like parts of objects or entire objects.

Gradient Descent and Training Neural Networks

Gradient descent is a fundamental optimization algorithm used in training neural networks and other machine learning models. It works by iteratively adjusting the model's parameters to minimize the loss function, which measures how well the model's predictions match the actual data. In each step, the algorithm computes the gradient of the loss function with respect to the model parameters, and moves the parameters in the direction that reduces the loss.

Backpropagation is the technique used to compute these gradients efficiently in neural networks. It involves two phases:

  • A forward pass, where input data is passed through the network to generate predictions.
  • A backward pass, where the gradient of the loss function is computed based on the prediction. It is later propagated back through the network to update the weights.

This process leverages the chain rule of calculus to estimate gradients, ensuring that each weight is adjusted in proportion to its contribution to the overall error. Together, Gradient Descent and Backpropagation enable neural networks to learn from data by iteratively improving their accuracy.

The Loss Functions

Loss functions play a critical role in guiding the training process. It is also known as a cost function or error function. It quantifies the difference between the predicted outputs of the network and the actual target values. This metric provides a concrete measure of how well the network is performing. The goal of training is to minimize this loss, thereby optimizing the model's parameters.

Commonly used loss functions in ANNs vary depending on the specific type of task:

  • For regression tasks, where the goal is to predict continuous values, the Mean Squared Error (MSE) loss is frequently used. MSE calculates the average of the squares of the differences between the predicted and actual values, penalizing larger errors more severely.
  • For classification tasks, where the output is a class label, Cross-Entropy Loss is commonly employed. This loss function measures the dissimilarity between the true label distribution and the predictions provided by the model.

The Vanishing Gradient Problem and ReLu

One significant challenge when building deep neural networks is the vanishing gradient problem. The gradients used in the training process can become too small, preventing weights from changing their values, which stops the network from sufficiently updating parameters.

This issue is particularly prominent with sigmoid or tanh activation functions. To mitigate this, deep learning has adopted the Rectified Linear Unit (ReLu) activation function. ReLu is defined as ReLU(x)=max(0,x), where x represents the input to a neuron. This function helps maintain a stronger gradient during training, allowing deeper networks to learn effectively without the gradients vanishing. This simplicity and efficiency in promoting nonlinearity without affecting the scale of the gradient make ReLu a popular choice in deep learning architectures.

Specialized Machine Learning Techniques

The course progressed into a variety of more specialized machine learning techniques, each tailored to specific applications and domains.

Natural Language Processing

Natural Language Processing (NLP) involves the application of computational techniques to the analysis and synthesis of natural language and speech. One of the main challenges in using machine learning for NLP is that text data is inherently unstructured and high-dimensional. Text must be converted into a numerical format that machine learning algorithms can process, a task complicated by the nuances of language such as syntax, semantics, and context.

The Bag of Words

The Bag of Words (BoW) model addresses this by transforming text into fixed-length vectors by counting how frequently each word appears in a document, ignoring the order and context of words. This method simplifies text data, making it manageable for basic machine learning models and serving as a foundational technique for text classification tasks, such as spam detection or sentiment analysis. However, simplicity of the BoW model, its disregard for word order and semantic context limit its effectiveness for more complex language tasks.

Reinforcement Learning with UCB and Thompson Sampling

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. It differs from supervised learning, since correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. This strategies evolves by balancing the exploration, trying new things, and exploitation, using known information, in decision-making processes.

The agent takes actions based on a policy, receives feedback through rewards or punishments, and updates its policy to maximize long-term rewards. Two notable strategies in RL that help manage the exploration-exploitation dilemma are the Upper Confidence Bound (UCB) and Thompson Sampling.

UCB is an algorithm that prioritizes exploration by selecting actions that have either high rewards or have not been tried often. The idea is to balance the known rewards with the potential of finding higher rewards in lesser-tried actions. UCB does this by constructing confidence bounds around the estimates of action rewards and choosing the action with the highest upper confidence bound. This approach systematically reduces uncertainty and improves decision-making over time.

Thompson Sampling takes a Bayesian approach to the exploration-exploitation problem. It involves sampling from the posterior distributions of the rewards for each action and selecting the action with the highest sample. This method allows for a more probabilistic exploration based on the known performance of actions, dynamically balancing between exploring new actions and exploiting the known ones based on their reward probability distributions.

Both UCB and Thompson Sampling are powerful techniques in situations where the learning environment is initially unknown to the agent, allowing for systematic exploration and optimized learning based on the feedback received from the environment. These methods are particularly useful in real-time decision-making scenarios like A/B testing or network routing.

Dimensionality Reduction Techniques

PCA is a statistical technique used for dimensionality reduction while preserving as much variance as possible. It works by identifying so called principal components - the directions along which the variance of the data is maximized. It reduces the dimension of the data by transforming the original variables into a new set of orthogonal variables. Orthogonality allows this new variable to be as non-correlated as possible, and account for the maximum variance in the data. This is particularly useful in reducing the number of variables in data while maintaining the relationships that contribute most to its variance. By transforming the data into a new set of dimensions with reduced complexity, PCA helps in visualizing high-dimensional data, speeding up learning algorithms, and removing noise.

LDA, on the other hand, is also a dimensionality reduction technique but focuses more on maximizing the separability among known categories. It tries to model the difference between the classes of data. LDA achieves this by finding a linear combination of features that separates classes. The resulting combination can be used as a linear classifier or for dimensionality reduction before later classification.

Both PCA and LDA serve slightly different purposes:

  • PCA is unsupervised, focusing on variance in the data.
  • LDA is supervised, focusing on maximizing class separability.

Modern Model Selection and Boosting Techniques

The latter part of the course explores advanced model selection strategies and introduces boosting. Boosting works by combining multiple weak learners into a stronger model in a sequential manner. Each learner in the sequence focuses on the errors made by the previous one, gradually improving the model's accuracy. The learners are usually simple models like decision trees, and each one contributes incrementally to the final decision, making the ensemble stronger than any individual model alone.

Extreme Gradient Boosting

One of the most popular implementations of this technique is Extreme Gradient Boosting (XGBoost), which stands out due to its efficiency and effectiveness across a wide range of predictive modeling tasks.

Conclusion

The "Machine Learning A-Z: AI, Python & R" course is a great starting point for anyone interested in machine learning. It covers a lot of important topics and gives a broad overview, but it’s just the beginning.

Finishing this course won’t make you an expert ready for a specialized machine learning job right away. Instead, think of it as a first step. It helps you understand the basics and shows you what parts of machine learning might be most interesting to you.

Top comments (0)