DEV Community: Pol Monroig Company

The Perfect Activation

Pol Monroig Company — Thu, 20 Aug 2020 17:02:03 +0000

It might be too bold to call an activation function perfect, given that the No Free Lunch Theorem of machine learning states that there is no universally perfect machine learning algorithm. Nevertheless, as misleading as the title can be, I will try to summarize the most widely used activation functions and describe their main differences.

Linear (identity)

The linear activation function is essentially no activation at all.
Overhead: fastest, no computation at all
Performance: bad, since it does not enable a non linear transformation
Advantages:

Differentiable at all points
Fast execution

Common issues:

Does not provide any non-linear output.

Sigmoid

The Sigmoid activation function is one of the oldest ones. Initially made to mimic the activations in the brain it has been shown to have poor performance on artificial neural networks, nevertheless it is commonly used and a classifier output to transform outputs into class probabilities.

Uses: it is commonly used in the output layer of binary classification where we need a probability value between 0 and 1.
Overhead: very expensive because of the exponential term.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Outputs are between 0 and 1, that means that values won't explode.
It is differentiable at every point.

Common issues:

Outputs are between 0 and 1, that means outputs might saturate.
Vanishing gradients are possible.
Outputs are always positive ( zero centered functions help in a faster convergence).

Code:

# Pytorch 
torch.nn.Sigmoid() 
# Tensorflow 
tf.keras.activations.sigmoid()

Softmax

Generalization of the Sigmoid function to more than one class, it enables to transform the outputs into multiple probabilities. Used in multiclass classification.
Uses: used in the output layer of a multiclass neural network.
Overhead: similar to Sigmoid, but more overhead caused by more inputs.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Unlike Sigmoid, it ensures that outputs are normalized between 0 and 1

Common issues:

Same as Sigmoid.

Code:

# Pytorch 
torch.nn.Softmax(dim=...) 
# Tensorflow 
tf.keras.activations.softmax()

Hyperbolic Tangent

Tanh function has the same shape as Sigmoid, in fact is the same but it is mathematically shifted and it works better in most cases.
Uses: generally used in hidden layers as it outputs between -1 and 1, thus creating normalized outputs, making learning faster.
Overhead: very expensive, since it uses an exponential term.
Performance: similar to Sigmoid but with some added benefits
Advantages:

Outputs are between -1 and 1, that means that values won't explode.
It is differentiable at every point.
It is zero-centered, unlike Sigmoid.

Common issues:

Vanishing gradients.
Gradients saturation.

Code:

# Pytorch 
torch.nn.Tanh() 
# Tensorflow 
tf.keras.activations.tanh()

ReLU

ReLU, also called rectified linear unit is one of the most commonly used activations, both for its computational efficiency and its great performance. Multiple variations have been created to improve its flaws.
Uses: must be used in hidden layers as it provides better performance than tanh and Sigmoid, and is more efficient since it is computationally faster.
Overhead: Almost none, extremely fast.
Performance: great performance, recommended for most cases.
Advantages:

Adds non-linearity to the network.
Does not suffer from vanishing gradient.
Does not saturate.

Common issues:

It suffers from dying ReLU
Not differentiable at x = 0

Code:

# Pytorch 
torch.nn.ReLU() 
# Tensorflow 
tf.keras.activations.relu()

Leaky Relu

Given that ReLU suffers from the dying relu problem where negative values are rounded to 0. Leaky ReLU tries to diminish the problem by changing the 0 output by a very small value.
Uses: used in hidden layers.
Overhead: same as ReLU
Performance: great performance if the hyperparameter is chosen correctly
Advantages:

Similar to ReLU and fixes dying ReLU.

Common issues:

New hyperparameter to tune.

Code:

# Pytorch 
torch.nn.LeakyReLU(negative_slope=...) 
# Tensorflow 
tf.keras.layers.LeakyReLU(alpha=...)

Parametric ReLU

Takes the same idea as leaky ReLU but instead of predifining the leaky hyperparemeter, it is added as a parameter that must be learned.
Uses: used in hidden layers.
Overhead: a new parameter must be learned for each PreLU in the network.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Fixes the need of tuning an hyperparameter

Common issues:

The parameter learned is not guaranteed to be the optimum, and it increases the overhead, so you might as well try some yourself with leaky.

Code:

# Pytorch 
torch.nn.PReLU(x) 
# Tensorflow 
tf.keras.layers.PReLU(x)

ELU

The ELU was introduced as another alternative to fix the issues that you can encounter with ReLU.
Uses: used in hidden layers
Overhead: computational expensive, it uses an exponential term
Performance: bad on hidden layers, mostly used on output layers
Advantages:

Similar to reLU.
Produces negative outputs.
Bends smoothly unlike leakyReLU.
Differentiable at x = 0

Common issues:

Additional hyperparameter

Code:

# Pytorch 
torch.nn.ELU() 
# Tensorflow 
tf.keras.activations.elu()

Other alternatives

There are a lot of activations functions to cover them all in a single post. Here are some:

SeLU
GeLU
CeLU
Swish
Mish
Softplus

Note: if it ends with LU it usually comes from ReLU.

Summary

So... having so many choices, which activation should we use? As a rule of thumb you should always try using ReLU in the hidden layers, as it has a great performance with minimal computational overhead. After that (if you have enough computing power) you might want to try with some complex variations of ReLU or similar alternatives. I would never recommend using Sigmoid, Tanh or Sotfmax for any hidden layer. Sigmoid and Softmax should be used whenever we want probabilities outputs for a classification task. Finally, with the current progress and research in deep learning and AI surely new and better functions will appear, so keep an eye out.

Remember to try and experiment always, you never know which function will work better for a specific task.

Making algorithms work together

Pol Monroig Company — Thu, 13 Aug 2020 07:40:30 +0000

In every machine learning application you may have found that every machine learning algorithm you try has its flaws. As the no free lunch theorem states there is no single learning algorithm in any domain always induces the most accurate learner. Learning is ill-posed since we do not have all the data and each algorithm converges for a specific subset of the original data. What if there was a way to combine the knowledge of different algorithms?(Spoiler: there is!)

Combining base-learners

The idea is to have multiple base learners, each an expert in a subset of the dataset and accurate in it. This way each learner complements each other. There are two aspects we must consider:

We must choose base-learners that complement each other (otherwise we would have unnecessary redundancy). A synonym to this is that they must not be correlated.
Each learner must try to have the highest accuracy, we should also find a way to combine the output of each learner to achieve it.

Diversifying learners

There are multiple ways in which we can diversify each learning algorithm. For instance, we can use different algorithms, we can also try similar algorithms but with different hyperparameters or each algorithm could be trained with a different data subset.

Voting

Voting is the most simple technique in which the output of multiple classifiers is combined using a linear combination (i.e. sum, product, median, weighted sum, maximum, minimum). The main idea behind this method is the we create a voting scheme over high variance and low bias models, thus bias remains small and variance is reduced. An analogy to this is when we have multiple noisy samples of the same data, if we combine them it is much easier to sort out the noise.

Bagging

This is variance of the voting method. The difference is that each base-learner is trained on different training sets. To generate each set we draw instances randomly with replacement. By replacing the instances it is possible that many instances are repeated and that some are not used at all.

Boosting

This type of mixer works by using simple learners that actively learn on the mistake of the previous learners, the most commonly and effective use of boosting is called AdaBoost. For AdaBoost to work, the learners must be very weak, otherwise we have a high risk of overfitting. AdaBoost combines the votes based on weights proportional to the accuracy of each base-learner in the training set.

Stacking

Stacking is mechanism in which multiple base-learners are trained using the train dataset. After that, a new training data is created by voting the predictions of each base-learner. Finally, another learner (called the mixer) learns from the predictions of the other algorithms.

Cascading

This is based on a sequence of classifiers, in which each learner is more complex than the latter, and thus generalize less. We first make a prediction with a simple learner. This learner tries to generalize as well as we can, if the confidence is not sufficient, we continue with the next learner. We continue making predictions until the prediction confidence of a learner is high enough. A common approach to this method is to use a weak algorithm such as linear regression as the first learner and a non-parametric approach for subsequent learners. The idea behind this method is that most samples can be explained with simple rules, and that few exceptions are explained with more complex learners.

Note: Combining learners is not always the best approach. It increases the speed and it has an execution overhead so you must be sure if it is worth the effort.

Decision trees uncovered

Pol Monroig Company — Thu, 06 Aug 2020 09:14:04 +0000

If you are a computer scientist I am sure you agree with me when I say that trees are everywhere. And I mean everywhere! It is extremely common to use trees as a basic data structure to improve and define new algorithms in all sorts of domains. Machine learning is no different; decision trees are one of the most used nonparametric methods, it can be used for both classification and regression.

Decision trees are hierarchical models that work by splitting the input space into smaller regions. A tree is composed of internal decision nodes and terminal leaves. Internal decision nodes implement a test function, this function works by given a set of variables (the most used approach is to use univariate trees, that is trees that test only 1 variable at a given node) we get a discrete output corresponding to which child node we should go next. Terminal nodes correspond to predictions; a classification output might be the corresponding class, and a regression a specific numerical value. A great advantage of decision trees is that they can work using categorical values directly.

For example, in the following tree, we might want to classify patients that required treatment versus patients that do not require it. Each node makes a decision based on a simple rule, and in each terminal nodes, we have the final prediction. The gini index is a measure of how impure the node is. If the impurity is equal to 0.0, that means we cannot split any further because we have reached a maximum purity.

One of the perks of decision trees, compared to other machine learning algorithms, is that they are extremely easy to understand and have a high degree of interpretability. Just by reading the tree, you can make decisions yourself. On the other hand, decision trees are very sensitive to small variations in the training data, so it usually recommended to apply a boosting method.

Note: In fact, a decision tree can be transformed into a series of rules that can then be used in a rule-based language such as Prolog.

Dimensionality reduction

The job of classification and regression trees (CART) is to predict an output based on the possible variables that the input might have; higher leaves tend to divide more important features and lower leaves tend to correspond to less important ones. That is why decision trees are commonly used as a dimensionality reduction technique. By running the CART algorithm you get the importance of each feature for free!

Error measures

As any machine learning model, we must ensure to have a correct error function. It has been shown that any of the following error functions tend to perform well:

MSE (regression): it is one of the most common error function on machine learning
Entropy (classification): entropy works by measuring the number of bits needed to encode a class code, based on its probability of occurrence.
Gini index (classification): Slightly faster impurity measure than entropy, it tends to isolate the most frequent class in its own branch of the tree, while entropy produces more balanced branches.

Note: Error functions on classification trees are also called impurity measures.

Boosting trees

Decision trees are very good estimators, but sometimes they can perform poorly. Fortunately, there are many ensemble methods to boost their performance.

Random forests: bagging/pasting method that works by training multiple decision trees, each with a subset of the dataset. Finally, each tree makes a prediction and they are all aggregated into a single final prediction.
AdaBoost: a first base classifier is trained and used to make predictions. Then, a second classifier is trained on the errors that the first one had. This continues on and on until there are no more classifiers to train.
Stacking: this idea works by creating a voting mechanism between different classifiers and create a blending classifier that is trained on the predictions of the other classifiers, instead of the data directly.

The following image represents a stacking ensemble:

When accuracy is not enough...

Pol Monroig Company — Wed, 29 Jul 2020 09:35:49 +0000

The task of classification has existed long before the invention of machine learning. A problem that may arise when working with different algorithms is the use of an error function that determines if an algorithm is good enough, with classification algorithms it is no different.

One of the most used metrics applied in these algorithms is the accuracy metric; based on the total number of samples and the predictions made, we return the percentage of samples that were correctly classified. But this method does not always work so well; imagine that we have a total of 1000 samples, and an algorithm called DummyAlgorithm that tries to classify them in two different classes (A and B). Unfortunately, DummyAlgorithm does not know anything about the data distribution, as a result, it always tells us that a given sample is of type A. Now imagine that all the samples are of class A (you might see where I'm going). In this case, it is easy to see that even though DummyAlgorithm has a 100% accuracy rate, it is not a very good algorithm.

In this post, we'll learn how we can complement the accuracy metric with other machine learning strategies that do take into account the problem described before. Consequently we'll see a method to avoid such a problem.

Definitions

Before going any further, let's define some basic concepts.

Accuracy: metric that returns the percentage of correctly classified samples in a dataset

True Positives: samples that were correctly classified with their respective positive class

True Negatives: samples that were correctly classified with their respective negative class

False Positives: samples that were classified as positives but were negatives

False Negatives: samples that were classified as negatives but were positives

Precision: accuracy of the true positives (TP / TP + FP)

Recall: ratio of positive instances that are correctly classified (TP / TP + FN)

Note: when we talk about positives/negatives, we are talking about a specific class

Confusion Matrix

The confusion matrix creates a division for each of the four possible categorizations. It can be used in multiclass classification. In the following example we are making a binary classification that classifies red dots among other colors.

Precision vs recall tradeoff

As with other metrics, the classifier has to make a decision in which if it wants to learn to have a better precision or a better recall. Sometimes you care more about precision than you care about recall. For example, if you wish to detect safe for work posts in a social network, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision). On the other hand, suppose you train a classifier to detect shoplifters, it is probably better that the classifier has the most recall as possible (the security system will get some false alerts, but almost all shoplifters will get caught.

Based on this tradeoff we can define a curve called the precision/recall curve

ROC curve

The ROC curve (receiver operating characteristic curve) is a very common tool used with binary classifiers. It is very similar to the precision/recall curve, but it plots the true positive rate against the false positive rate. One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a AUC equal to 1. A purely random classifier will have a ROC AUC equal to 0.5.

As the ROC curve and the precision/recall curve are very similar, it might be difficult to choose between them. A common approach is to use the precision/recall curve whenever the positive class is rare and when you care more about the false positives than the false negatives, and the ROC curve otherwise.

Solutions

The accuracy problem essentially happens when the data the model is being tested with is unbalanced. To solve this issue there are several approaches.

If you have a lot of training data you can discard some of it to create a more balanced data, although your model might generalize worse with less data, this approach must be used in special cases.
Use a data augmentation technique to increase the data available.
Use a resampling technique in which you make the training data bigger by using the same data, useful if the data augmentation approach is too complicated.

Optimal neural networks

Pol Monroig Company — Fri, 24 Jul 2020 10:42:59 +0000

Like everything in this world, finding the right path to a high-end goal can become tedious if you don't have the right tools. Each objective and environment has different requirements and must be treated differently. An example of this might be traveling, using a car to go to the grocery shop might be the fastest and most comfortable way to get there. On the other hand, if we want to travel abroad it might be a better idea to get on an airplane (unless you are one of those who loves driving for hours).

But we are not here to talk about the different types of transportation, we are here to talk about how to improve the training of your neural networks and choosing the best optimizer based on the memory it uses, its complexity and speed.

Different optimizers

Training a deep neural network can be very slow, there are multiple ways to improve the speed of convergence. By improving the learning rules of the optimizer we can make the network learn faster (with some computational and memory cost).

Simple optimizer SGD

The most simple optimizer out there is a Stochastic Gradient Descent optimizer, this works by calculating the gradient and error through backpropagation and updating the corresponding weights with the learning rate factor.

Speed: because it is the most basic implementation it is the fastest

Memory: it is also the one that uses the fewest memory since it only needs to save the gradients of each weight for backpropagation.

Performance: it has a very slow convergence but generalizes better than most methods.

Usage: this function can be used in pytorch by providing the models parameters (weights) and the learning rate, the rest of the parameters are optional.

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

Momentum optimization

The momentum optimization is a variant of the SGD that incorporates the previous update in the current change as if there is a momentum. This momentum provides a smoothing effect on the training. The value of the momentum is usually between 0.5 and 1.0

Speed: very fast since it only has an additional multiplication.

Memory: this optimization requires a memory increase, since it needs to save the memory of the weight of the update in the last step.

Performance: very useful since it provides an averaging and smooth effect in the trajectory during convergence. It promotes a faster convergence and helps roll past local optima. It almost always goes faster than SGD.

Usage: to activate the momentum, you need to specify its value through the momentum parameter.

torch.optim.SGD(params, lr=<required parameter>, momentum>0, dampening=0, weight_decay=0, nesterov=False)

Nesterov accelerated gradient

A variant of the momentum optimization was proposed in which instead of mesuring the gradient at the local position,we measure it in the direction of the momentum.

Speed: an additional sum must be done to apply the momentum to the parameter.

Memory: no extra memory is used in this case.

Performance: it usually works better than simple momentum since the momentum vector points towards the optimum. In general, it converges faster than the original momentum since we are promoting the movement towards a specific direction.

Usage: to apply the use of Nesterov we must set the Nesterov flag to true and add some momentum to the optimizer.

torch.optim.SGD(params, lr=<required parameter>, momentum>0, dampening=0, weight_decay=0, nesterov=True)

Adagrad

Adagrad stands for adaptive learning rate and it works by adapting the learning rate depending on where we are located. When we are near a local minimum, Adagrad tries to optimize the learning rate in order to get faster in that direction. A benefit of using this optimizer is that we don't need to concern ourselves too much in tuning the learning rate manually. The learning rate adapts based on all the gradients in the current training.

Speed: it is much slower since it needs to multiply a lot of things.

Memory: it does not require any additional memory.

Performance: in general, it works well for simple quadratic problems, but it often stops too early when training neural networks, since the learning rate gets scaled too much, thus never getting to the minimum. It is not recommended for neural networks but it may be efficient for simpler problems.

Usage: Adagrad can be used by providing the default parameters.

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

RMSprop

This is a variant of the Adagrad algorithm that fixes its never converging issue. It does it by accumulating only the gradients from the most recent iterations.

Speed: it is very similar to Adagrad

Memory: it uses the same memory as Adagrad

Performance: it converges much faster than Adagrad and does not stop before a local minimum, it. It has been used by machine learning researches for a long time before Adam came out. It does not perform very well on very simple problems.

Usage: you might notice there is a new hyperparameter, but the default values usually work well, this technique can be combined with a momentum.

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

Adam

Adam is a relatively new gradient descent optimization method, it stands for adaptive moment estimation. It is a mix between momentum optimization and RMSProp.

Speed: the one that costs more since it combines two methods.

Memory: the same as RMSprop

Performance: it usually performs better than RMSprop since it a combination of techniques trying to converge faster on the training data.

Usage: Adam can be used perfectly with the default parameters, it is even recommended to leave the learning rate as it is since it is an adaptive method that provides an automatic learning rate update.

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

Summary

In the end, which optimization algorithm should you use? It depends, adaptive algorithms are becoming really fancy nowadays but require more computational power, and most of the time more memory. It has been proven that simple SGD has better results on the validation set, as it tends to generalize better, it seems adaptive algorithms try to optimize the training set too much, thus ending with high variance and overfitting the data. The problem with SGD is that it might take a lot of time to reach a minimum, the computational resources needed in total are much higher than the ones needed in adaptive optimizations. So in the end, if you have a lot of computer resources you should consider using SGD with momentum as it tends to generalize better. On the other hand, if your resources, especially time resources, are limited Adam is your best choice.

Classifying machine learning models

Pol Monroig Company — Wed, 15 Jul 2020 20:15:58 +0000

Machine learning models come in all shapes and sizes and it might be difficult to create a classification between them, but there is a characteristic inherited in all models that can separate them in two types, parametric models vs non-parametric models.

Parametric models

Parametric models are all the ones that give results based on a set of parameters, each parameter is responsible for one of more features and affects each differently. The most basic example of a parametric model is linear regression, where each parameter is multiplied by each feature linearly. Parametric models try to find the probability distribution of the training data and to approximate it using a set of parameters. For parametric models to work we usually assume that the data is drawn from a probability distribution of known form. The advantage of this type of model is that it reduces the problem of estimating a probability density, discriminant, or regression function to estimating a small number of parameters. Its disadvantage is that the distribution assumption may not hold and that might cause a lot of error.

Note: sometimes a better approach is to use semi-parametric methods, these methods mix different distributions.

Non-parametric models

We usually use a non-parametric approach for density estimation, classification, outlier detection and regression. With this type of model we assume that similar inputs have similar outputs. Instances that are similar mean similar things. Based on past data, the algorithm tries to find similar instances, it interpolates their values and gives a result. For this to work non-parametric models required two basic things

A history of all the seen data (usually O(n) space and time complexity).
A distance measure to compare different instances and assign a similarity level (e.g. Euclidean, Mahalanobis) .

The linear complexity is the bottleneck of this method since usually, the training set is bigger than the parameters needed to model the problem.

Differences

As a summary, parametric models are faster, lighter, and more simple thus they tend to create less variance error. On the other hand, non-parametric models remember all the training instances and are a more powerful approach, although much slower, we need to be careful in order to prevent overfitting. An example of this is when using decision trees, a good regularization method is to use random forests since the results are interpolated between different trees.

Examples

Parametric	Non-parametric
Linear regression	Histogram estimator
Neural networks	Kernel estimator
Bayes' estimator	K-NN
Support vector machines	Decision trees
Linear Discriminant Analysis	Random forests

Subset selection: reducing dimensions

Pol Monroig Company — Wed, 08 Jul 2020 10:58:17 +0000

As most data scientists know, dimensionality is a curse; although the number of dimensions is not a curse itself but also the quality of each feature in a specific dimension. Dimensionality reduction is a set of techniques that try to transform the input space into a space with fewer dimensions while keeping the meaning and value of the features. In this post, we will journey through a greedy algorithm (greedy in the sense that it does not guarantee to find the optimal answer) that generates a selection of features that try to minimize the model's error.

Feature selection vs Feature extraction

Dimensionality reduction algorithms can be classified as feature selection methods or feature extraction methods. Feature selection methods are interested in reducing the number of initial features to the ones that give us the most information. On the other hand, feature extraction methods are interested in finding a new set of features, different from the initial ones, and with fewer dimensions.

Subset selection

Subset selection is a feature selection algorithm that can variate between a forward selection and a backward selection. Both methods consist in finding a subset of the initial features that contain the least number of dimensions that most contribute to accuracy. A naive approach would be to try all the 2^n possible subset combinations but if the number of dimensions is too big it would take forever. Instead, based on a heuristic function (error function) we add or remove features. The performance of subset selection depends highly on the model we choose and our pruning selection algorithm.

Forward selection

In forward selection we start with an empty set of features, for each feature that is not in the set we train the model with it and test its performance; we then select the feature with the least amount of error. We continue adding new features for the model to train until the error is low enough or until we have selected a proportion of the total features.

Backward selection

Backward selection works in the same way as forward but instead of starting with an empty set and adding features one by one, we start with a full set and remove features one by one. Thus we remove the features that cause the most error.

The triple tradeoff

Pol Monroig Company — Wed, 01 Jul 2020 17:34:11 +0000

In the machine learning community two concepts usually popup very often, bias and variance. These two concepts provide a representation of the quality of a model in terms of how it adapts to the training data and how it adapts to new data.

Bias

It is related to a simple, rigid and constrained model, few or poor quality data, underfitting.

Models with high bias tend to miss relevant features in the data that is modeling, this miss of features can cause the model to underfit. This usually happens when the model is too simple and does not have enough capacity to work properly. It can also happen when even though the data is simple enough for the model, it contains too much noise, thus it requires a more complex model.

Variance

It is related to complex, variable, and adaptable models, overfitting.

Machine learning models with high variance tend to have a very high sensitivity, they are variable to any small noise and fluctuations that the data can have. This means that it will learn everything in the training set by memory (they are the worst generalizers). This models are too complex and overfit the training dataset.

All algorithms that learn from data have a tradeoff between variance and bias, the best model is the one that minimizes both, that is, a model that is complex enough to reduce the learning loss and simple enough to be able to generalize properly to new, unseen samples. Based on this tradeoff we can create a third tradeoff called the triple tradeoff (Dietterich 2003).

1. The complexity of the model

The first part of the tradeoff takes into account the complexity/capacity of the model, the model has to be complex enough to navigate through the noise and learn the underlying representation. For example, if we take random data samples from the following linear function f(x) = x + noise(x) , the best model that can model this kind of data is a linear regression model. We need to be careful in not using a too complex model such as a polynomial or a neural network. As the complexity of the model increases, the generalization error decreases, but as we have seen before with the bias/variance tradeoff we need to stop increasing the complexity at some point; until we find the Optimal Model Complexity.

2. Amount of training data

The amount of training data is very important, the more data we have, the more information we can learn about the distribution of the samples. As the amount of training data increases the, the generalization error decreases. This statement is true only up to a point because the noise that we can find in the dataset can confuse the model.

3. The generalization error on new data

The generalization error decreases as we keep in check the other two parameters. The capacity of generalization of a machine learning model is the most important thing. When we learn from data, we want the model to be able to apply its knowledge to new situations if the model works perfectly with the training data but generalizes badly, it is useless and we might not even bother working with it.

The world of the autoencoders

Pol Monroig Company — Wed, 24 Jun 2020 17:21:12 +0000

Autoencoders are a special kind of neural networks used in unsupervised machine learning, these networks try to learn an efficient representation of a dataset, these representations are called codings and can be differentiated from the original input in that they usually have a much smaller dimensionality. Autoencoders are used in all kinds of tasks; such as music, text, and image generation. They can also be used for dimensionality reduction or collaborative filtering recommendation systems . Autoencoders are essentially very powerful feature detectors. One of the most remarkable differences with this type of network is that to the inputs (features) and outputs (labels) are both features. For this reason, it is very common to see symmetrical autoencoders with tied weights.

This type of networks can sometimes be tricky to train and are very prone to overfitting, as a result many different variations have been developed.

Denoising

One of the most simple alternatives is to use a Denoising Autoencoder; this kind of network is made up of a simple autoencoder but the difference is the introduction of some kind of noise to the input data. This way the network has a harder time filtering the data and cleaning it. Although some randomness is introduced to the training.

Variational

This category of neural network is very similar to denoising networks, but they are probabilistic neural networks, so they introduce randomness even after training. This is due to its internal structure. Variational autoencoders apply a gaussian transformation to the codings. This transformation enables to autoencoder to generate new dataset samples, this is called a generative neural network.

Generative adversarial networks

Generative adversarial networks or GANs for short, have been getting a lot of attention since their release 2014. They are famously unstable but have been shown to generate very good generative data. This autoencoder variation works by creating two networks, a discriminator and a generative network. The generative network works like any other autoencoder trying to generate new data, and the discriminator works like a categorization neural network that tries to distinguish between images from the original dataset and fake images generated with the generator network.

Other

There are a lot of autoencoders out there, and there is surely one that fits your needs, other autoencoder examples might be

Sparse AE
Contractive AE
Stacked convolutional AE
Winner Take it All
Generative stochastic network

As a personal favorite a really like this NVIDIA generative autoencoder.

5 git commands that will save your life

Pol Monroig Company — Wed, 17 Jun 2020 06:58:45 +0000

1. Amend mistakes

We've all been there, your in a rush for your next deadline and you have to deliver your next application, as always you have git to back you up, literally it backs up and versions your data. Unfortunately you've been taking git lightly and so you have committed something horrible, something that was not supposed to be saved, but what will you do? You don't have spare time to ask a question on Stackoverflow... Suddenly you remember a great command, a command that can discard all your changes.
git reset --hard <Commit>
With this command you remove your commit and move to a selected one. But beware! I will also delete any uncommited changes.
So what happens if you don't want to lose any changes? Well git comes to the rescue again with
git reset --soft HEAD~1

2. Find where you are

Have you ever been working on something but have so many files that it's impossible to distinguish between what is useful and what must be deleted? Have you ever wondered where you are? Or who are you? It turns to git can answer to the first two questions. One of the most useful and simple command you'll ever use is
git status
This command will help you visualize what are your staged files, your uncommited files, and and idea of what have you done in the current commit.

3. Go back in time

I have always wanted to go back in time, or at least only if I can come back later to the present. A lot of people don't know it, but git is a time machine, not one of those fancy time machines that can send you to the Jurassic period, but a time machine that can go back inside your code.
git checkout <Commit>
When you checkout a commit you are essentially travelly to a specific point in time where you wrote that code. You can open any files and navigate through the folders as if you where there. Awesome right? If you change the commit for a branch you can change between branches, but that isn't as fun.

4. Finding differences

Traveling in time is amazing, but sometimes is a little too much, perhaps you just want to see a specific file or wish to compare the same file between different commits.
git diff <OldCommit>...<NewCommit>
git diff <OldCommit>...<NewCommit> -- <file>
This command does precisely that, it takes two commits and compares their differences. You can only compare a specific file if that is what you want.

5. How to read your log?

Having a log is very important, it contains information on any changes you have done, why you have done them, who has done them and when. That is why git has its own log, although learning how to write your log correctly is as important as to how to read it.
Most people use git log, but it is so simple, so vague, let's add some decoration...
With git log -n <integer> you can view only a specified number of commits, really helpful if you don't want to see all.
git log --since <Commit> --until <Commit>
works by filtering any commits between two dates. Finally a very interesting alternative is
git log --grep=<Keyword>
This commands filters any commands that contain a specific keyword.

Conclusion

Git is an immensely powerful tool, it has amazing commands and I have just shown you some. Hope you learned at least one command this day.

A tip on unstructured data in AI

Pol Monroig Company — Mon, 11 May 2020 11:24:04 +0000

Training neural networks is not as hard as it used to be 40 years ago, nevertheless, it is as much as art as it is science, thus practice makes the master. Deep learning is a very powerful tool for someone who wants to apply AI to unstructured data, but unstructured data has the tendency to be unstructured, in other words, every piece of data might have a different size and shape. An example of this might be a text, a song, or images. As a programmer, you need to face this difficulty and find solutions. In this post, I'll show you two ways to handle variable-shaped input, when training neural networks.

Padding

The most simple way to do it is by padding every input to the same size. It is straightforward since you only need to find the biggest tensor in a batch of data and pad every other tensor in the batch to that size. Mmm, that does not seem very efficient, you are using a lot of empty space, and you are making the training harder for the neural network.

Bucketing

A better way would be to sort the data in ascending order and create batches that minimize the padding between tensors. This way you make training faster and avoid unused data. You might still encounter an over-padded situation but it is definitely better than the naive solution. A problem with this solution is that the batches you train with will always be the same, which might cause overfitting but it shouldn't be much of an issue.

Conclusion

This is a simple tip I wanted to share for deep learning enthusiasts. Comment your favorite way to handle variable-shaped data!

How to run anything in your web server?

Pol Monroig Company — Fri, 01 May 2020 08:16:24 +0000

Web applications are all around us, they allow us to communicate in unimaginable manners. Best of all, it is straightforward to set up your own web site and post any content you like. But what happens when you wish to run more complex stuff in your web server, maybe you need a fast application that replies to millions of requests, maybe you are just curious and wish to test your self in the web jungle, maybe you are in quarantine and don't have anything better to do. Whatever the reason (I don't judge you) this is the post for you, here you'll learn the basics of CGI programming.

Common Gateway Interface

A common gateway interface or CGI is the most common way to run scripts in the cloud. Commonly you would send POST/GET requests to your web server and get a response in HTML that is displayed by your web browser. On the other hand, when you configure your server with a CGI every post request will instead execute a specific file in your server. This file can be anything, from a Python script to a C++ binary (as long as your server can run it). This script will also get a response of course, in fact, you can run an entire web page on CGI; although you can create every component of your page manually it is easier to use some framework (e.g. Pistache, Wt for C++).

Simple setup with Apache

Before trying to send an HTTP request to the specific file we want to execute we need to set up the web server, for this, we'll use Apache.

Step 1 Locate your CGI directory

Scripts can only be executed in a specific directory, otherwise, you could run everything... Thus, the first we need to do is locate it, in any of the following directories

/etc/apache2/httpd.conf
/etc/apache2/apache2.conf
/etc/httpd/httpd.conf
/etc/httpd/conf/httpd.conf

Inside your config file, you'll need to find the ScriptAlias variable, this variable defines the location of your CGI files (you can change it if you want)

Step 2 Writing your CGI program

There are many ways to write a CGI program but remember it needs to deliver a response to the user, the response will be the text that write in the Standard Output. A simple program would be the following

#!/dir/to/python/env
# script.py / script.cgi 
print("Content-type: text/html\n") # write text content type(e.g. text/plain, image/gif)
print("<h1>This is a text header</h1>") # write the content

Step 3 Set permissions

For apache to execute the script for you, you should first set permissions for it to run it
chmod a+x script.py

Step 4 Test it

Now to try your new CGI script, it is as easy as sending an HTTP request, for simplicity you can use curl.
curl example.com/cgi-bin/script.py

Extra

There are two things you may be wondering by now. First of all, how can I use the data that was sent using a POST request from my python file? The data that is sent using a post request is saved in Linux environment variables, so it is a simple as reading them, probably in something like QUERY or PATH. Another thing that can be easily spotted is that every time a user sends a request a new process is created in the server, so if you have a lot of requests it is easy for the server to collapse. An alternative to CGI is FastCGI, it enables you to execute a single process that processes each request. A better idea would be to code your own web API, but that is beyond the scope of this post!

Conclusion

At this point you should have successfully executed your app in the cloud, if that was not the case, contact me and I'll help you as I can.