DEV Community: AiFlow

Volume Visualization - Same data, different perspectives

AiFlow — Wed, 02 Mar 2022 15:07:45 +0000

You've probably gone through an airport security check and saw the staff looking through your stuff on a screen, or have probably done an MRI that allowed a doctor to 'cut slices' through your body and check your health. Volume data is everywhere, from the human body to backpacks and closed boxes and the technology has evolved so that we can analyse their contents quickly, without opening them.

From a mathematical point of view, a volume is a 3D matrix of numbers, which can represent various things, such as density, intensity, or heat. Based on that abstract data, one can extract features and build useful visualizations.

This post is just an example of how the same data can be represented in many ways, serving different purposes. The data used is a scan of a fish, where the values in the volume represent intensities. The technique used to extract the actual 2D picture is Ray Casting, where a ray is shot through each of the pixels and a value is computed along the ray's intersection with the volume.

The example below is obtained by using Maximum Intensity Projection, so the pixel value is basically proportional to the highest value encountered across the ray, this is why the bone structure is slightly more visible compared to the soft tissue.

A way more simple approach is to do an ISO projection, similar to how the elevation curves on a map look like. Define a value, and return the first met volume position that has a higher intensity. It allows you to get a sense of the object's area.

Adding on the previous technique, one can make a projection look more real by adding a light source and, thus, shades. The model below is called Phong Shading and allows you to split the surface into 3 parts: the ambient is the static color, the diffuse adds shades and the specular makes the object shine, giving it a more plastic look.

Moving to a different approach, one can compose the values along the ray, based on their opacity. Compared to Maximum Intensity Projection, this allows more details to prevail, providing better separation between hard and soft tissue in this particular example.

Having the volume you can also compute the gradient in all the points, which enables you to define a transfer function. Given this, you can choose to see just parts of the volume, such as places where the gradient has a higher value, this way extracting the skeleton.

Besides gradient, many other values can be computed, such as centrality measures. Using those, visualizations can be created that show structure or clusters in data, highlighted with different colors.

Adding on the previous idea, you can define multiple regions in the transfer function, such that you can visually separate the tissue types with different colors, for a better analysis, as in the picture below.

Moving on from the fish example, this technique can be used in many other fields, especially medicine and transportation safety. Using the 2D transfer function, airport staff can easily detect sharp objects in your backpack and can thus prevent unwanted events.

Similarly, parts of the human body can be better analyzed using this approach, leading to a better learning experience for students and a more accurate diagnosis for doctors.

At aiflow.ltd, we try our best to obtain results as fast and as efficient as possible, to make sure we find you get the most of your data. If you’re curious to find out more, subscribe to our newsletter.

MapReduce - when you need more juice to process your data

AiFlow — Wed, 23 Feb 2022 15:13:32 +0000

Most of the websites you can find on the internet today are collecting some kind of data for later processing and information extraction. They usually do this to learn customer behavior, analyze logs, create better ad targeting or simply improve UI/UX wherever possible. But what if the data collected is larger than what a machine can process in a day, year, or even decade? This is where MapReduce comes into play.

What is MapReduce?

MapReduce is a programming paradigm for processing and generating big data sets with a parallel, distributed algorithm on a cluster of machines. Usually, a MapReduce program is composed of a map method which performs filtering and sorting, and a reduce method, which performs a summary operation over the output of the map function (we will discuss both map and reduce more in-depth later in the article). The underlying idea behind MapReduce is very similar to the divide and conquer approach. In fact, the key contributions of this data processing framework are not the map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine.

How does it work?

As already highlighted, the bread and butter of MapReduce are the two functions, map and reduce. They are sequenced one after the other.

The Map function takes input from the disk as (key, value) pairs, processes them, and produces another set of intermediate (key, value) pairs as output. The input data is first split into smaller blocks, then each block is then assigned to a mapper for processing. The number of mappers and block sizes is computed by the MapReduce framework depending on the input data and the memory block available on each mapper server.

The Reduce function also takes inputs as (key, value) pairs and produces (key, value) pairs as output. After all the mappers complete processing, the framework shuffles and sorts the results before passing them to the reducers. A reducer cannot start while a mapper is still processing. All the map output values that have the same key are assigned to a single reducer, which ten aggregates the values for that key.
Mappers and Reducers are servers managed by the MapReduce framework and they run the Map and Reduce functions respectively. It doesn't matter if these are the same or different servers.

To summarize how the MapReduce framework works conceptually, the map and reduce functions are supplied by the user and have the following associated types:

map    (k1, v1)  ->  list(k2, v2)
reduce (k2, list(v2))  ->  list(v2)

Other MapReduce processes

Other than Map and Reduce, there are two other intermediate steps, which can be controlled by the user or they can be managed by the MapReduce framework.

Combine is an optional process. The combiner is a reducer that runs individually on each mapper server, it reduces the data on each mapper further to a simplified form before passing it downstream, making shuffling and sorting easier as there is less data to work with. Often the reducer itself can be used as a combiner, but if needed, a standalone combiner can be implemented as well.

Partition is the process that translates the (key, value) pairs resulting from mappers to another set of (key, value) pairs to feed into the reducer. It decides how the data has to be presented and assigned to a particular reducer.

Example

As an example use case of MapReduce, we'll consider the problem of counting the number of occurrences of each word in a large collection of documents. The user pseudo-code for implementing this in MapReduce would be:

map(String key, String value):
  // key: document name
  // value: document contents
  for each word W in value:
    EmitIntermediate(w, "1");

reduce(String key, List values):
  // key: a word
  // values: a list of counts
  int result = 0;
  for each v in values:
    result += parseInt(v)
  EmitIntermediate(toString(result));

The map function emits each word plus an associated count of occurrences (in our case, just "1"). The reduce function sums together all the counts emitted for a particular word. Apart from these two function implementations, the user has to specify a config object with the names of the input and output files, and optional tunning parameters and that's it, the MapReduce framework can start processing data.

Other examples

Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes logs of web page requests and outputs (URL, 1). The reducer function adds together all values for each URL and emits (URL, total count) pairs.

This was a brief introduction int how the MapReduce framework works. At aiflow.ltd, we try our best to obtain results as fast and as efficient as possible, to make sure we find you get the most of your data. If you’re curious to find out more, subscribe to our newsletter.
References:

Word2Vec - from words to numbers

AiFlow — Wed, 16 Feb 2022 11:28:04 +0000

People write books, blog posts, articles and send thousands of messages daily. They are able to transfer a lot of knowledge and meaning through text. What if we can transfer this knowledge to a computer and enable it to understand the semantics behind a sentence? By doing so, we enable computers to learn and understand text, draw conclusions and even generate sequences that actually make sense.

The reason to do so might not be clear from the beginning. Why would we want to create some sort of encodings that computers understand? Do you remember your recent Google search that kindly offered you some autocomplete options or the chatbot you chatted with the last time you made a restaurant reservation? These might be some of the many reasons to look into a way to convert text to formats computers would understand.

Since computers don't understand words, but only numbers, there must be some way to convert a sentence into a list of words. One can simply transform "I think I understand machine learning" to the list [0, 1, 0, 2, 3, 4], based on the index of the first occurrence of the word in the sentence. While this is a valid input for a computer, it does not offer much insight into the actual semantics of the sentence. In the absence of a structure, computers are not able to learn representations and patterns from data. There must be a way to encode a given corpus of text into a sequence of n-dimensional vectors, one for each word while preserving the meaning and patterns we see in the texts we read in our daily life.

"Cells that fire together wire together" - Donald Hebb

You've probably heard this quote a few times until now and it applies perfectly to the use case of understanding natural language. You have probably noticed (if not, empirical studies clearly did), that some words are often coupled together, rather than alone in sentences. As a very shallow example, "cat" and "dog" are probably more often together than "cat" and "coffee" are. This shows that we can somehow, given a center word, predict the likelihood of other words to be in their context. We can learn a representation that will put "cat" closer to "dog" than to "coffee" in our n-dimensional space. More formal, the euclidian distance between the representations of "cat" and "dog" will be shorter than the same distance between "cat" and "coffee".

This drives us to what is known to be the n-gram model when we consider n context words around a center one. Using this idea, a training set is built with tuples of the form (A, B), where A and B are one-hot encoded vectors. I'll briefly explain this idea. Given a vocabulary V, A has id a and B has id b, the one-hot encoded vector of A would be a list of zeros of the size of V, with 1 only on the position a. The same goes with B and that basically signals the presence of that particular word in that vector.

Given this long list of (center, context) word pairs, we would like to train a special type of neural network, namely an autoencoder, that, given the center word, should accurately return a list with the probabilities of each word in the vocabulary to be in its surrounding. In the training process, we shrink the network in the middle, creating an embedding layer. This process makes the network learn the 'essence' of every word, forcing its meaning and semantics to fit into a fixed size embedding vector, small enough so we can use it for future tasks.

The model can be very simply represented by two matrices; the encoder matrix, called W1, and the decoder matrix, called W2. Suppose that X is a one-hot encoded version of a word, the word2vec encoding of it is the product X*W1. Given the encoding E of some word, the likelihood of any other word to be in its context is the product E*W2, turning the encoding into a set of probabilities for each word in the vocabulary.

The model is conventionally learned using backpropagation, trying, at every step, to learn to assign greater probabilities to the context words of a given center word rather than to other random words in the vocabulary.

The technique described above is a simple yet smart way of transforming words into computable structures, such as numeric arrays. The results we obtain might be some random guesses at the first sight, but when we properly use them, we can find various interesting things. Without any additional processing, we can find associations in the n-dimensional space that are simple for us humans but not so obvious for computers. For example, we can ask the general question: What is to X as A is to B? and the model will give some interesting outputs:

Once we have a smart way of embedding words while capturing their semantics and logic, a wide variety of tasks can be done, since we have text data all around the place. We can generate summaries from big texts, do machine translation, create our own Google translate and, what's most interesting, generate text given a large enough dataset to learn from.

At aiflow.ltd, we handle the computations so you can do Machine Learning without the hustle of understanding the math concepts behind it. Subscribe to the newsletter to find out more!

The Exploration-Exploitation dilemma: you miss all the shots you don’t take

AiFlow — Wed, 09 Feb 2022 14:34:45 +0000

In Machine Learning as well as in real life, there might be some patterns that bring wealth and others that don’t. As individuals, we tend to stay away from the latter and lean towards the former. Often, people get comfortable enough with one specific pattern that goes well that they do not bother to explore other variations. This is an ongoing issue in all kinds of Machine Learning frameworks. They have to leverage both the power of mining known well-performing regions (exploitation), but also the diversity that looking for better alternatives yields (exploration).

Problem definition

The exploration-exploitation dilemma is a general problem that can be encountered whenever there exists a feedback loop between data gathering and decision-making, that is whenever a model transitions from spectator to actor of collecting data, this problem may arise.

To better understand this idea, let’s look at two example models:

A model that predicts whether an image contains a cat or a dog. In this case, all the data required is collected, then the model is trained. After training, the model is used in one way or another to make predictions without any more collection of data, so no more exploration is needed.
A model that predicts click-through rate (CTR) for some ads. In this case, the model is trained with some initial data and then is continuously updated whenever a user interacts with an ad. This model becomes an actor in the process of data collection so the trade-offs between exploration-exploitation need to be carefully considered.

Before going forward it’s worth mentioning some real-life examples of the exploration-exploitation dilemma:

Going to your favorite restaurant, or trying a new one?
Keep your current job or hunt around?
Take the normal route home or try another?

Handling the dilemma

Greedy approach

As stated in the name this approach simply consists of taking the best action with respect to the current actor’s knowledge. Now, as we can deduce, this strategy means full exploitation, the drawback being that we don’t do any exploration and we may always do sub-optimal decisions because we don’t get the chance to explore the problem space.

Ɛ-greedy approach

The Ɛ-greedy algorithm is a simple, yet very effective variation of the greedy approach. It simply consists in choosing the best action with respect to the current knowledge with a probability of (1-Ɛ) - exploitation or a completely random action with probability Ɛ - exploration. The Ɛ parameter is balancing between exploration/exploitation, a high value of Ɛ yields a more explorative approach, while a low value emphasizes exploitation. This approach is widely used in Reinforcement Learning, for enhancing the Q-learning algorithm.

Ɛ-decaying greedy

It’s good that we introduced exploration in our approach, but ideally, we’d like to not explore infinitely if we found the best choosing strategy. However, in practice it’s very difficult to pinpoint when we found the best strategy so, to simulate this behavior, we can choose a value for the Ɛ parameter, and then slowly decrease it as we explore the problem space. The decaying rate highly depends on the problem we are trying to solve, so one needs to find the value that yields the best results.

Those were just a few insights into the Exploration-Exploitation dilemma. At aiflow.ltd, we try to automate as many processes as possible, to make sure we find the best balance between Exploration and Exploitation. If you’re curious to find out more, subscribe to our newsletter.

References:

Evolutionary algorithms - AI that just becomes better

AiFlow — Wed, 02 Feb 2022 15:24:18 +0000

Have you ever wondered how some algorithms just become better the more you use them, for example, better social media recommendations or highly personalized YouTube feeds? Today we’re looking at one of the many approaches one can take to achieve this: Evolutionary algorithms.

Evolutionary algorithms are inspired by the Darwinian theory that evolution is subject to a set of primal operations as the survival of the fittest, genome crossover, and random mutation as a result of adapting to the environment. They also rely on the fact that small improvements in the genome of individuals can, over time, using the technique of survival of the fittest, lead to greater advancements of the species. Being fit in such a population is highly subjective and is a function of the environment of the specific species. Gradually eliminating fewer fit members from the population and allowing, with a greater probability, the fittest to reproduce, will, over a number of epochs (the time unit in evolutionary algorithms), lead to an overall fitter population.

Using evolutionary algorithms in automated machine learning leverages the power of both evolution and statistics, since they mimic the trial and error performed by data scientists when approaching a project, and also follow the statistical fitter individuals, always pursuing the search in places where it is known that something is to be found. Also, they do not hesitate to explore new paths from time to time, leading to the well-known exploration-exploitation dilemma, which we’ll explain in depth in the following weeks.

Looking at this flow of evolutionary algorithms, the approach seems natural, since, as human beings, we are accustomed to it. The flow is a simple way of searching through the tightest corners of the search space and providing untaught configurations. Provided enough time and the right crossover, mutation, and selection logic, evolutionary algorithms are able to converge to global optimums and yield the awaited result.

Using the concepts of evolutionary algorithms in automated machine learning, one can search through the configuration space more efficiently, finding gradually better methods. Genomes can be considered as different configurations and their fitness is some metric on the dataset, such as classification accuracy. Data engineering is also subject to evolutionary optimizations since the techniques of generating features can be combined in various ways in order to extract the most valuable information from the dataset. Both feature engineering and model training using evolutionary algorithms can lead to interesting results, subject to further analysis by data scientists or AutoML tools.

As also mentioned previously in this section, Evolutionary Algorithms have some basic operations that guarantee the coverage of the search space provided enough time, namely random initialization, selection, cross-over, and mutation. Let’s see how those 4 steps help find better neural network architectures:

Random initialization

Neural networks have a few parameters available for tuning, some of them being the number of layers and the count of neurons in each, activation functions, dropout rates, and the learning rate. During the initialization phase, the specifications of each neural network that is being built are randomly selected from the available pool, be it a list of predefined choices or a continuous interval. The randomness provides diversity in the population.

Selection

At each epoch, two chromosomes (in this case, neural networks) are randomly selected for crossover. Multiple selection procedures exist, each with empirically proven good performances. To name a few, random selection selects a random chromosome, tournament selection samples a k-sized random population and returns the best from it, and roulette selection, which randomly selects individuals with a probability directly proportional to their fitness.

CrossOver

After 2 chromosomes are sampled from the population, a cross-over is performed, also known as an XO operator, providing an offspring. For example, for each property of the neural network, the offspring either inherits it from one of its parents with a given probability or creates a combination of both (e.g. the average for the learning rate).

Mutation

After the offspring are generated, the mutation operator M is applied in order to slightly transform its properties. The transformation M(off) yields a mutated offspring. For example, the mutation can consist of random modifications in continuous parameters (selecting a different activation function) or random switches in discrete ones (increasing or decreasing the learning rate).

That was a short introduction to how AI algorithms can become better without much human input. At AI Flow we take evolutionary algorithms one step further and automate the whole flow from data to deployed models. Check it out!

How to visualize data

AiFlow — Wed, 26 Jan 2022 15:05:28 +0000

As discussed in last week’s article, data is the core of every learning algorithm and we need lots of it to create a good intelligent product, but most of the time the type of algorithm we are going to use highly depends on what kind of information we are dealing with. To get some sense of the information we are working with, we use data visualization techniques.

What is data visualization?

Data visualization refers to an efficient graphical representation of data or information, for example, taking a spreadsheet’s content and converting it into a bar or line chart. It is a particularly efficient way of communicating when the information we are dealing with is numerous or complex, as for example, a time series.

From a formal point of view, the representation can be considered as a mapping between the original data (usually numeric) and graphic elements, for example, lines or points in a chart. The mapping determines how the attributes of these elements vary according to the data, like a bar chart is a mapping between the length of a bar and the magnitude of a variable.

Why do we visualize data?

To determine what’s the best learning algorithm for our problem, we need to understand our data. Most of the time it’s hard to have an intuition of the data we are working with and some algorithms only work on specific datasets. For example, a Linear Regression algorithm won’t work on a dataset that is not Linearly Separable.

You likely heard about the old saying: a picture is worth a thousand words, but sometimes in the field of learning it’s hard to find a compelling visualization for your data.

Visualization methods

As we humans cannot visualize in more than 3 dimensions (although some mathematicians can gain intuition in 4 dimensions), we have to reduce the dimensions of our dataset so we can visualize it properly. Two of the main methods to reduce dimensions are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE).

Principal Component Analysis (PCA)

PCA is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

The main steps of PCA are:

calculate the mean of each column
center the value in each column by subtracting the mean column value
calculate covariance matrix of centered matrix
calculate eigendecomposition of the covariance

So, PCA tries to reduce the number of variables of a dataset, while preserving as much information as possible, the only downside of PCA is that it works well with multidimensional data that is linearly separable. If the dataset is not linearly separable PCA will often lose a lot of information.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

The main difference between t-SNE and PCA is that t-SNE is a non-linear dimensionality reduction algorithm.

It takes a set of points in high dimensional data and converts it into low dimensional data.
It is a non-linear method and adapts to underlying data performing different transformations in different regions.
It’s incredibly flexible and often finds a structure where other dimensionality reduction algorithms can’t.

These were just a few insights into data visualization. At aiflow.ltd, we automatically create visualizations for you, to make sure you get a sense of your data. If you’re curious to find out more, subscribe to our newsletter and see our other articles.

References:

Basic data preparation for Machine Learning

AiFlow — Wed, 19 Jan 2022 16:37:29 +0000

The very core of every learning algorithm is data. The more, the better. Experiments show that for a learning algorithm to reach its full potential, the data that we feed to it must be as qualitative as it is quantitative. To achieve state of the art results in data science projects, the main material, namely data, has to be ready to be shaped and moulded as our particular situation demands. Algorithms that accept data as raw and unprocessed as it is are scarce and often fail to leverage the full potential of machine learning and of the dataset itself.

The road between raw data and the actual training of one model is far from straight and often requires various techniques of data processing to reveal insights and to emphasize certain distributions of the features.

Take for example this dataset from the real estate business. It is by far an easy job to be able to accurately predict the final acquisition price of a house considering only the raw data. There is no way a generic algorithm could make a difference between ‘1Story’ and ‘2Story’ from the HouseStyle column or differentiate between two different value scales, like the year in the YearBuilt column and the mark in the OverallQual column.

Data preparation has the duty of building an adequate value distribution for each column so that a generic algorithm could learn features from it.
Some examples are rescaling the values, turning text information into categorical or extracting tokens from continuous string values, like product descriptions. This post will give some insights into how to transform raw data into formats that make the most out of it.

Handling missing data

As the data producing sources are rarely perfect, raw datasets have missing values. Since generic algorithms cannot handle such cases and replacing them with a random value opens the possibility of obtaining any random output, methods have been implemented to replace them while keeping the data distribution in place, unmodified.

The basic solution is dropping the rows or columns that contain an excessive amount of missing fields, since replacing all the empty fields with the same default value might bias the model rather than create valuable insights. You might imagine that this approach is not optimal, since we don’t want to delete data that might prove itself valuable. Instead, setting missing values to the median of the column has, experimentally, provided good results, since it keeps a similar data distribution.

Handling outliers

Outliers are data points that lie far away from the majority of data samples in the geometrical space of the distribution. Since these observations are far from the mean, they can influence the learning algorithm in an unwanted way, biasing it towards the tails of the distribution. A common way to handle outliers is Outlier Capping, which limits the range, casting a value X in the range [ m - std * delta , m + std * delta ], where m is the median value of the distribution, std the standard deviation and delta an arbitrarily chosen scale factor. This is how we would write this more formally:

Creating polynomial features

There is often the case in machine learning when a feature is not linearly separable. Although more complex algorithms cope with the problem of non linearly separable search spaces, they might sacrifice accuracy over covering all the nonlinearities. Thus, creating polynomial features can help learning algorithms separate the search space with more ease, yielding better results in the end.

Generating the second degree polynomial of feature 1 and adding it to the dataset yields a better representation of the geometrical data space, thus making it easier to be split. Although this is a shallow example, it clearly illustrates the importance of polynomial features in machine learning. On a large scale, polynomials of multi-feature combinations are taken into consideration for generating even more insights from the data.

Those were just a few data processing steps to consider. At aiflow.ltd, we automatically process the data with many more steps, to make sure the prediction quality of our automated algorithms is the best we can achieve. If you’re curios to find out more, subscribe to our newsletter on aiflow.ltd.

Reinforcement Learning: make machines learn like humans

AiFlow — Wed, 12 Jan 2022 15:12:43 +0000

Since the beginning of computing, mathematicians and computer scientists were concerned about how to create programs or machines capable of learning in the same way as humans do, that is learning from interaction with our environment, which is a foundational idea underlying nearly all theories of learning and intelligence.

One compelling example of such learning in computer science is AlphaGo, the first computer program to defeat a professional human Go player and the first to defeat a Go world champion. The main technique used for creating and training AlphaGo is called Reinforcement Learning.

Reinforcement learning (RL) deals with the problem of how an autonomous agent situated in an environment perceives and acts upon it and can learn to select optimal actions to achieve its goals. Reinforcement learning is used in many practical problems, such as learning to control autonomous robots, learning to find the solution to an optimization problem (such as operations in factories), or learning to play board games. In all these problems, the agent has to learn how to choose optimal actions to achieve its goals, through the reinforcements received after the interaction with its environment. In a reinforcement learning task, the learner tries to perform actions in the environment. It receives rewards (or reinforcements) in the form of numerical values that represent an evaluation of how good the selected actions were. The learner (agent) simply has a given goal to achieve and it must learn how to achieve that goal by trial-and-error interactions with the environment. RL is learning how to map situations to actions to maximize the cumulative reward received when starting from some initial state and proceeding to a final state.

A general RL task is characterized by four components:

The environment state space S represents all possible states of an agent in the environment. For example, every cell on a world is represented as a grid.
The action space A consists of all actions that the learning agent can perform in the environment.
The transition function δ specifies the non-deterministic behavior of the environment (i.e. the possibly random outcomes of taking each action in any state).
The last component of the RL task is the reinforcement (reward) function which defines the possible reward of taking an action in a particular state.

So, to put it simply. the agent’s task in a reinforcement learning scenario is to learn an optimal policy, that maximizes the expected sum of the delayed rewards for all states in S.

One of the most widely used RL algorithms is the Q-learning algorithm.

In this scenario, the agent learns an action-value function (Q) giving the expected utility of taking a given action in a given state. In such a scenario, the agent does not need to have a model of its environment. For training the RL agent, a Q-learning approach is usually used, in which the agent learns the Q-value function that gives the expected utility of performing an action in a given state. The training process consists of the following: through several training episodes, the agent will try (possible optimal) candidate solution paths from the initial to a final state. After performing an action in its environment, the agent will receive rewards and will update the Q-values estimations according to Bellman’s equation where Q(s, a) denotes the estimation of the Q-value associated to the state s and action a, α represents the learning rate and γ is the discount factor for future rewards. Of course, RL problems can get very complicated when having a lot of states and actions to choose from, in that case, Q-learning is far from enough to solve the problem.

Summing up, this post aims to provide a brief introduction to the field of Reinforcement Learning, pointing out the general framework for solving RL tasks. This is a very exciting field of Machine Learning because it’s the closest we can get to human behavior at the moment.

At AI Flow, we handle the complicated math stuff, so you can create more intelligent products. Subscribe and get one month free when we launch the beta, at aiflow.ltd. Also don’t forget to check our Product Hunt campaign.

Automated Machine Learning

AiFlow — Wed, 05 Jan 2022 16:46:18 +0000

As the global data quantity already follows an exponential trend, machine learning has become present in every application, creating a great demand for general know-how, be it data scientists or computer scientists with related knowledge. Currently, the demand for work to be done surpasses the offer of such professionals, thus automatic solutions have to be found.

TL;DR

At aiflow, we develop solutions that would enable any software engineer to use AI in their projects in a simple way, without needing to have AI specific knowledge. They will use their data, we’ll automatically train and deploy accurate models, and return the API keys for further predictions. Subscribe to the newsletter and get one month free worth $200 when we launch the beta.

The classical machine learning process involves a few default steps that have become default nowadays, namely:

Data engineering
Model selection
Hyperparameter tuning
The actual model training

Due to the highly repetitive nature of trial and error of these tasks, automation can play a big role in optimizing time spent on them. Automated Machine Learning comes to help the process by adding different optimization techniques that determine data scientists be more productive and achieve similar or better results in a shorter period of time.

Since every new product or proposed algorithm has to solve a real problem in order to be successful, Automated Machine Learning has the main goal of saving time and lowering the error rate of humans. In the classical machine learning flow, it all starts with raw data. As the data producing sources are not perfect and often not optimized for analytics, the datasets they produce are far from ready to be fed to a learning algorithm. This creates the need for data cleaning and engineering, which usually takes more time than expected.

Taking into consideration that the power of any learning algorithm (talking about better scores) relies mostly on the data it receives and less on later optimizations, the step of feature engineering has to be consistent and well planned in order to achieve top results.
Right after the data engineering phase comes the model selection step, which is again, subject to trial and error. The fact that some tasks have well known model types that work well only reduces the search space, but there still leaves oportunity for choice between different approaches. The last step, after a model is chosen, is hyperparameter optimization, which has some degrees of freedom regarding the hyperparameters before the model is started to learn automatically.

After listing only three of the major steps of a classical machine learning pipeline (raw data to trained model), there are clearly places where automation can help and automatically iterating over the search space can yield better and unexplored configurations of the data engineering algorithm, while also producing surprising model configurations which work better than expected. As the statistics show, the time spent on data preparation and hyperparameter selection tasks is sometimes as high as 80% of the total project time. The rest of time is spent on almost automatic training of the model.
Since there is so much time to be optimized, it is worth researching better alternatives to the whole machine learning flow rather than just continuing with the classical approach. Not only are good data scientists scarce, but they are also financially intensive nowadays, so finding a secondary optimal alternative is a priority.

Automated Machine Learning is clearly an optimisation challenge. There is no well known algorithm or a one size fits all approach to solve all learning problems. Various configurations have to be searched in order to decide which one is performing best. Taking into account that data scientists are not dealing with a finite and continuous search space, but rather an infinite and non convex one, heuristics have to be developed in order to find a configuration that is near the best solution.

Since the landscape or the theoretically named search space is infinite, heuristics have to be developed with the goal of finding a reasonably good optimum in the least amount of time. These algorithms come in different forms, from conventional search methods, like grid or random search to more advanced methods inspired by nature, namely Bayesian Optimizations, Reinforcement Learning and Evolutionary Algorithms. Since AutoML is a relatively new research field, the first two techniques have been used with decent performances. Since the latter is a more natural way of converging to general global optimums, makes it a worth exploring alternative.

Evolutionary Algorithms are inspired by the Darwinian theory that evolution is subject to a set of primal operations as the survival of the fittest, genome crossover and random mutation as a result of adapting to the environment. They also rely on the fact that small improvements in the genome of individuals can, over time, using the technique of survival of the fittest, lead to greater advancements of the species. Being fit in such a population is highly subjective and is a function of the environment of the specific species. Generally a fitter individual is better in the environment in which he lives and at the tasks he is supposed to carry out. Gradually eliminating less fit members from the population and allowing, with a greater probability, the fittest to reproduce, will, over a number of iterations (also known as epochs), lead to an overall fitter population.

Using the concepts of evolutionary algorithms in Automated Machine Learning, one can search through the configuration space more efficiently, finding gradually better methods. Genomes can be considered as different configurations and their fitness some metric on the dataset, combined with other metrics regarding the training phase, such as convergence rates or the derivative of loss drop over time. Data engineering is also subject to evolutionary optimisations, since the techniques of generating features can be combined in various ways in order to extract the most valuable information from the dataset. Both feature engineering and model training using evolutionary algorithms can lead to interesting results, subject to further analysis by data scientists or artificial intelligence enthusiasts.

In order to prove the aforementioned concept, at AI Flow we have developed an Automated Machine Learning Pipeline that is able to automatically iterate over the data engineering steps, find good neural network architectures, set its hyper-parameters optimally and finally, yield the trained model.

The whole idea was to create a framework that could receive any dataset, regardless of the feature types and yield a trained model, which could eventually be reused in further predictions through scalable, plug-and-play APIs.

Not only do Evolutionary Algorithms applied in AutoML yield peculiar, yet well performing neural network configurations, but also prove to be capable of directly comparing themselves to already existing frameworks like:

Auto Keras
MLBox
TPOT.

Summing up, the aim of this post is to provide a brief intoroduction into the field of AutoML, emphasizing the importance of it for analysis and time saving purposes. This is exactly what we want to achieve at aiflow.ltd, making it easy for software engineers to convert data into value through low/no-code solutions. Subscribe and get one month free when we launch the beta, at aiflow.ltd.