Introduction to Automated Machine Learning in Python with AutoGOAL

#machinelearning #python

Photo by Charlotte Karlsen on Unsplash

AutoGOAL is a novel Python framework for Automated Machine Learning, also known as AutoML.

What is AutoML

AutoML is an exciting new field of machine learning that attempts to bridge the gap between highly complex machine learning techniques and non-experts. In other words, reducing the entry barrier to the world of machine learning for those of us who don't have the time and/or resources to learn all the intricacies of each algorithm but who need to solve real problems.

There are a lot of flavours of AutoML, but from a very pragmatic point of view, you can think of it as the design of high-level tools that automate most of the machine learning process, from data preprocessing to model selection and parameter tuning. The underlying problem is that even though machine learning is very promising, getting a real machine learning algorithm to work with real data beyond academic examples is hard: you have to prepare the data, select one algorithm (or family of algorithms) and possible tune a bunch of very specific parameters, like regularization factors, number of neurons in a neural network layer, activation functions, and whatnot. There are simply too many options and it requires a non-trivial level of expertise to even understand what they mean and, worst, how they impact the final performance of your model.

Ideally, getting machine learning to work should be as easy as:

from machine_learning import BlackMagic

algorithm = BlackMagic()
algorithm.learn(my_data)    # freshly taken from your DB
algorithm.predict(new_data) # maybe even from the users?

Unfortunately, current machine learning tools are far from this ideal, but AutoML researchers are trying to get there. For this reason, there is a lot of academic research as well as buzz around AutoML right now.
If you want a (quite technical) introduction, the AutoML book is a great resource.

Actually, next Saturday, July 18th, our team will be presenting AutoGOAL's first iteration in the AutoML Workshop collocated with the International Conference on Machine Learning (ICML), one of the top academic conferences in machine learning.

However, even though the field is young, there are plenty of awesome AutoML tools already out there that you can use today. The most useful ones, at least from a new developer's perspective, are the ones that give you black-box machine learning solutions.

If you've heard of AutoML before in the open-source world, chances are you've heard of AutoSklearn, AutoWeka or AutoKeras. These are wonderful tools which, as their names might hint, act as wrappers on top of very well-known machine learning libraries to give you something like that ideal black-box algorithm. If you need out-of-the-box machine learning solutions today, by all means, go look at these tools.

The AutoGOAL approach to AutoML

AutoGOAL is a new library in this world that tries to appeal both high-level (non-expert) users and low-level (expert) users.

For example, with AutoGOAL you can something like this (for real):

from autogoal.ml import AutoML 
X, y = # load data

automl = AutoML()
automl.fit(X, y)

Cool, isn't it? AutoGOAL will automatically search through a vast collection of different algorithms (things like logistic regression, decision trees, some neural networks) and find an optimal (or at least good enough) solution withing specified time and memory constraints. However, this is no silver bullet, there are a lot of restrictions on what X and y must be. But, it is a step closer to that ideal.

This is the high-level API, a black-box AutoML class that works with many different problem types, from image classification to entity recognition in text. Under the hood, AutoGOAL actually has wrappers to hundreds of different algorithms from sklearn, gensim, nlkt, pytorch, keras, spacy, and more. And this is the first difference between AutoGOAL and most other similar tools. AutoGOAL doesn't really know about any specific API or library, nor has any machine learning implemented itself, it's a thin collection of wrappers compatible with virtually anything that even resembles a machine learning algorithm.

So, if you install AutoGOAL now (pip install autogoal) you will actually get only this thin layer. You actually have to install sklearn, and/or keras, and/or the other libraries, and AutoGOAL will then discover those libraries and automatically use them. We are continuously adding new wrappers around the clock (opencv is coming soon, for example).

Low-level API for fine control

There are many details to AutoGOAL, but the main building blocks are based on the concept of defining classes or methods with type annotations that indicate the space of parameter values. For example, suppose you want to try a logistic regression from sklearn on some dataset. This is a basic code to instantiate and evaluate it on some random data:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(random_state=0)  # Fixed seed for reproducibility

from sklearn.linear_model import LogisticRegression

def evaluate(estimator, iters=30):
    scores = []

    for i in range(iters):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
        estimator.fit(X_train, y_train)
        scores.append(estimator.score(X_test, y_test))

    return sum(scores) / len(scores)

lr = LogisticRegression()
score = evaluate(lr)  # around 0.83

So far so good, but maybe we could do better with a different set of parameters. Logistic regression has at least two parameters that influence heavily its performance: the penalty function and the regularization strength.

Instead of writing a loop through a bunch of different parameters, we can use AutoGOAL to automatically explore the space of possible combinations. We can do this with the class-based API by providing annotations for the parameters we want to explore.

from autogoal.grammar import Continuous, Categorical

class LR(LogisticRegression):
    def __init__(self, penalty: Categorical("l1", "l2"), C: Continuous(0.1, 10)):
        super().__init__(penalty=penalty, C=C, solver="liblinear")

The penalty: Categorical("l1", "l2") annotation tells AutoGOAL that for this class the parameter penalty can take values from a list of predefined values. Likewise, the C: Continuous(0.1, 10) annotation indicates that the parameter C can take a float value in a specified range.

Now we will use AutoGOAL to automatically generate different instances of our LR class automatically. We achieve this by building a context-free grammar that describes all possible instances.

from autogoal.grammar import  generate_cfg

grammar =  generate_cfg(LR)
print(grammar)

This is the output for print(grammar):

<LR>         := LR (penalty=<LR_penalty>, C=<LR_C>)
<LR_penalty> := categorical (options=['l1', 'l2'])
<LR_C>       := continuous (min=0.1, max=10)

Basically, AutoGOAL introspects the type annotations and builds a grammar that describes the space of all possible instances of the LR class. We can now use AutoGOAL to search for the best instance, which will automatically try many different combinations of parameters intelligently (technically, it is using a probabilistic variant of an evolutionary algorithm called Grammatical Evolution).

from autogoal.search import PESearch

optimizer = PESearch(grammar, evaluate)
best, fn = optimizer.run(1000)

After a few iterations, best will be the best instance of LR and fn will be the actual value of evaluate for that instance.

What's next

Now, this can go much deeper. You can define a full hierarchy of classes, with parameters that are instances of other classes (even recursively instances of itself) and AutoGOAL will infer a proper grammar for that space and optimize in it.

As long as you can define your problem as search for the best program (i.e., instances of classes with parameters) as measured by some function, AutoGOAL can help you out. In the docs you can find much more complex examples, both in state-of-the-art academic datasets as well as in problems that are not even related to machine learning.

AutoGOAL is still an alpha stage and in active development. If you need a production-ready AutoML framework, there are alternatives with out-of-the-box solutions. But if you want to tinker around, AutoGOAL provides a great level of expressiveness and requires very little code. You can find it in Github and in Docker Hub preloaded with a bunch of machine learning libraries and GPU support.