DEV Community: Guillaume Chevalier

How to unit test machine learning code?

Guillaume Chevalier — Wed, 25 Aug 2021 20:12:56 +0000

Why are unit tests important? Why is testing important? How to do it for machine learning code? Those are questions I will answer.

I suggest that you grab a good coffee while you read what follows. If you write AI code at Neuraxio, or if you write AI code using software that Neuraxio distributed, this article is especially important for you to grasp what's going on with the testing and how it works.

The testing pyramid

Have you ever heard fo the testing pyramid? Martin Fowler has a nice article on this topic here. To summarize what it is: you should have LOTS OF small "unit" tests that are testing small components of your software, and then a FEW "integration" tests that are medium-sized (and will probably test your service application layer), and then VERY FEW "end-to-end" (E2E) tests that will test the whole thing at once (probably using your AI backend's REST API) with a real and complete use-case that does everything to see if everything works together. It makes a pyramid: unit tests at the bottom,

Why this different quantity of tests with these granularitues? So we have a pyramid of tests like this:

/__ VERY FEW end-to-end (large) tests; __\

/______ FEW integration (medium) tests; ______\

*/___________ LOTS OF unit (small) tests. ___________*

Note that the integration tests are sometimes also called acceptance tests. They may differ depending on where you work at, as different terminology is used. I personnaly prefer acceptation tests, so as to reffer to the business acceptation of a test case. Like if an acceptation test case is a business requirement written into code.

Suppose that in your daily work routine, you edit some code to either fix a bug, measure something in your code, or introduce new features. You will change something thinking that it helps. The following will eventually happen as you are not perfect and probably do errors and mistakes from time to time. How often have your code worked on the 1st try?

Without tests at all: you will catch the bug 2 weeks later and probably have no clue where it is and how to fix it. The cost to fix this test will be 10x than if you knew it at the start when you coded it.
With large & medium tests but no unit tests: you will know instantly upon doing the change that something is wrong. But you don't know for sure exactly where it is in your code. The cost to fix this test will be 3x what it'd be compared to if you knew where it was with unit tests.
With small unit tests: not only you'll instantly that you have a bug upon doing the change and running the test, but chances are, if you have a good code coverage with your unit tests (say 80%), that you have a unit test testing the piece of software that you've just modified and you'll know instantly and exactly where you have a bug and why.

To sum up: unit testing gives you, and especially your team, some considerable speed. Rare are the programmers who like to be stuck just debugging software. Cut the debugging times by using unit tests, and not only will everyone be happy, but also everyone will code faster.

"Understanding code is by far the activity at which professional developers spend most of their time."

Source: https://blog.codinghorror.com/when-understanding-means-rewriting/

Unit tests

A unit test has 3 parts, they are called the AAA steps of a unit test:

Arrange: create variables or constants that will be used in the next Act phase. If the variables are created in many tests, they can be extracted at the top of the test file or elsewhere to limit code duplication (or the test can even be parametrized as in the second image example later on).
Act: call your code to test using the variables or constants set up just above in the Arrange, and receive a result.
Assert: verify that the result you obtained in the Act phase matches what you'd expect.

Example #1 of the AAA in a ML unit test:

[Click here to read whole original code file for the code above]

See how the test is first set-upped (arranged) at the beginning? The test above is even further setupped using an argument in the test function, meaning that this test can be ran again and again with different arguments to test using PyTest's parametrize. Here is a good example of a well-parametrized unit test that also makes use of the AAA.

Example #2 of the AAA in a ML unit test:

[click here to read whole original code file for the code above]

In the test above, written by Alexandre Brillant, we also see the AAA. At first, we create a ML pipeline, data inputs (X), and expected outputs (y). Then, we act: we call "fit_transform" on the pipeline to get a prediction result (Y). Then, we assert: we check that the prediction result matches what we expected (y==Y). That is a very stupid and simple test, although, it can catch many bugs.

Unit tests in ML rarely use lots of data. Most of the time, they use small hand-design data samples just to check if things compile or so.

You'd then use medium-sized fake (or sometimes real) datasets in acceptance tests (medium integration tests), and real data in the end-to-end tests.

Sometimes, a unit test will test more than one thing. For instance, you'll test two things, because in your "Arrange" part you'll use something else. Hopefully, this something else was already tested individually with another test. And sometimes you could use what is called "mocks" or "stubs" to ensure you don't use two things in the same test, although mocking is a bit more advanced and more used in Java (less in Python), you can read about mocks and stubs here. Personnally, I often prefer writing stubs rather than writing mocks, as stubs feels more straightforward to use across many different unit tests.

The TDD loop

It naturally emerges that someone who do unit tests will do this 3-steps loop:

RED: write a unit test that fails;
GREEN: make the test pass by writing proper code;
BLUE: refactor the code by cleaning a bit what you've just written before moving on.

Therefore, by coding software that you test, you will do cycles of 1, 2, 3, 1, 2, 3, 1, 2, 3... and so forth. It is strongly recommended to start with writing the test that fails (red). Why? It is because it will put you in the shoes of someone using the code that you are about to write. It will start by making you think about your code's API or public function design. Plus, as per the SOLID principles (applied to Machine Learning), it will help you respect the DIP (Dependency Inversion Principle) by probably setting up something in your test (in the first AAA phase: "Arrange") and then you'll pass it to the class that you are about to test. This will effectively apply dependency inversion (DIP) to your code by passing things around as arguments and creating them outside, instead of creating them inside the classes that you test.

Obviously, by doing the TDD loop, you'll often re-run your whole unit test suite to ensure you didn't break things around in the rest of the codebase nearby.

The ATDD loop

The ATDD loop is an improvement to the TDD loop. It is summarized as follow:

ATDD: Write an acceptance test first, and then do many TDD loops to fulfill this acceptance test.

Why do so? Well, acceptance tests are medium-sized tests, compared to our small-sized unit tests. If you need to do the test beforehand, then you probably want to write an acceptance test that is a medium test case, and then within your acceptance test "medium TDD" loop, you'll encounter lots of smaller parts to solve where you'll do lots of "small TDD" unit test loops.

So the ATDD loop looks like this:

Acceptance RED: write an acceptance test that fails;
Acceptance GREEN: make the test pass by writing proper code;
1. 1. Unit RED: write a unit test that fails;
Unit GREEN: make the test pass by writing proper code;
Unit BLUE: refactor the code by cleaning a bit what you've just written before moving on.
1. 1. Unit RED: write a unit test that fails;
Unit GREEN: make the test pass by writing proper code;
Unit BLUE: refactor the code by cleaning a bit what you've just written before moving on.
1. 1. Unit RED: write a unit test that fails;
Unit GREEN: make the test pass by writing proper code;
Unit BLUE: refactor the code by cleaning a bit what you've just written before moving on.
1. [Continue TDD loops as long as required to solve the acceptance test...]
Acceptance BLUE: refactor the code by cleaning a bit what you've just written before moving on.

Other tests

Of course, there are more types of test. Some people do "border" tests, "database" tests, cloud "environment" tests, "uptime" tests (as in SLAs with uptime warranties), and more. But the 3 types of test presented above (E2E, acceptance/integration, unit) are the real deal for coding proper enterprise software.

"One difference between a smart programmer and a professional programmer is that the professional understands that clarity is king. Professionals use their powers for good and write code that others can understand."

Source: Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship

Neuraxio AI Programmer

Did you like this article? This article is part of our Neuraxio AI Programmer email series. Register below! Or by clicking here.

What's Wrong with Scikit-Learn Pipelines?

Guillaume Chevalier — Wed, 25 Aug 2021 16:35:00 +0000

Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?

Scikit-Learn had its first release in 2007, which was a pre deep learning era. It’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2021 already:

Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.

Let’s first clarify what’s missing exactly, and then let’s see how we solved each of those problems with building new design patterns based on the ones Scikit-Learn already uses.

TL;DR: How could things work to allow us to do what’s in the above list with the Pipe and Filter design pattern / architectural style that is particular of Scikit-Learn? The API must be redesigned to include broader functionalities, such as allowing the definition of hyperparameter spaces, and allowing a more comprehensive object lifecycle & data flow functionalities in the steps of a pipeline. We coded a solution: that is Neuraxle.

Don’t get me wrong, I used to love Scikit-Learn, and I still love to use it. It is a nice status quo: it offers useful features such as the ability to define pipelines with a panoply of premade machine learning algorithms. However, there are serious problems that they just couldn’t see in 2007, when deep learning wasn’t a thing.

The Problems

Some of the problems are highlighted by the top core developer of Scikit-Learn himself at a Scipy conference. He calls for new libraries to solve those problems instead of doing that within Scikit-Learn:

Source: the top core developer of Scikit-Learn himself - Andreas C. Müller @ SciPy Conference

Inability to Reasonably do Automatic Machine Learning (AutoML)

In Scikit-Learn, the hyperparameters and the search space of the models are awkwardly defined.

Think of builtin hyperparameter spaces and AutoML algorithms. With Scikit-Learn, despite a pipeline step can have hyperparameters, they don’t each have an hyperparameter distribution.

It’d be really good to have get_hyperparams_space as well as get_params in Scikit-Learn, for instance.

This lack of an ability to define distributions for hyperparameters is the root of much of the limitations of Scikit-Learn with regards to doing AutoML, and there are more technical limitations out there regarding constructor arguments of pipeline steps and nested pipelines.

Inability to Reasonably do Deep Learning Pipelines

Think about the following features:

train-only behavior:
- mini-batching (partial fits),
- repeating epochs during train,
- shuffling the data,
- oversampling / undersampling,
- data augmentation,
- adding noise to data,
- curriculum learning,
- online learning
test-only behavior:
- disabling regularization techniques,
- freezing parameters
mutating behavior:
- multiple or changing input placeholders,
- multiple or changing output heads,
- multi-task learning,
- unsupervised pre-training before supervised learning,
- fine-tuning
having evaluation strategies that works with the mini-batching and all of the aforementioned things.

Scikit-Learn does almost none of the above, and hardly allows it as their API is too strict and wasn’t built with those considerations in mind: for instance they are mostly lacking in the original Scikit-Learn Pipeline. Yet, all of those things are required for Deep Learning algorithms to be trained (and thereafter deployed).

Plus, Scikit-Learn lacks some things to do proper serialization, and it also lacks a compatibility with Deep Learning frameworks (i.e.: TensorFlow, Keras, PyTorch, Poutyne). It also lacks to provide lifecycle methods to manage resources and GPU memory allocation. Think of lifecycle methods as methods where each objects has: __init__, fit, transform. For instance, picture adding also setup, teardown, mutate, introspect, save, load, and more, to manage the events of the life of each algorithm’s object in a pipeline.

You’d also want some pipeline steps to be able to manipulate labels, for instance in the case of an autoregressive autoencoder where some “X” data is extracted to “y” data during the fitting phase only, or in the case of applying a one-hot encoder to the labels to feed them as integers.

Not ready for Production nor for Complex Pipelines

Parallelism and serialization are convoluted in Scikit-Learn: it’s hard, not to say broken. When some steps of your pipeline imports libraries coded in C++, those objects aren’t always serializable, it doesn’t work with the usual way of saving in Scikit-Learn, which is by using the joblib serialization library.

Also, when you build pipelines that are meant to run in production, there are more things you’ll want to add on top of the previous ones. Think about:

nested pipelines,
funky multimodal data,
parallelism and scaling on multiple cores,
parallelism and scaling on multiple machines,
cloud computing.

Shortly put: it’s hard to code Metaestimators using Scikit-Learn’s base classes. Metaestimators are algorithms that wrap other algorithms in a pipeline to change the behavior of the wrapped algorithm (e.x.: decorator design pattern). Examples of metaestimators:

a RandomSearch holds another step to optimize. A RandomSearch is itself also a step.
a Pipeline holds several other steps. A Pipeline is itself also a step (as it can be used inside other pipelines: nested pipelines).
a ForEachDataInputs holds another step. A ForEachDataInputs is itself also a step (as it is a replacement of one to just change dimensionality of the data, such as adapting a 2D step to 3D data by wrapping it).
an ExpandDim holds another step. An ExpandDim is itself also a step (inversely to the ForEachDataInputs, it augments the dimensionality instead of lowering it).

Metaestimators are crucial for advanced features. For instance, a ParallelTransform step could wrap a step to dispatch computations across different threads. A ClusteringWrapper could dispatch computations of the step it wraps to different worker computers within a pipeline. Upon receiving a batch of data, a ClusteringWrapper would work by first sending the step to the workers (if it wasn’t already sent) and then a subset of the data to each worker. A pipeline is itself a metaestimator, as it contains many different steps. There are many metaestimators out there. We also name those “meta steps” as a synonym.

Solutions that we’ve Found to Those Scikit-Learn’s Problems

For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!

Conclusion

Unfortunately, most Machine Learning pipelines and frameworks, such as Scikit-Learn, fail at combining Deep Learning algorithms within neat pipeline abstractions allowing for clean code, automatic machine learning, parallelism & cluster computing, and deployment in production. Scikit-Learn has those nice pipeline abstractions already, but it lacks the features to do AutoML, deep learning pipelines, and more complex pipelines such as for deploying to production.

Fortunately, we found some design patterns and solutions that allows for all the techniques we named to work together within a pipeline, making it easy for coders, bringing concepts from most recent frontend frameworks (e.g.: component lifecycle) into machine learning pipelines with the right abstractions, allowing for more possibilities such as a better memory management, serialization, and mutating dynamic pipelines. We also break past Scikit-Learn and Python’s parallelism limitations with a neat trick, allowing straightforward parallelization and serialization of pipelines for deployment in production.

We’re glad we’ve found a clean way to solve the most widespread problems out there related to machine learning pipelines, and we hope that our solutions to those problems will be prolific to many machine learning projects, as well as projects that can actually be deployed to production.

If you liked this reading, subscribe to Neuraxio’s updates to be kept in the loop! Also thanks to the Dot-Layer (.Layer) organization’s blog committee and administrators for their generous peer-review of the present article.

Our top learning resources for AI programmers

Guillaume Chevalier — Fri, 13 Aug 2021 19:14:54 +0000

You are an Artificial Intelligence (AI) programmer and you'd like to learn how to program well as we do at Neuraxio?

Lucky you, we've launched a series of curated resources to help you get better and to work like a pro in Machine Learning (ML) projects.

You'll learn:

How to apply the SOLID principles of Clean Code to your ML code;
Package design techniques to structure your code;
How to properly balance the usage of Jupyter Notebooks v.s. using an IDE in an ML project;
How to structure your backend software architecture code for a viable production-ready project that will live through time;
The structure of Clean Machine Learning code that yields results with AutoML
Links to the top online classes that we recommend to our programmers internally to get up to date with Machine Learning;
Top 5 documentation pages that you should read to grasp how Neuraxle works;
How to program robust software using unit tests;
Framework design patterns you can use to design your own frameworks;
How to do business with Neuraxio by working for us or referring us AI clients;
And more!

If you successfully pass the quizz that will be sent to you at the end of this training, you'll be able to purchase the certificate to showcase your skills for $17 CAD, if you wish to. This certificate can be showcased on LinkedIn as "Neuraxio AI Programmer".

Remember, you're only one ML project away from achieving success.

And it starts here and now.

AI technologies for eCommerce - The Commerce Show

Guillaume Chevalier — Tue, 03 Aug 2021 19:20:45 +0000

This is a podcast episode from The Commerce Show.

Original description from The Commerce Show:

In this episode, we are talking about AI technologies for eCommerce. Guillaume Chevalier has been working in artificial intelligence for over 7 years now and has been involved in over 57 machine learning projects.

We cover several applications of AI in the eCommerce industry such as Personalized shopping experience, Sales/Inventory forecasting, Automated customer service/chatbots, Visual search and powerful synonyms search, Price optimization, Understanding customers better (persona) and Recommendation algorithms.

After the first 30 minutes to talk about eCommerce, we also talk a bit deeper about AI from a developer point of view. Guillaume explains why he decided to develop Neuraxle an AI framework for machine learning projects over the years. Guillaume is also giving tips about « How to start and plan an AI transformation for a non-tech business. »

This podcast is amazing and you’ll discover Guillaume's passion for eCommerce, ML (machine learning) and NPL (natural language processing).

*Listen to the full podcast: *

https://thecommerceshow.com/episodes/ep-13-guillaume-chevalier-from-neuraxio-ai-technologies-for-ecommerce-the-commerce-show-tKAR98Gl

Follow The Commerce Show.

What's wrong with Scikit-Learn.

Guillaume Chevalier — Fri, 03 Jan 2020 16:35:31 +0000

Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?

Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes scikit-learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:

Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.

Let’s first clarify what’s missing exactly, and then let’s see how we solved each of those problems with building new design patterns based on the ones scikit-learn already uses.

TL;DR: How could things work to allow to do what’s in the above list with the Pipe and Filter design pattern / architectural style that is particular of Scikit-Learn?

The Problems

Some of the problems are highlighted by the creator of scikit-learn himself in one of his conference and he suggests himself that new libraries should solve those problems instead of doing that within scikit-learn:

https://www.youtube.com/embed/Wy6EKjJT79M?start=1361&end=1528

Source: the creator of scikit-learn himself - Andreas Mueller @ SciPy Conference

Inability to Reasonably do Automatic Machine Learning (AutoML)

In scikit-learn, the hyperparameters and the search space of the models are awkwardly defined.

Think of builtin hyperparameter spaces and AutoML algorithms. With scikit-learn, a pipeline step can only have some hyperparameters, but they don’t each have an hyperparameter distribution.

It’d be really good to have get_hyperparams_space as well as get_params in scikit-learn, for instance.

This lack of parameter distributions definition is the root of much of the limitations of scikit-learn, and there are more technical limitations out there regarding constructor arguments of pipeline steps and nested pipelines.

Inability to Reasonably do Deep Learning Pipelines

Think about mini-batching, repeating epochs during train, train/test mode of steps, pipelines that mutates in shape to change data sources and data structures amidst the training (e.g.: unsupervised pre-training before supervised learning), and having evaluation strategies that works with the mini-batching and all of the aforementioned things. Mini-batching also involves that the steps of a pipeline should be able to have “fit” called many times in a row on subsets of the data, which isn’t the standard in scikit-learn. None of that is available within a Scikit-Learn Pipeline, yet all of those things are required for Deep Learning algorithms to be trained and deployed.

Plus, Scikit-Learn lacks a compatibility with Deep Learning frameworks (i.e.: TensorFlow, Keras, PyTorch, Poutyne). Scikit-learn lacks to provide lifecycle methods to manage resources and GPU memory allocation, for instance.

You’d also want some pipelines steps to be able to manipulate labels, for instance in the case of an autoregressive autoencoder where some “X” data is extracted to “y” data during the fitting phase only, or in the case of applying a one-hot encoder to the labels to be able to feed them as integers.

Not ready for Production nor for Complex Pipelines

Parallelism and serialization are convoluted in scikit-learn: it’s hard, for not saying broken. When some steps of your pipeline imports libraries coded in C++ which objects aren’t serializable, it doesn’t work with the usual way of saving in Scikit-Learn.

Also, when you build pipelines meant for being deployed in production, there are more things you’ll want to add on top of the previous ones. Think about nested pipelines, funky multimodal data, parallelism and scaling, cloud computing.

Shortly put: with Scikit-Learn, it’s hard to code Metaestimators. Metaestimators are algorithms that wrap other algorithms in a pipeline so as to change the function of the wrapped algorithm (decorator design pattern).

Metaestimators are crucial for advanced features. For instance, a ParallelTransform step could wrap a step to dispatch computations across different threads. A ClusteringWrapper could dispatch computations of the step it wraps to different worker computers within a pipeline by first sending the step to the workers and then the data as it comes. A pipeline is itself a metaestimator, as it contains many different steps. There are many metaestimators out there. Here, we name those “meta steps” for simplicity.

Solutions that we’ve Found to Those Scikit-Learn’s Problems

For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions to make scikit-learn fresh and useable within modern computing projects!

Conclusion

Unfortunately, most Machine Learning pipelines and frameworks, such as for instance scikit-learn, fail at combining Deep Learning algorithms within neat pipeline abstractions allowing for clean code, automatic machine learning, parallelism & cluster computing, and deployment in production. Scikit-learn has those nice pipeline abstractions already, but they are lacking some necessary stuff to do AutoML, deep learning pipelines, and more complex pipelines such as for deploying to production.

Fortunately, we found some design patterns and solutions that combines the best of all, making it easy for coders, bringing concepts from most recent frontend frameworks (e.g.: component lifecycle) into machine learning pipelines with the right abstractions, allowing for more possibilities. We also break past scikit-learn and Python’s parallelism limitations with a neat trick, allowing easier parallelization and serialization of pipelines for deployment in production, as well as enabling complex mutating pipelines of unsupervised pre-training and fine-tuning.

We’re glad we’ve found a clean way to solve the most spread problems out there, and we hope for you that the results of our findings will be prolific to many machine learning projects, as well as projects that can actually be deployed to production.

If you liked this reading, subscribe to Neuraxio’s updates to be kept in the loop!

Why Deep Learning has a Bright Future

Guillaume Chevalier — Sun, 29 Dec 2019 14:37:43 +0000

Would you like to see the future? This post aims at predicting what will happen to the field of Deep Learning. Scroll on.

Microprocessor Trends

Who doesn’t like to see the real cause of trends?

“Get Twice the Power at a Constant Price Every 18 months”

Some people have said that Moore’s Law was coming to an end. A version of this law is that every 18 months, computers have 2x the computing power than before, at a constant price. However, as seen on the chart, it seems like improvements in computing got to a halt between 2000 and 2010.

See the Moore’s Law Graph

But the Growth Stalled…

This halt is in fact that we’re reaching the limit size of the transistors, an essential part in CPUs. Making them smaller than this limit size will introduce computing errors, because of quantic behavior. Quantum computing will be a good thing, however it won’t replace the function of classical computers as we know them today.

Faith isn’t lost: invest in parallel computing

Moore’s Law isn’t broken yet on another aspect: the number of transistors we can stack in parallel. This means that we can still have a speedup of computing when doing parallel processing. In simpler words: having more cores. GPUs are growing towards this direction: it’s fairly common to see GPUs with 2000 cores in the computing world, already.

That means Deep Learning is a good bet

Luckily for Deep Learning, it comprise matrix multiplications. This means that deep learning algorithms can be massively parallelized, and will profit from future improvements from what remains of Moore’s Law.

The AI Singularity in 2029

A prediction by Ray Kurtzweil

Ray Kurtzweil predicts that the singularity will happen in 2029. That is, as he defines it, the moment when a 1000$ computer will contain as much computing power as the brain. He is confident that this will happen, and he insists that what needs to be worked on to reach true singularity is better algorithms.

“We’re limited by the algorithms we use”

So we’d be mostly limited by not having found the best mathematical formulas yet. Until then, for learning to properly take place using deep learning, one needs to feed a lot of data to deep learning algorithms.

We, at Neuraxio, predict that Deep Learning algorithms built for time series processing will be something very good to build upon to get closer to where the future of deep learning is headed.

Big Data and AI

Yes, this keyword is so 2014. It still holds relevant.

“90% of existing data was created in the last 2 years”

It is reported by IBM New Vantage that 90% of the financial data was accumulated in the past 2 years. That’s a lot. At this rate of growth, we’ll be able to feed deep learning algorithms abundantly, more and more.

“By 2020, 37% of the information will have a potential for analysis”

That is what The Guardian reports, according to big data statistics from IDC. In contrast, only 0.5% of all data was analyzed in 2012, according to the same source. Information is more and more structured, and organization are now more conscious of tools to analyze their data. This means that deep learning algorithms will soon have access to the data more easily, wheter the data is stored locally or in the cloud.

It’s about intelligence.

Is about what defines us humans compared to all previous species: our intelligence.

The key of intelligence and cognition is a very interesting subject to explore and is not yet well understood. Technologies related to this field are are promising, and simply, interesting. Many are driven by passion.

On top of that, deep learning algorithms may use Quantum Computing and will apply to machine-brain interfaces in the future. Trend stacking at its finest: a recipe for success is to align as many stars as possible while working on practical matters.

Conclusion

First, Moore’s Law and computing trends indicate that more and more things will be parallelized. Deep Learning will exploit that.

Second, the AI singularity is predicted to happen in 2029 according to Ray Kurtzweil. Advancing Deep Learning research is a way to get there to reap the rewards and do good.

Third, data doesn’t sleep. More and more data is accumulated every day. Deep Learning will exploit that.

Finally, deep learning is about intelligence. It is about technology, it is about the brain, it is about learning, it is about what defines humans compared to all previous species: their intelligence. Curious people will know their way around deep learning.

If you liked this article, consider following us for more!

A Rant on Kaggle Competition Code (and Most Research Code)

Guillaume Chevalier — Thu, 26 Dec 2019 05:25:23 +0000

Machine Learning competition & research code sucks. What to do about it?

For having used code from Kaggle competitions a few times already, we realized it wasn’t full of rainbows, unicorns, and leprechauns. It’s rather like a Frankenstein. A Frankenstein is a work made of glued parts of other works and badly integrated. Machine Learning competition code in general, as well as machine learning research code, suffers from deep architectural issues. What to do about it? Using neat design patterns is the solution.

Bad Design Patterns.

It’s so common to see code coming from Kaggle competitions which doesn’t have the right abstractions for later deploying the pipeline to production - and with logic reason, kagglers have no incentives to prepare for deploying code to production, as they only need to win the competition by generating good results. The same goes with most research code for which papers are published, often, just beating the benchmarks is what is sought by coders. The situation is roughly the same in academia where researchers too often just try to get results to publish a paper and ditch the code after. Moreover, many machine learning coders didn’t learn to code properly in the first place, which makes things harder.

Here are a few examples of bad patterns we’ve seen:

Coding a pipeline using bunch of manual small “main” files to be executed one by one in a certain order, in parallel or one after the other. Yes, we saw that many times.
Forcing to use disk persistence between the aforementioned small main files, which strongly couple the code to the storage and makes it impossible to run the code without saving to disk. Yes, many times.
Making the disk persistence mechanism different for each of those small “main” files. For example, using a mix of JSON, then CSV, then Pickles and sometimes HDF5 and otherwise raw numpy array dumps. Or even worse : mixing up many databases and bigger frameworks instead of keeping it simple and writing to disks. Yikes! Keep it simple!
Provide no instructions whatsoever on how to run things in the good order. Bullcrap is left as an exercise for the reader.
Have no unit tests, or unit tests that yes-do-test the algorithm, but that also requires writing to disks or using what was already written to disks. Ugh. And by the time you execute that untested code again, you end up with an updated dependency for which no installation version was provided and nothing work as it did anymore.

Pointing to Some Examples

Those bad patterns doesn’t only apply to code written in programming competition environments (such as this code of mine written in a rush - yes, I can do it too when unavoidably pressured). Here are some examples of code with checkpoints using the disks:

Most winning Kaggle competition code. We dove many times in such code, and it never occured to us to see the proper abstractions.
BERT. Bear with me - just try to refactor “run_squad.py” for a second, and you’ll realize that every level of abstraction are coupled together. To name a few, the console argument parsing logic is mixed up at the same level of the model definition logic, full of global flag variables. Not only that, the model definition logic is mixed in all of this, along with the data loading and saving logic that uses the cloud, in one huge file of more than 1k lines of code in a small project.
FastText. The Python API is made for loading text files from disks, training on that, and dumping on disk the result. Couldn’t dumping on disk and using text files as training input be optional?

Using Competition Code?

Companies can sometimes draw inspiration from code on Kaggle, I’d advise them to code their own pipelines to be production-proof, as taking such competition as-is is risky. There is a saying that competition code is the worst code for companies to use, and even that the people winning competitions are the worst once to hire - because they write poor code.

I wouldn’t go that far in that saying (as I myself most of the time earn podiums in coding competitions) - it’s rather that competition code is written without thinking of the future as the goal is to win. Ironically, it’s at that moment that reading Clean Code and Clean Coder gets important. Using good pipeline abstractions helps machine learning projects surviving.

Solving the Problems.

Here are the things you want when building a machine learning pipeline which goal is to be sent to production:

You ideally want a pipeline than can process your data by calling just one function and not lots of small executable files. You might have some caching enabled if things are too slow to run, but you keep caching as minimal as possible, and your caching might not be checkpoints exactly.
Having the possibility to not use any data checkpoints between pipeline steps simply. You want to be able to deactivate all your pipeline’s checkpoints easily. Checkpoints are good for training the model, debugging it and actively coding the pipeline, but in production it’s just heavy and it must be easily disableable.
The ability to scale your ML pipeline on a cluster of machines.
Finally, you want the whole thing to be robust to errors and to do good predictions. Automatic Machine Learning can help here.

And if your goal is instead to continue to do competitions, please at least note that I started winning more competitions after learning how to do clean code. So the solutions above should as well apply in competitions, you’ll have more mental clarity, and as a result, more speed too, even if designing and thinking about your code beforehand seems like taking a lot of precious time. You’ll start saving even in the short run, quickly. Now that we’ve built a framework easing the process of writing clean pipelines, we have a hard time picturing how we’d get back to our previous habits anytime. Clean must be the new habit.

In all cases, using good patterns and good practices will almost always save time even in the short or medium term. For instance, using the pipe and filter design pattern with Neuraxle.

Conclusion

It’s hard to write good code when pressured by the deadline of a competition. With no incentives to build reusable and deployable code, it can be hard. We created Neuraxle to easily allow for the good abstractions to be used when in a rush. As a result, it’s a good thing that competition code be refactored into Neuraxle code, and it’s a good idea to write all your future code using a framework like Neuraxle.

The future is now. If you’d like to support Neuraxle, we’ll be glad that you get in touch with us. You can also register to our updates and follow us. Cheers!

How to Code Neat Machine Learning Pipelines?

Guillaume Chevalier — Sun, 22 Dec 2019 03:15:35 +0000

Have you ever coded an ML pipeline which was taking a lot of time to run?

Or worse: have you ever got to the point where you needed to save on disk intermediate parts of the pipeline to be able to focus on one step at a time by using checkpoints?

Or even worse: have you ever tried to refactor such poorly-written machine learning code to put it to production, and it took you months?

Well, we’ve all been there working on machine learning pipelines for long enough.

So how should we build a good pipeline that will give us flexibility and the ability to easily refactor the code to put it in production later?

Read the full post here:
https://www.neuraxio.com/en/blog/neuraxle/2019/10/26/neat-machine-learning-pipelines.html