DEV Community: Kamil A. Kaczmarek

Random Forest Regression: When Does It Fail and Why?

Kamil A. Kaczmarek — Tue, 18 Aug 2020 22:33:26 +0000

This article was originally written by Derrick Mwiti and posted on the Neptune blog.

In this article, we’ll look at a major problem with using Random Forest for Regression which is extrapolation.

We’ll cover the following items:

Random Forest Regression vs Linear Regression
Random Forest Regression Extrapolation Problem
Potential solutions
Should you use Random Forest for Regression?

Let’s dive in.

Random Forest Regression vs Linear Regression

Random Forest Regression is quite a robust algorithm, however, the question is should you use it for regression?

Why not use linear regression instead? The function in a Linear Regression can easily be written as y=mx + c while a function in a complex Random Forest Regression seems like a black box that can’t easily be represented as a function.

Generally, Random Forests produce better results, work well on large datasets, and are able to work with missing data by creating estimates for them. However, they pose a major challenge that is that they can’t extrapolate outside unseen data. We’ll dive deeper into these challenges in a minute

Decision Tree Regression

Decision Trees are great for obtaining non-linear relationships between input features and the target variable.

The inner working of a Decision Tree can be thought of as a bunch of if-else conditions.

It starts at the very top with one node. This node then splits into a left and right node — decision nodes. These nodes then split into their respective right and left nodes.

At the end of the leaf node, the average of the observation that occurs within that area is computed. The most bottom nodes are referred to as leaves or terminal nodes.

The value in the leaves is usually the mean of the observations occurring within that specific region. For instance in the right most leaf node below, 552.889 is the average of the 5 samples.

How far this splitting goes is what is known as the depth of the tree. This is one of the hyperparameters that can be tuned. The maximum depth of the tree is specified so as to prevent the tree from becoming too deep — a scenario that leads to overfitting.

Random Forest Regression

Random forest is an ensemble of decision trees. This is to say that many trees, constructed in a certain “random” way form a Random Forest.

Each tree is created from a different sample of rows and at each node, a different sample of features is selected for splitting.
Each of the trees makes its own individual prediction.
These predictions are then averaged to produce a single result.

source: Wikimedia

The averaging makes a Random Forest better than a single Decision Tree hence improves its accuracy and reduces overfitting.

A prediction from the Random Forest Regressor is an average of the predictions produced by the trees in the forest.

Example of trained Linear Regression and Random Forest

In order to dive in further, let’s look at an example of a Linear Regression and a Random Forest Regression. For this, we’ll apply the Linear Regression and a Random Forest Regression to the same dataset and compare the result.

Let’s take this example dataset where you should predict the price of diamonds based on other features like carat, depth, table, x, y and z. If we look at the distribution of price below:

We can see that the price ranges from 326 to 18823.

Let’s train the Linear Regression model and run predictions on the validation set.

The distribution of predicted prices is the following:

Predicted prices are clearly outside the range of values of ''price'' seen in the training dataset.

A Linear Regression model, just like the name suggests, created a linear model on the data. A simple way to think about it is in the form of y = mx+C. Therefore, since it fits a linear model, it is able to obtain values outside the training set during prediction. It is able to extrapolate based on the data.

Let’s now look at the results obtained from a Random Forest Regressor using the same dataset.

These values are clearly within the range of 326 and 18823 - just like in our training set. There are no values outside that range. Random Forest cannot extrapolate.

Extrapolation Problem

As you have seen above, when using a Random Forest Regressor, the predicted values are never outside the training set values for the target variable.

If you look at prediction values they will look like this:

source: Hengl, Tomislav et. al “Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables”. PeerJ. 6. e5518. 10.7717/peerj.5518.

Wondering why?

Let’s explore that phenomenon here. The data used above has the following columns carat, depth, table, x, y, z for predicting the price.

The diagram below shows one decision tree from the Random Forest Regressor.

Let’s zoom in to a smaller section of this tree. For example, there are 4 samples with depth <= 62.75, x <= 5.545, carat <= 0.905, and z <= 3.915. The price being predicted for these is 2775.75. This figure represents the mean of all these four samples. Therefore, any value in the test set that falls in this leaf will be predicted as 2775.75.

This is to say that when the Random Forest Regressor is tasked with the problem of predicting for values not previously seen, it will always predict an average of the values seen previously. Obviously the average of a sample can not fall outside the highest and lowest values in the sample.

The Random Forest Regressor is unable to discover trends that would enable it in extrapolating values that fall outside the training set. When faced with such a scenario, the regressor assumes that the prediction will fall close to the maximum value in the training set. Figure above illustrates that.

Potential Solutions

Ok, so how can you deal with this extrapolation problem?

There are a couple of options:

Use a linear model such as SVM regression, Linear Regression, etc,
Build a deep learning model because neural nets are able to extrapolate (they are basically stacked linear regression models on steroids),
Combine predictors using stacking. For example, you can create a stacking regressor using a Linear model and a Random Forest Regressor.,
Use modified versions of random forest.

One of such extensions is Regression-Enhanced Random Forests (RERFs). The authors of this paper propose a technique borrowed from the strengths of penalized parametric regression to give better results in extrapolation problems.

Specifically there are two steps to the process:

run Lasso before Random Forest,
train a Random Forest on the residuals from Lasso.

Since Random Forest is a fully nonparametric predictive algorithm, it may not efficiently incorporate known relationships between the response and the predictors. The response values are the observed values Y1, . . . , Yn from the training data. RERFs are able to incorporate known relationships between the responses and the predictors which is another benefit of using Regression-Enhanced Random Forests for regression problems.

source: Haozhe Zhang et. al 2019, Regression-Enhanced Random Forests

Final Thoughts

At this point, I am sure you might be wondering whether or not you should use a Random Forest for regression problems.

Let’s look at that.

When to use it

When the data has a non-linear trend and extrapolation outside the training data is not important.

When not to use it

When your data is in time series form. Time series problems require identification of a growing or decreasing trend that a Random Forest Regressor will not be able to formulate.

Hopefully, this article gave you some background into the inner workings of Random Forest Regression.

This article was originally written by Derrick Mwiti and posted on the Neptune blog where you can find more in-depth articles for machine learning practitioners.

The Best Tools, Libraries, Frameworks and Methodologies that Machine Learning Teams Use – Things We Learned from 41 ML Startups

Kamil A. Kaczmarek — Thu, 30 Jul 2020 16:02:24 +0000

This article was originally written by Jakub Czakon posted on the Neptune blog.

Setting up a good tool stack for your Machine Learning team is important to work efficiently and be able to focus on delivering results. If you work at a startup you know that setting up an environment that can grow with your team, needs of the users and rapidly evolving ML landscape is especially important.

We wondered: “What are the best tools, libraries and frameworks that ML startups use?” to tackle this challenge.

And to answer that question we asked 41 Machine Learning startups from all over the world.

The result?

A ton of great advice that we grouped into:

Methodology
Software development setup
Machine Learning frameworks
MLOps
Unexpected 🙂

Read on to figure out what will work for your machine learning team.

Good methodology is the key

Tools are only as strong as the methodology that employs them.

If you run around training models on some randomly acquired data and deploy whatever model you can get your hands on, sooner or later there will be trouble 🙂

Kai Mildenberger from psyML says that:

To us, the careful versioning of all the training and testing data is probably the most essential tool/methodology. We expect that to remain one of the most key elements in our toolbox, even as all of the techniques and mathematical models iterate forever. A second aspect might be to be extremely hypothesis driven. We use that as the single most important methodology to develop models.

I think having a strong understanding of what you want to use your tools for (and that you actually need them) is the very first step.

That said it is important to know what is out there and what people in similar situations use successfully.

Let’s dive right into that!

Software development tooling is the backbone of ML teams

Development environment is the foundation of every team’s workflow. So it was very interesting to learn what tools companies around the world consider the best in this area.

Source: giphy.com

ML teams use various tools as an IDE. Many teams like SimpleReport and Hypergiant use Jupyter Notebooks and Jupyter Lab with its ecosystem of NB Extensions.

“Jupyter Notebook is very useful for quick experiments and visualization, especially when exchanging ideas between multiple team members. Because we use Tensorflow, Google Colab is a natural extension to share our code more easily.” – says Wenxi Chen from Juji.

Various flavours of Jupyter have been mentioned as well. Deepnote (a hosted Jupyter Notebook solution) is “loved for their ML stuff” by the team of Intersect Labs while Google Colab “is a natural extension to share our code more easily” for the Juji team.

Others choose more standard software development IDEs. Among those Pycharm, tooted by Or Izchak from Hotelmize as “the best Python IDE” and Visual Studio Code used by Scanta for its “ease of connectivity with Azure and many ML-based extensions provided” were mentioned the most.

For teams that use R language like SimpleReport, RStudio was a clear winner when it comes to the IDE of choice. As Kenton White from Advanced Symbolics mentions:

We mostly use R + RStudio for analysis and model building. The workhorse for our AI modeling is VARX for time series forecasts.

When it comes to code versioning Github is a clear favourite. As Daniel Hanchen from Umbra AI mentions:

Github (now free for all teams!!) with its super robust version control system and easy repository sharing functionality is super useful for most ML teams.

Among most popular languages we have Python, R and interestingly Clojure mentioned by Wenxi Chen from Juji.

As for the environment/infrastructure setup notable mentions from ML startups are:

“AWS as the platform for deployment” (Simple Report)
“Anaconda serves as our goto tool for running ML experiments due to its live code feature wherein it can be used to combine software code, computational output, explanatory text, and multimedia resources in a single document.” (Scanta)
“Redis dominates as an in-memory data structure store due to its support for different kinds of abstract data structures, such as strings, lists, maps, sets, sorted sets, HyperLogLogs, bitmaps, streams, and spatial indexes.” (Scanta)
“Snowflake and Amazon S3 for data storage.” (Hypergiant) “Spark-pyspark – very simple api for distributing job to work on big data.” (Hotelmize)

Sooo many Machine Learning Frameworks

Source: giphy.com

Integrated development environment is crucial, but one needs a good ML framework on top of that to transform the vision into a project. The range of tools pointed out by the startups is quite diverse here.

For playing with tabular data, Pandas was mentioned the most.

Additional benefit of using Pandas mentioned by Nemo D’Qrill, the CEO of Sigma Polaris is:

I'd say that Pandas is probably one of the most valuable tools, in particular when working in collaboration with external developers on various projects. Having all data files in the form of data frames, across teams and individual developers, makes for a much smoother collaboration and unnecessary hassle.

Interesting library mentioned by Software Developer from Hotelmize was dovpanda – python extension library for panda which gives you insights on your panda code and data while working with panda.

When it comes to visualization matplotlib is used the most by the likes of Trustium, Hotelmize, Hypergiant and others.

Plotly was also a common choice. As developers from Wordnerds explain “for great visualisations to make data understandable and look good”. Dash, a tool for building interactive dashboards on top of Plotly charts, was recommended by Theodoros Giannakopoulos from Behavioral Signals for ML teams that need to present their analytical results in a nice, user-friendly manner.

For more standard machine learning problems most teams like Wordnerds, Sensitrust or Behavioral Signals use Scikit-Learn. ML team from iSchoolConnect explains why it is such a great tool:

It is one of the most popular toolkits used by machine learning researchers, engineers, and developers. The ease with which you can get what you want is amazing! From feature engineering to interpretability, scikit-learn provides you with every functionality.

Truth be told Pandas and Sklearn are really the workhorses of ML teams all over the world.

As Michael Phillips, Data Scientist from Numerai says:

Modern Python libraries like Pandas and Scikit-learn have 99% of the tools that an ML team needs to excel. Though simple, these tools have extraordinary power in the hands of an experienced data scientist

In my opinion, while in the general ML team population this may be true, in the case of ML Startups a lot of work goes into state of the art methods which usually means deep learning models.

When it comes to general deep learning frameworks we had many different opinions.

Many teams like Wordnerds and Behavioral Signals choose PyTorch.

The team of ML experts from iSchoolConnect tells us why so many ML practitioners and researchers choose PyTorch.

If you want to go deep into the waters, PyTorch is the right tool for you! Initially, it will take time to get accustomed to it but once you get comfortable with it there is nothing like it! The library is even optimized for quickly training and evaluating your ML-models.

But it is still Tensorflow and Keras that are leading in popularity.

Most teams like Strayos and Repetere choose it as their ML development frameworks. Cedar Milazzo from Trustium said:

Tensorflow, of course. Especially with 2.0! Eager execution was what TF really needed and now it’s here. I should note that when I say “”tensorflow”” I mean “”tensorflow + keras”” since keras is now built into TF.

It’s also important to mention that you don’t have to choose one framework and exclude others.

For example, Melodia’s Founder, Omid Aryan said that:

The tools that have been most beneficial to us are TensorFlow, PyTorch, and Python’s old scikit-learn tools.

There are some popular frameworks for more specialized applications.

In Natural Language Processing we’ve heard:

“Huggingface: it’s the most advanced and highest performance NLP library ever created. It’s the first of its kind in that researchers are directly contributing to a highly scalable NLP library. It separates itself from other similar tools by having production level tools available a few months after a newer model is published” says Ben Lamm, the CEO of Hypergiant.
“Spacy is a very cool natural language toolkit. NLTK is by far the most popular and I certainly use it, but spacy does lots of things NLTK can’t do so well, such as stemming and dependency parsing.” mentions Cedar Milazzo, the CEO of Trustium
“Gensim is good for word vectors and document vectors too, and I believe it isn’t so popular.” adds Cedar Milazzo.

In Computer Vision:

“OpenCV is indispensable for computer vision work” for Hypergiant. Their CEO says *“It’s a classic CV ensemble of methods from the 1960s until 2014 that are useful pre and post processing and can work well in scenarios where a neural network would be overkill.” *

Also it’s worth noting that not every team is implementing deep learning models themselves.

As Iuliia Gribanova and Lance Seidman from Munchron say, there are now API services where you can outsource some (or all) of the work:

Google ML kit is currently one of the best easy-to-entry tools that lets mobile developers easily embed ML API services like face recognition, image labeling, and other items that Google offers into an Android or iOS App. But additionally, you can also bring in your own TF (TensorFlow) lite models to run experiments and then bring them into production using Google’s ML Kit.

I think it’s important to mention that not always you can choose the latest and greatest libraries and the toolstack gets handed to you when you join the team.

As Naureen Mahmood from Meshcapade shared:

*“In the past, some important autodiff libraries that have made it possible for us to run multiple joint optimizations, and in doing so helped us build some of the core tech we still use today, are Chumpy & OpenDR. Now there are fancier and faster ones out there, like Pytorch and TensorFlow.” *

When it comes to model deployment Patricia Thaine from Private AI mentions “tflite, flask, tfjs and coreml” as their frameworks of choice. She also suggests that visualizing models is very important to them and they are using Netron for that.

But there are tools that go beyond frameworks that can help ML teams deliver real value quickly.

This is where MLOps comes in.

MLOps starts to be more important for machine learning startups

You may be wondering what MLOps is or why you should care.

Source: giphy.com

The term alludes to DevOps and describes tools used for operationalization of machine learning activities.

Jean-Christophe Petkovich CTO at Acerta provided us with an extremely thorough explanation of how their ML team approaches MLOps. It was so good that I decided to share it (almost) in full:

I think most of the interesting tools that are going to see broader adoption in 2020 are centered around MLOps. There was a big push to build those tools last year, and this year we’re going to find out who the winners will be.

For me, MLflow seems to be in the lead for tracking experiments, artifacts, and outcomes. A lot of what we’ve built internally for this purpose are extensions to the functionality of MLflow to incorporate more data tracking similar to how DVC tracks data.

The other big names in MLOps are Kubeflow, Airflow and TFX with Apache Beam—all tools designed for capturing data science workflows and pipelines end-to-end.

There are several ingredients for a complete MLOps system:

You need to be able to build model artifacts that contain all the information needed to preprocess your data and generate a result.
Once you can build model artifacts, you have to be able to track the code that builds them, and the data they were trained and tested on.
You need to keep track of how all three of these things, the models, their code, and their data, are related.
Once you can track all these things, you can also mark them ready for staging, and production, and run them through a CI/CD process.
Finally, to actually deploy them at the end of that process, you need some way to spin up a service based on that model artifact.

When it comes to tracking, MLflow is our pick, it’s tried-and true at Acerta, as several of our employees already used it as part of their personal workflows, and now it’s the de facto tracking tool for our data scientists.

For tracking data pipelines or workflows themselves, we are currently developing against Kubeflow since we’re already on Kubernetes making deployment a breeze, and our internal model pipelining infrastructure meshes well with the Kubeflow component concept.

On top of all of this MLOps development, there’s a shift toward building feature stores—basically specialized data lakes for storing preprocessed data in various forms—but I haven’t seen any serious contenders that really stand out yet.

These are all tools that need to be in place—I know a lot of places are doing their own home-baked solutions to this problem, but I think this year we’re going to see a lot more standardization around machine learning applications.”

Emily Kruger from Kaskada, which accidently is a startup building a feature store solution 🙂 adds:

The most useful tools from our perspective are feature stores, automated deployment pipelines, and experimentation platforms. All these tools address challenges with MLOps, which is an important emerging space for data teams, especially those running ML models in production and at scale.

Ok so in light of this what are other teams using to solve those problems?

Some teams prefer end-to-end platforms, others create everything in-house. Many teams are somewhere in between with a mix of some specific tools and home-grown solutions.

In terms of larger platforms, two names that were mentioned often were:

Amazon SageMaker which according to ML team from VCV “has a variety of tools for distributed collaboration” and SimpleReport chooses as their platform for deployment.
Azure which as Scanta team tells us “serves as a way to build, train, and deploy our Machine Learning applications as well as it helps in adding intelligence in our applications via their Language, Vision, and Speech recognition support. Azure has been our choice of IaaS due to rapid deployments and low-cost Virtual Machines.”

Experiment tracking tools come in and we see ML startups use various options:

Strayos uses Comet ML “for model collaboration and results sharing”.
Hotelmize and others are going with tensorboard which “is the best tool to visualize your model behavior, specially for neural network models.”
“MLflow seems to be in the lead for tracking experiments, artifacts, and outcomes.” as Jean-Christophe Petkovich CTO at Acerta mentioned before
Other teams like Repetere try to keep it simple and say that ”Our tooling is very simple, we use tensorflow and s3 to version model artifacts for analysis”.

Typically, experiment tracking tools keep track of metrics and hyperparameters but as James Kaplan from MeetKai points out:

“The most useful types of ML tools for us are anything that helps with dealing with model regressions caused by everything except the model architecture. Most of these are tools we have built ourselves, but I assume there are many existing options out there. We like to look at confusion matrices that can be visually diff’d under scenarios such as:

new data added to the training set (and the providence of said data)
quantization configurations
pruning/distillation

*We have found that being able to track performance across new data additions is far more important than being able to just track performance across hyper parameters of the model itself. This is especially so when datasets grow/change far faster than model configurations” *

Speaking of pruning/distillation Malte Pietsch, Co-Founder of deepset explains that:

We see an increasing need for tools that help us profile & optimize models in terms of speed and hardware utilization. With the growing size of NLP models, it becomes increasingly important to make training and inference more efficient.

While we are still looking for the ideal tooling here, we found pytest-benchmark, NVIDIA’s Nsight Systems and kernprof quite helpful.”

Another interesting tool for benchmarking training/inference is MLPerf suggested by Anton Lokhmotov from Dividiti.

Experimenting with models is undoubtedly very important but putting models in front of end-users is where the magic happens (for most of us). On that front Rosa Lin from Tolstoy mentioned using streamlit.io which is a “great tool for building ML model web apps easily.”

Valuable word of warning when it comes to using ML focused solutions comes from Gianvito Pio, Co-Founder of Sensitrust:

“There are also tools like Knife and Orange that allow you to design an entire pipeline in a drag-and-drop fashion, as well as AutoML tools (see AutoWEKA, auto-sklearn and JADBio) that will automatically select the most appropriate model for a specific task.

However, in my opinion, a strong expertise in the Machine Learning and AI areas are still necessary. Even the “”best, automated”” tool can be misused, without a good background in the field.”

Unexpected

Ok, when I started working on this, some answers like PyTorch, Pandas or Jupyter Lab were what I expected.

But one answer we received was really out-of-the-box.

Source: giphy.com

It put all the other things in perspective and made me think that perhaps we should take a step back and take a look at the larger picture.

Christopher Penn from Trust Insights suggested that ML teams should use a rather interesting “tool”:

Wetware – the hardware and software combination that sits between your ears – is the most important, most useful, most powerful machine learning tool you have.

Far, FAR too many people are hoping AI is a magic wand that solves everything with little to no human input. The reverse is true; AI requires more management and scrutiny than ever, because we lack so much visibility into complex models.

Interpretability and explainability are the greatest challenges we face right now, in the wake of massive scandals about bias and discrimination. And AI vendors make this worse by focusing on post hoc explanations of models instead of building the expensive but worthwhile interpretations and checkpoints into models.

So, wetware – the human in the loop – is the most useful tool in 2020 and for the foreseeable future.”

Our perspective:

Since we are building tools for ML teams and some of our customers are AI startups I think it makes sense to give you our perspective.

So we see:

A lot of teams use Jupyter ecosystem for exploration and Pycharm/VSCode for development
For deep learning people are using everything Tensorflow, Keras and Pytorch. Notably, we see more and more people using high-level PyTorch training libraries like Lightning, Ignite, Catalyst, fastai and Skorch,
For visual exploration people are using matplotlib, plotly, altair and hiplot (hyperparameter visualizations)
For running hyperparameter sweeps and general run orchestration some teams like YNAP choose AWS SageMaker.
For experiment tracking we see open-source packages like TensorBoard, MLflow and Sacred (Neptune integrates with all of them)

… and since those are our customers naturally they use neptune-notebooks for tracking explorations in jupyter notebooks and neptune for experiment tracking and organization of their machine learning projects.

This article was originally written by Jakub Czakon and posted on the Neptune blog. You can find more in-depth articles for machine learning practitioners there.

How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)

Kamil A. Kaczmarek — Tue, 28 Jul 2020 12:29:05 +0000

This article was originally written by Jakub Cieślik and posted on the Neptune blog.

I've been working with object detection and image segmentation problems for many years. An important realization I made is that people don't put the same amount of effort and emphasis on data exploration and results analysis as they would normally in any other non-image machine learning project.

Why is it so?

I believe there are two major reasons for it:

People don't understand object detection and image segmentation models in depth and treat them as black boxes, in that case they don't even know what to look at and what the assumptions are.
It can be quite tedious from a technical point of view as we don't have good image data exploration tools.

In my opinion image datasets are not really an exception, understanding how to adjust the system to match our data is a critical step to success.

In this article I will share with you how I approach data exploration for image segmentation and object detection problems. Specifically:

Why you should care about image and object dimensions,
Why small objects can be problematic for many deep learning architectures,
Why tackling class imbalances can be quite hard,
Why a good visualization is worth a thousand metrics,
The pitfalls of data augmentation.

The need for data exploration for image segmentation and object detection

Data exploration is key to a lot of machine learning processes. That said, when it comes to object detection and image segmentation datasets there is no straightforward way to systematically do data exploration.

There are multiple things that distinguish working with regular image datasets from object and segmentation ones:

The label is strongly bound to the image. Suddenly you have to be careful of whatever you do to your images as it can break the image-label-mapping.
Usually much more labels per image.
Much more hyperparameters to tune (especially if you train on your custom datasets)

This makes evaluation, results exploration and error analysis much harder. You will also find that choosing a single performance measure for your system can be quite tricky - in that case manual exploration might still be a critical step.

Data Quality and Common Problems

The first thing you should do when working on any machine learning problem (image segmentation, object detection included) is assessing quality and understanding your data.

Common data problems when training Object Detection and Image Segmentation models include:

Image dimensions and aspect ratios (especially dealing with extreme values)
Labels composition - imbalances, bounding box sizes, aspect ratios (for instance a lot of small objects)
Data preparation not suitable for your dataset.
Modelling approach not aligned with the data.

Those will be especially important if you train on custom datasets that are significantly different from typical benchmark datasets such as COCO. In the next chapters, I will show you how to spot the problems I mentioned and how to address them.

General Data Quality

This one is simple and rather obvious, also this step would be the same for all image problems not just object detection or image segmentation. What we need to do here is:

get the general feel of a dataset and inspect it visually.
make sure it's not corrupt and does not contain any obvious artifacts (for instance black only images)
make sure that all the files are readable - you don't want to find that out in the middle of your training.

My tip here is to visualize as many pictures as possible. There are multiple ways of doing this. Depending on the size of the datasets some might be more suitable than the others.

Plot them in a jupyter notebook using matplotlib.
Use dedicated tooling like google facets to explore image data (https://pair-code.github.io/facets/)
Use HTML rendering to visualize and explore in a notebook.

I'm a huge fan of the last option, it works great in jupyter notebooks (even for thousands of pictures at the same time!) Try doing that with matplotlib. There is even more: you can install a hover-zoom extension that will allow you to zoom in into individual pictures to inspect them in high-resolution.

Fig 1. 500 coco pictures visualized using html rendered thumbnails

Image sizes and aspect Ratios

In the real world, datasets are unlikely to contain images of the same sizes and aspect ratios. Inspecting basic datasets statistics such as aspect ratios, image widths and heights will help you make important decisions:

Can you and should you? do destructive resizing ? (destructive means resizing that changes the AR)
For non-destructive resizing what should be your desired output resolution and amount of padding?
Deep Learning models might have hyper parameters you have to tune depending on the above (for instance anchor size and ratios) or they might even have strong requirements when it comes to minimum input image size.

Good resources about anchors.

A special case would be if your dataset consists of images that are really big (4K+), which is not that unusual in satellite imagery or some medical modalities. For most cutting edge models in 2020, you will not be able to fit even a single 4K image per (server grade) GPU due to memory constraints. In that case, you need to figure out what realistically will be useful for your DL algorithms.

Two approaches that I saw are:

Training your model on image patches (randomly selected during training or extracted before training)
resizing the entire dataset to avoid doing this every time you load your data.

Fig 2. Histogram of image aspect ratios in the coco dataset

In general I would expect most datasets to fall into one of 3 categories.

Uniformly distributed where most of the images have the same dimensions - here the only decision you will have to make is how much to resize (if at all) This will mainly depend on objects area, size and aspect ratios)
Slightly bimodal distribution but most of the images are in the aspect ratio range of (0.7 ... 1.5) similar to the COCO dataset. I believe other "natural-looking" datasets would follow a similar distribution - for those type of datasets you should be fine by going with a non-destructive resize -> Pad approach. Padding will be necessary but to a degree that is manageable and will not blow the size of the dataset too much.
Dataset with a lot of extreme values (very wide images mixed with very narrow ones) - this case is much more tricky and there are more advanced techniques to avoid excessive padding. You might consider sampling batches of images based on the aspect ratio. Remember that this can introduce a bias to your sampling process - so make sure its acceptable or not strong enough.

The mmdetection framework supports this out of the box by implementing a GroupSampler that samples based on AR's

Fig 3 and 4. Example Images (resized and padded) with a extreme aspect ratios from the coco dataset

Label (objects) sizes and dimensions

Here we start looking at our targets (labels). Particularly we are interested in knowing how the sizes and aspect ratios are distributed.

Why is this important?

Depending on your modelling approach most of the frameworks will have design limitations. As I mentioned earlier, those models are designed to perform well on benchmark datasets. If for whatever reason your data is different, training them might be impossible. Let's have a look at a default config for Retinanet from detectron2:

ANCHOR_GENERATOR:
    SIZES: !!python/object/apply:eval ["[[x, x * 2**(1.0/3), x * 2**(2.0/3) ] for x in [32, 64, 128, 256, 512 ]]"]

What you can see there is, that for different feature maps the anchors we generate will have a certain size range:

for instance, if your dataset contains only really big objects - it might be possible to simplify the model a lot,
on the other side let's assume you have small images with small objects (for instance 10x10px) given this config it can happen you will not be able to train the model.

The most important things to consider when it comes to box or mask dimensions are:

Aspect ratios
Size (Area)

Fig 5. aspect ratio of bounding boxes in the coco dataset

The tail of this distribution (fig. 3) is quite long. There will be instances with extreme aspect ratios. Depending on the use case and dataset it might be fine to ignore it or not, this should be further inspected.

Fig 6. Mean area of bounding box per category

This is especially true for anchor-based models (most of object detection / image segmentation models) where there is a step of matching ground truth labels with predefined anchor boxes (aka. Prior boxes).

Remember that you control how those prior boxes are generated with hyperparameters like the number of boxes, their aspect ratio, and size. Not surprisingly you need to make sure those settings are aligned with your dataset distributions and expectations.

Fig 7. The Image shows anchor boxes at different scales and aspect ratios.

An important thing to keep in mind is that labels will be transformed together with the image. So if you are making an image smaller during a preprocessing step the absolute size of the ROI's will also shrink.

If you feel that object size might be an issue in your problem and you don't want to enlarge the images too much (for instance to keep desired performance or memory footprint) you can try to solve it with a Crop -> Resize approach. Keep in mind that this can be quite tricky (you need to handle cases what happens if you cut through a bounding box or segmentation mask)

Big objects on the other hand are usually not problematic from a modelling perspective (although you still have to make sure that will be matched with anchors). The problem with them is more indirect, essentially the more big objects a class has the more likely it is that it will be underrepresented in the dataset. Most of the time the average area of objects in a given class will be inversely proportional to the (label) count.

Partially labeled data

When creating and labeling an image detection dataset missing annotations are potentially a huge issue. The worst scenario is when you have false negatives already in your ground truth. So essentially you did not annotate objects even though they are present in the dataset.

In most of the modeling approaches, everything that was not labeled or did not match with an anchor is considered background. This means that it will generate conflicting signals that will hurt the learning process a LOT.

This is also a reason why you can't really mix datasets with non-overlapping classes and train one model (there are some way to mix datasets though - for instance by soft labeling one dataset with a model trained on another one)

Fig 8. Shows the problem of mixing datasets - notice for example that on the right image a person is not labeled. One way to solve this problem is to soft label the dataset with a model trained on the other one. Source

Imbalances

Class imbalances can be a bit of a problem when it comes to object detection. Normally in image classification for example, one can easily oversample or downsample the dataset and control each class contribution to the loss.

Fig 9. Object counts per class

You can imagine this is more challenging when you have co-occurring classes object detection dataset since you can't really drop some of the labels (because you would send mixed signals as to what the background is).

In that case you end up having the same problem as shown in the partially labeled data paragraph. Once you start resampling on an image level you have to be aware of the fact that multiple classes will be upsampled at the same time.

Note:

You may want to try other solutions like:

Adding weights to the loss (making the contributions of some boxes or pixels higher)
Preprocessing your data differently: for example you could do some custom cropping that rebalances the dataset on the object level

Understanding augmentation and preprocessing sequences

Preprocessing and data augmentation is an integral part of any computer vision system. If you do it well you can gain a lot but if you screw up it can really cost you.

Data augmentation is by far the most important and widely used regularization technique (in image segmentation / object detection ).

Applying it to object detection and segmentation problems is more challenging than in simple image classification because some transformations (like rotation, or crop) need to be applied not only to the source image but also to the target (masks or bounding boxes). Common transformations that require a target transform include:

Affine transformations,
Cropping,
Distortions,
Scaling,
Rotations
and many more.

It is crucial to do data exploration on batches of augmented images and targets to avoid costly mistakes (dropping bounding boxes, etc).

Note:

Basic augmentations are a part of deep learning frameworks like PyTorch or Tensorflow but if you need more advanced functionalities you need to use one of the augmentation libraries available in the python ecosystem. My recommendations are:

Albumentations (I'll use it in this post)
Imgaug
Augmentor

The minimal preprocessing setup

Whenever I'm building a new system I want to keep it very basic on the preprocessing and augmentation level to minimize the risk of introducing bugs early on. Basic principles I would recommend you to follow is:

Disable augmentation
Avoid destructive resizing
Always inspect the outputs visually

Let's continue our COOC example. From the previous steps we know that:the majority of our images have:

aspect ratios = width / height = 1.5
the average avg_width is = 600 and avg_height = 500.

Setting the averages as our basic preprocessing resize values seems to be a reasonable thing to do (unless there is a strong requirement on the model side to have bigger pictures) for instance a resnet50 backbone model has a minimum size requirement of 32×32 (this is related to the number of downsampling layers)

In Albumentations the basic setup implementation will look something like this:

LongestMaxSize(avg_height) - this will rescale the image based on the longest side preserving the aspect ratio
PadIfNeeded(avg_height, avg_width, border_mode='FILL', value=0)

Fig 10

Fig 11

Fig 10 and 11. MaxSize->Pad output for two pictures with drastically different aspect ratios

As you can see on figure 10 and 11 the preprocessing results in an image of 500×600 with reasonable 0-padding for both pictures.

When you use padding there are many options in which you can fill the empty space. In the basic setup I suggest that you go with default constant 0 value.

When you experiment with more advanced methods like reflection padding always explore your augmentations visually. Remember that you are running the risk of introducing false negatives especially in object detection problems (reflecting an object without having a label for it)

Fig 12. Notice how reflection-padding creates false negative errors in our annotations. The cat's reflection (top of the picture) has no label!

Augmentation - Rotations

Rotations are powerful and useful augmentations but they should be used with caution. Have a look at fig 13. below which was generated using a Rotate(45)->Resize->Pad pipeline.

Fig 13. Rotations can be harmful to your bounding box labels

The problem is that if we use standard bounding boxes (without an angle parameter), covering a rotated object can be less efficient (box-area to object-area will increase). This happens during rotation augmentations and it can harm the data. Notice that we have also introduced false positive labels in the top left corner. This is because we crop-rotated the image.

My recommendation is:

You might want to give up on those if you have a lot of objects with aspect ratios far from one.

Another thing you can consider is using 90,180, 270 degree non-cropping rotations (if they make sense) for your problem (they will not destroy any bounding boxes)

Augmentations - Key takeaways

As you see, spatial transforms can be quite tricky and a lot of unexpected things can happen (especially for object detection problems).

So if you decide to use those spatial augmentations make sure to do some data exploration and visually inspect your data.

Note:

Do you really need spatial augmentations? I believe that in many scenarios you will not need them and as usual keep things simpler and gradually add complexity.

From my experience a good starting point (without spatial transforms) and for natural looking datasets (similar to coco) is the following pipeline:

transforms = [
    LongestMaxSize(max_size=500),
    HorizontalFlip(p=0.5),
    PadIfNeeded(500, 600, border_mode=0, value=0),
    JpegCompression(quality_lower=70, quality_upper=100, p=1),
    RandomBrightnessContrast(0.3, 0.3),
    Cutout(max_h_size=32, max_w_size=32, p=1)
]

Of course things like max_size or cutout sizes are arbitrary and have to be adjusted.

Fig 14. Augmentation results with cutout, jpeg compression and contrast/brightness adjustments

Best Practice:
One thing I did not mention yet that I feel is pretty important: Always load the whole dataset (together with your preprocessing and augmentation pipeline).

%%timeit -n 1 -r 1
for b in data_loader: pass

Two lines of code that will save you a lot of time. First of all, you will understand what the overhead of the data loading is and if you see a clear performance bottleneck you might consider fixing it right away. More importantly, you will catch potential issues with:

corrupted files,
labels that can't be transformed etc
anything fishy that can interrupt training down the line.

Results understanding

Inspecting model results and performing error analysis can be a tricky process for those types of problems. Having one metric rarely tells you the whole story and if you do have one interpreting it can be a relatively hard task.

Let's have a look at the official coco challenge and how the evaluation process looks there (all the results i will be showing are for a MASK R-CNN model with a resnet50 backbone).

Fig 15. Coco evaluation output

It returns the AP and AR for various groups of observations partitioned by IOU (Intersection over Union of predictions and ground truth) and Area. So even the official COCO evaluation is not just one metric and there is a good reason for it.

Lets focus on the IoU=0.50:0.95 notation.

What this means is the following: AP and AR is calculated as the average of precisions and recalls calculated for different IoU settings (from 0.5 to 0.95 with a 0.05 step). What we gain here is a more robust evaluation process, in such a case a model will score high if its pretty good at both (localizing and classifying).

Of course, your problem and dataset might be different. Maybe you need an extremely accurate detector, in that case, choosing AP@0.90IoU might be a good idea.

The downside (of the coco eval tool) is that by default all the values are averaged for all the classes and all images. This might be fine in a competition-like setup where we want to evaluate the models on all the classes but in real-life situations where you train models on custom datasets (often with fewer classes) you really want to know how your model performs on a per-class basis. Looking at per-class metrics is extremely valuable, as it might give you important insights:

help you compose a new dataset better
make better decisions when it comes to data augmentation, data sampling etc.

Fig 16. Per class AP

Figure 16. gives you a lot of useful information there are few things you might consider:

Add more data to low performing classes
For classes that score well, maybe you can consider downsampling them to speed up the training and maybe help with the performance of other less frequent classes.
Spot any obvious correlations for instance classes with small objects performing poorly.

Visualizing results

Ok, so if looking at single metrics is not enough what should you do?

I would definitely suggest spending some time on manual results exploration, with the combination of hard metrics from the previous analysis - visualizations will help you get the big picture.

Since exploring predictions of image detection and image segmentation models can get quite messy I would suggest you do it step by step. On the gif below I show how this can be done using the coco inspector tool.

gif available here

On the gif we can see how all the important information is visualized:

Red masks - predictions
Orange masks - overlap of predictions and ground truth masks
Green masks - ground truth
Dashed bounding boxes - false positives (predictions without a match)
Orange boxes true positive
Green boxes - ground truth

Results understanding - per image scores

By looking at the hard metrics and inspecting images visually we most likely have a pretty good idea of what's going on. But looking at results of random images (or grouped by class) is likely not an optimal way of doing this. If you want to really dive in and spot edge cases of your model, I suggest calculating per image metrics (for instance AP or Recall).

Below and example of an image I found by doing exactly that.

Fig 18. Image with a very low AP score

In the example above (Fig 18.) we can see two false positive stop sign predictions - from that we can deduce that our model understands what a stop sign is but not what other traffic signs are.

Perhaps we can add new classes to our dataset or use our "stop sign detector" to label other traffic signs and then create a new "traffic sign" label to overcome this problem.

Fig 19. Example of an image with a good score > 0.5 AP

Sometimes we will also learn that our model is doing better that it would seem from the scores alone. That's also useful information, for instance in the example above our model detected a keyboard on the laptop but this is actually not labeled in the original dataset.

COCO format

The way a coco dataset is organized can be a bit intimidating at first.

It consists of a set of dictionaries mapping from one to another. It's also intended to be used together with the pycocotools / cocotools library that builds a rather confusing API on top of the dataset metadata file.

Nonetheless, the coco dataset (and the coco format) became a standard way of organizing object detection and image segmentation datasets.

In COCO we follow the xywh convention for bounding box encodings or as I like to call it tlwh: (top-left-width-height) that way you can not confuse it with for instance cwh: (center-point, w, h). Mask labels (segmentations) are run-length encoded (RLE explanation).

Fig 20. The coco dataset annotations format

There are still very important advantages of having a widely adopted standard:

Labeling tools and services export and import COCO-like datasets
Evaluation and scoring code (used for the coco competition) is pretty well optimized and battle tested.
Multiple open source datasets follow it.

In the previous paragraph, I used the COCO eval functionality which is another benefit of following the COCO standard. To take advantage of that you need to format your predictions in the same way as your coco dataset is constructed- then calculating metrics is as simple as calling: COCOeval(gt_dataset, pred_dataset)

COCO dataset explorer

In order to streamline the process of data and results exploration (especially for object detection) I wrote a tool that operates on COCO datasets.

Essentially you provide it with the ground truth dataset and the predictions dataset (optionally) and it will do the rest for you:

Calculate most of the metrics I presented in this post
Easily visualize the datasets ground truths and predictions
Inspect coco metrics, per class AP metrics
Inspect per-image scores

To use COCO dataset explorer tool you need to:

Clone the project repository

git clone https://github.com/i008/COCO-dataset-explorer.git

Download example data I used for the examples or use your own data in the COCO format:

Example COCO format dataset with predictions.

If you downloaded the example data you will need to extract it.

tar -xvf coco_data.tar

You should have the following directory structure:

COCO-dataset-explorer
    |coco_data
        |images
            |000000000139.jpg
            |000000000285.jpg
            |000000000632.jpg
            |...
        |ground_truth_annotations.json
        |predictions.json
|coco_explorer.py
|Dockerfile
|environment.yml
|...

*Set up the environment with all the dependencies

conda env update;
conda activate cocoexplorer

Run streamlit app specifying a file with ground truth and predictions in the COCO format and the image directory:

streamlit run coco_explorer.py -- \
    --coco_train coco_data/ground_truth_annotations.json \
    --coco_predictions coco_data/predictions.json  \
    --images_path coco_data/images/

Note: You can also run this with docker:

sudo docker run -p 8501:8501 -it -v "$(pwd)"/coco_data:/coco_data i008/coco_explorer  \
    streamlit run  coco_explorer.py -- \
    --coco_train /coco_data/ground_truth_annotations.json \
    --coco_predictions /coco_data/predictions.json  \
    --images_path /coco_data/images/

explore the dataset in the browser. By default, it will run on http://localhost:8501/

Final words

I hope that with this post I convinced you that data exploration in object detection and image segmentation is as important as in any other branch of machine learning.

I'm confident that the effort we make at this stage of the project pays off in the long run.

The knowledge we gather allows us to make better-informed modeling decisions, avoid multiple training pitfalls and gives you more confidence in the training process, and the predictions your model produces.

This article was originally written by Jakub Cieślik and posted on the Neptune blog. You can find more in-depth articles for machine learning practitioners there.

The Best NLP/NLU Papers from the ICLR 2020 Conference

Kamil A. Kaczmarek — Fri, 24 Jul 2020 16:17:23 +0000

This article was originally posted on the Neptune blog

The International Conference on Learning Representations (ICLR) took place last week, and I had a pleasure to participate in it. ICLR is an event dedicated to research on all aspects of representation learning, commonly known as deep learning. This year the event was a bit different as it went virtual due to the coronavirus pandemic. However, the online format didn't change the great atmosphere of the event. It was engaging and interactive and attracted 5600 attendees (twice as many as last year). If you're interested in what organizers think about the unusual online arrangement of the conference, you can read about it here.

Over 1300 speakers presented many interesting papers, so I decided to create a series of blog posts summarizing the best of them in four main areas: deep learning, reinforcement learning, generative modeling, NLP/NLU.

This is the last post of the series, in which I want to share 10 best Natural Language Processing/Understanding contributions from the ICLR.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
A Mutual Information Maximization Perspective of Language Representation Learning
Mogrifier LSTM
High Fidelity Speech Synthesis with Adversarial Networks
Reformer: The Efficient Transformer
DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling
Depth-Adaptive Transformer
On Identifiability in Transformers
Mirror-Generative Neural Machine Translation
FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Best Natural Language Processing/Understanding Papers

1. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

A new pretraining method that establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
(TL;DR, from OpenReview.net)

| Paper | Code |

The L2 distances and cosine similarity (in terms of degree) of the input and output embedding of each layer for BERT-large and ALBERT-large.
(source: Fig 1, from the paper)

First author: Zhenzhong Lan
| LinkedIn |

2. A Mutual Information Maximization Perspective of Language Representation Learning

Word representation is a common task in NLP. Here, authors formulate new frameworks that combine classical word embedding techniques (like Skip-gram) with more modern approaches based on contextual embedding (BERT, XLNet).

| Paper |

The left plot shows F1 scores of BERT-NCE and INFOWORD as we increase the percentage of training examples on SQuAD (dev). The right plot shows F1 scores of INFOWORD on SQuAD (dev) as a function of λDIM.
(source: Fig 1, from the paper)

First author: Lingpeng Kong
| Twitter | GitHub | Website |

3. Mogrifier LSTM

An LSTM extension with state-of-the-art language modelling results.
(TL;DR, from OpenReview.net)

| Paper |

Mogrifier with 5 rounds of updates. The previous state h0 = hprev is transformed linearly (dashed arrows), fed through a sigmoid and gates x −1 = x in an elementwise manner producing x1 . Conversely, the linearly transformed x1 gates h 0 and produces h2 . After a number of repetitions of this mutual gating cycle, the last values of h∗ and x∗ sequences are fed to an LSTM cell. The prev subscript of h is omitted to reduce clutter.
(source: Fig 1, from the paper)

First author: Gábor Melis
Twitter | LinkedIn | GitHub | Website |

4. High Fidelity Speech Synthesis with Adversarial Networks

We introduce GAN-TTS, a Generative Adversarial Network for Text-to-Speech, which achieves Mean Opinion Score (MOS) 4.2.
(TL;DR, from OpenReview.net)

| Paper | Code |

Residual blocks used in the model. Convolutional layers have the same number of input and output channels and no dilation unless stated otherwise. h - hidden layer representation, l - linguistic features, z - noise vector, m - channel multiplier, m = 2 for downsampling blocks (i.e. if their downsample factor is greater than 1) and m = 1 otherwise, M- G's input channels, M = 2N in blocks 3, 6, 7, and M = N otherwise; size refers to kernel size.
(source: Fig 1, from the paper)

First author: Mikołaj Bińkowski
| LinkedIn | GitHub |

5. Reformer: The Efficient Transformer

Efficient Transformer with locality-sensitive hashing and reversible layers.
(TL;DR, from OpenReview.net)

| Paper | Code |

An angular locality sensitive hash uses random rotations of spherically projected points to establish buckets by an argmax over signed axes projections. In this highly simplified 2D depiction, two points x and y are unlikely to share the same hash buckets (above) for the three different angular hashes unless their spherical projections are close to one another (below).
(source: Fig 1, from the paper)

Main authors

Nikita Kitaev
| LinkedIn | GitHub | Website |

Łukasz Kaiser
| Twitter | LinkedIn | GitHub |

6. DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

DeFINE uses a deep, hierarchical, sparse network with new skip connections to learn better word embeddings efficiently.
(TL;DR, from OpenReview.net)

| Paper |

With DeFINE, Transformer-XL learns input (embedding) and output (classification) representations in low n-dimensional space rather than high m-dimensional space, thus reducing parameters significantly while having a minimal impact on the performance.
(source: Fig 1, from the paper)

7. Depth-Adaptive Transformer

Sequence model that dynamically adjusts the amount of computation for each input.
(TL;DR, from OpenReview.net)

| Paper |

Training regimes for decoder networks able to emit outputs at any layer. Aligned training optimizes all output classifiers Cn simultaneously assuming all previous hidden states for the current layer are available. Mixed training samples M paths of random exits at which the model is assumed to have exited; missing previous hidden states are copied from below.
(source: Fig 1, from the paper)

8. On Identifiability in Transformers

We investigate the identifiability and interpretability of attention distributions and tokens within contextual embeddings in the self-attention based BERT model.
(TL;DR, from OpenReview.net)

| Paper |

a) Each point represents the Pearson correlation coefficient of effective attention and raw attention as a function of token length. (b) Raw attention vs. (c) effective attention, where each point represents the average (effective) attention of a given head to a token type.
(source: Fig 1, from the paper)

First author: Gino Brunner
| Twitter | LinkedIn | Website |

9. Mirror-Generative Neural Machine Translation

Translation approaches known as Neural Machine Translation models (NMT), depend on availability of large corpus, constructed as a language pair. Here, a new method is proposed for translations in both directions using generative neural machine translation.

| Paper |

The graphical model of MGNMT.
(source: Fig 1, from the paper)

First author: Zaixiang Zheng
| Twitter | Website |

10. FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Here, the authors propose a new algorithm, called FreeLB that formulate a novel approach to the adversarial training of the language model is proposed.

| Paper | Code |

algorithm's pseudo-code.
(source: Fig 1, from the paper)

First author: Chen Zhu
| LinkedIn | GitHub | Website |

Summary

Depth and breadth of the ICLR publications is quite inspiring. This post focuses on the "Natural Language Processing" topic, which is one of the main areas discussed during the conference. According to this analysis, these areas include:

Deep learning
Reinforcement learning
Generative models
Natural Language Processing/Understanding

In order to create a more complete overview of the top papers at ICLR, we have built a series of posts, each focused on one topic mentioned above. This is the last one, so you may want to check the others for a more complete overview.

We would be happy to extend our list, so feel free to share other interesting NLP/NLU papers with us.

In the meantime - happy reading!

This article was originally posted on the Neptune blog where you can find more in-depth articles for machine learning practitioners.

Tabular Data Binary Classification: All Tips and Tricks from 5 Kaggle Competitions

Kamil A. Kaczmarek — Wed, 22 Jul 2020 09:37:19 +0000

This article was originally written by Shahul Es and posted on the Neptune blog.

In this article, I will discuss some great tips and tricks to improve the performance of your structured data binary classification model. These tricks are obtained from solutions of some of Kaggle’s top tabular data competitions. Without much lag, let’s begin.

These are the five competitions that I have gone through to create this article:

Dealing with larger datasets

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.

Faster data loading with pandas.
Data compression techniques to reduce the size of data by 70%.
Optimize the memory by reducing the size of some attributes.
Use open-source libraries such as Dask to read and manipulate the data, it performs parallel computing and saves up memory space.
Use cudf.
Convert data to parquet format.
Converting data to feather format.
Reducing memory usage for optimizing RAM.

Data exploration

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

EDA for Microsoft malware detection.
Time Series EDA for malware detection.
Complete EDA for home credit loan prediction.
Complete EDA for Santader prediction.
EDA for VSB Power Line Fault Detection.

Data preparation

After data exploration, the first thing to do is to use those insights to prepare the data. To tackle issues like class imbalance, encoding categorical data, etc. Let’s see the methods used to do it.

Methods to tackle class imbalance.
Data augmentation by Synthetic Minority Oversampling Technique.
Fast inplace shuffle for augmentation.
Finding synthetic samples in the dataset.
Signal denoising used in signal processing competitions.
Finding patterns of missing data.
Methods to handle missing data.
An overview of various encoding techniques for categorical data.
Building model to predict missing values.
Random shuffling of data to create new synthetic training set.

Feature engineering

Next, you can check the most popular feature and feature engineering techniques used in these top kaggle competitions. The feature engineering part varies from problem to problem depending on the domain.

Target encoding cross validation for better encoding.
Entity embedding to handle categories.
Encoding cyclic features for deep learning.
Manual feature engineering methods.
Automated feature engineering techniques using featuretools.
Top hard crafted features used in microsoft malware detection.
Denoising NN for feature extraction.
Feature engineering using RAPIDS framework.
Things to remember while processing features using LGBM.
Lag features and moving averages.
Principal component analysis for dimensionality reduction.
LDA for dimensionality reduction.
Best hand crafted LGBM features for microsoft malware detection.
Generating frequency features.
Dropping variables with different train and test distribution.
Aggregate time series features for home credit competition.
Time Series features used in home credit default risk.
Scale,Standardize and normalize with sklearn.
Handcrafted features for Home default risk competition.
Handcrafted features used in Santander Transaction Prediction.

Feature selection

After generating many features from your data, you need to decide which all features to use in your model to get the maximum performance out of your model. This step also includes identifying the impact each feature is having on your model. Let’s see some of the most popular feature selection methods.

Six ways to do features selection using sklearn.
Permutation feature importance.
Adversarial feature validation.
Feature selection using null importance.
Tree explainer using SHAP.
DeepNN explainer using SHAP.

Modeling

After handcrafting and selecting your features, you should choose the right Machine learning algorithm to make your prediction. These are the collection of some of the most used ML models in structured data classification challenges.

Random forest classifier.
XGBoost : Gradient boosted decision trees.
LightGBM for distributed and faster training.
CatBoost to handle categorical data.
Naive bayes classifier.
Gaussian naive bayes model.
LGBM + CNN model used in 3rd place solution of Santander Customer Transaction Prediction
Knowledge distillation in Neural Network.
Follow the regularized leader method.
Comparison between LGB boosting methods (goss, gbdt and dart).
NN + focal loss experiment.
Keras NN with timeseries splitter.
5th place NN architecture with code for Santander Transaction prediction.

Hyperparameter tuning

LGBM hyperparameter tuning methods.
Automated model tuning methods.
Parametre tuning with hyper plot.
Bayesian optimization for hyperparameter tuning.
Gpyopt Hyperparameter Optimisation.

Evaluation

Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.

The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.

There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.

K-fold cross-validation.
Stratified KFold cross-validation.
Group KFold
Adversarial validation to check if train and test distributions are similar or not.
Time Series split validation.
Extensive time series splitter.

Note:

There are various metrics that you can use to evaluate the performance of your tabular models. A bunch of useful classification metrics are listed and explained here.

Other training tricks

GPU acceleration for LGBM.
Use the GPU efficiently.
Free keras memory.
Save and load models to save runtime and memory.

Ensemble

If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.

Let’s see some of the popular ensembling techniques used in kaggle competitions:

Weighted average ensemble.
Stacked generalization ensemble.
Out of folds predictions.
Blending with linear regression.
Use optuna to determine blending weights.
Power average ensemble.
Power 3.5 blending strategy.
Blending diverse models.
Different stacking approaches.
AUC weight optimization.
Geometric mean for low correlation predictions.
Weighted rank average.

Final thoughts

In this article, you saw many popular and effective ways to improve the performance of your tabular data binary classification model. Hopefully, you will find them useful in your projects

This article was originally written by Shahul Es and posted on the Neptune blog, where you can find more in-depth articles for machine learning practitioners.

How to Keep Track of PyTorch Lightning Experiments with Neptune

Kamil A. Kaczmarek — Thu, 16 Jul 2020 14:57:37 +0000

This article was originally written by Jakub Czakon and posted on the Neptune blog.

Working with PyTorch Lightning and wondering which logger should you choose to keep track of your experiments?

Thinking of using PyTorch Lightning to structure your Deep Learning code and wouldn't mind learning about it's logging functionality?

Didn't know that Lightning has a pretty awesome Neptune integration?

This article is (very likely) for you.

Why PyTorch Lightning and Neptune?

If you never heard of it, PyTorch Lightning is a very lightweight wrapper on top of PyTorch which is more like a coding standard than a framework. The format allows you to get rid of a ton of boilerplate code while keeping it easy to follow.

The result is a framework that gives researchers, students, and production teams the ultimate flexibility to try crazy ideas without having to learn yet another framework while automating away all the engineering details.

Some great features that you can get out-of-the-box are:

Train on CPU, GPU or TPUs without changing your code,
Trivial multi-GPU and multi-node training
Trivial 16 bit precision support
Built-in performance profiler (Trainer(profile=True))

and a ton of other great functionalities.

But with this great power of running experiments easily and flexibility in tweaking anything you want comes a problem.

How to keep track of all the changes like:

losses and metrics,
hyperparameters
model binaries
validation predictions

and other things that will help you organize your experimentation process?

Fortunately, PyTorch lightning gives you an option to easily connect loggers to the pl.Trainer and one of the supported loggers that can track all of the things mentioned before (and many others) is the NeptuneLogger which saves your experiments in… you guessed it Neptune.

Neptune not only tracks your experiment artifacts but also:

let's you monitor everything live,
gives you a nice UI where you can filter, group and compare various experiment runs
access experiment data that you logged programmatically from a Python script or Jupyter Notebook

The best part is that this integration really is trivial to use.

Let me show you how it looks.

Note:
You can also check out this colab notebook and play with the examples we will talk about yourself.

Basic Integration

In the simplest case you just create the NeptuneLogger:

from pytorch_lightning.logging.neptune import NeptuneLogger
neptune_logger = NeptuneLogger(
    api_key="ANONYMOUS",
    project_name="shared/pytorch-lightning-integration")

and pass it to the logger argument of Trainer and fit your model.

from pytorch_lightning import Trainer
trainer = Trainer(logger=neptune_logger)
trainer.fit(model)

By doing so you get your:

Metrics and losses logged and charts created,
Hyperparameters saved (if defined via lightning hparams),
Hardware utilization logged
Git info and execution script logged

Check out this experiment.

You can monitor your experiments, compare them, and share them with others.
Not too bad for a 4-liner.
But with just a bit more effort you can get a lot more.

Advanced Options

Neptune gives you a lot of customization options and you can simply log more experiment-specific things, like image predictions, model weights, performance charts and more.

All of that functionality is available for Lightning users and in the next sections I will show you how to leverage Neptune to the fullest.

Logging extra information at NeptuneLogger creation

When you are creating the logger you can log additional useful information:

code: snapshot scripts, jupyter notebooks, config files, and more
hyperparameters: log learning rate, number of epochs and other things (if you are using lightning hparams object from lightning it will be logged automatically)
properties: log data locations, data versions, or other things
tags: add tags like "resnet50" or "no-augmentation" to organize your runs.
name: every experiment deserves a meaningful name so let's not use "default" every time 🙂 shall we

Just pass this information to your logger:

neptune_logger = NeptuneLogger(
    api_key="ANONYMOUS",
    project_name="shared/pytorch-lightning-integration",
    experiment_name="default",  # Optional,
    params={"max_epochs": 10,
            "batch_size": 32},  # Optional,
    tags=["pytorch-lightning", "mlp"]  # Optional,
    upload_source_files=["**/*.py", "*.yaml"]  # Optional,
)

… and proceed as before to get an organized dashboard like this one.

Logging extra things during training

A lot of interesting information can be logged during training.

You may be interested in monitoring things like:

model predictions after each epochs (think prediction masks or overlaid bounding boxes)
diagnostic charts like ROC AUC curve or Confusion Matrix
model checkpoints, or other objects

It is really simple. Just go to your LightningModule and call methods of the Neptune experiment available as self.logger.experiment.

For example, we can log histograms of losses after each epoch:

class CoolSystem(pl.LightningModule):

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}

        # log debugging images like histogram of losses
        fig = plt.figure()
        losses = np.stack([x['val_loss'].numpy() for x in outputs])
        plt.hist(losses)
        self.logger.experiment.log_image('loss_histograms', fig)
    plt.close(fig)

        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

Explore them for yourself.

Other things you may want to log during training are:

self.logger.experiment.log_metric # log custom metrics
self.logger.experiment.log_text # log text values
self.ogger.experiment.log_artifact # log files
self.logger.experiment.log_image # log images, charts
self.logger.experiment.set_property # add key:value pairs
self.logger.experiment.append_tag # add tags for organization

Pretty cool right?

But … that is not all you can do!

Logging things after training has finished

Tracking your experiment doesn't have to finish after your .fit loop ends.

You may want to track the metrics of the trainer.test(model) or calculate some additional validation metrics and log them.

To do that you just need to tell NeptuneLogger not to close after fit:

neptune_logger = NeptuneLogger(
    api_key="ANONYMOUS",
    project_name="shared/pytorch-lightning-integration",
    close_after_fit=False,
    ...
)

… and you can keep logging 🙂

Test metrics:

trainer.test(model)

Additional (external) metrics:

from sklearn.metrics import accuracy_score
...
accuracy = accuracy_score(y_true, y_pred)
neptune_logger.experiment.log_metric('test_accuracy', accuracy)

Performance charts on test set:

from scikitplot.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
...
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred, ax=ax)
neptune_logger.experiment.log_image('confusion_matrix', fig)

The whole model checkpoints directory:

neptune_logger.experiment.log_artifact('my/checkpoints')

Go to this experiment to see how those objects are logged:

But … there is even more!

Neptune lets you fetch experiments after training.

Let me show you how.

Fetching your experiment information directly to the notebooks

You can fetch experiments after they have finished, analyze the results and update metrics, artifacts or other things if you want to.

import neptune

project = neptune.init('shared/pytorch-lightning-integration')
project.get_leaderboard().head()

For example, let's fetch the experiments dashboard to a pandas DataFrame or visualize it with HiPlot via neptune HiPlot integration:

from neptunecontrib.viz import make_parallel_coordinates_plot

make_parallel_coordinates_plot(
           metrics= ['train_loss', 'val_loss', 'test_accuracy'],
           params = ['max_epochs', 'batch_size', 'lr'])

or fetch a single experiment and update it with some external metric calculated after training:

exp = project.get_experiments(id='PYTOR-63')[0]
exp.log_metric('some_external_metric', 0.92)

As you can see there are a lot of things you can log to Neptune from Pytorch Lightning.

If you want to go deeper into this:

read the integration docs
go check out Neptune to see other things it can do,
try out Lightning + Neptune on colab

Final Thought

Pytorch Lightning is a great library that helps you with:

organizing your deep learning code to make it easily understandable to other people,
outsourcing development boilerplate to a team of seasoned engineers,
accessing a lot of state-of-the-art functionalities with almost no changes to your code

With Neptune integration, you get some additional things for free:

you can monitor and keep track of your deep learning experiments
you can share your research with other people easily
you and your team can access experiment metadata and collaborate more efficiently.

Hopefully, with all that power you will know exactly what you (and other people) tried and your deep learning research will be moving at a lightning speed 🙂

Bonus: Full PyTorch Lightning tracking script

pip install --upgrade torch pytorch-lightning \
    neptune-client neptune-contrib[viz] \
    matplotlib scikit-plot

import os

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms

import pytorch_lightning as pl

MAX_EPOCHS=7
LR=0.02
BATCHSIZE=32
CHECKPOINTS_DIR = 'my_models/checkpoints'

class CoolSystem(pl.LightningModule):

    def __init__(self):
        super(CoolSystem, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}

        fig = plt.figure()
        losses = np.stack([x['val_loss'].numpy() for x in outputs])
        plt.hist(losses)
        self.logger.experiment.log_image('loss_histograms', fig)

        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

    def test_step(self, batch, batch_idx):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}

    def test_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        tensorboard_logs = {'test_loss': avg_loss}
        return {'avg_test_loss': avg_loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        return torch.optim.Adam(self.parameters(), lr=LR)

    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=BATCHSIZE)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=BATCHSIZE)

    @pl.data_loader
    def test_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=BATCHSIZE)


from pytorch_lightning.loggers.neptune import NeptuneLogger

neptune_logger = NeptuneLogger(
    api_key="ANONYMOUS",
    project_name="shared/pytorch-lightning-integration",
    close_after_fit=False,
    experiment_name="default",  # Optional,
    params={"max_epochs": MAX_EPOCHS,
            "batch_size": BATCHSIZE,
            "lr": LR}, # Optional,
    tags=["pytorch-lightning", "mlp"],
    upload_source_files=['*.py','*.yaml'],
    upload_stderr=False,
    upload_stdout=False
)
model_checkpoint = pl.callbacks.ModelCheckpoint(filepath=CHECKPOINTS_DIR)

from pytorch_lightning import Trainer

model = CoolSystem()
trainer = Trainer(max_epochs=MAX_EPOCHS,
                  logger=neptune_logger,
                  checkpoint_callback=model_checkpoint,
                  )
trainer.fit(model)
trainer.test(model)

# Get predictions on external test
import numpy as np

model.freeze()
test_loader = DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=256)

y_true, y_pred = [],[]
for i, (x, y) in enumerate(test_loader):
    y_hat = model.forward(x).argmax(axis=1).cpu().detach().numpy()
    y = y.cpu().detach().numpy()

    y_true.append(y)
    y_pred.append(y_hat)

    if i == len(test_loader):
        break
y_true = np.hstack(y_true)
y_pred = np.hstack(y_pred)

# Log additional metrics
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_true, y_pred)
neptune_logger.experiment.log_metric('test_accuracy', accuracy)

# Log charts
from scikitplot.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred, ax=ax)
neptune_logger.experiment.log_image('confusion_matrix', fig)

# Save checkpoints folder
neptune_logger.experiment.log_artifact(CHECKPOINTS_DIR)

# You can stop the experiment
neptune_logger.experiment.stop()

This article was originally written by Jakub Czakon and posted on the Neptune blog. You can find there more in-depth articles for machine learning practitioners.

Understanding LightGBM Parameters (and How to Tune Them)

Kamil A. Kaczmarek — Tue, 14 Jul 2020 21:13:08 +0000

This article was originally written by MJ Bahmani and posted on the Neptune blog.

I've been using lightGBM for a while now. It's been my go-to algorithm for most tabular data problems. The list of awesome features is long and I suggest that you take a look if you haven't already.
But I was always interested in understanding which parameters have the biggest impact on performance and how I should tune lightGBM parameters to get the most out of it.
I figured I should do some research, understand more about lightGBM parameters… and share my journey.
Specifically I:

Took a deep-dive into LightGBM's documentation
Went through Laurae articles Lauraepp: xgboost / LightGBM parameters
Looked into the LightGBM GitHub Repository
Ran some experiments myself

As I was doing that I gained a lot more knowledge about lightGBM parameters. My hope is that after reading this article you will be able to answer the following questions:

Which Gradient Boosting methods are implemented in LightGBM and what are its differences?
Which parameters are important in general?
Which regularization parameters need to be tuned?
How to tune lightGBM parameters in python?

Gradient Boosting methods

With LightGBM you can run different types of Gradient Boosting methods. You have: GBDT, DART, and GOSS which can be specified with the "boosting" parameter.
In the next sections, I will explain and compare these methods with each other.

lgbm gbdt (gradient boosted decision trees)

This method is the traditional Gradient Boosting Decision Tree that was first suggested in this article and is the algorithm behind some great libraries like XGBoost and pGBRT.
These days gbdt is widely used because of its accuracy, efficiency, and stability. You probably know that gbdt is an ensemble model of decision trees but what does it mean exactly?
Let me give you a gist.
It is based on three important principles:

Weak learners (decision trees)
Gradient Optimization
Boosting Technique

So in the gbdt method we have a lot of decision trees(weak learners). Those trees are built sequentially:

first tree learns how to fit to the target variable
second tree learns how to fit to the residual (difference) between the predictions of the first tree and the ground truth
The third tree learns how to fit the residuals of the second tree and so on.

All those trees are trained by propagating the gradients of errors throughout the system.
The main drawback of gbdt is that finding the best split points in each tree node is time-consuming and memory-consuming operation other boosting methods try to tackle that problem.

dart gradient boosting

In this outstanding paper, you can learn all the things about DART gradient boosting which is a method that uses dropout, standard in Neural Networks, to improve model regularization and deal with some other less-obvious problems.
Namely, gbdt suffers from over-specialization, which means trees added at later iterations tend to impact the prediction of only a few instances and make a negligible contribution towards the remaining instances. Adding dropout makes it more difficult for the trees at later iterations to specialize on those few samples and hence improves the performance.

lgbm goss (Gradient-based One-Side Sampling)

In fact, the most important reason for naming this method lightgbm is using the Goss method based on this paper. Goss is the newer and lighter gbdt implementation (hence "light" gbm).
The standard gbdt is reliable but it is not fast enough on large datasets. Hence, goss suggests a sampling method based on the gradient to avoid searching for the whole search space. We know that for each data instance when the gradient is small that means no worries data is well-trained and when the gradient is large that should be retrained again. So we have two sides here, data instances with large and small gradients. Thus, goss keeps all data with a large gradient and does a random sampling (that's why it is called One-Side Sampling) on data with a small gradient. This makes the search space smaller and goss can converge faster. Finally, for gaining more insight about goss, you can check this blog post.

Let's put those differences in a table:

Note: If you set boosting as RF then the lightgbm algorithm behaves as random forest and not boosted trees! According to the documentation, to use RF you must use bagging_fraction and feature_fraction smaller than 1.

Regularization

In this section, I will cover some important regularization parameters of lightgbm. Obviously, those are the parameters that you need to tune to fight overfitting.
You should be aware that for small datasets (<10000 records) lightGBM may not be the best choice. Tuning lightgbm parameters may not help you there.
In addition, lightgbm uses leaf-wise tree growth algorithm whileXGBoost uses depth-wise tree growth. Leaf-wise method allows the trees to converge faster but the chance of over-fitting increases.
Maybe this talk from one of the PyData conferences gives you more insights about Xgboost and Lightgbm. Worth to watch!

Note: If someone asks you what is the main difference between LightGBM and XGBoost? You can easily say, their difference is in how they are implemented.

According to lightGBM documentation, when facing overfitting you may want to do the following parameter tuning:

Use small max_bin
Use small num_leaves
Use min_data_in_leaf and min_sum_hessian_in_leaf
Use bagging by set bagging_fraction and bagging_freq
Use feature sub-sampling by set feature_fraction
Use bigger training data
Try lambda_l1, lambda_l2 and min_gain_to_split for regularization
Try max_depth to avoid growing deep tree

In the following sections, I will explain each of those parameters in a bit more detail.

lambda_l1

Lambda_l1 (and lambda_l2) control to l1/l2 and along with min_gain_to_split are used to combat over-fitting. I highly recommend you to use parameter tuning (explored in the later section) to figure out the best values for those parameters.

num_leaves

Surely num_leaves is one of the most important parameters that controls the complexity of the model. With it, you set the maximum number of leaves each weak learner has. Large num_leaves increases accuracy on the training set and also the chance of getting hurt by overfitting. According to the documentation, one simple way is that num_leaves = 2^(max_depth) however, considering that in lightgbm a leaf-wise tree is deeper than a level-wise tree you need to be careful about overfitting!
As a result, It is necessary to tune num_leaves with the max_depth together.

Photo on lightgbm documentation

subsample

With subsample (or bagging_fraction) you can specify the percentage of rows used per tree building iteration. That means some rows will be randomly selected for fitting each learner (tree). This improved generalization but also speed of training.

I suggest using smaller subsample values for the baseline models and later increase this value when you are done with other experiments (different feature selections, different tree architecture).

feature_fraction

Feature fraction or sub_feature deals with column sampling, LightGBM will randomly select a subset of features on each iteration (tree). For example, if you set it to 0.6, LightGBM will select 60% of features before training each tree.
There are two usage for this feature:

Can be used to speed up training
Can be used to deal with overfitting

max_depth

This parameter control max depth of each trained tree and will have impact on:

The best value for the num_leaves parameter
Model Performance
Training Time

Pay attention If you use a large value of max_depth, your model will likely be over fit to the train set.

max_bin

Binning is a technique for representing data in a discrete view(histogram). Lightgbm uses a histogram based algorithm to find the optimal split point while creating a weak learner. Therefore, each continuous numeric feature (e.g. number of views for a video) should be split into discrete bins.

The photo on LightGBM and XGBoost Explained

Also, in this GitHub repo, you can find some comprehensive experiments which completely explains the effect of changing max_bin on CPU and GPU.

Clock time after 500 iterations - GitHub repo

If you define max_bin 255 that means we can have a maximum of 255 unique values per feature. Then Small max_bin causes faster speed and large value improves accuracy.

Training parameters

Training time! When you want to train your model with lightgbm, Some typical issues that may come up when you train lightgbm models are:

Training is a time-consuming process
Dealing with Computational Complexity (CPU/GPU RAM constraints)
Dealing with categorical features
Having an unbalanced dataset
The need for custom metrics
Adjustments that need to be made for Classification or Regression problems

In this section, we will try to explain those points in detail.

num_iterations

Num_iterations specifies the number of boosting iterations (trees to build). The more trees you build the more accurate your model can be at the cost of:

Longer training time
Higher chance of overfitting

Start with a lower number of trees to build a baseline and increase it later when you want to squeeze the last % out of your model.
It is recommended to use smaller learning_rate with larger num_iterations. Also, you should use early_stopping_rounds if you go for higher num_iterations to stop your training when it is not learning anything useful.

early_stopping_rounds

This parameter will stop training if the validation metric is not improving after the last early stopping round. That should be defined in pair with a number of iterations. If you set it too large you increase the change of overfitting (but your model can be better).
The rule of thumb is to have it at 10% of your num_iterations.

lightgbm categorical_feature

One of the advantages of using lightgbm is that it can handle categorical features very well. Yes, this algorithm is very powerful but you have to be careful about how to use its parameters. lightgbm uses a special integer-encoded method (proposed by Fisher) for handling categorical features
Experiments show that this method brings better performance than, often used, one-hot encoding.
The default value for it is "auto" that means: let lightgbm decide which means lightgbm will infer which features are categorical.
It doesn't always work well (some experiment show why here and here) and I highly recommend you set categorical feature manually simply with this code

cat_col = dataset_name.select_dtypes('object').columns.tolist()

But what happens behind the scenes and how lightgbm deals with the categorical features?
According to the documentation of lightgbm, we know that tree learners cannot work well with one hot encoding method because they grow deeply through the tree. In the proposed alternative method, tree learners are optimally constructed. For example for one feature with k different categories, there are 2^(k-1) - 1 possible partition and with fisher method that can improve to k * log(k) by finding the best-split way on the sorted histogram of values in the categorical feature.

lightgbm is_unbalance vs scale_pos_weight

One of the problems you may face in the binary classification problems is how to deal with the unbalanced datasets. Obviously, you need to balance positive/negative samples but how exactly can you do that in lightgbm?
There are two parameters in lightgbm that allow you to deal with this issue is_unbalance and scale_pos_weight, but what is the difference between them and How to use them?

When you set Is_unbalace: True, the algorithm will try to Automatically balance the weight of the dominated label (with the pos/neg fraction in train set)
If you want change scale_pos_weight (it is by default 1 which mean assume both positive and negative label are equal) in case of unbalance dataset you can use following formula(based on this issue on lightgbm repository) to set it correctly

sample_pos_weight = number of negative samples / number of positive samples

lgbm feval

Sometimes you want to define a custom evaluation function to measure the performance of your model you need to create a "feval" function.
Feval function should accept two parameters:

preds
train_data

and return

eval_name
eval_result
is_higher_better

Let's create a custom metrics function step by step.
Define a separate python function

def feval_func(preds, train_data):
   # Define a formula that evaluates the results
    return ('feval_func_name', eval_result, False)

Use this function as a parameter:

print('Start training...')
lgb_train = lgb.train(..., 
                      metric=None, 
                      feval=feval_func)

Note: to use feval function instead of metric, you should set metric parameter "None".

classification params vs regression params

Most of the things I mentioned before are true both for classification and regression but there are things that need to be adjusted.
Specifically you should:

The most important lightgbm parameters

We have reviewed and learned a bit about lightgbm parameters in the previous sections but no boosted trees article would be complete without mentioning the incredible benchmarks from Laurae 🙂
You can learn about best default parameters for many problems both for lightGBM and XGBoost.
You can check it out here but some most important takeaways are:

Note: You should never take any parameter value for granted and adjust it based on your problem. That said, those parameters are a great starting point for your hyperparameter tuning algorithms

Lightgbm parameter tuning example in python (lightgbm tuning)

Finally, after the explanation of all important parameters, it is time to perform some experiments!

I will use one of the popular Kaggle competitions: Santander Customer Transaction Prediction.

I will use this article which explains how to run hyperparameter tuning in Python on any script.

Worth a read!

Before we start, one important question! What parameters should we tune?

Pay attention to the problem you want to solve, for instance Santander dataset is highly imbalanced, and should consider that in your tuning! Laurae2, one of the contributors to lightgbm, explained this well here.
Some parameters are interdependent and must be adjusted together or tuned one by one. For instance, min_data_in_leaf depends on the number of training samples and num_leaves.

Note: It's a good idea to create two dictionaries for hyperparameters, one contains parameters and values that you don't want to tune, the other contains parameter and value ranges that you do want to tune.

SEARCH_PARAMS = {'learning_rate': 0.4,
                 'max_depth': 15,
                 'num_leaves': 20,
                 'feature_fraction': 0.8,
                 'subsample': 0.2}

FIXED_PARAMS={'objective': 'binary',
              'metric': 'auc',
              'is_unbalance':True,
              'boosting':'gbdt',
              'num_boost_round':300,
              'early_stopping_rounds':30}

By doing that you keep your baseline values separated from the search space!
Now, here's what we'll do.

First, we generate the code in the Notebook. It is public and you can download it.
Second, we track the result of each experiment on Neptune.ai.

analysis of results

If you have checked the previous section, you've noticed that I've done more than 14 different experiments on the dataset. Here I explain how to tune the value of the hyperparameters step by step.
Create a baseline training code:

from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
import neptunecontrib.monitoring.skopt as sk_utils
import lightgbm as lgb
import pandas as pd
import neptune
import skopt
import sys
import os

SEARCH_PARAMS = {'learning_rate': 0.4,
                'max_depth': 15,
                'num_leaves': 32,
                'feature_fraction': 0.8,
                'subsample': 0.2}

FIXED_PARAMS={'objective': 'binary',
             'metric': 'auc',
             'is_unbalance':True,
             'bagging_freq':5,
             'boosting':'dart',
             'num_boost_round':300,
             'early_stopping_rounds':30}

def train_evaluate(search_params):
   # you can download the dataset from this link(https://www.kaggle.com/c/santander-customer-transaction-prediction/data)
   # import Dataset to play with it
   data= pd.read_csv("sample_train.csv")
   X = data.drop(['ID_code', 'target'], axis=1)
   y = data['target']
   X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)
   train_data = lgb.Dataset(X_train, label=y_train)
   valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

   params = {'metric':FIXED_PARAMS['metric'],
             'objective':FIXED_PARAMS['objective'],
             **search_params}

   model = lgb.train(params, train_data,                     
                     valid_sets=[valid_data],
                     num_boost_round=FIXED_PARAMS['num_boost_round'],
                     early_stopping_rounds=FIXED_PARAMS['early_stopping_rounds'],
                     valid_names=['valid'])
   score = model.best_score['valid']['auc']
   return score

Use the hyperparameter optimization library of your choice (for example scikit-optimize)

neptune.init('mjbahmani/LightGBM-hyperparameters')
neptune.create_experiment('lgb-tuning_final', upload_source_files=['*.*'],
                              tags=['lgb-tuning', 'dart'],params=SEARCH_PARAMS)

SPACE = [
   skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
   skopt.space.Integer(1, 30, name='max_depth'),
   skopt.space.Integer(10, 200, name='num_leaves'),
   skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
   skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')
]
@skopt.utils.use_named_args(SPACE)
def objective(**params):
   return -1.0 * train_evaluate(params)

monitor = sk_utils.NeptuneMonitor()
results = skopt.forest_minimize(objective, SPACE, 
                                n_calls=100, n_random_starts=10, 
                                callback=[monitor])
sk_utils.log_results(results)

neptune.stop()

Try different types of configuration and track your results in Neptune

Finally, in the following table, you can see what changes have taken place in the parameters.

Final Thoughts:

Long story short, you learned:

what the main lightgbm parameters,
how to create custom metrics with the feval function,
what are the good default values of major parameters,
saw and example of how to tune lightgbm parameters to improve model performance.

And some other things 🙂 For more detailed information, please refer to the resources.

Resources:

Laurae extensive guide with good defaults etc
https://github.com/microsoft/LightGBM/tree/master/python-package
https://lightgbm.readthedocs.io/en/latest/index.html
https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

This article was originally written by MJ Bahmani and posted on the Neptune blog. You can find there more in-depth articles for machine learning practitioners.