What is Scikit-Learn?

Introduction

Here is a brief description of the the research paper titled “Scikit-learn: Machine Learning in Python” published in the “Journal of Machine Learning”. This paper explains the brilliance of Scikit-learn or how is is more commonly known Sklearn. First it starts off with what Scikit-learn is, then the visions of the project, how it was compiled and then some tradeoffs it has due to the ease of use. So I will start at the beginning and walk you through these steps.

What is it?

Sklearn is a series of tools to be used within the Python programming language to perform “Supervised” and “Unsupervised” machine learning. The difference between these two types of machine learning are simply where or not the data is labeled, with “Supervised learning” having labels and “Unsupervised learning” unlabeled. Normally to perform machine learning tasks you would need to be very specialized in a computer-science but that is where the Scikit Learn library can come in handy, for it was designed for both ease of use and to be accessible to anyone who would like to make use of it to help the growing need in statistical data analysis. To achieve this goal Sklearn did a couple of things: First it is distributed under a BSD (Berkeley Software Distribution) license which is a free software license that imposes minimal restrictions on the use and distribution of the software. Second it incorporates compiled code opposed to interpreted code, giving it the ability to reference a master list for steps on which files to run instead of running the code line by line. Third it references only two python packages to facilitate easy distribution compared similar toolboxes that reference many more. Fourth it uses imperative programming which uses statements to change a programs state instead of data flow programming which needs to follow a graph in order to perform operations. Lastly in the background it uses some C++ libraries to quickly reference implementations of support vector machines (two group classification models) and generalized linear models.

Vision of the Project

The primary vision for Sklearn was not to have the most features but to provide solid implementations and use naming for functions consistent to python and numpy style documentation. To apply this ease of use Sklearn attempts to keep the different number of objects to a minimum and avoid framework code so that it can be adapted to peoples needs. To expand on this idea of being adaptable they also base their development on community tools such as GitHub and public mailing lists encouraging the community to contribute. As well as providing ∼300 page user guide, class references, a tutorial, installation instructions, as well as more than 60 examples of real world applications while trying to minimize machine learning jargon but maintaining precision in their functions.

How it is Compiled?

Objects in Scikit are interface based giving them the ability to attach together. This allows them to use a consistent interface, the two most important being the central object which is an estimator and a cross-validation iterator. The estimator implements a fit method that accepts arguments given as an array of data and optionally labels for supervised learning, given that it is a supervised learning argument the estimator can implement a predict method which returns predicted values for as an output for the data that is input. For certain models it also has a transform method that returns modified data after being input into a function.The other main object, the cross validation iterator, is used to create pairs of training and testing from the data you input to validate the model that the estimator is creating.

Tradeoffs

Do to some of the features of Sklearn it loses some of its accuracy on large data sets. One of the greatest contributors to this is the Elastic Net which is a regularized regression method that provides great results on small to medium amounts of data but can can produce some error on larger steps, to combat this they could implement Karush–Kuhn–Tucker conditions though that would slow down processing of the model. Their K-Nearest-Neighbors algorithm again does amazing on small and medium data sets but on large datasets to reduce processing it uses a brute force search to complete the model again increasing error in larger sets. On other models there are similar problems that Sklearn implemented to reduce processing time. Therefore overall Scikit-Learn is the perfect tool for small or medium data sets but if you are using it to compute huge datasets it is probably best to use a more specialized set of tools.

Conclusion

For most people trying use machine learning in python Scikit-learn is the perfect tool for you, but if you are trying to process large amounts of data you may want to find a specialized toolkit. The greatness of Sklearn comes from its task oriented interface making it easy to compare different methods of machine learning, and the wide variety of learning algorithms, both supervised and unsupervised, that it employs.

Reference

Pedregosa, Fabian, et al. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12, Oct. 2011, www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf.