Classification, along with regression, is one of the two main tasks of supervised learning in Machine Learning. It involves associating each piece of data with a label from a set of possible labels (or categories). In the simplest cases, there are only two categories, known as binary or binomial classification. Otherwise, it is multi-class classification, also called multiclass classification.
The categories must be determined before any form of learning. Additionally, the data used for learning must all receive a label, allowing the expected response to be known: this is supervised learning. While in the majority of cases each piece of data will be associated with only one class, there are some particular cases, such as:
- Multilabel classification: each piece of data can be associated with multiple classes.
- Object detection in images: this involves not determining the class of an image as a whole but recognizing the different objects present and their respective positions.
- Image segmentation: a specific case of detection, it involves indicating for each pixel to which class it belongs. Pixels of the same class are generally associated with the same color.
Practical Examples
Although the most common example in the literature is determining whether an image contains a cat or a dog, classification is used in many more realistic and useful everyday domains:
In industry: it allows determining whether a product has defects or if a part needs replacement (predictive maintenance).
On e-commerce sites: it can automatically associate a category with a product based on its description or determine if a product is fraudulent.
In security: it can determine if there is fraud, if an email is spam, or if a site is potentially dangerous.
With connected objects: it can determine if the monitored element is in a normal state or not, and therefore if intervention is needed or if there is a risk of future problems, such as a pending avalanche.
Many problems can thus be reduced to a classification problem.
Specific Data Preparation
The classification task does not impose constraints on the explanatory variables, although some algorithms may have additional requirements. However, the target variable, which corresponds to the class to be predicted, must necessarily be a categorical variable (ordinal or nominal). Additionally, in practice, the number of classes must remain small compared to the number of examples. Indeed, since classification is based on statistics on existing data, it is important to have many examples of each class.
There is no rule to determine in advance the ideal number of cases per class. This will depend heavily on the algorithm, the proximity of data within the same class, and the distance between data associated with different classes.
Top comments (0)