DEV Community

Cover image for Feature Selection
Carlos Almonte
Carlos Almonte

Posted on

Feature Selection

Sometimes there is a lot of 'noise' in the data. By noise I mean data that is not relevant to the target variable. There are methods to determine the impact of each column to the outcome variable, and then selecting only the the columns that are of high enough impact.

The are methods which iterate through all the columns and identifies not only each of the columns that are relevant but also which combinations of columns affect the target variable the most. For example, "what is the likelihood of a user posting on social media?", columns are "amount-of-posts-seen-before", "internet-connection-quality", "followings-activity", and "time-of-day". Let's say "amount-of-posts-seen-before", "internet-connection-quality", and "followings-activity" are determined to be the relevant columns. However, there could be the case that the subset of "amount-of-posts-seen-before", and "internet-connection-quality" alone are more impactful on the target variable than having all relevant columns. So after feature selection the data that will be kept would be the columns "amount-of-posts-seen-before", and "internet-connection-quality" along with the target variable.

Top comments (0)