General confusion related to Feature Selection

#featureselection #machinelearning #python #scikitlearn

Should I do Feature Selection on the entire dataset?

The answer is NO.

The reason being this results in Bais and data leakage. As the matter of fact we always make sure that our TEST data is absolutely unknown and it's only available to assess the performance of our machine learning model. If we are performing Feature Selection on entire dataset this statement doesn't hold true any more.

The model has an unfair advantage as the Features are selected based on all the samples.

When should we do the feature selection?

Firstly, you should split your data into Train and Test Data.
Then, You should do the feature selection on the Training data.
Once, you done the feature selection on the Training data you can train your model.
Now, you can select the same features from the Testing data and perform the prediction.

How our feature selection is effected in case of K Fold Cross Validation usage?

Thing is the order remains the same. First split and then do the Feature Selection.

"CV methods are proven to be unbiased only if all the various aspects of classifier training takes place inside the CV loop. This means that all aspects of training a classifier e.g. feature selection, classifier type selection and classifier parameter tuning takes place on the data not left out during each CV loop. It has been shown that violating this principle in some ways can result in very biased estimates of the true error. "

The right way to Cross Validate with feature selection

scores = []

for train, test in KFold(len(y), n_folds=5):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]

    b = SelectKBest(f_regression, k=2)
    b.fit(xtrain, ytrain)
    xtrain = xtrain[:, b.get_support()]
    xtest = xtest[:, b.get_support()]

    clf.fit(xtrain, ytrain)    
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')

plt.xlabel("Predicted")
plt.ylabel("Observed")

print("CV Score is ", np.mean(scores))

Should I do Feature encoding such as One hot or Ordinal encoding before or after the Feature Selection?

One should do Feature encoding before the Feature selection. One intuition behind it can be as our main aim is to use Encoded feature in our machine learning model then we should find it's importance as well in the way it needs to be used in the model.

References

Deploy with ease. Manage efficiently. Scale faster.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

DEV Community