DEV Community

daud99
daud99

Posted on • Edited on

5 2

General confusion related to Feature Selection

Should I do Feature Selection on the entire dataset?

The answer is NO.

The reason being this results in Bais and data leakage. As the matter of fact we always make sure that our TEST data is absolutely unknown and it's only available to assess the performance of our machine learning model. If we are performing Feature Selection on entire dataset this statement doesn't hold true any more.

The model has an unfair advantage as the Features are selected based on all the samples.

When should we do the feature selection?

  1. Firstly, you should split your data into Train and Test Data.
  2. Then, You should do the feature selection on the Training data.
  3. Once, you done the feature selection on the Training data you can train your model.
  4. Now, you can select the same features from the Testing data and perform the prediction.

How our feature selection is effected in case of K Fold Cross Validation usage?

Thing is the order remains the same. First split and then do the Feature Selection.

"CV methods are proven to be unbiased only if all the various aspects of classifier training takes place inside the CV loop. This means that all aspects of training a classifier e.g. feature selection, classifier type selection and classifier parameter tuning takes place on the data not left out during each CV loop. It has been shown that violating this principle in some ways can result in very biased estimates of the true error. "

The right way to Cross Validate with feature selection

scores = []

for train, test in KFold(len(y), n_folds=5):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]

    b = SelectKBest(f_regression, k=2)
    b.fit(xtrain, ytrain)
    xtrain = xtrain[:, b.get_support()]
    xtest = xtest[:, b.get_support()]

    clf.fit(xtrain, ytrain)    
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')

plt.xlabel("Predicted")
plt.ylabel("Observed")

print("CV Score is ", np.mean(scores))
Enter fullscreen mode Exit fullscreen mode

Should I do Feature encoding such as One hot or Ordinal encoding before or after the Feature Selection?

One should do Feature encoding before the Feature selection. One intuition behind it can be as our main aim is to use Encoded feature in our machine learning model then we should find it's importance as well in the way it needs to be used in the model.

References

  1. https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after
  2. https://www.nodalpoint.com/not-perform-feature-selection/
  3. https://nbviewer.org/github/cs109/content/blob/master/lec_10_cross_val.ipynb
  4. https://stats.stackexchange.com/questions/64825/should-feature-selection-be-performed-only-on-training-data-or-all-data
  5. https://followthedata.wordpress.com/2013/10/30/the-importance-of-proper-cross-validation-and-experimental-design/
  6. https://datascience.stackexchange.com/questions/95071/should-i-do-one-hot-encoding-before-feature-selection-and-how-should-i-perform-f
  7. https://stats.stackexchange.com/questions/440372/feature-selection-before-or-after-encoding

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

AWS GenAI Live!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️