DEV Community

Abdul Rehman
Abdul Rehman

Posted on

Small dataset and K-fold Cross validation code in python

Machine learning sounds interesting but as a beginner it seems really hard to dive in. There are tons of tools and libraries to get started with like, tensorflow, pytorch, or the pretty old but powerful python library Scikit-learn.

As an Embedded System developers like me, you feel all the tutorials over the internet is overcrowded and assume we have super computer and can train whatever we want. Although the Google Colab makes a life bit easier but still doing stuff in our own pace is still a really craved thing.

I want to start playing with machine learning algorithms and don't want to prepare datasets with tons of images and train them and wait like weeks for them to get trained. So the small dataset is always a first choice for rapid prototyping of proof of concept.

Despite that Small dataset is still an active research topic which is discussed in many research papers. Today we are going to talk about few Machine learning models that are best suited for such problems.

  1. Naive Byes Classifier
  2. Random forest or Decision trees
  3. KNN Classifier

But due to the limited dataset, even above mentioned modals soon seems to overfitting so make sure to cross-validate them.

Cross Validation

Cross-validation is primarily used in applied machine learning to evaluate how well a machine learning model performs on untrained data. In other words, estimating the model's performance in general when used to make predictions on data that was not used during model training. It is frequently used in applied machine learning to compare and select a model for a given predictive modeling problem.

K-Fold Cross Validation

K-Fold algorithm is simple and easy to understand and implement for cross-validation. Here is the python implementation of K-fold cross validation using scikit-learn library.

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

# Create a Linear Regression model
model = LinearRegression()

# Create the k-fold cross validation object
kf = KFold(n_splits=5)

# Loop through each split of the data
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Fit the model on the training data
    model.fit(X_train, y_train)

    # Evaluate the model on the test data
    score = model.score(X_test, y_test)
    print("Fold score: ", score)
Enter fullscreen mode Exit fullscreen mode

This code uses k-fold cross validation with k=5, meaning it will divide the data into 5 subsets. The linear regression model is trained on four subsets and tested on one, then the process is repeated with a different subset used for testing each time. The score variable returns the accuracy of the model.

You can change the value of n_splits to set the number of folds you want to use. Also, you can replace the model (LinearRegression) with any other model that you want to use.

Following is the general process:

- Sort the dataset in a random order.
- Decide on k groups for the dataset.
- For each particular group:
    -- Consider using the group as a holdout or test data set.
    -- The remaining groups can serve as a training data set.
    -- Adapt a model to the training set, then evaluate it against the test set.
    -- Delete the model and keep the evaluation result
- Summarize the model's skill using a sample of the model evaluation scores.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)