A Synthetic Data for Predict Probability Senior Student Go To College

myxzlpltk — Mon, 23 May 2022 10:29:05 +0000

I'm back dev. Today, I want to share you about a synthetic data that I was created a few day ago. I already upload it to kaggle which you can access here https://www.kaggle.com/datasets/saddamazyazy/go-to-college-dataset

The data was created using make_classification from sklearn package. But I did add a little touch of clustering to make categorical feature. So, basically this data has 2 label from 1000 rows with 11 columns. Here is the code!

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    random_state=42,
)

After that, I must look up correlation matrix to see how every variable correlate each other in a matrix.

Some variables have positive or negative correlation, but some have none with value close to zero. With 10 variable I have to design a feature that match exactly based on research paper. To see whats correlate and whats not.

Based on those correlation, I can cluster some features with its label. This cluster usually in 2d. Due to underfitting, some cluster will not close with its true label. This is something that will give variation to data.

df['school_accreditation'] = KMeans(2, random_state=42).fit_predict(df[['school_accreditation', 'label']])
df['school_accreditation'] = df['school_accreditation'].replace({0: 'B', 1: 'A'})

I personally use K-Means to make cluster this number.

Face Mask Detection With ResNet50 and SVM + Decision Tree

myxzlpltk — Fri, 15 Apr 2022 13:38:20 +0000

Welcome, this post is a quick explanation on how I build mask detection using ResNet50 as feature extractor and then use Support Vector Machine (SVM) + Decision Tree with stacking ensemble method as classifier.

As tribute to fellow researcher, this app was based on research paper with title "A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic" written by Mohamed Loey, et al.

Table of contents:

Dataset Retrieval
Preprocessing
Feature Extraction
Split Dataset
Define Model Classifier
Tuning Model
Create Final Model
Deploy Real App

Dataset Retrieval

This application uses a dataset from Kaggle. This dataset contains 853 images belonging to the 3 classes, as well as their bounding boxes in the PASCAL VOC format. The classes are with_mask, without_mask, and mask_weared_incorrect. For some reason, I only use the with_mask and without_mask labels. Check out this image sample below.

You can access this dataset via this url below.
https://www.kaggle.com/datasets/andrewmvd/face-mask-detection

Preprocessing

Preprocessing can be achieved by cropping face area based on bounding box information. First, read all xml file and image file from dataset folder.

import os

img_names = []
xml_names = []
for dirname, _, filenames in os.walk('./face-mask-detection'):
  for filename in filenames:
    if os.path.join(dirname, filename)[-3:] != "xml":
      img_names.append(filename)
    else:
      xml_names.append(filename)

print(len(img_names), "images")

Then crop all images by its bounding box and read the label.

import xmltodict
from matplotlib import pyplot as plt
from skimage.io import imread

path_annotations = "face-mask-detection/annotations/"
path_images = "face-mask-detection/images/"

class_names = ['with_mask', 'without_mask']
images = []
target = []

def crop_bounding_box(img, bnd):
  x1, y1, x2, y2 = list(map(int, bnd.values()))
  _img = img.copy()
  _img = _img[y1:y2, x1:x2]
  _img = _img[:,:,:3]
  return _img

for img_name in img_names[:]:
  with open(path_annotations+img_name[:-4]+".xml") as fd:
    doc = xmltodict.parse(fd.read())

  img = imread(path_images+img_name)
  temp = doc["annotation"]["object"]
  if type(temp) == list:
    for i in range(len(temp)):
      if temp[i]["name"] not in class_names:
        continue
      images.append(crop_bounding_box(img, temp[i]["bndbox"]))
      target.append(temp[i]["name"])
  else:
    if temp["name"] not in class_names:
        continue
    images.append(crop_bounding_box(img, temp["bndbox"]))
    target.append(temp["name"])

Based on labels, this dataset consists of 3232 with mask faces and 717 without mask faces.

This preprocessing also contains resize and normalization steps for ImageNet.

import torch

from torchvision import transforms

# Define preprocessing
preprocess = transforms.Compose([
  transforms.ToPILImage(),
  transforms.Resize((128, 128)),
  transforms.ToTensor(),
  transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])

# Apply preprocess
image_tensor = torch.stack([preprocess(image) for image in images])
image_tensor.shape

Feature Extraction

Feature extraction is needed to gather information from images using spatial operations to extract something that represents a label. In this application, I use ResNet50 as a feature extractor. The last layer of ResNet, which is a fully connected layer with 1.000 neurons, needs to be deleted.

from torchvision import models

# Download model
resnet = models.resnet50(pretrained=True)
resnet = torch.nn.Sequential(*(list(resnet.children())[:-1]))

To freeze and keep the convolutional part of ResNet50 fixed, I need to set requires_grad to False.

for param in resnet.parameters():
    param.requires_grad = False

I also need to call eval() to set ResNet50's batch normalization to disabled. Which will interfere with model accuracy and make sure ResNet50 only acts as a feature extractor.

resnet.eval()

Last step apply ResNet50 to extract feature. Then ResNet will return a vector with 2048 features for each image.

import numpy as np

result = np.empty((len(image_tensor), 2048))
for i, data in enumerate(image_tensor):
  output = resnet(data.unsqueeze(0))
  output = torch.flatten(output, 1)
  result[i] = output[0].numpy()

Split Dataset

To prevent the model from overfitting, I needed to split the data into 70% train data and 30% test data. Train data will be used to train the model and test data will be used to test or validate the model.

from sklearn.model_selection import train_test_split

X, y = result, np.array(target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training data\n", np.asarray(np.unique(y_train, return_counts=True)).T)
print("Test data\n", np.asarray(np.unique(y_test, return_counts=True)).T)

Define Model Classifier

As I have teased before, the proposed model is a stacking classifier (ensemble method) that will use SVM and decision tree as weak learners. Logistic regression will be the final estimator. In short definition, ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produce more accurate solutions than a single model would.

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

clf = StackingClassifier(
    estimators=[('svm', SVC(random_state=42)),
                ('tree', DecisionTreeClassifier(random_state=42))],
    final_estimator=LogisticRegression(random_state=42),
    n_jobs=-1)

Tuning Model

Tuning is the process of maximizing a model's performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate "hyperparameters". You can define your own tuning method what ever you want. But here is mine.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'svm__C': [1.6, 1.7, 1.8],
    'svm__kernel': ['rbf'],
    'tree__criterion': ['entropy'],
    'tree__max_depth': [9, 10, 11],
    'final_estimator__C': [1.3, 1.4, 1.5]
}

grid = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    scoring='accuracy',
    n_jobs=-1)

grid.fit(X_train, y_train)

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

Based on the tuning process, the best hyperparameters are:

Best parameters: {'final_estimator__C': 1.3, 'svm__C': 1.6, 'svm__kernel': 'rbf', 'tree__criterion': 'entropy', 'tree__max_depth': 11}
Accuracy: 0.98

Create Final Model

Finally, I can create a final model with the best hyperparameters. I hope this model will not overfit.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

final_clf = StackingClassifier(
    estimators=[('svm', SVC(C=1.6, kernel='rbf', random_state=42)),
                ('tree', DecisionTreeClassifier(criterion='entropy', max_depth=11, random_state=42))],
    final_estimator=LogisticRegression(C=1.3, random_state=42),
    n_jobs=-1)

final_clf.fit(X_train, y_train)
y_pred = final_clf.predict(X_test)

print('Accuracy score : ', accuracy_score(y_test, y_pred))
print('Precision score : ', precision_score(y_test, y_pred, average='weighted'))
print('Recall score : ', recall_score(y_test, y_pred, average='weighted'))
print('F1 score : ', f1_score(y_test, y_pred, average='weighted'))

Then I test the model with test data based on accuracy, precision, recall, and f1 score. The result are:

Accuracy score :  0.9721518987341772
Precision score :  0.9719379890530496
Recall score :  0.9721518987341772
F1 score :  0.9717932606523529

Looks pretty good! Check out this confusion matrix. If it's biased, please comment 😁.

Deploy Real App

This step is not required. But if you are interested, you must export the model first. Only the stacking classifier model, which was trained before. So you can load again in another program.

import pickle

pkl_filename = 'face_mask_detection.pkl'
with open(pkl_filename, 'wb') as file:
  pickle.dump(final_clf, file)

This process might be simple, but first you need to check out this diagram below.

Important thing to remember is you need to implement your own face detection model and crop it. For my example of program, check out my Github Repository.

DEV Community: myxzlpltk