DEV Community: Vijeth-Rai

Steel Defect Detection

Vijeth-Rai — Mon, 09 Nov 2020 03:46:11 +0000

Steel

Steel is one of the most important building materials of modern times. Steel buildings are resistant to natural and man-made wear
which has made the material ubiquitous around the world. To help make production of steel more efficient, this case study will help
identify defects.

The production process of flat sheet steel is especially delicate. From heating and rolling, to drying and cutting, several machines touch flat steel by the time it’s ready to ship.

Severstal

Severstal is leading the charge in efficient steel mining and production. The company recently created the country’s largest
industrial data lake, with petabytes of data that were previously discarded.

Severstal is looking to machine learning to improve automation, increase efficiency, and maintain high quality in their production. Severstal uses images from high frequency cameras to power a defect detection algorithm.

Business Problem

In manufacturing industries, one of the main problems is detection of faulty production parts. Detected faulty parts are recycled.But in cases where it is not detected, it can lead to dangerous situations for the customer and reputation of the company.

The goal of the project is to help engineers improve the algorithm by localizing and classifying surface defects on a steel sheet. If successful, it will help keep manufacturing standards for steel high.

Dataset

The dataset is provided by one of the leading steel manufacturers in the world, Servastal. The dataset contains 3 features - ImageId, ClassId and Encoded Pixels.

EDA

ClassId consists of four unique classes, which are the 4 types of defects and maybe present individually or simultaneously in an
image.
There are 7095 observations in the train dataset. There are no missing values.
The segment for each defect class are encoded into a single row, even if there are several non-contiguous defect locations on an image. The segments are in the form of encoded pixels.

Check for imbalance

First, we need to determine whether the given data is balanced or not. Simple plot of Class Id vs counts is shown below.

Here we observe that the defect type 3 is more dominant than any other defect. Defect 2 is the least occurring defect. There is class imbalance.

Check for defect overlap

Now we will check whether the input image contains more than one defect simultaneously.

We can see that most observations have only one type of defect. Some have two defects simultaneously. There are no observations with three or more defects simultaneously.

Encoded pixels to masks

Encoded pixels is the information about the pixels that have defects. It contains the pattern of - the pixel index followed by count. This is interpreted as, the pixel index to the pixel index + count value are all the pixels with defects. This pattern is repeated until all pixels having defects are encoded.

The following function converts the encoded pixels to masks:

def masks(encoded_pixels):
   counts=[]
   mask=np.zeros((256*1600), dtype=np.int8)
   pre_mask=np.asarray([int(point) for point in encoded_pixels.split()])
   for index,count in enumerate(pre_mask):
      if(index%2!=0):
         counts.append(count)
   i=0
   for index,pixel in enumerate(pre_mask):
      if(index%2==0):
         if(i==len(counts)):
            break
         mask[pixel:pixel+counts[i]]=1
         i+=1
   mask=np.reshape(mask,(1600,256))
   mask=cv2.resize(mask,(256,1600)).T
   return mask

Plotting few datapoints

Here we will visualize how the input image and target masks look like. I have given each type of defect a different colour so that it is easier to differentiate between defects.

Defect type 1

Defect type 2

Defect type 3

Defect type 4

Two defects simultaneously

From above visualizations we can observe that defect type 4 is very distinct than any other defects. Defect type 1 is difficult to be seen. Defect type 2 and 3 look similar. Due the imbalance, type 2 defects maybe classified as type 3 by the model if data is not balanced.

Feature Engineering

The classes are heavily imbalanced. Also, the size of the dataset is small. Therefore, I augmented the images such that it compensates for class imbalance and also increases the amount of training data.

Class 3 was not augmented because it already has enough data. Class 1 and 2 were augmented such that their data increased by 5 times. Since Class 4 had the least amount of data available, I augmented it 8 times to compensate for imbalance.

In this step, I augmented both images and its masks, and then I saved the images. The masks values were converted into pixel encoding using the function below.

def rle(img):
    pixels= img.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] 
    runs[1::2] -= runs[::2]
    return ' '.join(str(x) for x in runs)

The total number of images were increased from 7000 to 13000.

Modelling

Performance Metric

I have chosen dice coefficient as the performance metric. It is because the result masks needs to have both good precision and a good recall.

Loss function

I have used Binary Crossentropy as loss function. It is used here because the insight of an element belonging to a certain class should not influence the decision for another class because some images might contain more than 1 class.

Residual Unet

ResUNet is a semantic segmentation model inspired by the deep residual learning and UNet. An architecture that take advantages from both, Residual and UNet models.

This combination bring us two benefits: 1) the residual unit will ease training of the network; 2) the skip connections within a residual unit and between low levels and high levels of the network will facilitate information propagation without degradation, making it possible to design a neural network with much fewer parameters however could achieve comparable ever better
performance on semantic segmentation.

Paper: https://arxiv.org/pdf/1711.10684.pdf

Architecture

The network comprises of three parts: encoding, bridge and decoding.1 The first part encodes the input image into compact
representations. The last part recovers the representations to
a pixel-wise categorization, i.e. semantic segmentation. The
middle part serves like a bridge connecting the encoding and
decoding paths. All of the three parts are built with residual
units which consist of two 3 × 3 convolution blocks and an
identity mapping. Each convolution block includes a BN layer,
a ReLU activation layer and a convolutional layer. The identity
mapping connects input and output of the unit.

Result

The best loss was 0.0513 after 15 epochs and then it reached the minima.

Unet

The architecture contains two paths. First path is the contraction path (also called as the encoder) which is used to capture the context in the image. The encoder is just a traditional stack of convolutional and max pooling layers. The second path is the symmetric expanding path (also called as the decoder) which is used to enable precise localization using transposed convolutions.

Thus it is an end-to-end fully convolutional network (FCN), i.e. it only contains Convolutional layers and does not contain any Dense layer because of which it can accept image of any size.

Result
The validation loss reached minima at 0.0125 after 23 epochs. The real vs prediction masks are shown below.

The prediction masks are more precise than the manually labelled masks.

Scope for Improvement

A new model can be built called Hierarchical Multi-Scale Attention for Semantic Segmentation. This model might be able to give even better results than Unet.
paper: https://arxiv.org/abs/2005.10821

Solving the Stripper-Well problem

Vijeth-Rai — Sat, 17 Oct 2020 03:53:53 +0000

Stripper Well:

A Stripper well is a low yield oil well. These wells have low operational costs and have tax breaks which makes it attractive from a business point of view. Almost 80% of the oil wells in the US are Stripper wells.

Business problem:

The well has a lot of mechanical components and the breakdown is quite often compared to other wells. The breakdown occurs at surface level or down-hole level.

In a business perspective, time is more valuable than small overhead costs. It is very inefficient to send the repair team without knowing where the failure has occured. Therefore it becomes a neccessity to create an algorithm which predicts where the failure has occured.

Project Goal:

The goal of the project is to predict whether the mechanical component has failed in the surface level or down-hole level. This information can be used to send the repair team to address the failure in either of the levels and save valuable time.

Dataset:

The dataset is provided by ConocoPhilips. The dataset has 107 features which are taken from the sensors that collect a variety of information at surface and bottom levels.

EDA:

Check for imbalance:

First, we need to determine whether the given data is balanced or not. Simple plot of target variables vs counts is shown below.

It is clear that there a big imbalance in the dataset. The imbalance ratio is 59:1.

Check for NaN values:

The data given has many missing values.

Features and NaN values observation:

There are 162 features with less than 25% missing values
There are 4 features with 25%-75% missing values
There are 6 features with greater than 75% missing values

Check for outliers

The outliers are detected using Inter Quartile Range rule. The 25th and 75th quartile ranges are found out. Then the difference between the 25th and 75th range is multiplied by 1.5. If any value in the feature exceeds this range, then it is classified as an outlier.

for i in tqdm(train.columns):
  Q25 = np.percentile(train[i],25)
  Q75 = np.percentile(train[i],75)
  IQR = Q75-Q25
  IQR = IQR*1.5
  UL = Q75+IQR
  LL = Q25-IQR
  out=0
  for i in train[i]:
    if i>UL or i<LL:
      out+=1
  out_count.append(out)

There are 6 features with outliers.

Check for outliers

Check for correlation becomes necessary to get best fit on models.

The graph shows many bright spots which means there are many features correlated to each other.

Feature Engineering:

NaN values relation with target:

There are many NaN values. There is a chance that these values can be related to target variable. We will check the relation using hamming distance.
NaN values in the feature are set to 0 and rest are set to 1. Then we will calculate hamming distance with these values and target values.
If the distance is around 1000, then the feature with NaN values are important. Also, if the distance is around 59000 even then the feature is important.

y = train.target
hamming_distance = []

for i in tqdm(range(len(train.columns))):
    diff=0
    series = np.zeros(len(train))
    for j in range(len(train)):
      if np.isnan(train.iloc[j,i])==False:
        series[j]=1
      if series[j]-y[j]==0:
        diff+=1
    hamming_distance.append(diff) 

save_cols=[]

for i in range(len(hamming_distance)):
  if hamming_distance[i]<1500 or hamming_distance[i]>58500:
    if i >1:
      save_cols.append(train.columns[i])

save_cols contains the feature names that are valuable with the NaN values. Hence we will create new features taking into account the NaN values for these features.

train['new_val1'] = np.where(train['sensor1_measure'].isnull(), 0, 1)

Now that we have important features gained from NaN features, we can remove all features with greater than 25% NaN values and mean impute remaining features with missing values.

Removing highly correlated features:

If a pair of features have more than 0.85 correlation, then one of the feature from the pair is removed. This will lead to a cleaner data and Boosting algorithms will work very well on the data.

indeces=[]
for i in range(len(mat)):
  for j in range(len(mat)):
    if i!=0:
      if j!=0:
        if i!=j:
          if abs(mat.iloc[i,j])>threshold:
            indeces.append((i,j))
save=[]
delete=[]
for i in indeces:
  if i[0] in save:
    if i[1] not in delete:
      delete.append(i[1])
  elif i[0] in delete:
    if i[1] not in save:
      if i[1] not in delete:
        delete.append(i[1])
  elif i[1] in delete:
    delete.append(i[0])
  else:
    save.append(i[0])
    delete.append(i[1])
names=[]
for i in delete:
  names.append(train.columns[i])
indeces=[]
for i in names:
  train = train.drop(columns=i,axis=1)

The heatmap after removing highly correlated features is shown below.

Scaling:

I have scaled the data so that it can be directly used on any model.

Modelling

Performance metric:

I am using F1 score as my performance metric. Both False positive and False negative have equal weightage since they will waste valuable time of the business.

Random Forest:

I have hypertuned Random Forest model with below parameters.

RF = RandomForestClassifier(n_jobs=-1,class_weight={0:1,1:59}) 
param_grid = { 'n_estimators': [200,500,1000,2000],
'max_depth': [5,10,20]}
CV_RF = GridSearchCV(estimator=RF, param_grid=param_grid, 
cv= 5, scoring='f1',verbose=1)

The best parameters were found to be max_depth=20 and n_estimators=1000

The performance of the above model on test data is as follows:

Macro F1 score = 0.89.
Precision = 0.90
Recall = 0.70
77 points out of 16001 points are misclassified.

XGBoost:

The XGBoost model is hyper tuned for the below parameters.

param_grid = {
    'max_depth': [10,20],
    'learning_rate': [0.1],
    'subsample': [0.5, 1],
    'colsample_bytree': [0.7,1],
    'colsample_bylevel': [0.7, 1.0],
    'min_child_weight': [3.0, 5.0],
    'gamma': [1.0],
    'reg_lambda': [ 1.0],
    'n_estimators': [100,200,500,1000,2000]}

xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59)

randomized_search = RandomizedSearchCV(xgb_model, param_grid, n_iter=30,
n_jobs=-1,cv=5,scoring='f1', random_state=42)

The best parameters are:

{'colsample_bylevel': 0.7,
 'colsample_bytree': 0.7,
 'gamma': 1.0,
 'learning_rate': 0.1,
 'max_depth': 10,
 'min_child_weight': 5.0,
 'n_estimators': 1000,
 'reg_lambda': 1.0,
 'subsample': 0.5}

The performance of the above model on the test data is as follows:

Macro F1 score = 0.92
Precision = 0.84
Recall = 0.84
66 out of 16001 points are misclassified.

AdaBoost

The AdaBoost model is hypertuned for n_estimators using GridSearchCv. The best n_estimator was found to be 500.

The performance of the above model on the test data is as follows:

Macro F1 score = 0.89
Precision = 0.92
Recall = 0.68
75 out of 16001 points are misclassified.

Decision Tree

The model is hypertuned for max_depth. The best depth is found to be 100.

The performance of the above model on the test data is as follows:

Macro F1 score = 0.80
Precision = 0.58
Recall = 0.66
166 out of 16001 points are misclassified.

Custom Ensemble 1

In this custom ensemble,

I have split the train data into two equal parts X1 and X2.

X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.5, random_state=42, stratify=y_train)

The first part is then randomly split into four parts d1,d2,d3 and d4 using sampling with replacement.

_, d1, _, dy1 = train_test_split(X1, y1, test_size=0.25, stratify=y1)
_, d2, _, dy2 = train_test_split(X1, y1, test_size=0.25, stratify=y1)
_, d3, _, dy3 = train_test_split(X1, y1, test_size=0.25, stratify=y1)
_, d4, _, dy4 = train_test_split(X1, y1, test_size=0.25, stratify=y1)

These four parts are trained on the previous hypertuned model respectively.


RF_final = RandomForestClassifier(max_depth=20,n_estimators=1000,class_weight={0:1,1:59})
RF_final.fit(d1,dy1)
xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model.fit(d2, dy2)
AdaB = AdaBoostClassifier(n_estimators=500, random_state=0)
AdaB.fit(d3, dy3)
DTC = DecisionTreeClassifier(class_weight={0:1,1:59},max_depth=100)
DTC.fit(d4, dy4)

Now we use X2 data and predict the target from each of these models and combine them into single dataframe.

pred1 = RF_final.predict(X2)
pred2 = xgb_model.predict(X2)
pred3 = AdaB.predict(X2)
pred4 = DTC.predict(X2)

df_list = [pred1,pred2,pred3,pred4]

df = pd.DataFrame(df_list)
df = df.transpose()

This dataframe is used to train on hypertuned XGBoost model.

xgb_model_F = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model_F.fit(df, y2)

The performance of the above model on the test data is as follows:

Macro F1 score = 0.79
Precision = 0.47
Recall = 0.82
221 out of 16001 points are misclassified.

Custom Ensemble 2

In this custom ensemble,

I have split the train data into two equal parts X1 and X2.

X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.5, random_state=42, stratify=y_train)

X1 is trained on the previous four hypertuned models.

RF_final = RandomForestClassifier(max_depth=20,n_estimators=1000,class_weight={0:1,1:59})
RF_final.fit(X1,y1)
xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model.fit(X1,y1)
AdaB = AdaBoostClassifier(n_estimators=500, random_state=0)
AdaB.fit(X1,y1)
DTC = DecisionTreeClassifier(class_weight={0:1,1:59},max_depth=100)
DTC.fit(X1,y1)

Now we use X2 data and predict the target from each of these models and combine them into single dataframe.

pred1 = RF_final.predict(X2)
pred2 = xgb_model.predict(X2)
pred3 = AdaB.predict(X2)
pred4 = DTC.predict(X2)

df_list = [pred1,pred2,pred3,pred4]

df = pd.DataFrame(df_list)
df = df.transpose()

This dataframe is used to train on hypertuned XGBoost model.

xgb_model_F = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model_F.fit(df, y2)

The performance of the above model on the test data is as follows:

Macro F1 score = 0.82
Precision = 0.54
Recall = 0.84
174 out of 16001 points are misclassified.

Custom Ensemble 3

In this custom ensemble,

I have split the train data into two equal parts X1 and X2.

X1, X2, y1, y2 = train_test_split(df_train, dfy_train, test_size=0.5, random_state=42, stratify=dfy_train)

The first part is then randomly split into n parts d1,d2,d3...dn using sampling with replacement.

for i in range(n_estimators):
    _, d1, _, dy1 = train_test_split(X1, y1, test_size=0.25, stratify=y1,random_state=i)

These n parts are trained on hypertuned decision tree. for i in range(n_estimators): _, d1, _, dy1 = train_test_split(X1, y1, test_size=0.25, stratify=y1,random_state=i) DTC = DecisionTreeClassifier(max_depth=50) DTC.fit(d1, dy1)
Now we use X2 data and predict the target from each of these models and combine them into single dataframe.
This dataframe is used to train on hypertuned XGBoost model.

 xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=500,reg_lambda=1.0,subsample=0.5)
  xgb_model.fit(new_df, y2)

The best hyper parameters are found to be decision tree max_depth = 50 and n_estimators = 3.

The performance of the above model on the test data is as follows:

Macro F1 score = 0.83
Precision = 0.43
Recall = 0.78
249 out of 16001 points are misclassified.

ANN

The network has one hidden dense layer with 50 units and relu activation. The output layer is dense with 2 units and sigmoid activation. Adam optimizer is used along with binary_crossentropy loss function and learning rate =0.01.

new_df = pd.DataFrame(preds)
new_df_test = pd.DataFrame(pred_test)
new_df = new_df.transpose()
new_df_test = new_df_test.transpose()

The performance of the above model on the test data is as follows:

Macro F1 score = 0.78
Precision = 0.58
Recall = 0.85
153 out of 16001 points are misclassified.

Failed FE and Models

I had deleted all the features with more than 20% missing data and median imputed other features with missing values. Because of this approach I had lost the valuable features with NaN values related to target variables. Hypertuned XGBoost on this data gave very bad result of 0.68.
Custom ensemble 1 didnt work well because of lack of training data. To overcome this, I created custom ensemble 2 but there was not much increase in performance.
Custom Ensemble 3 was created so that any number of n_estimators can be used but the drawback was I could use only one model for all estimators. With only decision trees in the ensemble, I got an F1 score 3% greater than decision tree model. Recreating the custom ensemble 3 with XGBoost model would give even better results than XGBoost model but this would require greater computational power.

Conclusion

New features impact the performance of the model by a great deal. XGBoost model works very well with macro F1 score of 0.92.

Scope for improvement: Custom Ensemble 3 can be recreated using XGBoost instead of decision trees which might result in better performance. This would require better computational power.