Stripper Well:
A Stripper well is a low yield oil well. These wells have low operational costs and have tax breaks which makes it attractive from a business point of view. Almost 80% of the oil wells in the US are Stripper wells.
Business problem:
The well has a lot of mechanical components and the breakdown is quite often compared to other wells. The breakdown occurs at surface level or down-hole level.
In a business perspective, time is more valuable than small overhead costs. It is very inefficient to send the repair team without knowing where the failure has occured. Therefore it becomes a neccessity to create an algorithm which predicts where the failure has occured.
Project Goal:
The goal of the project is to predict whether the mechanical component has failed in the surface level or down-hole level. This information can be used to send the repair team to address the failure in either of the levels and save valuable time.
Dataset:
The dataset is provided by ConocoPhilips. The dataset has 107 features which are taken from the sensors that collect a variety of information at surface and bottom levels.
EDA:
Check for imbalance:
First, we need to determine whether the given data is balanced or not. Simple plot of target variables vs counts is shown below.
It is clear that there a big imbalance in the dataset. The imbalance ratio is 59:1.
Check for NaN values:
The data given has many missing values.
Features and NaN values observation:
- There are 162 features with less than 25% missing values
- There are 4 features with 25%-75% missing values
- There are 6 features with greater than 75% missing values
Check for outliers
The outliers are detected using Inter Quartile Range rule. The 25th and 75th quartile ranges are found out. Then the difference between the 25th and 75th range is multiplied by 1.5. If any value in the feature exceeds this range, then it is classified as an outlier.
for i in tqdm(train.columns):
Q25 = np.percentile(train[i],25)
Q75 = np.percentile(train[i],75)
IQR = Q75-Q25
IQR = IQR*1.5
UL = Q75+IQR
LL = Q25-IQR
out=0
for i in train[i]:
if i>UL or i<LL:
out+=1
out_count.append(out)
There are 6 features with outliers.
Check for outliers
Check for correlation becomes necessary to get best fit on models.
The graph shows many bright spots which means there are many features correlated to each other.
Feature Engineering:
NaN values relation with target:
There are many NaN values. There is a chance that these values can be related to target variable. We will check the relation using hamming distance.
NaN values in the feature are set to 0 and rest are set to 1. Then we will calculate hamming distance with these values and target values.
If the distance is around 1000, then the feature with NaN values are important. Also, if the distance is around 59000 even then the feature is important.
y = train.target
hamming_distance = []
for i in tqdm(range(len(train.columns))):
diff=0
series = np.zeros(len(train))
for j in range(len(train)):
if np.isnan(train.iloc[j,i])==False:
series[j]=1
if series[j]-y[j]==0:
diff+=1
hamming_distance.append(diff)
save_cols=[]
for i in range(len(hamming_distance)):
if hamming_distance[i]<1500 or hamming_distance[i]>58500:
if i >1:
save_cols.append(train.columns[i])
save_cols contains the feature names that are valuable with the NaN values. Hence we will create new features taking into account the NaN values for these features.
train['new_val1'] = np.where(train['sensor1_measure'].isnull(), 0, 1)
Now that we have important features gained from NaN features, we can remove all features with greater than 25% NaN values and mean impute remaining features with missing values.
Removing highly correlated features:
If a pair of features have more than 0.85 correlation, then one of the feature from the pair is removed. This will lead to a cleaner data and Boosting algorithms will work very well on the data.
indeces=[]
for i in range(len(mat)):
for j in range(len(mat)):
if i!=0:
if j!=0:
if i!=j:
if abs(mat.iloc[i,j])>threshold:
indeces.append((i,j))
save=[]
delete=[]
for i in indeces:
if i[0] in save:
if i[1] not in delete:
delete.append(i[1])
elif i[0] in delete:
if i[1] not in save:
if i[1] not in delete:
delete.append(i[1])
elif i[1] in delete:
delete.append(i[0])
else:
save.append(i[0])
delete.append(i[1])
names=[]
for i in delete:
names.append(train.columns[i])
indeces=[]
for i in names:
train = train.drop(columns=i,axis=1)
The heatmap after removing highly correlated features is shown below.
Scaling:
I have scaled the data so that it can be directly used on any model.
Modelling
Performance metric:
I am using F1 score as my performance metric. Both False positive and False negative have equal weightage since they will waste valuable time of the business.
Random Forest:
I have hypertuned Random Forest model with below parameters.
RF = RandomForestClassifier(n_jobs=-1,class_weight={0:1,1:59})
param_grid = { 'n_estimators': [200,500,1000,2000],
'max_depth': [5,10,20]}
CV_RF = GridSearchCV(estimator=RF, param_grid=param_grid,
cv= 5, scoring='f1',verbose=1)
The best parameters were found to be max_depth=20 and n_estimators=1000
The performance of the above model on test data is as follows:
- Macro F1 score = 0.89.
- Precision = 0.90
- Recall = 0.70
- 77 points out of 16001 points are misclassified.
XGBoost:
The XGBoost model is hyper tuned for the below parameters.
param_grid = {
'max_depth': [10,20],
'learning_rate': [0.1],
'subsample': [0.5, 1],
'colsample_bytree': [0.7,1],
'colsample_bylevel': [0.7, 1.0],
'min_child_weight': [3.0, 5.0],
'gamma': [1.0],
'reg_lambda': [ 1.0],
'n_estimators': [100,200,500,1000,2000]}
xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59)
randomized_search = RandomizedSearchCV(xgb_model, param_grid, n_iter=30,
n_jobs=-1,cv=5,scoring='f1', random_state=42)
The best parameters are:
{'colsample_bylevel': 0.7,
'colsample_bytree': 0.7,
'gamma': 1.0,
'learning_rate': 0.1,
'max_depth': 10,
'min_child_weight': 5.0,
'n_estimators': 1000,
'reg_lambda': 1.0,
'subsample': 0.5}
The performance of the above model on the test data is as follows:
- Macro F1 score = 0.92
- Precision = 0.84
- Recall = 0.84
- 66 out of 16001 points are misclassified.
AdaBoost
The AdaBoost model is hypertuned for n_estimators using GridSearchCv. The best n_estimator was found to be 500.
The performance of the above model on the test data is as follows:
- Macro F1 score = 0.89
- Precision = 0.92
- Recall = 0.68
- 75 out of 16001 points are misclassified.
Decision Tree
The model is hypertuned for max_depth. The best depth is found to be 100.
The performance of the above model on the test data is as follows:
- Macro F1 score = 0.80
- Precision = 0.58
- Recall = 0.66
- 166 out of 16001 points are misclassified.
Custom Ensemble 1
In this custom ensemble,
- I have split the train data into two equal parts X1 and X2.
X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.5, random_state=42, stratify=y_train)
- The first part is then randomly split into four parts d1,d2,d3 and d4 using sampling with replacement.
_, d1, _, dy1 = train_test_split(X1, y1, test_size=0.25, stratify=y1)
_, d2, _, dy2 = train_test_split(X1, y1, test_size=0.25, stratify=y1)
_, d3, _, dy3 = train_test_split(X1, y1, test_size=0.25, stratify=y1)
_, d4, _, dy4 = train_test_split(X1, y1, test_size=0.25, stratify=y1)
- These four parts are trained on the previous hypertuned model respectively.
RF_final = RandomForestClassifier(max_depth=20,n_estimators=1000,class_weight={0:1,1:59})
RF_final.fit(d1,dy1)
xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model.fit(d2, dy2)
AdaB = AdaBoostClassifier(n_estimators=500, random_state=0)
AdaB.fit(d3, dy3)
DTC = DecisionTreeClassifier(class_weight={0:1,1:59},max_depth=100)
DTC.fit(d4, dy4)
- Now we use X2 data and predict the target from each of these models and combine them into single dataframe.
pred1 = RF_final.predict(X2)
pred2 = xgb_model.predict(X2)
pred3 = AdaB.predict(X2)
pred4 = DTC.predict(X2)
df_list = [pred1,pred2,pred3,pred4]
df = pd.DataFrame(df_list)
df = df.transpose()
- This dataframe is used to train on hypertuned XGBoost model.
xgb_model_F = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model_F.fit(df, y2)
The performance of the above model on the test data is as follows:
- Macro F1 score = 0.79
- Precision = 0.47
- Recall = 0.82
- 221 out of 16001 points are misclassified.
Custom Ensemble 2
In this custom ensemble,
- I have split the train data into two equal parts X1 and X2.
X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.5, random_state=42, stratify=y_train)
- X1 is trained on the previous four hypertuned models.
RF_final = RandomForestClassifier(max_depth=20,n_estimators=1000,class_weight={0:1,1:59})
RF_final.fit(X1,y1)
xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model.fit(X1,y1)
AdaB = AdaBoostClassifier(n_estimators=500, random_state=0)
AdaB.fit(X1,y1)
DTC = DecisionTreeClassifier(class_weight={0:1,1:59},max_depth=100)
DTC.fit(X1,y1)
- Now we use X2 data and predict the target from each of these models and combine them into single dataframe.
pred1 = RF_final.predict(X2)
pred2 = xgb_model.predict(X2)
pred3 = AdaB.predict(X2)
pred4 = DTC.predict(X2)
df_list = [pred1,pred2,pred3,pred4]
df = pd.DataFrame(df_list)
df = df.transpose()
- This dataframe is used to train on hypertuned XGBoost model.
xgb_model_F = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=1000,reg_lambda=1.0,subsample=0.5)
xgb_model_F.fit(df, y2)
The performance of the above model on the test data is as follows:
- Macro F1 score = 0.82
- Precision = 0.54
- Recall = 0.84
- 174 out of 16001 points are misclassified.
Custom Ensemble 3
In this custom ensemble,
- I have split the train data into two equal parts X1 and X2.
X1, X2, y1, y2 = train_test_split(df_train, dfy_train, test_size=0.5, random_state=42, stratify=dfy_train)
- The first part is then randomly split into n parts d1,d2,d3...dn using sampling with replacement.
for i in range(n_estimators):
_, d1, _, dy1 = train_test_split(X1, y1, test_size=0.25, stratify=y1,random_state=i)
- These n parts are trained on hypertuned decision tree. for i in range(n_estimators): _, d1, _, dy1 = train_test_split(X1, y1, test_size=0.25, stratify=y1,random_state=i) DTC = DecisionTreeClassifier(max_depth=50) DTC.fit(d1, dy1)
- Now we use X2 data and predict the target from each of these models and combine them into single dataframe.
- This dataframe is used to train on hypertuned XGBoost model.
xgb_model = xgb.XGBClassifier(n_jobs=-1,scale_pos_weight=59, colsample_bylevel=0.7,colsample_bytree=1,gamma=1.0,learning_rate=0.1,max_depth=20,min_child_weight=5.0,n_estimators=500,reg_lambda=1.0,subsample=0.5)
xgb_model.fit(new_df, y2)
The best hyper parameters are found to be decision tree max_depth = 50 and n_estimators = 3.
The performance of the above model on the test data is as follows:
- Macro F1 score = 0.83
- Precision = 0.43
- Recall = 0.78
- 249 out of 16001 points are misclassified.
ANN
The network has one hidden dense layer with 50 units and relu activation. The output layer is dense with 2 units and sigmoid activation. Adam optimizer is used along with binary_crossentropy loss function and learning rate =0.01.
new_df = pd.DataFrame(preds)
new_df_test = pd.DataFrame(pred_test)
new_df = new_df.transpose()
new_df_test = new_df_test.transpose()
The performance of the above model on the test data is as follows:
- Macro F1 score = 0.78
- Precision = 0.58
- Recall = 0.85
- 153 out of 16001 points are misclassified.
Failed FE and Models
I had deleted all the features with more than 20% missing data and median imputed other features with missing values. Because of this approach I had lost the valuable features with NaN values related to target variables. Hypertuned XGBoost on this data gave very bad result of 0.68.
Custom ensemble 1 didnt work well because of lack of training data. To overcome this, I created custom ensemble 2 but there was not much increase in performance.
Custom Ensemble 3 was created so that any number of n_estimators can be used but the drawback was I could use only one model for all estimators. With only decision trees in the ensemble, I got an F1 score 3% greater than decision tree model. Recreating the custom ensemble 3 with XGBoost model would give even better results than XGBoost model but this would require greater computational power.
Conclusion
New features impact the performance of the model by a great deal. XGBoost model works very well with macro F1 score of 0.92.
Scope for improvement: Custom Ensemble 3 can be recreated using XGBoost instead of decision trees which might result in better performance. This would require better computational power.
Top comments (0)