*Problem Formulation *
I delve into a problem that is peculiar to the automotive industry. The business problem hinges on the leverage of predictive analytics and machine learning techniques to act as a forerunner of vehicle component failures before they happen, this would aid in the reduction of downtime, enhancing safety, and lowering maintenance costs.
Furthermore, the exact business scenario is divided into three facets. The first one is to predict the insurance coverage for the vehicle based on specific characteristics. Second, predict the type of failure and lastly apportion the vehicles to groups based on certain characteristics. The first facet is essential because vehicles can be quite expensive and thus it requires insurance to protect both customers and dealerships. This coverage helps mitigate financial losses from accidents and vehicle part failure. Moreover, the relevance of the second facet would aid in enhancing operational efficiency by minimizing unplanned downtime, by scheduling maintenance activities before a failure occurs and ensuring continuous production. Also, it will reduce costs by addressing problems before they compound. Usually, the cost of compounded damages and unforeseen imminent repairs are immensely reduced.
Finally, the safety of the users of the vehicles is enhanced by ensuring that all components are functioning efficiently and effectively. When this is achieved, it then leads to the satisfaction of the customers by giving them the impression that vehicles purchased from company XYZ are well-maintained and reliable.
Data Collection and Preparation
the dataset used for the scenario is artificial. I resorted to Kaggle for a sample dataset. Ideally, data can be generated from the sensors installed in the vehicle. Sensors such as, transmission sensors, brake sensors, engine sensors, environmental sensors etc. Engine sensors can monitor and provide data on temperature, pressure, oil levels etc.
The data attributes of the dataset are Vehicle_Model, Mileage,Maintenance_History,
Reported_Issues,Odometer_Reading,Insurance_Premium,Tire_Condition_Battery_Status and Need_Maintenace. Among these, the data under the first ten data attributes are descriptive data types of which first 8 are nominal and 2 being ordinal. Moreover, 10 are numerical data types, the second four being discreet and the last four being continuous.
However, while loading my dataset in python, I encountered an issue. I was not able to read the csv file, thus I saved it in an excel file extension(xlsx).
Data Cleaning
After importing my dataset into google colab, I imported the pandas and NumPy libraries. I loaded my data using the Pandas library. Initially, my file extension was CSV, but I had difficulties while loading, thus I had to change the extension to an Excel file extension(xlsx). To work with such file extension, I converted the sheet to a data frame to allow me to load it with the use of creating an instance from the pandas library called ‘df variable’ which contains the functions ‘pd.DataFrame(data)’
Furthermore, I cleaned the dataset because it is always anticipated that the dataset contains null values, duplicated columns and rows, etc. Hence, I checked for missing values in the dataset using the ‘df.isna()/ df.isnull()’ method and realized I had several missing values as indicated in the image as ‘True’. Then, I went ahead, to sum up all the missing values in the dataset which resulted in ‘312’ on the 26 columns as shown in the image.
However, I replaced the missing values using the forward fill method as shown in the image.
Afterward, I used the ‘df.info()’ method to get a brief statistical overview of the vehicle dataset. Specifically, the count; showing the number of values being ‘301’. [3]The unique, showing a unique element of an array as the result has stated in the image. The top shows the highest counted value of the categorical values as the result stated in the image. Lastly, freq shows the most common value frequency in each of the 26 columns of the dataset as stated in the image.
Moving forward, I used the ‘df.info’ method to display a summary of the vehicle dataset. As I understand from the image, the index indicates 301 entries, data type or dtype indicating object and memory usage as 63,5 + KB. I applied the duplicated method’ df.duplicated ()to find out if there were duplicates in my dataset and fortunately, the output indicated false which means no duplicates.
I went overboard to show the drop_duplicates, a method used to remove duplicate values should incase. Also, showcased the dropna method to remove all ‘Nan’ values in the dataset.
Exploratory Data Analysis
I performed an exploration data analysis on the vehicle maintenance dataset in the predictive maintenance system for an automotive manufacturing project. I performed the EDA by using some functions such as df.describe(), and df.info() to get a statistical summary and the summary of the data types of the vehicle maintenance dataset. I performed some data type conversions.
Firstly, I converted the data type of the date columns of the vehicle maintenance dataset, which are the Last Service Date and Warranty Expiry Date to the DateTime format. Also, I converted columns such as Vehicle Model, Maintenance History, Fuel Type, Transmission Type, Owner Type, Tire Condition, Brake condition, and Battery Status category data type. There are many reasons for the data type conversion into a category data type. Some of these are memory efficiency, machine learning preprocessing, data integrity, etc The most relevant for this project is to enhance the interpretability of the visualization that was performed during the project. As the data type is converted to categorical type it will create clear, distinct, and comprehensive plots. Also, it helped in the encoding process. Now a random forest classifier is one of the models to classify whether a vehicle needs maintenance or vice versa. Now, this machine learning algorithm is not able to handle categorical variables directly. As a result of that, I converted the data above to categorical type for the model to interpret the categorical data accurately. I converted the various columns into numerical. Those column attributes are Reported Issues, Vehicle Age, Engine Size, Odometer Reading, Insurance Premium, Service History, Accident History, Fuel Efficiency, and Need Maintenance(IBM Knowledge Center, 2024)(Brownlee, 2024).
I converted the data of these column attributes to ensure accuracy. Also to ensure that it is machine learning algorithm compatible, In this case, regression(IBM Knowledge Center, 2024),
(Brownlee, 2024).
After the conversions, I displayed the new dataset and saved it into a new Excel file named cleaned_vehicle_maintenance_dataset.xlsx.
Below are the images of the code snippets and their results.
Data Visualization
The vehicle maintenance data analysis has some fascinating insights that can be drawn from its dataset.
Firstly, I visualized all the numerical columns of the vehicle maintenance dataset by plotting each on the histogram. The values under the data attribute mileage show that there is a relatively uniform distribution which indicates that the vehicles in the dataset have a wide range of values in the mileage column. For the reported issues data attributes, the histogram visualization displays that discrete values between 0 and 5 are equally frequent. It implies that the values of reported issues vary. Vehicle age shows that the distribution is nearly uniform across different ages, which implies a varied range amongst the vehicle ages in the dataset. The histogram of the Engine size attribute shows that certain discrete values are concentrated, which gives the impression that most vehicles fall into a few specific engine size categories. The distribution of the Odometer reading is relatively uniform, showing a wide range of odometer readings. The Insurance premium histogram indicates a relatively uniform distribution implying a diverse range of insurance premiums. The data in the service history shows a discrete value from 0 to 10, with nearly equal frequency. However, accident history shows discrete values, suggesting that the accident history is distributed across a few specific efficiency categories. In fuel efficiency, there is a concentration around certain values, indicating that most vehicles fall into a few specific efficiency categories. Lastly, the histogram of need maintenance shows a binary distribution with a large number of vehicles not needing maintenance (0) compared to those that do (1). To summarize my observation, features like Mileage, Odometer reading, and Insurance premium have a relatively uniform distribution that indicates a diverse dataset with no preeminent range of values. Features such as Engine size and fuel Efficiency show that certain values are concentrated. Also, features such as Reported Issues, Service history, Accident history and Need maintenance show discrete values which are useful for classification models(Random Forest Classifier).
Moreover, I observed that features such as Mileage, Odometer reading, and Insurance premium are suitable for regression models as they help predict values such as Insurance premium since they are continuous and uniformly distributed. Additionally, features such as Reported issues, Service history, and Need maintenance are fit for the classification model(random forest classifier) to predict maintenance needs because they are discrete and continuous.
Secondly, I visualized the categorical columns of the vehicle maintenance dataset to discover some interesting insights that would assist me during the modeling phase. I plotted the categorical columns on a bar chart. Initially, I discovered that the dataset of the distribution of vehicle model attributes includes different types of vehicles and each vehicle type has a similar count indicating a balanced dataset concerning the vehicle types. Under the maintenance history distribution, the vehicles are categorized based on their maintenance history being a good, or poor average. It indicates that the counts are fairly evenly distributed among these categories, with each maintenance history type having a similar number of vehicles. For the distribution of fuel type, the dataset includes vehicles with different fuel types electric, diesel, and petrol. Each fuel type is identical, indicating an even distribution. The distribution of transmission type shows that the vehicles are categorized based on their transmission type. It is very close, with a slight difference favoring one type over the other. For the distribution of owner type, the counts are evenly distributed among the different owner types. The distribution of tire condition and the counts for each condition are almost identical, indicating an even distribution across the dataset. The distribution of brake condition is quite similar to that of tire condition, the counts for each brake condition category are nearly equal. For battery status, the distribution of battery statuses is evenly spread out among the different categories. Lastly, the distribution of need maintenance indicates whether a vehicle needs maintenance or otherwise(in binary 1 and 0). Under the need for maintenance, it suggests that several vehicles are flagged as needing maintenance, with a smaller count indicating no maintenance.
Thirdly, I plotted a correlation matrix as a heatmap for the numeric columns of the vehicle maintenance dataset only to derive some insights. I discovered that the reported issues and need maintenance columns indicate a moderate positive correlation between the number of reported
issues and the need for maintenance. Service history and need maintenance indicate a weak positive correlation between service history and the need for maintenance. Accident history and maintenance show a very weak positive correlation between accident history and the need for maintenance. Mileage and the other variables, mileage shows a very weak correlation with other variables. Vehicle age and other variables, the vehicle age has a very weak correlation with other variables, implying that age alone is not a strong predictor of other factors in the vehicle maintenance dataset. The Odometer reading and mileage indicate a very weak positive correlation between odometer reading and mileage, showing a slight relationship, as odometer readings generally reflect the total distance traveled. The insurance premium and other variables show a very weak correlation with other variables indicating that the insurance cost is not strongly related to the other factors in the dataset. Engine size and the other variables, the engine size has a very weak correlation with all other variables indicating a minimal linear relationship with this dataset. Lastly, for the fuel efficiency and other variables, fuel efficiency shows a very weak correlation with other variables, suggesting that it doesn’t have a strong linear relationship with them.
Below are the code snippets for the exploratory data analysis and their visualizations.
Model Selection and Implementation
LINEAR REGRESSION
In this project, I implemented the linear regression to predict the insurance premium which is a data attribute that will help to prioritize vehicles that need more attention because they are at a higher risk. In other words, it will help to optimize the allocation of maintenance resources. The dependent variable is the insurance premium while the independent variables are all the numerical columns except the Insurance premium. In other words, the features are mileage, reported issues, vehicle age, engine size, odometer reading, service history, accident history, fuel efficiency, and maintenance. The target variable is insurance premiums. The evaluation metrics are the mean squared error and r-squared. With that being said, the current linear regression model for predicting the insurance premium performs poorly, indicated by a high mean squared error(MSE) of 52,685,368.84 and a very low R-squared value of 0.000010238. This implies that the chosen features do not adequately explain the variation in insurance premiums, thus making the model unreliable for prioritizing vehicles for its coverage. For this reason, a hyperparameter tuning or feature selection will be implemented to enhance the performance and efficiency of the model.
FEATURE SELECTION
Feature selection is the process of selecting the most relevant features for the model. [10] After performing the feature selection(Recursive Feature Selection) in the linear regression model, the output is as follows. The array [5 8 6 1 2 3 9 7 4] represents the ranking of each feature. The lower ranking values mean they are more important features. Feature 4 is ranked 1, feature 5 is ranked 2, feature 6 is ranked 3, feature 9 is ranked 4, feature 1 is ranked 5, feature 3 is ranked 6, feature 8 is ranked 7, and feature 2 is ranked 8. The recursive feature selection suggests that using only one feature is ideal for the model. This implies that the model performs best with only the top-ranked feature. According to the recursive feature selection, ‘Engine Size’ was selected as the most important.
Next, I trained and evaluated the linear regression model using ‘Engine size’. The results are as follows; mean square error is 52693064.087938584, root mean square error is 7258.99883509693, r-squared is -0.00013591570286775045, cross-validation rmse is 7223.230567727876 and the coefficient is the feature ‘engine size being -0.132491
To interpret the results, the mean square error value is large, suggesting that the prediction deviates significantly from the actual values. Hence, the model doesn’t perform well even after feature selection. The root mean squared error of 7258.998 indicates substantial errors in the predictions. The r-squared value implies that the model performs worse than the mean of the target variable, indicating a poor fit. Cross-validation rmse indicates consistent performance across different subsets of the data. However, the high rmse value again suggests poor model performance. For the feature coefficient, the predicted value of the target variable decreases by approximately 0.132491 units. However, the overall performance of the model is very poor.
To conclude, even after feature selection, the model did not perform as expected.
Below are the code snippets for the linear regression and the feature selection
RANDOM FOREST CLASSIFIER (CLASSIFICATION)
Also in this project, I implemented the random forest classifier to classify whether a vehicle component needs maintenance using a random forest classifier based on various features in the vehicle maintenance dataset. I used relevant features such as vehicle age, maintenance history, reported issues, etc. The target variable is need_maintenance. In other words, the independent variables are all the categorical columns except need_maintenance while the dependent variable is the need maintenance. To evaluate the model, there are some metrics to consider, metrics such as accuracy, precision, recall, and f1-score. To evaluate the model, I will state that the accuracy being 1.0 indicates that the model correctly classifies all instances in the test set. The precision gives a true positive prediction to the actual positives. Recall of 1.0 for both classes means that the model successfully identified all instances that truly need maintenance. F1-score of 1.0 indicates perfect precision and recall, implying that precision and recall have a harmonic mean. Moreover, the number of actual occurrences of each class in the test set shows the test set had 1915 instances of class 0 and 8085 instances of class 1. ( 0 here means not needing maintenance and 1 here means needing maintenance). To make sense out of the metrics, I can say with high recall and precision, the model correctly identifies all vehicles needing maintenance and doesn’t misclassify any vehicle that doesn't need it. This ensures that resources are not wasted on unnecessary maintenance and that all vehicles requiring attention are promptly addressed.
In summary, the random forest classifier model is highly effective for classifying whether a vehicle needs maintenance. The scores also indicate the model will be a reliable tool for this vehicle maintenance system because it supports the objective of the project by anticipating vehicle component failures and optimizing maintenance operations. Since the model functions perfectly, there would be no need for hyperparameter tuning or feature selection.
Below are the code snippets for the random forest classifier(classification)
K MEANS CLUSTERING ( CLUSTERING)
Lastly, I implemented the k means clustering in this project to segment vehicle maintenance data into clusters based on various numerical features. For the features, I used numerical features such as vehicle age, engine size, etc. To determine the optimal number of clusters for this particular k-means clustering, I utilized the elbow method. The elbow method is used to determine the number of clusters in a data set. The method involves running K-Means clustering on the data set for a range of values for k (number of clusters). For each value of k, the sum of squared errors (SSE) is calculated. The SSE measures the compactness of the clusters, where a lower value indicates that the clusters are dense and well-defined (Towards Data Science).
With regards to this project, the sum of squares errors(SSE) plot displays different values of k(number of clusters). From the curve below, the optimal number of clusters is around 3. This is because that is where the SSE curve begins to flatten out, indicating that when more clusters are added beyond this point yields diminishing returns in terms of reducing SSE. After the optimal number of clusters(3) was determined, K-Means was applied to the scaled dataset to assign cluster labels to each data point. These clusters can help segment the vehicles into different groups based on the vehicle maintenance needs and characteristics. Furthermore, I utilized the PCA (Principal Component Analysis) to visualize the clusters formed by the K-Means to gain insights into the underlying pattern in the vehicle maintenance data. This helped to reduce the data to two dimensions, to visualize complex high-dimensional data. In other words, the PCA plot shows the data points reduced to two principal components(PCA1 and PCA2). Also, I used the colors to distinguish the clusters or groups of vehicles. Cluster 0 is represented by blue, cluster 2 is represented by orange, and cluster 2 is represented by green. The data points in cluster 0 share similar characteristics, suggesting they might have similar vehicle maintenance needs. This clustering can help to identify patterns that indicate higher maintenance and can help prioritize maintenance efforts. This means that vehicles in that particular cluster can be predicted as being at a higher risk of being scheduled for maintenance as soon as possible.
Below are the code snippets for the normalization, the k means clustering as well as the visualizations
Model Evaluation
LINEAR REGRESSION
The Mean Squared Error(MSE) for predicting insurance premiums was 52,685,368.84, indicating a significant error in prediction. The R-squared of 0.000010238 is very low suggesting that the chosen features poorly elaborate the variation in insurance premiums.
The linear regression model for predicting insurance premiums performed poorly with a high mean squared error and an insignificant R-squared value. This implies that the selected numerical features do not sufficiently capture the variability in insurance premiums. To conclude, the model is unreliable for selecting the coverage for vehicles.
RANDOM FOREST CLASSIFIER (CLASSIFICATION)
The accuracy in the result of the model attained a value of 1.0, highlighting perfect precision and recall for the vehicles needing maintenance and those that don’t require maintenance.
The random forest classifier effectively classified whether vehicle components needed maintenance, indicating a perfect performance with precision at 1.0, recall at 1.0, and accuracy at 1.0. This means that the model correctly identifies all vehicles requiring maintenance without misclassifying any non-maintenance vehicles. This accuracy ensures efficient allocation of maintenance resources, ensuring timely servicing of vehicles at risk.
K MEANS CLUSTERING ( CLUSTERING)
Firstly, the elbow method based on the SSE determined three clusters as displayed in the visualization. Afterward, the PCA visualized the clusters in 2D, displaying distinct groupings based on vehicle maintenance characteristics.
K-means clustering segmented vehicle maintenance data into 3 distinct clusters, each representing a vehicle with similar maintenance needs. This segmentation helps identify patterns indicating higher maintenance requirements, facilitating proactive maintenance scheduling. The PCA visualization aids in understanding the clustering structure, revealing potential insights into the vehicle groups that may require immediate attention.
Below are the results of each models
**COMPARISON OF MODELS AND IMPLICATIONS
Linear Regression and Forest Classifier
**
The linear regression grappled with predictive accuracy and explanatory power, while the random forest classifier did well with perfect classification metrics. This indicates the superiority of ensemble methods like Random Forest for classification tasks over traditional regression when dealing with categorical outcomes
Random Forest Classifier and K-means Clustering
The random forest classifier directly addresses classification needs for maintenance urgency, while the K-means clustering provides insight into broader patterns and grouping within the data, supporting strategic maintenance planning.
BUSINESS IMPLICATION
For the first facet and Its corresponding model, if the model should have worked as expected(a successful linear regression model for predicting insurance premiums). It would enable precise cost estimation for insurance coverage based on vehicle risk factors. It would aid in opting for the ideal insurance plans that balance coverage and cost, reducing financial risks associated with vehicle accidents or failures.
Furthermore, the second facet and its corresponding model (random forest classifier accurately predicting maintenance needs) ensures proactive maintenance scheduling. This reduces downtime, optimizes vehicle availability, and supports continuous production operations.
Lastly, the third facet and its corresponding model(k-means clustering segments vehicles based on operational and performance characteristics). This helps in the allocation of maintenance resources efficiently by grouping vehicles with similar needs together. It ensures resources are focused where they are most needed, optimizing maintenance efforts and reducing overall maintenance costs.
Conclusion and Recommendation
To conclude, the implementation of predictive analytics through three machine learning models, namely; linear regression, random forest classifier, and k-means clustering. These models provided valuable insights into the vehicle maintenance system.
The linear regression model did not effectively predict the insurance premium, which is essential for the selection of suitable insurance coverage. Even after the feature selection, the results showed that the model is not effective in predicting the insurance premium.
The Random forest classifier model effectively predicted whether vehicle components needed maintenance, as the result indicates a perfect performance with precision at 1.0, recall at 1.0, and accuracy at 1.0.
The K-Means clustering model worked effectively as displayed in the visualization. The clustering model showed to work as it distinctly grouped vehicle maintenance based on certain characteristics.
*Recommendation
*
As a recommendation, I would enhance the feature engineering by refining the feature selection for the linear regression model to improve its predictive accuracy for the insurance premium. I would fine-tune my dataset to make sure it entails relevant variables for the problem to be addressed.
Also, I would incorporate continuous model evaluation and improvement. In effect, I would encourage periodic retraining of machine learning models with updated data to ensure that they remain robust and accurate over time.
Lastly, I would ensure data integrity, consistency, and accessibility across all systems. This would maintain high-quality data inputs for accurate predictive modeling.
Project Reflection
Overall reflection on my project would revolve around the problem I encountered from finding a dataset to loading the dataset and curating a model to solve the problem.
First and foremost, locating the perfect dataset that comprehensively covered the project’s problem, and vehicle maintenance system was challenging. The available dataset often lacked depth, necessitating extensive searches.
Next, I encountered technical difficulties while loading the CSV file into the google colab environment. I had to change the file extension to xl
sx, an Excel file.
I identified and engineered features from the dataset. I changed the data types of some of the columns to both numerical and categorical. It required a deep understanding of the domain and iterative experimentation to capture the most predictive variables.
In curating a model, I encountered a hurdle when optimizing the performance of the linear regression model. Even after performing feature selection, the model failed to perform as expected.
BIBLIOGRAPHY
- Coursera. (n.d.). Data Analytics Certificate Course. Retrieved from https://www.coursera.org/courses/data-analytics
- IBM Predictive Analytics. https://www.ibm.com/topics/predictive-analytics (Accessed on 15 June 2024)
- JavaTpoint Data Analytics Tutorial. https://www.javatpoint.com/data-analytics-tutorial (Accessed on 15 June 2024)
- Simplilearn Exploratory Data Analysis (EDA). https://www.simplilearn.com/tutorials/data-analytics-tutorial/exploratory-data-analysis (Accessed on 15 June 2024)
- IBM Knowledge Center. Introduction to Data Types and Field Properties. https://www.ibm.com/docs/en/db2/11.5?topic=elements-introduction-data-types-field-pro perties ( Accessed on 17 June 2024).
- Brownlee, J. Data Preparation for Machine Learning. https://machinelearningmastery.com/data-preparation-for-machine-learning/ (Accessed on 17 June 2024)
- Montgomery, D. C., Peck, E. A. and Vining, G. G. (2012) Introduction to Linear Regression Analysis. 5th edn. Hoboken, NJ: Wiley
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
- Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
- Towards Data Science Introduction to K-means Clustering and the Elbow Method. https://towardsdatascience.com/introduction-to-k-means-clustering-and-the-elbow-metho d-72a6231c6a92 (Accessed on 24 June 2024)
Top comments (1)
hi, awesome work
if you dont mind, can you share the dataset for my project work
if you can, pls share it to vamsgaming25@gmail.com