DEV Community: Gerhardt

Retail Company Data Analytics (Predicting Future Sales)

Gerhardt — Mon, 01 Jul 2024 23:01:00 +0000

In this project, I analyzed a dataset containing sales data for a retail company using linear algebra concepts. Moreover, I use linear algebra techniques to identify trends and patterns in sales figures. Apply algorithms, possibly regression analysis, to predict future sales based on historical data.

DATA ACQUISITION AND PREPARATION

the retail store dataset that was curated personally according to these data attributes. Date, Sales Amount, Number of Product Sold, Marketing Expenditure, Region. Sample datasets from Kaggle were used as a template for this particular dataset. The retail dataset is an Excel file that contains 8 columns and 500 rows. In this dataset, the data under customer ID are descriptive but specifically nominal—likewise the data under date, location, and product_category_preferences attributes. The data under the number of products sold and frequency attributes are discrete data types. At the same time, sales amount and marketing expenditure are continuous data types. However, these two distinct data types are subsets of numerical data.

Below is an image showing the retail dataset in an xlsx format.

Below is an image showing how the retail dataset was loaded into a dataframe

Data Cleaning

After importing the retail dataset in google colab, I imported the pandas library as pd. This enabled me to load the dataset into a dataframe. To do this, an instance was created with pd and stored in the variable ‘data’.
Moreover, I cleaned the retail dataset because it is a necessity that must be carried out as it ensures that the data is quality and consistent for analysis. Thus, I checked the column names if they are correct using ‘data.columns’. Afterwards, I converted the sales column into a numeric column by removing the dollar sign. The dollar sign attached to the figures make it a string.
Before performing this cleaning, I used the function ‘data.info()’ to derive some information from the retail dataset. I understood that the number of entries in the dataset was 500. Also, I got to understand the data type of each column. The memory usage was 31.4+ KB.
After, I converted the date into datetime format. I went a step further to check for missing values by using data.isnull(). The result I received indicated that there was no missing. To double prove, I replaced all the missing values with the mean value, using the fillna method.

Below is an image of the various data cleaning that was performed on the retail dataset

Exploratory Data Analysis
Moving forward, I performed an exploratory data analysis on the Retail dataset, a data generated by a retail company. I utilized the following methods to perform the first part of the EDA; data.describe(), data.info(). The data.describe() method was used to derive descriptive statistics from the retail dataset. The data.info() method was used to derive the information about the dataframe including index dtype and columns, non nulls values and memory usage(pandas, 2023). However, I performed some data type conversions.
Initially, I converted the sales column into a numeric column simply by getting rid of the dollar sign Next, I converted the date column in the dataset into datetime format.
The data.describe() method gave an output of the metrics indicating the count, mean, min, max, percentile and standard deviation of each column in the dataset. It indicates that all columns have 500 non-null entries, indicating no missing values in these columns. An example for the mean is the Number of product sold columns being 7.68 and that of the Sales being 49.91. These and more descriptive stats will be displayed in the images below.
The data.info() method displayed an output that indicated the memory usage of the dataframe to be 31.4 KB. The dtype of each of the columns. These and more descriptive stats will be displayed in the images below.

Data Visualization

The next aspect of the EDA,I visualized the histogram of each numeric column as well as a pairplot to see the relationships between numeric variables.

Below are images showing the code snippets of the Exploratory Data Analysis.

Linear Algebra & Model Training
In this sales prediction project, I applied a couple of linear algebra concepts. Linear algebra is a branch of mathematics that aims at solving systems of linear equations with a finite number of unknowns(Schilling, Nachtergaele, and Lankham, n.d.). The concepts implemented are covariance matrix and singular value decomposition. Covariance matrix is the representation of covariance values of each pair of variables in multivariable data(Builtin, 2023). Singular value decomposition of a matrix is the factorization of that matrix into the product of three matrices. Example is A = UDV T where the columns of U an V are orthonormal and the matrix is diagonal with positive real entries(Guruswami, n.d.).
For this project, I converted the sales column into numerics because it had the dollar currency attached which made it a string. The reason for this is because it will be used as a target variable. I created a variable X(more or less like a new dataframe) which should contain all the columns with the exception of the columns I dropped. My target variable being y, which contains the values in the sales column. Afterwards, I calculated the covariance matrix(cov_matrix = np.cov(X.T)) by creating an instance of the function in the numpy library. However, the T is used to transpose the data frame so that each row represents a variable. The covariance matrix represents a measure of how much two random variables change together between each pair of features in X.
The singular value decomposition(SVD) performs on the features of X and returns three matrices. In the line of code the U is the unitary matrix having left singular vectors, S which are non-negative numbers in a decreasing order. Vt a unitary matrix having right singular vectors.

Below are the code snippets for the covariance matrix and the SVD

Model Training
The machine learning model utilized to predict future marketing expenditure is linear regression. Linear regression is a machine learning model where the independent variable is used to determine the dependent continuous variable.
Firstly, I split the retail dataset into training and testing. Then I extracted the index of the marketing expenditure column. This is to identify the marketing expenditure column in the features. Then I created a list of feature columns and dropped the sales columns. A simple linear regression was performed with the marketing expenditure. The marketing expenditure would be used as the only predictor.
Afterward, a multiple linear regression was performed using all features. Multiple linear regression is a statistical technique that models the relationship between a dependent variable and two or more independent variables(Frost, 2023). More or less like an upgrade of linear regression.

Below are the code snippets of the linear regression model and multiple linear regression model

Evaluation Metrics

The R-squared for the simple regression (-0.005806533949882953) and that of the multiple linear regression (0.008004018071779306) indicate that the models explains virtually none of the variability of the response data around the mean. Moreover, the negative values suggest that the model performs worse than a horizontal line (mean of the dependent variable). The RSME for the multiple linear regression is 29.246565814382784. Thai value(29.25) means that, on average, the model’s predictions are off by 29.25. The coefficients of the multiple linear regression for the features are as below; feature 1 is -0.21182487, feature 2 is -0.00075948 and feature 3 is -0.27559952.Each of these values represent the change in the dependent variable for a change in the respective independent variable. The negative coefficient indicates an inverse relationship between the feature and the target variable. The intercept of the multiple linear regression is 55.55450162269854. This value(55.55) represents the expected value of the dependent variable when all the independent variables are zero.

Business Implication

The results of the evaluation metrics of the models have some implications for the sales of the retail company. The low and negative R-squared values indicate that the model does not fit the data well and has poor predictive capabilities. This suggests that the features used are not good predictors of the target variable. In effect, It will require feature engineering. The high RMSE confirms that the model’s predictions are not accurate.
This means the models need improvement. One of them is that the data that are collected should be more relevant. Also, feature selection should be implemented to ensure that the best features are selected for the models.
This implies that the model is not reliable for predicting future sales for the retail company.

Conclusion

In conclusion, the sales prediction analysis of the retail company was carried out by linear algebra concepts namely; covariance matrix and singular value decomposition. These linear algebra concepts provided valuable insights before the model training. The covariance matrix made me understand which variables to include in the linear regression and multiple linear regression models based on their relationships with the target variable. In simple terms, the covariance matrix was used to identify which variables have the strongest relationships with sales.
The SVD helped to reduce the dimensionality and the multicollinearity and improved the stability and performance of the regression models. This was achieved by focusing on the most significant components.
The linear regression model implemented in the sales prediction analysis of the retail company provided some valuable insights into the relationships between various factors and their impact on sales performance. After the modeling, the performance was checked using the R-squared values for the accuracy. The root means squared error(rmse). The coefficients indicated the direction and magnitude of the relationships between the independent variables and the dependent variables. Moreover, the negative coefficients indicated an inverse relationship.

Project Reflection

The reflection on the sale prediction of the retail company will cover some challenges I encountered including anomalies in the dataset that was not fit for the linear algebra concepts(covariance matrix and SVD), and the insights that I derived as well as future prediction.
The dataset had some anomalies that would have made it hectic for analysis. However, the retail dataset was fairly good but the sale column of the dataset was recognized as a string because of the dollar sign attached to it. I had to detach the dollar sign from the figures to make it numeric since it was needed in the linear algebra concepts namely; covariance matrix and SVD.
The result of linear and multiple regression models indicated that the model is not a good predictor of future sales. Thus, feature selection or hyperparameter tuning is essential to make sure the right column is selected for the training and testing of the model.

Predictive Maintenance of vehicles in the Automotive Industry

Gerhardt — Wed, 26 Jun 2024 13:33:05 +0000

*Problem Formulation *

I delve into a problem that is peculiar to the automotive industry. The business problem hinges on the leverage of predictive analytics and machine learning techniques to act as a forerunner of vehicle component failures before they happen, this would aid in the reduction of downtime, enhancing safety, and lowering maintenance costs.
Furthermore, the exact business scenario is divided into three facets. The first one is to predict the insurance coverage for the vehicle based on specific characteristics. Second, predict the type of failure and lastly apportion the vehicles to groups based on certain characteristics. The first facet is essential because vehicles can be quite expensive and thus it requires insurance to protect both customers and dealerships. This coverage helps mitigate financial losses from accidents and vehicle part failure. Moreover, the relevance of the second facet would aid in enhancing operational efficiency by minimizing unplanned downtime, by scheduling maintenance activities before a failure occurs and ensuring continuous production. Also, it will reduce costs by addressing problems before they compound. Usually, the cost of compounded damages and unforeseen imminent repairs are immensely reduced.
Finally, the safety of the users of the vehicles is enhanced by ensuring that all components are functioning efficiently and effectively. When this is achieved, it then leads to the satisfaction of the customers by giving them the impression that vehicles purchased from company XYZ are well-maintained and reliable.

Data Collection and Preparation

the dataset used for the scenario is artificial. I resorted to Kaggle for a sample dataset. Ideally, data can be generated from the sensors installed in the vehicle. Sensors such as, transmission sensors, brake sensors, engine sensors, environmental sensors etc. Engine sensors can monitor and provide data on temperature, pressure, oil levels etc.
The data attributes of the dataset are Vehicle_Model, Mileage,Maintenance_History,
Reported_Issues,Odometer_Reading,Insurance_Premium,Tire_Condition_Battery_Status and Need_Maintenace. Among these, the data under the first ten data attributes are descriptive data types of which first 8 are nominal and 2 being ordinal. Moreover, 10 are numerical data types, the second four being discreet and the last four being continuous.

However, while loading my dataset in python, I encountered an issue. I was not able to read the csv file, thus I saved it in an excel file extension(xlsx).

Data Cleaning

After importing my dataset into google colab, I imported the pandas and NumPy libraries. I loaded my data using the Pandas library. Initially, my file extension was CSV, but I had difficulties while loading, thus I had to change the extension to an Excel file extension(xlsx). To work with such file extension, I converted the sheet to a data frame to allow me to load it with the use of creating an instance from the pandas library called ‘df variable’ which contains the functions ‘pd.DataFrame(data)’
Furthermore, I cleaned the dataset because it is always anticipated that the dataset contains null values, duplicated columns and rows, etc. Hence, I checked for missing values in the dataset using the ‘df.isna()/ df.isnull()’ method and realized I had several missing values as indicated in the image as ‘True’. Then, I went ahead, to sum up all the missing values in the dataset which resulted in ‘312’ on the 26 columns as shown in the image.

However, I replaced the missing values using the forward fill method as shown in the image.
Afterward, I used the ‘df.info()’ method to get a brief statistical overview of the vehicle dataset. Specifically, the count; showing the number of values being ‘301’. [3]The unique, showing a unique element of an array as the result has stated in the image. The top shows the highest counted value of the categorical values as the result stated in the image. Lastly, freq shows the most common value frequency in each of the 26 columns of the dataset as stated in the image.
Moving forward, I used the ‘df.info’ method to display a summary of the vehicle dataset. As I understand from the image, the index indicates 301 entries, data type or dtype indicating object and memory usage as 63,5 + KB. I applied the duplicated method’ df.duplicated ()to find out if there were duplicates in my dataset and fortunately, the output indicated false which means no duplicates.
I went overboard to show the drop_duplicates, a method used to remove duplicate values should incase. Also, showcased the dropna method to remove all ‘Nan’ values in the dataset.

Exploratory Data Analysis

I performed an exploration data analysis on the vehicle maintenance dataset in the predictive maintenance system for an automotive manufacturing project. I performed the EDA by using some functions such as df.describe(), and df.info() to get a statistical summary and the summary of the data types of the vehicle maintenance dataset. I performed some data type conversions.
Firstly, I converted the data type of the date columns of the vehicle maintenance dataset, which are the Last Service Date and Warranty Expiry Date to the DateTime format. Also, I converted columns such as Vehicle Model, Maintenance History, Fuel Type, Transmission Type, Owner Type, Tire Condition, Brake condition, and Battery Status category data type. There are many reasons for the data type conversion into a category data type. Some of these are memory efficiency, machine learning preprocessing, data integrity, etc The most relevant for this project is to enhance the interpretability of the visualization that was performed during the project. As the data type is converted to categorical type it will create clear, distinct, and comprehensive plots. Also, it helped in the encoding process. Now a random forest classifier is one of the models to classify whether a vehicle needs maintenance or vice versa. Now, this machine learning algorithm is not able to handle categorical variables directly. As a result of that, I converted the data above to categorical type for the model to interpret the categorical data accurately. I converted the various columns into numerical. Those column attributes are Reported Issues, Vehicle Age, Engine Size, Odometer Reading, Insurance Premium, Service History, Accident History, Fuel Efficiency, and Need Maintenance(IBM Knowledge Center, 2024)(Brownlee, 2024).
I converted the data of these column attributes to ensure accuracy. Also to ensure that it is machine learning algorithm compatible, In this case, regression(IBM Knowledge Center, 2024),
(Brownlee, 2024).
After the conversions, I displayed the new dataset and saved it into a new Excel file named cleaned_vehicle_maintenance_dataset.xlsx.

Below are the images of the code snippets and their results.

Data Visualization

The vehicle maintenance data analysis has some fascinating insights that can be drawn from its dataset.
Firstly, I visualized all the numerical columns of the vehicle maintenance dataset by plotting each on the histogram. The values under the data attribute mileage show that there is a relatively uniform distribution which indicates that the vehicles in the dataset have a wide range of values in the mileage column. For the reported issues data attributes, the histogram visualization displays that discrete values between 0 and 5 are equally frequent. It implies that the values of reported issues vary. Vehicle age shows that the distribution is nearly uniform across different ages, which implies a varied range amongst the vehicle ages in the dataset. The histogram of the Engine size attribute shows that certain discrete values are concentrated, which gives the impression that most vehicles fall into a few specific engine size categories. The distribution of the Odometer reading is relatively uniform, showing a wide range of odometer readings. The Insurance premium histogram indicates a relatively uniform distribution implying a diverse range of insurance premiums. The data in the service history shows a discrete value from 0 to 10, with nearly equal frequency. However, accident history shows discrete values, suggesting that the accident history is distributed across a few specific efficiency categories. In fuel efficiency, there is a concentration around certain values, indicating that most vehicles fall into a few specific efficiency categories. Lastly, the histogram of need maintenance shows a binary distribution with a large number of vehicles not needing maintenance (0) compared to those that do (1). To summarize my observation, features like Mileage, Odometer reading, and Insurance premium have a relatively uniform distribution that indicates a diverse dataset with no preeminent range of values. Features such as Engine size and fuel Efficiency show that certain values are concentrated. Also, features such as Reported Issues, Service history, Accident history and Need maintenance show discrete values which are useful for classification models(Random Forest Classifier).
Moreover, I observed that features such as Mileage, Odometer reading, and Insurance premium are suitable for regression models as they help predict values such as Insurance premium since they are continuous and uniformly distributed. Additionally, features such as Reported issues, Service history, and Need maintenance are fit for the classification model(random forest classifier) to predict maintenance needs because they are discrete and continuous.
Secondly, I visualized the categorical columns of the vehicle maintenance dataset to discover some interesting insights that would assist me during the modeling phase. I plotted the categorical columns on a bar chart. Initially, I discovered that the dataset of the distribution of vehicle model attributes includes different types of vehicles and each vehicle type has a similar count indicating a balanced dataset concerning the vehicle types. Under the maintenance history distribution, the vehicles are categorized based on their maintenance history being a good, or poor average. It indicates that the counts are fairly evenly distributed among these categories, with each maintenance history type having a similar number of vehicles. For the distribution of fuel type, the dataset includes vehicles with different fuel types electric, diesel, and petrol. Each fuel type is identical, indicating an even distribution. The distribution of transmission type shows that the vehicles are categorized based on their transmission type. It is very close, with a slight difference favoring one type over the other. For the distribution of owner type, the counts are evenly distributed among the different owner types. The distribution of tire condition and the counts for each condition are almost identical, indicating an even distribution across the dataset. The distribution of brake condition is quite similar to that of tire condition, the counts for each brake condition category are nearly equal. For battery status, the distribution of battery statuses is evenly spread out among the different categories. Lastly, the distribution of need maintenance indicates whether a vehicle needs maintenance or otherwise(in binary 1 and 0). Under the need for maintenance, it suggests that several vehicles are flagged as needing maintenance, with a smaller count indicating no maintenance.
Thirdly, I plotted a correlation matrix as a heatmap for the numeric columns of the vehicle maintenance dataset only to derive some insights. I discovered that the reported issues and need maintenance columns indicate a moderate positive correlation between the number of reported
issues and the need for maintenance. Service history and need maintenance indicate a weak positive correlation between service history and the need for maintenance. Accident history and maintenance show a very weak positive correlation between accident history and the need for maintenance. Mileage and the other variables, mileage shows a very weak correlation with other variables. Vehicle age and other variables, the vehicle age has a very weak correlation with other variables, implying that age alone is not a strong predictor of other factors in the vehicle maintenance dataset. The Odometer reading and mileage indicate a very weak positive correlation between odometer reading and mileage, showing a slight relationship, as odometer readings generally reflect the total distance traveled. The insurance premium and other variables show a very weak correlation with other variables indicating that the insurance cost is not strongly related to the other factors in the dataset. Engine size and the other variables, the engine size has a very weak correlation with all other variables indicating a minimal linear relationship with this dataset. Lastly, for the fuel efficiency and other variables, fuel efficiency shows a very weak correlation with other variables, suggesting that it doesn’t have a strong linear relationship with them.

Below are the code snippets for the exploratory data analysis and their visualizations.

Model Selection and Implementation

LINEAR REGRESSION
In this project, I implemented the linear regression to predict the insurance premium which is a data attribute that will help to prioritize vehicles that need more attention because they are at a higher risk. In other words, it will help to optimize the allocation of maintenance resources. The dependent variable is the insurance premium while the independent variables are all the numerical columns except the Insurance premium. In other words, the features are mileage, reported issues, vehicle age, engine size, odometer reading, service history, accident history, fuel efficiency, and maintenance. The target variable is insurance premiums. The evaluation metrics are the mean squared error and r-squared. With that being said, the current linear regression model for predicting the insurance premium performs poorly, indicated by a high mean squared error(MSE) of 52,685,368.84 and a very low R-squared value of 0.000010238. This implies that the chosen features do not adequately explain the variation in insurance premiums, thus making the model unreliable for prioritizing vehicles for its coverage. For this reason, a hyperparameter tuning or feature selection will be implemented to enhance the performance and efficiency of the model.

FEATURE SELECTION
Feature selection is the process of selecting the most relevant features for the model. [10] After performing the feature selection(Recursive Feature Selection) in the linear regression model, the output is as follows. The array [5 8 6 1 2 3 9 7 4] represents the ranking of each feature. The lower ranking values mean they are more important features. Feature 4 is ranked 1, feature 5 is ranked 2, feature 6 is ranked 3, feature 9 is ranked 4, feature 1 is ranked 5, feature 3 is ranked 6, feature 8 is ranked 7, and feature 2 is ranked 8. The recursive feature selection suggests that using only one feature is ideal for the model. This implies that the model performs best with only the top-ranked feature. According to the recursive feature selection, ‘Engine Size’ was selected as the most important.
Next, I trained and evaluated the linear regression model using ‘Engine size’. The results are as follows; mean square error is 52693064.087938584, root mean square error is 7258.99883509693, r-squared is -0.00013591570286775045, cross-validation rmse is 7223.230567727876 and the coefficient is the feature ‘engine size being -0.132491
To interpret the results, the mean square error value is large, suggesting that the prediction deviates significantly from the actual values. Hence, the model doesn’t perform well even after feature selection. The root mean squared error of 7258.998 indicates substantial errors in the predictions. The r-squared value implies that the model performs worse than the mean of the target variable, indicating a poor fit. Cross-validation rmse indicates consistent performance across different subsets of the data. However, the high rmse value again suggests poor model performance. For the feature coefficient, the predicted value of the target variable decreases by approximately 0.132491 units. However, the overall performance of the model is very poor.
To conclude, even after feature selection, the model did not perform as expected.

Below are the code snippets for the linear regression and the feature selection

RANDOM FOREST CLASSIFIER (CLASSIFICATION)

Also in this project, I implemented the random forest classifier to classify whether a vehicle component needs maintenance using a random forest classifier based on various features in the vehicle maintenance dataset. I used relevant features such as vehicle age, maintenance history, reported issues, etc. The target variable is need_maintenance. In other words, the independent variables are all the categorical columns except need_maintenance while the dependent variable is the need maintenance. To evaluate the model, there are some metrics to consider, metrics such as accuracy, precision, recall, and f1-score. To evaluate the model, I will state that the accuracy being 1.0 indicates that the model correctly classifies all instances in the test set. The precision gives a true positive prediction to the actual positives. Recall of 1.0 for both classes means that the model successfully identified all instances that truly need maintenance. F1-score of 1.0 indicates perfect precision and recall, implying that precision and recall have a harmonic mean. Moreover, the number of actual occurrences of each class in the test set shows the test set had 1915 instances of class 0 and 8085 instances of class 1. ( 0 here means not needing maintenance and 1 here means needing maintenance). To make sense out of the metrics, I can say with high recall and precision, the model correctly identifies all vehicles needing maintenance and doesn’t misclassify any vehicle that doesn't need it. This ensures that resources are not wasted on unnecessary maintenance and that all vehicles requiring attention are promptly addressed.
In summary, the random forest classifier model is highly effective for classifying whether a vehicle needs maintenance. The scores also indicate the model will be a reliable tool for this vehicle maintenance system because it supports the objective of the project by anticipating vehicle component failures and optimizing maintenance operations. Since the model functions perfectly, there would be no need for hyperparameter tuning or feature selection.

Below are the code snippets for the random forest classifier(classification)

K MEANS CLUSTERING ( CLUSTERING)

Lastly, I implemented the k means clustering in this project to segment vehicle maintenance data into clusters based on various numerical features. For the features, I used numerical features such as vehicle age, engine size, etc. To determine the optimal number of clusters for this particular k-means clustering, I utilized the elbow method. The elbow method is used to determine the number of clusters in a data set. The method involves running K-Means clustering on the data set for a range of values for k (number of clusters). For each value of k, the sum of squared errors (SSE) is calculated. The SSE measures the compactness of the clusters, where a lower value indicates that the clusters are dense and well-defined (Towards Data Science).
With regards to this project, the sum of squares errors(SSE) plot displays different values of k(number of clusters). From the curve below, the optimal number of clusters is around 3. This is because that is where the SSE curve begins to flatten out, indicating that when more clusters are added beyond this point yields diminishing returns in terms of reducing SSE. After the optimal number of clusters(3) was determined, K-Means was applied to the scaled dataset to assign cluster labels to each data point. These clusters can help segment the vehicles into different groups based on the vehicle maintenance needs and characteristics. Furthermore, I utilized the PCA (Principal Component Analysis) to visualize the clusters formed by the K-Means to gain insights into the underlying pattern in the vehicle maintenance data. This helped to reduce the data to two dimensions, to visualize complex high-dimensional data. In other words, the PCA plot shows the data points reduced to two principal components(PCA1 and PCA2). Also, I used the colors to distinguish the clusters or groups of vehicles. Cluster 0 is represented by blue, cluster 2 is represented by orange, and cluster 2 is represented by green. The data points in cluster 0 share similar characteristics, suggesting they might have similar vehicle maintenance needs. This clustering can help to identify patterns that indicate higher maintenance and can help prioritize maintenance efforts. This means that vehicles in that particular cluster can be predicted as being at a higher risk of being scheduled for maintenance as soon as possible.

Below are the code snippets for the normalization, the k means clustering as well as the visualizations

Model Evaluation

LINEAR REGRESSION

The Mean Squared Error(MSE) for predicting insurance premiums was 52,685,368.84, indicating a significant error in prediction. The R-squared of 0.000010238 is very low suggesting that the chosen features poorly elaborate the variation in insurance premiums.
The linear regression model for predicting insurance premiums performed poorly with a high mean squared error and an insignificant R-squared value. This implies that the selected numerical features do not sufficiently capture the variability in insurance premiums. To conclude, the model is unreliable for selecting the coverage for vehicles.

RANDOM FOREST CLASSIFIER (CLASSIFICATION)

The accuracy in the result of the model attained a value of 1.0, highlighting perfect precision and recall for the vehicles needing maintenance and those that don’t require maintenance.
The random forest classifier effectively classified whether vehicle components needed maintenance, indicating a perfect performance with precision at 1.0, recall at 1.0, and accuracy at 1.0. This means that the model correctly identifies all vehicles requiring maintenance without misclassifying any non-maintenance vehicles. This accuracy ensures efficient allocation of maintenance resources, ensuring timely servicing of vehicles at risk.

K MEANS CLUSTERING ( CLUSTERING)

Firstly, the elbow method based on the SSE determined three clusters as displayed in the visualization. Afterward, the PCA visualized the clusters in 2D, displaying distinct groupings based on vehicle maintenance characteristics.
K-means clustering segmented vehicle maintenance data into 3 distinct clusters, each representing a vehicle with similar maintenance needs. This segmentation helps identify patterns indicating higher maintenance requirements, facilitating proactive maintenance scheduling. The PCA visualization aids in understanding the clustering structure, revealing potential insights into the vehicle groups that may require immediate attention.

Below are the results of each models

**COMPARISON OF MODELS AND IMPLICATIONS

Linear Regression and Forest Classifier
**
The linear regression grappled with predictive accuracy and explanatory power, while the random forest classifier did well with perfect classification metrics. This indicates the superiority of ensemble methods like Random Forest for classification tasks over traditional regression when dealing with categorical outcomes

Random Forest Classifier and K-means Clustering

The random forest classifier directly addresses classification needs for maintenance urgency, while the K-means clustering provides insight into broader patterns and grouping within the data, supporting strategic maintenance planning.

BUSINESS IMPLICATION

For the first facet and Its corresponding model, if the model should have worked as expected(a successful linear regression model for predicting insurance premiums). It would enable precise cost estimation for insurance coverage based on vehicle risk factors. It would aid in opting for the ideal insurance plans that balance coverage and cost, reducing financial risks associated with vehicle accidents or failures.
Furthermore, the second facet and its corresponding model (random forest classifier accurately predicting maintenance needs) ensures proactive maintenance scheduling. This reduces downtime, optimizes vehicle availability, and supports continuous production operations.
Lastly, the third facet and its corresponding model(k-means clustering segments vehicles based on operational and performance characteristics). This helps in the allocation of maintenance resources efficiently by grouping vehicles with similar needs together. It ensures resources are focused where they are most needed, optimizing maintenance efforts and reducing overall maintenance costs.

Conclusion and Recommendation

To conclude, the implementation of predictive analytics through three machine learning models, namely; linear regression, random forest classifier, and k-means clustering. These models provided valuable insights into the vehicle maintenance system.
The linear regression model did not effectively predict the insurance premium, which is essential for the selection of suitable insurance coverage. Even after the feature selection, the results showed that the model is not effective in predicting the insurance premium.
The Random forest classifier model effectively predicted whether vehicle components needed maintenance, as the result indicates a perfect performance with precision at 1.0, recall at 1.0, and accuracy at 1.0.
The K-Means clustering model worked effectively as displayed in the visualization. The clustering model showed to work as it distinctly grouped vehicle maintenance based on certain characteristics.

*Recommendation
*

As a recommendation, I would enhance the feature engineering by refining the feature selection for the linear regression model to improve its predictive accuracy for the insurance premium. I would fine-tune my dataset to make sure it entails relevant variables for the problem to be addressed.
Also, I would incorporate continuous model evaluation and improvement. In effect, I would encourage periodic retraining of machine learning models with updated data to ensure that they remain robust and accurate over time.
Lastly, I would ensure data integrity, consistency, and accessibility across all systems. This would maintain high-quality data inputs for accurate predictive modeling.

Project Reflection

Overall reflection on my project would revolve around the problem I encountered from finding a dataset to loading the dataset and curating a model to solve the problem.
First and foremost, locating the perfect dataset that comprehensively covered the project’s problem, and vehicle maintenance system was challenging. The available dataset often lacked depth, necessitating extensive searches.
Next, I encountered technical difficulties while loading the CSV file into the google colab environment. I had to change the file extension to xl
sx, an Excel file.
I identified and engineered features from the dataset. I changed the data types of some of the columns to both numerical and categorical. It required a deep understanding of the domain and iterative experimentation to capture the most predictive variables.
In curating a model, I encountered a hurdle when optimizing the performance of the linear regression model. Even after performing feature selection, the model failed to perform as expected.

BIBLIOGRAPHY

Coursera. (n.d.). Data Analytics Certificate Course. Retrieved from https://www.coursera.org/courses/data-analytics
IBM Predictive Analytics. https://www.ibm.com/topics/predictive-analytics (Accessed on 15 June 2024)
JavaTpoint Data Analytics Tutorial. https://www.javatpoint.com/data-analytics-tutorial (Accessed on 15 June 2024)
Simplilearn Exploratory Data Analysis (EDA). https://www.simplilearn.com/tutorials/data-analytics-tutorial/exploratory-data-analysis (Accessed on 15 June 2024)
IBM Knowledge Center. Introduction to Data Types and Field Properties. https://www.ibm.com/docs/en/db2/11.5?topic=elements-introduction-data-types-field-pro perties ( Accessed on 17 June 2024).
Brownlee, J. Data Preparation for Machine Learning. https://machinelearningmastery.com/data-preparation-for-machine-learning/ (Accessed on 17 June 2024)
Montgomery, D. C., Peck, E. A. and Vining, G. G. (2012) Introduction to Linear Regression Analysis. 5th edn. Hoboken, NJ: Wiley
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
Towards Data Science Introduction to K-means Clustering and the Elbow Method. https://towardsdatascience.com/introduction-to-k-means-clustering-and-the-elbow-metho d-72a6231c6a92 (Accessed on 24 June 2024)

Social Buzz Data Analytics and Visualisation Project

Gerhardt — Thu, 18 Jan 2024 20:10:04 +0000

INTRODUCTION

In this project, I carried out an extensive analysis for Social Buzz, a social media and content creation company that has scaled quicker than anticipated and needs the help of an advisory firm to oversee its scaling process effectively.

Due to the huge amount of data they create, collect and analyse they are now willing to bring in an external expertise to help with the management.

DATA PREPARATION AND CLEANING
Upon reading the brief from the social buzz, I understood the requirements needed to be delivered for this project. These requirements are an audit of big data practice, recommendations for IPO, and analysis of popular content.

I was provided with 7 datasets and a data model.
So, firstly I used the data model to identify which datasets will be required to answer the business question - which is to to figure out the top 5 categories with the largest popularity.
Below is the image of the data model:

Next, I needed to ensure that the data was clean and ready for analysis…
To clean the data, I removed rows that were irrelevant to the analysis and had values which are missing, changed the data type of some values within a column, and removed columns that are not relevant to this task. After, I joined the relevant columns from your Content data set, and then the Reaction Types data set using the VLookUp formula.
Here are images of the datasets in Microsoft Excel:

ANALYSIS

The Analysis was done to get some insights and answers to some imperative questions and to get an understanding of the dataset.
This was done by the use of catchy visualizations.
Before that, I’d like to reiterate the problem and what the analysis should find.

PROBLEM
Over 100000 posts per day
36,500,000 pieces of content per year

So the analysis should help us find social buzz’s to 5 most popular categories of content.

Below are images of the insights, top five content categories and top five aggregate by popular score.

ANALYSIS - Studying and healthy eating are the two most common categories in the aggregate score.
This shows that people prioritize
"cognitive" and "longevity" content the most.

INSIGHTS - Food is the most common theme of the top 5 with
"studying" ranking the highest. You could use this insight to create a campaign and work with cognitive and mental development brands to boost user engagement.

NEXT STEPS - This ad-hoc analysis is insightful, but it's time to take this analysis into a larger-scale production for a real-time understanding of your business.

This is the conclusion and I appreciate you taking out time to read it.