DEV Community

Cover image for EDA, Feature Engineering and ML Model Creation - 100,000 UK Used Car Data Set (Kaggle)
Siddhesh shankar
Siddhesh shankar

Posted on • Updated on

EDA, Feature Engineering and ML Model Creation - 100,000 UK Used Car Data Set (Kaggle)

Introduction:

Learning the concepts of Exploratory Data Analysis, Machine Learning as well as the life-cycle of a Data Science project, has not only helped me gain knowledge but also increased my capability of interpreting data correctly.
Not only that, it has also helped me thinking rationally how the process work, how the data is been collected, processed and analyzed to get insights which become crucial.

In this article, I am here to talk about the Exploratory Data Analysis, Feature Engineering as well as model creation that I have done in the 100,000 UK Used Car data set in Kaggle.

UK Audi Used Car

Description of the data set:

We all know how people upgrade themselves by buying a new car. The ownership of a new car has been in a boom since the last decade. Since the data set is of the used cars from United Kingdom, a study showed that over 1.63 million new cars were registered in the year 2020. For some people it is not only for the travelling purposes but also as a status upgrade.
This data set had many car companies giving the data of the used cars that are up for sale in United Kingdom(UK). So, we have chosen to work with the Audi Cars data.

Here's the link to my Kaggle notebook:

Audi Data Set- 96 % Accurate Model Creation & EDA - Kaggle

Let's start the project:

Always we need to know what our data set is and what all libraries the project needs. This is the first step towards "glory".

So, we import the libraries:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # For visualzations and graph creations
%matplotlib inline
import seaborn as sns # For advanced visualizations and graph creations
from sklearn.preprocessing import LabelEncoder # For Feature Engineering Method - Label Encoding
from sklearn.preprocessing import MinMaxScaler # For Normalization
from sklearn.model_selection import train_test_split # For Splitting the data into train data and test data
from sklearn.ensemble import RandomForestRegressor # For Creation of Random Forest Regressor Model
from sklearn.linear_model import LinearRegression # For Creation of Linear Regression Model
from catboost import CatBoostRegressor # For Creation of CatBoost Regressor Model

# Libraries for calculation Metrics of the Model we create:

from sklearn.metrics import mean_squared_error 
from sklearn.metrics import r2_score
Enter fullscreen mode Exit fullscreen mode

After importing the libraries, we import the data set:

audidata = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/audi.csv")
audidata.head()
Enter fullscreen mode Exit fullscreen mode

We have imported the audi data records with .read_csv() function of Pandas Library.

With the help of the .info() function we could get the information about the columns in the data set, their data types, Non-Null Values etc.

audidata.info()
audidata.isna().sum()
audidata.shape
Enter fullscreen mode Exit fullscreen mode

The data set contains 10,668 entries/records with 9 columns. So, the columns and their descriptions are as follows:

  • model --> Model Name.
  • year --> The year it was bought.
  • price --> The price at which the used car will be sold.
  • transmission --> The transmission type i.e. manual, automatic or semi-automatic.
  • mileage --> The miles that used car has driven.
  • fuelType --> The fuel type of the car i.e. petrol, diesel or hybrid.
  • tax --> The tax that will be applied on the selling price of that used car.
  • mpg --> The miles per gallon ratio telling us how many mies it can drive per gallon of fuel.
  • engineSize --> The engine size of the used car.

No null records which means that a type of noise in data set is not present.

Outlier Removal:

Though we know that there are no NA values/records in the data set. Still, it doesn't mean that the data doesn't contain noisy data points. So, removing outliers becomes very much necessary. So with the help of boxplots, outlier detection was done.

box1 = sns.boxplot(x = 'mileage', data = a_clean)
Enter fullscreen mode Exit fullscreen mode

Boxplot showing Outliers in Mileage Column
We can see that there is a car which is above the 300,000 miles driven. This is an outlier hence removed it by keeping the range of data points for mileage as 0-200,000 miles.

a_clean = a_clean[a_clean['mileage'] < 200000]
print('We removed {} outliers!'.format(len(audidata) - len(a_clean)))
Enter fullscreen mode Exit fullscreen mode

Boxplot showing Outliers in Mileage Column after cleaning
Removed Outlier Count

Similarly, we followed the same procedure for the other integer columns.

box1 = sns.boxplot(x = 'tax', data = a_clean)
Enter fullscreen mode Exit fullscreen mode

Boxplot showing Outliers in Tax Column
We can observe that there are outliers in the range 0 € -100 € and 500 € - 600 €. Removed these outliers from the data in tax column.

a_clean = a_clean[a_clean['tax'] < 500]
print('We removed {} outliers!'.format(len(audidata) - len(a_clean)))
box1 = sns.boxplot(x = 'tax', data = a_clean)
Enter fullscreen mode Exit fullscreen mode

After cleaning the outliers, this the boxplot for the tax column.
Boxplot showing outliers in Tax Column after Cleaning
Image description

After removal of all the outliers from the numerical columns, we dive straight into the Exploratory Data Analysis(EDA).

Exploratory Data Analysis(EDA):

To start with EDA, we plotted a heatmap. For heatmap, we used Seaborn Library. In the heatmap, we passed the correlation of the clean data, used 'Reds' as color mapping.

sns.heatmap(a_clean.corr(), cmap ="Reds", annot = True)
plt.title("Correlation HeatMap/ Matrix")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description
.corr() function of Pandas Library uses Pearson's Correlation Method as the default method. But we can of course change the method by giving the method name like this:

a_clean.corr(method = 'Kendall')
Enter fullscreen mode Exit fullscreen mode

From the Correlation Matrix we get the following information:

  • There is a negative correlation price and mileage. This means the car that has driven more has more mileage and therefore the price is lesser since the car is used more.

  • There is a negative correlation with mpg(miles per gallon) and the price. This means that the sports car have less miles per gallon ratio but whereas normal cars have more miles per gallon. So hence the price of the sports car models are more and the normal car models are of lesser price.

  • There is positive correlation between price of the car and the engine size of the car. It means that people tend to buy those cars having higher engine size.

  • There is a small positive correlation between tax and the price of the car. Cars with higher taxes on them are costlier. Total Price = Selling Price + VAT(Tax applied). ^ inc.

After taking insights from Correlation Matrix, we can use histplots too to check the skewness of the data. We can create 3 graphs or plots in two rows itself by using the command plt.subplot(figsize = (12,10), nrows = 2, ncols = 3).

fig, axes = plt.subplots(figsize = (12,10), nrows = 2, ncols = 3)
sns.histplot(a_clean["year"], ax = axes[0,0])
sns.histplot(a_clean["mileage"], ax = axes[0,1])
sns.histplot(a_clean["tax"], ax = axes[0,2])
sns.histplot(a_clean["mpg"], ax = axes[1,0])
sns.histplot(a_clean["engineSize"], ax = axes[1,1])
sns.histplot(a_clean["price"], ax = axes[1,2])
plt.show()
Enter fullscreen mode Exit fullscreen mode

Sub Plots (HistPlots)

  • For year column, it is right-skewed which means that most of the cars are between 2015 to 2020.
  • For mileage column, it is left-skewed which means that most of the cars listed are driven for more than 5000 miles.
  • For engineSize column, the most used cars engine size is between 1.5 lts to 2 lts.

We can explore the categorical column 'transmission' using Seaborn's countplot like this:

sns.countplot(x = "transmission", data = a_clean)
plt.title("Transmission Types")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Countplot showing transmission column
This countplot shows us that there are around 4000+ cars which are of Manual Transmission in UK. Around 2500+ cars which are Automatic Transmission in UK and around 3500+ Cars which are Semi-Auto transmission.

Now to get the unique Model names that have been listed:

print(a_clean['model'].unique())
Enter fullscreen mode Exit fullscreen mode

Unique Car Model Names Listed

Now, some very important insights were taken from this data set. They are:
Firstly,

sns.lineplot(x = "year", y = "tax", data = a_clean)
plt.title("The Taxes applied on the car based on the Number of Years old")
plt.show()
Enter fullscreen mode Exit fullscreen mode

This lineplot shows us the taxes applied on the listed price of the used car over the years.
Lineplot Year vs Tax

  • By this lineplot we can see that at least 150 € are the taxes applied on cars which are relatively new i.e. 1-2 years old. In UK, every car needs to pay road tax irrespective of being a used car or a new car. There can been some deviations in the taxes on the specific year-old car. From this there is a hypothesis that we can form which is:
  • The taxes are varying because the type of the car too. Like SUVs, Sedans will have more taxes applied on it.

Secondly, if we plot price vs the number of years old, we can see that:

  • By this lineplot we can see that the cars which are relatively new are having higher prices which is obvious because lesser distance the cars have travelled. But we can see that there are some deviations or differences in prices of the car which are like 4-5 years old. The Seller have tried to maximize their profit but the buyers haven't seen it through logically and mathematically.
sns.lineplot(x = "year", y = "price", data = a_clean)
plt.title("The Price based on the number of Years old")
plt.show()
Enter fullscreen mode Exit fullscreen mode

LinePlot showing Price vs year

With this we completed the Exploratory Data Analysis. There are lot of other plots and insights taken from this data set but the important ones I have mentioned above. After this, we directly went into Feature Engineering.

Feature Engineering:

We can see that this data set has lot of categorical columns. We had to convert those categories in terms of numbers so that in model creation it is just numbers that we are passing for computation.
So, we applied the concepts of Label Encoding.
Label Encoding is a Encoding method where we can convert the categorical variables into numbers such as 0,1,2,... depending on the categorical variables in a column.
We applied this concept like this:
We applied it on one categorical columns one by one.

encoder = LabelEncoder()
a_clean['model'] = encoder.fit_transform(a_clean['model'])
model_mapping = {index : label for index, label in enumerate(encoder.classes_)}
model_mapping
Enter fullscreen mode Exit fullscreen mode

Label Encoding done on Model Column
So, after applying this on all categorical columns, the result was this:
Label Encoding Result
Before applying the machine learning algorithm on this data set, a final touch was required for so that all the Numerical values of the columns would be in the range of 0 and 1. For achieving this we used the concept of MinMax Scaler. It basically transformed the features into a given range. This is basically Normalization of the Data set.

scaler = MinMaxScaler(copy = True, feature_range = (0,1))
X = scaler.fit_transform(x)
X[:10]
Enter fullscreen mode Exit fullscreen mode

MinMax Scaler

Now, the best part of the project we all are waiting for.

Model Creation:

We split the data into two parts. First part was used to train our Machine Learning model. Other part was used as to test the machine learning model and seeing how accurate it is.

x_train, x_test,y_train,y_test = train_test_split(x,y,test_size = 0.35, random_state=0)
print("Shape of the x_train: ", x_train.shape)
print("Shape of the x_test: ", x_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of the y_test: ", y_test.shape)
Enter fullscreen mode Exit fullscreen mode

Image description
We split the data into 65% train data set and 35% test data set.
We build three machine learning models:

Linear Regression Model:

Linear Regression Model which wasn't that much accurate.

LinearRegressionModel = LinearRegression(fit_intercept = True, normalize = True, copy_X = True, n_jobs = -1)
LinearRegressionModel.fit(x_train, y_train)


print('Linear Regression Train Score is : ' , LinearRegressionModel.score(x_train, y_train))
print('Linear Regression Test Score is : ' , LinearRegressionModel.score(x_test, y_test))

print('----------------------------------------------------')
y_pred = LinearRegressionModel.predict(x_test)
print('Predicted Value for Linear Regression is : ' , y_pred[:10])
Enter fullscreen mode Exit fullscreen mode

Image description
We assumed a linear relationship between the "to be" predicted variable x and the feature variable that affects x i.e. y. We then applied the Linear Regression but setting the parameters like this:

  • copy_x = True --> It was given as True because we didn't want the the value of x to be overwritten.
  • n_jobs = - 1 --> The number of jobs for the computational purposes was given as -1 as the data set wasn't too large.
  • fit_intercept = True --> We wanted the model to calculate the intercept. Hence we gave it as True.
  • normalize = True --> We wanted the regressors to be normalized before regression was applied. Since we kept fit_intercept as True, we had to provide the bool value to normalize parameter too.

As we can see, we got only 80.236% as our test prediction accuracy.
Image description

There is a visible difference between the actual car price listed and the predicted price by Linear Regression model since the accuracy of the model is just 80.23%.
We weren't satisfied with this accuracy, hence we decided for a different model.

Random Forrest Regressor Model:

This model performed better in terms of train as well as test prediction accuracy.

RandomForestRegressorModel = RandomForestRegressor(n_estimators=100,max_depth=11, random_state=33)
RandomForestRegressorModel.fit(x_train, y_train)

print('Random Forest Regressor Train Score is : ' , RandomForestRegressorModel.score(x_train, y_train))
print('Random Forest Regressor Test Score is : ' , RandomForestRegressorModel.score(x_test, y_test))
print('Random Forest Regressor No. of features are : ' , RandomForestRegressorModel.n_features_)
print('----------------------------------------------------')

y_pred = RandomForestRegressorModel.predict(x_test)
print('Predicted Value for Random Forest Regressor is : ' , y_pred[:10])
Enter fullscreen mode Exit fullscreen mode

Image description
We then applied the Random Forrest Regressor Model but setting the parameters like this:

  • n_estimators = 100 --> The number of trees that the model will use. We gave 100 as the parameter value to n_estimators.
  • max_Depth = 11 --> The maximum depth of the tree. We gave 11 as the parameter value to max_depth.
  • random_state = 33 --> Controls both the randomness of the bootstrapping of the samples used when building trees. We gave 33 as the parameter value to random_state.

Significant improvement over Linear Regression Model's Accuracy.
We were able to achieve around 95.56% accuracy on test data.
Image description
We can observe that there very little visible difference between the actual price of the car listed and the predicted price of the car, predicted by Random Forest Regressor Model, since it is 95.56% accurate.

CatBoost Regressor Model by Yandex:

catModel = CatBoostRegressor(verbose = 0, random_state = 33)
catModel.fit(x_train, y_train)
y_pred = catModel.predict(x_test)
r2 = r2_score(y_pred, y_test)
print(f'CatBoost Regressor Model by Yandex r2 score : {r2:0.5f}')
Enter fullscreen mode Exit fullscreen mode

Image description
We then applied the CatBoost Regressor Model but setting the parameters like this:

  • verbose = 0 --> We can see the losing value in every computational iteration.
  • random_state = 33 --> Controls both the randomness of the bootstrapping of the samples used when building trees.

We achieved a whooping 96.002% Accuracy on test data.

pricePredicted = pd.DataFrame({'Actual Price': y_test, 'Predicted Price': y_pred})
pricePredicted = pricePredicted.reset_index()
pricePredicted.head(5)
Enter fullscreen mode Exit fullscreen mode

Image description

We can observe that there very little visible difference between the actual price of the car listed and the predicted price of the car, predicted by CatBoost Regressor Model, since it is 96% accurate.

Conclusion:

It was part of my Data Science Specialization Project.
However do wanted to apply Hyper Parameter Tuning. It would have improved the machine learning model accuracy. Soon will apply Hyper Parameter Tuning concepts in some other project.

Here is my LinkedIn Profile:
Siddhesh Shankar

Here is my GitHub Repository link:
100,000 UK Used Car Data Set

I hope that you enjoyed the code walkthrough and explanation.
Thanks for reading, you can reach to me by mailing me --> barnali.siddhesh@gmail.com.

Top comments (0)