DEV Community

Cover image for Predicting Used cars Prices
Everlynn
Everlynn

Posted on

Predicting Used cars Prices

Introduction

Welcome to week two of my 52-week blog challenge. Find the week one blog article here.

Today I will take you through this prediction project I have been working on.

Let's jump right in.

Context

Photo of a used cars lot

Photo by Haidan on Unsplash

There is always a huge demand for used cars in developing economies such as Kenya and India. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding Indian tech start-up that aims to find a good strategy in this market.

Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

So how can a data scientist help the business streamline its pricing model? Well, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

Objectives

By the end of this blog, you will be able to:

  1. Explore and visualize the data.

  2. Build a model to predict the prices of the used cars

  3. Generate a set of insights and recommendations that will help the business.

  4. Come up with an effective and easy to understand data story that will inform the business.

  5. Answer the key business question: " Which factors would affect the price of used cars?"

Data Dictionary

Image description
Photo by Romain Vignes on Unsplash

First a brief description of what a data dictionary is...

A data dictionary is a collection of names, definitions, and attributes about data elements that are used to explain what all the variable names and values in a dataset mean.

for this particular data, its dictionary is:

S.No.: Serial Number

Name: Name of the car which includes Brand name and Model name

Location: The location in which the car is being sold or is available for purchase (Cities)

Year: Manufacturing year of the car

Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type: The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)

Transmission: The type of transmission used by the car. (Automatic / Manual)

Owner: Type of ownership

Mileage: The standard mileage offered by the car company in kmpl or km/kg

Engine: The displacement volume of the engine in CC.

Power: The maximum power of the engine in bhp.

Seats: The number of seats in the car.

New_Price: The price of a new car of the same model is INR 100,000 (INR = Indian Rupee)

Price: The price of the used car is INR 100,000 (Target Variable)

Problem Formulation

You are trying to predict a quantity, therefore you have a regression problem, unlike a classification problem that predicts a label.

Solution

Step by step
Photo by Kaja Kadlecova on Unsplash

Due to the lengthy nature of this particular project post, I will divide it into a three-part article miniseries.

This first part will cover;

  1. Extraction, Transformation, and Loading of the data. (ETL)
  2. Exploratory Data Analysis (EDA)

Let me take you through the first portion of this solution to the business case.

1. First, you Import the necessary libraries.

import warnings                                                  # Used to ignore the warning given as output of the code
warnings.filterwarnings('ignore')

import numpy as np                                               # Basic libraries of python for numeric and dataframe computations
import pandas as pd

import matplotlib.pyplot as plt                                  # Basic library for data visualization
import seaborn as sns                                            # Slightly advanced library for data visualization

from sklearn.model_selection import train_test_split             # Used to split the data into train and test sets.

from sklearn.linear_model import LinearRegression, Ridge, Lasso  # Import methods to build linear model for statistical analysis and prediction

from sklearn.tree import DecisionTreeRegressor                   # Import methods to build decision trees.
from sklearn.ensemble import RandomForestRegressor               # Import methods to build Random Forest.

from sklearn import metrics                                      # Metrics to evaluate the model

from sklearn.model_selection import GridSearchCV                 # For tuning the model
Enter fullscreen mode Exit fullscreen mode

Remove the limit from the number of displayed columns and rows. (This step is optional)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)

2. Now you explore the data ( Extract, Transform and Load )(ETL)

Loading the data

Loading the data into Python to explore and understand it.

df = pd.read_csv("used_cars_data.csv")
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")  # f-string

df.head(10)  # displays the first ten rows
Enter fullscreen mode Exit fullscreen mode

Result

There are 7253 rows and 14 columns.
S.No.   Name    Location    Year    Kilometers_Driven   Fuel_Type   Transmission    Owner_Type  Mileage Engine  Power   Seats   New_Price   Price
0   0   Maruti Wagon R LXI CNG  Mumbai  2010    72000   CNG Manual  First   26.6 km/kg  998 CC  58.16 bhp   5.0 NaN 1.75
1   1   Hyundai Creta 1.6 CRDi SX Option    Pune    2015    41000   Diesel  Manual  First   19.67 kmpl  1582 CC 126.2 bhp   5.0 NaN 12.50
2   2   Honda Jazz V    Chennai 2011    46000   Petrol  Manual  First   18.2 kmpl   1199 CC 88.7 bhp    5.0 8.61 Lakh   4.50
3   3   Maruti Ertiga VDI   Chennai 2012    87000   Diesel  Manual  First   20.77 kmpl  1248 CC 88.76 bhp   7.0 NaN 6.00
4   4   Audi A4 New 2.0 TDI Multitronic Coimbatore  2013    40670   Diesel  Automatic   Second  15.2 kmpl   1968 CC 140.8 bhp   5.0 NaN 17.74
5   5   Hyundai EON LPG Era Plus Option Hyderabad   2012    75000   LPG Manual  First   21.1 km/kg  814 CC  55.2 bhp    5.0 NaN 2.35
6   6   Nissan Micra Diesel XV  Jaipur  2013    86999   Diesel  Manual  First   23.08 kmpl  1461 CC 63.1 bhp    5.0 NaN 3.50
7   7   Toyota Innova Crysta 2.8 GX AT 8S   Mumbai  2016    36000   Diesel  Automatic   First   11.36 kmpl  2755 CC 171.5 bhp   8.0 21 Lakh 17.50
8   8   Volkswagen Vento Diesel Comfortline Pune    2013    64430   Diesel  Manual  First   20.54 kmpl  1598 CC 103.6 bhp   5.0 NaN 5.20
9   9   Tata Indica Vista Quadrajet LS  Chennai 2012    65932   Diesel  Manual  Second  
Enter fullscreen mode Exit fullscreen mode

What you learn from the above is;

  • S.No. is just an index for the data entry. In all likelihood, this column will not be a significant factor in determining the price of the car. Having said that, there are instances where the index of the data entry contains information about the time factor (an entry with a smaller index corresponds to data entered years ago).

Now check the info of the data.

df.info()
Enter fullscreen mode Exit fullscreen mode

What you learn from the above is;

  • Mileage, Engine, Power and New_Price are objects when they should ideally be numerical. To be able to get summary statistics for these columns, you will have to process them first.

Processing Columns

Process 'Mileage', 'Engine', 'Power' and 'New_Price' and extract numerical values from them.

1. Mileage

You have car mileage in two units, kmpl and km/kg.

After quick research on the internet it is clear that these 2 units are used for cars of 2 different fuel types.

kmpl - kilometers per litre - is used for petrol and diesel cars.

-km/kg - kilometers per kg - is used for CNG and LPG-based engines.

You have the variable Fuel_type in our data.
Check if these observations hold true in our data also.

# Create 2 new columns after splitting the mileage values.
km_per_unit_fuel = []
mileage_unit = []

for observation in df["Mileage"]:
    if isinstance(observation, str):
        if (
            observation.split(" ")[0]
            .replace(".", "", 1)
            .isdigit()  # First element should be numeric
            and " " in observation  # Space between numeric and unit
            and (
                observation.split(" ")[1]
                == "kmpl"  # Units are limited to "kmpl" and "km/kg"
                or observation.split(" ")[1] == "km/kg"
            )
        ):
            km_per_unit_fuel.append(float(observation.split(" ")[0]))
            mileage_unit.append(observation.split(" ")[1])
        else:
            # To detect if there are any observations in the column that do not follow
            # The expected format [number + ' ' + 'kmpl' or 'km/kg']
            print(
                "The data needs further processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the mileage column,
        # We add corresponding missing values to the 2 new columns
        km_per_unit_fuel.append(np.nan)
        mileage_unit.append(np.nan)
Enter fullscreen mode Exit fullscreen mode
# No print output from the function above. The values are all in the expected format or NaNs
# Add the new columns to the data
df["km_per_unit_fuel"] = km_per_unit_fuel
df["mileage_unit"] = mileage_unit

# Checking the new dataframe
df.head(5)  # looks good!
Enter fullscreen mode Exit fullscreen mode
# Check if the units correspond to the fuel types as expected.
df.groupby(by = ["Fuel_Type", "mileage_unit"]).size()
Enter fullscreen mode Exit fullscreen mode

Result
Fuel_Type mileage_unit
CNG km/kg 62
Diesel kmpl 3852
LPG km/kg 12
Petrol kmpl 3325
dtype: int64

  • As expected, km/kg is for CNG/LPG cars and kmpl is for Petrol and Diesel cars.
2. Engine

The data dictionary suggests that Engine indicates the displacement volume of the engine in CC. You will make sure that all the observations follow the same format - [numeric + " " + "CC"] and create a new numeric column from this column.

This time, use a regex to make all the necessary checks.

Regular Expressions, also known as “regex”, are used to match strings of text such as particular characters, words, or patterns of characters. It means that you can match and extract any string pattern from the text with the help of regular expressions.

# re module provides support for regular expressions
import re

# Create a new column after splitting the engine values.
engine_num = []

# Regex for numeric + " " + "CC"  format
regex_engine = "^\d+(\.\d+)? CC$"

for observation in df["Engine"]:
    if isinstance(observation, str):
        if re.match(regex_engine, observation):
            engine_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "CC"]  format
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the engine column, we add missing values to the new column
        engine_num.append(np.nan)
Enter fullscreen mode Exit fullscreen mode
# No print output from the function above. The values are all in the same format - [numeric + " " + "CC"] OR NaNs
# Add the new column to the data
df["engine_num"] = engine_num

# Checking the new dataframe
df.head(5)
Enter fullscreen mode Exit fullscreen mode
3.Power

The data dictionary suggests that Power indicates the maximum power of the engine in bhp. You will make sure that all the observations follow the same format - [numeric + " " + "bhp"] and create a new numeric column from this column, like you did for Engine.

# Create a new column after splitting the power values
power_num = []

# Regex for numeric + " " + "bhp"  format
regex_power = "^\d+(\.\d+)? bhp$"

for observation in df["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "bhp"]  format
            # That we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)
Enter fullscreen mode Exit fullscreen mode
  • You can see that some Null values in power column exist as 'null bhp' string. Let us replace these with NaNs
ower_num = []

for observation in df["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            power_num.append(np.nan)
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)

# Add the new column to the data
df["power_num"] = power_num

# Checking the new dataframe
df.head(10)  # Looks good now
Enter fullscreen mode Exit fullscreen mode
4. New_price

You know that New_Price is the price of a new car of the same model in INR Lakhs (1 Lakh = 100, 000).

This column clearly has a lot of missing values. You will impute the missing values later. For now you will only extract the numeric values from this column.

# Create a new column after splitting the New_Price values.
new_price_num = []

# Regex for numeric + " " + "Lakh"  format
regex_power = "^\d+(\.\d+)? Lakh$"

for observation in df["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "Lakh"]  format
            # That we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)
Enter fullscreen mode Exit fullscreen mode

You will see not all values are in Lakhs. There are a few observations that are in Crores as well.

Covert these to lakhs, 1Cr = 100 Lakhs

new_price_num = []

for observation in df["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # Converting values in Crore to lakhs
            new_price_num.append(float(observation.split(" ")[0]) * 100)
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)

# Add the new column to the data
df["new_price_num"] = new_price_num

# Checking the new dataframe
df.head(5)  # Looks ok
Enter fullscreen mode Exit fullscreen mode

3. Feature Engineering

The Name column in the current format might not be very useful in our analysis. Since the name contains both the brand name and the model name of the vehicle, the column would have too many unique values to be useful in prediction.

df["Name"].nunique()
Enter fullscreen mode Exit fullscreen mode

Results
2041

  • With 2041 unique names, car names are not going to be great predictors of the price in our current data. But you can process this column to extract important information and see if that reduces the number of levels for this information.
1. Car Brand Name
# Extract Brand Names
df["Brand"] = df["Name"].apply(lambda x: x.split(" ")[0].lower())

# Check the data
df["Brand"].value_counts()
Enter fullscreen mode Exit fullscreen mode
plt.figure(figsize = (15, 7))

sns.countplot(y = "Brand", data = df, order = df["Brand"].value_counts().index)
Enter fullscreen mode Exit fullscreen mode

Resulting visualization.

A count plot showing Maruti as the most popular car brand name

A count plot showing Maruti as the most popular car brand name

2. Car Model Name
# Extract Model Names
df["Model"] = df["Name"].apply(lambda x: x.split(" ")[1].lower())

# Check the data
df["Model"].value_counts()
Enter fullscreen mode Exit fullscreen mode
plt.figure(figsize = (15, 7))

sns.countplot(y = "Model", data = df, order = df["Model"].value_counts().index[0:30])
Enter fullscreen mode Exit fullscreen mode

A count plot that shows swift as the most popular car model name

A count plot that shows swift as the most popular car model name.

  • It is clear from the above charts that the dataset contains used cars from luxury as well as budget-friendly brands.

  • You can create a new variable using this information. You will bin all our cars in 3 categories

Budget-Friendly
Mid Range
Luxury Cars

3. Car_category
df.groupby(["Brand"])["Price"].mean().sort_values(ascending = False)
Enter fullscreen mode Exit fullscreen mode

Output
Brand
lamborghini 120.000000
bentley 59.000000
porsche 48.348333
land 39.259500
jaguar 37.632250
mini 26.896923
mercedes-benz 26.809874
audi 25.537712
bmw 25.243146
volvo 18.802857
jeep 18.718667
isuzu 14.696667
toyota 11.580024
mitsubishi 11.058889
force 9.333333
mahindra 8.045919
skoda 7.559075
ford 6.889400
renault 5.799034
honda 5.411743
hyundai 5.343433
volkswagen 5.307270
nissan 4.738352
maruti 4.517267
tata 3.562849
fiat 3.269286
datsun 3.049231
chevrolet 3.044463
smart 3.000000
ambassador 1.350000
hindustan NaN
opelcorsa NaN
Name: Price, dtype: float64

The output is very close to expectation (domain knowledge), in terms of brand ordering. The mean price of a used Lamborghini is 120 Lakhs and that of cars from other luxury brands follow in descending order.

Towards the bottom end you have the more budget-friendly brands.

You can see that there is some missingness in our data. Yoy could come back to creating this variable once you have removed missingness from the data.

Exploratory Data Analysis

# Basic summary stats - Numeric variables
df.describe().T
Enter fullscreen mode Exit fullscreen mode

Output

Image showing the summary statistics table of the data

Observations:

  1. S.No. has no interpretation here but as discussed earlier drop it only after having looked at the initial linear model.
  2. Kilometers_Driven values have an incredibly high range. You should check a few of the extreme values to get a sense of the data.
  3. Minimum and the maximum number of seats in the car also warrant a quick check. On average a car seems to have 5 seats, which is right.
  4. You have used cars being sold at less than a lakh rupees and as high as 160 lakh, as you saw for Lamborghini earlier. you might have to drop some of these outliers to build a robust model.
  5. Min Mileage being 0 is also concerning, you'll have to check what is going on.
  6. Engine and Power mean and median values are not very different. Only someone with more domain knowledge would be able to comment further on these attributes.
  7. New price range seems right. You have both budget-friendly Maruti cars and Lamborghinis in your stock. Mean being twice that of the median suggests that there are only a few very high range brands, which again makes sense.
# Check Kilometers_Driven extreme values
df.sort_values(by = ["Kilometers_Driven"], ascending = False).head(10)
Enter fullscreen mode Exit fullscreen mode
  • It looks like the first row here is a data entry error. A car manufactured as recently as 2017 having been driven 6500000 kms is almost impossible.

  • The other observations that follow are also on a higher end. There is a good chance that these are outliers. You'll look at this further while doing the univariate analysis.

# Check Kilometers_Driven Extreme values
df.sort_values(by = ["Kilometers_Driven"], ascending = True).head(10)
Enter fullscreen mode Exit fullscreen mode
  • After looking at the columns - Year, New Price, and Price entries seem feasible.

  • 1000 might be the default value in this case. Quite a few cars having driven exactly 1000 km is suspicious.

# Check seats extreme values
df.sort_values(by = ["Seats"], ascending = True).head(5)
Enter fullscreen mode Exit fullscreen mode

Audi A4 having 0 seats is a data entry error. This column requires some outlier treatment or you can treat seats == 0 as a missing value. Overall, there doesn't seem much to be concerned about here.

# Let us check if we have a similar car in our dataset.
df[df["Name"].str.startswith("Audi A4")]
Enter fullscreen mode Exit fullscreen mode

Looks like an Audi A4 typically has 5 seats.

# Let us replace #seats in row index 3999 form 0 to 5
df.loc[3999, "Seats"] = 5.0


# Check seats extreme values
df.sort_values(by = ["Seats"], ascending = False).head(5)

Enter fullscreen mode Exit fullscreen mode

A Toyota Qualis has 10 seats and so does a Tata Sumo. No data entry error here.

# Check Mileage - km_per_unit_fuel extreme values
df.sort_values(by = ["km_per_unit_fuel"], ascending = True).head(10)
Enter fullscreen mode Exit fullscreen mode

You will have to treat Mileage = 0 as missing values.

# Check Mileage - km_per_unit_fuel extreme values
df.sort_values(by = ["km_per_unit_fuel"], ascending = False).head(10)
Enter fullscreen mode Exit fullscreen mode

Maruti Wagon R and Maruti Alto CNG versions are budget-friendly cars with high mileage, so these data points are fine.

# Looking at value counts for non-numeric features

num_to_display = 10  # Defining this up here so it's easy to change later

for colname in df.dtypes[df.dtypes == "object"].index:
    val_counts = df[colname].value_counts(dropna = False)  # Will also show the NA counts

    print(val_counts[:num_to_display])

    if len(val_counts) > num_to_display:
        print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
    print("\n\n")  # Just for more space in between
Enter fullscreen mode Exit fullscreen mode

Since you haven't dropped the original columns that you processed, you have a few redundant outputs here.
You had checked cars of different Fuel_Type earlier, but you did not encounter the 2 electric cars. Let us check why.

df.loc[df["Fuel_Type"] == "Electric"]
Enter fullscreen mode Exit fullscreen mode

Mileage values for these cars are NaN, that is why you did not encounter these earlier with groupby.

Electric cars are very new in the market and very rare in our dataset. You can consider dropping these two observations if they turn out to be outliers later. There is a good chance that you will not be able to create a good price prediction model for electric cars, with the currently available data.

3. Missing Values

Before you start looking at the individual distributions and interactions, let's quickly check the missingness in the data.

# Checking missing values in the dataset
df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode
  • 2 Electric car variants don't have entries for Mileage.

  • Engine displacement information of 46 observations is missing and a maximum power of 175 entries is missing.

  • Information about the number of seats is not available for 53 entries.

  • New Price as you saw earlier has a huge missing count. you'll have to see if there is a pattern here.

  • Price is also missing for 1234 entries. Since price is the response variable that you want to predict, you will have to drop these rows when you build a model. These rows will not be able to help in modeling or model evaluation. But while you are analyzing the distributions and doing missing value imputations, you will keep using information from these rows.

  • New Price for 6247 entries is missing. You need to explore if you can impute these or if you should drop this column altogether.

# Drop the redundant columns.
df.drop(
    columns=["Mileage", "mileage_unit", "Engine", "Power", "New_Price"], inplace = True
)
Enter fullscreen mode Exit fullscreen mode

To be continued.....

Photo by Reuben Juarez on Unsplash

You have come to the end of part one. Part two post will cover data visualization, bivariate distributions and correlation between variables.

Here is the link to the source code.

Stay tuned! Like, save, and share your comments. Happy coding.

Top comments (0)