DEV Community: Data Stories

Predicting Used cars Prices

Data Stories — Thu, 08 Jun 2023 19:51:05 +0000

Introduction

Welcome to week two of my 52-week blog challenge. Find the week one blog article here.

Today I will take you through this prediction project I have been working on.

Let's jump right in.

Context

There is always a huge demand for used cars in developing economies such as Kenya and India. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding Indian tech start-up that aims to find a good strategy in this market.

Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

So how can a data scientist help the business streamline its pricing model? Well, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

Objectives

By the end of this blog, you will be able to:

Explore and visualize the data.
Build a model to predict the prices of the used cars
Generate a set of insights and recommendations that will help the business.
Come up with an effective and easy to understand data story that will inform the business.
Answer the key business question: " Which factors would affect the price of used cars?"

Data Dictionary

Photo by Romain Vignes on Unsplash

First a brief description of what a data dictionary is...

A data dictionary is a collection of names, definitions, and attributes about data elements that are used to explain what all the variable names and values in a dataset mean.

for this particular data, its dictionary is:

S.No.: Serial Number

Name: Name of the car which includes Brand name and Model name

Location: The location in which the car is being sold or is available for purchase (Cities)

Year: Manufacturing year of the car

Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type: The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)

Transmission: The type of transmission used by the car. (Automatic / Manual)

Owner: Type of ownership

Mileage: The standard mileage offered by the car company in kmpl or km/kg

Engine: The displacement volume of the engine in CC.

Power: The maximum power of the engine in bhp.

Seats: The number of seats in the car.

New_Price: The price of a new car of the same model is INR 100,000 (INR = Indian Rupee)

Price: The price of the used car is INR 100,000 (Target Variable)

Problem Formulation

You are trying to predict a quantity, therefore you have a regression problem, unlike a classification problem that predicts a label.

Solution

Photo by Kaja Kadlecova on Unsplash

Due to the lengthy nature of this particular project post, I will divide it into a three-part article miniseries.

This first part will cover;

Extraction, Transformation, and Loading of the data. (ETL)
Exploratory Data Analysis (EDA)

Let me take you through the first portion of this solution to the business case.

1. First, you Import the necessary libraries.

import warnings                                                  # Used to ignore the warning given as output of the code
warnings.filterwarnings('ignore')

import numpy as np                                               # Basic libraries of python for numeric and dataframe computations
import pandas as pd

import matplotlib.pyplot as plt                                  # Basic library for data visualization
import seaborn as sns                                            # Slightly advanced library for data visualization

from sklearn.model_selection import train_test_split             # Used to split the data into train and test sets.

from sklearn.linear_model import LinearRegression, Ridge, Lasso  # Import methods to build linear model for statistical analysis and prediction

from sklearn.tree import DecisionTreeRegressor                   # Import methods to build decision trees.
from sklearn.ensemble import RandomForestRegressor               # Import methods to build Random Forest.

from sklearn import metrics                                      # Metrics to evaluate the model

from sklearn.model_selection import GridSearchCV                 # For tuning the model

Remove the limit from the number of displayed columns and rows. (This step is optional)
pd.set_option("display.max_columns", None) pd.set_option("display.max_rows", 200)

2. Now you explore the data ( Extract, Transform and Load )(ETL)

Loading the data

Loading the data into Python to explore and understand it.

df = pd.read_csv("used_cars_data.csv")
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")  # f-string

df.head(10)  # displays the first ten rows

Result

There are 7253 rows and 14 columns.
S.No.   Name    Location    Year    Kilometers_Driven   Fuel_Type   Transmission    Owner_Type  Mileage Engine  Power   Seats   New_Price   Price
0   0   Maruti Wagon R LXI CNG  Mumbai  2010    72000   CNG Manual  First   26.6 km/kg  998 CC  58.16 bhp   5.0 NaN 1.75
1   1   Hyundai Creta 1.6 CRDi SX Option    Pune    2015    41000   Diesel  Manual  First   19.67 kmpl  1582 CC 126.2 bhp   5.0 NaN 12.50
2   2   Honda Jazz V    Chennai 2011    46000   Petrol  Manual  First   18.2 kmpl   1199 CC 88.7 bhp    5.0 8.61 Lakh   4.50
3   3   Maruti Ertiga VDI   Chennai 2012    87000   Diesel  Manual  First   20.77 kmpl  1248 CC 88.76 bhp   7.0 NaN 6.00
4   4   Audi A4 New 2.0 TDI Multitronic Coimbatore  2013    40670   Diesel  Automatic   Second  15.2 kmpl   1968 CC 140.8 bhp   5.0 NaN 17.74
5   5   Hyundai EON LPG Era Plus Option Hyderabad   2012    75000   LPG Manual  First   21.1 km/kg  814 CC  55.2 bhp    5.0 NaN 2.35
6   6   Nissan Micra Diesel XV  Jaipur  2013    86999   Diesel  Manual  First   23.08 kmpl  1461 CC 63.1 bhp    5.0 NaN 3.50
7   7   Toyota Innova Crysta 2.8 GX AT 8S   Mumbai  2016    36000   Diesel  Automatic   First   11.36 kmpl  2755 CC 171.5 bhp   8.0 21 Lakh 17.50
8   8   Volkswagen Vento Diesel Comfortline Pune    2013    64430   Diesel  Manual  First   20.54 kmpl  1598 CC 103.6 bhp   5.0 NaN 5.20
9   9   Tata Indica Vista Quadrajet LS  Chennai 2012    65932   Diesel  Manual  Second

What you learn from the above is;

S.No. is just an index for the data entry. In all likelihood, this column will not be a significant factor in determining the price of the car. Having said that, there are instances where the index of the data entry contains information about the time factor (an entry with a smaller index corresponds to data entered years ago).

Now check the info of the data.

df.info()

What you learn from the above is;

Mileage, Engine, Power and New_Price are objects when they should ideally be numerical. To be able to get summary statistics for these columns, you will have to process them first.

Processing Columns

Process 'Mileage', 'Engine', 'Power' and 'New_Price' and extract numerical values from them.

1. Mileage

You have car mileage in two units, kmpl and km/kg.

After quick research on the internet it is clear that these 2 units are used for cars of 2 different fuel types.

kmpl - kilometers per litre - is used for petrol and diesel cars.

-km/kg - kilometers per kg - is used for CNG and LPG-based engines.

You have the variable Fuel_type in our data.
Check if these observations hold true in our data also.

# Create 2 new columns after splitting the mileage values.
km_per_unit_fuel = []
mileage_unit = []

for observation in df["Mileage"]:
    if isinstance(observation, str):
        if (
            observation.split(" ")[0]
            .replace(".", "", 1)
            .isdigit()  # First element should be numeric
            and " " in observation  # Space between numeric and unit
            and (
                observation.split(" ")[1]
                == "kmpl"  # Units are limited to "kmpl" and "km/kg"
                or observation.split(" ")[1] == "km/kg"
            )
        ):
            km_per_unit_fuel.append(float(observation.split(" ")[0]))
            mileage_unit.append(observation.split(" ")[1])
        else:
            # To detect if there are any observations in the column that do not follow
            # The expected format [number + ' ' + 'kmpl' or 'km/kg']
            print(
                "The data needs further processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the mileage column,
        # We add corresponding missing values to the 2 new columns
        km_per_unit_fuel.append(np.nan)
        mileage_unit.append(np.nan)

# No print output from the function above. The values are all in the expected format or NaNs
# Add the new columns to the data
df["km_per_unit_fuel"] = km_per_unit_fuel
df["mileage_unit"] = mileage_unit

# Checking the new dataframe
df.head(5)  # looks good!

# Check if the units correspond to the fuel types as expected.
df.groupby(by = ["Fuel_Type", "mileage_unit"]).size()

Result
Fuel_Type mileage_unit CNG km/kg 62 Diesel kmpl 3852 LPG km/kg 12 Petrol kmpl 3325 dtype: int64

As expected, km/kg is for CNG/LPG cars and kmpl is for Petrol and Diesel cars.

2. Engine

The data dictionary suggests that Engine indicates the displacement volume of the engine in CC. You will make sure that all the observations follow the same format - [numeric + " " + "CC"] and create a new numeric column from this column.

This time, use a regex to make all the necessary checks.

Regular Expressions, also known as “regex”, are used to match strings of text such as particular characters, words, or patterns of characters. It means that you can match and extract any string pattern from the text with the help of regular expressions.

# re module provides support for regular expressions
import re

# Create a new column after splitting the engine values.
engine_num = []

# Regex for numeric + " " + "CC"  format
regex_engine = "^\d+(\.\d+)? CC$"

for observation in df["Engine"]:
    if isinstance(observation, str):
        if re.match(regex_engine, observation):
            engine_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "CC"]  format
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the engine column, we add missing values to the new column
        engine_num.append(np.nan)

# No print output from the function above. The values are all in the same format - [numeric + " " + "CC"] OR NaNs
# Add the new column to the data
df["engine_num"] = engine_num

# Checking the new dataframe
df.head(5)

3.Power

The data dictionary suggests that Power indicates the maximum power of the engine in bhp. You will make sure that all the observations follow the same format - [numeric + " " + "bhp"] and create a new numeric column from this column, like you did for Engine.

# Create a new column after splitting the power values
power_num = []

# Regex for numeric + " " + "bhp"  format
regex_power = "^\d+(\.\d+)? bhp$"

for observation in df["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "bhp"]  format
            # That we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)

You can see that some Null values in power column exist as 'null bhp' string. Let us replace these with NaNs

ower_num = []

for observation in df["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            power_num.append(np.nan)
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)

# Add the new column to the data
df["power_num"] = power_num

# Checking the new dataframe
df.head(10)  # Looks good now

4. New_price

You know that New_Price is the price of a new car of the same model in INR Lakhs (1 Lakh = 100, 000).

This column clearly has a lot of missing values. You will impute the missing values later. For now you will only extract the numeric values from this column.

# Create a new column after splitting the New_Price values.
new_price_num = []

# Regex for numeric + " " + "Lakh"  format
regex_power = "^\d+(\.\d+)? Lakh$"

for observation in df["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "Lakh"]  format
            # That we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)

You will see not all values are in Lakhs. There are a few observations that are in Crores as well.

Covert these to lakhs, 1Cr = 100 Lakhs

new_price_num = []

for observation in df["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # Converting values in Crore to lakhs
            new_price_num.append(float(observation.split(" ")[0]) * 100)
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)

# Add the new column to the data
df["new_price_num"] = new_price_num

# Checking the new dataframe
df.head(5)  # Looks ok

3. Feature Engineering

The Name column in the current format might not be very useful in our analysis. Since the name contains both the brand name and the model name of the vehicle, the column would have too many unique values to be useful in prediction.

df["Name"].nunique()

Results
2041

With 2041 unique names, car names are not going to be great predictors of the price in our current data. But you can process this column to extract important information and see if that reduces the number of levels for this information.

1. Car Brand Name

# Extract Brand Names
df["Brand"] = df["Name"].apply(lambda x: x.split(" ")[0].lower())

# Check the data
df["Brand"].value_counts()

plt.figure(figsize = (15, 7))

sns.countplot(y = "Brand", data = df, order = df["Brand"].value_counts().index)

Resulting visualization.

A count plot showing Maruti as the most popular car brand name

2. Car Model Name

# Extract Model Names
df["Model"] = df["Name"].apply(lambda x: x.split(" ")[1].lower())

# Check the data
df["Model"].value_counts()

plt.figure(figsize = (15, 7))

sns.countplot(y = "Model", data = df, order = df["Model"].value_counts().index[0:30])

A count plot that shows swift as the most popular car model name.

It is clear from the above charts that the dataset contains used cars from luxury as well as budget-friendly brands.
You can create a new variable using this information. You will bin all our cars in 3 categories

Budget-Friendly
Mid Range
Luxury Cars

3. Car_category

df.groupby(["Brand"])["Price"].mean().sort_values(ascending = False)

Output
Brand lamborghini 120.000000 bentley 59.000000 porsche 48.348333 land 39.259500 jaguar 37.632250 mini 26.896923 mercedes-benz 26.809874 audi 25.537712 bmw 25.243146 volvo 18.802857 jeep 18.718667 isuzu 14.696667 toyota 11.580024 mitsubishi 11.058889 force 9.333333 mahindra 8.045919 skoda 7.559075 ford 6.889400 renault 5.799034 honda 5.411743 hyundai 5.343433 volkswagen 5.307270 nissan 4.738352 maruti 4.517267 tata 3.562849 fiat 3.269286 datsun 3.049231 chevrolet 3.044463 smart 3.000000 ambassador 1.350000 hindustan NaN opelcorsa NaN Name: Price, dtype: float64

The output is very close to expectation (domain knowledge), in terms of brand ordering. The mean price of a used Lamborghini is 120 Lakhs and that of cars from other luxury brands follow in descending order.

Towards the bottom end you have the more budget-friendly brands.

You can see that there is some missingness in our data. Yoy could come back to creating this variable once you have removed missingness from the data.

Exploratory Data Analysis

# Basic summary stats - Numeric variables
df.describe().T

Output

Observations:

S.No. has no interpretation here but as discussed earlier drop it only after having looked at the initial linear model.
Kilometers_Driven values have an incredibly high range. You should check a few of the extreme values to get a sense of the data.
Minimum and the maximum number of seats in the car also warrant a quick check. On average a car seems to have 5 seats, which is right.
You have used cars being sold at less than a lakh rupees and as high as 160 lakh, as you saw for Lamborghini earlier. you might have to drop some of these outliers to build a robust model.
Min Mileage being 0 is also concerning, you'll have to check what is going on.
Engine and Power mean and median values are not very different. Only someone with more domain knowledge would be able to comment further on these attributes.
New price range seems right. You have both budget-friendly Maruti cars and Lamborghinis in your stock. Mean being twice that of the median suggests that there are only a few very high range brands, which again makes sense.

# Check Kilometers_Driven extreme values
df.sort_values(by = ["Kilometers_Driven"], ascending = False).head(10)

It looks like the first row here is a data entry error. A car manufactured as recently as 2017 having been driven 6500000 kms is almost impossible.
The other observations that follow are also on a higher end. There is a good chance that these are outliers. You'll look at this further while doing the univariate analysis.

# Check Kilometers_Driven Extreme values
df.sort_values(by = ["Kilometers_Driven"], ascending = True).head(10)

After looking at the columns - Year, New Price, and Price entries seem feasible.
1000 might be the default value in this case. Quite a few cars having driven exactly 1000 km is suspicious.

# Check seats extreme values
df.sort_values(by = ["Seats"], ascending = True).head(5)

Audi A4 having 0 seats is a data entry error. This column requires some outlier treatment or you can treat seats == 0 as a missing value. Overall, there doesn't seem much to be concerned about here.

# Let us check if we have a similar car in our dataset.
df[df["Name"].str.startswith("Audi A4")]

Looks like an Audi A4 typically has 5 seats.

# Let us replace #seats in row index 3999 form 0 to 5
df.loc[3999, "Seats"] = 5.0


# Check seats extreme values
df.sort_values(by = ["Seats"], ascending = False).head(5)

A Toyota Qualis has 10 seats and so does a Tata Sumo. No data entry error here.

# Check Mileage - km_per_unit_fuel extreme values
df.sort_values(by = ["km_per_unit_fuel"], ascending = True).head(10)

You will have to treat Mileage = 0 as missing values.

# Check Mileage - km_per_unit_fuel extreme values
df.sort_values(by = ["km_per_unit_fuel"], ascending = False).head(10)

Maruti Wagon R and Maruti Alto CNG versions are budget-friendly cars with high mileage, so these data points are fine.

# Looking at value counts for non-numeric features

num_to_display = 10  # Defining this up here so it's easy to change later

for colname in df.dtypes[df.dtypes == "object"].index:
    val_counts = df[colname].value_counts(dropna = False)  # Will also show the NA counts

    print(val_counts[:num_to_display])

    if len(val_counts) > num_to_display:
        print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
    print("\n\n")  # Just for more space in between

Since you haven't dropped the original columns that you processed, you have a few redundant outputs here.
You had checked cars of different Fuel_Type earlier, but you did not encounter the 2 electric cars. Let us check why.

df.loc[df["Fuel_Type"] == "Electric"]

Mileage values for these cars are NaN, that is why you did not encounter these earlier with groupby.

Electric cars are very new in the market and very rare in our dataset. You can consider dropping these two observations if they turn out to be outliers later. There is a good chance that you will not be able to create a good price prediction model for electric cars, with the currently available data.

3. Missing Values

Before you start looking at the individual distributions and interactions, let's quickly check the missingness in the data.

# Checking missing values in the dataset
df.isnull().sum()

2 Electric car variants don't have entries for Mileage.
Engine displacement information of 46 observations is missing and a maximum power of 175 entries is missing.
Information about the number of seats is not available for 53 entries.
New Price as you saw earlier has a huge missing count. you'll have to see if there is a pattern here.
Price is also missing for 1234 entries. Since price is the response variable that you want to predict, you will have to drop these rows when you build a model. These rows will not be able to help in modeling or model evaluation. But while you are analyzing the distributions and doing missing value imputations, you will keep using information from these rows.
New Price for 6247 entries is missing. You need to explore if you can impute these or if you should drop this column altogether.

# Drop the redundant columns.
df.drop(
    columns=["Mileage", "mileage_unit", "Engine", "Power", "New_Price"], inplace = True
)

Photo by Reuben Juarez on Unsplash

You have come to the end of part one. Part two post will cover data visualization, bivariate distributions and correlation between variables.

Here is the link to the source code.

Stay tuned! Like, save, and share your comments. Happy coding.

Visualizing Temperature Variation; A Climate Spiral .

Data Stories — Tue, 06 Jun 2023 17:19:51 +0000

Introduction

Photo by Brett Jordan on Unsplash

Welcome to the first article of my 52-week blog challenge. I will be covering technical and description articles in the field of data science and artificial intelligence.

Let's jump right into the definitions first.

Temperature - is a physical quantity that expresses the perception of hotness and coldness. In other words, the measure of hotness and coldness is expressed in terms of scales.
Variation - is the extent something is different from another

So....

Temperature variation is the measure of the difference in temperature in a specific area at a particular range of time.

Goals

The goal of this project is to create an animated spiral of Kenya's variation in temperature from 1991 to 2016.

By the end of this blog post you will have learned:

Exploratory data analysis - ETL( Extraction, Transformation and Loading data)
Data Visualization
Generation of a GIF
Reporting and presenting the data's story after transforming it from data to information and insights.

Why?

Descriptive analysis- It will describe the current situation on the ground.
Informed decision making-The insight will help with making informed decisions in climate policy-making.
Disaster preparedness-The visualization can help show early signs of unusual temperature spikes that could help prepare better for them.

Background

Ed Hawkins, a climate scientist, unveiled an animated visualization in 2017 that captivated the world. This visualization showed the deviations in the global average temperature from 1850 to 2017. It was re-shared millions of times over Twitter and Facebook and a version of it was even shown at the opening ceremony for the Rio Olympics.

This animation is created with the help of https://www.dataquest.io/blog/climate-temperature-spirals-python/ written by Srini Kadamati.

Historical weather data was retrieved from africa open data https://africaopendata.org/dataset/kenya-climate-data-1991-2016

The data was collected for the climate knowledge portal by the World Bank.

Building the spiral visualization.

1. ETL( Extraction, Transformation and Loading data)

#importing libraries we'll use 
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.animation as animation

#reading the temperature file into a pandas dataframe
temp_data = pd.read_csv(
    "temp data.csv",
    delim_whitespace=True,
    usecols=[0, 1],
    header=None)

Let's take a quick look at the data frame and some properties of the data.

temp_data

Result:


0   1
0   Year,Month  Average,Temperature
1   1991,Jan    Average,25.1631
2   1991,Feb    Average,26.0839
3   1991,Mar    Average,26.2236
4   1991,Apr    Average,25.5812
... ... ...
308 2016,Aug    Average,24.0942
309 2016,Sep    Average,24.437
310 2016,Oct    Average,26.0317
311 2016,Nov    Average,25.5692
312 2016,Dec    Average,25.7401

temp_data.describe()

Result:


0   1
count   313 313
unique  313 313
top Year,Month  Average,Temperature
freq    1   1

From the results you get, check if there is a need to make it more readable.

With this particular case, you need to separate year, month, and average temperature.

temp_data[['Year', 'Month']] = temp_data['Year'].str.split(',', expand=True)

temp_data[['Average', 'Temparature']] = temp_data['Average'].str.split(',', expand=True)
temp_data.head()

Result:


0   1   Year    Month   Average Temperature Temparature
0   Year,Month  Average,Temperature Year    Month   Average Average,Temperature Temperature
1   1991,Jan    Average,25.1631 1991    Jan Average Average,25.1631 25.1631
2   1991,Feb    Average,26.0839 1991    Feb Average Average,26.0839 26.0839
3   1991,Mar    Average,26.2236 1991    Mar Average Average,26.2236 26.2236
4   1991,Apr    Average,25.5812 1991    Apr Average Average,25.5812 25.5812

It is best practice to drop the columns that are repetitive.

temp_data_1 = temp_data.drop(temp_data.columns[[0, 1, 4, 5]], axis=1)
temp_data_1

Result:

Year    Month   Temparature
0   Year    Month   Temperature
1   1991    Jan 25.1631
2   1991    Feb 26.0839
3   1991    Mar 26.2236
4   1991    Apr 25.5812
... ... ... ...
308 2016    Aug 24.0942
309 2016    Sep 24.437
310 2016    Oct 26.0317
311 2016    Nov 25.5692
312 2016    Dec 25.7401

Now let's get to know the data types in the data.

#getting to know what data types my data frame has
temp_data_2.dtypes

Result:

Year           object
Month          object
Temparature    object
dtype: object

All the data is in object form
You need to convert the temperature column data type from object to float. This is because it is the only way you can perform mathematical operations on it and visualize it on a scale.

temp_data_2['Temparature'] = temp_data_2['Temparature'].astype(str).astype(float)

#view data types of each column
temp_data_2.dtypes

Result

Year            object
Month           object
Temparature    float64
dtype: object

Now you will write a function that converts month names to numbers. Here you utilize the datetime python library.

# Define a function to convert month names to numbers
def month_string_to_number(string):
    dt = datetime.strptime(string, "%b")
    return dt.month
## Apply the function to the month column to convert to numbers
temp_data_2['month_number'] = temp_data_2['Month'].apply(month_string_to_number)

temp_data_2.head(20)

Result:

    Year    Month   Temparature month_number
1   1991.0  Jan 25.1631 1
2   1991.0  Feb 26.0839 2
3   1991.0  Mar 26.2236 3
4   1991.0  Apr 25.5812 4
5   1991.0  May 24.6618 5
6   1991.0  Jun 23.9439 6
7   1991.0  Jul 22.9982 7
8   1991.0  Aug 23.0391 8
9   1991.0  Sep 23.9423 9
10  1991.0  Oct 25.5236 10
11  1991.0  Nov 24.5875 11
12  1991.0  Dec 24.7398 12
13  1992.0  Jan 24.4359 1
14  1992.0  Feb 26.2892 2
15  1992.0  Mar 26.5409 3
16  1992.0  Apr 26.0819 4
17  1992.0  May 24.7852 5
18  1992.0  Jun 24.0563 6
19  1992.0  Jul 22.8377 7
20  1992.0  Aug 22.7902 8

It is best practice to drop the unnecessary month name column.

temp_data_2 = temp_data_2.drop('Month', axis=1)

Checking for null or missing values is very important in the ETL process.

temp_data_2.isnull().sum()

Result:

Year            0
Temparature     0
month_number    0
dtype: int64

There are no missing values in this data.

Now you find the mean of the temperature column and subtract the mean from each individual value in the column. This will help you find the temperature variation of every month against the year's mean temperature. This is a sort of normalization of data.

2. Visualizing the data.

Cartesian versus polar coordinate system
There are a few key phases to recreating Ed's GIF:

-learning how to plot on a polar coordinate system
-transforming the data for polar visualization
-customizing the aesthetics of the plot
-stepping through the visualization year-by-year and turning the plot into a GIF

- Preparing data for polar plotting

You need to subset the data by year and use the following coordinates:

r: temperature value for a given month, adjusted to contain no negative values.
Matplotlib supports plotting negative values, but not in the way you think. You want -0.1 to be closer to the center than 0.1, which isn't the default matplotlib behavior.
You also want to leave some space around the origin of the plot for displaying the year as text.
theta: generate 12 equally spaced angle values that span from 0 to 2*pi.

You'll start with how to plot just the data for the year 1991 in matplotlib, then scale up to all years.

To generate a matplotlib Axes object that uses the polar system, you need to set the projection parameter to "polar" when creating it.

fig = plt.figure(figsize=(8,8))
ax1 = plt.subplot(111, projection='polar')

To adjust the data to contain no negative temperature values, you need to first calculate the minimum temperature value:

temp_data_2['Temparature'].min()

Result:

-2.3378881410256405

You'll add

2 to all temperature values, so they'll be positive but there's still some space reserved around the origin for displaying text:

Note; adjust your value according to your data's minimum temperature.

You'll also generate 12 evenly spaced values from 0 to 2*pi and use the first 12 as the theta values:

# returns a boolean Series that selects only the rows 
#where the Year column is equal to 1991.
hc_1991 = temp_data_2[temp_data_2['Year'] == 1991]
#the code creates a new figure with 
#the plt.figure() function and sets the size of the figure to be 8 inches by 8 inches with figsize=(8,8).
fig = plt.figure(figsize=(8,8))
ax1 = plt.subplot(111, projection='polar')
r = hc_1991['Temparature'] + 2
theta = np.linspace(0, 2*np.pi, 12)
# Plot the data on the polar axes
ax1.plot(theta, r)

# hide all of the tick labels for both axes 
ax1.axes.get_yaxis().set_ticklabels([])
ax1.axes.get_xaxis().set_ticklabels([])
#Background color within the polar plot to be black, and the color surrounding the polar plot to be gray.
#I can use
#fig.set_facecolor() to set the foreground color and Axes.set_axis_bgcolor() to set the background color of the plot:
fig.set_facecolor("#323331")
ax1.set_facecolor('#000100')
#add the title and labels
ax1.set_ylabel('Temperature')
ax1.set_title("Kenya's Temperature Change (1991-2016)", color='white', fontdict={'fontsize': 30})
# Display the plot
plt.show()

Plotting the remaining years
To plot the spirals for the remaining years, you need to repeat what you just did but for all of the years in the dataset. The one tweak you should make here is to manually set the axis limit for

r (or y in matplotlib). This is because matplotlib scales the size of the plot automatically based on the data that's used. This is why, in the last step, I observed that the data for just 1991 was displayed at the edge of the plotting area. You'll calculate the maximum temperature value in the entire dataset and add a generous amount of padding (to match what Ed did).

Now, you can use a for loop to generate the rest of the data. You'll leave out the code that generates the center text for now (otherwise each year will generate text at the same point and it'll be very messy):

You will use the color (or c) parameter when calling the Axes.plot() method and draw colors from plt.cm.(index).

ig = plt.figure(figsize=(14,14))
ax1 = plt.subplot(111, projection='polar')

# hide all of the tick labels for both axes 
ax1.axes.get_yaxis().set_ticklabels([])
ax1.axes.get_xaxis().set_ticklabels([])

#fig.set_facecolor() to set the foreground color and Axes.set_axis_bgcolor() to set the background color of the plot:
fig.set_facecolor("#323331")
#ax1.set_ylim(0, 3.25)


theta = np.linspace(0, 2*np.pi, 12)


ax1.set_title("Kenya's Temperature Change (1991-2016)", color='white', fontdict={'fontsize': 30})
ax1.set_facecolor('#000100')

years = temp_data_2['Year'].unique()

for index,Year in enumerate(years):
  r=temp_data_2.loc[temp_data_2["Year"]== Year,"Temparature"]+2
  ax1.plot(theta,r,c=plt.cm.viridis(index*2))
plt.show()

Adding Temperature Rings
At this stage, the viewer can't actually understand the underlying data at all. There is no indication of temperture values in the visualization.
Next, You will add temperature rings at 0.0, 1.5, 2.0 degrees Celsius:
Then, finally Generating The GIF Animation
Now you're ready to generate a GIF animation from the plot. An animation is a series of images that are displayed in rapid succession. You'll use the

matplotlib.animation.FuncAnimation function to help with this. To take advantage of this function, you need to write code that:

defines the base plot appearance and properties
updates the plot between each frames with new data
you'll use the following required parameters when calling

FuncAnimation():

fig: the matplotlib Figure object
func: the update function that's called between each frame
frames: the number of frames (you want one for each year)
interval: the number of milliseconds each frame is displayed (there are 1000 milliseconds in a second)
This function will return a

matplotlib.animation.FuncAnimation object, which has a save() method you can use to write the animation to a GIF file.

The code block below shows all these above steps added to produce a GIF.

from mpl_toolkits.mplot3d import Axes3D 
months=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
fig=plt.figure(figsize=(15,15))
ax1=plt.subplot(111,projection="polar")

ax1.plot(full_circle_thetas, blue_one_radii, c='blue')
ax1.plot(full_circle_thetas, red_one_radii, c='red')
ax1.plot(full_circle_thetas, red_two_radii, c='red')
ax1.plot(full_circle_thetas, red_three_radii, c='red')
ax1.plot(full_circle_thetas, red_four_radii, c='red')

#fig.set_facecolor() to set the foreground color and Axes.set_axis_bgcolor() to set the background color of the plot:
fig.set_facecolor("#323331")
#ax1.set_ylim(0, 3.25)

ax1.text(np.pi/2, 1.0, "0.0 C", color="blue", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 2.0, "0.5 C", color="red", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 2.5, "1.0 C", color="red", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 3.0, "1.5 C", color="red", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 3.5, "2.0 C", color="red", ha='center', fontdict={'fontsize': 20})


ax1.set_xticks([])
ax1.set_yticks([])
ax1.set_xticklabels([])
ax1.set_yticklabels([])


theta = np.linspace(0, 2*np.pi, 12)


ax1.set_title("Kenya's Temperature Change Spiral (1991-2016)", color='white', fontdict={'fontsize': 30})
ax1.set_facecolor('#000100')

years = temp_data_2['Year'].unique()

fig.text(0.78,0,"Kenya Temperature data",color="white",fontsize=20)
fig.text(0.05,0.02,"Everlynn Muthoni; Data Stories",color="white",fontsize=20)
fig.text(0.05,0,"Inspired by Ed Hawkins's 2017 Visualization",color="white",fontsize=15)

#add months ring
months_angles= np.linspace((np.pi/2)+(2*np.pi),np.pi/2,13)
for i,month in enumerate(months):
  ax1.text(months_angles[i],5.0,month,color="white",fontsize=15,ha="center")

#for index,Year in enumerate(years):
  #r=temp_data_2.loc[temp_data_2["Year"]== Year,"Temparature"]+2
  #ax1.plot(theta,r,c=plt.cm.viridis(index*15))

def update(i):
    # Remove the last year text at the center
    for txt in ax1.texts:
      if(txt.get_position()==(0,0)):
        txt.set_visible(False)
    # Specify how we want the plot to change in each frame.
    # We need to unravel the for loop we had earlier.
    Year = years[i]
    r = temp_data_2[temp_data_2['Year'] == Year]['Temparature'] + 2
    ax1.plot(theta, r, c=plt.cm.viridis(i*30))
    ax1.text(0,0,Year,fontsize=20,color="white",ha="center")
    return ax1

anim = animation.FuncAnimation(fig, update, frames=len(years), interval=10)


ffmpeg_writer = animation.FFMpegWriter();

anim.save("Spiral.gif", writer = 'pillow', fps = 5, dpi=100);

Final result:

3. The story our data visualization tells.

So....from the analysis and visualization, the following insights are deduced;

Since 1990 the temperature variation has been gradually increasing between February and June with the highest variation occurring mostly between June and July

-High-temperature variation mostly occurs during most of the first half of the year.

And that's it. Congrats, you have successfully visualized temperature data using a climate spiral!

Click here if you'd like to check out the source code.

4. Recommendations

For a better 3d visualization, explore the project using Matlab
For even better real time descriptive analysis, try to find data with the latest dates.

Like, subscribe and share your thoughts with me. Bye! and Happy coding.

MySQL Error 2003 (HY000): Can't connect to MySQL server on 'localhost:3306' (10061)

Data Stories — Sat, 20 May 2023 03:11:52 +0000

So we have finished installing MySQL and we want to start it at the command line. There are times you may come across the 2003(HYOOO) error or the MySQL Command Line Client may disappear after it prompts you for your password. Do not fret, this is easily fixed.

How to fix it;

Please see the steps below to fix the problem.

Log in as Admin to your system
Open the Task Manager panel
Go to the Services tab.
Look for your MySQL service. It will show the service has stopped.
Select the MySQL and right click. You will get Start option. Select start command, give it a few minutes to start and in a while it will read running. Please see the below image for reference.
Please recheck that MySQL service has been started successfully.
In case MySQL services do not start using this way, It means you are not logged in with Admin account. In this case you will receive this error.

Please follow next given steps to start MySQL service from Administrative Panel.

Please Go to Control Panel > All Control Panel Items > Administrative Tools.
Go to Services.
Select MySQL56 under Name column.
Click on start link from left panel to start MySQL Service.
Once MySQL service started go to the MySQL Command Line Client
Enter your MySQL password.
We can see image given below after MySQL server started successfully.

Thanks for reading this post. Please let me know if your problem has been resolved. Like, comment and share.

Finally, In case you want to learn how to install MySQL server from scratch please check this blog post on my learning SQL series.

Exploratory Data Analysis on Diabetes dataset with Python.

Data Stories — Sun, 06 Nov 2022 14:42:09 +0000

Introduction.

Let's start with understanding what exploratory data analysis (EDA) is. It is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. Simply put, it is the process of investigating data. This blog is a guide to understanding EDA with an example dataset.

Intuition

Before we know how, we should first understand why. Why perform EDA at all? Imagine you and your friends decide to go on a vacation to a beach destination neither of you has been to. At first, all of you are bummed. You don't know where to begin. Being a good planner the first question you would ask is, what are the best beach destinations? The next natural question would be, what is our budget? Consequently, you would then ask, what accommodations are available in that area and finally you'd find out the ratings and review the hotel you plan to stay at.

Whatever investigating measures you would take before finally booking your stay at your destination, is nothing but what data scientists in their lingo call Exploratory Data Analysis.

EDA is all about making sense of the data in hand, before getting them dirty with it.

EDA explained using a sample data set:

To share my understanding of the EDA concept and techniques I know, I'll take an example of the Pima Indians diabetes data set. A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). It is this research data we will be using.

First a little knowledge of diabetes. Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

Our Data dictionary:
Below is the attribute information:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Blood pressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skinfold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml) test
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history
Age: Age in years
Outcome: Class variable (0: the person is not diabetic or 1: the person is diabetic)

Now that we understand a little about our data set and the goal of the analysis ( to understand the patterns and trends of diabetes among the Pima Indians population), let's get right into the analysis.

** The analysis**

To start with, I imported the necessary libraries ( pandas, NumPy, matplotlib, and seaborn).

Note: Whatever inferences and insights I could extract, I've mentioned with bullet points and comments on the code starts with #.


import numpy as np  # library used for working with arrays
import pandas as pd # library used for data manipulation and analysis

import seaborn as sns # library for visualization
import matplotlib.pyplot as plt # library for visualization
%matplotlib inline


# to suppress warnings
import warnings
warnings.filterwarnings('ignore')

*Reading the given dataset *

#read csv dataset

pima = pd.read_csv("diabetes.csv") # load and reads the csv file
pima

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	79	33.6	0.627	50	1
1	1	85	66	29	79	26.6	0.351	31	0
2	8	183	64	20	79	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
...	...	...	...	...	...	...	...	...	...
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	79	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	20	79	30.1	0.349	47	1
767	1	93	70	31	79	30.4	0.315	23	0

Let's find the number of columns

# finds the number of columns in the dataset
total_cols=len(pima.axes[1])
print("Number of Columns: "+str(total_cols))

Number of Columns: 9

Let's show the first 10 records of the dataset.

pima.head(10)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	79	33.600000	0.627	50	1
1	1	85	66	29	79	26.600000	0.351	31	0
2	8	183	64	20	79	23.300000	0.672	32	1
3	1	89	66	23	94	28.100000	0.167	21	0
4	0	137	40	35	168	43.100000	2.288	33	1
5	5	116	74	20	79	25.600000	0.201	30	0
6	3	78	50	32	88	31.000000	0.248	26	1
7	10	115	69	20	79	35.300000	0.134	29	0
8	2	197	70	45	543	30.500000	0.158	53	1
9	8	125	96	20	79	31.992578	0.232	54	1

Finding the number of rows in the dataset.

# finds the number of rows in the dataset
total_rows=len(pima.axes[0])
print("Number of Rows: "+str(total_rows))

Number of Rows: 768

Now let us understand the dimensions of the dataset.

print('The dimension of the DataFrame is: ', pima.ndim)

The dimension of the DataFrame is:  2

Note: The Pandas dataframe.ndim property returns the dimension of a series or a DataFrame.

For all kinds of dataframes and series, it will return dimension 1 for series that only consists of rows and will return 2 in case of DataFrame or two-dimensional data.

The size of the dataset.

pima.size

Note: In Python Pandas, the dataframe.size property is used to display the size of Pandas DataFrame.

It returns the size of the DataFrame or a series which is equivalent to the total number of elements.

If I want to calculate the size of the series, it will return the number of rows. In the case of a DataFrame, it will return the rows multiplied by the columns.

Let us now find out the **data types **of all variables in the dataset.

#The info() function is used to print a concise summary of a DataFrame. 
#This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

pima.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

There are 768 entries
There are 2 float data types and 67 integer data types

Now let us check for missing values.

#functions that return a boolean value indicating whether the passed in argument value is in fact missing data.
# this is an example of chaining methods 

pima.isnull().values.any()

False

Pandas defines what most developers would know as null values as missing or missing data in pandas. Within pandas, a missing value is denoted by NaN.

#it can also output if there is any missing values each of the columns

pima.isnull().any()

Pregnancies                 False
Glucose                     False
BloodPressure               False
SkinThickness               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                     False
dtype: bool

- We can then conclude there is no missing values in the dataset. ## Statistical summary Now let us do a statistical summary of the data. We should find the summary statistics for all variables except 'outcome' in the dataset. It is our output variable in our case. Summary statistics of data represent descriptive statistics. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. ``` #excludes the outcome column pima.iloc[:,0:8].describe() ```

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	121.675781	72.250000	26.447917	118.270833	32.450805	0.471876	33.240885
std	3.369578	30.436252	12.117203	9.733872	93.243829	6.875374	0.331329	11.760232
min	0.000000	44.000000	24.000000	7.000000	14.000000	18.200000	0.078000	21.000000
25%	1.000000	99.750000	64.000000	20.000000	79.000000	27.500000	0.243750	24.000000
50%	3.000000	117.000000	72.000000	23.000000	79.000000	32.000000	0.372500	29.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000

From the results we can make out a few insights - The pregnancy numbers appear to be normally distributed whereas the others seem to be rightly skewed. (The mean and std deviation of pregnancies are more or less the same as opposed to the others). - Highest glucose levels is 199, pregnancies 17 and BMI 67. Now to the fun part. **Data Visualization** Plotting a distribution plot for variable 'Blood Pressure'. displot() function which is used to visualize a distribution of the univariate variable. This function uses matplotlib to plot a histogram and fit a kernel density estimate (KDE). ``` sns.displot(pima['BloodPressure'], kind='kde') plt.show() ``` ![Histogram of the Blood Pressure levels](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hplo1dmvr7t5nlwn01em.png) - We can interpret from the above plot that the blood pressure is between the range of 60 to 80 for a large number of the observations. This implies that most people's blood pressure range from 60 to 80. **What is the BMI of the person having the highest glucose** Max() method finds the highest value. ``` pima[pima['Glucose']==pima['Glucose'].max()]['BMI'] ```

661    42.9
Name: BMI, dtype: float64

- The person with the highest glucose value (661) has a bmi of 42.9 **Finding Measures of Central Tendency (the mean,median, and mode) ** ``` m1 = pima['BMI'].mean() # mean print(m1) m2 = pima['BMI'].median() # median print(m2) m3 = pima['BMI'].mode()[0] # mode print(m3) ```

32.45080515543619
32.0
32.0

Mean, median and mode ( central measures of tendency) are equal

*How many women's Glucose levels are above the mean level of Glucose
*
mean() method finds the mean of all nimerical values in a series or column.

pima[pima['Glucose']>pima['Glucose'].mean()].shape[0]

There are 343 number of women's glucose levels that are above the 32.45 mean

Let us count the number of women that have their 'BloodPressure' equal to the median of 'BloodPressure' and their 'BMI' less than the median of 'BMI'

it then saves this into a new dataframe pima1

pima1 = pima[(pima['BloodPressure']==pima['BloodPressure'].median()) & (pima['BMI']<pima['BMI'].median())]
number_of_women=len(pima1.axes[0])
print("Number of women:" +str(number_of_women))

Number of women:22

Getting a pairwise distribution between Glucose, Skin thickness and Diabetes pedigree function.

The pair plot gives a pairwise distribution of variables in the dataset. pairplot() function creates a matrix such that each grid shows the relationship between a pair of variables. On the diagonal axes, a plot shows the univariate distribution of each variable.

sns.pairplot(data=pima,vars=['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue = 'Outcome')
plt.show()

Studying the correlation between glucose and insulin using a Scatter Plot.

A scatter plot is a set of points plotted on horizontal and vertical axes. The scatter plot can be used to study the correlation between the two variables. One can also detect the extreme data points using a scatter plot.

sns.scatterplot(x='Glucose',y='Insulin',data=pima)
plt.show()

The scatter plot above implies that mostly the increase in glucose does relatively little change in insulin levels It also shows that in some the increase in glucose increases in insulin. This could probably be outliers.

Let us explore the possibility of outliers using the Box Plot.

Boxplot is a way to visualize the five-number summary of the variable. Boxplot gives information about the outliers in the data.

plt.boxplot(pima['Age'])

plt.title('Boxplot of Age')
plt.ylabel('Age')
plt.show()

The box plot shows the presence of outliers above the horizontal line.

Let us now try to understand the number of women in different age groups given whether they have diabetes or not. We will utilize the Histogram for this.

A histogram is used to display the distribution and spread of the continuous variable. One axis represents the range of the variable and the other axis shows the frequency of the data points.

Understanding the number of women in different age groups with diabetes.

plt.hist(pima[pima['Outcome']==1]['Age'], bins = 5)
plt.title('Distribution of Age for Women who has Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Of all the women with diabetes most are from the age between 22 to 30.
The frequency of women with diabetes decreases as age increases.

understanding the number of women in different age groups without diabetes.

plt.hist(pima[pima['Outcome']==0]['Age'], bins = 5)
plt.title('Distribution of Age for Women who do not have Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

The highest number of Women without diabetes range between ages 22 to 33.
Women between the age of 22 to 35 are at the highest risk of diabetes and also the is the highest number of those without diabetes.

What is the Interquartile Range of all the variables?
The IQR or Inter Quartile Range is a statistical measure used to measure the variability in a given data.

It tells us inside what range the bulk of our data lies.

It can be calculated by taking the difference between the third quartile and the first quartile within a dataset.

Why? It is a methodology that is generally used to filter outliers in a dataset. Outliers are extreme values that lie far from the regular observations that can possibly be got generated because of variability in measurement or experimental error.

Q1 = pima.quantile(0.25)
Q3 = pima.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Pregnancies                  5.0000
Glucose                     40.5000
BloodPressure               16.0000
SkinThickness               12.0000
Insulin                     48.2500
BMI                          9.1000
DiabetesPedigreeFunction     0.3825
Age                         17.0000
Outcome                      1.0000
dtype: float64

*And finally let us find and visualize the correlation between all variables.
*
Correlation is a statistic that measures the degree to which two variables move with each other.

corr_matrix = pima.iloc[:,0:8].corr()

corr_matrix

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
Pregnancies	1.000000	0.128022	0.208987	0.009393	-0.018780	0.021546	-0.033523	0.544341
Glucose	0.128022	1.000000	0.219765	0.158060	0.396137	0.231464	0.137158	0.266673
BloodPressure	0.208987	0.219765	1.000000	0.130403	0.010492	0.281222	0.000471	0.326791
SkinThickness	0.009393	0.158060	0.130403	1.000000	0.245410	0.532552	0.157196	0.020582
Insulin	-0.018780	0.396137	0.010492	0.245410	1.000000	0.189919	0.158243	0.037676
BMI	0.021546	0.231464	0.281222	0.532552	0.189919	1.000000	0.153508	0.025748
DiabetesPedigreeFunction	-0.033523	0.137158	0.000471	0.157196	0.158243	0.153508	1.000000	0.033561
Age	0.544341	0.266673	0.326791	0.020582	0.037676	0.025748	0.033561	1.000000

Now let us visualize using a Heatmap.
Heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. Each square in the heatmap shows the correlation between variables on each axis.

```# 'annot=True' returns the correlation values
plt.figure(figsize=(8,8))
sns.heatmap(corr_matrix, annot = True)

display the plot

plt.show()





![A heatmap showing the correlation between the independent variable](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ycu0zn4xkyk7m3bsmyom.png)

- Note: The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is. 

A correlation closer to -1 is similar, but instead of both increasing one variable will decrease as the other increases. 

- Age and pregnancies are positively correlated.
Glucose and insulin are positively correlated.
SkinThickness and BMI are positively correlated.


This marks the end of our exhaustive EDA. Tell me what you think, and drop your comments in the comment section. Bye.

Data Stories.

Data Stories — Sat, 05 Nov 2022 08:14:15 +0000

When someone hears the word "data science" they often assume the meaning is an entirely analytical job. Though this is true, I like to think of it as a storytelling field. After all who doesn't love a good story? When a data analyst is given a dataset to analyze, we often try to find patterns, trends, and anomalies within the dataset and use that information to make business decisions or predict future data. It would be okay to view this as a pure analysis job, but a better way to think of it is by picturing the information as a story your data is trying to tell.

Let's take my case as an example. I remember one of the first exciting stories (that gave rise to the Idiom " Don't fly too close to the sun") I heard from my Mum when I was a kid. The story of Icarus and Daedalus. I remember the first time she launched into the story. I couldn't quite fathom how this story would end. As I was introduced to the fantastical characters and the unique circumstances they found themselves in, I imagined all the scenarios that could have resulted to their final fate. And as the story unfolded, it became gradually clear what fate would befall them. In the end, Icarus plummeted down to earth as his father, Daedalus, watched. It was clear that the main lesson was the value of listening to our elders.

You might ask, how does the story tie in with data science? Well, let's take our dataset and equate it to the story before we heard it. They are both a mystery. We have no idea what the dataset wants to tell us. But, as the story and the dataset unfold (in our case as we do data explorative and descriptive analysis) we start to get a clearer picture. At the end of our story, we come to understand the patterns (our story's character decisions), the features ( environmental circumstances our characters find themselves in), and the information the dataset gives us ( the lesson learned from our character's story).

Looking for your data's story is a valuable skill every data scientist must work on improving, and the way to do that is through visualizations. As data scientists, we need to develop skills from broad and various fields. We need some business knowledge, some maths, statistics, and programming. But, I would argue that learning to tell a story with your data is an essential skill as well.

In summary, if you are a good storyteller and can create efficient visualizations, you will uncover your data's story, present that story effectively to your clients, and prove the value of your work.