<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Data Stories</title>
    <description>The latest articles on DEV Community by Data Stories (@data_stories).</description>
    <link>https://dev.to/data_stories</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F949037%2F31873213-e724-41d5-a9a2-fe7a88ba2d4a.jpg</url>
      <title>DEV Community: Data Stories</title>
      <link>https://dev.to/data_stories</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/data_stories"/>
    <language>en</language>
    <item>
      <title>Predicting Used cars Prices</title>
      <dc:creator>Data Stories</dc:creator>
      <pubDate>Thu, 08 Jun 2023 19:51:05 +0000</pubDate>
      <link>https://dev.to/data_stories/predicting-used-cars-prices-33kj</link>
      <guid>https://dev.to/data_stories/predicting-used-cars-prices-33kj</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome to week two of my 52-week blog challenge. Find the week one blog article &lt;a href="https://dev.to/evedevtech/visualizing-temperature-variation-a-climate-spiral--35jg"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Today I will take you through this prediction project I have been working on. &lt;/p&gt;

&lt;p&gt;Let's jump right in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlvs8x7b29xft68qw7iz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlvs8x7b29xft68qw7iz.jpg" alt="Photo of a used cars lot" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://unsplash.com/ko/@hydngallery?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Haidan&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/car-lot?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is always a huge demand for used cars in developing economies such as Kenya and India. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding Indian tech start-up that aims to find a good strategy in this market. &lt;/p&gt;

&lt;p&gt;Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.&lt;/p&gt;

&lt;p&gt;So how can a data scientist help the business streamline its pricing model? Well, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Objectives
&lt;/h2&gt;

&lt;p&gt;By the end of this blog, you will be able to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Explore and visualize the data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build a model to predict the prices of the used cars&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generate a set of insights and recommendations that will help the business.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Come up with an effective and easy to understand data story that will inform the business. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Answer the key business question: " Which factors would affect the price of used cars?"&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Data Dictionary
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3nuod9ej8srooqph5lv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3nuod9ej8srooqph5lv.jpg" alt="Image description" width="800" height="531"&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://unsplash.com/@rvignes?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Romain Vignes&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/dictionary?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First a brief description of what a data dictionary is...&lt;/p&gt;

&lt;p&gt;A data dictionary is a collection of names, definitions, and attributes about data elements that are used to explain what all the variable names and values in a dataset mean.&lt;/p&gt;

&lt;p&gt;for this particular data, its dictionary is:&lt;/p&gt;

&lt;p&gt;S.No.: Serial Number&lt;/p&gt;

&lt;p&gt;Name: Name of the car which includes Brand name and Model name&lt;/p&gt;

&lt;p&gt;Location: The location in which the car is being sold or is available for purchase (Cities)&lt;/p&gt;

&lt;p&gt;Year: Manufacturing year of the car&lt;/p&gt;

&lt;p&gt;Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in KM.&lt;/p&gt;

&lt;p&gt;Fuel_Type: The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)&lt;/p&gt;

&lt;p&gt;Transmission: The type of transmission used by the car. (Automatic / Manual)&lt;/p&gt;

&lt;p&gt;Owner: Type of ownership&lt;/p&gt;

&lt;p&gt;Mileage: The standard mileage offered by the car company in kmpl or km/kg&lt;/p&gt;

&lt;p&gt;Engine: The displacement volume of the engine in CC.&lt;/p&gt;

&lt;p&gt;Power: The maximum power of the engine in bhp.&lt;/p&gt;

&lt;p&gt;Seats: The number of seats in the car.&lt;/p&gt;

&lt;p&gt;New_Price: The price of a new car of the same model is INR 100,000 (INR = Indian Rupee)&lt;/p&gt;

&lt;p&gt;Price: The price of the used car is INR 100,000 (Target Variable)&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem Formulation
&lt;/h2&gt;

&lt;p&gt;You are trying to predict a quantity, therefore you have a regression problem, unlike a classification problem that predicts a label.&lt;/p&gt;
&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxew7xtbyuqz0g4wjn5jx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxew7xtbyuqz0g4wjn5jx.jpg" alt="Step by step" width="800" height="534"&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://unsplash.com/@websbykaja?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Kaja Kadlecova&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/continue?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Due to the lengthy nature of this particular project post, I will divide it into a three-part article miniseries.&lt;/p&gt;

&lt;p&gt;This first part will cover;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extraction, Transformation, and Loading of the data. (ETL)&lt;/li&gt;
&lt;li&gt;Exploratory Data Analysis (EDA)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let me take you through the first portion of this solution to the business case.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. First, you Import the necessary libraries.
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import warnings                                                  # Used to ignore the warning given as output of the code
warnings.filterwarnings('ignore')

import numpy as np                                               # Basic libraries of python for numeric and dataframe computations
import pandas as pd

import matplotlib.pyplot as plt                                  # Basic library for data visualization
import seaborn as sns                                            # Slightly advanced library for data visualization

from sklearn.model_selection import train_test_split             # Used to split the data into train and test sets.

from sklearn.linear_model import LinearRegression, Ridge, Lasso  # Import methods to build linear model for statistical analysis and prediction

from sklearn.tree import DecisionTreeRegressor                   # Import methods to build decision trees.
from sklearn.ensemble import RandomForestRegressor               # Import methods to build Random Forest.

from sklearn import metrics                                      # Metrics to evaluate the model

from sklearn.model_selection import GridSearchCV                 # For tuning the model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Remove the limit from the number of displayed columns and rows. (This step is optional)&lt;br&gt;
&lt;code&gt;pd.set_option("display.max_columns", None)&lt;br&gt;
pd.set_option("display.max_rows", 200)&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Now you explore the data ( Extract, Transform and Load )(ETL)
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Loading the data
&lt;/h4&gt;

&lt;p&gt;Loading the data into Python to explore and understand it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv("used_cars_data.csv")
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")  # f-string

df.head(10)  # displays the first ten rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;There are 7253 rows and 14 columns.
S.No.   Name    Location    Year    Kilometers_Driven   Fuel_Type   Transmission    Owner_Type  Mileage Engine  Power   Seats   New_Price   Price
0   0   Maruti Wagon R LXI CNG  Mumbai  2010    72000   CNG Manual  First   26.6 km/kg  998 CC  58.16 bhp   5.0 NaN 1.75
1   1   Hyundai Creta 1.6 CRDi SX Option    Pune    2015    41000   Diesel  Manual  First   19.67 kmpl  1582 CC 126.2 bhp   5.0 NaN 12.50
2   2   Honda Jazz V    Chennai 2011    46000   Petrol  Manual  First   18.2 kmpl   1199 CC 88.7 bhp    5.0 8.61 Lakh   4.50
3   3   Maruti Ertiga VDI   Chennai 2012    87000   Diesel  Manual  First   20.77 kmpl  1248 CC 88.76 bhp   7.0 NaN 6.00
4   4   Audi A4 New 2.0 TDI Multitronic Coimbatore  2013    40670   Diesel  Automatic   Second  15.2 kmpl   1968 CC 140.8 bhp   5.0 NaN 17.74
5   5   Hyundai EON LPG Era Plus Option Hyderabad   2012    75000   LPG Manual  First   21.1 km/kg  814 CC  55.2 bhp    5.0 NaN 2.35
6   6   Nissan Micra Diesel XV  Jaipur  2013    86999   Diesel  Manual  First   23.08 kmpl  1461 CC 63.1 bhp    5.0 NaN 3.50
7   7   Toyota Innova Crysta 2.8 GX AT 8S   Mumbai  2016    36000   Diesel  Automatic   First   11.36 kmpl  2755 CC 171.5 bhp   8.0 21 Lakh 17.50
8   8   Volkswagen Vento Diesel Comfortline Pune    2013    64430   Diesel  Manual  First   20.54 kmpl  1598 CC 103.6 bhp   5.0 NaN 5.20
9   9   Tata Indica Vista Quadrajet LS  Chennai 2012    65932   Diesel  Manual  Second  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you learn from the above is;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S.No. is just an index for the data entry. In all likelihood, this column will not be a significant factor in determining the price of the car. Having said that, there are instances where the index of the data entry contains information about the time factor (an entry with a smaller index corresponds to data entered years ago).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now check the info of the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you learn from the above is;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mileage, Engine, Power and New_Price are objects when they should ideally be numerical. To be able to get summary statistics for these columns, you will have to process them first.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Processing Columns
&lt;/h4&gt;

&lt;p&gt;Process 'Mileage', 'Engine', 'Power' and 'New_Price' and extract numerical values from them.&lt;/p&gt;

&lt;h5&gt;
  
  
  1. Mileage
&lt;/h5&gt;

&lt;p&gt;You have car mileage in two units, kmpl and km/kg.&lt;/p&gt;

&lt;p&gt;After quick research on the internet it is clear that these 2 units are used for cars of 2 different fuel types.&lt;/p&gt;

&lt;p&gt;kmpl - kilometers per litre - is used for petrol and diesel cars.&lt;/p&gt;

&lt;p&gt;-km/kg - kilometers per kg - is used for CNG and LPG-based engines.&lt;/p&gt;

&lt;p&gt;You have the variable Fuel_type in our data.&lt;br&gt;
Check if these observations hold true in our data also.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create 2 new columns after splitting the mileage values.
km_per_unit_fuel = []
mileage_unit = []

for observation in df["Mileage"]:
    if isinstance(observation, str):
        if (
            observation.split(" ")[0]
            .replace(".", "", 1)
            .isdigit()  # First element should be numeric
            and " " in observation  # Space between numeric and unit
            and (
                observation.split(" ")[1]
                == "kmpl"  # Units are limited to "kmpl" and "km/kg"
                or observation.split(" ")[1] == "km/kg"
            )
        ):
            km_per_unit_fuel.append(float(observation.split(" ")[0]))
            mileage_unit.append(observation.split(" ")[1])
        else:
            # To detect if there are any observations in the column that do not follow
            # The expected format [number + ' ' + 'kmpl' or 'km/kg']
            print(
                "The data needs further processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the mileage column,
        # We add corresponding missing values to the 2 new columns
        km_per_unit_fuel.append(np.nan)
        mileage_unit.append(np.nan)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# No print output from the function above. The values are all in the expected format or NaNs
# Add the new columns to the data
df["km_per_unit_fuel"] = km_per_unit_fuel
df["mileage_unit"] = mileage_unit

# Checking the new dataframe
df.head(5)  # looks good!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check if the units correspond to the fuel types as expected.
df.groupby(by = ["Fuel_Type", "mileage_unit"]).size()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result &lt;br&gt;
&lt;code&gt;Fuel_Type  mileage_unit&lt;br&gt;
CNG        km/kg             62&lt;br&gt;
Diesel     kmpl            3852&lt;br&gt;
LPG        km/kg             12&lt;br&gt;
Petrol     kmpl            3325&lt;br&gt;
dtype: int64&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;As expected, km/kg is for CNG/LPG cars and kmpl is for Petrol and Diesel cars.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;
  
  
  2. Engine
&lt;/h5&gt;

&lt;p&gt;The data dictionary suggests that Engine indicates the displacement volume of the engine in CC. You will make sure that all the observations follow the same format - [numeric + " " + "CC"] and create a new numeric column from this column.&lt;/p&gt;

&lt;p&gt;This time, use a regex to make all the necessary checks.&lt;/p&gt;

&lt;p&gt;Regular Expressions, also known as “regex”, are used to match strings of text such as particular characters, words, or patterns of characters. It means that you can match and extract any string pattern from the text with the help of regular expressions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# re module provides support for regular expressions
import re

# Create a new column after splitting the engine values.
engine_num = []

# Regex for numeric + " " + "CC"  format
regex_engine = "^\d+(\.\d+)? CC$"

for observation in df["Engine"]:
    if isinstance(observation, str):
        if re.match(regex_engine, observation):
            engine_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "CC"]  format
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the engine column, we add missing values to the new column
        engine_num.append(np.nan)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# No print output from the function above. The values are all in the same format - [numeric + " " + "CC"] OR NaNs
# Add the new column to the data
df["engine_num"] = engine_num

# Checking the new dataframe
df.head(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  3.Power
&lt;/h5&gt;

&lt;p&gt;The data dictionary suggests that Power indicates the maximum power of the engine in bhp. You will make sure that all the observations follow the same format - [numeric + " " + "bhp"] and create a new numeric column from this column, like you did for Engine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create a new column after splitting the power values
power_num = []

# Regex for numeric + " " + "bhp"  format
regex_power = "^\d+(\.\d+)? bhp$"

for observation in df["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "bhp"]  format
            # That we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;You can see that some Null values in power column exist as 'null bhp' string. Let us replace these with NaNs
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ower_num = []

for observation in df["Power"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            power_num.append(float(observation.split(" ")[0]))
        else:
            power_num.append(np.nan)
    else:
        # If there are any missing values in the power column, we add missing values to the new column
        power_num.append(np.nan)

# Add the new column to the data
df["power_num"] = power_num

# Checking the new dataframe
df.head(10)  # Looks good now
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  4. New_price
&lt;/h5&gt;

&lt;p&gt;You know that New_Price is the price of a new car of the same model in INR Lakhs (1 Lakh = 100, 000).&lt;/p&gt;

&lt;p&gt;This column clearly has a lot of missing values. You will impute the missing values later. For now you will only extract the numeric values from this column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create a new column after splitting the New_Price values.
new_price_num = []

# Regex for numeric + " " + "Lakh"  format
regex_power = "^\d+(\.\d+)? Lakh$"

for observation in df["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # To detect if there are any observations in the column that do not follow [numeric + " " + "Lakh"]  format
            # That we see in the sample output
            print(
                "The data needs furthur processing. All values are not similar ",
                observation,
            )
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will see not all values are in Lakhs. There are a few observations that are in Crores as well.&lt;/p&gt;

&lt;p&gt;Covert these to lakhs, 1Cr = 100 Lakhs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;new_price_num = []

for observation in df["New_Price"]:
    if isinstance(observation, str):
        if re.match(regex_power, observation):
            new_price_num.append(float(observation.split(" ")[0]))
        else:
            # Converting values in Crore to lakhs
            new_price_num.append(float(observation.split(" ")[0]) * 100)
    else:
        # If there are any missing values in the New_Price column, we add missing values to the new column
        new_price_num.append(np.nan)

# Add the new column to the data
df["new_price_num"] = new_price_num

# Checking the new dataframe
df.head(5)  # Looks ok
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Feature Engineering
&lt;/h3&gt;

&lt;p&gt;The Name column in the current format might not be very useful in our analysis. Since the name contains both the brand name and the model name of the vehicle, the column would have too many unique values to be useful in prediction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df["Name"].nunique()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results &lt;br&gt;
&lt;code&gt;2041&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;With 2041 unique names, car names are not going to be great predictors of the price in our current data. But you can process this column to extract important information and see if that reduces the number of levels for this information.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;
  
  
  1. Car Brand Name
&lt;/h5&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Extract Brand Names
df["Brand"] = df["Name"].apply(lambda x: x.split(" ")[0].lower())

# Check the data
df["Brand"].value_counts()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.figure(figsize = (15, 7))

sns.countplot(y = "Brand", data = df, order = df["Brand"].value_counts().index)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Resulting visualization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w7x050b7bl75zo3gohq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w7x050b7bl75zo3gohq.png" alt="A count plot showing Maruti as the most popular car brand name " width="800" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A count plot showing Maruti as the most popular car brand name&lt;/p&gt;
&lt;h5&gt;
  
  
  2. Car Model Name
&lt;/h5&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Extract Model Names
df["Model"] = df["Name"].apply(lambda x: x.split(" ")[1].lower())

# Check the data
df["Model"].value_counts()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.figure(figsize = (15, 7))

sns.countplot(y = "Model", data = df, order = df["Model"].value_counts().index[0:30])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhuc24xvlmtqazttr7xy6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhuc24xvlmtqazttr7xy6.png" alt="A count plot that shows swift as the most popular car model name   " width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A count plot that shows swift as the most popular car model name.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It is clear from the above charts that the dataset contains used cars from luxury as well as budget-friendly brands.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can create a new variable using this information. You will bin all our cars in 3 categories&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Budget-Friendly&lt;br&gt;
Mid Range&lt;br&gt;
Luxury Cars&lt;/p&gt;
&lt;h5&gt;
  
  
  3. Car_category
&lt;/h5&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.groupby(["Brand"])["Price"].mean().sort_values(ascending = False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output &lt;br&gt;
&lt;code&gt;Brand&lt;br&gt;
lamborghini      120.000000&lt;br&gt;
bentley           59.000000&lt;br&gt;
porsche           48.348333&lt;br&gt;
land              39.259500&lt;br&gt;
jaguar            37.632250&lt;br&gt;
mini              26.896923&lt;br&gt;
mercedes-benz     26.809874&lt;br&gt;
audi              25.537712&lt;br&gt;
bmw               25.243146&lt;br&gt;
volvo             18.802857&lt;br&gt;
jeep              18.718667&lt;br&gt;
isuzu             14.696667&lt;br&gt;
toyota            11.580024&lt;br&gt;
mitsubishi        11.058889&lt;br&gt;
force              9.333333&lt;br&gt;
mahindra           8.045919&lt;br&gt;
skoda              7.559075&lt;br&gt;
ford               6.889400&lt;br&gt;
renault            5.799034&lt;br&gt;
honda              5.411743&lt;br&gt;
hyundai            5.343433&lt;br&gt;
volkswagen         5.307270&lt;br&gt;
nissan             4.738352&lt;br&gt;
maruti             4.517267&lt;br&gt;
tata               3.562849&lt;br&gt;
fiat               3.269286&lt;br&gt;
datsun             3.049231&lt;br&gt;
chevrolet          3.044463&lt;br&gt;
smart              3.000000&lt;br&gt;
ambassador         1.350000&lt;br&gt;
hindustan               NaN&lt;br&gt;
opelcorsa               NaN&lt;br&gt;
Name: Price, dtype: float64&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The output is very close to expectation (domain knowledge), in terms of brand ordering. The mean price of a used Lamborghini is 120 Lakhs and that of cars from other luxury brands follow in descending order.&lt;/p&gt;

&lt;p&gt;Towards the bottom end you have the more budget-friendly brands.&lt;/p&gt;

&lt;p&gt;You can see that there is some missingness in our data. Yoy could come back to creating this variable once you have removed missingness from the data.&lt;/p&gt;
&lt;h2&gt;
  
  
  Exploratory Data Analysis
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Basic summary stats - Numeric variables
df.describe().T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1nemmyvcvmzt1bg7ipa.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1nemmyvcvmzt1bg7ipa.PNG" alt="Image showing the summary statistics table of the data" width="583" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Observations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;S.No. has no interpretation here but as discussed earlier drop it only after having looked at the initial linear model.&lt;/li&gt;
&lt;li&gt;Kilometers_Driven values have an incredibly high range. You should check a few of the extreme values to get a sense of the data.&lt;/li&gt;
&lt;li&gt;Minimum and the maximum number of seats in the car also warrant a quick check. On average a car seems to have 5 seats, which is right.&lt;/li&gt;
&lt;li&gt;You have used cars being sold at less than a lakh rupees and as high as 160 lakh, as you saw for Lamborghini earlier. you might have to drop some of these outliers to build a robust model.&lt;/li&gt;
&lt;li&gt;Min Mileage being 0 is also concerning, you'll have to check what is going on.&lt;/li&gt;
&lt;li&gt;Engine and Power mean and median values are not very different. Only someone with more domain knowledge would be able to comment further on these attributes.&lt;/li&gt;
&lt;li&gt;New price range seems right. You have both budget-friendly Maruti cars and Lamborghinis in your stock. Mean being twice that of the median suggests that there are only a few very high range brands, which again makes sense.
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check Kilometers_Driven extreme values
df.sort_values(by = ["Kilometers_Driven"], ascending = False).head(10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It looks like the first row here is a data entry error. A car manufactured as recently as 2017 having been driven 6500000 kms is almost impossible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The other observations that follow are also on a higher end. There is a good chance that these are outliers. You'll look at this further while doing the univariate analysis.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check Kilometers_Driven Extreme values
df.sort_values(by = ["Kilometers_Driven"], ascending = True).head(10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;&lt;p&gt;After looking at the columns - Year, New Price, and Price entries seem feasible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;1000 might be the default value in this case. Quite a few cars having driven exactly 1000 km is suspicious.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check seats extreme values
df.sort_values(by = ["Seats"], ascending = True).head(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Audi A4 having 0 seats is a data entry error. This column requires some outlier treatment or you can treat seats == 0 as a missing value. Overall, there doesn't seem much to be concerned about here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Let us check if we have a similar car in our dataset.
df[df["Name"].str.startswith("Audi A4")]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks like an Audi A4 typically has 5 seats.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Let us replace #seats in row index 3999 form 0 to 5
df.loc[3999, "Seats"] = 5.0


# Check seats extreme values
df.sort_values(by = ["Seats"], ascending = False).head(5)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Toyota Qualis has 10 seats and so does a Tata Sumo. No data entry error here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check Mileage - km_per_unit_fuel extreme values
df.sort_values(by = ["km_per_unit_fuel"], ascending = True).head(10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will have to treat Mileage = 0 as missing values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check Mileage - km_per_unit_fuel extreme values
df.sort_values(by = ["km_per_unit_fuel"], ascending = False).head(10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Maruti Wagon R and Maruti Alto CNG versions are budget-friendly cars with high mileage, so these data points are fine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Looking at value counts for non-numeric features

num_to_display = 10  # Defining this up here so it's easy to change later

for colname in df.dtypes[df.dtypes == "object"].index:
    val_counts = df[colname].value_counts(dropna = False)  # Will also show the NA counts

    print(val_counts[:num_to_display])

    if len(val_counts) &amp;gt; num_to_display:
        print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
    print("\n\n")  # Just for more space in between
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since you haven't dropped the original columns that you processed, you have a few redundant outputs here.&lt;br&gt;
You had checked cars of different Fuel_Type earlier, but you did not encounter the 2 electric cars. Let us check why.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.loc[df["Fuel_Type"] == "Electric"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mileage values for these cars are NaN, that is why you did not encounter these earlier with groupby.&lt;/p&gt;

&lt;p&gt;Electric cars are very new in the market and very rare in our dataset. You can consider dropping these two observations if they turn out to be outliers later. There is a good chance that you will not be able to create a good price prediction model for electric cars, with the currently available data.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Missing Values
&lt;/h3&gt;

&lt;p&gt;Before you start looking at the individual distributions and interactions, let's quickly check the missingness in the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Checking missing values in the dataset
df.isnull().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;2 Electric car variants don't have entries for Mileage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Engine displacement information of 46 observations is missing and a maximum power of 175 entries is missing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Information about the number of seats is not available for 53 entries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;New Price as you saw earlier has a huge missing count. you'll have to see if there is a pattern here.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Price is also missing for 1234 entries. Since price is the response variable that you want to predict, you will have to drop these rows when you build a model. These rows will not be able to help in modeling or model evaluation. But while you are analyzing the distributions and doing missing value imputations, you will keep using information from these rows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;New Price for 6247 entries is missing. You need to explore if you can impute these or if you should drop this column altogether.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Drop the redundant columns.
df.drop(
    columns=["Mileage", "mileage_unit", "Engine", "Power", "New_Price"], inplace = True
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79llkace50lr7feq1o10.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79llkace50lr7feq1o10.jpg" alt="To be continued....." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://unsplash.com/@sunnystate?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Reuben Juarez&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/continue?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You have come to the end of part one. &lt;a href="https://dev.to/evedevtech/predicting-used-cars-prices-part-two-22hh-temp-slug-5145758?preview=b02f5e1dfaaefd165765996ed3fe888b4d847544aa34615e3dab523f1a7334d70f8c4f4bc7346eb23797b7255f7880dd79e9893b01700c72c0ac0cd7"&gt;Part two&lt;/a&gt; post will cover data visualization, bivariate distributions and correlation between variables.&lt;/p&gt;

&lt;p&gt;Here is the &lt;a href="https://github.com/Eve-dev-tech/Predicting-used-car-prices" rel="noopener noreferrer"&gt;link&lt;/a&gt; to the source code.&lt;/p&gt;

&lt;p&gt;Stay tuned! Like, save, and share your comments. Happy coding.&lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>prediction</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Visualizing Temperature Variation; A Climate Spiral .</title>
      <dc:creator>Data Stories</dc:creator>
      <pubDate>Tue, 06 Jun 2023 17:19:51 +0000</pubDate>
      <link>https://dev.to/data_stories/visualizing-temperature-variation-a-climate-spiral--35jg</link>
      <guid>https://dev.to/data_stories/visualizing-temperature-variation-a-climate-spiral--35jg</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobi1qz53p681s2wxx2ot.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobi1qz53p681s2wxx2ot.jpg" alt="A photo depicting crossword tiles with the words" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://unsplash.com/@brett_jordan?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Brett Jordan&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/begin?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Welcome to the first article of my 52-week blog challenge. I will be covering technical and description articles in the field of data science and artificial intelligence.&lt;/p&gt;

&lt;p&gt;Let's jump right into the definitions first.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Temperature" rel="noopener noreferrer"&gt;Temperature&lt;/a&gt;&lt;/strong&gt; -  is a physical quantity that expresses the perception of hotness and coldness. In other words, the measure of hotness and coldness is expressed in terms of scales.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.merriam-webster.com/dictionary/variation" rel="noopener noreferrer"&gt;&lt;strong&gt;Variation&lt;/strong&gt;&lt;/a&gt; - is the extent something is different from another &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So....&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Temperature variation is the measure of the difference in temperature in a specific area at a particular range of time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Febnw7bwly5b2hj0dnf47.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Febnw7bwly5b2hj0dnf47.jpg" alt="Image showing variation in temperature across the world" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Goals
&lt;/h2&gt;

&lt;p&gt;The goal of this project is to create an animated spiral of Kenya's variation in temperature from 1991 to 2016.&lt;/p&gt;

&lt;p&gt;By the end of this blog post you will have learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Exploratory data analysis - ETL( Extraction, Transformation and Loading data)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data Visualization&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generation of a &lt;a href="https://www.socialpilot.co/social-media-terms/gif" rel="noopener noreferrer"&gt;GIF &lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reporting and presenting the data's story after transforming it from data to information and insights.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Why?
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Descriptive analysis&lt;/strong&gt;- It will describe the current situation on the ground.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Informed decision making&lt;/strong&gt;-The insight will help with making informed decisions in climate policy-making.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Disaster preparedness&lt;/strong&gt;-The visualization can help show early signs of unusual temperature spikes that could help prepare better for them.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://climate.nasa.gov/climate_resources/300/video-climate-spiral-1880-2022/" rel="noopener noreferrer"&gt;Ed Hawkins&lt;/a&gt;, a climate scientist, unveiled an animated visualization in 2017 that captivated the world. This visualization showed the deviations in the global average temperature from 1850 to 2017. It was re-shared millions of times over Twitter and Facebook and a version of it was even shown at the opening ceremony for the Rio Olympics.&lt;/p&gt;

&lt;p&gt;This animation is created with the help of &lt;a href="https://www.dataquest.io/blog/climate-temperature-spirals-python/" rel="noopener noreferrer"&gt;https://www.dataquest.io/blog/climate-temperature-spirals-python/&lt;/a&gt; written by Srini Kadamati.&lt;/p&gt;

&lt;p&gt;Historical weather data was retrieved from africa open data &lt;a href="https://africaopendata.org/dataset/kenya-climate-data-1991-2016" rel="noopener noreferrer"&gt;https://africaopendata.org/dataset/kenya-climate-data-1991-2016&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;The data was collected for the climate knowledge portal by &lt;a href="https://www.worldbank.org/en/home" rel="noopener noreferrer"&gt;the World Bank&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Building the spiral visualization.
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. ETL( Extraction, Transformation and Loading data)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#importing libraries we'll use 
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.animation as animation


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#reading the temperature file into a pandas dataframe
temp_data = pd.read_csv(
    "temp data.csv",
    delim_whitespace=True,
    usecols=[0, 1],
    header=None)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Let's take a quick look at the data frame and some properties of the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
0   1
0   Year,Month  Average,Temperature
1   1991,Jan    Average,25.1631
2   1991,Feb    Average,26.0839
3   1991,Mar    Average,26.2236
4   1991,Apr    Average,25.5812
... ... ...
308 2016,Aug    Average,24.0942
309 2016,Sep    Average,24.437
310 2016,Oct    Average,26.0317
311 2016,Nov    Average,25.5692
312 2016,Dec    Average,25.7401
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data.describe()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
0   1
count   313 313
unique  313 313
top Year,Month  Average,Temperature
freq    1   1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the results you get, check if there is a need to make it more readable.&lt;/p&gt;

&lt;p&gt;With this particular case, you need to separate year, month, and average temperature.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data[['Year', 'Month']] = temp_data['Year'].str.split(',', expand=True)

temp_data[['Average', 'Temparature']] = temp_data['Average'].str.split(',', expand=True)
temp_data.head()


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
0   1   Year    Month   Average Temperature Temparature
0   Year,Month  Average,Temperature Year    Month   Average Average,Temperature Temperature
1   1991,Jan    Average,25.1631 1991    Jan Average Average,25.1631 25.1631
2   1991,Feb    Average,26.0839 1991    Feb Average Average,26.0839 26.0839
3   1991,Mar    Average,26.2236 1991    Mar Average Average,26.2236 26.2236
4   1991,Apr    Average,25.5812 1991    Apr Average Average,25.5812 25.5812
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is best practice to drop the columns that are repetitive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data_1 = temp_data.drop(temp_data.columns[[0, 1, 4, 5]], axis=1)
temp_data_1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Year    Month   Temparature
0   Year    Month   Temperature
1   1991    Jan 25.1631
2   1991    Feb 26.0839
3   1991    Mar 26.2236
4   1991    Apr 25.5812
... ... ... ...
308 2016    Aug 24.0942
309 2016    Sep 24.437
310 2016    Oct 26.0317
311 2016    Nov 25.5692
312 2016    Dec 25.7401
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's get to know the &lt;a href="https://docs.python.org/3/library/datatypes.html" rel="noopener noreferrer"&gt;data types&lt;/a&gt; in the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#getting to know what data types my data frame has
temp_data_2.dtypes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Year           object
Month          object
Temparature    object
dtype: object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All the data is in object form&lt;br&gt;
You  need to convert the temperature column data type from object to float. This is because it is the only way you can perform mathematical operations on it and visualize it on a scale.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data_2['Temparature'] = temp_data_2['Temparature'].astype(str).astype(float)

#view data types of each column
temp_data_2.dtypes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Year            object
Month           object
Temparature    float64
dtype: object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you will write a function that converts month names to numbers. Here you utilize the &lt;a href="https://docs.python.org/3/library/datetime.html" rel="noopener noreferrer"&gt;datetime python library.&lt;br&gt;
&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Define a function to convert month names to numbers
def month_string_to_number(string):
    dt = datetime.strptime(string, "%b")
    return dt.month
## Apply the function to the month column to convert to numbers
temp_data_2['month_number'] = temp_data_2['Month'].apply(month_string_to_number)

temp_data_2.head(20)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Year    Month   Temparature month_number
1   1991.0  Jan 25.1631 1
2   1991.0  Feb 26.0839 2
3   1991.0  Mar 26.2236 3
4   1991.0  Apr 25.5812 4
5   1991.0  May 24.6618 5
6   1991.0  Jun 23.9439 6
7   1991.0  Jul 22.9982 7
8   1991.0  Aug 23.0391 8
9   1991.0  Sep 23.9423 9
10  1991.0  Oct 25.5236 10
11  1991.0  Nov 24.5875 11
12  1991.0  Dec 24.7398 12
13  1992.0  Jan 24.4359 1
14  1992.0  Feb 26.2892 2
15  1992.0  Mar 26.5409 3
16  1992.0  Apr 26.0819 4
17  1992.0  May 24.7852 5
18  1992.0  Jun 24.0563 6
19  1992.0  Jul 22.8377 7
20  1992.0  Aug 22.7902 8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is best practice to drop the unnecessary month name column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data_2 = temp_data_2.drop('Month', axis=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Checking for null or missing values is very important in the ETL process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data_2.isnull().sum()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Year            0
Temparature     0
month_number    0
dtype: int64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are no missing values in this data.&lt;/p&gt;

&lt;p&gt;Now you find the mean of the temperature column and subtract the mean from each individual value in the column. This will help you find the temperature variation of every month against the year's mean temperature. This is a sort of &lt;a href="https://en.wikipedia.org/wiki/Normalization_(statistics)" rel="noopener noreferrer"&gt;normalization of data&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Visualizing the data.
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Cartesian versus polar coordinate system&lt;/em&gt;&lt;br&gt;
There are a few key phases to recreating Ed's GIF:&lt;/p&gt;

&lt;p&gt;-learning how to plot on a polar coordinate system&lt;br&gt;
-transforming the data for polar visualization&lt;br&gt;
-customizing the aesthetics of the plot&lt;br&gt;
-stepping through the visualization year-by-year and turning the plot into a GIF&lt;/p&gt;
&lt;h4&gt;
  
  
  - Preparing data for polar plotting
&lt;/h4&gt;

&lt;p&gt;You need to subset the data by year and use the following coordinates:&lt;/p&gt;

&lt;p&gt;r: temperature value for a given month, adjusted to contain no negative values.&lt;br&gt;
Matplotlib supports plotting negative values, but not in the way you think. You want -0.1 to be closer to the center than 0.1, which isn't the default matplotlib behavior.&lt;br&gt;
You also want to leave some space around the origin of the plot for displaying the year as text.&lt;br&gt;
theta: generate 12 equally spaced angle values that span from 0 to 2*pi.&lt;/p&gt;

&lt;p&gt;You'll start with how to plot just the data for the year 1991 in matplotlib, then scale up to all years.&lt;/p&gt;

&lt;p&gt;To generate a matplotlib Axes object that uses the polar system, you need to set the projection parameter to "polar" when creating it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fig = plt.figure(figsize=(8,8))
ax1 = plt.subplot(111, projection='polar')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7kzp8k1q0kjqo4y39kz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp7kzp8k1q0kjqo4y39kz.png" alt="a matplotlib Axes object that uses the polar system," width="699" height="688"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To adjust the data to contain no negative temperature values, you need to first calculate the minimum temperature value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp_data_2['Temparature'].min()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-2.3378881410256405
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll add&lt;/p&gt;

&lt;p&gt;2 to all temperature values, so they'll be positive but there's still some space reserved around the origin for displaying text:&lt;/p&gt;

&lt;p&gt;Note; adjust your value according to your data's minimum temperature.&lt;/p&gt;

&lt;p&gt;You'll also generate 12 evenly spaced values from 0 to 2*pi and use the first 12 as the theta values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# returns a boolean Series that selects only the rows 
#where the Year column is equal to 1991.
hc_1991 = temp_data_2[temp_data_2['Year'] == 1991]
#the code creates a new figure with 
#the plt.figure() function and sets the size of the figure to be 8 inches by 8 inches with figsize=(8,8).
fig = plt.figure(figsize=(8,8))
ax1 = plt.subplot(111, projection='polar')
r = hc_1991['Temparature'] + 2
theta = np.linspace(0, 2*np.pi, 12)
# Plot the data on the polar axes
ax1.plot(theta, r)

# hide all of the tick labels for both axes 
ax1.axes.get_yaxis().set_ticklabels([])
ax1.axes.get_xaxis().set_ticklabels([])
#Background color within the polar plot to be black, and the color surrounding the polar plot to be gray.
#I can use
#fig.set_facecolor() to set the foreground color and Axes.set_axis_bgcolor() to set the background color of the plot:
fig.set_facecolor("#323331")
ax1.set_facecolor('#000100')
#add the title and labels
ax1.set_ylabel('Temperature')
ax1.set_title("Kenya's Temperature Change (1991-2016)", color='white', fontdict={'fontsize': 30})
# Display the plot
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plotting the remaining years&lt;br&gt;
To plot the spirals for the remaining years, you need to repeat what you just did but for all of the years in the dataset. The one tweak you should make here is to manually set the axis limit for&lt;/p&gt;

&lt;p&gt;r (or y in matplotlib). This is because matplotlib scales the size of the plot automatically based on the data that's used. This is why, in the last step, I observed that the data for just 1991 was displayed at the edge of the plotting area. You'll calculate the maximum temperature value in the entire dataset and add a generous amount of padding (to match what Ed did).&lt;/p&gt;

&lt;p&gt;Now, you can use a for loop to generate the rest of the data. You'll leave out the code that generates the center text for now (otherwise each year will generate text at the same point and it'll be very messy):&lt;/p&gt;

&lt;p&gt;You will use the color (or c) parameter when calling the Axes.plot() method and draw colors from plt.cm.(index).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ig = plt.figure(figsize=(14,14))
ax1 = plt.subplot(111, projection='polar')

# hide all of the tick labels for both axes 
ax1.axes.get_yaxis().set_ticklabels([])
ax1.axes.get_xaxis().set_ticklabels([])

#fig.set_facecolor() to set the foreground color and Axes.set_axis_bgcolor() to set the background color of the plot:
fig.set_facecolor("#323331")
#ax1.set_ylim(0, 3.25)


theta = np.linspace(0, 2*np.pi, 12)


ax1.set_title("Kenya's Temperature Change (1991-2016)", color='white', fontdict={'fontsize': 30})
ax1.set_facecolor('#000100')

years = temp_data_2['Year'].unique()

for index,Year in enumerate(years):
  r=temp_data_2.loc[temp_data_2["Year"]== Year,"Temparature"]+2
  ax1.plot(theta,r,c=plt.cm.viridis(index*2))
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding Temperature Rings&lt;br&gt;
At this stage, the viewer can't actually understand the underlying data at all. There is no indication of temperture values in the visualization.&lt;br&gt;
Next, You will add temperature rings at 0.0, 1.5, 2.0 degrees Celsius:&lt;br&gt;
Then, finally Generating The GIF Animation&lt;br&gt;
Now you're ready to generate a GIF animation from the plot. An animation is a series of images that are displayed in rapid succession. You'll use the&lt;/p&gt;

&lt;p&gt;matplotlib.animation.FuncAnimation function to help  with this. To take advantage of this function, you need to write code that:&lt;/p&gt;

&lt;p&gt;defines the base plot appearance and properties&lt;br&gt;
updates the plot between each frames with new data&lt;br&gt;
you'll use the following required parameters when calling&lt;/p&gt;

&lt;p&gt;FuncAnimation():&lt;/p&gt;

&lt;p&gt;fig: the matplotlib Figure object&lt;br&gt;
func: the update function that's called between each frame&lt;br&gt;
frames: the number of frames (you want one for each year)&lt;br&gt;
interval: the number of milliseconds each frame is displayed (there are 1000 milliseconds in a second)&lt;br&gt;
This function will return a&lt;/p&gt;

&lt;p&gt;matplotlib.animation.FuncAnimation object, which has a save() method you can use to write the animation to a GIF file.&lt;/p&gt;

&lt;p&gt;The code block below shows all these above steps added to produce a GIF.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from mpl_toolkits.mplot3d import Axes3D 
months=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
fig=plt.figure(figsize=(15,15))
ax1=plt.subplot(111,projection="polar")

ax1.plot(full_circle_thetas, blue_one_radii, c='blue')
ax1.plot(full_circle_thetas, red_one_radii, c='red')
ax1.plot(full_circle_thetas, red_two_radii, c='red')
ax1.plot(full_circle_thetas, red_three_radii, c='red')
ax1.plot(full_circle_thetas, red_four_radii, c='red')

#fig.set_facecolor() to set the foreground color and Axes.set_axis_bgcolor() to set the background color of the plot:
fig.set_facecolor("#323331")
#ax1.set_ylim(0, 3.25)

ax1.text(np.pi/2, 1.0, "0.0 C", color="blue", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 2.0, "0.5 C", color="red", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 2.5, "1.0 C", color="red", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 3.0, "1.5 C", color="red", ha='center', fontdict={'fontsize': 20})
ax1.text(np.pi/2, 3.5, "2.0 C", color="red", ha='center', fontdict={'fontsize': 20})


ax1.set_xticks([])
ax1.set_yticks([])
ax1.set_xticklabels([])
ax1.set_yticklabels([])


theta = np.linspace(0, 2*np.pi, 12)


ax1.set_title("Kenya's Temperature Change Spiral (1991-2016)", color='white', fontdict={'fontsize': 30})
ax1.set_facecolor('#000100')

years = temp_data_2['Year'].unique()

fig.text(0.78,0,"Kenya Temperature data",color="white",fontsize=20)
fig.text(0.05,0.02,"Everlynn Muthoni; Data Stories",color="white",fontsize=20)
fig.text(0.05,0,"Inspired by Ed Hawkins's 2017 Visualization",color="white",fontsize=15)

#add months ring
months_angles= np.linspace((np.pi/2)+(2*np.pi),np.pi/2,13)
for i,month in enumerate(months):
  ax1.text(months_angles[i],5.0,month,color="white",fontsize=15,ha="center")

#for index,Year in enumerate(years):
  #r=temp_data_2.loc[temp_data_2["Year"]== Year,"Temparature"]+2
  #ax1.plot(theta,r,c=plt.cm.viridis(index*15))

def update(i):
    # Remove the last year text at the center
    for txt in ax1.texts:
      if(txt.get_position()==(0,0)):
        txt.set_visible(False)
    # Specify how we want the plot to change in each frame.
    # We need to unravel the for loop we had earlier.
    Year = years[i]
    r = temp_data_2[temp_data_2['Year'] == Year]['Temparature'] + 2
    ax1.plot(theta, r, c=plt.cm.viridis(i*30))
    ax1.text(0,0,Year,fontsize=20,color="white",ha="center")
    return ax1

anim = animation.FuncAnimation(fig, update, frames=len(years), interval=10)


ffmpeg_writer = animation.FFMpegWriter();

anim.save("Spiral.gif", writer = 'pillow', fps = 5, dpi=100);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Final result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fra0tpk6xxo6seg3uppg2.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fra0tpk6xxo6seg3uppg2.gif" alt="final gif visualization of Kenya's temperature data " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The story our data visualization tells.
&lt;/h3&gt;

&lt;p&gt;So....from the analysis and visualization, the following insights are deduced;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Since 1990 the temperature variation has been gradually increasing between February and June with the highest variation occurring mostly between June and July&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;-High-temperature variation mostly occurs during most of the first half of the year.&lt;/p&gt;

&lt;p&gt;And that's it. Congrats, you have successfully visualized temperature data using a climate spiral!&lt;/p&gt;

&lt;p&gt;Click &lt;a href="https://github.com/Eve-dev-tech/Kenya-tempearature-spiral" rel="noopener noreferrer"&gt;here&lt;/a&gt; if you'd like to check out the source code.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Recommendations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;For a better 3d visualization, explore the project using &lt;a href="https://www.mathworks.com/products/matlab.html" rel="noopener noreferrer"&gt;Matlab&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For even better real time descriptive analysis, try to find data with the latest dates.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Like, subscribe and share your thoughts with me. Bye! and Happy coding.&lt;/p&gt;

</description>
      <category>visualization</category>
      <category>python</category>
      <category>climate</category>
      <category>beginners</category>
    </item>
    <item>
      <title>MySQL Error 2003 (HY000): Can't connect to MySQL server on 'localhost:3306' (10061)</title>
      <dc:creator>Data Stories</dc:creator>
      <pubDate>Sat, 20 May 2023 03:11:52 +0000</pubDate>
      <link>https://dev.to/data_stories/mysql-error-2003-hy000-cant-connect-to-mysql-server-on-localhost3306-10061-1m8f</link>
      <guid>https://dev.to/data_stories/mysql-error-2003-hy000-cant-connect-to-mysql-server-on-localhost3306-10061-1m8f</guid>
      <description>&lt;p&gt;So we have finished &lt;a href="https://dev.tourl"&gt; installing MySQL&lt;/a&gt; and we want to start it at the command line. There are times you may come across the 2003(HYOOO) error or the MySQL Command Line Client may disappear after it prompts you for your password. Do not fret, this is easily fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to fix it;
&lt;/h2&gt;

&lt;p&gt;Please see the steps below to fix the problem.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log in as Admin to your system&lt;/li&gt;
&lt;li&gt;Open the Task Manager panel &lt;/li&gt;
&lt;li&gt;Go to the Services tab.&lt;/li&gt;
&lt;li&gt;Look for your MySQL service. 
It will show the service has stopped.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select the MySQL and right click. You will get Start option. Select start command, give it a few minutes to start and in a while it will read running. Please see the below image for reference.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqyr9ry0qlt0ybyxpobl.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqyr9ry0qlt0ybyxpobl.PNG" alt="Image showing how to start MySQL server from task manager" width="623" height="418"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Please recheck that MySQL service has been started successfully.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In case MySQL services do not start using this way, It means you are not logged in with Admin account. In this case you will receive this error.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8ispccvum2ggnkvidhn.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8ispccvum2ggnkvidhn.PNG" alt="Image showing error because f not running ad admin" width="337" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Please follow next given steps to start MySQL service from Administrative Panel.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Please Go to  Control Panel &amp;gt; All Control Panel Items &amp;gt; Administrative Tools.&lt;/li&gt;
&lt;li&gt;Go to Services. &lt;/li&gt;
&lt;li&gt;Select MySQL56 under Name column.&lt;/li&gt;
&lt;li&gt;Click on start link from left panel to start MySQL Service.&lt;/li&gt;
&lt;li&gt;Once MySQL service started go to the MySQL Command Line Client&lt;/li&gt;
&lt;li&gt;Enter your MySQL password.&lt;/li&gt;
&lt;li&gt;We can see image given below after MySQL server started successfully.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3dq90im8ujaa2ppy6jh5.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3dq90im8ujaa2ppy6jh5.PNG" alt="Login sucess" width="800" height="377"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Login success! Your MySQL server is now ready to connect!&lt;/p&gt;

&lt;p&gt;Thanks for reading this post. Please let me know if your problem has been resolved. Like, comment and share. &lt;/p&gt;

&lt;p&gt;Finally, In case you want to learn how to install MySQL server from scratch please check &lt;a href="https://dev.tourl"&gt;this blog post &lt;/a&gt;on my learning SQL series.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>sql</category>
      <category>help</category>
      <category>errors</category>
    </item>
    <item>
      <title>Exploratory Data Analysis on Diabetes dataset with Python.</title>
      <dc:creator>Data Stories</dc:creator>
      <pubDate>Sun, 06 Nov 2022 14:42:09 +0000</pubDate>
      <link>https://dev.to/data_stories/exploratory-data-analysis-on-diabetes-dataset-with-python-2ofe</link>
      <guid>https://dev.to/data_stories/exploratory-data-analysis-on-diabetes-dataset-with-python-2ofe</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmju9qtfh2xwxkd3r61sn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmju9qtfh2xwxkd3r61sn.jpg" alt="Exploratory Data Analysis(EDA)" width="300" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction.
&lt;/h2&gt;

&lt;p&gt;Let's start with understanding what exploratory data analysis (EDA) is. It is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. Simply put, it is the process of investigating data. This blog is a guide to understanding EDA with an example dataset. &lt;/p&gt;

&lt;h2&gt;
  
  
  Intuition
&lt;/h2&gt;

&lt;p&gt;Before we know how, we should first understand why. Why perform EDA at all? Imagine you and your friends decide to go on a vacation to a beach destination neither of you has been to. At first, all of you are bummed. You don't know where to begin. Being a good planner the first question you would ask is, what are the best beach destinations? The next natural question would be, what is our budget? Consequently, you would then ask, what accommodations are available in that area and finally you'd find out the ratings and review the hotel you plan to stay at.&lt;/p&gt;

&lt;p&gt;Whatever investigating measures you would take before finally booking your stay at your destination, is nothing but what data scientists in their lingo call Exploratory Data Analysis.&lt;/p&gt;

&lt;p&gt;EDA is all about making sense of the data in hand, before getting them dirty with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  EDA explained using a sample data set:
&lt;/h2&gt;

&lt;p&gt;To share my understanding of the EDA concept and techniques I know, I'll take an example of the Pima Indians diabetes data set. A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). It is this research data we will be using.&lt;/p&gt;

&lt;p&gt;First a little knowledge of diabetes. Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our Data dictionary:&lt;/strong&gt;&lt;br&gt;
Below is the attribute information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pregnancies: Number of times pregnant&lt;/li&gt;
&lt;li&gt;Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test&lt;/li&gt;
&lt;li&gt;Blood pressure: Diastolic blood pressure (mm Hg)&lt;/li&gt;
&lt;li&gt;SkinThickness: Triceps skinfold thickness (mm)&lt;/li&gt;
&lt;li&gt;Insulin: 2-Hour serum insulin (mu U/ml) test&lt;/li&gt;
&lt;li&gt;BMI: Body mass index (weight in kg/(height in m)^2)&lt;/li&gt;
&lt;li&gt;DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history&lt;/li&gt;
&lt;li&gt;Age: Age in years&lt;/li&gt;
&lt;li&gt;Outcome: Class variable (0: the person is not diabetic or 1: the person is diabetic)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that we understand a little about our data set and the goal of the analysis ( to understand the patterns and trends of diabetes among the Pima Indians population), let's get right into the analysis.&lt;/p&gt;

&lt;p&gt;** The analysis**&lt;/p&gt;

&lt;p&gt;To start with, I imported the necessary libraries ( pandas, NumPy, matplotlib, and seaborn).&lt;/p&gt;

&lt;p&gt;Note: Whatever inferences and insights I could extract, I've mentioned with bullet points and comments on the code starts with #.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
import numpy as np  # library used for working with arrays
import pandas as pd # library used for data manipulation and analysis

import seaborn as sns # library for visualization
import matplotlib.pyplot as plt # library for visualization
%matplotlib inline


# to suppress warnings
import warnings
warnings.filterwarnings('ignore')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;*&lt;em&gt;Reading the given dataset *&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#read csv dataset

pima = pd.read_csv("diabetes.csv") # load and reads the csv file
pima
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Pregnancies&lt;/th&gt;
      &lt;th&gt;Glucose&lt;/th&gt;
      &lt;th&gt;BloodPressure&lt;/th&gt;
      &lt;th&gt;SkinThickness&lt;/th&gt;
      &lt;th&gt;Insulin&lt;/th&gt;
      &lt;th&gt;BMI&lt;/th&gt;
      &lt;th&gt;DiabetesPedigreeFunction&lt;/th&gt;
      &lt;th&gt;Age&lt;/th&gt;
      &lt;th&gt;Outcome&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;148&lt;/td&gt;
      &lt;td&gt;72&lt;/td&gt;
      &lt;td&gt;35&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;33.6&lt;/td&gt;
      &lt;td&gt;0.627&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;85&lt;/td&gt;
      &lt;td&gt;66&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;26.6&lt;/td&gt;
      &lt;td&gt;0.351&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;183&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;23.3&lt;/td&gt;
      &lt;td&gt;0.672&lt;/td&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;89&lt;/td&gt;
      &lt;td&gt;66&lt;/td&gt;
      &lt;td&gt;23&lt;/td&gt;
      &lt;td&gt;94&lt;/td&gt;
      &lt;td&gt;28.1&lt;/td&gt;
      &lt;td&gt;0.167&lt;/td&gt;
      &lt;td&gt;21&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;137&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;35&lt;/td&gt;
      &lt;td&gt;168&lt;/td&gt;
      &lt;td&gt;43.1&lt;/td&gt;
      &lt;td&gt;2.288&lt;/td&gt;
      &lt;td&gt;33&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;...&lt;/th&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;763&lt;/th&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;101&lt;/td&gt;
      &lt;td&gt;76&lt;/td&gt;
      &lt;td&gt;48&lt;/td&gt;
      &lt;td&gt;180&lt;/td&gt;
      &lt;td&gt;32.9&lt;/td&gt;
      &lt;td&gt;0.171&lt;/td&gt;
      &lt;td&gt;63&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;764&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;122&lt;/td&gt;
      &lt;td&gt;70&lt;/td&gt;
      &lt;td&gt;27&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;36.8&lt;/td&gt;
      &lt;td&gt;0.340&lt;/td&gt;
      &lt;td&gt;27&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;765&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
      &lt;td&gt;72&lt;/td&gt;
      &lt;td&gt;23&lt;/td&gt;
      &lt;td&gt;112&lt;/td&gt;
      &lt;td&gt;26.2&lt;/td&gt;
      &lt;td&gt;0.245&lt;/td&gt;
      &lt;td&gt;30&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;766&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;126&lt;/td&gt;
      &lt;td&gt;60&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;30.1&lt;/td&gt;
      &lt;td&gt;0.349&lt;/td&gt;
      &lt;td&gt;47&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;767&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;93&lt;/td&gt;
      &lt;td&gt;70&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;30.4&lt;/td&gt;
      &lt;td&gt;0.315&lt;/td&gt;
      &lt;td&gt;23&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Let's find the number of columns&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# finds the number of columns in the dataset
total_cols=len(pima.axes[1])
print("Number of Columns: "+str(total_cols))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;Number of Columns: 9
&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Let's show the first 10 records of the dataset.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pima.head(10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Pregnancies&lt;/th&gt;
      &lt;th&gt;Glucose&lt;/th&gt;
      &lt;th&gt;BloodPressure&lt;/th&gt;
      &lt;th&gt;SkinThickness&lt;/th&gt;
      &lt;th&gt;Insulin&lt;/th&gt;
      &lt;th&gt;BMI&lt;/th&gt;
      &lt;th&gt;DiabetesPedigreeFunction&lt;/th&gt;
      &lt;th&gt;Age&lt;/th&gt;
      &lt;th&gt;Outcome&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;148&lt;/td&gt;
      &lt;td&gt;72&lt;/td&gt;
      &lt;td&gt;35&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;33.600000&lt;/td&gt;
      &lt;td&gt;0.627&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;85&lt;/td&gt;
      &lt;td&gt;66&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;26.600000&lt;/td&gt;
      &lt;td&gt;0.351&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;183&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;23.300000&lt;/td&gt;
      &lt;td&gt;0.672&lt;/td&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;89&lt;/td&gt;
      &lt;td&gt;66&lt;/td&gt;
      &lt;td&gt;23&lt;/td&gt;
      &lt;td&gt;94&lt;/td&gt;
      &lt;td&gt;28.100000&lt;/td&gt;
      &lt;td&gt;0.167&lt;/td&gt;
      &lt;td&gt;21&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;137&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;35&lt;/td&gt;
      &lt;td&gt;168&lt;/td&gt;
      &lt;td&gt;43.100000&lt;/td&gt;
      &lt;td&gt;2.288&lt;/td&gt;
      &lt;td&gt;33&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;116&lt;/td&gt;
      &lt;td&gt;74&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;25.600000&lt;/td&gt;
      &lt;td&gt;0.201&lt;/td&gt;
      &lt;td&gt;30&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;78&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;88&lt;/td&gt;
      &lt;td&gt;31.000000&lt;/td&gt;
      &lt;td&gt;0.248&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;115&lt;/td&gt;
      &lt;td&gt;69&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;35.300000&lt;/td&gt;
      &lt;td&gt;0.134&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;197&lt;/td&gt;
      &lt;td&gt;70&lt;/td&gt;
      &lt;td&gt;45&lt;/td&gt;
      &lt;td&gt;543&lt;/td&gt;
      &lt;td&gt;30.500000&lt;/td&gt;
      &lt;td&gt;0.158&lt;/td&gt;
      &lt;td&gt;53&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;125&lt;/td&gt;
      &lt;td&gt;96&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;79&lt;/td&gt;
      &lt;td&gt;31.992578&lt;/td&gt;
      &lt;td&gt;0.232&lt;/td&gt;
      &lt;td&gt;54&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Finding the number of rows in the dataset.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# finds the number of rows in the dataset
total_rows=len(pima.axes[0])
print("Number of Rows: "+str(total_rows))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;Number of Rows: 768
&lt;/pre&gt;

&lt;p&gt;Now let us understand the &lt;strong&gt;dimensions of the dataset.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print('The dimension of the DataFrame is: ', pima.ndim)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;The dimension of the DataFrame is:  2
&lt;/pre&gt;

&lt;ul&gt;
&lt;li&gt;Note: The Pandas dataframe.ndim property returns the dimension of a series or a DataFrame. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For all kinds of dataframes and series, it will return dimension 1 for series that only consists of rows and will return 2 in case of DataFrame or two-dimensional data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The size of the dataset.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pima.size
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;6912&lt;/pre&gt;

&lt;ul&gt;
&lt;li&gt;Note: In Python Pandas, the dataframe.size property is used to display the size of Pandas DataFrame. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It returns the size of the DataFrame or a series which is equivalent to the total number of elements. &lt;/p&gt;

&lt;p&gt;If I want to calculate the size of the series, it will return the number of rows. In the case of a DataFrame, it will return the rows multiplied by the columns.&lt;/p&gt;

&lt;p&gt;Let us now find out the **data types **of all variables in the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#The info() function is used to print a concise summary of a DataFrame. 
#This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

pima.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;&amp;lt;class 'pandas.core.frame.DataFrame'&amp;gt;
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
&lt;/pre&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;There are 768 entries&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There are 2 float data types and 67 integer data types&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let us &lt;strong&gt;check for missing values.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#functions that return a boolean value indicating whether the passed in argument value is in fact missing data.
# this is an example of chaining methods 

pima.isnull().values.any()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;False&lt;/pre&gt;

&lt;ul&gt;
&lt;li&gt;Pandas defines what most developers would know as null values as missing or missing data in pandas. Within pandas, a missing value is denoted by NaN.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#it can also output if there is any missing values each of the columns

pima.isnull().any()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;Pregnancies                 False
Glucose                     False
BloodPressure               False
SkinThickness               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                     False
dtype: bool&lt;/pre&gt;

- We can then conclude there is no missing values in the dataset.

## Statistical summary

Now let us do a statistical summary of the data. We should find the summary statistics for all variables except 'outcome' in the dataset. It is our output variable in our case.

Summary statistics of data represent descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.



```
#excludes the outcome column

pima.iloc[:,0:8].describe() 
```




&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Pregnancies&lt;/th&gt;
      &lt;th&gt;Glucose&lt;/th&gt;
      &lt;th&gt;BloodPressure&lt;/th&gt;
      &lt;th&gt;SkinThickness&lt;/th&gt;
      &lt;th&gt;Insulin&lt;/th&gt;
      &lt;th&gt;BMI&lt;/th&gt;
      &lt;th&gt;DiabetesPedigreeFunction&lt;/th&gt;
      &lt;th&gt;Age&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;count&lt;/th&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
      &lt;td&gt;768.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;mean&lt;/th&gt;
      &lt;td&gt;3.845052&lt;/td&gt;
      &lt;td&gt;121.675781&lt;/td&gt;
      &lt;td&gt;72.250000&lt;/td&gt;
      &lt;td&gt;26.447917&lt;/td&gt;
      &lt;td&gt;118.270833&lt;/td&gt;
      &lt;td&gt;32.450805&lt;/td&gt;
      &lt;td&gt;0.471876&lt;/td&gt;
      &lt;td&gt;33.240885&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;std&lt;/th&gt;
      &lt;td&gt;3.369578&lt;/td&gt;
      &lt;td&gt;30.436252&lt;/td&gt;
      &lt;td&gt;12.117203&lt;/td&gt;
      &lt;td&gt;9.733872&lt;/td&gt;
      &lt;td&gt;93.243829&lt;/td&gt;
      &lt;td&gt;6.875374&lt;/td&gt;
      &lt;td&gt;0.331329&lt;/td&gt;
      &lt;td&gt;11.760232&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;min&lt;/th&gt;
      &lt;td&gt;0.000000&lt;/td&gt;
      &lt;td&gt;44.000000&lt;/td&gt;
      &lt;td&gt;24.000000&lt;/td&gt;
      &lt;td&gt;7.000000&lt;/td&gt;
      &lt;td&gt;14.000000&lt;/td&gt;
      &lt;td&gt;18.200000&lt;/td&gt;
      &lt;td&gt;0.078000&lt;/td&gt;
      &lt;td&gt;21.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;25%&lt;/th&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;99.750000&lt;/td&gt;
      &lt;td&gt;64.000000&lt;/td&gt;
      &lt;td&gt;20.000000&lt;/td&gt;
      &lt;td&gt;79.000000&lt;/td&gt;
      &lt;td&gt;27.500000&lt;/td&gt;
      &lt;td&gt;0.243750&lt;/td&gt;
      &lt;td&gt;24.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;50%&lt;/th&gt;
      &lt;td&gt;3.000000&lt;/td&gt;
      &lt;td&gt;117.000000&lt;/td&gt;
      &lt;td&gt;72.000000&lt;/td&gt;
      &lt;td&gt;23.000000&lt;/td&gt;
      &lt;td&gt;79.000000&lt;/td&gt;
      &lt;td&gt;32.000000&lt;/td&gt;
      &lt;td&gt;0.372500&lt;/td&gt;
      &lt;td&gt;29.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;75%&lt;/th&gt;
      &lt;td&gt;6.000000&lt;/td&gt;
      &lt;td&gt;140.250000&lt;/td&gt;
      &lt;td&gt;80.000000&lt;/td&gt;
      &lt;td&gt;32.000000&lt;/td&gt;
      &lt;td&gt;127.250000&lt;/td&gt;
      &lt;td&gt;36.600000&lt;/td&gt;
      &lt;td&gt;0.626250&lt;/td&gt;
      &lt;td&gt;41.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;max&lt;/th&gt;
      &lt;td&gt;17.000000&lt;/td&gt;
      &lt;td&gt;199.000000&lt;/td&gt;
      &lt;td&gt;122.000000&lt;/td&gt;
      &lt;td&gt;99.000000&lt;/td&gt;
      &lt;td&gt;846.000000&lt;/td&gt;
      &lt;td&gt;67.100000&lt;/td&gt;
      &lt;td&gt;2.420000&lt;/td&gt;
      &lt;td&gt;81.000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

From the results we can make out a few insights

- The pregnancy numbers appear to be normally distributed whereas the others seem to be rightly skewed. (The mean and std deviation of pregnancies are more or less the same as opposed to the others). 

- Highest glucose levels is 199, pregnancies 17 and BMI 67.

Now to the fun part.

**Data Visualization**

Plotting a distribution plot for variable 'Blood Pressure'.

displot() function which is used to visualize a distribution of the univariate variable.

This function uses matplotlib to plot a histogram and fit a kernel density estimate (KDE).



```
sns.displot(pima['BloodPressure'], kind='kde')
plt.show()
```



![Histogram of the Blood Pressure levels](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hplo1dmvr7t5nlwn01em.png)

- We can interpret from the above plot that the blood pressure is between the range of 60 to 80 for a large number of the observations. This implies that most people's blood pressure range from 60 to 80.

**What is the BMI of the person having the  highest glucose**

Max() method finds the highest value.



```
pima[pima['Glucose']==pima['Glucose'].max()]['BMI']
```


&lt;pre&gt;661    42.9
Name: BMI, dtype: float64&lt;/pre&gt;

- The person with the highest glucose value (661) has a bmi of 42.9

**Finding Measures of Central Tendency  (the mean,median, and mode)  
**



```
m1 = pima['BMI'].mean()  # mean
print(m1)
m2 = pima['BMI'].median()  # median
print(m2)
m3 = pima['BMI'].mode()[0]  # mode
print(m3)
```


&lt;pre&gt;32.45080515543619
32.0
32.0
&lt;/pre&gt;

&lt;ul&gt;
&lt;li&gt;Mean, median and mode ( central measures of tendency) are equal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;*&lt;em&gt;How many women's Glucose levels are above the mean level of Glucose&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
mean() method finds the mean of all nimerical values in a series or column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pima[pima['Glucose']&amp;gt;pima['Glucose'].mean()].shape[0]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;343&lt;/pre&gt;

&lt;ul&gt;
&lt;li&gt;There are 343 number of women's glucose levels that are above the 32.45 mean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let us count the number of women that have their 'BloodPressure' equal to the median of 'BloodPressure' and their 'BMI' less than the median of 'BMI'&lt;/p&gt;

&lt;h1&gt;
  
  
  it then saves this into a new dataframe pima1
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pima1 = pima[(pima['BloodPressure']==pima['BloodPressure'].median()) &amp;amp; (pima['BMI']&amp;lt;pima['BMI'].median())]
number_of_women=len(pima1.axes[0])
print("Number of women:" +str(number_of_women))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;Number of women:22
&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Getting a pairwise distribution between Glucose, Skin thickness and Diabetes pedigree function.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pair plot gives a pairwise distribution of variables in the dataset. pairplot() function creates a matrix such that each grid shows the relationship between a pair of variables. On the diagonal axes, a plot shows the univariate distribution of each variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.pairplot(data=pima,vars=['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue = 'Outcome')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8m0ms66j807i09zbu1xj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8m0ms66j807i09zbu1xj.png" alt="A pair plot" width="591" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Studying the correlation between glucose and insulin using a Scatter Plot.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A scatter plot is a set of points plotted on horizontal and vertical axes. The scatter plot can be used to study the correlation between the two variables. One can also detect the extreme data points using a scatter plot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.scatterplot(x='Glucose',y='Insulin',data=pima)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0xpq40tklzrbuy5qm1h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0xpq40tklzrbuy5qm1h.png" alt="The scatter plot" width="389" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The scatter plot above implies that mostly the increase in glucose does relatively little change in insulin levels It also shows that in some the increase in glucose increases in insulin. This could probably be outliers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let us explore the possibility of outliers using the Box Plot.&lt;/p&gt;

&lt;p&gt;Boxplot is a way to visualize the five-number summary of the variable. Boxplot gives information about the outliers in the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.boxplot(pima['Age'])

plt.title('Boxplot of Age')
plt.ylabel('Age')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mlt3k55mpvlymfcs6gt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mlt3k55mpvlymfcs6gt.png" alt="Boxplot" width="382" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The box plot shows the presence of outliers above the horizontal line.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let us now try to understand the number of women in different age groups given whether they have diabetes or not. We will utilize the Histogram for this.&lt;/p&gt;

&lt;p&gt;A histogram is used to display the distribution and spread of the continuous variable. One axis represents the range of the variable and the other axis shows the frequency of the data points.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understanding the number of women in different age groups with diabetes.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.hist(pima[pima['Outcome']==1]['Age'], bins = 5)
plt.title('Distribution of Age for Women who has Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanuik5r0pbdgi01zkg6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanuik5r0pbdgi01zkg6g.png" alt="A histogram of women with diabetes" width="382" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Of all the women with diabetes most are from the age between 22 to 30.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The frequency of women with diabetes decreases as age increases.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;understanding the number of women in different age groups without diabetes.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.hist(pima[pima['Outcome']==0]['Age'], bins = 5)
plt.title('Distribution of Age for Women who do not have Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagtfj02p238iorhjmoxc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagtfj02p238iorhjmoxc.png" alt="A histogram of women without diabetes" width="393" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The highest number of Women without diabetes range between ages 22 to 33.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Women between the age of 22 to 35 are at the highest risk of diabetes and also the is the highest number of those without diabetes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What is the Interquartile Range of all the variables?&lt;/strong&gt;&lt;br&gt;
The IQR or Inter Quartile Range is a statistical measure used to measure the variability in a given data.&lt;/p&gt;

&lt;p&gt;It tells us inside what range the bulk of our data lies.&lt;/p&gt;

&lt;p&gt;It can be calculated by taking the difference between the third quartile and the first quartile within a dataset.&lt;/p&gt;

&lt;p&gt;Why? It is a methodology that is generally used to filter outliers in a dataset. Outliers are extreme values that lie far from the regular observations that can possibly be got generated because of variability in measurement or experimental error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Q1 = pima.quantile(0.25)
Q3 = pima.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;pre&gt;Pregnancies                  5.0000
Glucose                     40.5000
BloodPressure               16.0000
SkinThickness               12.0000
Insulin                     48.2500
BMI                          9.1000
DiabetesPedigreeFunction     0.3825
Age                         17.0000
Outcome                      1.0000
dtype: float64
&lt;/pre&gt;

&lt;p&gt;*&lt;em&gt;And finally let us find and visualize the correlation between all variables.&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
Correlation is a statistic that measures the degree to which two variables move with each other.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;corr_matrix = pima.iloc[:,0:8].corr()

corr_matrix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Pregnancies&lt;/th&gt;
      &lt;th&gt;Glucose&lt;/th&gt;
      &lt;th&gt;BloodPressure&lt;/th&gt;
      &lt;th&gt;SkinThickness&lt;/th&gt;
      &lt;th&gt;Insulin&lt;/th&gt;
      &lt;th&gt;BMI&lt;/th&gt;
      &lt;th&gt;DiabetesPedigreeFunction&lt;/th&gt;
      &lt;th&gt;Age&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;Pregnancies&lt;/th&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;0.128022&lt;/td&gt;
      &lt;td&gt;0.208987&lt;/td&gt;
      &lt;td&gt;0.009393&lt;/td&gt;
      &lt;td&gt;-0.018780&lt;/td&gt;
      &lt;td&gt;0.021546&lt;/td&gt;
      &lt;td&gt;-0.033523&lt;/td&gt;
      &lt;td&gt;0.544341&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Glucose&lt;/th&gt;
      &lt;td&gt;0.128022&lt;/td&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;0.219765&lt;/td&gt;
      &lt;td&gt;0.158060&lt;/td&gt;
      &lt;td&gt;0.396137&lt;/td&gt;
      &lt;td&gt;0.231464&lt;/td&gt;
      &lt;td&gt;0.137158&lt;/td&gt;
      &lt;td&gt;0.266673&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;BloodPressure&lt;/th&gt;
      &lt;td&gt;0.208987&lt;/td&gt;
      &lt;td&gt;0.219765&lt;/td&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;0.130403&lt;/td&gt;
      &lt;td&gt;0.010492&lt;/td&gt;
      &lt;td&gt;0.281222&lt;/td&gt;
      &lt;td&gt;0.000471&lt;/td&gt;
      &lt;td&gt;0.326791&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;SkinThickness&lt;/th&gt;
      &lt;td&gt;0.009393&lt;/td&gt;
      &lt;td&gt;0.158060&lt;/td&gt;
      &lt;td&gt;0.130403&lt;/td&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;0.245410&lt;/td&gt;
      &lt;td&gt;0.532552&lt;/td&gt;
      &lt;td&gt;0.157196&lt;/td&gt;
      &lt;td&gt;0.020582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Insulin&lt;/th&gt;
      &lt;td&gt;-0.018780&lt;/td&gt;
      &lt;td&gt;0.396137&lt;/td&gt;
      &lt;td&gt;0.010492&lt;/td&gt;
      &lt;td&gt;0.245410&lt;/td&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;0.189919&lt;/td&gt;
      &lt;td&gt;0.158243&lt;/td&gt;
      &lt;td&gt;0.037676&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;BMI&lt;/th&gt;
      &lt;td&gt;0.021546&lt;/td&gt;
      &lt;td&gt;0.231464&lt;/td&gt;
      &lt;td&gt;0.281222&lt;/td&gt;
      &lt;td&gt;0.532552&lt;/td&gt;
      &lt;td&gt;0.189919&lt;/td&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;0.153508&lt;/td&gt;
      &lt;td&gt;0.025748&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;DiabetesPedigreeFunction&lt;/th&gt;
      &lt;td&gt;-0.033523&lt;/td&gt;
      &lt;td&gt;0.137158&lt;/td&gt;
      &lt;td&gt;0.000471&lt;/td&gt;
      &lt;td&gt;0.157196&lt;/td&gt;
      &lt;td&gt;0.158243&lt;/td&gt;
      &lt;td&gt;0.153508&lt;/td&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
      &lt;td&gt;0.033561&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Age&lt;/th&gt;
      &lt;td&gt;0.544341&lt;/td&gt;
      &lt;td&gt;0.266673&lt;/td&gt;
      &lt;td&gt;0.326791&lt;/td&gt;
      &lt;td&gt;0.020582&lt;/td&gt;
      &lt;td&gt;0.037676&lt;/td&gt;
      &lt;td&gt;0.025748&lt;/td&gt;
      &lt;td&gt;0.033561&lt;/td&gt;
      &lt;td&gt;1.000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Now let us visualize using a Heatmap.&lt;/strong&gt;&lt;br&gt;
Heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. Each square in the heatmap shows the correlation between variables on each axis.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```# 'annot=True' returns the correlation values &lt;br&gt;
plt.figure(figsize=(8,8))&lt;br&gt;
sns.heatmap(corr_matrix, annot = True)&lt;/p&gt;

&lt;h1&gt;
  
  
  display the plot
&lt;/h1&gt;

&lt;p&gt;plt.show()&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;



![A heatmap showing the correlation between the independent variable](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ycu0zn4xkyk7m3bsmyom.png)

- Note: The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is. 

A correlation closer to -1 is similar, but instead of both increasing one variable will decrease as the other increases. 

- Age and pregnancies are positively correlated.
Glucose and insulin are positively correlated.
SkinThickness and BMI are positively correlated.


This marks the end of our exhaustive EDA. Tell me what you think, and drop your comments in the comment section. Bye.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>datascience</category>
      <category>codenewbie</category>
      <category>hacktoberfest</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Data Stories.</title>
      <dc:creator>Data Stories</dc:creator>
      <pubDate>Sat, 05 Nov 2022 08:14:15 +0000</pubDate>
      <link>https://dev.to/data_stories/data-stories-4o6h</link>
      <guid>https://dev.to/data_stories/data-stories-4o6h</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxql6k219s0r9nag5t4na.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxql6k219s0r9nag5t4na.jpg" alt="Image description" width="385" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When someone hears the word "data science" they often assume the meaning is an entirely analytical job. Though this is true, I like to think of it as a storytelling field. After all who doesn't love a good story? When a data analyst is given a dataset to analyze, we often try to find patterns, trends, and anomalies within the dataset and use that information to make business decisions or predict future data. It would be okay to view this as a pure analysis job, but a better way to think of it is by picturing the information as a story your data is trying to tell. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiuznipksdjx4v5gjwa3b.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiuznipksdjx4v5gjwa3b.jpg" alt="Image description" width="291" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's take my case as an example. I remember one of the first exciting stories (that gave rise to the Idiom " Don't fly too close to the sun") I heard from my Mum when I was a kid. The story of Icarus and Daedalus. I remember the first time she launched into the story. I couldn't quite fathom how this story would end. As I was introduced to the fantastical characters and the unique circumstances they found themselves in, I imagined all the scenarios that could have resulted to their final fate. And as the story unfolded, it became gradually clear what fate would befall them. In the end, Icarus plummeted down to earth as his father, Daedalus, watched. It was clear that the main lesson was the value of listening to our elders.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyvsox70ux5wxd6z0lgb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyvsox70ux5wxd6z0lgb.jpg" alt="Image description" width="230" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You might ask, how does the story tie in with data science? Well, let's take our dataset and equate it to the story before we heard it. They are both a mystery. We have no idea what the dataset wants to tell us. But, as the story and the dataset unfold (in our case as we do data explorative and descriptive analysis) we start to get a clearer picture. At the end of our story, we come to understand the patterns (our story's character decisions), the features ( environmental circumstances our characters find themselves in), and the information the dataset gives us ( the lesson learned from our character's story).&lt;/p&gt;

&lt;p&gt;Looking for your data's story is a valuable skill every data scientist must work on improving, and the way to do that is through visualizations. As data scientists, we need to develop skills from broad and various fields. We need some business knowledge, some maths, statistics, and programming. But, I would argue that learning to tell a story with your data is an essential skill as well.&lt;/p&gt;

&lt;p&gt;In summary, if you are a good storyteller and can create efficient visualizations, you will uncover your data's story, present that story effectively to your clients, and prove the value of your work.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>analytics</category>
      <category>python</category>
    </item>
  </channel>
</rss>
