Week 2 DSEA X Lux Academy Project on Exploratory Data Analysis, to uncover interesting patterns, insights, and potential anomalies within the Weather dataset used for week 1 Assignment.
1. Data Overview and Cleaning
The first step of any data overview and cleaning is preparing your data, by ensuring the data is imported correctly, in the right format and the right place. It is then in order to check the attributes of the data using the .shape
function and the characteristics of the data i.e the data types and any missing values for each column using the .info()
function. Also, it is important to check for any duplicated rows using the .duplicated()
function as part of cleaning. Finally, using the .head()
function, look at the first five or any specified number of rows of your dataset.
In the case that there are any missing values, two actions can be taken:
Drop all the missing values using the
.dropna(inplace=True)
function. The"inplace=True"
ensures the original dataframe is updated without making a copy. If we don't include"inplace=True"
or if we include"inplace=False"
, then pandas will revert back to the default.Fill the missing values using the
.fillna()
function which iterates through the dataset and fills the miissing values with either the mean, mode, median or any other specified value.
It is always advisable to use your own judgement to determine whether to drop or to fill the missing values by looking at the impact the action will have on the dataset.
As for the duplicated rows, it is advisable to drop them using the.drop_duplicates()
function. Duplicates can skew statistical measures and compromise data integrity hence misleading analyses and visualizations.
2. Statistical Summary
Statistical summary refers to a set of statistical calculations that characterize the values in a dataset, giving a broader perspective and understanding of a dataset and uncovering relationships between different variables and extracting important variables for the problem at hand.
The two main types of summary statistics used in evaluation are the measure of central tendency and measure of dispersion.
(i) Measures of Central Tendency
These are measures that indicate the approximate center of a dataset. The main measures of central tendency are mean, mode and median.
mean: also referred to as the arithmetic average, this is the sum of all values in a dataset divided by the number of values in the dataset.
mode: this is the most repeated 'element' in a dataset i.e. value with the highest frequency.
median: this is the 'middle most element' in a dataset when arranged in ascending or descending order. If the dataset has an even number of values, the median is the average of the two middle values.
(ii) Measures of dispersion
These are measures that describe the variability / how spread out the values in a dataset are. The main measures of dispersion are range, variance, standard deviation, interquartile range and mean absolute deviation. The following explain the measures and how to get them in python pandas;
- range: this is the difference between the highest and lowest values in a dataset. In python pandas get the range as
range=df['column_name1'].max() - df['column_name1'].min()
- variance: this is a measure of variability i.e. measure of how far each data point is from the mean in squared units. In python pandas,
variance=df['column_name'].var()
- standard deviation: this is a measure of how spread out data is around the mean. In python pandas,
standard_deviation=df['column_name'].std()
- interquartile range: this is a measure of the spread of the middle 50% of the dataset. It's the difference between the 75th and 25th percentiles of the dataset. It eliminates the effect of the outliers. In python pandas,
interquartile_range=df['column_name'].Quantile(0.75) - df['column_name'].Quantile(0.25)
- mean absolute deviation: this is a measure of the average distance between each data point and the mean of the dataset. In python pandas,
mean_absolute_deviation=df['column_name'].mad()
Rather than calculating all the summary statistics individually, another way to do this is by using the .describe()
function in python pandas. By using the describe function on a dataframe, it automatically computes basic statistics for all the numerical variables. Any NaN values are automatically skipped in these statistics. Using the "include='all'", which is optional, in the describe function adds the 'unique', 'top' and 'freq' analyses results on top of the basic statistics results. The top result from "include='all'" is the mode.
Data Visualization
Data visualization is the art of graphically/pictorially presenting information and data in order to see trends or patterns and correlations among the data, using visual elements such as charts, graphs and plots, with the aim of making data easier to interpret and understand for better decision making and deriving insights.
In python, data visualization is made possible through the utilization of libraries such as matplotlib, seaborn, plotly, pandas visualization, altair, bokeh etc. Here's a quick run through some of them:
1.Matplotlib
Matplotlib is a python library used for creating
static, animated and interactive visualizations in python. While Matplotlib is used for basic graph plotting such as line charts, bar graphs, box plots, histograms, scatter plots etc, it works well with data arrays and frames, and is more customizable and pairs well with Pandas and Numpy for Exploratory Data Analysis.
To be able to use Matplotlib, the following have to be fulfilled:
(i)Installation:
Use the command
pip install matplotlib
(ii)Usage:
Ensure you import as follows:
import matplotlib.pyplot as plt
2. Seaborn
Seaborn is a data visualization library, built on top of Matplotlib, used for making statistical graphs in python. With fewer commands, Seaborn performs complex visualizations. With more inbuilt themes as compared to Matplotlib, Seaborn is considered more organized and functional and handles the entire dataset as a solitary unit. Seaborn supports plots including but not limited to box plot, scatter plot, regression plot, histogram and more.
For use cases, ensure the following are fulfilled:
(i)Installation
Use the command
pip install seaborn
(ii)Usage
Import as follows:
import seaborn as sns
3.Plotly
Plotly is a data visualization python library used for its ability to create interactive plots and dashboards making it easily understandable. Some of plotly's best features are that it has hover tools capabilities that allow detection of outliers or any other anomalies in a large data set, supports a wide range of plots including line charts, scatter plot, bar charts and more, offers endless options for customization and formatting as well as providing the Dash framework for interactive web applications building.
Important use cases to fulfill:
(i)Installation
Use the command:
pip install plotly
(ii)Usage
Import as follows:
import plotly.express as px
Conclusion
The Week 2 DSEA X Lux Academy Project on Exploratory Data Analysis (EDA) has provided a thorough understanding of the weather dataset used in Week 1. Through the systematic process of data cleaning, statistical summary, and visualization, the project has successfully uncovered valuable patterns, insights, and potential anomalies within the data.
Data cleaning ensured the dataset's integrity by handling missing values, duplicates, and inconsistencies. The statistical summary provided a deeper understanding of the dataset's central tendencies and dispersion, which are crucial for interpreting weather patterns. Visualization, using tools like Matplotlib, Seaborn, and Plotly, brought these insights to life, making it easier to identify trends, correlations, and outliers.
Overall, this analysis sets a strong foundation for further exploration and more advanced modeling, helping to derive actionable insights from the weather data and supporting better decision-making.
Top comments (0)