In this article, we will delve into various aspects of plotting and analyzing numerical and categorical variables in bivariate and a multivariate manner.
Exploratory data analysis in many modeling projects (whether in data science or in research) involves examining correlation among predictors and between predictors and a target variable.
Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go with low values of Y.
We use Pearson's Correlation Coefficient as a de-facto method for computing correlation among numerical variables.
Following is the mathematical formula of the same:
The correlation coefficient always lies between +1 (perfect positive correlation) and –1 (perfect negative correlation); 0 indicates no correlation.
#Reading the data data = pd.read_csv('winequality-red.csv', sep = ';') # pandas dataframe has .corr() method to compute a correlation table data.iloc[:, :-1].corr()
The output of the following code is a square matrix with each numerical variable's correlation computed against every other in the data.
For visualization purposes, we use seaborn's heatmap for better inferences and data storytelling.
plt.figure(figsize= (10,7)) sns.heatmap(data.iloc[:, :-1].corr(), vmin= -1, vmax= 1, cmap= sns.diverging_palette(20, 220, as_cmap=True))
Like the mean and standard deviation, the correlation coefficient is sensitive to outliers in the data.
Note: There are various other correlation coefficients devised by statisticians: Spearman’s rho or Kendall’s tau.
These are correlation coefficients based on the rank of the data. Since they work with ranks rather than values, these estimates are robust to outliers and can handle certain types of nonlinearities.
The standard way to visualize the relationship between two measured data variables is with a scatterplot. The x-axis represents one variable and the y-axis another, and each point on the graph is a record.
ax = data.plot.scatter(x = 'citric acid', y = 'fixed acidity', figsize = (10,6)) ax.axhline(0, color='grey', lw=1) ax.axvline(0, color='grey', lw=1) ax.grid()
The plot shows a fairly positive relation between Citric Acid and Fixed Acidity, where we can conclude safely that increase of citric acid results in a corresponding increase of acidity levels.
Familiar estimators like mean and variance look at variables one at a time (univariate analysis).
In this section, we look at additional estimates and plots, and at more than two variables (multivariate analysis).
Scatterplots are fine when there is a relatively small number of data values.
For data sets with hundreds of thousands or millions of records, a scatterplot will be too dense, so we need a different way to visualize the relationship.
Rather than plotting points, which would appear as a monolithic dark cloud, we grouped the records into hexagonal bins and plotted the hexagons with a color indicating the number of records in that bin.
In Python, hexagonal binning plots are readily available using the pandas data frame method hexbin
ax = data.plot.hexbin(x = 'citric acid', y = 'fixed acidity', gridsize= 40, figsize = (10,6)) ax.set_xlabel('Citric Acid') ax.set_ylabel('Fixed Acidity')
Another method to analyze dense data is to plot the density contours. In Python, seaborn has the method kdeplot
plt.figure(figsize=(10,10)) sns.kdeplot(data['citric acid'], data['fixed acidity'])
A useful way to summarize two categorical variables is a contingency table - a table of counts by category.
Contingency tables can look only at counts, or they can also include column and total percentages.
The pivot_table method creates the pivot table in Python. The aggfunc argument allows us to get the counts.
crosstab = adult_data.pivot_table(index = 'education', columns= 'sex', aggfunc= lambda x: len(x) , margins= True) df = crosstab.copy() df.loc[:, 'workclass']
As we can observe that, for two categorical variables : education and sex, we have computed a contingency table.
From the table, we have used 'workclass' column for the output.
Boxplots are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable.
The pandas boxplot method takes the by an argument that splits the data set into groups and creates the individual boxplots.
ax = adult_data.boxplot(by = 'race', column = 'hours-per-week', figsize=(10,10))
From the above visualization, we can observe that, we have grouped the data by the categorical variable race and have plotted against hours-per-work
A violin plot is an enhancement to the boxplot and plots the density estimate with the density on the y-axis.
The density is mirrored and flipped over, and the resulting shape is filled in, creating an image resembling a violin.
The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot
plt.figure(figsize=(10,5)) sns.violinplot(x= adult_data['race'], y= adult_data['hours-per-week'], inner = 'quartile')
We created a violin plot with similar features as the aforementioned boxplot.
Exploratory data analysis (EDA) set a foundation for the field of data science. The key idea of EDA is that the first and most important step in any project based on data is to look at the data. By summarizing and visualizing the data, you can gain valuable intuition and understanding of the project.