In this article, we will delve into various aspects of plotting and analyzing numerical and categorical variables in bivariate and a multivariate manner.

### Correlation

Exploratory data analysis in many modeling projects (whether in data science or in research) involves examining correlation among predictors and between predictors and a target variable.

*Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go with low values of Y.*

We use *Pearson's Correlation Coefficient* as a de-facto method for computing correlation among numerical variables.

Following is the mathematical formula of the same:

The correlation coefficient always lies between +1 (perfect positive correlation) and –1 (perfect negative correlation); 0 indicates no correlation.

```
#Reading the data
data = pd.read_csv('winequality-red.csv', sep = ';')
# pandas dataframe has .corr() method to compute a correlation table
data.iloc[:, :-1].corr()
```

The output of the following code is a square matrix with each numerical variable's correlation computed against every other in the data.

For visualization purposes, we use seaborn's *heatmap* for better inferences and data storytelling.

```
plt.figure(figsize= (10,7))
sns.heatmap(data.iloc[:, :-1].corr(), vmin= -1, vmax= 1, cmap= sns.diverging_palette(20, 220, as_cmap=True))
```

Like the mean and standard deviation, the correlation coefficient is sensitive to outliers in the data.

Note: There are various other correlation coefficients devised by statisticians: *Spearman’s rho* or *Kendall’s tau*.

These are correlation coefficients based on the rank of the data. Since they work with ranks rather than values, these estimates are robust to outliers and can handle certain types of nonlinearities.

#### Scatterplots

The standard way to visualize the relationship between two measured data variables is with a scatterplot. The x-axis represents one variable and the y-axis another, and each point on the graph is a record.

```
ax = data.plot.scatter(x = 'citric acid', y = 'fixed acidity', figsize = (10,6))
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)
ax.grid()
```

The plot shows a fairly positive relation between *Citric Acid* and *Fixed Acidity*, where we can conclude safely that increase of citric acid results in a corresponding increase of acidity levels.

### Exploring Two or More Variables

Familiar estimators like mean and variance look at variables one at a time (univariate analysis).

In this section, we look at additional estimates and plots, and at more than two variables (multivariate analysis).

#### Hexagonal Binning and Contours

Scatterplots are fine when there is a relatively small number of data values.

For data sets with hundreds of thousands or millions of records, a scatterplot will be too dense, so we need a different way to visualize the relationship.

Rather than plotting points, which would appear as a monolithic dark cloud, we grouped the records into hexagonal bins and plotted the hexagons with a color indicating the number of records in that bin.

In Python, hexagonal binning plots are readily available using the pandas data frame method *hexbin*

```
ax = data.plot.hexbin(x = 'citric acid', y = 'fixed acidity', gridsize= 40, figsize = (10,6))
ax.set_xlabel('Citric Acid')
ax.set_ylabel('Fixed Acidity')
```

Another method to analyze dense data is to plot the density contours. In Python, *seaborn* has the method *kdeplot*

```
plt.figure(figsize=(10,10))
sns.kdeplot(data['citric acid'], data['fixed acidity'])
```

#### Two Categorical Variables

A useful way to summarize two categorical variables is a contingency table - *a table of counts by category.*

Contingency tables can look only at counts, or they can also include column and total percentages.

The pivot_table method creates the pivot table in Python. The aggfunc argument allows us to get the counts.

```
crosstab = adult_data.pivot_table(index = 'education', columns= 'sex', aggfunc= lambda x: len(x) , margins= True)
df = crosstab.copy()
df.loc[:, 'workclass']
```

As we can observe that, for two categorical variables : *education* and *sex*, we have computed a contingency table.

From the table, we have used 'workclass' column for the output.

#### Categorical and Numerical Variables

*Boxplots* are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable.

The *pandas* boxplot method takes the by an argument that splits the data set into groups and creates the individual boxplots.

```
ax = adult_data.boxplot(by = 'race', column = 'hours-per-week', figsize=(10,10))
```

From the above visualization, we can observe that, we have grouped the data by the categorical variable *race* and have plotted against *hours-per-work*

A **violin plot** is an enhancement to the boxplot and plots the density estimate with the density on the y-axis.

The density is mirrored and flipped over, and the resulting shape is filled in, creating an image resembling a violin.

*The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot*

```
plt.figure(figsize=(10,5))
sns.violinplot(x= adult_data['race'], y= adult_data['hours-per-week'], inner = 'quartile')
```

We created a violin plot with similar features as the aforementioned boxplot.

#### Closing Remarks

*Exploratory data analysis (EDA)* set a foundation for the field of data science. The key idea of EDA is that the first and most important step in any project based on data is to look at the data. By summarizing and visualizing the data, you can gain valuable intuition and understanding of the project.

Fin.

## Top comments (0)