DEV Community

Cover image for Exploratory Data Analysis on the Iris Flower Dataset
Samuel Kalu
Samuel Kalu

Posted on

Exploratory Data Analysis on the Iris Flower Dataset

Motivation

This is my submission of stage zero in the HNG 11 internship, I am currently deep exploring the field of data analysis , I believe this internship gives me the opportunity to learn and grow more in this field

To know more:

Observation from first glance

Looking at the Iris dataset from first glance,
The Iris flower dataset comprises 150 samples with four features each: sepal length, sepal width, petal length, and petal width, distributed across three species: Iris-setosa, Iris-versicolor, and Iris-virginica, with 50 samples per species

Image description

Image description

Exploratory Data Analysis

Image description

The pairplot above easily summarizes how the entire distribution of the 4 features are against the target variable.

We can infer all of the above

The pairplot of the Iris dataset provides a visual summary of the relationships between the four features (sepal length, sepal width, petal length, and petal width) for the three Iris species: setosa, versicolor, and virginica. Here are some detailed observations:

  1. Species Separation:

    • Iris-setosa: This species is distinctly separated from the other two species in almost all pairwise comparisons. The petal length and petal width features are particularly effective in distinguishing Iris-setosa, as the points representing this species form a distinct cluster in the lower left corner in the petal length vs. petal width plot.
    • Iris-versicolor and Iris-virginica: These two species overlap more but show some degree of separation. The petal length and petal width features again provide good separation, with Iris-versicolor generally having smaller petal measurements compared to Iris-virginica. However, there is still some overlap between these two species in the middle range of the feature values.
  2. Feature Distributions:

    • The diagonal plots show the kernel density estimates (KDE) for each feature within each species. These plots reveal that the distribution of each feature varies significantly between species. For example, Iris-setosa has a much narrower and distinct distribution for petal length and petal width compared to the other two species.
    • Sepal length and sepal width have more overlapping distributions, especially between Iris-versicolor and Iris-virginica, making them less effective for classification on their own.
  3. Inter-feature Relationships:

    • There is a noticeable positive correlation between petal length and petal width across all species, particularly within Iris-versicolor and Iris-virginica.
    • Sepal length and petal length also exhibit a positive correlation, especially for Iris-versicolor and Iris-virginica, while Iris-setosa remains distinctly separated.
    • Sepal width shows a weaker correlation with other features compared to the petal measurements.
  4. Within-Species Variability:

    • Iris-setosa shows low variability in petal measurements, which are consistently small.
    • Both Iris-versicolor and Iris-virginica exhibit more variability in their petal measurements, with Iris-virginica generally showing the largest measurements.

CORRELATION

Image description

The correlation matrix heatmap of the Iris dataset reveals the relationships between the features. Sepal length shows a strong positive correlation with petal length (0.87) and petal width (0.82). Petal length and petal width are highly correlated (0.96), indicating that as petal length increases, petal width also tends to increase significantly. Sepal width, on the other hand, has a weak negative correlation with sepal length (-0.12) and moderate negative correlations with petal length (-0.43) and petal width (-0.37). These insights suggest that petal measurements are more strongly interrelated compared to sepal measurements, which are less correlated with each other and with petal measurements

Thanks so much for reading😊, Cya👋.

Top comments (0)