DEV Community

Cover image for The Gemika's Magical Guide to Sorting Hogwarts Students using the Decision Tree Algorithm (Part #6)
gerry leo nugroho
gerry leo nugroho

Posted on • Updated on

The Gemika's Magical Guide to Sorting Hogwarts Students using the Decision Tree Algorithm (Part #6)

6. Visualizing Data with Charts

Visualizing Data with Charts

Our previous quest to unlock the secrets of sorting at Hogwarts is well underway! We've gathered our essential spellbooks (Python libraries) and mended the forgetful pages (filled in missing data). Now, it's time to unleash the true power of data science – the magic of data visualization! 🪄

Imagine Professor Dumbledore himself, his eyes twinkling with wisdom, holding a magical artifact – a shimmering chart. This isn't your ordinary piece of parchment, mind you! It's a canvas upon which raw data is transformed into a breathtaking spectacle, revealing hidden patterns and trends just like a Marauder's Map unveils secret passages. ️

6.1 Distribution of Students Across Houses

Now that we've filled those forgetful pages in our book, it's time to delve deeper into the fascinating world of Hogwarts houses! Remember how Harry, Ron, and Hermione were sorted into their houses based on their unique talents and personalities? Well, we're about to embark on a similar quest, using a magical tool called Matplotlib to create a visual map of how the Hogwarts students are distributed across their houses. ✨

With a wave of our metaphorical wand (or a line of Python code!), Matplotlib will conjure a magnificent bar chart. Think of it like a giant sorting hat, but instead of a tear on its brim, this hat boasts colorful bars that reach for the ceiling. Each bar represents a Hogwarts house – Gryffindor, Ravenclaw, Hufflepuff, and Slytherin. 🪄

# Importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Setting the aesthetic style for our plots
sns.set(style="whitegrid")

# Visualizing the distribution of students across houses
plt.figure(figsize=(15, 10))
sns.countplot(x='house', data=hogwarts_df, hue='house', legend=False)
plt.title('Distribution of Students Across Houses')
plt.xlabel('House')
plt.ylabel('Students')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Visualizing the distribution of students across houses


6.2 Distribution of Students Across Houses (With a Twist)

But this isn't just any ordinary painting. We're going to use the magic of data to bring our picture to life. With a flick of our wand (or a click of a mouse), we'll transform cold numbers into a vibrant tapestry that tells a tale as enchanting as any fairy tale. But this time, let's add a bit of twist of spell to show the values of each X and Y axis accordingly so it'd become more informative 💫

# Importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Setting the aesthetic style for our plots
sns.set(style="whitegrid")

# Visualizing the distribution of students across houses
plt.figure(figsize=(15, 10))
ax = sns.countplot(x='house', data=hogwarts_df, hue='house', legend=False)

# Adding numerical information on top of each bar
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom', 
                fontsize=12, color='black', 
                xytext=(0, 5),  # Offset the text slightly above the bar
                textcoords='offset points')

plt.title('Distribution of Students Across Houses')
plt.xlabel('House')
plt.ylabel('Students')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Distribution of Students Across Houses (With a bit of twist)


6.3 Visualizing Age Distribution

But what if we want to see how the ages of boys and girls differ? Fear not, for we have another spell, the Bar Chart. This spell creates side-by-side towers, comparing the number of boys and girls at each age. It's like two rival houses, Gryffindor and Slytherin, competing for the tallest tower. ⚔️

# Visualizing the age distribution
plt.figure(figsize=(10, 6))
sns.histplot(hogwarts_df['age'], kde=True, color='blue')
plt.title('Age Distribution of Hogwarts Students')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Visualizing the age distribution


6.4 Visualizing Relationships Features

Next, we weave a more intricate spell, exploring the relationships between different features in our dataset. For instance, does a student’s heritage influence their choice of pet, or is there a connection between a student’s age and the type of wand they use? This step is like exploring the Forbidden Forest, uncovering the connections and mysteries that lie within.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Path to your dataset
dataset_path = 'data/hogwarts-students-02.csv'

# Reading the dataset
hogwarts_df = pd.read_csv(dataset_path)

# Plotting the distribution of Hogwarts Houses with student counts
plt.figure(figsize=(10, 5))
sns.countplot(x='house', hue='pet', data=hogwarts_df, palette='viridis')

# Add data labels (student counts) on top of each bar
for container in plt.gca().containers:
    plt.bar_label(container)

plt.title('Relationship between "House" and "Choice of Pet"')
plt.xlabel('House')
plt.ylabel('Number of Students')
plt.legend(title='Pet Type')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Visualizing Relationships Features

Through this visualization, we might discover that Muggle-born students have a penchant for owls, while Pure-bloods prefer cats. These insights are akin to understanding the habits of magical creatures, revealing the subtle nuances that define the Hogwarts community.


6.5 Summarizing the Data

Summarizing the Data

This summary provides key statistics such as the mean, median, and standard deviation of numerical columns, and unique counts and modes for categorical columns. For instance, we might find that the most common house is Gryffindor, or that the average age of students is 14 years.

summary = hogwarts_df.describe(include='all')
print(summary)
Enter fullscreen mode Exit fullscreen mode
Unnamed: 0          name gender        age   origin specialty  \
count    52.000000            52     52  52.000000       52        52   
unique         NaN            52      2        NaN        9        24   
top            NaN  Harry Potter   Male        NaN  England    Charms   
freq           NaN             1     27        NaN       35         7   
mean     25.500000           NaN    NaN  14.942308      NaN       NaN   
std      15.154757           NaN    NaN   2.492447      NaN       NaN   
min       0.000000           NaN    NaN  11.000000      NaN       NaN   
25%      12.750000           NaN    NaN  13.250000      NaN       NaN   
50%      25.500000           NaN    NaN  16.000000      NaN       NaN   
75%      38.250000           NaN    NaN  17.000000      NaN       NaN   
max      51.000000           NaN    NaN  18.000000      NaN       NaN   

             house blood_status  pet wand_type       patronus  \
count           52           52   52        52             52   
unique           6            4    9        28             15   
top     Gryffindor   Half-blood  Owl       Ash  Non-corporeal   
freq            18           25   36         4             36   
mean           NaN          NaN  NaN       NaN            NaN   
std            NaN          NaN  NaN       NaN            NaN   
min            NaN          NaN  NaN       NaN            NaN   
25%            NaN          NaN  NaN       NaN            NaN   
50%            NaN          NaN  NaN       NaN            NaN   
75%            NaN          NaN  NaN       NaN            NaN   
max            NaN          NaN  NaN       NaN            NaN   

       quidditch_position  boggart favorite_class  house_points  
count                  52       52             52     52.000000  
unique                  5       11             21           NaN  
top                Seeker  Failure         Charms           NaN  
freq                   47       40              9           NaN  
mean                  NaN      NaN            NaN    119.200000  
std                   NaN      NaN            NaN     53.057128  
min                   NaN      NaN            NaN     10.000000  
25%                   NaN      NaN            NaN     77.500000  
50%                   NaN      NaN            NaN    119.600000  
75%                   NaN      NaN            NaN    160.000000  
max                   NaN      NaN            NaN    200.000000  
Enter fullscreen mode Exit fullscreen mode

6.5.1 Summary of the Results

gemika haziq nugroho insight

  1. Count: The number of non-null values in each column.
  2. Unique: The number of unique values in each column.
  3. Top: The most frequent value in each column.
  4. Freq: The number of times the most frequent value appears.
  5. Mean: The arithmetic mean of the values in each column.
  6. Std: The standard deviation of the values in each column.
  7. Min: The minimum value in each column.
  8. 25%: The 25th percentile (lower quartile) of the values in each column.
  9. 50%: The 50th percentile (median) of the values in each column.
  10. 75%: The 75th percentile (upper quartile) of the values in each column.
  11. Max: The maximum value in each column.

6.5.2 Key Observations

gemika haziq nugroho insight

  1. Age: The mean age is 14.942308, with a standard deviation of 2.492447. The age range is from 11 to 18.
  2. Gender: There are only two unique values: Male and Female.
  3. Origin: There are nine unique values, with England being the most frequent.
  4. Specialty: There are 24 unique values, with Charms being the most frequent.
  5. House: There are six unique values, with Gryffindor being the most frequent.
  6. Blood Status: There are four unique values, with Half-blood being the most frequent.
  7. Pet: There are nine unique values, with Owl being the most frequent.
  8. Wand Type: There are 28 unique values, with Ash being the most frequent.
  9. Patronus: There are 15 unique values, with Non-corporeal being the most frequent.
  10. Quidditch Position: There are five unique values, with Seeker being the most frequent.
  11. Boggart: There are 11 unique values, with Failure being the most frequent.
  12. Favorite Class: There are 21 unique values, with Charms being the most frequent.
  13. House Points: The mean is 119.200000, with a standard deviation of 53.057128. The range is from 10 to 200.

6.5.3 Insights

gemika haziq nugroho insight

  • Age Distribution: The age distribution is relatively narrow, with most students being between 13 and 17 years old.
  • Gender: The dataset is skewed towards males.
  • Specialty and House: The most frequent values in these columns suggest that students tend to specialize in Charms and are part of Gryffindor house.
  • Blood Status: The most frequent value suggests that most students are Half-blood.
  • Pet and Wand Type: The most frequent values in these columns suggest that students often have pets like Owls and use wands made of Ash.
  • Patronus: The most frequent value suggests that many students have Non-corporeal patronuses.
  • Quidditch Position: The most frequent value suggests that many students play the role of Seeker in Quidditch.
  • Boggart and Favorite Class: The most frequent values in these columns suggest that students often fear Failure and enjoy studying Charms.
  • House Points: The mean and range of house points suggest that students in this dataset have varying levels of achievement and participation.

These insights can help you better understand the characteristics of the students in the Hogwarts dataset.


6.6 Correlation Matrix

Correlation Matrix

Finally, we perform statistical analysis to quantify relationships and trends within our data. This step is akin to Snape carefully measuring potion ingredients to ensure the perfect brew. The correlation matrix and its visualization show us how different features relate to each other. For example, we might find a strong correlation between age and year at Hogwarts, as expected. Understanding these relationships helps us build more accurate models and make informed predictions.

The correlation matrix and its visualization show us how different features relate to each other. For example, we might find a strong correlation between age and year at Hogwarts, as expected. Understanding these relationships helps us build more accurate models and make informed predictions.

# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the dataset
dataset_path = 'data/hogwarts-students-02.csv'  # Path to our dataset
hogwarts_df = pd.read_csv(dataset_path)

# Displaying the first few rows to understand the structure of the dataset
print(hogwarts_df.head())

# Checking the data types of each column to identify numerical and categorical data
print(hogwarts_df.dtypes)

# Selecting only numerical columns for correlation matrix
numerical_df = hogwarts_df.select_dtypes(include=[np.number])

# Calculating the correlation matrix using only numerical data
correlation_matrix = numerical_df.corr()
print(correlation_matrix)

# Visualizing the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Hogwarts Student Features')
plt.show()
Enter fullscreen mode Exit fullscreen mode
name  gender  age   origin                      specialty  \
0      Harry Potter    Male   11  England  Defense Against the Dark Arts   
1  Hermione Granger  Female   11  England                Transfiguration   
2       Ron Weasley    Male   11  England                          Chess   
3      Draco Malfoy    Male   11  England                        Potions   
4     Luna Lovegood  Female   11  Ireland                      Creatures   

        house blood_status  pet wand_type              patronus  \
0  Gryffindor   Half-blood  Owl     Holly                  Stag   
1  Gryffindor  Muggle-born  Cat      Vine                 Otter   
2  Gryffindor   Pure-blood  Rat       Ash  Jack Russell Terrier   
3   Slytherin   Pure-blood  Owl  Hawthorn         Non-corporeal   
4   Ravenclaw   Half-blood  Owl       Fir                  Hare   

  quidditch_position         boggart                 favorite_class  \
0             Seeker        Dementor  Defense Against the Dark Arts   
1             Seeker         Failure                     Arithmancy   
2             Keeper          Spider                         Charms   
3             Seeker  Lord Voldemort                        Potions   
4             Seeker      Her mother                      Creatures   

   house_points  
0         150.0  
1         200.0  
2          50.0  
3         100.0  
4         120.0  
name                   object
gender                 object
age                     int64
origin                 object
specialty              object
house                  object
blood_status           object
pet                    object
wand_type              object
patronus               object
quidditch_position     object
boggart                object
favorite_class         object
house_points          float64
dtype: object
                   age  house_points
age           1.000000      0.315227
house_points  0.315227      1.000000
Enter fullscreen mode Exit fullscreen mode

correlation-matrix-age-house-points.png

The correlation analysis results provided show the correlation coefficients between the age and house_points columns in the dataset. Here’s a breakdown of what can be implied from these results, as the following.

6.6.1 Correlation Coefficients Interpretation

A correlation coefficient is like a magical measuring tape, helping us understand how closely two things are linked. It's a number between -1 and 1, and the closer it is to either end, the stronger the connection. Think of it as a magical spell that reveals hidden relationships!

                age    house_points
  age         1.000000      0.315227
  house_points 0.315227      1.000000
Enter fullscreen mode Exit fullscreen mode

A positive correlation is like a friendship charm; as one thing increases, so does the other. For instance, if height and weight have a strong positive correlation, taller students tend to weigh more. On the other hand, a negative correlation is like a mischievous Pixies' prank; as one thing increases, the other decreases. If hours of sleep and tiredness have a strong negative correlation, those who sleep more tend to be less tired.

6.6.2 Correlation Value Analysis:

The correlation coefficient between age and house_points is 0.315227. This value indicates a positive correlation between the two variables. In general, correlation coefficients range from -1 to 1:

  • 1 indicates a perfect positive correlation.
  • 0 indicates no correlation.
  • -1 indicates a perfect negative correlation.

6.6.3 Strength of the Correlation

A correlation of 0.315 suggests a weak to moderate positive correlation. This means that as the age of the students increases, their house points tend to increase as well, but the relationship is not very strong.

6.6.4 Implications:

  • Age and Performance: The positive correlation may imply that older students tend to accumulate more house points. This could be due to increased experience, maturity, or participation in activities that earn house points.
  • Further Investigation Needed: While there is a correlation, it does not imply causation. Other factors could be influencing both age and house points, such as the year of study, involvement in extracurricular activities, or differences in house dynamics.
  • Potential Analysis: Further analysis could involve looking at other variables (like specialty or house) to see if they mediate or moderate the relationship between age and house points.

6.6.4 Correlation Coefficients Summary

In summary, the correlation analysis indicates a weak to moderate positive relationship between age and house points among Hogwarts students. While older students may tend to earn more points, further analysis is necessary to understand the underlying factors contributing to this correlation.

But beware, young wizard! Correlation doesn't always equal causation. Just because two things are linked doesn't mean one causes the other. It's like finding a lost sock and a lucky penny on the same day; they might be connected, but it doesn't mean one caused the other. 🪄✨


6.7 Gemika's Pop-Up Quiz: Visualizing Data with Charts 🧙‍♂️🪄

Gemika's Pop-Up Quiz: Visualizing Data with Charts

And now, dear reader, my son Gemika Haziq Nugroho appears with a twinkle in his eye and a quiz in hand. Are you ready to test your knowledge and prove your mastery of data exploration?

  1. What magical python libraries used to perform visualization?
  2. What metric do you use to identify the number of times the most frequent value appears?
  3. What can be implied from "Blood Status" insight?

Answer these questions with confidence, and you will demonstrate your prowess in the art of data exploration. With our dataset now fully explored and understood, we are ready to embark on the next phase of our magical journey. Onward, and continue to our next deeper discoveries and greater insights! 🌟✨🧙‍♂️


Top comments (0)