Data Analysis with Python: Spotify Songs Dataset
Within the field of data science, loading or exploratory data analysis are some of the tasks you can perform on a dataset. Additionally, depending on the information you need to obtain, you'll have to carry out other additional tasks.
Before starting a data analysis, it's necessary to know the steps to follow. In the following list, you can see the order of their implementation:
Loading data (dataset).
Exploratory data analysis.
Data preparation and preprocessing.
Data visualization.
Machine learning model generation.
Machine learning model training.
Predictive model definition.
Evaluation of the trained model with reserved data.
In the exercise that I explain below, I only want to obtain information about Spotify songs. Since this is a brief analysis written in Python, if you want to see the complete exercise, you can download it from the repository on Github.
β οΈ Before Starting
Before starting a data analysis, it's very important to define the information you need to obtain, because without a clear objective, you won't have a starting point.
Loading the Dataset
The dataset (MostStreamedSpotifySongs2024.csv) consists of several columns that reference the main streaming music platforms. In this case, I only want to explore Spotify data. The information I want to know is the following:
- Songs by year
- Song percentage: Explicit VS Non-Explicit
- Most listened to songs by year with and without explicit content
- Song with the most streams
Importing the Libraries
The Pandas, Numpy, Matplotlib, and Seaborn libraries make the work much easier due to the large number of methods they offer.
# Data manipulation with DataFrames.
import pandas as pd
# Numerical operations and array handling.
import numpy as np
# Chart creation.
import matplotlib.pyplot as plt
# Advanced statistical visualization.
import seaborn as sns
# Display charts in Jupyter notebook.
%matplotlib inline
Reading the File
In this exercise, there is only a single file in csv format with ISO-8859-1 encoding. To avoid reading errors, it's important to specify the encoding, as some files contain special characters.
# Reading the file, the encoding is ISO-8859-1
file_path = ('MostStreamedSpotifySongs2024.csv')
data = pd.read_csv(file_path, encoding='ISO-8859-1')
Visualizing the Data Table
Once the data is loaded, you need to visualize the information it contains. The head() method displays the first five rows of the file.
# View the table with all the data
data.head()
Dataset Dimensions
Knowing the dimensions of the dataset helps understand the amount of data you'll be working with.
# Dataset dimensions
print(f'Dataset size: {data.shape}')
DataFrame Observation
Before starting data cleaning, you need to check if there is missing data.
# List of categorical and numerical variables
data.info()
Null Data and Duplicate Data
After observing that data is missing in the columns, the next step is to know the number of null and duplicate data. To get the total of both, add the sum() method to each one.
# Sum of null values
data.isnull().sum()
# Sum of duplicate records
data.duplicated().sum()
Data Cleaning
The following cleaning processes are necessary to achieve an intact dataset.
Duplicate Rows
The drop_duplicates() method is used to remove duplicate data.
# Find all duplicate records
duplicated_rows = data[data.duplicated()]
# Display duplicate records
print(duplicated_rows)
# Remove duplicate rows
print(f'Dataset size before removing duplicate rows: {data.shape}')
data.drop_duplicates(inplace=True)
print(f'Dataset size after removing duplicate rows: {data.shape}')
Null Rows
The first step is to filter the rows where Artist is null and remove them.
# Filter rows where 'Artist' is null
null_artists = data[data['Artist'].isnull()]
# Display the indices of rows with null values in 'Artist'
print("\nIndices of artists that are null:")
print(null_artists.index.tolist())
# Remove null artists
print(f"Number of null artists before removing them: {data['Artist'].isnull().sum()}")
data.dropna(subset=['Artist'], inplace=True)
print(f"Number of null artists after removing them: {data['Artist'].isnull().sum()}")
Transforming the Data
Since the objective of the analysis is to explore only Spotify data, the columns corresponding to other music platforms are removed.
# Remove columns that are not considered for the main objective
# Define the list of columns to remove
columns_to_drop = [
'YouTube Views', 'YouTube Likes', 'TikTok Posts', 'TikTok Likes', 'TikTok Views',
'YouTube Playlist Reach', 'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
'Deezer Playlist Count', 'Deezer Playlist Reach', 'Amazon Playlist Count', 'Pandora Streams',
'Pandora Track Stations', 'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity'
]
# Remove the columns
data.drop(columns=columns_to_drop, axis=1, inplace=True)
Data Visualization
After performing the data loading, cleaning, and transformation processes, the next step is to visualize the information requested by the exercise.
Songs by Year
# Count the number of songs by year
songs_by_year = data['Year'].value_counts().sort_index()
# Create the chart
plt.figure(figsize=(10, 6))
songs_by_year.plot(kind='bar', color='skyblue')
plt.title('Number of Songs by Year')
plt.xlabel('Year')
plt.ylabel('Number of Songs')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='-', alpha=0.7)
# Display the chart
plt.tight_layout()
plt.show()
Song Percentage: Explicit vs Non-Explicit
# Total songs with explicit lyrics
# Count the number of occurrences of 0 and 1
value_counts = data['Explicit Track'].value_counts()
# Map binary values to explicit labels
labels = ['Explicit', 'Non-Explicit']
sizes = [value_counts.get(1, 0), value_counts.get(0, 0)]
# Create the pie chart
plt.figure(figsize=(4, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=['skyblue', 'salmon'])
plt.title('Song Distribution: Explicit vs Non-Explicit')
# Display the chart
plt.show()
Most Listened to Songs by Year with and without Explicit Content
# Filter explicit and non-explicit songs
explicit_data = data[data['Explicit Track'] == 1]
no_explicit_data = data[data['Explicit Track'] == 0]
# Group by year with explicit content and without explicit content
explicit_track = explicit_data.groupby('Year')['Track'].count().reset_index()
no_explicit_track = no_explicit_data.groupby('Year')['Track'].count().reset_index()
# Rename columns to unify the DataFrame
explicit_track.rename(columns={'Track': 'Count'}, inplace=True)
explicit_track['Explicit'] = 'Yes'
no_explicit_track.rename(columns={'Track': 'Count'}, inplace=True)
no_explicit_track['Explicit'] = 'No'
# Merge the two DataFrames
data_combined = pd.concat([explicit_track, no_explicit_track])
# Create the chart using Seaborn
plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")
# Create bar chart
sns.barplot(data=data_combined, x='Year', y='Count', hue='Explicit')
# Add title and labels
plt.title('Songs by Year According to Their Content')
plt.xlabel('Year')
plt.ylabel('Number of Songs')
# Display the chart
plt.show()
Song with the Most Streams
# Identify the row with the most listened to song
most_listened_song = data.loc[data['Spotify Streams'].idxmax()]
print(f"The song with the most streams is '{most_listened_song['Track']}' by {most_listened_song['Artist']} with {most_listened_song['Spotify Streams']} streams.")
Conclusions
After exploring and visualizing the data of the most listened to songs on Spotify in 2024, I've drawn the following insights.
In the chart of Songs by year according to their content, you can observe an increase in the number of songs with explicit content from 2015 onwards. The explanation for this increase may be due to the following factors:
- Increase in new artists who use more explicit language.
- Emergence or fusion of new musical styles.
- Reflections of society in song lyrics with advocacy motives.
- Other reasons.
Another result that I found curious is that the song with the most plays is one of my favorites and it's not the one with the highest score. Then, the question arises: what is the key to success in a song?
π Want to explore the project further?
π» Check out the project on GitHub: music-data-analysis
Watch my favorite, most-streamed song on YouTube: Watch on YouTube
I hope this article has been useful to you. π
Top comments (0)