DEV Community: Dilan Bosire

Using Clustering to Group Songs by Tempo, Energy, and Vocals

Dilan Bosire — Wed, 21 Jan 2026 09:13:28 +0000

Introduction

With the rapid expansion of digital music libraries and streaming platforms, organizing and understanding large collections of songs has become increasingly important. As music datasets grow into the thousands or even millions of tracks, manual categorization becomes impractical. Clustering—an unsupervised machine learning technique—offers an effective solution by grouping songs based on shared characteristics without relying on predefined labels.

This article explores how clustering can be applied to a dataset of 1,000 songs using three key audio features: tempo, energy level, and vocal presence. It also discusses the types of song groupings that are likely to emerge from such an analysis and their real-world applications.

Understanding the Key Features

Before applying clustering techniques, it is essential to understand the features used to represent each song:

Tempo

Tempo refers to the speed of a song, measured in beats per minute (BPM). It plays a crucial role in defining the pace and mood of a track, distinguishing fast-paced dance songs from slower, more relaxed compositions.

Energy Level

Energy is a numerical representation of a song’s intensity and activity. It is often derived from attributes such as loudness, rhythm strength, and dynamic range. High-energy songs tend to feel lively and powerful, while low-energy songs are calmer and more subdued.

Vocal Presence

Vocal presence measures the dominance of vocals in a track. This feature may be represented as a continuous scale (from low to high vocal intensity) or as a binary indicator distinguishing vocal tracks from instrumental ones.

Together, these features capture both the rhythmic and expressive elements of music, making them ideal for clustering songs by mood, style, and listening context.

Applying Clustering Techniques

To cluster the 1,000-song dataset effectively, the following steps are typically followed:

1. Data Preprocessing

Normalize or standardize tempo, energy, and vocal features to ensure that no single attribute dominates the clustering process.
Handle missing or noisy data to improve the accuracy and reliability of the results.

2. Choosing a Clustering Algorithm

Several clustering algorithms are well suited for music data:

K-Means Clustering
A popular and efficient algorithm that partitions songs into a predefined number of clusters based on similarity.
Hierarchical Clustering
Useful for exploring relationships between clusters and identifying subgroups within broader musical categories.
DBSCAN
Effective for detecting outliers or niche music styles that do not fit well into larger clusters.

3. Determining the Optimal Number of Clusters

Techniques such as the elbow method and the silhouette score are commonly used to identify the most appropriate number of clusters.

4. Cluster Interpretation

Once clustering is complete, the average tempo, energy, and vocal values of each cluster are analyzed to understand the musical characteristics of each group.

Expected Song Groupings

Based on tempo, energy, and vocal presence, several natural clusters are likely to emerge:

1. High-Tempo, High-Energy, Vocal-Heavy Songs

These clusters typically include pop, EDM, dance, and upbeat hip-hop tracks. They are well suited for workouts, parties, and energetic environments.

2. High-Tempo, High-Energy, Instrumental Songs

Often composed of electronic or instrumental dance music, these tracks are commonly used for gaming, background music, or focus-driven activities.

3. Medium-Tempo, Medium-Energy, Vocal-Focused Songs

This group includes mainstream pop, rock, and alternative music, making it ideal for casual listening and radio play.

4. Low-Tempo, Low-Energy, Vocal-Heavy Songs

Ballads, acoustic tracks, and emotionally expressive songs fall into this category and are often associated with relaxation or reflection.

5. Low-Tempo, Low-Energy, Instrumental Songs

Ambient, classical, and lo-fi music typically form this cluster, commonly used for studying, meditation, or background ambiance.

6. Outlier or Niche Clusters

These include experimental tracks with unusual tempos or mixed energy levels. While they may not align with common listening patterns, they represent unique artistic styles.

Practical Applications

Clustering songs based on audio features has several real-world applications:

Music Recommendation Systems
Improves personalized recommendations by grouping similar songs together.
Playlist Curation
Helps create playlists tailored to specific moods, activities, or environments.
Music Analysis and Discovery
Enables artists, producers, and analysts to understand musical trends and listener preferences.
Market Segmentation
Allows streaming platforms to better target different listener groups.

Conclusion

A strong, data-driven method for grouping songs according to tempo, intensity, and vocals is clustering. Meaningful and intuitive song groups that reflect common listening moods and styles naturally arise when unsupervised learning techniques are applied to a dataset of 1,000 songs. These clusters deepen our understanding of musical patterns and listener behavior in addition to improving music discovery and recommendation systems.

Understanding Parametric and Non-Parametric Tests in Statistics

Dilan Bosire — Sun, 12 Oct 2025 10:02:12 +0000

Introduction

In statistics, data rarely come in one form or follow one rule. Sometimes we have neat, normally distributed data; other times, our data are messy, skewed, or come from small samples. Because of this, researchers rely on two main types of tests to analyze data — parametric and non-parametric tests. Knowing the difference between them and when to use each is essential for drawing accurate conclusions from research.

What Are Parametric Tests?

Parametric tests are statistical tests that make specific assumptions about the population data. The most important assumption is that the data follow a normal distribution. These tests also assume that the data are measured on an interval or ratio scale (meaning they have meaningful numerical values and equal spacing between them).

Common examples of parametric tests include:

t-test – compares the means of two groups.
ANOVA (Analysis of Variance) – compares means across three or more groups.
Pearson’s correlation – measures the strength and direction of a linear relationship between two continuous variables.

Because parametric tests rely on assumptions about the data, they tend to be more powerful when those assumptions are met. This means they’re better at detecting true differences or relationships.

Example:
Imagine a researcher comparing average blood pressure between two groups of adults. If the data are normally distributed and measured on a ratio scale, a t-test would be the appropriate parametric choice.

What Are Non-Parametric Tests?

Non-parametric tests, on the other hand, do not rely on strict assumptions about the data’s distribution. They’re often called distribution-free tests because they can be used when data don’t follow a normal distribution, when sample sizes are small, or when data are ranked or ordinal rather than numerical.

Common examples include:

Mann–Whitney U test – compares two independent groups (used instead of a t-test).
Kruskal–Wallis test – compares more than two groups (used instead of ANOVA).
Spearman’s rank correlation – measures the relationship between two ranked variables.

Non-parametric tests are especially useful when dealing with non-normal, skewed, or ordinal data, such as survey responses or rankings.

Example:
If a researcher wanted to compare satisfaction levels between two hospitals using survey scores on a scale of 1–5, a Mann–Whitney U test would be more appropriate than a t-test because the data are ordinal and may not be normally distributed.

Key Differences Between Parametric and Non-Parametric Tests

Aspect	Parametric Tests	Non-Parametric Tests
Assumptions	Require normal distribution, equal variances	No strict distribution assumptions
Data Type	Interval or ratio data	Ordinal or ranked data
Statistical Power	Higher when assumptions are met	Lower but more flexible
Examples	t-test, ANOVA, Pearson’s correlation	Mann-Whitney U, Kruskal-Wallis, Spearman’s correlation
When to Use	Data are normally distributed and continuous	Data are not normal, small sample size, or ordinal

Why Are They Important?

Choosing the Right Test Prevents Errors
Using the wrong type of test can lead to misleading conclusions. For example, using a t-test on non-normal data could make the results unreliable. Knowing which test fits your data helps ensure accuracy.
They Reflect the Nature of the Data
Parametric and non-parametric tests acknowledge that data vary in quality, type, and distribution. By choosing the right test, researchers respect the data’s characteristics and avoid forcing it into the wrong model.
They Complement Each Other
These two types of tests aren’t rivals—they work together. Parametric tests are ideal when data meet assumptions, while non-parametric tests are lifesavers when those assumptions are violated.
They Support Evidence-Based Decisions
Whether in healthcare, business, or social sciences, selecting the right statistical test ensures that decisions are grounded in reliable evidence rather than chance.

Conclusion

Both parametric and non-parametric tests are essential tools in statistical analysis. Parametric tests are more powerful when data follow expected patterns, while non-parametric tests offer flexibility for real-world data that don’t fit neatly into assumptions. In practice, skilled researchers understand when to apply each test, ensuring their findings are accurate, fair, and meaningful. Ultimately, the choice between the two depends on one simple principle — always let the data guide the method.

References

Gravetter, F. J., & Wallnau, L. B. (2021). Statistics for the behavioral sciences (11th ed.). Cengage Learning.

Lane, D. M. (2020). Introduction to statistics online edition. Rice University. https://onlinestatbook.com

Urdan, T. C. (2017). Statistics in plain English (4th ed.). Routledge.

Understanding Degrees of Freedom: Why They Matter in Statistics

Dilan Bosire — Sun, 12 Oct 2025 09:36:56 +0000

Introduction

If you’ve ever taken a statistics class, you’ve probably come across the term degrees of freedom. At first, it can sound confusing or overly technical—but it’s actually a simple and important idea. Degrees of freedom play a key role in how we calculate statistics, interpret results, and make decisions based on data. Understanding what they are and why they matter can help make sense of various statistical tests, including t-tests, ANOVA, and the chi-square test.

What Are Degrees of Freedom?

In simple terms, degrees of freedom (df) tell us how many numbers in a statistical calculation are free to vary. They represent the number of independent pieces of information we have when estimating something from a sample.

Here’s an easy way to think about it:
Imagine you have a sample of five numbers. You can pick any four of them freely, but once you know the average (mean) of the group, the fifth number can’t just be anything—it has to fit the mean. That means you only have four values that can truly vary, giving you 4 degrees of freedom (calculated as n - 1).

In general, every time we estimate a parameter (like the mean) from data, we lose one degree of freedom. This adjustment helps keep our calculations accurate.

Degrees of Freedom in Common Statistical Tests

t-Test
When we compare means using a t-test, the degrees of freedom depend on how many observations we have:
- For a one-sample t-test:
df = n - 1
- For a two-sample t-test (assuming equal variances):
df = n_1 + n_2 - 2

The degrees of freedom tell us which t-distribution to use when finding critical values or p-values.
Chi-Square Test

For a goodness-of-fit test, degrees of freedom are based on the number of categories:

df = k - 1

In a test of independence using a table, it’s:

df = (r - 1)(c - 1)

where r is the number of rows and c is the number of columns.

ANOVA (Analysis of Variance)
ANOVA divides degrees of freedom into two parts:
- Between groups: k - 1 (where k is the number of groups)
- Within groups: N - k (where N is the total number of observations) These help determine whether the differences between group means are statistically significant.

Why Are Degrees of Freedom Important?

They Adjust for Sample Size
Degrees of freedom make sure our statistical results reflect that we’re working with a sample, not an entire population. This adjustment helps keep our estimates more accurate.
They Shape Statistical Distributions
Many statistical distributions—like the t-distribution and chi-square distribution—change shape depending on their degrees of freedom. For example, when df is small, the t-distribution is wider and has heavier tails, but as df increases, it starts to look like the normal distribution.
They Ensure Accurate Results
Using the correct degrees of freedom means that our p-values and confidence intervals are reliable. Getting them wrong could make us think results are significant when they’re not.
They Account for Constraints in the Data
Every time we calculate something like a mean or variance, we add a restriction on how the data can vary. Degrees of freedom adjust for these restrictions, keeping our calculations honest and precise.

Conclusion

Degrees of freedom may sound like a small technical detail, but they’re essential to accurate and meaningful statistical analysis. They help us understand how much information our data really gives us and ensure that the tests we use are fair and reliable. In short, degrees of freedom are what make our statistics trustworthy—they balance the freedom of data with the limits of estimation.

References

Gravetter, F. J., & Wallnau, L. B. (2021). Statistics for the behavioral sciences (11th ed.). Cengage Learning.

Lane, D. M. (2020). Introduction to statistics online edition. Rice University. https://onlinestatbook.com

Urdan, T. C. (2017). Statistics in plain English (4th ed.). Routledge.

Similarities Between a Stored Procedure in SQL and a Function in Python

Dilan Bosire — Wed, 10 Sep 2025 09:03:48 +0000

Programming and database development share overlapping concepts, even though they operate in distinct environments. A SQL stored procedure and a Python function may appear unrelated, one is in a database, the other in an application, but both are reusable blocks of logic that perform tasks efficiently. Let’s look at their similarities.

1. Encapsulation of Logic

Stored procedures and Python functions allow developers to define a set of instructions once and reuse them as needed, eliminating the need to rewrite code multiple times.

SQL Example:

CREATE PROCEDURE get_all_customers
AS
BEGIN
    SELECT * FROM Customers;
END;

Python Example:

def get_all_customers(customers):
    return customers

2. Reusability

Reusability is a key advantage in programming. Once created, SQL stored procedures and Python functions can be executed multiple times without needing to be redefined.

In SQL, use EXEC ProcedureName to run a stored procedure.
In Python, call a function with function_name(). This saves time, reduces redundancy, and minimizes errors.

3. Parameters and Arguments

Stored procedures and Python functions can accept input parameters to modify their behavior.

SQL Example with Parameter:

We are creating a stored procedure that takes a department name as a parameter and returns the employees in that department.

CREATE PROCEDURE GetEmployeesByDepartment @DeptName NVARCHAR(50)
AS
BEGIN
    SELECT employee_id, name, department
    FROM Employees
    WHERE department = @DeptName;
END;

How to run it (passing an argument):
EXEC GetEmployeesByDepartment 'HR';
Python Example with Argument:
In Python, we create a function that takes a department name as input and returns the employees from that department.

def get_employees_by_department(employees, dept_name):
    return [emp for emp in employees if emp["department"] == dept_name]

# Example data
employees = [
    {"id": 1, "name": "Alice", "department": "HR"},
    {"id": 2, "name": "Bob", "department": "IT"},
    {"id": 3, "name": "Charlie", "department": "HR"}
]

# Call function (passing argument)
print(get_employees_by_department(employees, "HR"))

Output:

[{'id': 1, 'name': 'Alice', 'department': 'HR'},
 {'id': 3, 'name': 'Charlie', 'department': 'HR'}]

4. Return Results

Both can return results.

Stored procedures typically return result sets (such as query rows) or output parameters.
Python functions return values with the return keyword.

Example:

SQL: Returning employee rows.
Python: Returning a filtered list of employees.

5. Improved Maintainability

Both approaches enhance code maintainability. When logic changes, you only need to update the procedure or function once, rather than modifying multiple queries or code snippets throughout the system.

6. Support for Control Flow

Python is more powerful, but stored procedures also support conditional logic (IF, CASE) and loops (WHILE). In contrast, Python functions use standard control flow structures (if, for, while).

Example:

SQL:

IF EXISTS (SELECT * FROM Employees WHERE Department = 'HR')
    PRINT 'HR Department found';

Python:

if any(emp["department"] == "HR" for emp in employees):
    print("HR Department found")

Both check if at least one HR employee exists and then display a message.

Conclusion

A stored procedure in SQL and a function in Python share several similarities:

Both encapsulate logic into reusable blocks.
Both accept parameters and return results.
Both support control flow and enhance code maintainability.
Both improve efficiency and consistency in programming.

The key difference is that stored procedures run in a database engine, while Python functions run in an application. However, they both aim to organize logic effectively and efficiently for reuse.

Understanding the Difference Between Subquery, CTE, and Stored Procedure

Dilan Bosire — Wed, 10 Sep 2025 08:21:15 +0000

In SQL and database programming, developers have several tools for organizing and optimizing queries. Among these are subqueries, Common Table Expressions (CTEs), and stored procedures. While they can sometimes be used to achieve similar goals, each serves a different purpose and has unique strengths. Let’s break down the differences.

1. Subquery

A subquery is a query nested inside another query. It is often used to filter, aggregate, or transform data before the main query executes. Subqueries can appear in SELECT, FROM, WHERE, or HAVING clauses.

Example:

If you have a table called employees with name and salary columns.

SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

Here, the subquery calculates the average salary, and the outer query filters employees based on that value.

Key Characteristics:

Runs each time it is called in the outer query.
Can return a single value, multiple rows, or a table.
Great for quick inline logic, but overuse can harm performance.

2. Common Table Expression (CTE)

A Common Table Expression (CTE) is a temporary result set created with the WITH clause. It can be referenced within a query, enhances readability, and can be reused, helping to simplify complex queries.

Example:

If you have a table called orders and customers with a common column called customer_id, and you want to rank customers based on their total quantities

with customer_rankings as (
    select orders.customer_id,sum(orders.quantity)as total_quantity,
        row_number() over (order by sum(orders.quantity)desc) as rankings
    from orders
    group by orders.customer_id)
select customers.first_name,customers.second_name,customer_rankings.total_quantity ,customer_rankings.rankings
from customers join customer_rankings on customers.customer_id= customer_rankings.customer_id
order by customer_rankings.total_quantity desc;

Key Characteristics:

Improves SQL readability and modularity.
Supports recursion for hierarchical data, such as organizational charts.
Exists only during query execution.
Cannot be reused across sessions without rewriting.

3. Stored Procedure

A stored procedure is a set of SQL statements, possibly with procedural logic, saved in the database for repeated use. Unlike subqueries or CTEs, it operates independently and can be called multiple times.

Example:

If you have a table called employees with employee_id, name, and department columns, and we want a stored procedure that shows all employees in the HR department.

CREATE PROCEDURE GetHREmployees
AS
BEGIN
    SELECT employee_id, name, department
    FROM employees
    WHERE department = 'HR';
END;

Key Characteristics:

Stored and run on the database server.
Supports parameters, control-of-flow logic, error handling, and transactions.
Improves performance by minimizing network traffic—only the procedure call is sent.
Enhances security by controlling access with permissions.

Conclusion

Use subqueries for inline filtering or calculations in another query.
Use Common Table Expressions (CTEs) for better readability, recursive queries, and simplifying complex logic.
Use stored procedures for reusable, parameterized, and secure database logic.

These tools give SQL developers the flexibility to manage tasks from simple lookups to complex data processing.

Excel’s Strengths and Weaknesses in Predictive Analysis and Its Role in Data-Driven Business Decisions

Dilan Bosire — Sun, 10 Aug 2025 10:54:36 +0000

When you think of business tools, Microsoft Excel is probably one of the first that comes to mind. For decades, it has been the go-to tool for organizing numbers, crunching data, and making sense of information. But as businesses become more data-driven and predictive analysis grows in importance, it’s worth asking: How well does Excel hold up? Let’s take a closer look at where Excel shines, where it falls short, and how it fits into today’s decision-making landscape.

Excel’s Strengths in Predictive Analysis

-Widely Accessible & Familiar: Most professionals know Excel, making it easy to adopt for predictive tasks.

-Built-in Statistical Tools: Offers basic predictive modeling (e.g., regression, forecasting) without coding.

-What-If & Scenario Analysis: Tools like Goal Seek and Scenario Manager help explore different outcomes.

-Data Visualization: Strong charting and dashboard features for communicating trends and forecasts.

-Enhanced Data Modeling: Power Query and Power Pivot allow for more complex data handling.

Excel’s Weaknesses in Predictive Analysis

-Limited Advanced Analytics: Lacks support for advanced machine learning techniques.

-Scalability Issues: Struggles with huge datasets; performance drops or crashes.

-Error-Prone: Manual data entry and formulas can lead to mistakes.

-Collaboration Challenges: Weak version control and simultaneous editing issues.

-Limited Automation: Less suited for automated, repeatable workflows compared to coding languages.

Excel’s Role in Data-Driven Business Decisions

-Quick Prototyping: Great for rapidly testing ideas and building initial models.

-Cost-Effective for SMBs: Meets basic analysis needs for small and medium businesses.

-Bridges Business & Data Teams: Makes data accessible and understandable for both technical and non-technical users.

-Complements Other Tools: Often used alongside advanced analytics platforms for visualization and reporting.

Conclusion

Excel is a valuable and accessible tool for early-stage analysis, prototyping, and reporting; however, it has limitations when it comes to handling complex models and large datasets. It works best as a complementary tool rather than a replacement for advanced analytics platforms.

How to Install and Set Up PostgreSQL on a Linux Server

Dilan Bosire — Fri, 01 Aug 2025 11:48:40 +0000

In this article, I will show you how to install and set up PostgreSQL on a Linux server.

Make sure you have:

A Linux server
Internet access to install packages

Step 1: Log into your Ubuntu server

you can use the command,
ssh username@server_ip_
Then you'll be prompted to enter your password.
If the password entered is correct then you are in the server.

Step 2: Install PostgreSQL

Search https://www.postgresql.org/download/ to see the download versions of available.
Select the one for linux and ubuntu.
You'll get the image below.

Copy and paste the lines of code as below into your ubuntu server, and you'll have downloaded posgresql.

sudo apt install curl ca-certificates
sudo install -d /usr/share/postgresql-common/pgdg
sudo curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc
. /etc/os-release
sudo sh -c "echo 'deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] https://apt.postgresql.org/pub/repos/apt $VERSION_CODENAME-pgdg main' > /etc/apt/sources.list.d/pgdg.list"
sudo apt update
sudo apt -y install postgresql-16

Step 3: Connect to PostgreSQL

PostgreSQL creates a system user named postgres. You can switch to this user to open the PostgreSQL prompt:
We use,
sudo -i -u postgres

Then type, psql.
You should now be at the PostgreSQL prompt:

postgres=#

To exit the prompt, type:
\q

Then return to your regular user:

exit

Step 4: Create a New Database and User

If you want to create a new PostgreSQL user and database:

Switch to the postgres user:
sudo -i -u postgres

Create a new user:
createuser --user_name

It will prompt you for a username and whether the new user should be a superuser.

Create a new database:
createdb database_name

Now you can select the database you want to use and add tables and values to it, to select use
\c database_name
Then from there you can add your tables and values.

Check posgres status:
sudo systemctl status posgresql

To stop posgresql:
sudo systemctl stop posgresql

To start posgresql:
sudo systemctl start posgresql

To restart posgresql:
sudo systemctl restart posgresql