DEV Community: Keffas Mutethia Nyamu

Supervised Learning and the Role of Classification in Machine Learning

Keffas Mutethia Nyamu — Mon, 25 Aug 2025 18:09:28 +0000

Supervised learning is one of the most powerful approaches in machine learning. At its core, it is about learning from examples. The model is trained on data where both the inputs and the correct outputs (labels) are provided. By studying these patterns, the model learns how to map inputs to outputs. Once trained, it can make predictions on unseen data—a process that powers many of the intelligent systems we use every day.

How Classification Works

Within supervised learning, classification focuses on predicting discrete categories. Unlike regression, which forecasts continuous values, classification assigns inputs to predefined groups.

A familiar example is spam detection: every email is classified as either “spam” or “not spam.” The algorithm learns from past emails—those labeled as spam or safe—and applies those lessons to new ones. Essentially, classification is about defining boundaries in data space, separating one class from another.

Models Commonly Used in Classification

Over time, many models have been developed for classification, each with unique strengths:

Logistic Regression – A straightforward yet effective method, especially for problems with linear relationships.
Decision Trees – Easy to interpret and explain, as they split data into simple decision rules.
Random Forests & Gradient Boosting (XGBoost, LightGBM, CatBoost) – Ensemble methods that combine multiple models to achieve strong performance.
Naïve Bayes – Particularly useful in text classification tasks such as spam filtering.
k-Nearest Neighbors (k-NN) – A simple technique that classifies based on the closest data points.
Support Vector Machines (SVMs) – Effective when classes are clearly separated in high-dimensional space.
Neural Networks – Capable of handling complex and unstructured data, such as images and audio.

The choice of model often depends on the problem, the dataset, and the balance between interpretability and accuracy.

Personal Insights

What stands out most about classification is its wide range of applications. From detecting fraud to predicting customer churn, it enables businesses and organizations to turn raw data into actionable insights.

However, I’ve also learned that success in classification rarely comes from the algorithm alone. Data preparation—handling missing values, feature selection, and balancing class distributions—often has a greater impact than the choice of model itself. Good data leads to good results.

Another key takeaway is that accuracy isn’t everything. In real-world applications, interpretability, fairness, and reliability matter just as much as performance metrics. For instance, in healthcare, a highly accurate but opaque model may not be as valuable as a slightly less accurate model that clinicians can understand and trust.

Challenges in Classification

Working with classification problems presents a set of challenges that often shape the outcome:

Imbalanced Classes – Many datasets are skewed, with one class dominating. Models can easily become biased toward the majority class unless techniques like resampling or adjusted class weights are applied.
Overfitting – Complex models can perform well on training data but fail to generalize. Careful validation and regularization are key.
Data Quality – Noisy, incomplete, or irrelevant features can reduce model performance significantly. Preprocessing and feature engineering are essential steps.
Evolving Data – Patterns change over time. Fraudsters adapt, customers shift behavior, and models need to be retrained to stay effective.

Conclusion

Classification is one of the most practical and impactful tasks in supervised learning. It transforms raw data into meaningful categories, enabling informed decisions across industries. While the choice of algorithm is important, the real success often lies in understanding the data, addressing challenges like imbalance and overfitting, and ensuring the solutions are trustworthy and explainable.

At its best, classification is not just about predicting labels—it is about solving real problems and unlocking value from data.

Premier League 2024–2025: Teams Most Likely to Win — Based on Win Probability

Keffas Mutethia Nyamu — Thu, 31 Jul 2025 16:23:58 +0000

As the curtain falls on the 2024–2025 Premier League season, an analysis of team performance based on win probability provides insight into the clubs most likely to dominate in future campaigns. This article uses empirical win rates, calculated from the number of matches won out of 38, to determine each team’s likelihood of winning.

How Win Probability Was Calculated

Win probability was derived using the formula:

Win Probability = Wins ÷ Matches Played

Each Premier League team played 38 matches this season. The number of victories per team was divided by 38 to compute a win rate (expressed as a decimal), representing their likelihood of winning a match.

Teams with the Highest Win Probabilities

1. Liverpool FC – 0.658

Liverpool emerged as the season’s most dominant side, winning 25 matches. Their high win probability of 65.8% highlights consistent performance, likely driven by tactical depth, squad fitness, and solid management.

2. Manchester City FC – 0.553

Manchester City followed with a 55.3% win rate (21 wins). While slightly below their usual standard, City remained a formidable side and continue to pose a serious title threat.

3. Arsenal FC, Chelsea FC & Newcastle United FC – 0.526

All three clubs secured 20 wins, translating to a 52.6% win probability.

Arsenal’s youthful energy
Chelsea’s tactical rebuilding
Newcastle’s resurgence under Eddie Howe

Each contributed to strong campaigns.

4. Nottingham Forest FC & Aston Villa FC – 0.500

With 19 wins each, these two clubs hit a 50% win rate, suggesting they are now among the league’s upper mid-table performers.

Aston Villa, in particular, continued their rise with disciplined, attacking football.

Mid-Table Contenders

Clubs such as Brighton, Brentford, Fulham, and Bournemouth had win probabilities ranging from 0.395 to 0.421, showing competitiveness but room for growth.

Underperformers

At the bottom of the win probability table:

Manchester United, West Ham, Everton, Tottenham — all posted a 0.289 win rate (11 wins), a surprising dip for clubs with top-six ambitions.
Leicester City (0.158)
Ipswich Town (0.105)
Southampton (0.053)

These clubs struggled heavily, signaling potential relegation threats or serious structural issues.

Final Thoughts

While win probability does not account for draws or goal difference, it offers a clear view of consistent winners across the season.

Based on current form:

Liverpool, Manchester City, and Arsenal remain the most likely to win in future fixtures — barring transfers or managerial changes.
Clubs like Nottingham Forest and Aston Villa are also worth watching, as their 50% win rate suggests growing strength.
On the flip side, traditional powerhouses such as Manchester United and Tottenham may need major overhauls to return to form.

Measures of Central Tendency and Their Importance in Data Science

Keffas Mutethia Nyamu — Tue, 22 Jul 2025 10:46:53 +0000

In any data analysis or statistical endeavour, understanding the behaviour of a dataset is fundamental. One of the primary ways to summarise and interpret data is through measures of central tendency, which identify the central or typical value around which data points cluster. The three main measures of central tendency are mean, median, and mode.

1. The Mean

The mean, commonly known as the average, is calculated by summing all values in a dataset and dividing by the number of values. For example, if data represents the ages of participants in a survey, the mean gives the general age around which most participants’ ages are spread.

Advantages: Uses all data points, making it highly representative when there are no extreme outliers.
Limitations: Sensitive to outliers, which can distort the mean away from the true typical value in skewed distributions.

2. The Median

The median is the middle value in an ordered dataset. If the dataset has an odd number of observations, it is the exact middle; if even, it is the average of the two central numbers.

Advantages: Robust to outliers and skewed data. For instance, in income data where a few very high incomes inflate the mean, the median provides a better measure of typical income.
Limitations: Does not use all data points directly, only their order.

3. The Mode

The mode is the value that occurs most frequently in a dataset. In categorical data, it is particularly useful as the mean or median cannot be computed.

Advantages: The only measure that can be used for nominal data and indicates the most common category or value.
Limitations: A dataset can be bimodal or multimodal (having more than one mode), and in some cases, no mode exists if all values are unique.

Why Are Measures of Central Tendency Important in Data Science?

a. Summarising Data

Large datasets can be overwhelming. Measures of central tendency simplify these datasets by providing a single value that summarises the general tendency, making it easier to interpret results and communicate findings to stakeholders.

b. Facilitating Comparison

Central tendency measures enable comparison between different groups or datasets. For example, comparing the mean sales of two products quickly informs decision-makers about relative performance.

c. Supporting Model Building

Many machine learning algorithms, such as K-Means clustering or Gaussian Naïve Bayes, rely on assumptions involving central tendencies. For instance:

K-Means seeks to minimise distances from cluster means.
Gaussian distributions use means and variances to define probability densities.

d. Identifying Skewness and Outliers

Analysing differences between the mean and median provides insights into data distribution and potential outliers, guiding preprocessing decisions like transformation or outlier treatment before model training.

e. Informing Business Decisions

Data science projects often aim at actionable business insights. For example:

Identifying the average time customers spend on a website informs user experience optimisation.
Knowing the modal product size bought assists inventory decisions.

Conclusion

Measures of central tendency are fundamental tools in data science for summarising data, supporting modelling, and driving informed decisions. While they provide critical insights into data behaviour, it is essential to use them alongside measures of dispersion and data visualisation to gain a comprehensive understanding of datasets before implementing analytical or predictive solutions.

# Understanding Relationships in Power BI

Keffas Mutethia Nyamu — Mon, 14 Jul 2025 12:58:34 +0000

Relationships are a core concept in Power BI, allowing users to connect multiple tables to build meaningful and accurate reports. Without relationships, data analysis would require manually merging tables, leading to inefficiency, duplication, and potential errors.

What Are Relationships?

In Power BI, a relationship defines how two tables are connected through common fields. For example, a Sales table containing ProductID can connect to a Products table with product details. This connection enables you to analyse sales by product name, category, or other attributes without combining tables into a single flat file.

Types of Relationships

One common type is One-to-Many (1:*), where one record in a table relates to multiple records in another. For instance, each product in the Products table can appear many times in the Sales table.

Many-to-One (*:1) is essentially the reverse view, where multiple records in one table relate back to one record in another.

Another type is Many-to-Many (:), used when both tables can have multiple matching rows. This relationship type helps in complex models but should be used cautiously, as it can create ambiguity in calculations.

Relationships also have filter directions. A single directional filter means data filters from one table to another, which is ideal for most scenarios. Bidirectional filters allow filters to flow both ways between tables. They are useful in specific analysis, such as role-playing dimensions, but can impact performance and produce unexpected results if used incorrectly.

Why Are Relationships Important?

Relationships allow seamless combination of data from multiple tables. They support advanced DAX calculations that involve related tables and ensure data integrity in visuals and aggregations.

Using relationships also reduces duplication by enabling you to build a star schema, where dimension tables (e.g. Products, Customers) connect to fact tables (e.g. Sales, Transactions). This improves performance and simplifies data models.

Creating Relationships

Power BI can automatically detect relationships based on similar column names and data types. However, it is good practice to review these auto-created relationships for accuracy.

To create relationships manually:

Go to Model view.
Drag a field from one table to its matching field in another table.
Set the cardinality (1:, *:1, or *:).
Choose the appropriate cross-filter direction based on your analysis needs.

Best Practices

Always check cardinality and filter direction when creating relationships.
Avoid using Many-to-Many relationships unless necessary.
Prefer single directional filters for clarity and performance.
Design your data model as a star schema wherever possible.
Ensure fields used for relationships are clean, consistent, and free of duplicates on the ‘one’ side.

Common Pitfalls

Incorrect cardinality can lead to inaccurate totals in visuals. Unintended bidirectional filters may create ambiguity, while missing relationships can cause blank visuals or errors in DAX measures.

Conclusion

Mastering relationships in Power BI is fundamental to building robust and accurate data models. By understanding types of relationships, how to create them, and applying best practices, you will enhance your data analysis capabilities and create impactful reports that drive strategic decisions.

How Excel is Used in Real-World Data Analysis

Keffas Mutethia Nyamu — Wed, 11 Jun 2025 09:20:11 +0000

One of the most widely used tools in data analysis is Microsoft Excel. It does not matter whether you are a small business owner or work on a global enterprise, Excel offers you impressive features that enable users to organize, interpret, and act on data. The in-depth look at Excel this week demonstrated its real-world applicability regarding informed, data-driven decisions. Data Analysis Real-world applications of Excel are much more than a mere spreadsheet program.
Three essential applications of Excel in the real-world data analysis are as follows:

Business Decision-Making Organizations use Excel to monitor key performance metrics, make sales projections, and document business processes. The use of dashboards and charts created in Excel ensures that the leaders are able to understand the trends and make strategic decisions within a short period.
Financial Reporting The world of finance cannot run without Excel in budgeting, financial modeling, and creating reports. It enables analysts to cope with a huge amount of data, do calculations, and be accurate in tracking profits and losses.
Marketing Performance Analysis Excel helps marketing teams to track the campaign rates, customer activity, and return on investment (ROI). Equipped with organized data and interactive graphics, marketers have the chance to evaluate and optimize their tactics.

The Features and Formulas of Excel I have Learned so far

The topic of this week, concentrated on Excel, presented some important tools and formulas that complement data analysis in the real world:

• SUM () Function: It is a basic function that helps to add up row or column totals- it is best at adding up sales numbers, time costs, or poll results.
• IF () Statement: Allows logical decision-making on a dataset. As an example, it may be applied to mark customers as inactive or active depending on the purchase history.
• Pivot Tables: An efficient summary of a huge amount of data. They enable users to see totals, averages, or counts broken down by any variables of interest such as region, month, or product line.
• VLOOKUP () and HLOOKUP (): The two look-up functions make it easy to retrieve data in big tables. VLOOKUP and HLOOKUP are important functions that search vertically and horizontally respectively, necessary in finding the matching data between spreadsheets.
• INDEX () and MATCH (): This is a powerful combination that can extend the look-up. INDEX retrieves the value of a cell at a given position and MATCH determines that the position given some criteria is more flexible and powerful than VLOOKUP in most cases.
In conclusion, learning about Excel through and through has changed my perception of data. Learning Excel at a more advanced level has been a revelation as what appeared to be mere numbers. It stops being the instrument of number storage only- it becomes the medium of problem-solving, thinking, and telling stories with data. This capability to find the meaning in a set of data and to transform it into intelligent and practical implications has changed my way of thinking concerning the purpose of data in decision-making. Excel has also helped me realize that some of the most basic tools in the right hands can become a powerful force. I have shifted my attitude toward data as a mere piece of information but as a strategic asset that could be used to shape the result and encourage innovation.