DEV Community: MakenaKinyua

Visualizing the Africa Energy Project

MakenaKinyua — Mon, 27 Oct 2025 14:20:45 +0000

There are so many untold stories that lie hidden in data that can be useful to us in one way or another. I worked on the Africa Energy Project, which is a data extraction project that involves getting data on energy indicators for all African countries.

The project goes beyond data extraction into visualization which is am important aspect. Data visualization helps others understand the data and in this case, it is useful in policy making for such an important sector.

I recommend you start here which contains web scrapping before making your way to visualization.

The Process...

Before getting into visualization, there were several questions that became points of interest that could be answered by the data..

i. How does the performance across indicator topics look like?
ii. How do the energy scores across regions differ?
iii. Which are the best performing countries and the worst performing countries?
iv. How does the trend in scores across the years look like?
v. What is the performance for individual countries?

The Visualizations

The report contains slicers, KPI cards, pie chart, bar chart and column charts.

The slicers are meant to provide an interactive way for the user to filter data.

The year slicer in in form of a slider which enables the user to view the different trends between selected years.
The second one is for the indicator topic, one can select multiple indicator topics and see how the visuals change with each selection.
The last slicer is for the units. This one in particular is important because the scores are measured in different units. Filtering out scores by unit gives a more informed picture of what is happening.

The pie chart is used to represent the share by region. It reveals that North Africa is more developed when it comes to electricity access, supply and technical aspects as compared to other parts of Africa.

The bar chart is used to represent the top and bottom (n) countries and is responsive to the slicers. This gives shines a light on high performing and low performing countries which informs policy making.

The column chart shows the trend in total and average scores over the years.

The report has a second page which is used to filter further for individual countries and the indicator topics. It uses a table to show the sum and average score for all indicators under a specific indicator topic for a chosen country.

The visualization process paints a clear picture of the information once hidden in the data. There are various recommendations that are reached by policy makers as a result of visualization.

You can view the full report here and interact with the visuals!

Web Scrapping Project

MakenaKinyua — Thu, 23 Oct 2025 15:08:54 +0000

Getting data from modern websites is not the same as it used to be- today, most websites render their data dynamically making it hard for traditional web scrapping tools to obtain any data.

In this project: The Africa Energy project, we are going to use different tools to obtain data about Energy Indicators across 54 African countries for the years 2000 - 2022 from the The Africa Energy Portal.

The project features: a web scrapper that extracts JSON data from API network responses.

Project Overview

The Africa Energy Portal is a dynamic webpage that contains information about energy indicators across 54 African countries. The indicators are energy access, supply and technical aspects related to energy.

The indicators are further broken down into sub-sectors such as 'Population access to electricity-National' which shows the percentage of people with access to electricity on a national level etc.

The Process

The project uses the following technologies:

a. Python for developing the web scrapping logic
b. Selenium for browser automation and capturing network responses
c. Pandas for data handling and manipulation
d. MongoDB which is a NOSQL database for storing and querying the data

The scrapper utilizes Selenium to automate browser interaction such as loading the page and selecting all required themes, years and countries for precise data extraction.

The results

The scrapper obtains all the selected fields of the data ie;

id, name, score, unit, region name, indicator topic, indicator source, indicator name, indicator group, year, url

The data is extracted in JSON format and appended to an empty list before it is flattened and converted to csv format.

Working on this project has been nothing short of a learning experience from the thought process of understanding the project, to learning of different ways to execute the project, to implementation of the same.

You can check out the project on Github and feel free to reach out for inquiries or collaboration!

RAGs for beginners

MakenaKinyua — Sat, 20 Sep 2025 07:35:21 +0000

Advancement in Artificial Intelligence has made it possible for humans to interact with AI in different ways such as holding conversations that make sense for example using LLMs like GPT-4.

Despite the strides achieved, one thing that can be frustrating when dealing with LLMs is querying and getting responses that make no sense. For instance, in the medical field, you can ask about the mode of action of a new drug in the market. Instead of answering correctly, the model gives you information that is outdated or made up.

This is one of the limitations of traditional LLMs: they have knowledge gaps because they were trained on totally different data. LLMs also tend to provide generic answers and they hallucinate, answering questions out of context. To solve this, RAG systems were introduced.

RAG

RAG refers to Retrieval-Augmented Generation systems which provide support to LLMs. RAG systems work by enabling the LLMs to access information from an external database. As a result, the models are able to generate responses that have more context and are up-to-date thus addressing the limitations of traditional Large Language Models.

How does it work?

A simple RAG system flow has six stages: data extraction, chunking extracted data, embedding the chunks, storing the data in a vector database, retrieval when a user queries the database and generating a response.

i) Data collection

This is the first stage which involves extracting data from a source, such as a pdf, that will be fed into the model. After extraction, the data is loaded ready to go to the next step.

ii) Chunking

Data is extracted as a single string. Chunking involves breaking down the data into chunks using text splitters such as the RecursiveCharacterTextSplitter. Define the chunk size (characters in each chunk) and the chunk overlap(by how many characters the chunks overlap) in order to preserve context.

iii) Embedding

The chunks are not interpretable by the machine in their extracted state. Thus an embedding model is used to transform the characters into vector representations which are understood by the machine. There are different embedding models from SentenceTransformers that can be used based on your need.

iv) Vector Database

Once the chunks have been embedded, they need to be stored in a vector database such as Chroma. The vector database stores the embeddings as vectors thus easier for it conduct a similarity search when queried.

v) Retrieval

The user interacts with the LLM at this point by querying it. The queries are embedded into vectors then a similarity search occurs in the data base, comparing the vectors of the query to what is stored in the vector database. The most relevant response is retrieved and it proceeds to the next step.

vi) Generating a response

After retrieving information from the database, a response is generated to the user based on the information available to the system.

RAG systems can be used in various ways such as:

Providing customer support that is helpful to the users rather than frustrating.
In market research where it has access to a vast pool of data.
Useful in recommendation systems

It is incredible how much incorporating a RAG to a LLM improves the response quality of the model as well as the satisfaction of the user. In our ever growing tech space, it will be great to see how RAG systems will be improved upon and incorporated more in our day-to-day activities.

Unsupervised Machine Learning

MakenaKinyua — Sun, 31 Aug 2025 17:06:46 +0000

Machine learning is one of the core concepts of data science which forms the foundation for AI.

Have you ever wondered how recommendation systems such as YouTube work? How are they able to recommend just the right content? The answer is simple: unsupervised machine learning.

Unsupervised Machine Learning

It is a type of machine learning where the model is fed raw, unstructured data without any labels.

The model then learns and make sense of the data through discovering patterns and relationships on its own.

How does it work?

For it to learn the patterns and relationships, unsupervised machine learning depends heavily on mathematical concepts.
Data points with similar features are grouped together each in its own group.

Models of Unsupervised Machine Learning

There are different unsupervised machine learning models: clustering and dimensionality reduction.

i. Clustering

Just as the name suggests, clustering involves grouping data into different clusters such that data points in the same cluster have very similar features while data points in different clusters are very different.

There are different algorithms used in clustering:

K-means

In K-means clustering, the user defines the number of desired clusters (K) that the algorithm is supposed to form.

Distance metrics such as the Euclidean distance and Manhattan distance come in to play whereby the algorithm measures the distance of data points from a centroid and clusters the data points depending on how similar a point is to a centroid.

Hierarchical clustering

It involves forming a hierarchy of data points thus creating a tree of clusters. There are two types of hierarchical clustering: agglomerative which is a bottom-up approach and divisive which is a top-bottom approach.

ii. Dimensionality Reduction

At times, we encounter datasets that have so many features; features which are of no meaningful value to the data we are trying to make sense of.

In such a case, we use dimensionality reduction which works by reducing the number of variables while preserving key information. The model filters through the noise and gets rid of the unnecessary dimensions.

My thoughts:

There is so much to learn from unsupervised machine learning models which exist to help us understand data beyond what is obvious to the human eye.

Its ability to recognize patterns and relationships makes it powerful to use because real world data is messy and noisy!

Classification in Machine Learning.

MakenaKinyua — Mon, 25 Aug 2025 18:54:42 +0000

Classification takes many forms in our day to day activities. It can take a simple form in spam filters, trends to a complex form like image search.

What is Classification?

Classification is a type of Supervised Machine Learning that is used to predict categories of data points.

Classification works by looking at the characteristics or features of data points and putting them into categories based on the similarities depending on the model chosen.

There are different algorithms used in classification:

Logistic regression
K-Nearest Neighbors
Random forest
Decision trees etc

A good example of classification in action is your email service provider predicting whether an email is spam or not spam.

How is Gmail/Outlook able to achieve that? Is there a trade off that occurs behind the scenes that guides decision making?

Metrics of Evaluation🧮

These are the decision making guidelines that determine how well our model is doing.

Scenario:

Predicting spam and legitimate emails. Our model has 200 data points of which 180 are legitimate and 20 are spam. Our model predicts that 150 are legitimate and 50 are spam. There are 4 possible outcomes from the prediction:

	Predicted Positive	Predicted Negative
Positive	True Positive	False Negative
Negative	False positives	True Negative

a. Accuracy

Accuracy tells us how many correct predictions our model has captured, whether positive or negative, out of all predictions. It is important because it explains how right our model is in classifying the categories.

b. Precision

Precision tells us more about the positive predictions. How many positives have been correctly predicted out of all our positives? For instance, our model predicted 50 positives but could have included some values as spam that were not.

c. Recall

Recall shows us how many of the true positives were predicted by our model. The model can predict positives but how many of the true positives was it able to capture?

d. F1 Score

The balance between recall and precision.

Okay, but why does any of this matter?📍

Classification is used in many fields and thus the models need to suit the different scenarios as best as possible. The metrics are important because they guide us on the decisions we make based on what our models predict.

The trade off occurs where the model predicts more positives and in the process, it captures false positives which are treated as points of interest. On the other hand, the model may predict less positives but risks not capturing some of the actual positives.

The implications of favoring one metric can be missing out on important data points or capturing a lot more that may not necessarily be accurate. Thus the need to balance the metrics.

Classification is thought provoking because the process of model design and decision making goes beyond just the code. It is a deeper process that needs you to factor in what level of correctness you desire and understand what that means to the problem that is at hand.

Type i and Type ii errors

MakenaKinyua — Tue, 12 Aug 2025 11:16:25 +0000

Errors in Hypothesis testing🧮

In hypothesis testing, we conduct statistical tests in order to determine the validity of our tests at a specific level of significance.

We start by setting the null and alternate hypothesis. The alternate hypothesis is what we observe from an experiment and the null hypothesis is the opposite of the alternate hypothesis.

In the process, there is always a chance of encountering errors when it comes to rejecting or failing to reject the null hypothesis. With these errors, there is a question of how to balance the errors and what we are willing to trade off.

There are two types of errors:

Type i error - occurs when we reject null when we should not have rejected it.
Type ii error - occurs when we fail to reject the null hypothesis when we should have rejected it.

Medical Dilemma : A Cancer Scenario🩺

A patient walks into a hospital with several signs and symptoms. The doctor suspects that the patient's symptoms are consistent with cancer.
In this case:
H0 : The patient does not have cancer
H1 : The patient has cancer

Is it 'better' for a type i or a type ii error to occur? Where do we trade off between the two and how do we decide which one is better?

Type i error

A type i error would be that we reject null; that the patient does not have cancer, when we should not have rejected it.

The implication:

The patient is in turn put on cancer treatments such as chemotherapy, a draining treatment, when they are in fact healthy.
It leads to physical, mental and financial drain on the patient.

Type ii error

A type ii error would mean that we fail to reject null; that the patient does not have cancer, when we should have rejected it.

The implication:

The patient would be sent home thinking that they are not sick when in fact, they are sick.
They do not receive any sort of care and might end up having a sudden decline in health which might lead to death.

Reflection📑

So which one is more acceptable in this case? A false positive or a false negative? How can we balance between the two?

A type i error is reduced by setting an acceptable alpha value for the hypothesis test.
A type ii error is reduced by increasing the statistical power.

It is important to strike a good balance between the two because both extremes can be dangerous.
I say we need to consider our priorities, our morality, the cost of each type of error and the effects in the long run. What are your thoughts?

Calculating win probabilities of the EPL.

MakenaKinyua — Thu, 31 Jul 2025 17:12:20 +0000

The English Premier League is about to resume for the next season and I hope all fans are ready for it! This a simple experiment to calculate and visualize win probabilities ; as a bernoulli distribution and binomial distribution using python.

Data from the 2024/2025 season obtained from Football Data Org API , an API for football, and I used several python libraries;

import requests 
import pandas as pd 
from scipy import stats 
import seaborn as sns

After obtaining the data, it was converted from json data to a pandas DataFrame for wrangling and visualization.

1. Defining a function to calculate probabilities.
The defined function has two objectives: calculate the win, draw and loss probabilities and calculate the binomial probability of the teams winning the same amount of games based on the number of games they won.

i. The first part of the function
Calculates the win, loss and draw probabilities. It takes the number of games divided by the total games. This provides us with an understanding of the probability of the outcomes for the individual teams at any point during the season. It can be likened to a bernoulli distribution which calculates the probability of a success ie; probability of a win or no win.

ii. Second part of the function
Calculates the binomial probabilities using the scipy python library. We use the stats.binom.pmf which takes in the arguments (k, n, p) where;

k - number of successes which is number of games won
n - total games played
p - probability of a win

The binomial probabilities are interpreted as the probability of the team having the same number of wins for the next season.

2. Visualizing the results
From the results, I noticed the differences in team positions as a result of the calculated probabilities. I created a plot of both the win rate which is in orange and the win probability in blue just to help me understand the analysis.

Based on this, we see that Liverpool FC is most likely to be at the top of the table followed by Manchester City FC and Chelsea FC. The three bottom most teams have higher probability of winning the same amount of games than the rate of winning any games. They suffer the penalty of relegation onto a lower competition.

Conclusion
Working on this was interesting and I got to learn a lot through my trials and errors. There is so much that goes into predicting football outcome probabilities such as form, stage, players etc. I can't wait to explore these variables for a more informed prediction. As for now, I stand with Manchester United FC.

Mean, Median and Mode in Statistics

MakenaKinyua — Tue, 22 Jul 2025 17:09:05 +0000

Statistics, as a whole, is one of the subjects I enjoy most as a data scientist. In this article, we explore measures of central tendency which are part of the fundamentals of statistics and get to understand how they are used.

Measures of central tendency are values that are used to summarize data in order to understand how the data is distributed. They include; mean, median and mode

i) Mean
It is the average value of a given data set and is obtained by adding all the values of the data and dividing the result by the number of values in the data set.

The mean is used when you want to see where the average value of the data set lies which helps you understand the nature of the distribution. It is also used to fill in missing values in data set where the distribution is a symmetric and has no outliers.

Calculating mean using python library numpy:

import numpy as np
num = [2,2,3,4,8,5]
mean = np.mean(num)
print(mean)

ii) Median
This is the midpoint in the data set; data is arranged in ascending or descending order and the middle value is obtained. In a symmetric distribution, the median value usually equals or is close to the mean value. For the median, it is used to fill missing values in a data set where the data has no outliers.

Calculating median using python library numpy:

import numpy as np
num = [2,2,3,4,8,5]
median = np.median(num)
print(median)

iii)Mode

It refers to the most repeated value in a data set. The mode is also used to replace missing values depending on how many times it appears and the nature of the distribution.

Calculating mode sing python library statistics:

import statistics
num = [2,2,3,4,8,5]
mode = statistics.mode(num)
print(mode)

In conclusion, the measures of central tendency are fundamental when exploring your data and can tell you so much about it. I hope this article has helped to shed some light on your understanding of the measures of central tendency and their importance!

Relationships, in Power BI

MakenaKinyua — Sun, 22 Jun 2025 15:45:04 +0000

Power BI is quite an interesting data tool and one of the concepts that I have enjoyed working on is relationships. This simply refers to the connection made between two tables.

It is created by joining two separate tables using a selected column that has similar information on both tables. The table selected first is the fact table and the other table is the dimension table. There is a primary key in one table which is the unique identifier that maps on to the foreign key in the other table.

There are four types of relationships:

i)One to One
This is the simplest type of relationship. Values from the selected column of the first table match perfectly with values from the second table. All the values appear only once in each table hence a one-to-one relationship.

Consider two tables; one with sales date and another with calendar dates. The sales date can only match with one date on the calendar table, this creates a connection between the two tables based on these columns.

ii)One to Many
For this type of relationship, values in the first table match multiple values on the second table.

When you have customer table with customer id and sales table with customer id, the customer ID can match the sales table multiple times. This is because one customer can have multiple sales creating a one-to-many relationship.

iii)Many to One
This is basically a reverse of the on- to-many relationship; multiple values from the first table match one value on the second table.

For example, table with Sales information and another table with Products information. The common column between them is the product ID column. In this case, there are multiple sales that are made of one product. Therefore, multiple values on the sales table match with one value on the product table.

iv)Many to Many
In this type of relationship, multiple values from the first table match with multiple values from the second table.

Another thing about relationships is the cross-filter direction. This refers to the direction in which a filter that is applied on tables that have a relationship affect those tables.
a) Single cross filter direction - filters are applied from one table to another
b)Both cross filter direction - filters are applied in both tables

Relationships in PowerBI have made it easy to understand, explore and join tables for easier visualization of data and I can't wait to learn more!

Excel, an analysis.

MakenaKinyua — Wed, 11 Jun 2025 13:43:17 +0000

I have learnt and built with Microsoft Excel and it has been an interesting experience. Excel is amongst the most important data tools for any analyst. It is used for analysis, storage and visualization of data.

Excel uses formulas and functions that enables the analysis of data in order to capture useful insights. In real-world data, such insights are very important and are used to make data driven decisions. Some examples of excel in the real world include:

Preparing Financial Statements

Excel can be used to prepare financial statements such as balance sheets and profit and loss statements amongst others. These are used in understanding business performance.

Budgeting Process

Excel is used to create dynamic budgets. We have the expected expenses and income, using different formulas and functions, excel is able to update the budget with every input and compute given values and delivers an outcome.

Financial Planning

Excel provides analyst with a platform to provide data context on performance indicators that help with long term financial planning.

Features And Formulas

Some of the interesting things I have learnt include:

Data Validation

This is a feature that allows on specific data to be input in a selected range of cells. For example, when you input data on marital status, it can be restricted to: single, married, divorced and widowed. This ensures that nothing outside that selection is entered in the specific cells.

Conditional Formatting

This is used to create simple visuals of the data using icons, data bars, color among others. In a category that has sales, it can be used to see high, low and moderate sales depending on the rules that have been input.

Filtering

This is used to view your data based on different categories. I love this because it is very helpful when you are exploring your data in order to understand what you are working with.

Reflection

Learning excel has made me appreciate the importance and power of data. Good data brings with it endless possibilities in the amount of information that can be obtained from it and how much sense can be made from a bunch of numbers and words which is beautiful to behold.

[Boost]

MakenaKinyua — Mon, 09 Jun 2025 15:51:44 +0000