DEV Community: anangwemike

Building a clean Energy Data Pipeline for Africa( from raw CSVs to MongoDB)

anangwemike — Sat, 11 Oct 2025 10:45:30 +0000

In order to get accurate policy analysis, research and innovation across the continent, it is necessary to have access to accurate energy data. I worked on a data extraction and structuring project, where I built a clean pipeline to process multiple energy datasets.

🎯 Project Goal

To collect, clean, format and upload Africa's energy-related datasets into a centralized MongoDB database, ready for analytics dashboards, automation and future API integration.

What was achieved

a. File Structure and Raw Data Review

I gathered over 30 CSV energy files covering access rates, generation, imports, exports, renewables, consumption trends and installed capacity.
Designed a consistent file naming strategy using formatted_.csv.

b. Data Cleaning & Extraction

Standardized missing values, column casing and data types.
Ensured uniform schema across different energy indicators to enable integration later.

c. Master Dataset Creation

Merged individual metric files into a single master dataset master_energy_dataset.csv for centralized analytics.

d. MongoDB Integration

Connected to MongoDB Atlas using URI string.
Built a production-ready Python script to:
i. Loop through all formatted CSV files.
ii. Create separate collections automatically.
iii. Insert clean data using insert_many() with dedupe logic and schema structure.
iv. Uploaded master_energy_dataset.csv as a unified collection.

e. GitHub & Version Control

Initialized Git repository.
Committed dataset extraction notebook EDA.ipynb.

🛠 Tech Stack Used

Data Cleaning - Pandas
Storage MongoDB - Atlas (NoSQL)
Upload Logic - PyMongo + Automation Script
Version Control - Git + GitHub
Future Plans - FastAPI + Dashboard Integration

This project marks the first building block of a scalable African Energy Data Platform. Clean, well-structured, accessible data is the foundation - and now that foundation exists.

Cracking the Code of Classification: How Machines Learn to Label

anangwemike — Sun, 24 Aug 2025 17:26:29 +0000

Supervised learning is one of the most widely used branches of machine learning. It's termed 'supervised' since the algorithm learns from a labeled dataset(each training example comes with an input and the correct output). The model discovers patterns in the dataset so that it can predict labels for new, unseen inputs.

Within supervised learning, classification refers to predicting a discrete label or category. An example could be; predicting whether an email is spam or not, classifying customer feedback as positive, neutral or negative. The process typically involves:

Feature extraction – representing data in terms of numerical inputs (e.g., word counts, image pixels, statistical measures).
Model training – fitting a classification model on the labeled dataset.
Prediction – applying the trained model to new data to assign class labels.
Evaluation – assessing performance with metrics like accuracy, precision, recall, and F1-score.

There are several algorithms used for classification tasks, each with unique strengths:

Logistic Regression – A simple yet powerful linear model for binary classification, widely used for problems like credit scoring and churn prediction.
Decision Trees – Models that split data into rules, offering interpretability and flexibility but prone to overfitting.
Random Forests – An ensemble of decision trees that reduces overfitting and improves accuracy.
Support Vector Machines (SVMs) – Effective for complex decision boundaries, especially in high-dimensional data.
Naïve Bayes – Based on probability theory, efficient for text classification problems like spam filtering.
K-Nearest Neighbors (KNN) – A distance-based method that classifies based on the majority label of nearby points.
Neural Networks – Especially deep learning models, capable of handling complex classification tasks such as image and speech recognition.

From my experience, classification is often the first step into applied machine learning. It feels intuitive because humans naturally think in categories: “safe or unsafe,” “yes or no,” “type A or type B.” What fascinates me is how simple models like logistic regression can still perform remarkably well in practical problems, while more advanced models like random forests or neural networks can handle much greater complexity.

I’ve also noticed that interpretability is just as important as accuracy. Stakeholders often ask why a model made a particular prediction, which makes models like decision trees and logistic regression highly valuable in certain industries.

Working with classification has not been without difficulties. Some of the challenges include:

Imbalanced datasets – When one class has far more samples than others, models often become biased toward the majority class.
Overfitting – Models like decision trees may memorize training data rather than learning general patterns, leading to poor performance on test data.
Feature selection – Choosing the right features significantly impacts model accuracy. Irrelevant or noisy features often degrade performance.

Despite these challenges, I find classification an exciting area of machine learning because it blends mathematics, data insights, and problem-solving into practical applications that impact real-world decisions.

Striking a balance between Type I and Type II Errors: A medical case study on disease diagnosis

anangwemike — Thu, 07 Aug 2025 13:23:38 +0000

When it comes to statistical testing, more so in life-critical fields like medicine balancing Type I and Type II errors could quite literally be the difference between life and death. Using a practical scenario, I will attempt to explain where and why there is a trade off between the two errors as well as the implications to healthcare, patients and stakeholders.

Imagine a diagnostic test for a serious yet treatable ailment, say early stage pancreatic cancer. It's a rare disease but early detection can greatly increase survival rates. Suppose a doctor is evaluating a new screening test for this strain of cancer. The test is not perfect and could return both false positives or false negatives. The big question would be, would you prefer to throw caution to the wind and detect more cases even though some could end up being false alarms or err on the side of certainty and only diagnose when you are very sure even if, in so doing, you might end up missing some real cases?

Hypothesis Parameters

In this case, our Null Hypothesis(H₀) would be The patient does not have pancreatic cancer and the Alternative Hypothesis(H₁) would be The patient has pancreatic cancer. Our predefined significance level here would be (α = 0.05)

❗ Type I Error (False Positive)

Here we would reject the null hypothesis when in actual case it is true. Basically the test results would show the patient having pancreatic cancer when in reality they do not. Possible consequences to this scenario would be:

The patient would undergo unnecessary stress and psychological trauma on receiving the diagnosis.
There would be costly follow up procedures like biopsies and CT scans in an attempt to determine the severity of the ailment.
The patient could develop side effects from the unnecessary treatments as cancer treatment takes a toll on the body.
The patient would also realize that he would actually suffer no physical harm if he were to ignore or miss treatment sessions.

Type II Error (False Negative)

Here we would fail to reject the null hypothesis when the alternative is true. Basically, the test results show the patient does not have pancreatic cancer but they do. Possible consequences to this scenario would be:

This will lead to a delayed diagnosis because the cancer will progress to a later stage.
Due to the fact that the diagnosis will come later when the patient's health has deteriorated, survival chances would be reduced and the treatment cost would also be higher.
There is a real possibility of irreversible harm and ultimately death if the cancer is detected in the later stages.

⚖️ The Trade-Off: What's Worse?

Finding a balance can be tricky since a low Type I error rate (fewer false positives) would come at the cost of a higher Type II error rate and vice versa. Thus a Type II error is far more dangerous than a Type I error. Therefore, the test must have a decision threshold set to maximum sensitivity so it accepts a higher false positive rate (Type I) in order to catch as many true cases as possible.

In conclusion, in high stakes medical applications it is ethically justifiable to accept a higher Type I error rate in order to minimize the risk of a Type II error. The cost of missing a real diagnosis can be catastrophic while false positives can be managed through additional testing and monitoring.

Stats Don’t Lie: Predicting the Premier League’s Top Winners

anangwemike — Sun, 27 Jul 2025 13:54:12 +0000

To determine the likelihood of Premier League teams winning matches in the 2024-2025 season, I analyzed match data from Football-Data.org API. I then filtered out only completed matches and calculated win probabilities using the formula:
Win Probability (%) = (Number of Wins / Total Games Played) × 100
The result gives the percentage of games a team has won, giving a clear indication of the form and competitiveness

🏆 Top Contenders: Who Will Likely Win the Most Games?
Using the calculated win probabilities, teams most likely to win games throughout the season are:

Liverpool FC
With a win probability of 63%, Liverpool tops the charts. Showing tactical discipline, pressing intensity and consistent attacking prowess. With formidable performances both home and away, Liverpool is expected to maintain their push for the title.
Manchester City FC
With a win probability of 59%, Manchester City come a close second. This team is known for its dominance in possession and strategic flexibility thus remain a strong favorite to win their remaining games.
Arsenal FC
Arsenal are placed third with a win probability of 56% maintaining a high win rate. Arsenal have a youthful yet matured squad and their structured defense and pressing game help them stay near the top of the table.
Dark horses and mid table strength
Newcastle United, Brentford and Brighton also posted win probabilities above 40%, showing their consistency and ability to challenge bigger clubs. They might not win the title, but can easily cause an upset to a title contender on their day and subsequently shake up the race for the top four.

⚠️ Teams Struggling to Win
Clubs like Southampton FC, Ipswich Town FC and Leicester FC have the lowest win probabilities(below 20%). This suggests they are likely to struggle unless they significantly improve squad depth, tactics or morale. If this trend continues they face a real danger of struggling in a relegation battle.

Aside from just showing rankings, these win probabilities also show current form, tactical execution and consistency. High win probably teams usually; score more early goals, maintain better possession and tend to have strong defensive records. on the other hand, teams with low win probability can point to leaky defenses, poor finishing or tactical instability.

Statistics may not guarantee future outcomes, but teams with a higher win probability based on the data are statistically more likely to keep winning. Keeping an eye on these metrics throughout the season can help analysts and fans make smarter decisions.

UNDERSTANDING RELATIONSHIPS IN POWER BI

anangwemike — Mon, 16 Jun 2025 11:35:11 +0000

Relationships form an integral part of the data modeling process when using Power BI. They simplify the process of intertwining different tables of data, enabling them to work together seamlessly in visuals, calculations and reports. Gaining a clear understanding of these relationships is crucial to create accurate, efficient and scalable business intelligence solutions.

Before delving deeper into the topic, one must understand what this relationship in Power BI actually is. Simply put; it defines how two tables are linked to each other through one or more columns. The ‘Linking Columns’ must contain matching data values such as ‘Customer ID’ in both a ‘Customers’ table and a ‘Sales’ table. Once a relationship is formed between these two tables, Power BI can instinctively combine and analyze data from both sources as though they were one.

Types of relationships in Power BI can be grouped into three categories, namely;
• One-to-many (1:*) - This is the most common and widely used type where one record in a table, for instance a customer, is related to multiple records in another table like transactions.
• Many-to-one (*:1) – It’s similar to one-to-many but occurs in reverse; that is, multiple records in one table are related to one record in another table.
• Many-to-many (:): - This represents a more complex type that happens when both tables have non-unique values. Power BI handles this using intermediate relationship tables or composite models.

Relationships in Power BI primarily allow for data from multiple tables to be analyzed together without redundancy or manual combination. They have several key functions, namely;
• Data Integrity – They help maintain consistent data across visuals by matching values appropriately.
• Simplified Reporting – One can drag fields from different tables into visuals and Power BI will intelligently understand how to join them.
• Accurate Calculations – Measures like total sales, average profit or customer count depend on the underlying relationships to deliver correct results.
• Improved Performance– Instead of merging large tables with huge datasets, relationships allow for separate but connected data models, thus improving efficiency.

In real world scenarios, relationships are used in nearly every Power BI solution in business, government or research settings. A few practical examples are:
• Sales and Customer Analysis – An organization can analyze sales trends by region, product or customer segment by linking a Sales table to Products, Customers and Regions tables. This synergy allows for metrics like sales by customer type or average order size per region to be seen.
• Healthcare Reporting – Hospitals can use Power BI to connect Patient Records, with Lab Results, Diagnoses and Treatment Plans. Relationships allow for a unified and simplified view of a patient’s journey across different departments.
• Education Dashboards – A university can link Students, Courses and Performance Data. Relationships can analyze course outcomes based on demographics or teaching staff.
• Financial Reporting – In finance, linking Budget tables with Actual Expenditure and Forecast tables allows for variance analysis and KPI monitoring in a clear, concise and scalable way.

In conclusion, relationships in Power BI serve as an invisible thread binding data together into a coherent and interactive model. They are vital for accurate analysis, powerful reporting and a seamless user experience. Whether you are building dashboards for sales performance or government service delivery, mastering relationships is the bedrock of effective data storytelling in Power BI.

EXCEL IN THE REAL WORLD

anangwemike — Wed, 11 Jun 2025 08:30:15 +0000

Microsoft Excel is not just a simple spreadsheet tool; but rather a powerful and widely used tool for data analysis in real life applications. With practical applications in engineering, finance, marketing, healthcare or education, Excel plays a pivotal role in helping organizations transform raw, complex data to a format that is easily understandable which in turn improves company insights and decision-making processes.

One of the most common uses of Excel is organization of raw data. Businesses ranging from hospitals, schools, banks and other institutions face a similar challenge. They all deal with large datasets from patient records, sales transactions, employee records, student registration records and a host of other important data. Excel structures this data in tables, thus allowing for easier sorting, filtering and navigation. Data cleaning features like; Find & Replace, Remove Duplicates and Text to Columns allows one to prepare data for analysis quickly and accurately.

The built-in formulas and functions make Excel an ideal platform for performing calculations and statistical analysis. Whether it’s basic arithmetic or advanced statistical operations like standard deviations or regression analysis, data driven insights are fully supported. Functions like SUMIFS, AVERAGEIF, VLOOKUP and IF statements give one the ability to uncover trends, compare variables and build decision-making logic.

Charts and graphs are essential tools in data analysis that helps break down complex and large datasets to a simple and easy to navigate visual format. Excel has a wide variety of formats; from pie charts, column charts, scatter plots and many more. With a few clicks, one can turn massive amounts of data into visual insights that communicate findings easily to others. Dashboards that combine multiple charts, tables and slicers are built in excel and create interactive reports for investors and shareholders.

Excel is highly useful in forecasting trends, analyzing customer behavior, evaluating financial performance and monitoring business KPIs. For instance, marketing teams can track campaign performance and ROI, operations teams can analyze inventory levels and delivery efficiency. Financial analysts use Excel for budgeting, forecasting and investment analysis.

In conclusion, Excel remains a foundational, easily accessible and flexible tool for real world data analysis. The wide range of applications from data organization, formula-based analysis, visualization tools and interactivity make it valuable across many industries. There may be more specialized tools for large scale datasets like Power BI, SQL or Python but Excel remains the go-to starting point for professionals who need to make sense of data and drive informed decisions.