<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Josiah Nyamai</title>
    <description>The latest articles on DEV Community by Josiah Nyamai (@joe_siah).</description>
    <link>https://dev.to/joe_siah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2401081%2F894303aa-35ec-4fc5-a7e4-5949619d800a.jpg</url>
      <title>DEV Community: Josiah Nyamai</title>
      <link>https://dev.to/joe_siah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joe_siah"/>
    <language>en</language>
    <item>
      <title>Supervised Learning — The Heart of Modern AI</title>
      <dc:creator>Josiah Nyamai</dc:creator>
      <pubDate>Mon, 25 Aug 2025 05:29:44 +0000</pubDate>
      <link>https://dev.to/joe_siah/supervised-learning-the-heart-of-modern-ai-gpe</link>
      <guid>https://dev.to/joe_siah/supervised-learning-the-heart-of-modern-ai-gpe</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“If data is the new oil, supervised learning is the engine that refines it.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Artificial Intelligence and Machine Learning (AI/ML) are transforming industries — from healthcare to finance to entertainment. At the core of most of these intelligent systems lies a foundational technique called Supervised Learning.&lt;/p&gt;

&lt;p&gt;Whether you’re a data scientist in training, a software developer branching into ML, or just curious about how machines “learn,” this guide is for you. We’ll explore what supervised learning is, how it works, common algorithms, real-world use-cases, and even write a little code.&lt;/p&gt;

&lt;h1&gt;
  
  
  📌 What Is Supervised Learning?
&lt;/h1&gt;

&lt;p&gt;Supervised learning is a type of machine learning where the model is trained using labeled data.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;You give the algorithm input data (X) and the correct output (y).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The algorithm tries to learn the mapping between inputs and outputs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Once trained, it can predict the output for new, unseen inputs.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📦 Think of it like this:&lt;br&gt;
You’re the teacher. You give the model a bunch of math problems (inputs) and answers (labels). Over time, the model learns how to solve similar problems on its own.&lt;/p&gt;
&lt;h1&gt;
  
  
  🎯 The Goal
&lt;/h1&gt;

&lt;p&gt;The goal of supervised learning is to minimize the error between the predicted output and the actual (true) output. It does this by adjusting internal parameters (called weights) during training.&lt;/p&gt;
&lt;h2&gt;
  
  
  🧩 Types of Supervised Learning
&lt;/h2&gt;

&lt;p&gt;There are two major branches:&lt;/p&gt;
&lt;h3&gt;
  
  
  1️⃣ Regression
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Output: Continuous values (e.g., real numbers)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Goal: Predict “how much” or “how many”&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Predicting house prices 🏠&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Forecasting stock prices 📈&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Estimating temperature 🌡️&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🧮 Output Example: y = 250,000 (price in USD)&lt;/p&gt;
&lt;h3&gt;
  
  
  2️⃣ Classification
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Output: Discrete categories or classes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Goal: Predict “which class” an input belongs to&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Spam or Not Spam 📧&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cat vs. Dog 🐱🐶&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Disease diagnosis (positive/negative) 🧬&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🎯 Output Example: y = "Spam"&lt;/p&gt;
&lt;h1&gt;
  
  
  🧠 How Does It Work? (Step-by-Step)
&lt;/h1&gt;

&lt;p&gt;Here’s the general pipeline of supervised learning:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Collect Data&lt;br&gt;
Gather labeled examples: each has input features (X) and a known label (y).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Split the Data&lt;br&gt;
Training set (usually ~70–80%)&lt;br&gt;
Test set (~20–30%)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Choose an Algorithm&lt;br&gt;
Decide what type of model you want to train (e.g., Linear Regression, Decision Tree, etc.).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Train the Model&lt;br&gt;
Feed the training data into the algorithm so it learns patterns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Evaluate the Model&lt;br&gt;
Test the model on unseen data and measure performance using metrics like accuracy, precision, recall, RMSE, etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tune &amp;amp; Improve&lt;br&gt;
Adjust parameters, try different algorithms, add more data, etc.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  🧮 Common Algorithms in Supervised Learning
&lt;/h2&gt;

&lt;p&gt;Here are some popular supervised learning algorithms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Use-case Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linear Regression&lt;/td&gt;
&lt;td&gt;Regression&lt;/td&gt;
&lt;td&gt;Predicting house prices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logistic Regression&lt;/td&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;Spam detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decision Trees&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Customer segmentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Credit scoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support Vector Machines (SVM)&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Face recognition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K-Nearest Neighbors (KNN)&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Medical diagnosis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradient Boosting (XGBoost, LightGBM)&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Fraud detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neural Networks&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Image classification, speech analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  🧪 Quick Python Example
&lt;/h2&gt;

&lt;p&gt;Let’s use a simple example: predicting whether a person will buy a product based on their age and income.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Sample dataset
import pandas as pd

data = pd.DataFrame({
    'age': [22, 25, 47, 52, 46, 56, 55, 60],
    'income': [15000, 29000, 48000, 60000, 52000, 65000, 58000, 72000],
    'buy': ['No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']
})

X = data[['age', 'income']]
y = data['buy']

# Step 2: Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Step 3: Train model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Step 4: Predict
y_pred = clf.predict(X_test)

# Step 5: Evaluate
print("Predictions:", y_pred)
print("Accuracy:", accuracy_score(y_test, y_pred))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📊 How Do We Measure Performance?
&lt;/h2&gt;

&lt;p&gt;Metrics depend on the task:&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ For Classification:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Accuracy – % of correct predictions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Precision – Of the predicted positives, how many were correct?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Recall – Of the actual positives, how many were found?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;F1-score – Balance between precision &amp;amp; recall&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Confusion Matrix – Table showing TP, FP, TN, FN&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📈 For Regression:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Mean Squared Error (MSE)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Root Mean Squared Error (RMSE)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mean Absolute Error (MAE)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;R² Score (Coefficient of Determination)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏭 Real-World Applications
&lt;/h2&gt;

&lt;p&gt;Supervised learning is literally everywhere:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Application Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Healthcare&lt;/td&gt;
&lt;td&gt;Disease prediction, drug response modeling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Finance&lt;/td&gt;
&lt;td&gt;Credit scoring, fraud detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;Customer churn prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retail&lt;/td&gt;
&lt;td&gt;Product recommendation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agriculture&lt;/td&gt;
&lt;td&gt;Crop disease classification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transportation&lt;/td&gt;
&lt;td&gt;Traffic flow prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Email&lt;/td&gt;
&lt;td&gt;Spam detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NLP&lt;/td&gt;
&lt;td&gt;Sentiment analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  ⚠️ Challenges &amp;amp; Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Need for labeled data: Labeled data is often expensive or time-consuming to get.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Overfitting: Model memorizes training data but fails on new data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bias in data: Garbage in, garbage out — biased data leads to biased models.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Computational cost: Some algorithms are slow with large datasets.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ✅ Tips for Success
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;🧹 Clean your data. Missing values, duplicates, and wrong types can ruin your model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;📊 Explore your data using visualizations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;📦 Use scikit-learn or other libraries to avoid reinventing the wheel.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;🧪 Experiment! Try multiple algorithms and compare results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;⚖️ Balance your dataset when classes are imbalanced (especially in classification).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🧠 TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Supervised learning uses labeled data to train models to predict outcomes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It has two main branches: Regression (continuous outputs) and Classification (categorical outputs).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It's used in almost every industry today.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;With Python and scikit-learn, you can build supervised models in just a few lines.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🙌 Conclusion
&lt;/h2&gt;

&lt;p&gt;Supervised learning is the bread and butter of modern AI. From predicting your next Netflix show to detecting credit card fraud, it’s the quiet workhorse behind the scenes.&lt;/p&gt;

&lt;p&gt;If you're starting out in machine learning, mastering supervised learning is non-negotiable. Once you understand the concepts and build a few models, you’ll unlock a whole new world of intelligent applications.&lt;/p&gt;

&lt;p&gt;Happy learning, and may your loss always go down 📉 and your accuracy go up 📈!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>supervisedlearning</category>
    </item>
    <item>
      <title>From Messy to Meaningful: Cleaning and ETL on a Real-World Cancer Lab Dataset</title>
      <dc:creator>Josiah Nyamai</dc:creator>
      <pubDate>Mon, 28 Jul 2025 10:34:46 +0000</pubDate>
      <link>https://dev.to/joe_siah/from-messy-to-meaningful-cleaning-and-etl-on-a-real-world-cancer-lab-dataset-1680</link>
      <guid>https://dev.to/joe_siah/from-messy-to-meaningful-cleaning-and-etl-on-a-real-world-cancer-lab-dataset-1680</guid>
      <description>&lt;p&gt;Real-world datasets rarely come neat and tidy. During a recent technical assessment for a Data Engineering &amp;amp; AI position, I was challenged to transform a messy, inconsistent cancer diagnostics dataset into clean, structured data ready for analysis and potential machine learning applications. This article walks through the steps I took to clean the data, standardize values, and build an ETL pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧪 The Dataset
&lt;/h2&gt;

&lt;p&gt;The dataset, sourced from Kaggle, simulates test orders in a cancer diagnostics lab. It includes records across departments like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Histology&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cytology&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Haematology&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Immunohistochemistry&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But just like in the real world, the data was messy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Missing values&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Invalid or mixed date formats&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Typos and inconsistent labels&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Non-numeric prices&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ambiguous values (e.g., “N/A”, “Unknown”)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🔍 Step 1: Data Cleaning &amp;amp; Standardization
&lt;/h2&gt;

&lt;p&gt;I started by downloading the dataset directly from Kaggle using the kagglehub library, then loaded it into a pandas DataFrame for inspection.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Importing and Inspecting the Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import kagglehub

path = kagglehub.dataset_download("eustusmurea/labtest-dataset")

print("Path to dataset files:", path)

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv(r"C:\Users\Administrator\.cache\kagglehub\datasets\eustusmurea\labtest-dataset\versions\1\messy_cancer_lab_dataset.csv")

data.head()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gave me a first look at the structure of the dataset — including messy values, inconsistent formats, and missing data — and laid the groundwork for cleaning and transformation.&lt;/p&gt;

&lt;h3&gt;
  
  
  📅 Fixing Date Columns
&lt;/h3&gt;

&lt;p&gt;Many rows had invalid or mixed date formats. I converted them using errors='coerce', which sets invalid entries to NaT.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;date_columns = ['creation_date', 'signout_date']
for col in date_columns:
    data[col] = pd.to_datetime(data[col], format='mixed', errors='coerce')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Then I filled missing dates with a safe placeholder: 2020-01-01.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  💵 Cleaning the Price Column
&lt;/h3&gt;

&lt;p&gt;Prices came in multiple formats like "KES 2,500" or "ksh3000". I created a custom function to clean these values and convert them into floats.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def clean_price(value):
    if pd.isnull(value): return None
    value = str(value).lower().replace('ksh', '').replace('kes', '').replace(',', '').strip()
    try:
        return float(value)
    except:
        return None

data['price'] = data['price'].apply(clean_price)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;After confirming there were no outliers via a boxplot, I filled missing values with the mean.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.boxplot(y='price', data = data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data['price'] = data['price'].fillna(data['price'].mean())
data.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧹 Filling Missing Object Values
&lt;/h3&gt;

&lt;p&gt;For other columns with object dtype (e.g., text columns), I replaced nulls with "Unknown":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for col in data.select_dtypes(include='object').columns:
    data[col] = data[col].fillna('Unknown')

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🏷️ Standardizing Text Columns
&lt;/h3&gt;

&lt;p&gt;Inconsistent labels like "St. Marys" and "st. mary's oncology" needed normalization. I used .replace() and string functions for standardization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data['facility'] = data['facility'].replace({
    'St. Marys': "St. Mary's Oncology",
    'mercy cancer center': 'Mercy Cancer Center'
})
data['facility'] = data['facility'].str.title().str.strip()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;I applied similar methods to clean up categories, tests, and other text columns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ⚙️ Step 2: ETL Pipeline
&lt;/h2&gt;

&lt;p&gt;After transforming the raw data, I moved on to building the ETL pipeline. The final step involved loading the clean dataset into a PostgreSQL database — making it ready for querying, reporting, or integration with BI tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  🛢️ Load: Saving to PostgreSQL
&lt;/h3&gt;

&lt;p&gt;🔹 Step 1: Export Cleaned Data to CSV&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data.to_csv("Cleaned_dataset.csv", index=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Establish PostgreSQL Connection
&lt;/h3&gt;

&lt;p&gt;I used SQLAlchemy and psycopg2 to securely connect to a PostgreSQL database using environment variables for credentials.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from dotenv import load_dotenv
import os
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

# Load .env variables
load_dotenv()

# 1. Connect postgres using psycopg2
connection = psycopg2.connect(  
    host=os.getenv("db_host"),
    database=os.getenv("db_name"),
    user=os.getenv("db_user"),
    password=os.getenv("db_pass"),
    port=os.getenv("db_port")   
)
cursor = connection.cursor()

#create a table in the database
create_table_query = """
CREATE TABLE IF NOT EXISTS lab_tests (
    order_id TEXT PRIMARY KEY,
    signout_date DATE,
    lab_number TEXT,
    assigned_to_pathologist TEXT,
    patient_name TEXT,
    facility TEXT,
    test_category TEXT,
    test TEXT,
    sub_category TEXT,
    receiving_centers TEXT,
    processing_centers TEXT,
    creation_date DATE,
    creation_year TEXT,
    creation_weekday TEXT,
    service TEXT,
    price FLOAT,
    payment_method TEXT,
    delay_days INT
);
"""
cursor.execute(create_table_query)
connection.commit()

# 2. Create SQLAlchemy engine and use pandas.to_sql() ---
db_url = f"postgresql://{os.getenv('db_user')}:{os.getenv('db_pass')}@" \
         f"{os.getenv('db_host')}:{os.getenv('db_port')}/{os.getenv('db_name')}"

engine = create_engine(db_url)

# Load DataFrame into PostgreSQL table
data.to_sql('lab_tests', engine, if_exists='replace', index=False)

print("Data successfully loaded into PostgreSQL database.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🤖 Bonus: ML Use Case Proposal
&lt;/h2&gt;

&lt;p&gt;As part of the assessment, I proposed an ML task:&lt;/p&gt;

&lt;h3&gt;
  
  
  🎯 Predicting Lab Test Delays
&lt;/h3&gt;

&lt;p&gt;Target: Days between creation_date and signout_date&lt;/p&gt;

&lt;p&gt;Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Facility&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Test category&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Price&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Creation weekday/month&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Models: Linear Regression, XGBoost&lt;br&gt;
Metrics: MAE, RMSE, R²&lt;/p&gt;

&lt;h2&gt;
  
  
  📌 GitHub Repository
&lt;/h2&gt;

&lt;p&gt;You can find the full code, cleaned dataset, and pipeline on GitHub:&lt;/p&gt;

&lt;h3&gt;
  
  
  🔗 &lt;a href="https://github.com/josiahnyamai/AI-Data-Engineering-Project" rel="noopener noreferrer"&gt;GitHub – Cancer Lab ETL &amp;amp; Cleaning Project&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Feel free to clone it, run it, or build on top of it.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>📊 Understanding Measures of Central Tendency and Their Importance in Data Science</title>
      <dc:creator>Josiah Nyamai</dc:creator>
      <pubDate>Sat, 26 Jul 2025 06:57:17 +0000</pubDate>
      <link>https://dev.to/joe_siah/understanding-measures-of-central-tendency-and-their-importance-in-data-science-242n</link>
      <guid>https://dev.to/joe_siah/understanding-measures-of-central-tendency-and-their-importance-in-data-science-242n</guid>
      <description>&lt;p&gt;When diving into the world of data science, one of the first statistical concepts you encounter is measures of central tendency. These foundational tools help data scientists make sense of data by summarizing it with a single representative value. Whether you’re building machine learning models, cleaning data, or generating business insights, understanding central tendency is crucial.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore what measures of central tendency are, how they work, and why they are vital in data science.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔍 What Are Measures of Central Tendency?
&lt;/h2&gt;

&lt;p&gt;Measures of central tendency are statistical metrics used to determine the center or typical value of a dataset. They give a sense of where data points tend to cluster and are useful for summarizing large datasets in a meaningful way.&lt;/p&gt;

&lt;p&gt;There are three main measures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Mean – The arithmetic average&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Median – The middle value&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mode – The most frequently occurring value&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s break each one down.&lt;/p&gt;

&lt;h3&gt;
  
  
  📐 1. Mean (Arithmetic Average)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Formula:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mean = ∑𝑥/𝑛
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Where:
&lt;/h4&gt;

&lt;p&gt;∑x = sum of all values&lt;br&gt;
n = number of values&lt;/p&gt;

&lt;p&gt;The mean is widely used due to its simplicity and mathematical properties. However, it can be sensitive to outliers.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Consider the ages: [22, 24, 25, 23, 100]
Mean = (22 + 24 + 25 + 23 + 100) / 5 = 38.8
In this case, the mean is skewed by the outlier (100).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  🔢 2. Median (Middle Value)
&lt;/h3&gt;

&lt;p&gt;The median is the middle number when the dataset is ordered. If there’s an even number of values, it’s the average of the two middle numbers.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[22, 23, 24, 25, 100]
Median = 24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The median is more robust to outliers and gives a better sense of central location in skewed data.&lt;/p&gt;
&lt;h3&gt;
  
  
  🔁 3. Mode (Most Frequent Value)
&lt;/h3&gt;

&lt;p&gt;The mode represents the value that occurs most often in the dataset.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[22, 23, 23, 24, 25]
Mode = 23
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Mode is especially useful for categorical data, where mean and median may not be applicable.&lt;/p&gt;
&lt;h2&gt;
  
  
  📌 Why Are Measures of Central Tendency Important in Data Science?
&lt;/h2&gt;

&lt;p&gt;Understanding and correctly applying these measures can lead to better decisions, cleaner data, and more accurate models. Here's how they impact the data science workflow:&lt;/p&gt;
&lt;h3&gt;
  
  
  🧹 1. Data Cleaning &amp;amp; Preprocessing
&lt;/h3&gt;

&lt;p&gt;Missing data is a common issue in real-world datasets. Measures of central tendency are often used to impute missing values. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Use the mean to fill in missing numerical values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use the mode to impute missing categorical data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures the dataset remains useful and statistically representative.&lt;/p&gt;
&lt;h3&gt;
  
  
  📊 2. Exploratory Data Analysis (EDA)
&lt;/h3&gt;

&lt;p&gt;During EDA, understanding the distribution of data is key. Central tendency measures help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Detect skewness&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Identify outliers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compare different features&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A quick look at mean and median can reveal whether data is symmetrical, left-skewed, or right-skewed, helping you choose the right transformation methods.&lt;/p&gt;
&lt;h3&gt;
  
  
  📈 3. Feature Engineering
&lt;/h3&gt;

&lt;p&gt;When engineering new features for machine learning models, it’s common to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Normalize data using the mean&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replace noisy data with the median&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create flags or indicators based on the mode&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These practices help improve model accuracy and interpretability.&lt;/p&gt;
&lt;h3&gt;
  
  
  🤖 4. Model Evaluation &amp;amp; Bias Detection
&lt;/h3&gt;

&lt;p&gt;Understanding the central tendencies of predictions and actual values can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Reveal systematic bias&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Help diagnose model drift&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Support performance comparison across different segments&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For instance, if the mean prediction is significantly higher than the mean actual, your model may be overestimating.&lt;/p&gt;
&lt;h3&gt;
  
  
  🧠 5. Communicating Insights
&lt;/h3&gt;

&lt;p&gt;Data scientists are often required to present their findings to non-technical stakeholders. Measures of central tendency are intuitive, easy to understand, and widely accepted in business environments.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;

&lt;p&gt;“The average customer age is 34”&lt;br&gt;
is much more digestible than&lt;br&gt;
“Customer age is normally distributed with a standard deviation of 6.2”.&lt;/p&gt;
&lt;h2&gt;
  
  
  🚫 Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;While useful, measures of central tendency can be misleading if not used appropriately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Skewed Distributions: In heavily skewed data, mean might not represent the "center" accurately. Prefer the median.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multimodal Data: If a dataset has multiple peaks, relying on a single mode can oversimplify.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Outliers: Outliers can distort the mean drastically. Always visualize your data before relying solely on these measures.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  🛠 Tools in Python
&lt;/h2&gt;

&lt;p&gt;In Python, libraries like pandas and numpy make it easy to calculate these measures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

data = pd.Series([22, 24, 25, 23, 100])

mean = data.mean()
median = data.median()
mode = data.mode().values[0]

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧩 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Measures of central tendency are simple yet powerful tools in a data scientist's toolkit. Whether you're just getting started or working on advanced models, they provide essential insights into your data's structure and behavior.&lt;/p&gt;

&lt;p&gt;Always pair these statistics with data visualization (like histograms or box plots) for a more complete understanding. The better you understand your data's center, the more informed your analyses and decisions will be.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>statistics</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>🧑‍💻 Web Scraping Indeed Remote Jobs and Storing in PostgreSQL</title>
      <dc:creator>Josiah Nyamai</dc:creator>
      <pubDate>Tue, 17 Jun 2025 13:31:31 +0000</pubDate>
      <link>https://dev.to/joe_siah/web-scraping-indeed-remote-jobs-and-storing-in-postgresql-35ca</link>
      <guid>https://dev.to/joe_siah/web-scraping-indeed-remote-jobs-and-storing-in-postgresql-35ca</guid>
      <description>&lt;h2&gt;
  
  
  📌 Project Overview
&lt;/h2&gt;

&lt;p&gt;In this project, I built a Python-based pipeline to automate the collection of remote job listings from Indeed. The project involved using Selenium and BeautifulSoup for web scraping, cleaning the extracted data with Pandas, and storing the final structured dataset into a PostgreSQL database for analysis or reporting.&lt;/p&gt;

&lt;p&gt;This article walks you through each step of the pipeline, including:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Setting up the environment&lt;/li&gt;
&lt;li&gt;Automating job searches with Selenium&lt;/li&gt;
&lt;li&gt;Parsing job data using BeautifulSoup&lt;/li&gt;
&lt;li&gt;Data cleaning and transformation&lt;/li&gt;
&lt;li&gt;Storing the final dataset into PostgreSQL&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🛠️ Step 1: Setting Up the Environment
&lt;/h2&gt;

&lt;p&gt;First, I installed the necessary libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip install selenium beautifulsoup4 pandas psycopg2-binary

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also downloaded the appropriate WebDriver (e.g., ChromeDriver) and ensured it's added to the system PATH.&lt;/p&gt;

&lt;h2&gt;
  
  
  🌐 Step 2: Navigating to the Website with Selenium
&lt;/h2&gt;

&lt;p&gt;Using Selenium, I navigated to the Indeed search page and triggered a search for remote jobs in tech-related fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
driver.get("https://www.indeed.com")

# Search for remote jobs
search_job = driver.find_element(By.NAME, "q")
search_job.send_keys("Data Analyst Remote")
search_job.submit()

time.sleep(5)  # Wait for the results to load

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧩 Step 3: Parsing the HTML with BeautifulSoup
&lt;/h2&gt;

&lt;p&gt;After the search results loaded, I passed the page source to BeautifulSoup to extract job details like title, company, location, and summary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source, "html.parser")
job_cards = soup.find_all("div", class_="job_seen_beacon")

jobs = []

for card in job_cards:
    title = card.find("h2", class_="jobTitle").text.strip()
    company = card.find("span", class_="companyName").text.strip()
    location = card.find("div", class_="companyLocation").text.strip()
    summary = card.find("div", class_="job-snippet").text.strip()
    jobs.append([title, company, location, summary])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once done, I closed the browser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;driver.quit()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📊 Step 4: Storing Data in a DataFrame and Cleaning It
&lt;/h2&gt;

&lt;p&gt;I converted the list of jobs into a Pandas DataFrame and performed light cleaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

df = pd.DataFrame(jobs, columns=["Job Title", "Company", "Location", "Summary"])
df.drop_duplicates(inplace=True)
df.head()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🗃️ Step 5: Storing the Data in PostgreSQL
&lt;/h2&gt;

&lt;p&gt;Finally, I connected to a PostgreSQL database using psycopg2 and inserted the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="job_scraper",
    user="your_username",
    password="your_password"
)
cur = conn.cursor()

cur.execute("""
CREATE TABLE IF NOT EXISTS remote_jobs (
    id SERIAL PRIMARY KEY,
    job_title TEXT,
    company TEXT,
    location TEXT,
    summary TEXT
)
""")

for index, row in df.iterrows():
    cur.execute("""
        INSERT INTO remote_jobs (job_title, company, location, summary)
        VALUES (%s, %s, %s, %s)
    """, (row['Job Title'], row['Company'], row['Location'], row['Summary']))

conn.commit()
cur.close()
conn.close()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📈 Conclusion
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how web scraping can automate data collection from job platforms and store insights in a relational database for deeper analysis. Whether you're building a job analytics dashboard or tracking market demand, this approach scales well and integrates with tools like Power BI or Tableau.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔗 Explore the Full Code on GitHub
&lt;/h2&gt;

&lt;p&gt;Want to see the full source code for this project, including the complete Jupyter Notebook, scraping logic, and PostgreSQL integration?&lt;/p&gt;

&lt;p&gt;👉 Check it out here: &lt;a href="https://github.com/josiahnyamai/Scrapping-Indeed-Jobs-" rel="noopener noreferrer"&gt;GitHub Repository - Scraping Indeed Remote Jobs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feel free to clone it, star it ⭐, or fork it and customize for your own job data project!&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>data</category>
      <category>python</category>
    </item>
    <item>
      <title>How Excel is Used in Real-World Data Analysis</title>
      <dc:creator>Josiah Nyamai</dc:creator>
      <pubDate>Tue, 10 Jun 2025 06:38:56 +0000</pubDate>
      <link>https://dev.to/joe_siah/how-excel-is-used-in-real-world-data-analysis-3n0e</link>
      <guid>https://dev.to/joe_siah/how-excel-is-used-in-real-world-data-analysis-3n0e</guid>
      <description>&lt;p&gt;As a data analyst, returning to Excel during the first week of LuxDevHQ Data Analytics course has given me a testament to just how powerful and flexible the tool actually is. Far more than a simple spreadsheet program, Excel remains an indispensable platform for performing real-world data analysis for a wide industry base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Excel: What is it?&lt;/strong&gt;&lt;br&gt;
A spreadsheet program called Microsoft Excel is used to arrange, examine, and display data. It enables users to build dashboards, make data-driven decisions, execute computations, and produce charts. It is a preferred tool for analysts, financial professionals, marketers, and decision-makers worldwide due to its intuitive interface and extensive functionality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Excel is used in Data Analysis in Everyday Situations&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Business Decision-Making:&lt;br&gt;
Companies use Excel to analyze sales trends, monitor stock, and understand consumer trends. By putting data in tables and charts, corporate leaders can identify trends and make informed decisions that lead to company success.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Financial Reporting:&lt;br&gt;
Budgeting, forecasting, and creating financial reports are critical in finance and accounting. With financial formulas and pivot tables embedded, financial analysts can easily summarize big data to examine performance and future plan in an efficient manner.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marketing Performance Measurement:&lt;br&gt;
Marketing teams rely on Excel to track campaign performance, measure ROI, and analyze customer engagement. Excel’s charting tools and conditional formatting make it easier to interpret data visually and spot trends over time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key Excel Features and Formulas I’ve Learned&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;SUM Function:&lt;br&gt;
A handy formula to calculate the sum of values across a range, useful for summing up sales, expenses, or any numeric data. For example, =SUM(B2:B10) quickly calculates the total revenue in a column of sales.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IF Function:&lt;br&gt;
A logical function that evaluates one of two values based on a condition. For example, =IF(C2&amp;gt;10000, "High", "Low") can classify sales performance above a threshold&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pivot Tables:&lt;br&gt;
Pivot tables allow for dynamic summarization of huge data amounts. I learned to sort, filter, and analyze data effortlessly at a button click — a time-saver for pulling insights without needing to type out complicated formulas.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;My Reflection&lt;/strong&gt;&lt;br&gt;
Although I’ve used Excel extensively in my data analysis work, this week has deepened my appreciation for its capabilities. It’s not just a tool for organizing data — it’s a strategic asset for uncovering insights, validating assumptions, and driving decisions. Revisiting key features with a fresh perspective reminded me how crucial Excel is for building strong analytical foundations, even in complex data environments.&lt;/p&gt;

&lt;p&gt;This experience has only served to convince me that a mastery of the basics — such as those presented by Excel — is critical for every data professional, regardless of how sophisticated their toolset becomes.&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>data</category>
    </item>
  </channel>
</rss>
