DEV Community: Maureen Muthoni

Building a Smart Travel Assistant with RAG: A Journey Through Kenya's Tourism Landscape

Maureen Muthoni — Sat, 11 Apr 2026 09:20:32 +0000

How I Built an AI-Powered Q&A System

Have you ever wished you could ask specific questions about a travel destination and get accurate, sourced answers? That's precisely what I set out to build and in this article, I'll walk you through creating a Retrieval-Augmented Generation (RAG) system for Kenya's tourism industry.

The Problem: AI That Makes Things Up

Large Language Models (LLMs) are impressive, but they have a fatal flaw: they confidently generate information that sounds right but might be completely wrong. Ask ChatGPT about the best time to visit the Maasai Mara, and it might give you a reasonable answer or it might hallucinate facts about wildebeest migration patterns.

This is where RAG comes in. Instead of relying on what the AI "thinks" it knows, we give it a library of trusted documents and teach it to search through them before answering. Think of it as moving from a student who wings their exam to one who brings a cheat sheet with verified facts.

What We're Building

Our system ingests PDF documents about Kenyan tourism destinations (Maasai Mara, Mombasa, Mount Kenya, etc.) and provides a REST API where users can ask questions like the following:

"What wildlife can I see at Maasai Mara?"
"What are the best beaches in Mombasa?"
"How difficult is it to climb Mount Kenya?"

The system will:

Search through the PDF documents for relevant information
Extract the most pertinent passages
Use an LLM to generate a natural language answer based only on those passages
Return the sources so users can verify the information

The Tech Stack

Here's what we're using and why:

FastAPI: Lightning-fast Python web framework, perfect for building APIs
Sentence Transformers: Converts text to embeddings (fancy math that makes similar text have similar numbers)
ChromaDB: Stores and searches through those embeddings efficiently
Groq: Blazingly fast LLM inference (seriously, it's ridiculously fast)
pypdf: Extracts text from PDF documents

Architecture: The 30,000-Foot View

PDFs → Text Extraction → Chunking → Embeddings → Vector Database
                                                        ↓
User Query → Embedding → Similarity Search → Context → LLM → Answer

We have two main pipelines:

Ingestion Pipeline (run once): Takes PDFs, breaks them into chunks, converts chunks to vectors, stores in a database.
Query Pipeline (run every query): Takes question, converts to vector, finds similar chunks, sends to LLM for an answer.

Step 1: Document Ingestion — Teaching the System to Read

Let's start with the ingestion script. This is where the magic of preparing our knowledge base happens.

Extracting Text from PDFs

from pypdf import PdfReader

def extract_text(path):
    reader = PdfReader(path)
    text = ""

    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text + "\n"

    return text

Simple enough, we read each page and concatenate the text. But here's the thing: PDFs are notoriously tricky. Some have scanned images (which need OCR), some have weird encodings, and some have tables that don't extract well. For this project, I assumed clean, text-based PDFs. In production, you'd want more robust error handling.

The Chunking Strategy: Why Size Matters

def chunk_text(text, size=300):
    words = text.split()
    chunks = []
    for i in range(0, len(words), size):
        chunks.append(" ".join(words[i:i+size]))
    return chunks

Why chunk at all? LLMs have context windows, and we can't feed them entire books. More importantly, smaller chunks mean more precise retrieval. If your document chunk is an entire chapter about Mombasa and someone asks about beaches, you'll retrieve all of Mombasa's beaches, hotels, restaurants and history. That's too much noise.

I chose 300 words per chunk through experimentation. Too small (100 words) and you lose context. Too large (1000 words) and your retrieval becomes imprecise.

Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

def normalize_embeddings(embeddings):
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return (embeddings/norms).tolist()

Here's where things get interesting. Embeddings convert text into high dimensional vectors (arrays of numbers). Similar text gets similar vectors. "The lion roared" and "The big cat made a loud sound" will have vectors that are close together in this mathematical space.

I chose BAAI/bge-small-en-v1.5because:

It's small (133M parameters) fast inference
It's good at semantic search tasks
It's actively maintained and well documented

The normalization step is crucial. It converts vectors to unit length, which makes cosine similarity (how ChromaDB compares vectors) equivalent to dot product and dot product is faster to compute.

Storing Everything in ChromaDB

import chromadb

client = chromadb.PersistentClient(path="./chromadb")
collection = client.get_or_create_collection(
    name="travel_and_tourism",
    metadata={"description": "Multi PDF Tourism documents"}
)

collection.add(
    documents=all_chunks,
    embeddings=all_embeddings,
    ids=all_ids,
    metadatas=all_metadatas
)

ChromaDB is a vector database designed for this exact use case. It:

Stores embeddings efficiently
Provides fast similarity search
Persists data to disk
Has a simple Python API

The PersistentClient means our vectors survive restarts. We don't have to re-embed all our documents every time we start the server.

Step 2: The Query Pipeline

Now for the fun part: answering questions.

Converting Questions to Vectors

def ask(question: str):
    query_embedding = model.encode([question])
    query_embedding = normalize_embeddings(query_embedding)

We use the same embedding model we used for documents. This is critical. If you embed documents with Model A and queries with Model B, the vector spaces won't align.

Similarity Search

results = collection.query(
    query_embeddings=query_embedding,
    n_results=3
)

docs = results['documents'][0]
metadatas = results["metadatas"][0]

ChromaDB finds the 3 most similar document chunks to our query. How does it know what's similar? It computes the distance between the query vector and every document vector, then returns the closest ones.

Why 3? Another Goldilocks number. Too few (1) and you might miss important context. Too many (10) and you'll include irrelevant information that confuses the LLM. I tested several values and found 3 provided the best balance.

The LLM

from groq import Groq

groq_client = Groq(api_key=os.getenv("GROQ_API_KEY"))

context = "\n\n".join(docs)

response = groq_client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {"role": "system", "content": "Answer only using provided context"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion:\n{question}"}
    ],
    temperature=0
)

answer = response.choices[0].message.content

This is where RAG shines. We give the LLM:

A system instruction: "Only use the provided context" (reducing hallucinations)
The retrieved context
The user's question

The temperature=0 setting makes the model deterministic; the same input always produces the same output. This is crucial for reliability.

Why Groq? Speed. Seriously, it's fast. What takes OpenAI 3-4 seconds, Groq does in under a second. For user facing applications, this matters.

Source Attribution

sources = list({meta["source"] for meta in metadatas})
return answer, sources

We return the source PDFs used to generate the answer. This serves two purposes:

Users can verify the information
It builds trust in the system

Step 3: The FastAPI

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI(title="Travel and Tourism")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.post("/ask", response_model=QuestionResponse)
def ask_question(request: QuestionRequest):
    try:
        answer, sources = ask(request.question)
        return QuestionResponse(answer=answer, sources=sources)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

FastAPI gives us:

Automatic API documentation (visit /docs to see it)
Request validation via Pydantic models
Type hints that actually work
Easy async support (though we're not using it here)

The CORS middleware allows frontend applications from any origin to call our API. In production, you'd restrict this to your specific domain.

The Results: Does It Actually Work?

Let's test it:

Query: "What wildlife can I see at Maasai Mara?"

Response:

{
  "answer": "At Maasai Mara, you can see the Big Five: lions, elephants, leopards, rhinos, and buffalo. The park is famous for the annual wildebeest migration between July and October, where millions of wildebeest, zebras, and gazelles cross the Mara River. You can also spot cheetahs, hyenas, giraffes, hippos, crocodiles, and over 450 bird species.",
  "sources": ["Maasai_Mara.pdf"]
}

Beautiful. The answer is specific, accurate, and sourced.

The Bigger Picture: Why RAG Matters

RAG represents a fundamental shift in how we build AI applications. Instead of:

Fine-tuning models (expensive, time-consuming, static)
Relying on model knowledge (outdated, prone to hallucination)

We can:

Use any LLM as a reasoning engine
Plug in our own knowledge dynamically
Update information without retraining
Provide source attribution for trust

This pattern works for:

Customer support bots trained on company documentation
Legal research tools searching case law
Medical assistants referencing clinical guidelines
Internal knowledge bases for enterprises

Conclusion

Building this RAG system taught me that the real challenge isn't the AI it's the data pipeline, retrieval strategy, and user experience. The LLM is just the final step that ties everything together.

RAG won't solve all AI problems. But for question-answering over documents, it's incredibly powerful. And as embedding models improve, vector databases get faster, and LLMs become more capable, RAG systems will only get better.

Code Snippets

All code in this article is available in my GitHub repository [https://github.com/maureenmuthoni-hue/Travel_and_Tourism_RAG_System]. Feel free to star, fork, and adapt it for your own projects!

Customer Lifetime Value (CLV) Prediction with Machine Learning

Maureen Muthoni — Mon, 23 Feb 2026 20:00:50 +0000

Introduction

Customer acquisition is expensive. But do you know which customers will actually generate long term revenue? That’s where Customer Lifetime Value (CLV) comes in.

Instead of focusing on one-off transactions, CLV estimates the total revenue a business expects from a customer over their entire relationship.

In this project, I built an end-to-end CLV prediction model and then deployed it as a production ready API.

In this article, we’ll cover:

Business problem
Data preprocessing
Model development
Model evaluation
Model deployment with FastAPI
Production-ready setup

The Business Problem

Businesses want to answer:

Which customers are most valuable?
Who should receive retention incentives?
Where should marketing budgets be allocated?

Predicting CLV helps with:

Customer segmentation
Revenue forecasting
Budget optimization
Retention strategies

This is a regression problem since CLV is a continuous value.

Step 1: Data Preprocessing

The dataset includes:

Purchase frequency
Recency
Average transaction value
Tenure
Demographic features

Data Preparation

Before training any model, we need to separate our features from the target variable. In this case, CLV is what we're trying to predict, and everything else in the dataset serves as input:

x = df.drop('CLV', axis=1)
y = df['CLV']

We also check for missing values:

x.isnull().sum()

Clean data is non-negotiable. Missing values can silently corrupt a model's performance if left unaddressed.

Splitting the Dataset

We divide the data into training and testing sets 80% for training and 20% for evaluating performance on unseen data:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Setting random_state=42 ensures reproducibility, so results remain consistent across runs.

Step 2: Model development

Linear Regression
We start with linear regression, a simple but interpretable baseline. It assumes a linear relationship between the features and the target, making it fast to train and easy to explain to stakeholders.

from sklearn.linear_model import LinearRegression

Linear = LinearRegression()
Linear.fit(x_train, y_train)
Predictions = Linear.predict(x_test)

Random Forest Regressor
Next, we train a Random Forest an ensemble method that builds 200 decision trees and averages their predictions. This approach is more robust to non-linear patterns in the data and tends to outperform linear models on complex real world datasets.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(x_train, y_train)
random_prediction = rf.predict(x_test)

Step 3: Model evaluation

We evaluate both models using Root Mean Squared Error (RMSE) and R² Score. RMSE tells us the average prediction error in the same units as CLV, while R² tells us how much of the variance in CLV our model explains (1.0 = perfect, 0 = no better than guessing the mean).

from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

RMSE_linear = sqrt(mean_squared_error(y_test, Predictions))
r2_linear = r2_score(y_test, Predictions)

RMSE_tree = sqrt(mean_squared_error(y_test, random_prediction))
r2_tree = r2_score(y_test, random_prediction)

print(f'RMSE_linear: {RMSE_linear}')
print(f'r2_linear:   {r2_linear}')
print(f'RMSE_tree:   {RMSE_tree}')
print(f'r2_tree:     {r2_tree}')

In most real-world CLV scenarios, the Random Forest will outperform Linear Regression due to its ability to capture complex, non-linear relationships between customer features and lifetime value.

Saving the Model
Once we're satisfied with model performance, we persist the trained model and feature schema using joblib. This makes reloading the model later without retraining straightforward:

import joblib

model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')

Saving the feature set alongside the model is a great practice. It documents exactly what columns and structure the model expects at inference time, which prevents subtle bugs when deploying.

Step 4: Model deployment with FastAPI

Training a model is only half the work. To put it into production, you need an API that other systems can call. Here's how to build a simple REST endpoint using FastAPI:

1. Install Dependencies

pip install fastapi uvicorn joblib scikit-learn pandas

2. Create the API

from fastapi import FastAPI 
from pydantic import BaseModel
import joblib
import numpy as np 

app = FastAPI(title='Customer Lifetime Value Prediction API')

# Load the saved model and feature schema
model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')

# Define the input schema (adjust fields to match your actual dataset columns)
class CLVinput(BaseModel):
    Customer_Age: int
    Annual_Income: float
    Tenure_Months: int
    Monthly_Spend: float
    Visits_Per_Month: int
    Avg_Basket_Value: float
    Support_Tickets: int

@app.get("/")
def health_check():
    return {"status": "API is running"}

@app.post('/predict-CLV')
def predict_CLV(data:CLVinput):
    x = np.array([[getattr(data,f) for f in feature_name]])
    prediction = model.predict(x)[0]
    return{'predicted_CLV': prediction}

3. Run the Server Locally

uvicorn app:app --reload

Your API will be live at http://localhost:8000. You can test it at http://localhost:8000/docs. FastAPI generates interactive API documentation automatically.

4. Deploy to the Cloud
For production, deploy the API to a cloud provider. Here's a quick overview:
Railway or Render (simplest): Push your code to GitHub and connect the repo. Both platforms auto-detect Python apps and handle deployment with minimal configuration. Add a requirements.txt file:
fastapi uvicorn joblib scikit-learn pandas

Summary

Here's the end-to-end workflow we covered:

Load and explore the customer dataset
Prepare features by separating inputs from the CLV target
Train two models, Linear Regression and Random Forest and compare them using RMSE and R²
Save the best model using joblib
Deploy via FastAPI with a /predict endpoint that accepts customer data and returns a CLV estimate

Predicting Customer Lifetime Value turns raw customer data into a strategic business asset. With a deployed model, your sales and marketing teams can make real-time decisions based on predicted value, not just historical behaviour.

How Statistics Can Be Used to Drive Business Decisions

Maureen Muthoni — Fri, 06 Feb 2026 17:22:11 +0000

Introduction

In today's competitive business landscape, intuition is no longer sufficient for making critical decisions. Companies that leverage statistical analysis to inform their strategies consistently outperform those that rely on experience or instinct.

The story that follows demonstrates how a systematic statistical approach from descriptive analytics to hypothesis testing can provide clear, evidence-based answers to complex business questions. More importantly, it shows how understanding statistical concepts like effect size, statistical power, and potential errors can prevent costly mistakes and unlock growth opportunities.

The Business Problem

A retail company operating both online and physical stores wanted to answer three key questions:

How are sales performing over time?
How reliable are insights drawn from the data?
Does running a marketing campaign actually increase revenue per transaction? The company had three years of transaction data, including revenue, store type, region, and whether a marketing campaign was used. The goal was to use statistics to support decision-making.

Step by Step Statistical Analysis

Before testing anything, you need to know what your data looks like. This is called descriptive statistics.
What We Calculated:
Central Tendency (The "Average")

Mean revenue: 8272 per transaction
Median revenue: 7723 per transaction The mean is higher than the median, which tells us some transactions are very high (outliers). The median is often more "typical".

Distribution Shape

Skewness and kurtosis show that most transactions are low to moderate, but there are some very high transactions pulling the average up. The distribution looks like this:

Visualize the Data

Numbers are important, but pictures tell stories. We created four key visualizations:

Revenue Over Time (Line Chart)

What we found is that revenue has seasonal peaks and valleys. December is high (holidays), and January is low (post holiday slump).
Why this matters: If we only compared December to January, we'd think campaigns work miracles. But it might just be Christmas shopping.

Revenue by Store Type (Bar Chart)
Online transactions are actually more valuable on average, even though physical stores sell more volume.

Revenue by Region (Box Plot)
Nairobi: Highest median revenue but most variable
Rift Valley: Most consistent (narrow box)
Western & Coast: Lower median but good campaign response

Business Insight: One marketing strategy won't fit all regions. We need customisation.

Units Sold vs. Revenue (Scatter Plot)
This showed that campaigns don't just increase volume, they increase the value per unit sold. Customers buy more expensive items during campaigns.

Check for Bias (Sampling)

TYPES OF BIAS:
1. SELECTION BIAS

Urban areas systematically differ from rural areas
Higher income, different shopping behaviors
Better infrastructure and internet connectivity

2. GEOGRAPHIC BIAS

Rural regions completely excluded
Cannot generalize findings to entire market

3. SOCIOECONOMIC BIAS

Urban customers have different purchasing power
Product preferences may differ

BUSINESS IMPACT

Revenue estimates would be overstated
Marketing effectiveness could be overestimated
Regional strategy would be incomplete
Expansion decisions would lack empirical foundation

RECOMMENDED SAMPLING METHOD:
STRATIFIED RANDOM SAMPLING:

Divide population into strata (regions, store types)
Randomly sample proportionally from each stratum
Ensures all segments are represented
Maintains natural population distribution
Allows both overall and stratum-specific analysis

Apply Statistical Theorems

Law of Large Numbers
We tested sample sizes from 10 to 1,000 transactions:

Central Limit Theorem
Even though individual transactions are all over the place (skewed distribution), when we take many samples and average them, the averages form a nice, normal bell curve.

Hypothesis Testing

A key business question examined was:
Does running a marketing campaign increase average revenue per transaction?

A one-tailed independent samples t-test was conducted to compare revenues from transactions with and without a marketing campaign.

The results showed:

A large t-statistic
A p-value far below the 5% significance level

This led to rejecting the null hypothesis and concluding that marketing campaigns significantly increase average revenue per transaction.

Business implication:
Statistical testing provides objective evidence to support or challenge strategic initiatives, reducing reliance on intuition alone.

Errors and Interpretation

Statistical decisions are subject to error:
A Type I error would mean concluding the campaign works when it does not, leading to wasted marketing budgets.
A Type II error would mean failing to detect a real effect, causing missed revenue opportunities.
Recognizing these risks allows businesses to balance caution with opportunity.

Type II error is worse because:

Lost revenue is permanent
Competitors gain market share
Recovery is expensive

Effective size and Power

Although the campaign effect was statistically significant, the calculated Cohen’s d indicated a small to medium effect size. This means that while the campaign works, its impact per transaction is modest.

A statistically insignificant result could still matter in practice if:

The effect is small but consistent
The business operates at large scale
The sample size is insufficient

Business implication:
Statistical significance should be interpreted alongside effect size and business context. Collecting more data can improve confidence and guide optimization rather than abandonment of strategies.

Conclusion

This case study illustrates that statistics is far more than an academic exercise. When applied correctly, it enables businesses to:

Understand performance realistically
Measure risk and variability
Test strategic decisions objectively
Avoid costly cognitive and sampling biases

By integrating descriptive statistics, visualization, sampling theory, probability laws, and hypothesis testing, organizations can make evidence-based decisions that are both statistically sound and commercially meaningful.

In an increasingly competitive environment, businesses that leverage statistics effectively gain a decisive advantage not by predicting the future perfectly, but by making better decisions under uncertainty.

Ridge Regression vs Lasso Regression

Maureen Muthoni — Tue, 03 Feb 2026 20:02:36 +0000

Introduction

Linear regression stands as one of the most fundamental tools in a data scientist's toolkit. At its core lies Ordinary Least Squares (OLS), a method that estimates model parameters by minimizing the sum of squared differences between predicted and actual values. In many real world problems, such as house price prediction datasets often contain many features, correlated variables, and noisy inputs. In such cases, traditional Ordinary Least Squares (OLS) regression becomes unstable and prone to overfitting. To solve these challenges, regularization techniques are used. The two most important regularization based models are:

Ridge Regression (L2 Regularization)
Lasso Regression (L1 Regularization)

Ordinary Least Squares (OLS)

Ordinary Least Squares estimates model parameters by minimizing the sum of squared residuals between predicted and actual values:

i=1∑n(yi−y^i)2

where y^i represents predicted prices.
OLS works well for small, clean datasets, but struggles when:

There are many features
Features are highly correlated (multicollinearity)
Data contains noise

This leads to overfitting, where the model performs well on training data but poorly on unseen data.

Regularization in Linear Regression

By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. The model now has to weigh accuracy against simplicity rather than just minimising error. The model now has to weigh accuracy against simplicity rather than just minimising error. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data.

General form: Loss = Error + Penalty

Ridge Regression (L2 Regularization)

Ridge regression modifies the OLS loss function by adding an L2 penalty term proportional to the sum of squared coefficients.

Ridge Regression Loss Function:
Minimize: RSS + λΣβⱼ² = Σ(yᵢ - ŷᵢ)² + λ(β₁² + β₂² + ... + βₚ²)

Where:

λ (lambda) = regularization parameter (λ ≥ 0)
The penalty term is the sum of squared coefficients
Note: The intercept β₀ is typically not penalised.

Conceptual Effect

Shrinks coefficients smoothly
Reduces model variance
Keeps all features
Handles multicollinearity well

Key Property

Ridge regression does not perform feature selection because coefficients are reduced but never become exactly zero.
Python Example:

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled

Lasso Regression (L1 Regularization)

Lasso takes a different approach through L1 regularization. Its loss function penalizes the sum of absolute coefficient values rather than squared values.

Lasso Regression Loss Function:
Minimize: RSS + λΣ|βⱼ| = Σ(yᵢ - ŷᵢ)² + λ(|β₁| + |β₂| + ... + |βₚ|)

Where:
The penalty term is the sum of absolute values of coefficients
λ controls the strength of regularization.

Conceptual Effect

Creates sparse models
Forces some coefficients to exactly zero
Automatically removes weak features

Key Property
Lasso performs feature selection, producing simpler and more interpretable models.
Python Example:

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)

Comparing Ridge and Lasso

1. Feature Selection Capability
Ridge retains all features with shrunken coefficients, while Lasso performs automatic selection by zeroing out irrelevant features.
2. Coefficient Behavior with Correlated Features
When size (sq ft) and number of rooms correlate at r = 0.85:

Ridge: Size = $120/sq ft, Rooms = $8,000/room (both moderate)
Lasso: Size = $180/sq ft, Rooms = $0 (picks one, drops other)

Ridge distributes weight smoothly; Lasso makes discrete choices.
3. Model Interpretability
Ridge model: "Price depends on all 10 factors with varying importance."
Lasso model: "Price primarily depends on size, location, and age, other factors don't matter."
Lasso produces simpler, more explainable models for stakeholders.

Application Scenario: House Price Prediction

Suppose your dataset includes:

House size
Number of bedrooms
Distance to the city
Number of nearby schools
Several noisy or weak features

When to use Ridge
Choose Ridge if:

Most features likely influence price
Multicollinearity exists
You want stable predictions

When to use Lasso
Choose Lasso if:

Only a few features truly matter
Many variables add noise
Interpretability is important

Python Implementation
Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error


X = df[['size', 'bedrooms', 'distance_city', 'schools_nearby', 'noise_feature']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

OLS Model

ols = LinearRegression()
ols.fit(X_train_scaled, y_train)

y_pred_ols = ols.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ols)

Ridge Regression

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ridge)

Lasso Regression

lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_lasso)

Choosing the Right Model for House Prices
If all features contribute meaningfully (e.g., size, bedrooms, schools, distance):
Ridge Regression is preferred.
If only a few features are truly important and others add noise:
Lasso Regression is more suitable due to its feature selection capability.

Model Evaluation and Overfitting Detection

Overfitting can be detected by comparing training and testing performance:

High training score but low test score indicates overfitting
Similar training and test scores suggest good generalization Residual analysis also plays a key role. Residuals should be randomly distributed; visible patterns may indicate missing variables or non-linear relationships.

Conclusion

OLS is simple but prone to overfitting in complex datasets. Ridge and Lasso regression introduce regularization to improve stability and generalization. Ridge is best when all features matter, while Lasso is preferred for sparse, interpretable models. Understanding when and how to apply these techniques is essential for both exams and real-world machine learning problems.

Building an Effective Power BI Dashboard: Connection, Cleaning, Modeling & Design

Maureen Muthoni — Wed, 10 Dec 2025 11:24:30 +0000

Introduction

Our hospital management system began with five interconnected tables storing operational data:

Patients - Demographics and contact information
Doctors - Provider profiles and specializations
Appointments - Scheduling and visit records
Admissions - Inpatient stay information
Bills - Financial transactions and payment tracking

The Integration Challenge

Raw data rarely tells a complete story. Our appointment records contained timestamps but lacked easy date grouping. Status fields had inconsistent formatting ("cancelled" vs "Cancelled" vs "CANCELLED"). Most critically, connecting appointment data to patient demographics and doctor specializations required three separate table joins.
For financial analysis, understanding a patient's complete billing history meant traversing from patients → admissions → bills, aggregating along the way.
The Solution: Build staging views, making the data consistently accessible for all downstream analysis

View 1: Appointments_Enriched (Operational Hub)

Combines the most frequently accessed data points
Eliminates repetitive join logic across reports
Maintains real-time accuracy (dynamic view updates automatically)

Key Decision: Used joins to ensure data integrity. Appointments without valid patient/doctor references are excluded, preventing corrupt data from polluting reports.

View 2: Patient_Balances (Financial Lens)

Enables quick identification of collection priorities
Supports cash flow forecasting and bad debt analysis

Key Decision: Aggregated at "PatientID" level rather than admission level.

view 3: Doctor_Monthly_Metrics (Performance Tracker)

In this post, I’ll walk through the key steps I followed connecting to data, cleaning it, modelling it, and designing a clear dashboard highlighting practical decisions that helped shape the final report.

1. Connecting to the Data

The project began by bringing multiple data sources into Power BI Desktop. These included structured tables that contained records, lookup information, and date related fields. Using Power BI’s Get Data interface, the sources were imported in Import Mode connect to PostgreSQL database.
Once the tables were loaded, I confirmed:

Data types were detected correctly
Column names were consistent
Tables aligned logically (fact vs. dimension)

This initial step set the foundation for all transformations and modelling work.

2. Cleaning & Transforming the Data

Most of the data preparation happened in Power Query, where I carried out cleaning tasks before loading tables into the model.

Key cleaning steps included:

Renaming columns to proper text.
Merging columns, such as first and last name into full names.
Fixing incorrect data types (e.g., numbers stored as text, date/time inconsistencies).
Normalizing categories so that values followed a consistent naming convention.
Filtering out invalid or missing records that could distort metrics by replacing.
Removing duplicates to avoid inflated total.
Trimming spaces and texts.

These cleaning procedures ensured that the dataset was accurate, consistent, and analytics ready.

3. Data Modelling Choices

I used a star schema design to keep the model simple, efficient, and easy to scale.
The model included:

Fact tables holding transactional or event level records.
Dimension tables for people, categories, products, locations, and calendar data.
A dedicated Date table, enabling accurate time-based analysis.

Relationships were kept single-directional except where needed for specific behaviours, and unused columns were removed to keep the model lean.

4. Dashboard Design & Visual Layout

With the data clean and model optimized, I designed an interactive dashboard intended to provide both summaries and insights.
The dashboard included:

KPI cards showing totals and key performance indicators
Trend charts to show how activity changed over time
Category comparisons using bar and column charts
Detailed tables for users who want records
Slicers and filters for month, doctor's name, and specialization

5. Final Dashboard

Home screen with KPIs

Trend charts

Category breakdowns
Detailed tables
Filters panel

Dashboards

Conclusion

This Power BI project successfully transforms raw data into clear, actionable insights. Through careful data cleaning, a well structured star schema model, and thoughtfully designed visuals, the dashboard provides users with an intuitive way to explore trends and compare performance. The result is a clean, interactive, and reliable report that supports quick understanding and informed decision making.

This was done as a group.
co- authors:

1. - Hilda Chepkirui
2. - Asha Siyat
3. - Saciid Shaakaal
4. - Samuel Irungu

Connecting PostgreSQL to Power BI

Maureen Muthoni — Sun, 23 Nov 2025 18:57:53 +0000

Introduction

Power BI is one of the most popular business intelligence tools for data visualization and analytics. Combined with PostgreSQL, a powerful open-source relational database, you can create dashboards and reports. This guide will walk you through connecting PostgreSQL to Power BI using two approaches: a local PostgreSQL installation and Aiven's cloud hosted PostgreSQL service.

Connecting Local PostgreSQL to Power BI

Step 1.Installation

Download PostgreSQL from the official PostgreSQL website and follow the installation process. Download Power Bi from the Microsoft store. During installation, note that your user password and port number. If yours is local then the default port number is 5432.

Step 2. Preparing your database

Ensure your database contains the data you want to visualize. Make sure your PostgreSQL server is active.
typical default setting:

Host: localhost
Port: 5432
Default database: postgres
Username: postgres
Then test connection.

Step 3. Connect PostgreSQL to Power BI.

Open Power BI and follow these steps:

Click on "Get Data" on the home ribbon.
In the Get data window, navigate to more > Database> PostgreSQL.
Click "Connect".

Step 4. Enter local connection details.

Fill the dialog:
Server: localhost:5432
Database: postgres (or name of your DB)

Click "OK"

Step 5. Enter Credentials.

Username: your PostgreSQL username
Password: your password
Select “Use Encrypted Connection” if available
Click "Connect".

Step 6. Load Data.

The Navigator window will display all available tables and views in your database. Select the tables you want to work with by checking the boxes next to them. You can preview the data by clicking on each table name.

Connecting Aiven PostgreSQL to Power BI.

Aiven is a cloud-based platform that provides fully managed services for open-source data technologies like databases and streaming services.

Step 1. Set up Aiven PostgreSQL.

If you don't have an Aiven account, sign up at aiven.io. Aiven offers a free trial to test their services.

Create a new PostgreSQL service:
Log into the Aiven console
Click "Create a new service"
Select "PostgreSQL" as the service type
Choose your cloud provider and region
Select a service plan based on your needs
Name your service and click "Create service"

Step 2. Retrieve Connection Information.

In the Aiven console, click on your PostgreSQL service to view its details. You'll find the connection information in the "Overview" tab:

Host: The service URI (usually in the format: name project.aivencloud.com)
Port: Usually a number or another assigned port
User: Default is "avnadmin"
Password: Click the eye icon to reveal the password
Database: Default is "defaultdb"

Step 3. Download the CA certificate.

Aiven enforces SSL connections for security. Download the CA certificate from the Aiven console:

In your service overview, find the "Download CA cert" button
Save the certificate file to a known location on your computer
Note the file path for later use

Step 4. Connect Power Bi to PostgreSQL.

Click "Get Data" from the Home ribbon
Select "PostgreSQL database" and click "Connect"
In the connection dialog, enter:
Server: Your Aiven host address: Port number.
Database: Your database name ("defaultdb").

Step 5.Authentication.

Enter your username and password from your aiven.

Step 6. Load data and transform data.

Just like with local PostgreSQL, the Navigator window will show your available tables and views. Select the data you need and click "Load" or "Transform Data" to begin working with your Aiven hosted data.

Conclusion

Connecting PostgreSQL to Power BI whether running locally or hosted on Aiven is simple once the correct drivers and SSL configurations are in place.
Local PostgreSQL connects using localhost and standard credentials.
Aiven PostgreSQL requires SSL certificates and cloud connection parameters.

Excel in the Era of Power BI & Python

Maureen Muthoni — Sat, 04 Oct 2025 06:51:51 +0000

Introduction

Whenever I receive a messy dataset, I always reach out for Excel first and not Power BI or Python.

Excel still remains as the go to or the very first option for operational analytics. It handles quick analysis, data entry and rapid data cleaning.

Where Excel Shines

Excel is ubiquitous- Almost everyone knows how to use it. It is also virtually accessible.
No coding- It does not require you to use programming language which can be complex to some.
Rapid data cleaning- It is quicker when it comes to cleaning data and data entry. It doesn't require complex operations.
Interactive dashboards- It is easier to create dashboards with excel as well as KPI's and dynamic visuals.
Self contained- It requires no dependencies, environments or deployment needed.

Power Bi

Power BI is mostly better for automated reporting dashboards, real time data connections and interactive visualizations.

Scalability- It can handle large data with ease and connects to live data.
Visual polish- It's visuals are sleek and interactive with modern charts, drill through and it's also great for storytelling.
DAX- Unlocks advanced metrics, time intelligence and powerful modelling language.

Python

Python performs complex analysis, automation, machine learning and data engineering.

Automation- It builds repeatable pipelines for cleaning and transformation.
Advanced analytics- It enables analytics like regression clustering and forecasting.
Integration- It connects to APIs, databases and cloud services.

Practical Stack

Here's how i view modern analytics

Each tool has it's strength and together they form a flexible workflow.

Conclusion

Power BI and Python are taking over repeatable, automated reports, large databases, complex transformation and production dashboards but excel still thrives in business work where flexibility matters more than scalability.

Most organisations use all three tools but many analysts still prototype in Excel before building something more formal.

Excel also continues to evolve, it now includes Python integration, power query for ETL tasks and other sophisticated functions. Rather than becoming obsolete, It's become part of an integrated toolkit.

Excel isn't going anywhere, It's still the fastest way to clean, model and explore data.