DEV Community: Maureen Mukami

RAGs for Dummies: The Game-Changing Power of RAG

Maureen Mukami — Sun, 21 Sep 2025 17:01:06 +0000

What is RAG?
RAG stands for Retrieval-Augmented Generation. It’s a powerful framework that combines large language models (LLMs) such as GPT-4, LLaMA, or Mistral with external knowledge sources like company documents, databases, or the web. If we think of LLM as a smart student with a huge memory, then RAG is like giving that student access to a library full of the latest books and notes. This way, they don’t just rely on what they memorized years ago; they can also look things up before answering.
LLMs are great, but they come with limitations:

Limited knowledge as they only know what they were trained on, so recent updates or niche details might be missing.
Hallucinations where by sometimes they confidently make things up, even if they sound convincing.
Generic answers in that without specific context, they may respond vaguely.

RAG solves these problems by letting the model fetch real information first, then generate an informed answer.

Steps for setting up RAGs framework
1. Data Collection
This is gathering the knowledge your system will rely on, such as company policies, SOPs, HR manuals, FAQs, product guides, business reports, or even external sources like research papers and regulations. The data can come in different formats including PDFs, Word docs, spreadsheets, databases, or web pages. It’s important to collect only relevant information, clean and update it regularly, organize it by category, and secure sensitive files. For example, a phone company might collect manuals, warranty documents, and troubleshooting guides to help a RAG bot provide accurate customer support.
2. Data Chunking
This process entails splitting large documents into smaller, manageable sections (e.g., 500–1000 words) so the system can retrieve only the most relevant parts. To avoid cutting off important context mid-sentence or paragraph, chunks are often created with a slight overlap (like 50–100 words) between them. This overlap ensures smoother continuity, prevents loss of meaning, and improves the quality of answers generated by the LLM.
3. Document Embeddings
It involves converting text chunks into numerical vectors that capture their meaning, making them searchable by similarity rather than exact words. These embeddings are stored in a vector database (like FAISS, Pinecone, or Chroma), which allows the retriever to quickly find the most relevant chunks when a user asks a question. When generating embeddings, you can pass kwargs (keyword arguments) such as model name, batch size, or normalization options to fine-tune how the embeddings are created. This step is crucial because high-quality embeddings directly determine how accurately your RAG system retrieves and ranks information.
4. User Queries
The user queries are transformed into embeddings so they can be compared with document embeddings in the vector database. The retriever then selects the most relevant chunks based on similarity, often using parameters like k to control how many results are returned. These chunks are combined with the query and sent to the LLM, which generates a context-aware answer.
5. Generate a Response
The query and the selected chunks are passed into the LLM, which uses both to craft an informed and accurate answer. You can fine-tune this step with parameters like temperature (controls creativity), max_tokens (limits response length), and other kwargs for custom behavior. This ensures the final response is not only factually grounded but also clear, coherent, and aligned with the user’s needs.

Real-World Power of RAG
i. Customer Support
RAGs answer customer questions with precision.
For example if a customer asks, “What’s the warranty on Model X200?” → RAG fetches the warranty policy and answers: “The X200 has a 2-year warranty covering manufacturing defects.”
ii. Market Research
RAGs summarize customer reviews, social media, or industry reports.
For example a question of “What are customers saying about our new app update?” → RAG analyzes feedback and gives a sentiment breakdown.
iii. Content Generation
RAGs automatically create product descriptions, wikis, or reports.
An example is when RAG generates a sales report by pulling the latest figures from company databases.
iv. Data Analysis & Business Intelligence
RAGs extract insights from huge datasets.
An example is a question “What were the top 5 reasons for customer complaints last quarter?” → RAG scans logs and summarizes findings.
v. Knowledge Management
RAGs make company policies and procedures easy to access.
For example new employees can ask, “What’s the leave policy?” and instantly get the official HR response.

In conclusion, RAG is a game changer as it: Keeps responses accurate and current, reduces hallucinations from LLMs, makes AI useful for real-world business problems, and is flexible in that it works with cloud APIs (like GPT-4, Claude) or local open-source models (like Mistral, LLaMA).
At its core, RAG = LLM + Your Data = Smart, Reliable Assistant.
It gives AI the brains of a language model plus the memory of a search engine, making it one of the most powerful tools for businesses, researchers, and everyday users alike.

Clustering as a Method of Unveiling Hidden Patterns in Data

Maureen Mukami — Fri, 05 Sep 2025 13:04:29 +0000

Unsupervised learning is a type of machine learning that deals with unlabeled data. While supervised learning relies on labeled data to make predictions, unsupervised learning works with data that has no predefined labels or outputs. This makes it particularly powerful for uncovering hidden patterns, relationships, and structures in data without human intervention. Unsupervised learning algorithms do not rely on direct input-to-output mappings. Instead, they autonomously explore data to find meaningful organization. Over the years, these algorithms have become increasingly efficient in discovering the underlying structures of complex, unlabeled datasets.
How Does Unsupervised Learning Work?
Unsupervised learning works by:
-Analyzing unlabeled data to identify similarities, differences, and relationships.
-Grouping or transforming data into structures that highlight hidden patterns.
-Providing insights that may not be obvious through human observation.
Main Models in Unsupervised Learning
There are three primary methods in unsupervised learning:
Clustering
Involves grouping untagged data based on similarities and differences.
Items in the same group (cluster) share common properties, while items in different groups are dissimilar.
Association Rules
A rule-based approach for discovering interesting relationships between features in a dataset.
Uses statistical measures (support, confidence, lift) to identify strong associations.
Dimensionality Reduction
Transforms data from high-dimensional spaces into low-dimensional spaces without losing important information. Common techniques: PCA (Principal Component Analysis), t-SNE. Useful for visualization, noise reduction, and computational efficiency.
Clustering
Clustering is the most widely applied technique in unsupervised learning.
Clustering is the process of organizing data into groups so that objects within the same group (cluster) are more similar to each other than to those in other groups. It is often used to reveal natural structures within datasets where no prior labels exist.
Clustering answers the question:
Which data points naturally belong together?”
Common Clustering Algorithms include:
a.K-Means Clustering
K- Means clustering divides data into k clusters, where k is predefined. Each cluster is represented by a centroid, and data points are assigned to the nearest centroid. It is an iterative algorithm that creates non-overlapping clusters meaning each instance in your dataset can only belong to one cluster exclusively
Pros: Efficient on large datasets.
Cons: Requires pre-selecting the number of clusters (k) and is sensitive to outliers.
b.Hierarchical Clustering
Hierarchical clustering is a method of clustering that builds a hierarchy of clusters which are visualized through a dendrogram. There are two types of this method:
i.Agglomerative (Bottom-Up): Starts with each point as its own cluster and merges them step by step.
ii.Divisive (Top-Down): Starts with one cluster and splits it recursively.
Pros: No need to specify the number of clusters beforehand.
Cons: Computationally expensive for large datasets.
c.DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN groups points that are closely packed together and labels isolated points as noise. Unlike K-Means, the number of clusters does not need to be specified.
Pros: Can detect arbitrarily shaped clusters and handle outliers.
Cons: Performance drops with clusters of varying densities.
d.Mean Shift Clustering
Mean Shift clustering identifies clusters by shifting data points toward areas of higher density.
Automatically determines the number of clusters based on data distribution.
Pros: No need for k value.
Cons: Can be computationally heavy.
Applications of Clustering
Customer Segmentation – grouping customers based on behavior for personalized marketing.
Anomaly Detection – spotting unusual data points such as fraudulent transactions.
Document/Text Clustering – organizing news articles, research papers, or emails into categories.
Image Segmentation – dividing images into regions for medical or computer vision tasks.
Recommendation Systems – grouping users with similar preferences for product or content suggestions.

My key insights is that, unlike in supervised learning, where models rely on labeled data and clear input-output mappings, unsupervised learning and clustering in particular thrives in situations where labels are absent. To me, this is its greatest strength because most real-world data is unstructured and unlabeled. I find clustering especially valuable because it reveals hidden structures without requiring human guidance. For example, while supervised models can classify emails as spam or not spam, clustering can go further by discovering new, previously unseen patterns in user behavior or anomalies that no one thought to label. That said, I also recognize the challenges. Unlike supervised learning, the results of clustering are not always straightforward to evaluate. Determining the right number of clusters, the right algorithm, or even whether the clusters are meaningful can sometimes feel subjective. In my view, this makes clustering as much an art as it is a science.

A Simple Guide to Classification in Machine Learning

Maureen Mukami — Mon, 25 Aug 2025 09:36:05 +0000

Supervised learning is a machine learning approach where models are trained with labeled data. The data is split into training sets which teach the model and testing sets which check accuracy of the model. The goal is for predictions to match the true outcomes as closely as possible.
How Classification Works
Classification works by teaching a model how to recognize patterns using examples that already have answers. Classification deals with discrete data, assigning items to specific categories. It can be into three types: Binary classification where data belongs to one of two classes, such as “pass or fail” or “true or false.” Multiclass classification where data can fall into one class among many, for example identifying different species of plants. Multilabel classification where data can belong to multiple classes at once, such as classifying a movie as both “comedy” and “romance” (romcom).
Different models which are different algorithms or techniques that are applied to sort data into categories (classes) include:
Logistic Regression It predicts probabilities using a sigmoid function.
Support Vector Machines(SVMs) which find the best boundary between classes.
Decision Trees which split data into branches for decisions.
Random Forests which combine multiple trees for better accuracy.
Naive Bayes which uses probability with independence assumptions.
KNN which classifies based on nearest neighbors.
Neural Networks which use layers of neurons to learn complex patterns.

From my perspective, classification makes handling large datasets easier by grouping and labeling information in meaningful ways. This allows people to analyze patterns more effectively and make smart, data-driven decisions. What excites me most is the impact of classification in real life. For example, advanced systems can help detect diseases early in healthcare, or spot fraudulent transactions in banking, which protects people’s money. Classification also powers personalization—recommendation systems suggest movies, music, and products that match individual preferences, creating unique user experiences. I also see classification as an assistant for decision-making. Instead of replacing human judgment, it supports people by providing fast, evidence-based insights. That combination of speed and accuracy is transformative across industries.

While I’ve struggled to fully grasp some algorithms at first, constant practice with Python has made concepts clearer. I’m also learning that success is not only about knowing algorithms but also about choosing the right one for the right problem.

Lupus and the Trade-off between Type I and Type II Errors in Diagnosis

Maureen Mukami — Wed, 13 Aug 2025 10:15:13 +0000

Lupus, also known as systemic lupus erythematosus (SLE), is a chronic autoimmune disease in which the immune system mistakenly attacks healthy tissues and organs. The exact cause is unknown, but potential triggers include genetics, hormones (it is more common in women of childbearing age), and environmental factors such as sunlight, infections, stress, and certain medications. While there is no cure, treatment can control symptoms and prevent organ damage.
In hypothesis testing, the process of diagnosing lupus can be expressed as:
• H₀ (Null hypothesis): The patient does not have lupus.
• H₁ (Alternative hypothesis): The patient has lupus.
In medical diagnosis, two types of errors can occur:
Type I error (false positive) happens when someone is diagnosed with lupus even though they don’t actually have it. This can lead to unnecessary treatment, which may harm the body. Medicines used for lupus are often strong, so they can overwork the liver as it processes them, increasing the risk of liver damage. The kidneys can also be strained from filtering drugs that aren’t needed. Strong medications like steroids may weaken the immune system, making infections more likely, and long-term use can cause bone loss, weight gain, high blood pressure, or diabetes. Other drugs may damage the eyes (retinopathy), and anti-inflammatory medicines can cause stomach ulcers or bleeding.
Type II error (false negative) is when someone actually has lupus but is told they don’t. Without proper treatment, the disease keeps progressing and can cause serious organ damage. Kidneys may develop lupus nephritis, leading to kidney failure. The heart can become inflamed or develop a higher risk of heart attack. The lungs may be damaged or inflamed, causing breathing problems. Lupus can also affect the brain and nervous system, leading to seizures, strokes, memory loss, or mood changes. It may damage the blood and bone marrow, causing anemia or clotting problems. On the outside, it can cause ongoing skin rashes or ulcers, and in the joints, it can lead to painful, chronic arthritis.

Trade-off between Type I and Type II Errors in Lupus Diagnosis
In medicine, doctors must balance the risk of over-diagnosing (Type I error) with the risk of missing the diagnosis (Type II error). While Type I errors can result in unpleasant side effects, financial costs, and temporary health risks from unnecessary treatment, they are often reversible. A patient can return for follow-up visits, seek a second opinion, and stop treatment if the misdiagnosis is identified. Type II errors, however, can be far more dangerous in the case of lupus. Because the disease may progress silently, patients might feel healthy while lupus continues to damage vital organs such as the kidneys, heart, lungs, and brain. By the time the correct diagnosis is made, the damage could be irreversible, leading to lifelong disability or even death. For this reason, in diseases like lupus where early treatment is critical, it is generally safer to minimize Type II errors even if that means accepting a slightly higher risk of Type I errors. In other words, it is often better to risk over-treating a healthy patient than to miss lupus in someone who truly has it.

Who Has the Upper Hand? Data-Driven Prediction for the 2025/2026 EPL Title Race

Maureen Mukami — Thu, 31 Jul 2025 11:15:29 +0000

Predicting the winner of the English Premier League (EPL) is always a complex challenge, given the intense competition and frequent surprises that define the league. Yet, by applying a data-driven approach to team performance across the last two seasons, a strong case can be made for Liverpool FC as the top contender for the 2025/2026 title.
This prediction is based on an analysis of match outcomes from the 2023/2024 and 2024/2025 seasons, specifically looking at wins, draws, and losses for each team. Using the standard EPL point system which is 3 points for a win, 1 point for a draw and 0 for a loss. I calculated the total points each team accumulated. I then determined each team's performance efficiency by dividing their total earned points by the maximum possible points (calculated as matches played × 3 × total number of teams). This yielded a probability score reflecting each team's ability to consistently secure points.
According to this analysis, Liverpool FC emerged with a probability score of 0.036404 being the highest among all EPL teams over the two-year period. While the figure may appear modest on its own, its value lies in how it compares with the rest of the league. Liverpool's score positions them at the top of the performance index, highlighting their consistent form and competitive edge.
What does this mean going into the new season? Simply put, Liverpool has shown the strongest efficiency in turning matches into points across two full seasons. Their consistency, combined with tactical discipline, squad depth, and a winning mindset, makes them statistically the most likely team to lift the 2025/2026 trophy. In a sport where emotions, media narratives, and unpredictable transfers often influence expectations, a data-led forecast offers a refreshing, objective lens. As the EPL gears up for another thrilling season, the numbers suggest one thing clearly: Liverpool FC is the team to watch.

Measures of central tendencies

Maureen Mukami — Thu, 24 Jul 2025 14:30:06 +0000

In the world of data analysis, making sense of large volumes of information is crucial. One of the foundational concepts that enable this is measures of central tendency. These are statistical tools used to describe the center point or typical value of a dataset, helping analysts and data scientists summarize data in a meaningful way. The three most common measures are the mean, median, and mode each serving a unique purpose depending on the data context.

The Mean: The Arithmetic Average
The mean often referred to as the average is calculated by summing all values in a dataset and dividing by the total number of values. It is widely used due to its simplicity and interpretability. However, one of its limitations is its sensitivity to extreme values, also known as outliers. In skewed datasets, even a single unusually high or low value can distort the mean, making it less representative of the overall data distribution.

The Median: The Middle Ground
The median represents the middle value in a sorted dataset. If the number of observations is even, the median is the average of the two central values. One of the key advantages of the median is its resistance to outliers. It offers a better sense of central location in skewed datasets.

The Mode: The Most Frequent Value
The mode is the value that appears most frequently in a dataset. Unlike the mean and median, the mode is particularly useful for categorical data, where understanding the most common category is important. A dataset can be unimodal, bimodal, multimodal, or even have no mode if all values are unique.

Why Measures of Central Tendency Matter in Data Science
In data science, understanding the central tendency of a dataset is more than just a basic statistical exercise. It has broad applications across different stages of data analysis and model development.

Data Summarization and Exploration
During exploratory data analysis (EDA), central tendency offers a quick overview of the data, allowing analysts to identify trends and patterns without diving into every individual data point.
Understanding Data Distributions
The relationship between the mean, median, and mode provides insights into the shape of the distribution. In normally distributed data, all three measures are close. In skewed data, significant differences can highlight asymmetry, helping determine the appropriate statistical techniques or transformations.
Outlier Detection
A large gap between the mean and median can suggest the presence of outliers—unusual data points that may impact analysis or model accuracy.
Feature Engineering and Preprocessing
These measures are often used to fill in missing values or create derived features. They also guide data transformation decisions, especially when preparing data for algorithms that assume normality.
Communication and Reporting
Finally, central tendency measures make it easier to communicate findings to stakeholders. Saying “the average customer spends $50” is more impactful and digestible than listing all individual transactions.

Relationships in Power BI

Maureen Mukami — Sun, 22 Jun 2025 06:13:43 +0000

Power BI is a powerful tool for data analysis and visualization, but its real strength lies in how it connects different tables through relationships. These relationships allow you to combine data, filter across tables, and create interactive reports with ease.

📘 What Are Relationships
In Power BI, relationships refer to the logical connections between two or more tables based on a common column. These connections enable Power BI to combine data from multiple tables and perform calculations across them, just like joins in traditional databases.
Relationships determine how data from different sources interacts by ensuring that slicers, filters, and visuals all work harmoniously when pulling from multiple tables.

🧩 How Relationships Are Categorized
In Power BI, every relationship between tables is defined by two key properties:

Cardinality
Cross-filter direction

1️⃣ Cardinality
Cardinality refers to the nature of the relationship between two tables specifically, how many unique values in one column relate to values in another column. Power BI supports four types of cardinality:

a. One to Many (1:*)
This is the most common relationship type. One record in the first table is related to multiple records in the second table.
Example: One product in the Products table can appear in many rows in the Sales table.

b. Many to One (*:1)
This is essentially the reverse of one-to-many. Multiple records in the first table relate to one record in the second table.
Example: Many sales entries can point back to one customer in the Customers table.

c. One to One (1:1)
Each record in the first table corresponds to exactly one matching record in the second table, and vice versa.
Example: Each employee in an Employees table has one matching record in a Payroll table.

d. Many to Many (:)
This relationship exists when multiple values in one table relate to multiple values in another table often used for complex models.
Example: A student can enroll in many courses, and each course can have many students.

2️⃣ Cross-filter Direction
Cross-filter direction defines how filters flow between the related tables when interacting with visuals.

a. Single Direction
Filters flow from one table to the other only. Recommended for simpler models for better performance.
Example: Filters flow from the Date table to the Sales table, but not the other way around.
b. Both Directions (Bi-directional)
Filters can flow in both directions between tables. Useful in complex scenarios like many-to-many relationships or when multiple slicers are involved. Should be used cautiously to avoid ambiguous relationships and performance issues.

As I conclude, I’d like to emphasize that defining relationships correctly in Power BI is critical to building a robust and accurate data model. By understanding cardinality and cross-filter direction, you can ensure your data behaves as expected across dashboards and reports. Take the time to plan your relationships carefully they are the backbone of meaningful, interactive insights in Power BI.

How Excel is Used in Real-World Data Analysis

Maureen Mukami — Wed, 11 Jun 2025 07:42:51 +0000

Excel is a powerful spreadsheet software that allows users to input, organize, and manipulate data using rows and columns. While it's commonly associated with basic calculations Excel is widely used in the real-world data analysis across many industries such as manufacturing, healthcare, logistics and transportation

Real-World Applications of Excel
One key area where Excel is used is in sales analysis. Businesses rely on it to monitor product sales across different regions, track performance over time, and identify best selling items. With charts and PivotTables, trends become visible, helping guide better business decisions.

Excel is also used in inventory management, it helps companies track stock levels and monitor reorder points. Using simple formulas and conditional formatting, it becomes easy to highlight items that are running low or overstocked.

Another common use is in customer data analysis. Excel enables organizations to study customer demographics, purchase behavior, and satisfaction scores. This helps tailor marketing strategies and improve customer experiences.

Excel Features and Formulas
One useful feature I’ve explored is Conditional formatting. It automatically highlights cells based on specific criteria, making trends and outliers stand out at a glance. For instance, low sales figures can be marked in red, while high performers can be green.

Another powerful tool is the PivotTable. This feature allows one to summarize and analyze large datasets quickly. It’s incredibly helpful for grouping and comparing data like total sales by region or by product without writing complex formulas.

I also learned the VLOOKUP function, which allows you to search for a value in one table and return related information from another column. This is especially useful when working with multiple datasets, like matching customer names with their corresponding purchase history.

My Reflection
Learning Excel has completely transformed the way I perceive data. I used to think that data analysis required advanced tools or programming knowledge. However, I've come to realize that Excel is a powerful and versatile tool capable of handling a wide range of data tasks such as cleaning, organizing, analyzing and visualizing.

What impressed me most is Excel’s flexibility. It enables me to carry out various functions in one platform, whether it's performing calculations, identifying trends, or generating visual insights. As data continues to play a critical role in decision-making across all industries, mastering Excel provides me with a strong foundation as I advance in my data science journey.