DEV Community: Sharon Otieno

Retrieval Augmented Generation (RAG) for Dummies

Sharon Otieno — Sun, 14 Sep 2025 10:54:23 +0000

Retrieval-Augmented Generation (RAG) is a technique that optimises an AI model’s performance by connecting it with external knowledge bases. RAG then helps large language models (LLMs) provide more accurate, relevant, and high-quality responses. Let’s put things into perspective.

Imagine someone asked you which planet has the most moons. Thinking back to when one was younger, one would say that Jupiter has 88 moons (for example). However, this response is flawed since the information is out of date, and the source itself is not credible. LLMs have similar challenges.

Now, let’s assume that you had first gone and looked up the question on a reputable source, such as NASA or the European Space Agency. You would have most likely gotten that Saturn has the most moons, as scientists keep discovering new ones.
Where exactly does retriever-augmentation come in?
So let's say you were using ChatGPT. Now, rather than solely relying on what the model knows, it would refer to a ‘content store’, where there is reliable data. The model would then retrieve content relevant to the query from the store. So, rather than stating that the answer is Jupiter, the store would reveal that the correct planet is Saturn.
Rather than having to retrain the model, all you have to do is update the model, and you can ensure that the store is always up to date. This approach lowers the risk of hallucination, allows access to up-to-date information, and improves trust in the model.

Summary of How RAG Works
User submits a prompt ↠ Retriever searches a knowledge base for relevant documents or data (retrieval)↠ The system combines the original prompt with the retrieved content (augmentation)↠ Generator (LLM) produces a response based on this enriched input (generation)↠ Output is returned to the user.

Becoming a Data Analyst: My Linear Regression Story

Sharon Otieno — Sat, 23 Aug 2025 14:29:25 +0000

Transitioning from environmental science into the world of data and machine learning has undoubtedly been a challenging journey. This is especially because I have always been intimidated by the complexities of mathematical models. But despite all these, I have gradually learned the importance of breaking down concepts and ultimately using them to uncover patterns and make predictions.

Linear Regression

A core area of machine learning is linear regression. It basically entails modeling the relationship between variables to predict an outcome. asay, for instance, you wanted to predict house prices based on the number of rooms each room has. Normally, a house's price increases as the size of the house increases. Therefore, by plotting the relationship between the square footage and price of the house, linear regression helps you find a line that estimates how the prices rise due to additional square meters.

Use Cases

While I initially found it challenging to grasp these ideas, over time, I learned just how strongly variables influence the prediction outcome. Consider these cases:

Estimating salary based on years of experience.
Predicting water demand based on population growth.
Predicting academic performance from attendance rates.

The Modelling Process

In each scenario, you first have to define your features and the target variable. In the water/population growth example, the target variable (y) is water demand, while the independent variable (x) is the population size.
Our assumption is that as the population increases, water's demand also rises.
We would then visualize the relationship, train a linear regression model, interpret coefficients, and use the insights acquired to predict future water demand based on the projected population growth.
Ultimately, we would evaluate how effective and accurate the model is in providing valuable recommendations.

Final Words

Overall, my journey with linear regression has made me realize that despite one's anxiety, one can use mathematical tools to unlock insights, guide real-world decisions, and engage in data-driven processes.

How I Calculated Premier League Win Probabilities Using Python

Sharon Otieno — Thu, 31 Jul 2025 19:23:01 +0000

Imagine you wanted to make an informed guess on which football teams are likely to win in the Premier League, based on last season's performance. Well, this is what I set out to do.

Extracting Data From the API

The first step I took was to access the football-data.org API website, which would be the source of my data. Now, what exactly is an API? Think of an API as a bridge between a website and your code. While the website has the data you need, you can use your computer to make requests from the site. The two systems then interact, exchange data, and deliver responses.

Once you access the website, you need to create an account, typically via your email, where an authentication code can be sent. This code allows you to access and use the data. With the code at hand, you can then proceed to install the necessary libraries on Python (I used json (to load the data), requests, and pandas(to work with tables)). The next step I took was to set the URL to match the data I wanted. Since I wanted the latest data, I filtered using year, setting it to 2024. I then made the request to access the data, which was fetched successfully.

Manipulating the Data

After fetching, one now has to read the data. The data = json.load(f) line then allowed me to convert the file from JSON to a Python dictionary. With this, I was then able to prepare the data for further analysis. The first thing I did was extract the relevant fields I needed, which included the home and away teams, the date each match was played, and the teams' respective scores.

The next thing I did was create a function that would extract the winning teams based on their scores. This, for instance, showed me that while Manchester United FC played against Fulham FC, it scored one goal. The next step entailed aggregating the appearances and wins to see the total number of games each team won.

Getting the Win Probabilities

My final step involved using the total wins per team against the total matches played to calculate the probability that they would win this season. From this, I learned that Liverpool FC had a total of 25 wins, the highest of all. This meant that it had the highest probability of being the winning team in the coming Premier League.
This was quite an interesting project!

Why Central Tendency Rules in the World of Data Science

Sharon Otieno — Sun, 20 Jul 2025 13:20:07 +0000

As a data scientist, one is expected to collect, organize, summarize, analyze, and draw inferences from data. This is where statistical processes come in handy.

What is Central Tendency?

Central tendency refers to the statistical measure that represents the typical value or central point of a dataset. With central tendencies, one can provide an accurate description of the data they are interacting with. The three main measures of central tendency are: mean, mode, and median.

Mean (μ) -This is the most commonly used measure. It acts as the arithmetic average value. Although the mean is a good representative of data, it is sensitive to outliers, especially when working with a small sample size.
Median- The median refers to the middle value in a data set, when all the items are arranged in either ascending or descending order. While the median is easy to compute and is not distorted by skewed data, its disadvantages are that it does not use all the information available and cannot be used for further mathematical calculations.
Mode- The mode refers to the most common value within a set of data. Even though it can be calculated easily and is the only measure that can be used with data that is in a nominal scale, the mode is not used in statistical analysis since it is not algebraically defined.

Which Measure is Best to Use?

If working with ordinal or nominal datasets, one is not able to calculate the median or mode. It is therefore best to use the calculated mode.
Assuming you have quantitative data, it is best to use either the mean or the mode. However, if the data is either skewed or has an outlier, one should opt for the median.
In every other circumstance, one can use the mean, especially since it shows the least errors.

References

Agarwal, K. (2022, September 24). Statistics for data science Part 1: Use of central tendency for data analysis. Medium. https://medium.com/analytics-vidhya/statistics-for-data-science-part-1-use-of-central-tendency-for-data-analysis-d37cff35c9ea

Bhaskar, S., Ali, Z., & Sudheesh, K. (2019). Descriptive statistics: Measures of central tendency, dispersion, correlation, and regression. Airway, 2(3), 120. https://doi.org/10.4103/arwy.arwy_37_19

S., M. (2011). Measures of central tendency: The mean. Journal of Pharmacology and Pharmacotherapeutics, 2(2), 140-142. https://doi.org/10.4103/0976-500x.81920

S., M. (2011). Measures of central tendency: Median and mode. Journal of Pharmacology and Pharmacotherapeutics, 2(3), 214-215. https://doi.org/10.4103/0976-500x.83300

Discovering the Power of Excel (As an Absolute Beginner!)

Sharon Otieno — Sun, 08 Jun 2025 12:37:00 +0000

Although many people have heard of Excel, they tend to underestimate it, thinking it is only a tool to enter data. However, it is much more than a set of rows and columns.
In simple terms, Excel is a tool used to organize data, perform calculations, and derive valuable insights. The following examples will better portray how versatile and valuable Excel is in the real world.

Sales Performance Analysis- A business can use Excel to track sales performance, such as by assessing whether females purchased more products than males. Such insights may then influence business decisions, such as whether to adjust inventory depending on demand.
Human Resource Management- An organization can use Excel to keep records of their employees, such as the hire date and department they work in. Such information can then be used to automate salary calculations, analyze performance over time, and evaluate trends, such as gender-based pay differences.
Tracking Academic Performance- A school may opt to use Excel to keep records of students’ demographic information (such as gender and age), as well as test scores. Using the available information, the school can, for instance, identify learners with learning challenges and offer them additional support.

What Features Have I Been Exposed To?

From my first week of learning about Excel, there are certain features and formulas I have been exposed to, and have grown to appreciate:

The “IF” Statement- In the sales performance example, let’s say the business wanted to assess if sales made were of high value or not. =IF(D2>15000, "High Value", "Low Value"). If the calculation (as above) were executed, the business would identify high-value and low-value sales.
Conditional Formatting- If an organization wanted to identify employees who, for instance, earned the lowest amount of money, it could use conditional formatting, such as color scales or data bars to spot them, especially when the data set is large.
Data Validation- Normally, the data validation tool is used to restrict the kind of information input into a cell in a given column. So, let’s say that the school wanted to restrict academic performance into either poor, good, or excellent, the mentioned tool will prevent unrelated entries from being made, therefore ensuring consistency.

My Personal Reflection

Before interacting with Excel, I only viewed data as something that you simply entered into rows and columns. It did not appear as something I was able to interact with. However, my view of data has completely changed.

I recognize and appreciate that every piece of data in Excel has the potential to tell a story. Using certain tools, such as pivot tables, one can recognize patterns, discrepancies, and similarities. This data then has the potential to generate actionable insights. Now, more than ever, I am quite excited about using Excel to explore data further.