DEV Community: Purity Kihoro

How Excel is Used in Real-World Data Analysis... A beginners view

Purity Kihoro — Sat, 06 Jun 2026 12:56:16 +0000

Microsoft Excel is a spreadsheet software that was developed by Microsoft to help users to organize their data in rows and columns and do calculations on the data. Excel also allows the users to come up with charts in order to visualize large amounts of data which can then be used to analyze the data.

It is the very first tool that most data analysts learn as it enables them to efficiently handle large datasets, expose insights and automate repetitive tasks using its features such as Power Query, Pivot Tables and conditional formatting.

How is Excel used by analysts?

- Track sales and performance
Microsoft Excel is used to track leads and monitor the sales targets for individuals in a company or the whole company in general. It enables the analysts to check the monthly/ quarterly performance and also generating the reports. Excel helps to compare the targets versus the achievements and helps in identifying the high performing products, individuals, branches and regions.

- Make financial decisions

Excel is used to prepare budgets, track organizational expenses, predict revenue and also create financial models. The financial models enable the companies to predict future growth and forecasting which products have better profitability than others. This enables companies to make better decisions with confidence.

- Manage employee records

Excel is commonly used by the Human Resource Department in managing employee attendance, the payroll and in tracking the performance of every individual employee. This enables the department to recognize those employees who are productive and award them accordingly.

Most Common Functions Used by Analysts in Excel

SUM()
This is used to calculate the total of the values in a range. =SUM(B3:B7)

MAX()
This is used to find the maximum value in a given range. =MAX(B3:B7)

MIN()
This is used to calculate the minimum value in a given range.
=MIN(B3:B7)

SUMIF()
This gives the total value of numbers in a range under a specified criteria.
=SUMIF(C3:C7, "Jersey", B3:B7) This gives the total given by all Jerseys.

AVERAGE()
This calculates the mean of number values within a ranges.
=AVERAGE(B3:B7)

AVERAGEIF()
This calculates the mean of values within a range that meet a specified criteria.
=AVERAGEIF(C3:C7, "Jersey", B3:B7) This gives the average production by the Jersey breed.

MEDIAN()
This calculates the median of a numbers in a range of cells.
=MEDIAN(B3:B7)

TODAY()
This gives the current date. =TODAY()

DATEDIF()
This is a hidden formula in Excel that is used to calculate the difference between two dates.
the syntax is =DATEDIF(Start_date, End_date, Unit)
Example Find the difference between two dates. 1/1/2000 and 1/1/2026
=DATEDIF(C16,C17, "d") Gives 9497days
=DATEDIF(C16,C17, "m") Gives 312months
=DATEDIF(C16,C17, "y") Gives 26years

NETWORKDAYS()
This calculates the difference between two dates while omitting the weekends.
=NETWORKDAYS(C16,C17) which gives 6784

My take:
Microsoft Excel created a way for people to collect, organize and present data in a way that can be easily understood. Given a dataset of 1000 rows, it is difficult to understand the data that you are working with, but when you can use functions to answer the specific questions that you have, then it is converted to very useful information that helps you to make decisions going forward.

Follow me on my journey while we learn Data Analysis together.

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI.

Purity Kihoro — Mon, 09 Feb 2026 02:39:59 +0000

An analyst's main goal is to spot trends, performance and communicate insights faster from provided data. This makes them seek a software that supports data cleaning, create impactful visuals and present the data in a way that actually tells a story and understand the data better. Hence, Power Bi.
How do analysts use Power Bi?

1. Get the data

First, the analyst sources for the data. They may get the data from the company’s database, it may be in Excel format or CSV or it may come from POS machines or even surveys.

2. Power Query in Power Bi

They then transform the data in Power Query where they get to clean it. Clean it? Is data dirty? Yes! Raw data collected from a source is very messy. Data can be messy in a number of ways, these include:

Missing records: these may be resolved by filling in the values with unknown for text data types and null for numeric data types.
Duplicate values: Analysts remove the duplicate rows to ensure accuracy in the data.
Inconsistent data for example, you may find a column such as city with values such as NRB and Nairobi. These values are the same but to the software, they appear as different cities. The analyst will standardize the data by restructuring all formats to be consistent.
Wide tables: The analysts Unpivot the columns to normalize the data.
Wrong data types: The data types of the columns are adjusted in Transform tab.
Use DAX to do calculations. In order to create insightful reports, I can use DAX functions to perform dynamic calculation on the data. FOR EXAMPLE SUM() which adds up all the values in a row. Create a new measure.

Total sales= Sum(Sales[Amount])

What if you want to get the expected results, multiple columns are required? Then you would use SUMX()

Total Revenue = SUMX(Sales,Sales[Quantity]*Sales[Price])

What about calculating KPIs? Then I would use Calculate()which allows me to evaluate an expression with complex filters.

Total Sweater sales 2024= CALCULATE(SUM(Sales[Amount]), 
Sales[Product] = "Sweater", YEAR(Sales[Date]) = 2024)

4. Create Visuals

Analyst then build visualizations from the functions conducted. There are so many types of visualizations in Power Bi. You can either use reports or dashboards.
Reports are used for detailed analysis and can have multiple pages and visuals while Dashboards are most used for summaries therefore only contain the most important insights.
For the functions performed above, the most preferred would be a Card. It is used to visualize a single important values.

Other visuals that are very popular include:

Bar/Column charts: These are used for comparing values in different categories.
Pie/Donut charts: Used for showing the percentage of a value to a whole.
Slicers: For filtering the visuals on the dashboard.
Map visuals: Ideal for geographic analytics either filled or bubble map.

5. Present Findings

An analyst then publishes their findings for executives. These can be done using Power Bi Service where a company can host their own servers securely. There is also a mobile App that allows one to view their reports on the go.

Summary

Power Bi has proven to be a very necessary tool for every data specialist. It can assist in converting chaotic data into very compelling insights. In order to story tell using data, mastering Power Bi is nonnegotiable as a data analyst.

Schemas and data modelling in Power BI

Purity Kihoro — Mon, 02 Feb 2026 14:00:49 +0000

Schema and Data Modelling.

This is a logical structure that defines how data is organized within a database. Database schemas provide a logical blueprint for data storage and organization, for greater user accessibility, scalability, and data integrity. This blueprint is inclusive of logical constraints such as, table names, fields, data types and the relationships between these entities. The schema does not contain the actual data itself, but rather provides the structure that the data must conform to.

Schemas commonly use visual representations to communicate the architecture of the database, becoming the foundation for an organization’s data management discipline. This process is known as Data modelling. It involves the process of modelling Database Schemas.

A data model is a diagram that visually represents a conceptual framework for organizing, defining, and showing the relationships between data elements. This visual method helps clarify complex connections between various data points, simplifying the design of efficient and well-structured databases.

Most common Schemas used on Power Bi
There are 2 main schemas used in Power Bi. They include Star Schema and Snowflake Schema.

Star Schema

A star schema is a type of schema where a single central fact table is surrounded by multiple-dimension tables. This single fact table contains facts of the data model, while the dimension tables contain descriptive properties or dimensions of the data model. This schema resembles a star shape. The Power BI engine works best with star schema.
Businesses can utilize a star schema to manage and organize large datasets based on two primary principles: facts and dimensions.
Facts: The center of the structure and provides measurement-based pieces of data. Examples of such central facts are the number of transactions, website clicks, or total purchases.
Dimension: Provides additional information about the fact, such as which customer made the purchase, where they made it from, and what product they bought.

Why is star preferred?

Easier to understand: Their dimensions can be used to slice and dice data and facts.
Better Performance: Since they have lesser joins and shorter paths, a better performance is guaranteed.
Scalable: It is easier to add new dimension tables.

Snowflake Schema

A snowflake schema is a type of schema that extends the star schema by normalizing dimension tables. In this schema, the dimension tables are further broken down into sub-dimension tables, creating a more complex structure. For example, the dimension product is further divided into category and subcategory which can be seen attached to the product dimension table.
Advantage: Beneficial for reducing data redundancy in complex scenarios
Disadvantage: Slower performance due to increased table joins.

Introduction to Linux for Data Engineers, Beginner Friendly Approach

Purity Kihoro — Fri, 30 Jan 2026 04:02:56 +0000

Why is Linux important for data engineers

Linux is an open source Operating System that is customizable therefore; it is able to meet the specific needs of different professionals such as data engineers. It is a very efficient and secure platform to use. Data engineers deal with extracting, transforming and loading very large volumes of data. They prefer a Linux terminal for the following reasons:

_1. Compatibility with Data Engineering Tools: There are tools such as Hadoop (to store and process large data sets), Kafka (real time data streaming) and Docker (create, deploy and run container applications) all run seamlessly on Linux.
2.** Security and Stability:** Linux is built on a promise of Security and is very reliable in handling sensitive data. Its open source nature allows it to be regularly updated with security patches by developers all around the world.

Scalability and Flexibility: Data Engineers work with data that is ever growing in volume, to keep up with the demand, Linux is very good at offering more processing power and speed to create workflows.
Command Line Interface: Data Engineers work on the Linux CLI as it ensures efficient, high speed processing and provides powerful automation capabilities. The CLI is also used to manage remote servers and computers using tools such as SSH._

Basic Linux commands

mkdir : Creates a new directory.
cd : Changes to the specified directory.
ls: Lists files and directories in the current directory.
mv : Moves or renames the source to the destination.
cp : Copies the source to the destination.
rm (Remove) Deletes files and directories.
touch : Creates an empty file or updates its modification time.
Clear: Clears the terminal screen.
ssh @: Connects to the remote server.

Text Editors in the Linux Terminal
There are 2 editors available: Vi and Nano
Vi
Vi is a text editor that divides the editing process into different modes. It has 3 Key modes. They include: This modal approach allows for fast and efficient text manipulation, making it a favorite for many seasoned developers and engineers.

What concepts should I master to be a Data Engineer?

Purity Kihoro — Sun, 10 Aug 2025 21:02:08 +0000

As a new data engineering student, there are a number of concepts that you need to grasp. The concepts will guide you in knowing exactly what to learn in respect to data engineering. So create a notion page and gather all resources available to be able to track your progress while learning.
i) Batch Verses Streaming Ingestion.
A Data Engineer implements the ETL (Extract Transform and Load) process for their organizations. In extracting the data, they should have identified the sources of these data and have a procedure for collecting the data. Data ingestion is the procedure that the data engineer takes to collect the data from all the different sources and organize it in a way that it can be processed for their specific organization.
Batch Ingestion is when the data is collected over a period of time for example, a minute, a week or a month, once it has all being gathered, it is then processed all together at the same time. It is suitable for when dealing with very large datasets. For example collecting all the sales data of an Ecommerce store after a day.
Stream ingestion is when data is processed as soon as it is collected. In stream ingestion, the data is processed instantly. It is highly recommended for critical data that requires immediate decision making. For example, you can use stream ingestion when dealing with fraud detection systems to identify the fraud as soon as it happens.
ii) (CDC) Change Data Capture
This is a technique used to ensure that all the records in a database are synchronized across the entire database in real-time. If and when a change is made to a record in a database, then these changes are integrated across the entire database resulting in data with low latency.

iii) Idempotency
This is the ability to ensure that if a process is repeated several times, the results do not change. A data engineer performs the ETL process where the data extracted may be from csv flles. Implementing idempotency ensures that no matter how many times the same data is loaded, it does not change the output by producing duplicates, instead it is able to identify that the data has already being loaded before to avoid data inconsistencies.

iv) OLAP VS OLTP
OLAP (Online Analytical Processing) is a data processing system that does multidimensional analysis on large amounts of raw data at high speeds. It is best uses for analytical reporting such as financial analysis or forecasting future sales.
OLTP (Online Transactional Processing) is the process that enables most online transactions that are recorded into a database. These transactions are recorded in real time and can be performed by multiple users concurrently. Such transactions include online bank transactions and flight booking
Both OLAP and OLTP are used together by most organizations as they both contribute very necessary data required for the growth of the organization.
v) Columnar vs Row-based Storage
Columnar databases organize data by the fields making it easier for calculations and also aggregation of the data. It allows for efficient data retrieval and analysis, as it only pulls the required data.
Row based storage database organize data by rows making it easier for dealing with complex queries. It is used mostly for transactional systems that require frequent querying using the CRUD (Create, Read, Update and Delete) operations.
vi) Data Partitioning
This is the process of taking large volumes of data and subdividing it into smaller, more manageable datasets by using a suitable criteria. The criteria may be a column. The smaller datasets created are known as partitions. This process allows for efficient data filtering. Identifying the right column to be used for the partitioning is very key, it is recommended to use one with distinct values.
vii) ETL VS ELT
E-T-L is the major job of a data engineer. It means Extract, Transform and Load. Extracting is the process of process of getting the data from different sources. Transforming involves enriching the data by manipulating the field names. Loading is placing the transformed data into the tools used by the data analysts.
In ETL, the data that is in file storage is transformed using SQL and loaded into the DBMS used by the analysts while in ELT the data is loaded into a data warehouse and then transformed using SQL into the DBMS used by the analyst.

MASTERING DATA ANALYTICS: The Ultimate Guide To Data Analytics

Purity Kihoro — Thu, 17 Oct 2024 03:56:48 +0000

Introduction

Before explaining the techniques and tools that are involved in data analytics, first we need to understand what data analytics is all about. A data analyst is the person that conducts the data analytics of a company.

What is data analytics?

This is the process that involves collecting data from a specific field or subject, cleaning and transforming the data to visualizations that can be used to come up with solutions to problems within the given field.

The process involves identifying trends and patterns within the data for example if analyzing data for an ecommerce site, one can derive the time when most customers are on the site.

Who is a data analyst?

This is the professional who conducts the collection of data, cleans the data and finally transforms the data using a number of techniques and tools.

Tools used in data analytics

There are a number of tools that a data analyst has to master. They include:

Spreadsheets i.e Excel or Google Sheet
Visualization tools such as Tableau and Power BI
Querying tools such as Python, SQL and R
Big data processing tools such as Hive and Hadoop
Access and extraction tools such as Data Lakes, Data Pipelines _and _Data Warehouses.

It is important to master at least one tool for a specific task instead of learning all the tools that conduct the same task. For example, under the spreadsheets, master only one either Excel or Google Sheets. This is because the principles found under one tool are likely the same for the other tool.

Fundamental skills of a data analyst

Statistical skills. Analysis involves finding relationship between data, therefore mean, correlation and mode of data is calculated.
Analytical skills. An analyst should have the ability to look at a dataset and figure out possible queries to create that are relevant to it.
Data visualization skills. While presenting their findings, an analyst should know the best type of graph to use in order to present the data in the most appealing format
Problem solving skills. After analyzing results from data, and analyst should have the ability to figure out the problems that can be solved using its findings.
Project management skills. An analyst should be organized in the whole process of gathering data to the point of presentation. Every step should be well documented to reveal the accuracy of the data.