DEV Community: Kendi

Why Statistics Is the Backbone of Data Science

Kendi — Fri, 26 Jun 2026 05:55:28 +0000

One of the biggest surprises I had while learning data science was realizing that Python isn't the hard part.

You can learn Python in a few weeks. You can become comfortable with pandas pretty quickly. You can even train a machine learning model by following a tutorial.

But none of that means you understand your data.

That's where statistics comes in.

A lot of beginners (myself included) focus on learning tools first because they're exciting. New libraries, dashboards, machine learning models. Statistics often feels like something you can come back to later.

In reality, it's the opposite.

Statistics isn't a side topic in data science. It's the reason the tools work in the first place.

Data only becomes useful when you understand what it represents

Before running any analysis, the first question isn't "Which model should I use?"

It's "What kind of data am I looking at?"

Broadly speaking, data falls into two groups.

Numerical data consists of values you can measure or count. Sales, age, height, temperature.

Categorical data represents labels or groups. Blood type, product category, education level.

That distinction matters more than most beginners realize.

For example, calculating the average blood group doesn't make sense. Treating education levels as though the gap between each level is identical can also lead to misleading conclusions.

Python won't stop you from making those mistakes.

Statistics teaches you when a calculation actually makes sense.

A single number rarely tells the whole story

Imagine two classes that both have an average score of 50.

At first glance, you'd think they performed similarly.

class_A = [48, 49, 50, 51, 52]
class_B = [10, 30, 50, 70, 90]

Both classes have exactly the same mean.

But they're clearly very different.

In Class A, almost everyone performed similarly.

In Class B, performance varied dramatically.

That's why summary statistics come in pairs.

Measures like the mean, median, and mode tell you where the center of the data lies.

Measures like standard deviation, variance, range, and IQR tell you how spread out the data is.

Looking at only one is like reading only half the sentence.

The median isn't a backup plan

When people first learn statistics, they often think:

"Use the mean whenever possible. If it doesn't work, use the median."

That's not really how it works.

The mean uses every value in the dataset, which makes it powerful—but also sensitive to extreme values.

Imagine 99 people earn KES 30,000 each month, while one person earns KES 10 million.

The average income suddenly becomes much higher than what almost everyone actually earns.

The median ignores those extremes and simply finds the middle value.

Sometimes that's a much better description of what's "typical."

Choosing between the mean and median isn't about memorizing rules.

It's about understanding your data.

Outliers aren't always mistakes

One of the first instincts many people have is to delete values that look unusual.

Sometimes that's the right decision.

If someone accidentally entered 250 instead of 25, that's probably a data entry error.

But sometimes the unusual value is exactly what you're looking for.

If you're building a fraud detection system, the suspicious transactions are the most valuable observations in your dataset.

Statistics gives us a systematic way to flag potential outliers using the IQR rule.

Lower fence = Q1 − (1.5 × IQR)

Upper fence = Q3 + (1.5 × IQR)

Anything outside those boundaries is flagged for investigation.

Notice the wording.

Flagged—not automatically deleted.

Statistics helps identify unusual observations.

Context tells you what to do with them.

Machine learning doesn't replace statistics

Every machine learning algorithm makes assumptions.

Linear regression assumes linear relationships and normally distributed residuals.

Naive Bayes assumes features are conditionally independent.

K-Means works best when clusters are reasonably compact and roughly spherical.

If those assumptions don't hold, your model may still produce predictions.

They just won't be reliable.

Understanding statistics helps you know when to trust a model—and when not to.

The part of data science that doesn't become obsolete

Libraries change.

Frameworks change.

The code you write today may look outdated in a few years.

But the important questions stay the same.

What does this distribution tell me?
Is this difference meaningful or just random variation?
Is this outlier important or just an error?
Can I trust this result?

Those are statistical questions.

And they're the questions that separate someone who can write code from someone who can genuinely analyze data.

If you're starting your journey into data science, don't treat statistics as something to learn later.

It's the foundation that makes everything else make sense.

Power BI Data Modeling Explained Simply: Joins, Relationships, and Schemas

Kendi — Sat, 28 Mar 2026 20:43:42 +0000

Here is something nobody tells you when you start learning Power BI: the visuals are the easy part. Anyone can drag a bar chart onto a canvas. What separates a report that works from one that lies to you is what happens before you touch a single visual — the data model.

Get the model right and your numbers are accurate, your reports are fast, and your DAX measures are simple. Get it wrong and no amount of formatting or fancy visuals will fix it. You will just have a beautifully designed wrong answer.

This article covers everything you need to build models that actually work.

What is Data Modeling?

Data modeling is the process of organizing your tables and defining how they connect to each other so that analysis is accurate, efficient, and scalable.

In Power BI specifically, your data model determines three things:

Whether your calculations are correct
How fast your report runs
How easily someone can explore the data

A weak model produces incorrect aggregations, duplicate counts, and reports that take forever to load. A strong model makes all of that disappear.

Part 1: SQL Joins — Combining Tables in Power Query

Joins are how you physically combine two tables into one based on a shared column called a key. In Power BI, joins happen in Power Query during data preparation — before anything reaches your model.

Let us use a consistent example throughout. You have two tables:

Staff Table
| StaffID | StaffName |
|---------|-----------|
| S01 | James |
| S02 | Amina |
| S03 | Peter |

Training Table
| StaffID | Course |
|---------|--------|
| S01 | Excel |
| S02 | Power BI |
| S04 | SQL |

Notice: Peter (S03) has no training record. The SQL course (S04) has no matching staff member. This mismatch is exactly what each join type handles differently.

INNER JOIN — Only What Matches in Both

Returns rows that have a match in both tables. Anything without a match on either side is excluded.

Result:
| StaffID | StaffName | Course |
|---------|-----------|--------|
| S01 | James | Excel |
| S02 | Amina | Power BI |

Peter is excluded — no training record. The SQL course is excluded — no matching staff.

Use case: A clean attendance report showing only staff with confirmed training records.

LEFT JOIN — Keep Everything on the Left

Returns all rows from the left table. Rows from the right table are included only where a match exists. No match means null.

Result:
| StaffID | StaffName | Course |
|---------|-----------|--------|
| S01 | James | Excel |
| S02 | Amina | Power BI |
| S03 | Peter | null |

Peter stays — but his Course is null because no training record exists for him.

Use case: All staff members, flagging those who have not yet completed any training.

RIGHT JOIN — Keep Everything on the Right

Mirror of the LEFT JOIN. All rows from the right table are kept. Left table rows are included only where a match exists.

Result:
| StaffID | StaffName | Course |
|---------|-----------|--------|
| S01 | James | Excel |
| S02 | Amina | Power BI |
| S04 | null | SQL |

The SQL course stays — but StaffName is null because S04 does not exist in the Staff table.

Use case: All training courses offered, identifying any assigned to staff members who no longer exist in the system.

FULL OUTER JOIN — Everything from Both Tables

Returns every row from both tables. Where there is no match, nulls fill the gap on the missing side. Nothing is excluded.

Result:
| StaffID | StaffName | Course |
|---------|-----------|--------|
| S01 | James | Excel |
| S02 | Amina | Power BI |
| S03 | Peter | null |
| S04 | null | SQL |

Use case: A full audit comparing your HR system against your training system — surfacing every mismatch in both directions at once.

LEFT ANTI JOIN — Only the Unmatched Left Rows

Returns only rows from the left table that have no match in the right table. The opposite of an INNER JOIN in a sense.

Result:
| StaffID | StaffName |
|---------|-----------|
| S03 | Peter |

Only Peter — the staff member with no training record.

Use case: Finding staff members who have not attended any training. A compliance check.

RIGHT ANTI JOIN — Only the Unmatched Right Rows

Returns only rows from the right table that have no match in the left table.

Result:
| StaffID | Course |
|---------|--------|
| S04 | SQL |

Only the SQL course — assigned to a StaffID that does not exist.

Use case: Identifying orphaned records in a system — training records pointing to staff members who no longer exist.

How to Create Joins in Power Query

Go to Home tab → Transform Data to open Power Query Editor
Select the table you want as your left table
Click Home tab → Merge Queries
Select the second table from the dropdown
Click the matching column in each table to set the key
Choose your Join Kind from the dropdown
Click OK
Click the expand icon on the new merged column to select which columns to bring in

Join Summary Table

Join Type	Keeps from Left	Keeps from Right	Best For
Inner	Matching only	Matching only	Clean matched data
Left	All rows	Matching only	Enrich left, flag gaps
Right	Matching only	All rows	Enrich right, flag gaps
Full Outer	All rows	All rows	Full system audit
Left Anti	Unmatched only	Nothing	Find missing references
Right Anti	Nothing	Unmatched only	Find orphaned records

Part 2: Power BI Relationships — Joins vs Relationships

This distinction trips up almost everyone starting out.

A join physically merges two tables into one. The result is a single flat table. You use this during data preparation in Power Query.

A relationship keeps tables separate and creates a logical filter connection between them. The data stays in its own table. Power BI uses the relationship to know how filters should flow when someone interacts with a report.

In most professional Power BI models, relationships handle the core structure. Joins are reserved for specific data preparation steps where you genuinely need to flatten or enrich a table before loading it into the model.

	Joins	Relationships
Where	Power Query	Model View
Tables	Physically combined	Stay separate
Performance	Heavier	More efficient
Flexibility	Static	Dynamic
Use for	Data preparation	Analysis

How to Create Relationships in Power BI

Method 1 — Drag and drop in Model View:

Click the Model View icon on the left sidebar
Find the key column in one table
Drag it onto the matching key column in the other table
A relationship line connects the two tables

Method 2 — Manage Relationships:

Go to Modeling tab → Manage Relationships
Click New
Select both tables and their matching columns
Set cardinality and cross-filter direction
Click OK

Cardinality — What Kind of Relationship Is It?

One-to-Many (1:M) — Use This as Your Default

One row in the first table matches many rows in the second.

Example: One Department can have many Employees. The Departments table lists each department once. The Employees table has that department ID repeated across many rows.

This is the most common, most efficient, and most reliable relationship type in Power BI. Build your model around 1:M relationships wherever possible. In Model View it shows as 1 on one side and ***** on the other.

Many-to-Many (M:M) — Use With Real Caution

Many rows in both tables can match each other.

Example: Doctors and Hospitals. One doctor works at multiple hospitals. One hospital employs multiple doctors.

Power BI supports M:M natively but it introduces ambiguous filtering and can produce incorrect totals. Where possible, resolve M:M by introducing a bridge table — a third table that breaks the relationship into two clean 1:M relationships.

One-to-One (1:1) — Rare and Usually Unnecessary

Each row in one table matches exactly one row in the other.

If you have a 1:1 relationship, ask yourself honestly whether these tables should just be merged. They usually should.

Active vs Inactive Relationships

Power BI allows only one active relationship between any two tables. Active relationships are what visuals and DAX measures use by default. Additional relationships between the same tables must be inactive — shown as dashed lines in Model View.

When do you need inactive relationships?

A Date table connected to a fact table with multiple date columns is the classic case. Say your Orders table has OrderDate, ShipDate, and DeliveryDate. You can only have one active relationship to your Date table. The others are inactive.

To use an inactive relationship in a measure, activate it temporarily with USERELATIONSHIP:

Sales by Ship Date = CALCULATE(
    SUM(Orders[Revenue]),
    USERELATIONSHIP(Orders[ShipDate], Calendar[Date])
)

This measure calculates revenue filtered by ShipDate instead of the default OrderDate — without permanently changing the model.

Cross-Filter Direction — Which Way Do Filters Flow?

Single direction: Filters flow from the "one" side to the "many" side only. This is the default and the right choice for most relationships.

Both directions (bidirectional): Filters flow both ways simultaneously.

Bidirectional sounds helpful but it creates real problems — ambiguous filter paths, performance degradation, and measures that produce incorrect results in ways that are very hard to diagnose. Start with single direction always. Only consider bidirectional after testing confirms you genuinely need it and the behavior is correct.

Part 3: Fact Tables and Dimension Tables

Every professional data model is built around this distinction.

Fact Tables

A fact table stores measurable events — things that happened. Each row is one transaction or event.

Contains numbers you want to measure: revenue, quantity, duration, count
Contains foreign keys that point to dimension tables
Typically long — many rows
Typically narrow — few columns

Examples: Hospital admissions, e-commerce orders, bank transactions, flight bookings

Dimension Tables

A dimension table stores descriptive context about those events — the who, what, where, and when.

Contains descriptive attributes: names, categories, locations, dates
Contains a primary key that the fact table references
Typically short — few rows
Typically wide — many columns

Examples: Patients, Products, Branches, Calendar

The practical rule: Your slicers and filters come from dimension tables. Your aggregations and measures come from fact tables. Keep them separate.

Part 4: Schemas — The Overall Shape of Your Model

Star Schema — Start Here Every Time

A central fact table connects directly to surrounding dimension tables. One hop from any dimension to the fact. No chains.

              [Calendar]
                  |
[Customer] — [Sales Fact] — [Product]
                  |
              [Branch]

Clean, fast, easy to maintain. Power BI's calculation engine is specifically optimized for this structure. Your DAX measures will be simpler and your reports will be faster on a star schema than any other structure. This is the recommended starting point for almost every Power BI model.

Snowflake Schema — Normalized but Complex

Dimension tables are broken into sub-dimensions connected in chains.

[Region] ← [Branch] — [Sales Fact] — [Product] → [Category] → [Department]

Instead of a single Branch dimension with Region included as a column, Region becomes its own separate table connected to Branch.

This reduces data redundancy — useful in enterprise data warehouses where storage and integrity matter. In Power BI reporting however, the additional complexity adds joins, slows queries, and makes the model harder to navigate. If your source data arrives in a snowflake structure, consider flattening dimension chains in Power Query before loading into the model.

Flat Table (Denormalized) — Simple but Limited

Everything — facts and dimensions — in a single table. No relationships needed.

OrderDate	CustomerName	City	ProductName	Category	Revenue
01/03/2026	James	Nairobi	Laptop	Electronics	85000
02/03/2026	Amina	Mombasa	Phone	Electronics	42000

Fast to build, easy to understand. But "Electronics" is repeated in every electronics row. Change a category name and you are updating thousands of cells. Performance degrades quickly as row count grows.

Use flat tables for: Quick one-off analysis, very small datasets, early prototyping before building a proper model. Avoid them for anything that will be maintained, updated, or scaled.

Schema Comparison

Schema	Structure	Performance	Complexity	Best For
Star	Fact + direct dimensions	Excellent	Low	Most Power BI reporting
Snowflake	Fact + chained dimensions	Good	Medium	Enterprise warehouses
Flat	Single table	Poor at scale	Very low	Small, simple, one-off

Part 5: Role-Playing Dimensions

A role-playing dimension is one dimension table used multiple times in the same model, each time in a different context.

The Date table is the most common example. A Hospital fact table might have AdmissionDate, DischargeDate, and SurgeryDate — all referencing the same Calendar dimension but each representing a different point in time.

Solution 1 — One active, rest inactive:
Create one active relationship between Calendar and AdmissionDate. Create inactive relationships to DischargeDate and SurgeryDate. Use USERELATIONSHIP in DAX when you need the inactive ones.

Discharges This Month = CALCULATE(
    COUNT(Admissions[PatientID]),
    USERELATIONSHIP(Admissions[DischargeDate], Calendar[Date])
)

Solution 2 — Duplicate the dimension table:
Create three separate Calendar tables — AdmissionCalendar, DischargeCalendar, SurgeryCalendar — each with its own active relationship. More relationships to maintain but no inactive relationship complexity in DAX.

For most scenarios, Solution 1 is cleaner.

Part 6: Common Modeling Issues

Many-to-Many without a bridge table
Connecting two fact tables directly produces unreliable totals. Fix by introducing a shared dimension table that both relate to through 1:M relationships.

Bidirectional filtering everywhere
Slows the model and creates ambiguous results. Default to single direction. Only use Both direction after deliberate testing.

Circular relationships
Table A → Table B → Table C → Table A creates a loop Power BI cannot resolve. Fix by identifying and removing the redundant relationship.

No dedicated Date table
Using date columns from fact tables directly breaks time intelligence functions. Always create a continuous Date dimension table, mark it as a Date table in Power BI (right-click table in Model View → Mark as Date Table), and use it as the single source for all date filtering.

Loading unnecessary columns
Every column you load into the model consumes memory. In Power Query, remove columns you will not use before loading. Keep the model lean.

The Recommended Workflow

Step 1 — Load your data into Power Query

Step 2 — Clean and prepare using joins and transformations in Power Query

Step 3 — Build the model in Model View using 1:M relationships in a star schema

Step 4 — Create a Date table and mark it appropriately

Step 5 — Write DAX measures on top of the clean model

Following this sequence means your measures are built on a solid foundation. Skipping to DAX before the model is right is the most common reason Power BI reports produce numbers nobody trusts.

Final Thought

The most important insight in data modeling is also the simplest one: keep facts and dimensions separate, connect them with 1:M relationships, and structure them as a star.

Everything else in this article — the join types, cardinality options, schema variations, role-playing dimensions — is either building on that foundation or explaining what happens when you deviate from it.

Master the star schema with clean 1:M relationships first. You will handle 90% of real-world Power BI modeling scenarios with that alone.

Part of my data science learning journey.

How Excel is Used in Real-World Data Analysis

Kendi — Wed, 25 Mar 2026 13:10:38 +0000

How Excel is Used in Real-World Data Analysis

Excel is a spreadsheet application by Microsoft that organizes data into a grid of rows and columns. Each box in that grid is called a cell, and each cell can hold a number, text, a date, or a formula that calculates something automatically.

On top of that grid, Excel gives you:

Hundreds of built-in functions for math, text, dates, and logic
Power Query for importing and transforming data from any source
PivotTables that summarize thousands of rows in seconds
Charts and visualizations that turn numbers into stories
Data validation to control what gets entered into cells
Conditional formatting that highlights patterns automatically

Excel is used in finance, healthcare, logistics, marketing, research, and government — basically anywhere data exists and decisions need to be made. It sits at a sweet spot between accessibility and depth. A beginner can use it on day one. An expert can still find new things to learn after years.

How I Used Excel in a Real Project

For a recent project I analyzed product performance data from Jumia — one of Africa's largest e-commerce platforms — covering product prices, discounts, customer reviews, and ratings. Here is exactly how I approached it.

1. Make a Copy of Your Original Data

Before touching anything, I created a copy of the raw data and kept the original sheet untouched. Data cleaning is destructive — you delete rows, overwrite values, change formats. Without a backup, there is no going back.

2. Understand Your Data Before Changing Anything

I used CTRL + END to see how many rows and columns I had, then scrolled through to spot obvious problems — price columns with "KSh" symbols, ratings stored as text like "4.5 out of 5", negative review counts, and blank cells scattered throughout.

Understanding your data first tells you exactly what needs fixing and in what order. It prevents you from applying the wrong solution to the wrong problem.

3. Data Cleaning — Where the Real Work Happens

Data cleaning is unglamorous but it is the foundation everything else stands on. In professional data work it takes 60 to 80 percent of total project time. The key steps I worked through were:

Removing duplicates — ensuring each product appeared only once using Data tab → Remove Duplicates
Fixing data types — converting price columns from text to numbers, extracting numeric ratings from text like "4.5 out of 5"
Handling missing values — replacing blank cells with the column average rather than deleting rows and losing data
Standardizing text — making sure category values like "Nairobi" and "nairobi" were consistent using PROPER, UPPER, and LOWER functions

None of this is exciting. All of it is essential.

The Formulas That Fascinated Me

XLOOKUP — The Lookup Function Excel Should Have Had From the Start

Before XLOOKUP, the standard way to look up data was VLOOKUP — and it had serious limitations. It could only search left to right, broke silently when columns were inserted, and required hardcoded column numbers that made formulas fragile.

XLOOKUP fixes all of that:

=XLOOKUP(lookup_value, lookup_range, return_range, if_not_found)

It searches in any direction, handles missing values natively, uses column names instead of numbers, and can return multiple columns at once. One formula replaces what used to require complex workarounds.

The one caveat — XLOOKUP is only available in Excel 2021 and Microsoft 365. If you are on an older version, INDEX/MATCH is the next best alternative.

IFS — Replacing Messy Nested IFs

IFS checks multiple conditions in sequence and returns the first match. It replaced what would otherwise be deeply nested IF statements that are almost impossible to read or debug.

For example, classifying products by rating:

=IFS(F2>=4.5, "Excellent", F2>=3, "Average", F2<3, "Poor")

Clean, readable, and easy to update. Nesting three IFs inside each other to do the same thing is a maintenance nightmare.

PivotTables — Summarizing Data Without a Single Formula

A PivotTable takes your raw data and lets you summarize it any way you want — by category, by date, by product — without writing a single formula. You drag fields into rows, columns, and values, and Excel does the rest.

One tip that makes PivotTables significantly more reliable: convert your clean data into an Excel Table first using CTRL + T. Tables expand automatically when new rows are added, so your PivotTable always captures the full dataset on refresh.

The Dashboard — Making Data Usable for Everyone

The final step was bringing everything together in an interactive dashboard — a single sheet where any business stakeholder could explore the data without touching the underlying numbers. KPI cards at the top, charts in the middle, and slicers that filter every chart simultaneously with a single click.

This is the difference between data analysis and data communication. The numbers only matter if the right people can read them.

Conclusion

The formulas are learnable. The functions have documentation. What nobody tells you is that the hardest part of data analysis is not the technical side — it is the discipline.

The discipline to make a backup before touching anything. The discipline to understand your data before changing it. The discipline to clean thoroughly before analyzing. The discipline to question your results before presenting them.

Excel gave me powerful tools. But the two steps that made the biggest difference were the simplest ones — making a copy of the original data and actually reading through it before writing a single formula.

Written as part of a data science learning program at LuxDev HQ.