DEV Community: Ian Macharia Mwangi

Understanding Degrees of Freedom and Their Importance in Statistics

Ian Macharia Mwangi — Fri, 07 Nov 2025 19:33:31 +0000

When I first started learning statistics, the term “degrees of freedom” (df) felt mysterious — it popped up in formulas for t-tests, chi-square tests, ANOVA, and even in regression analysis, yet no one seemed to clearly explain what it actually meant.
After diving deeper, I realized that degrees of freedom are not just a mathematical artifact — they reflect how much information we truly have available to estimate something.

Let’s unpack what that means and why it matters.

What Are Degrees of Freedom? In simple terms, degrees of freedom represent the number of independent values in a dataset that are free to vary when estimating a statistical parameter.

Think of it as:

“How many data points can change without breaking the rules imposed by the estimation process?”

Formula:
df = n - k
(where n = number of observations, k = number of estimated parameters or constraints)

Example: Understanding Through Intuition

Suppose you have three numbers:
Let’s call them x1, x2, x3.

You know their mean is 10.

This constraint means that:

Equation:
(x1 + x2 + x3) / 3 = 10
→ x1 + x2 + x3 = 30

Now, if you pick any two of these numbers freely (say x1 = 8 and x2 = 11), then x3 is no longer free — it must be 11 to make the total 30.

So even though there are three values, only two can vary freely.

Hence, the degrees of freedom = 3 − 1 = 2.

This is why, when calculating sample variance, we divide by (n − 1) instead of n — one degree of freedom is lost because we already used the sample mean to estimate the center of the data.

Degrees of Freedom in Common Statistical Tests

Degrees of freedom appear in almost every statistical test because they determine the shape of the underlying probability distribution used for inference.

Let’s look at a few examples.

a. t-Test

The formula for the t-statistic is:

Equation:
t = (x̄ − μ₀) / (s / √n)

Where:

x̄ = sample mean

μ₀ = hypothesized population mean

s = sample standard deviation

n = sample size

Because one parameter (the mean) is estimated, the degrees of freedom = n − 1.

The t-distribution’s shape changes with df:

With small df (e.g., 5 or 10), it has heavier tails (more uncertainty).

As df increases, it approaches the normal distribution.

b. ANOVA (Analysis of Variance)

In ANOVA, degrees of freedom partition total variability into between-group and within-group components.

Equations:
Total df = N − 1
Between-groups df = k − 1
Within-groups df = N − k

Where:

N = total number of observations

k = number of groups

These df values are used to compute F-statistics, which test whether group means differ significantly.

c. Chi-Square Tests

In a chi-square goodness-of-fit test, the degrees of freedom equal:

Equation:
df = k − 1

where k = number of categories (one degree is lost because probabilities must sum to 1).

In a chi-square test of independence,
Equation:
df = (rows − 1) × (columns − 1)

d. Regression Analysis

In regression, degrees of freedom are divided between model parameters and residuals.

Equations:
Regression df (Model df) = number of predictors
Residual df = n − k − 1

Where:

n = number of data points

k = number of predictors (excluding the intercept)

Residual df measures how much information is left to estimate the variance of errors.

Why Degrees of Freedom Matter

Degrees of freedom are not just a technical detail — they directly affect statistical accuracy and inference.

Here’s why they’re important:

They Control the Shape of Sampling Distributions

Every inferential test — t, F, or chi-square — depends on a specific distribution that changes shape with df.
Fewer degrees of freedom → wider tails → more uncertainty → harder to achieve statistical significance.

They Reflect How Much Information You Actually Have

Even if you have 100 data points, if your model estimates 10 parameters, you only have 90 degrees of freedom left to estimate variability.
That’s why overfitting models (too many predictors) reduce statistical power.

They Determine Confidence and Reliability

Lower df means less reliable estimates — confidence intervals widen, p-values become larger, and results are less stable.
In essence, df quantifies the balance between data richness and model complexity.

A Real-Life Analogy

Imagine you’re organizing a team photo.
You have 10 people (data points) to arrange, but 1 spot is fixed (constraint).
You can freely move only 9 people around.

That’s 9 degrees of freedom — the number of ways you can vary things before constraints lock the system.

The same concept applies to statistics — every estimated parameter reduces flexibility, just like every fixed position limits how freely the group can move.

Conclusion

Degrees of freedom are a way of measuring flexibility in your data.
They tell you how many independent pieces of information remain once your model has made certain assumptions or used certain estimates.
In short:

The more parameters you estimate, the fewer degrees of freedom you have left — and the more cautious you should be when trusting your results.

Understanding degrees of freedom helps you interpret why statistical tests behave the way they do, and it’s one of the keys to moving from just running analyses to truly understanding them.

Would you like me to extend this article with a Python example showing how degrees of freedom appear when using scipy.stats.ttest_ind() or pandas.DataFrame.var() (which uses n−1 by default)?

Similarities Between a Stored Procedure in SQL and a Function in Python

Ian Macharia Mwangi — Wed, 05 Nov 2025 09:37:11 +0000

Although SQL and Python belong to very different ecosystems — one for database management and the other for general-purpose programming — their core design principles overlap when it comes to modularizing and reusing logic.

Both stored procedures in SQL and functions in Python serve as reusable, encapsulated units of code designed to perform a task efficiently.

Let’s break down the main similarities:

Encapsulation of Logic

Both structures allow you to group multiple steps or statements into a single callable unit.
This helps you avoid repeating logic and keeps your codebase organized.

SQL Stored Procedure:

CREATE PROCEDURE GetEmployeesByDept
    @DeptID INT
AS
BEGIN
    SELECT EmployeeName, Salary
    FROM Employees
    WHERE DepartmentID = @DeptID;
END;

def get_employees_by_dept(dept_id):
    query = f"SELECT EmployeeName, Salary FROM Employees WHERE DepartmentID = {dept_id}"
    # Imagine this is sent to the database and results are returned
    return execute_query(query)

In both cases, you’ve encapsulated a reusable block that performs a defined action.

Parameters and Arguments

Both can accept parameters (inputs) that alter their behavior without changing the internal logic.

In SQL: parameters are declared with @ and passed during execution (EXEC GetEmployeesByDept @DeptID = 2;)

In Python: parameters are passed as function arguments (get_employees_by_dept(2))

This design makes both highly flexible and reusable across different contexts.

Reusable and Maintainable

Once defined, both can be reused multiple times in different scripts, queries, or applications.

Stored procedures can be called from different SQL scripts or even external applications.

Python functions can be imported and reused across multiple modules.

Both help enforce the DRY principle — Don’t Repeat Yourself.

Structured Flow Control

Both support conditional logic and loops, allowing more complex, programmatic behavior.

SQL Example:

CREATE PROCEDURE UpdateSalary
    @EmpID INT,
    @IncreasePercent DECIMAL(5,2)
AS
BEGIN
    IF @IncreasePercent > 0
        UPDATE Employees
        SET Salary = Salary + (Salary * @IncreasePercent / 100)
        WHERE EmployeeID = @EmpID;
END;

The syntax differs, but both follow the same logical pattern.

Return Values (or Results)

A stored procedure can return a result set or output parameters, while a Python function returns a value or object.

SQL: SELECT statements inside a stored procedure return results to the caller.

Python: return sends back data to the calling code.

Example:

SQL

EXEC GetHighEarners @MinSalary = 80000;

result = get_high_earners(80000)

Both act as black boxes — you provide input and get output.

Security and Access Control

In SQL, stored procedures help control access to underlying tables by granting users permission to run procedures instead of querying tables directly.
In Python, while not about access control in the same sense, functions can hide implementation details, exposing only necessary interfaces.

In both cases, this encapsulation helps with abstraction and safety.

Performance Benefits

Stored procedures are precompiled and cached by the database engine, leading to faster execution for repeated tasks.

Python functions can also improve performance indirectly — by reducing redundant code and centralizing logic that might otherwise run inefficiently in multiple places.

While the underlying performance mechanics differ, the intent — optimization through reuse — is shared.
Both act as black boxes — you provide input and get output.

Security and Access Control

In both cases, this encapsulation helps with abstraction and safety.

Performance Benefits

Stored procedures are precompiled and cached by the database engine, leading to faster execution for repeated tasks.

Python functions can also improve performance indirectly — by reducing redundant code and centralizing logic that might otherwise run inefficiently in multiple places.

While the underlying performance mechanics differ, the intent — optimization through reuse — is shared.

In conclusion, at a conceptual level, a stored procedure in SQL is to a database what a function is to a Python program — both encapsulate logic, promote reusability, and enhance maintainability.

When I first started bridging SQL with Python (for analytics and ETL work), recognizing these parallels made it much easier to design systems where each layer handled what it’s best at:

SQL for data manipulation and set-based logic,

Python for orchestration, computation, and automation.

Understanding the Difference Between Subquery, CTE, and Stored Procedure

Ian Macharia Mwangi — Wed, 05 Nov 2025 09:31:26 +0000

When working with SQL, one of the key skills that separates intermediate developers from advanced database professionals is knowing how and when to use different query structures.

Three of the most commonly misunderstood SQL components are subqueries, Common Table Expressions (CTEs), and stored procedures. While they might seem similar at first glance — all allow you to organize or modularize your logic — they serve different purposes and have different performance implications.

In this article, I’ll break down each concept, highlight where it shines, and show some real-world examples from my own experience.

Subquery — A Query Within a Query

A subquery is simply a query nested inside another SQL statement. It’s often used to filter, calculate, or compare values dynamically.

Example
SELECT 
    EmployeeName,
    Salary
FROM Employees
WHERE Salary > (
    SELECT AVG(Salary) FROM Employees
);

In this example, the inner query calculates the average salary, and the outer query selects only employees earning more than that.

This pattern is great for short, inline calculations or when you don’t want to repeat the same logic elsewhere.

When to Use a Subquery

When you need a quick, one-off result from another table or calculation.

When joining tables would make the main query unnecessarily complex.

When readability and simplicity are more important than reusability.

Limitations

Subqueries can be less efficient than joins in some databases because they execute repeatedly for each row (depending on the optimizer).

They can’t always be reused elsewhere in the same query.

Deeply nested subqueries can make debugging painful.

CTE (Common Table Expression) — A Temporary Result Set

A CTE, introduced in SQL Server 2005 and supported by most modern databases, allows you to define a temporary, named result set that can be referenced within the same statement.

Example

WITH DepartmentAverage AS (
    SELECT DepartmentID, AVG(Salary) AS AvgSalary
    FROM Employees
    GROUP BY DepartmentID
)
SELECT 
    e.EmployeeName, 
    e.DepartmentID,
    e.Salary,
    d.AvgSalary
FROM Employees e
JOIN DepartmentAverage d 
    ON e.DepartmentID = d.DepartmentID
WHERE e.Salary > d.AvgSalary;

Here, the CTE DepartmentAverage calculates each department’s average salary once, and the outer query uses it to find above-average earners.

This approach is more readable and maintainable than embedding multiple subqueries.

When to Use a CTE

When you want to organize complex queries into readable logical blocks.

When you need to reference the same derived result multiple times within one query.

When working with recursive queries (e.g., hierarchical data like org charts).

Limitations

CTEs exist only during the execution of the query; they are not stored permanently.

They don’t automatically improve performance (they’re mainly about readability and structure).

For very large datasets, a temporary table might perform better.

Stored Procedure — Precompiled Logic on the Server

A stored procedure is a precompiled, reusable block of SQL logic stored in the database.
Unlike subqueries or CTEs, which are part of a single query, a stored procedure is a stand-alone database object.

Example

CREATE PROCEDURE GetHighEarners
    @MinSalary INT
AS
BEGIN
    SELECT EmployeeName, Salary, DepartmentID
    FROM Employees
    WHERE Salary > @MinSalary;
END;

EXEC GetHighEarners @MinSalary = 80000;

This approach is efficient and scalable, especially for complex business logic that runs frequently.

When to Use a Stored Procedure

When you need to encapsulate logic and reuse it across multiple applications.

When you want to improve performance through precompilation.

When enforcing security or access control — users can execute procedures without seeing the underlying tables.

When building ETL processes, batch jobs, or automated reporting.

Limitations

Stored procedures require database-level maintenance — not ideal for ad-hoc queries.

They can become hard to version-control if business logic changes often.

They’re specific to a database engine (e.g., T-SQL for SQL Server, PL/pgSQL for PostgreSQL).

My Take — Choosing the Right One

In practice, I use all three, depending on the situation:

Subqueries for quick filters or checks.

CTEs when readability matters and I want to logically break a big query into parts.

Stored procedures when I’m implementing repeatable or parameterized business logic.

There’s no universal “best” — it depends on whether you’re optimizing for performance, maintainability, or reuse.

The key is to understand how each behaves under the hood so you can choose the right tool for the job.

Final Thoughts

Think of it like this:

A subquery is a snippet of logic.

A CTE is a named snippet.

A stored procedure is a function you can call again and again.

Mastering when to use each will make your SQL code not only faster but also more elegant and maintainable.

Excel’s Strengths and Weaknesses in Predictive Analysis and Its Role in Data-Driven Business Decisions

Ian Macharia Mwangi — Sun, 10 Aug 2025 17:53:54 +0000

Excel’s Strengths and Weaknesses in Predictive Analysis and Its Role in Data-Driven Business Decisions

Introduction

When it comes to business tools, Microsoft Excel is almost legendary. From finance to marketing, small startups to multinational corporations, it’s hard to find a professional who hasn’t used it in some form. Over the years, Excel has evolved from a simple spreadsheet program into a versatile analytical tool—capable of handling everything from basic data organization to complex forecasting models.

In my research, I found that while Excel can absolutely support predictive analysis and guide data-driven business decisions, it’s not without its pitfalls. Let’s explore both sides of the coin.

Strengths of Excel in Predictive Analysis

1. Accessibility and Familiarity

One of Excel’s biggest strengths is its universal familiarity. Most professionals already have some level of comfort using it, which means predictive models can be implemented without extensive training. Plus, it’s included in most Office 365 subscriptions, so the barrier to entry is low.

2. Built-in Analytical Tools

Excel offers functions like FORECAST.LINEAR, TREND, and even the Data Analysis Toolpak for regression analysis. Combined with pivot tables, charts, and conditional formatting, users can quickly turn raw data into meaningful trends and forecasts.

3. Flexibility

Unlike specialized analytics software, Excel isn’t locked into a specific type of analysis. You can customize formulas, link datasets, and even use VBA (Visual Basic for Applications) to automate repetitive predictive tasks.

4. Integration with Other Tools

Excel plays nicely with other platforms—importing from databases, APIs, and CSV files is straightforward. This makes it easier to feed historical data into predictive models.

Weaknesses of Excel in Predictive Analysis

1. Scalability Issues

Excel works well for small to medium-sized datasets, but once you start dealing with millions of rows or real-time data streams, it becomes sluggish or prone to crashing.

2. Limited Advanced Modeling

While Excel handles basic forecasting well, it’s not designed for advanced machine learning models or AI-driven predictions. Specialized tools like Python (with scikit-learn) or R are better suited for that.

3. Error Sensitivity

Human error in formula entry, data input, or referencing can lead to inaccurate predictions. In predictive analysis, even small mistakes can cause significant misdirection in decision-making.

4. Collaboration Limitations

Although Excel Online and cloud storage have improved collaboration, version control issues can still arise, especially in predictive models that require constant updates.

Role of Excel in Data-Driven Business Decisions

Despite its limitations, Excel remains a cornerstone in business decision-making. Here’s why:

Quick Prototyping: Before committing resources to complex analytics platforms, businesses can use Excel to create quick, cost-effective predictive models.
Decision Support: Forecasting sales, estimating demand, and budgeting can be done efficiently in Excel, helping leaders make informed choices.
Data Visualization: Through charts, dashboards, and conditional formatting, decision-makers can quickly grasp trends and patterns.
Accessibility Across Departments: From finance teams tracking cash flow to marketing teams analyzing campaign performance, Excel’s familiarity ensures that insights are understandable company-wide.

Conclusion

Excel is not a magic wand for predictive analysis—it’s a versatile but limited tool. It shines in its accessibility, flexibility, and ability to bridge the gap between raw data and actionable insights. However, for large-scale, high-complexity predictive modeling, businesses may need to integrate Excel with more specialized tools.

The real power lies in knowing when Excel is “good enough” and when it’s time to scale up. Used wisely, it can still be a trusted ally in making smart, data-driven business decisions.