DEV Community

Cover image for Hypothesis Testing.
Michelle Njuguna
Michelle Njuguna

Posted on

Hypothesis Testing.

Introduction

Have you ever disliked a dish yet you still have to eat it sometimes? Most of people dislike vegetables and would prefer not eating then yet they are really good for our bodies. Now imagine machine learning and AI as a body and guess what the vegetables in question are, Mathematics specifically statistics.

I know statistics might not be the most exciting part of AI for many people, but it’s the foundation of how machines learn and process data. I will try as much as I can to help you understand statistics and how major concepts work so that as we code ,we actually know what we are doing.

Hypothesis Testing

I believe hypothesis testing is the backbone of all models. It helps determine whether the problem you're solving is truly significant and proves if your solution is actually viable.

From Wikipedia, Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. It is used to estimate the relationship between 2 statistical variables.

Example: In my intro, I made an assumption that most people do not like eating vegetables. I could perform tests to validate what I said by collecting and evaluating a representative sample from the data set under study.

Importance of Hypothesis Testing

  1. Avoiding Misleading Conclusions and Making Smart Decisions.

Imagine you're an oncologist with a new method for testing cancer. Two patients, Patient A and Patient B, come in for testing. Patient A is actually healthy, but your test incorrectly diagnoses them with cancer—this is a Type I error (false positive). On the other hand, Patient B does have cancer, but your test fails to detect it, and you send them home thinking they're fine—this is a Type II error (false negative).

2.Optimizing Business Tactics.

It is invaluable for testing new ideas and strategies before fully committing to them. Example: Checking whether investing in a model that predicts early signs of cancer would bring a heath institution more patients who would want to be screened. Hence bringing new clients.

Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)

x̅ = sample mean,
μ0= population mean,
σ = standard deviation,
n= sample size.

1. Formulating a Hypothesis

Null Hypothesis (H0): This is the default assumption that there is no effect or difference.
Alternative Hypothesis (Ha): This is the hypothesis that there is an effect or difference.

Choose the Significance Level (α)
The significance level, often denoted by alpha (α), is the probability of rejecting the null hypothesis when it is true. Common choices for α are 0.05 (5%), 0.01 (1%), and 0.10 (10%).

The significance level (α) directly controls the probability of a Type I error. Decreasing α reduces the chance of Type I errors but increases the risk of Type II errors.

To balance these errors:

  1. Adjust the significance level based on the consequences of each error type.
  2. Increase sample size to improve the power of the test.
  3. Use one-tailed tests when appropriate.

2. Select the Appropriate Test
Choose a statistical test based on the type of data and the hypothesis. Common tests include t-tests, chi-square tests, ANOVA, and regression analysis. The selection depends on data type, distribution, sample size, and whether the hypothesis is one-tailed or two-tailed.

3. Collect Data
Gather the data that will be analyzed in the test.This data should be representative of the population.

4.Calculate the Test Statistic
Based on the collected data and the chosen test, calculate a test statistic that reflects how much the observed data deviates from the null hypothesis.

5. Determine the p-value
The p-value is the probability of observing test results at least as extreme as the results observed, assuming the null hypothesis is correct.

6. Make a Decision
Compare the p-value to the chosen significance level:
If the p-value ≤ α: Reject the null hypothesis, suggesting sufficient evidence in the data supports the alternative hypothesis.
If the p-value > α: Do not reject the null hypothesis, suggesting insufficient evidence to support the alternative hypothesis.

7. Report the Results
Present the findings from the hypothesis test, including the test statistic, p-value, and the conclusion about the hypotheses.

Types of Hypothesis Testing

  1. Z Test
    It usually checks to see if two means are the same (the null hypothesis). Only when the population standard deviation is known and the sample size is 30 data points or more can a z-test be applied.

  2. T Test
    Compares the means of two groups. To determine whether two groups differ or if a procedure or treatment affects the population of interest.

  3. Chi-Square
    To determine if the expected and observed results are well-fitted, the Chi-square test analyzes the differences between categorical variables from a random sample. The test's fundamental premise is that the observed values in your data should be compared to the predicted values that would be present if the null hypothesis were true.

  4. ANOVA
    Analysis of Variance, is a statistical method used to compare the means of three or more groups. It’s particularly useful when you want to see if there are significant differences between multiple groups.

Modern Approaches to Hypothesis Testing
In addition to traditional hypothesis testing methods, there are several modern approaches:

  1. Permutation or randomization tests
    These tests involve randomly shuffling the observed data many times to create a distribution of possible outcomes under the null hypothesis. They are particularly useful when dealing with small sample sizes or when the assumptions of parametric tests are not met.

  2. Bootstrapping
    Bootstrapping is a resampling technique that involves repeatedly sampling with replacement from the original dataset. It can be used to estimate the sampling distribution of a statistic and construct confidence intervals.

  3. Jackknife
    Jackknife is a cross-validation technique and therefore, a form of resampling. It is especially useful for bias and variance estimation.

  4. Monte Carlo simulation
    Monte Carlo methods use repeated random sampling to obtain numerical results. In hypothesis testing, they can be used to estimate p-values for complex statistical models or when analytical solutions are difficult to obtain.

Conclusion
While statistical findings are important, it's also crucial to apply common sense when working with hypothesis testing. The practicality of your ideas will lead to better tests and more reliable conclusions. I do hope I have helped you like vegetables a little bit. Remember, with statistics, practice makes perfect, so keep solving as many problems as you can.

Heroku

Built for developers, by developers.

Whether you're building a simple prototype or a business-critical product, Heroku's fully-managed platform gives you the simplest path to delivering apps quickly — using the tools and languages you already love!

Learn More

Top comments (0)

👋 Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay