DEV Community

Shlok Kumar
Shlok Kumar

Posted on

True Error vs Sample Error

In machine learning and statistics, understanding the concepts of true error and sample error is crucial for evaluating the performance of models. These errors help us assess how well our models generalize from training data to unseen data. Let’s delve into these concepts and see how they differ.

True Error

True error refers to the probability that a hypothesis will misclassify a single randomly drawn sample from the entire population. The population, in this context, includes all potential data points that the model might encounter.

For a given hypothesis ( h(x) ) and the actual target function ( f(x) ), the true error can be expressed as:

T.E. = P[f(x) ≠ h(x)]
Enter fullscreen mode Exit fullscreen mode

This indicates the likelihood that the model’s predictions do not match the true values.

Sample Error

Sample error, on the other hand, measures the proportion of misclassified examples within a specific sample. It is calculated based on the data points that were used to train or test the model. The formula for sample error is:

S.E. = Number of misclassified instances / Total number of instances
Enter fullscreen mode Exit fullscreen mode

Alternatively, it can also be expressed in terms of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN):

S.E. = (FP + FN) / (TP + FP + FN + TN)
Enter fullscreen mode Exit fullscreen mode

Or simply:

S.E. = 1 - Accuracy
Enter fullscreen mode Exit fullscreen mode

For example, if a hypothesis misclassifies 7 out of 33 examples, the sample error would be:

S.E. = 7 / 33 = 0.21
Enter fullscreen mode Exit fullscreen mode

Bias & Variance

Bias

Bias measures the difference between the average prediction of a model and the actual value. High bias typically indicates that a model is too simplistic and is likely to underfit the data.

Bias = E[h(x)] - f(x)
Enter fullscreen mode Exit fullscreen mode

Variance

Variance assesses how much the model's predictions vary for different training sets. A high-variance model is overly complex and can lead to overfitting.

Var(X) = E[(X - E[X])²]
Enter fullscreen mode Exit fullscreen mode

Confidence Interval

Calculating true error directly can be complex and challenging. Instead, it can be estimated using a confidence interval, which is derived from the sample error. The process involves:

  1. Randomly drawing ( n ) samples from the population (where ( n > 30 )).
  2. Calculating the sample error for these samples.

The formula for estimating the true error based on the sample error is:

T.E. = S.E. ± z_s * √(S.E.(1 - S.E.) / n)
Enter fullscreen mode Exit fullscreen mode

Where ( z_s ) is the z-score corresponding to the desired confidence level.

Example Code for Confidence Interval Estimation

Here's how you can implement the estimation of true error using a confidence interval in Python:

# Imports
import numpy as np
import scipy.stats as st

# Define sample data
np.random.seed(0)
data = np.random.randint(10, 30, 10000)

alphas = [0.90, 0.95, 0.99, 0.995]
for alpha in alphas:
    print(st.norm.interval(alpha=alpha, loc=np.mean(data), scale=st.sem(data)))
Enter fullscreen mode Exit fullscreen mode

Confidence Interval Output

This code will output confidence intervals for different confidence levels:

  • 90%: (17.87, 19.89)
  • 95%: (17.67, 20.09)
  • 99%: (17.30, 20.46)
  • 99.5%: (17.15, 20.61)

True Error vs Sample Error Summary

True Error Sample Error
Represents the probability of misclassification in the population. Represents the fraction of misclassified instances within the sample.
Used to estimate errors across the entire population. Used to assess errors within the sample data.
Difficult to calculate directly; often estimated using confidence intervals. Easier to calculate by analyzing the sample data.
Can be influenced by poor data collection methods or bias. Can be affected by selection errors or non-response errors.

Understanding true error and sample error is essential for building robust machine learning models. By estimating these errors, you can make informed decisions about model performance and improve the predictive capabilities of your algorithms.

For more content, follow me at — https://linktr.ee/shlokkumar2303

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

AWS GenAI LIVE image

Real challenges. Real solutions. Real talk.

From technical discussions to philosophical debates, AWS and AWS Partners examine the impact and evolution of gen AI.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay