DEV Community

paoloap
paoloap

Posted on

Outlier Detection Made Simple: 3 Effective Methods Every Data Scientist Should Know

If you're working with real-world data, you're going to run into outliers. They're the weird values that sit miles away from the rest. Maybe a customer spent $10,000 when the average order is $50. Or a sensor glitched and logged -9999. 

These values distort your stats and make your experiment's conclusion unreliable. And because so many decisions ride on means (A/B tests, pricing, forecasting), ignoring outliers can seriously mess with your results. 

That's the danger. Outliers don't just skew your charts. They throw off everything: confidence intervals, p-values, whether you ship a feature or kill it. If your decisions rely on the mean, you'd better know what's hiding in the tails.

The good news? You don't need advanced stats to fix outliers. A few clean line of code and some common sense will go a long way. 

Framing the Problem

Say you're comparing two groups in an experiment. Group A has an average order value of $10, Group B is at $12. It sounds like the test group is doing better, but both include outliers. These extreme values skew the mean and standard deviation, making the difference between $10 and $12 harder to trust.

Distribution of control (A) and test groups (B) with visible outliers

Here's how to generate a synthetic version of this problem:

`import numpy as np
N = 1000

Group A: mean 10, with some large outliers

x1 = np.concatenate((
np.random.normal(10, 3, N), # Normal distribution
10 * np.random.random_sample(50) + 20 # Outliers: values between 20–30
))

Group B: mean 12, with some moderate outliers

x2 = np.concatenate((
np.random.normal(12, 3, N), # Normal distribution
4 * np.random.random_sample(50) + 1 # Outliers: values between 1–5
))`

Method 1: Trim the Tails

A quick way to fix outliers is to cut off the extremes: remove the lowest 5% and highest 5% of values. Sure, you lose some data, but you're getting rid of the weirdest 10% that usually don't add value anyway.

Here's how to do it:

low = np.percentile(x, 5)
high = np.percentile(x, 95)
x_clean = [i for i in x if low < i < high]

Done. Now your averages won't be dragged off by those few extreme values. It's a blunt but effective method, perfect for a fast cleanup.

Distribution of control (A) and test groups (B) after trimming the tails

Method 2: Use IQR Bands

Another approach is to exclude values outside a range based on the interquartile range (IQR). Specifically, you drop anything below the 25th percentile minus 1.5 times the standard deviation, or above the 75th percentile plus 1.5 times the standard deviation. 

This method usually removes only about 1.0% of the data but tightens the distribution and improves the accuracy of your estimates. It's a solid way to filter out extreme values without throwing away too much information.

Here's how to do it:

Q1 = np.percentile(x, 25)
Q3 = np.percentile(x, 75)
low = Q1 - 1.5 * np.std(x)
high = Q3 + 1.5 * np.std(x)
x_clean = [i for i in x if low < i < high]

Distribution of control (A) and test groups (B) after applying IQR Bands

Method 3: Bootstrap

Sometimes, the smartest move is not to remove anything at all. Instead, use bootstrapping to smooth out the noise. You resample your data with replacement a bunch of times, calculate the mean each time, and use the average of those. This often gives a more stable estimate of the mean, even if outliers remain in the data.

For datasets with unavoidable outliers (like revenue or user behavior), bootstrapping gives you a better sense of the "typical" outcome, without deciding what to keep or toss. It's computationally cheap and surprisingly powerful.

Here's how to apply bootstrapping to your data:

def bootstrap_mean(x, n=1000):
return np.mean([np.mean(np.random.choice(x, size=len(x), replace=True)) for _ in range(n)])

Distribution of control (A) and test groups (B) after Bootstrapping

Which One Should You Use?

  • Trim the tails: Fast, simple, aggressive. Use it when you know your extremes are garbage or need a fast clean-up. 
  • IQR method: Balanced, statistically sound. Use it when you want a stats-defensible way to filter noise without cutting to deep in the data.
  • Bootstrap: No filtering, better central estimates. Use it when removing values isn't an option or when your data naturally includes rare-but-legit extremes.

Don't overthink it. Try all three methods and compare the averages and variances. You'll quickly see what gives you the most stable, trustworthy result. It's not about perfection, it's about using the right tool for the job.

Common Pitfalls to Avoid

This is where people mess up: they blindly delete anything that looks weird. Don't do that. Always check what you're cutting. That $9,000 order might be rare - but legit. Dropping outliers without context can erase real signals or create blind spots.

If you automatically delete anything that doesn't fit your expectations, you risk filtering out the exact thing you should be investigating. 

Outliers can be early warnings, new trends, or edge cases that turn into product ideas. Treat them like clues, not trash.

Also, stop relying only on the mean. It's fragile. Just a few outliers can throw it off completely. Use the median or a trimmed mean when things look messy. And always - seriously, always - plot your data first. A quick histogram or boxplot will keep you from making dumb assumptions.

Final Thought

Outliers aren't the problem. Misreading them is. Sometimes they're garbage. Other times, they're signal you didn't expect. Your job isn't to blindly cut them, it's to figure out what they actually mean.

Use the methods above when you need clean, reliable data. But don't ignore what the outliers might be trying to tell you.

They could be pointing to a bug… Or to your next big opportunity.

Your job is to know the difference.

Struggling to grow your audience as a Tech Professional?
The Tech Audience Accelerator is the go-to newsletter for tech creators serious about growing their audience. You’ll get the proven frameworks, templates, and tactics behind my 30M+ impressions (and counting).

The Tech Audience Accelerator | Paolo Perrone | Substack

The go-to newsletter for tech creators building serious audiences. Steal the exact frameworks, templates, and tactics behind my 30M+ impressions (and counting). No fluff, no guesswork. Just high-leverage strategies that work. Click to read The Tech Audience Accelerator, by Paolo Perrone, a Substack publication with thousands of subscribers.

favicon techaudienceaccelerator.substack.com

Agents
Agentic Ai
Llm
AI

Top comments (0)