Your Outlier Detection is Lying to You

#ai #machinelearning #datascience #python

Why DBSCAN breaks in high dimensions and what to do instead

You tuned epsilon to 1.5 because it felt reasonable. Here is what that decision actually means. On a dataset with 16 features, shifting epsilon from 1.0 to 2.0 changes your outlier rate from 60.31% to 2.35%. Same data. Same algorithm. One decimal point of difference. These are not numbers from a toy dataset: they come from a decade of real Australian weather records, 145,000 observations, 16 continuous meteorological variables.

If someone asked you to justify eps=1.5 in a production review, what would you say?

The Setup

The dataset is the Australian weather observations from the Bureau of Meteorology, publicly available on Kaggle. It contains daily measurements from 49 stations across the country: temperature, rainfall, wind speed, pressure, humidity. Real data, messy data, with missing values and a distribution that does not care about your assumptions.

The preprocessing is standard. Select numerical columns, impute missing values with the column median, and scale everything with StandardScaler. Sixteen features survive the selection.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

df = pd.read_csv("weatherAUS.csv")

num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
imputer = SimpleImputer(strategy='median')
df_num_imputed = pd.DataFrame(
    imputer.fit_transform(df[num_cols]), columns=num_cols
)

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_num_imputed)

print(f"Total rows: {len(df_scaled)} | Dimensions: {len(num_cols)}")
# Total rows: 145460 | Dimensions: 16

Nothing unusual so far. This is the pipeline you have probably written a dozen times. The problem starts at the next step.

Why DBSCAN Cannot Handle This

DBSCAN defines a point as an outlier if no other point falls within a radius of epsilon in the feature space. The logic is intuitive in two or three dimensions. In sixteen dimensions it stops making geometric sense.

The reason is the curse of dimensionality. As dimensions increase Euclidean distances between points concentrate. The ratio between the maximum and minimum distance across all point pairs converges toward one. In practice this means that in a high-dimensional space all points start to look roughly equidistant from each other. The notion of a dense neighborhood that DBSCAN relies on becomes increasingly difficult to define and the choice of epsilon loses its geometric interpretation.

from sklearn.cluster import DBSCAN

eps_values = [1.0, 1.5, 2.0]
outlier_counts = []

for eps in eps_values:
    dbscan = DBSCAN(eps=eps, min_samples=4, n_jobs=-1)
    labels = dbscan.fit_predict(df_scaled)
    n_outliers = np.sum(labels == -1)
    pct = (n_outliers / len(df_scaled)) * 100
    outlier_counts.append(pct)
    print(f"DBSCAN eps={eps}: {n_outliers} outlier ({pct:.2f}%)")

# Output:
# DBSCAN eps=1.0: 87720 outlier (60.31%)
# DBSCAN eps=1.5: 18166 outlier (12.49%)
# DBSCAN eps=2.0:  3423 outlier  (2.35%)

That is the structural problem. You are not making a calibration decision. You are making an arbitrary choice that determines whether your pipeline discards 87,000 rows or 3,400 rows and you have no principled way to defend either number.

The Paradigm Shift: Isolation Over Distance

Isolation Forest does not use distances. It builds an ensemble of random decision trees and for each tree it randomly selects a feature and a split value within the feature range. A point is considered anomalous if it gets isolated near the root of the tree, meaning very few splits were needed to separate it from the rest of the data.

This matters because anomalies are by definition rare and different. A truly anomalous point sits in a sparse region of the feature space and is easy to isolate with just a few random cuts. A normal point lives in a dense cluster and requires many cuts to separate. The algorithm exploits this structural property without ever computing a distance.

The practical consequence is that Isolation Forest does not suffer from the concentration of distances that kills DBSCAN in high dimensions. Each split operates on a single feature so the geometric complexity does not scale with the number of dimensions in the same catastrophic way.

from sklearn.ensemble import IsolationForest

# For meteorological data, ~5% of anomalous events is a reasonable estimate
# based on domain knowledge. This is not a magic number: it is a claim
# you can argue in front of a domain expert.
CONTAMINATION = 0.05

iso = IsolationForest(contamination=CONTAMINATION, random_state=42, n_jobs=-1)
iso.fit(df_scaled)

anomaly_scores = iso.decision_function(df_scaled)
predictions = iso.predict(df_scaled)

df['Anomaly_Score'] = anomaly_scores
df['Is_Anomaly'] = (predictions == -1)

Notice what changed conceptually. With DBSCAN you were choosing a geometric radius with no interpretable meaning in 16 dimensions. With Isolation Forest you are choosing a contamination rate, a domain assumption you can state explicitly. You can argue that you expect approximately 5 percent of these observations to be genuine meteorological anomalies. That is a claim you can bring to a domain expert or a code reviewer. An epsilon of 1.5 is not.

The Sensitivity Problem Has Not Disappeared

Here is something that deserves honesty. Isolation Forest does not eliminate parameter sensitivity. It relocates it to a space where the sensitivity is at least interpretable.

print("--- Threshold sensitivity in Isolation Forest ---")
for threshold in [-0.10, -0.05, 0.00, 0.05]:
    n = np.sum(anomaly_scores < threshold)
    print(f"  Threshold {threshold:+.2f}: {n} outlier ({(n/len(df))*100:.2f}%)")

# Output:
#   Threshold -0.10:   123 outlier (0.08%)
#   Threshold -0.05:  1405 outlier (0.97%)
#   Threshold +0.00:  7273 outlier (5.00%)
#   Threshold +0.05: 28844 outlier (19.83%)

The range from 123 to 28,844 outliers is still dramatic. The difference from the DBSCAN case is that each of these thresholds maps to a falsifiable claim about the data. Cutting at a threshold of 0.00 corresponds to your 5 percent contamination assumption. Cutting at a lower threshold means you only want to remove the most extreme fractions of a percent. You can debate those percentages with domain knowledge. You cannot debate what a geometric radius means in a 16-dimensional standardized feature space because it does not mean anything you can explain to another human being.

What the Algorithm Actually Found

The real test of an unsupervised method with no ground truth is whether its outputs make sense to a domain expert. Look at the top anomalies Isolation Forest flagged.

cols_to_show = ['Date', 'Location', 'Rainfall', 'MaxTemp', 'WindGustSpeed', 'Anomaly_Score']
top_5 = df.sort_values('Anomaly_Score').head(5)
print(top_5[cols_to_show].to_string(index=False))

# Output:
#       Date  Location  Rainfall  MaxTemp  WindGustSpeed  Anomaly_Score
# 2011-02-15    Darwin     132.6     24.8           98.0      -0.154950
# 2015-12-24    Darwin     122.8     27.0           80.0      -0.151782
# 2014-01-01   Woomera       0.0     46.8           74.0      -0.151524
# 2011-02-16    Darwin     367.6     25.6           83.0      -0.143283
# 2009-12-12    Darwin     141.2     26.1           94.0      -0.135914

Darwin in February 2011 is not a statistical artifact. That is Cyclone Carlos, which produced record-breaking precipitation across the Northern Territory, with Darwin International Airport recording its highest 24-hour rainfall total in history. Woomera with 46.8 degrees Celsius and wind gusts of 74 km/h is a documented extreme heat event in one of Australia's most arid regions.

The algorithm did not know any of this. It learned the typical joint distribution across 16 variables and flagged the points that were hardest to explain given that distribution. The fact that those points correspond to historically documented extreme events is as close to external validation as you can get without labeled ground truth.

iso_outliers = np.sum(df['Is_Anomaly'])
print(f"DBSCAN (eps=1.5)          -> {dbscan_outliers} outlier ({(dbscan_outliers/len(df))*100:.2f}%)")
print(f"Isolation Forest (c=0.05) -> {iso_outliers} outlier ({(iso_outliers/len(df))*100:.2f}%)")

# Output:
# DBSCAN (eps=1.5)          -> 18166 outlier (12.49%)
# Isolation Forest (c=0.05) ->  7273 outlier  (5.00%)

The two methods disagree by roughly 11,000 rows on the same dataset. Without ground truth labels you cannot say with certainty which one is right. What you can say is which one gives you a number you can stand behind in front of another human being.

Every anomaly detection method requires a human to make a threshold decision. DBSCAN forces you to make that decision in a geometric space that loses interpretability as dimensions grow. Isolation Forest forces you to make it in the space of contamination rates, which is a domain question with a domain answer.

In production you will always be asked to justify your choices. The question is whether you want to justify a geometric radius in a 16-dimensional standardized space or whether you want to justify what proportion of your data you believe to be genuinely anomalous.

One of those conversations is possible. The other is not.