Tinkwell: Anomaly Detector using Randomized PCA

#iot #programming #python #machinelearning

In earlier posts, I introduced a language-agnostic, firmware-less approach to IoT that sidesteps many traditional complications. I've since been building a reference implementation in C# called Tinkwell (named from Tinkwer and Well). The project is still evolving, but I'm convinced it has potential beyond its original scope, with applications in a variety of scenarios.

In this post, I'll walk through how to integrate a Python-based anomaly detection system with Tinkwell, using the most straightforward tool available: the Tinkwell Command Line Interface (tw). Of course, if you were to write a runner in Python, you'd want to do it properly (first of all communicating with all the services using gRPC) but for a quick test, or to prototype an algorithm, using the CLI is faster and easier.

To keep things focused, the code shown here is heavily redacted. For the full working version, head over to the GitHub repository.

Introduction to Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique that transforms a set of possibly correlated variables into a smaller set of new variables called principal components, which are linearly uncorrelated. The first component grabs the most variance it can from the data, and each one after it captures as much of what's left while staying independent from the others.

In even simpler terms: when your data has a bunch of features, PCA figures out which combinations of them explain the most variation. It then turns those into new features that summarize your data without all the extra clutter. This helps when you want to make the data easier to visualize or feed into a machine learning model without losing too much of what makes it useful.

Randomized PCA

Randomized PCA is based on randomized algorithms for matrix decomposition. Instead of directly computing the full covariance matrix and solving for eigenvectors (which can be slow for big datasets), it approximates the principal components using a random projection method.

This gives you an efficient way to estimate the top k principal components of a data matrix without needing to compute the full decomposition, especially useful when the matrix is huge or sparse.

Why PCA is a Good Fit for Anomaly Detection

PCA is particularly well-suited for anomaly detection in multivariate time series data (like our measures) for several reasons:

Dimensionality Reduction: Real-world systems often involve many correlated sensors or measures. PCA can reduce this high-dimensional data into a lower-dimensional subspace, making the anomaly detection task more manageable and computationally efficient.
Normal Behavior Modeling: PCA effectively captures the "normal" operating patterns and correlations within the data. When the system behaves normally, its data points will lie close to the subspace spanned by the principal components.
Reconstruction Error as Anomaly Score: Anomalies often represent deviations from these normal patterns. When an anomalous data point is projected onto the PCA subspace and then reconstructed back to the original dimension, the difference between the original and reconstructed point (the "reconstruction error") will be significantly larger than for normal data points. This error serves as an effective anomaly score.
Unsupervised Learning: PCA is an unsupervised learning technique, meaning it does not require labeled anomaly data for training. It learns the normal behavior from the available data, which is often abundant, and then flags anything that deviates significantly from this learned normal.

Tinkwell Configuration

This is the ensamble file we need to run this example:

compose service orchestrator "Tinkwell.Orchestrator.dll"
compose service store "Tinkwell.Store.dll"
compose service events "Tinkwell.EventsGateway.dll"
compose agent reducer "Tinkwell.Reducer.dll" { path: "./measures.twm" }
compose agent reactor "Tinkwell.Reactor.dll" { path: "./measures.twm" }
compose agent watchdog "Tinkwell.Watchdog.dll"

This is the configuration for the measures we're going to use:

measure voltage {
    type: "ElectricPotential"
    unit: "Volt"
    expression: "5"
    minimum: 1
    maximum: 50
}

measure current {
    type: "ElectricCurrent"
    unit: "Ampere"
    expression: "2"
    minimum: 0
    maximum: 10
}

measure power {
    type: "Power"
    unit: "Watt"
    expression: "voltage * current"
    minimum: 0
    maximum: 500

    signal high_load {
      when: "power > 100"
    }
}

The Python Code

This example consists of three main scripts:

anomaly_detector.py: it detects anomalies in the input data and save a CSV file with the inputs and its findings. It uses tw measures subscribe to listen for changes and process the new values.
feed_synthetic_data.py: generates synthetic data to test the anomaly detector. It uses tw measures write to update the Store.
plot_measures.py: plots the results saved in the CSV file exported by anomaly_detector.py.

It is not shown in this example but you could feed back the results from the anomalies detection into Tinkwell using tw events publish.

There are many parameters to configure, please read the README.md file carefully.

The anomaly detection process in this script leverages the reconstruction error property of PCA:

Data Normalization: Each incoming measure value is first normalized using its pre-determined minimum and maximum values. This ensures that all measures contribute equally to the PCA, regardless of their original scale.
PCA Model Training:
- The script collects a PCA_BUFFER_SIZE number of normalized data samples. Each sample is a vector representing the current values of all subscribed measures.
- A Randomized PCA model is then trained on this buffer of "normal" data.
- After training, the model can project new data points onto its principal components and reconstruct them.
Reconstruction Error Calculation: For each data sample in the training buffer, its reconstruction error is calculated. This error is the Euclidean distance (or L2 norm) between the original data point and its reconstructed version after being projected onto the PCA subspace and then inverse-transformed.
Anomaly Threshold Determination: A statistical threshold for anomaly detection is established from the distribution of these reconstruction errors. Specifically, the ANOMALY_THRESHOLD_PERCENTILE (e.g., 99th percentile) of the reconstruction errors from the training data is chosen as the threshold. This means that ANOMALY_THRESHOLD_PERCENTILE% of the "normal" training data will have a reconstruction error below this threshold.
Real-time Anomaly Detection:
- As new measure data arrives, it is normalized and formed into a new sample vector.
- This new sample is then passed through the trained PCA model to calculate its reconstruction error.
- If the calculated reconstruction error for the new sample exceeds the pre-determined anomaly threshold, the sample is flagged as an anomaly.
- When an anomaly is detected, the script prints detailed information, including the raw and normalized values, the reconstruction error, and the anomaly threshold, to help in understanding the nature of the deviation.

The code for the PCA detector is fairly small:

import numpy as np
from sklearn.decomposition import PCA

class PcaAnomalyDetector:
    def __init__(self, n_components, anomaly_threshold_percentile):
        self.n_components = n_components
        self.anomaly_threshold_percentile = anomaly_threshold_percentile
        self.pca_model = None
        self.anomaly_threshold = None

    def train(self, data_buffer):
        if not data_buffer:
            raise ValueError("Data buffer cannot be empty for training.")

        self.pca_model = PCA(n_components=self.n_components, svd_solver='randomized')
        self.pca_model.fit(np.array(data_buffer))

        reconstructed_data = self.pca_model.inverse_transform(self.pca_model.transform(np.array(data_buffer)))
        reconstruction_errors = np.linalg.norm(np.array(data_buffer) - reconstructed_data, axis=1)

        self.anomaly_threshold = np.percentile(reconstruction_errors, self.anomaly_threshold_percentile)
        return self.anomaly_threshold

    def detect(self, sample):
        if self.pca_model is None or self.anomaly_threshold is None:
            raise RuntimeError("PCA model not trained. Call train() first.")

        current_sample_np = np.array([sample])
        transformed_sample = self.pca_model.transform(current_sample_np)
        reconstructed_sample = self.pca_model.inverse_transform(transformed_sample)
        current_reconstruction_error = np.linalg.norm(current_sample_np - reconstructed_sample)

        is_anomaly = current_reconstruction_error > self.anomaly_threshold
        return is_anomaly, current_reconstruction_error, reconstructed_sample[0]

    def is_trained(self):
        return self.pca_model is not None and self.anomaly_threshold is not None

To run this example (after you completed the initial setup):

Start Tinkwell and wait for the initialization to complete.
In a new terminal run python feed_synthetic_data.py to start pushing our synthetic test data into the system.
In a new terminal run python anomaly_detector.py to start analyzing the data.
In a new terminal run python plot_measures.py to visualize the data as they change and to see where the anomalies have been detected.

The end result will look more or less like this:

Note that the first 120 samples have been used to train the model.

Tuning the Parameters

The number of components to keep: too few and you'll miss key patterns in the data, too many and you bring back noise and irrelevant details. You could arbitrarily set this value or set it to keep components until you get, let's say, 95% of the variance (sklearn can do this for you!).
Anomaly threshold percentile: - Tune this percentile by checking how well it separates known normal vs abnormal data. Start somewhere around 95–99% and adjust.

References

PCA Basics:
- Jolliffe, I. T. (2002). Principal Component Analysis. Springer.
Randomized PCA:
- Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288.
PCA for Anomaly Detection:
- Shyu, M. L., Chen, S. C., Sarinnapakorn, M., & Chang, L. (2003). A novel anomaly detection scheme based on principal component classifier. Proceedings of the IEEE International Conference on Data Mining (ICDM), 2003, 347-354.
- Wang, S., & Ma, J. (2018). Anomaly detection based on PCA and reconstruction error. Journal of Physics: Conference Series, 1087(6), 062029.