DEV Community: Chloe Caron

How to build an Anomaly Detector using BigQuery

Chloe Caron — Thu, 18 Jan 2024 12:23:52 +0000

Bad data quality can arise in any type of data, be it numerical, textual or other. As we saw in the last article of this series, LLMs like OpenAI are quite effective at detecting anomalies in textual data. However, the OpenAI anomaly detector really struggled with numerical data, reaching an accuracy of 68% even after applying multiple forms of prompt engineering (compared to ~100% for textual data).

This is to be expected since OpenAI is a large language model and isn’t trained on mathematical concepts. What to do? Numerical data is a non-negligible proportion of global data. Although not backed by exact statistics, it wouldn’t be an unreasonable statement to say that the proportion of data which is numeric is bigger than that which is textual. A few sources could include scientific, financial, market and consumer, sensor, healthcare data, etc., the list is endless.

So if we can’t use LLMs to build our anomaly detector, what do we use? BigQuery ML is one solution to our problem! BigQuery ML has an inbuilt ‘anomaly detector’ function which we can use to identify anomalies in our data. Let’s give it a shot and work through this together.

Part 1: Choosing and preparing a model

Before we try to detect any anomaly in our data, we need to create a model that our anomaly detector will base itself off of, i.e. does your dataset match what the model predicted?

Choosing the correct model is important based on your situation. To help understand the differences between the type of models and choose the correct one, I built this decision tree:

If you remember from my previous article, we are using data from the Electricity Consumption UK 2009-2023 dataset.

The data we have is a time series and, for our case scenario, we’ll keep it simple and not look at cross dependencies between variables (although this would be an interesting experiment to try). Our logical choice is, therefore, to use ARIMA_PLUS as our model type. If you would like to know more about the ARIMA_PLUS model, don’t hesitate to have a look at BigQuery’s documentation!

Going back to our dataset, you can notice that we have a lot of data columns, i.e. for a singular timestamp we are collecting multiple types of data (e.g. wind generation, solar generation). These can’t all fit into a single model. An ARIMA_PLUS model can only point to a single data column and be trained on that column. In this case, we’ll need to create a separate model for each data column that we have.

If you haven’t worked with BigQuery before, have a look at some of the tutorials online. This one is a relatively good start and you just need to know how to upload data using a CSV (or other) and how to run a query before being able to follow through these steps.

Okay, great! So we know that we’ll be using the ARIMA_PLUS model, we know what data we want to use and we know that we’ll need a model per column. Let’s see this in action as a query.

DECLARE column_names ARRAY<STRING>;

DECLARE column_name STRING;
DECLARE query_string STRING;

-- Get column names into an array
SET column_names = (
  SELECT ARRAY_AGG(COLUMN_NAME)
  FROM `project-name.dataset-name.INFORMATION_SCHEMA.COLUMNS`
  WHERE TABLE_NAME = 'name_of_your_table_with_data' AND COLUMN_NAME != 'name_of_your_column_with_your_timestamp'
);

-- Loop through columns and create ARIMA_PLUS models
FOR row IN (SELECT * FROM UNNEST(column_names) AS column_name)
DO
    SET column_name = row.column_name;

    SET query_string = (
    SELECT
      'CREATE OR REPLACE MODEL `dataset-name.data_arima_model_' || column_name || '` ' ||
      'OPTIONS(MODEL_TYPE="ARIMA_PLUS", TIME_SERIES_DATA_COL="' || column_name || '", TIME_SERIES_TIMESTAMP_COL="name_of_your_column_with_your_timestamp") AS ' ||
      'SELECT name_of_your_column_with_your_timestamp, ' || column_name || ' FROM `project-name.dataset-name.table-with-data`'
  );

  EXECUTE IMMEDIATE query_string;
END FOR;

Let’s break this down. In the first section, we are creating an array of the column names (remember to replace project-name, dataset-name, name_of_your_table_with_data, table-with-data and name_of_your_column_with_your_timestamp with your own values). This array is used in the next section where we loop through them and create an individual ARIMA_PLUS model for each data column.

Great, we have our models! Once the query has run and if it’s successful, you’ll be able to see the models in the explorer tab within the GCP console:

(Optional) Formatting your timestamp

If you use the same dataset that I did (Electricity Consumption UK 2009-2023 dataset), you’ll notice that you don’t quite get a good timestamp. Instead you have SETTLEMENT_DATA and SETTLEMENT_PERIOD to determine which half-hour period of the day you are in. As is, this will cause an error when you run your model since the TIME_SERIES_TIMESTAMP_COL that you will try use will not be a timestamp.

But, no worries, I have a script to solve that!

Once you’ve imported your data, run the following script:

-- Step 1: Create a new table with the desired structure and transformed data
CREATE OR REPLACE TABLE `project-name.dataset-name.<table-with-data>_new` AS
SELECT
  * EXCEPT (SETTLEMENT_DATE, SETTLEMENT_PERIOD),  -- Exclude the columns you want to remove
  DATETIME(CONCAT(FORMAT_DATE('%F', SETTLEMENT_DATE), 'T', 
    CASE 
      WHEN SETTLEMENT_PERIOD = 48 THEN '00:00:00'
      WHEN SETTLEMENT_PERIOD < 20 THEN CONCAT('0', CAST(FLOOR((SETTLEMENT_PERIOD) / 2) AS STRING), ':', LPAD(CAST(MOD(SETTLEMENT_PERIOD, 2) * 30 AS STRING), 2, '0'), ':00')
      ELSE CONCAT(CAST(FLOOR((SETTLEMENT_PERIOD) / 2) AS STRING), ':', LPAD(CAST(MOD(SETTLEMENT_PERIOD, 2) * 30 AS STRING), 2, '0'), ':00')
    END)) AS time_stamp
FROM `project-name.dataset-name.table-with-data`;

-- Step 2: Drop the original table
DROP TABLE `project-name.dataset-name.table-with-data`;

-- Step 3: Rename the new table to the original table name
ALTER TABLE `project-name.dataset-name.<table-with-data>_new` RENAME TO `table-with-data`;

This is where the magic of SQL comes in so I would highly recommend looking over this script to make sure you understand each stage.

Part 2: Setting up our results table

Now that our models have been generated, we are nearly ready to run our anomaly detector. But, what’s the point in running the experiment if we can’t see the data? And what is the easiest and most pragmatic way to visualise that data? We’re in the data world, so we can talk about all sorts of beautiful visualisations, but let’s keep it simple with storing our results in a table. First, let’s prepare the table:

DECLARE dynamic_sql STRING;
DECLARE column_name STRING;

-- ... the code to set the column_names, seen above, is needed again

-- Start building the dynamic SQL to create the table of results
SET dynamic_sql = 'CREATE OR REPLACE TABLE `project-name.dataset-name.anomaly_results` (time_stamp TIMESTAMP, ';

-- Loop through the column names
FOR row IN (SELECT * FROM UNNEST(column_names) AS column_name)
DO
    SET column_name = row.column_name;
    SET dynamic_sql = dynamic_sql || column_name || ' FLOAT64 DEFAULT NULL, is_anomaly_' || column_name || ' BOOL DEFAULT FALSE, prob_' || column_name || ' FLOAT64 DEFAULT NULL, ';
END FOR;

-- Remove the trailing comma and close the SQL statement
SET dynamic_sql = SUBSTR(dynamic_sql, 1, LENGTH(dynamic_sql) - 2) || ')';

-- Execute the dynamic SQL to create the table to store results
EXECUTE IMMEDIATE dynamic_sql;

In this snippet of code, we are creating an empty table and creating 3 columns for each of the original data column: the original data (float64), whether or not the value is an anomaly is_anomaly_<column-name> (bool), and the probability of it being an anomaly prob_<column-name> (float64) which will come in useful for debugging. The default values are quite important to have here. When we are writing to our result table, we will be incrementally adding values to the column as we run our anomaly detector against our models. Since we’ll be adding data to a row in a piece-wise manner, we’ll need to have those default values as placeholders to not get an error when we create a new row with partial data.

Part 3: Running the anomaly detector

Now that we have the model and we have a place to store our results, we can run our anomaly detector.

DECLARE anomaly_column_name STRING;
DECLARE anomaly_prob_column_name STRING;
DECLARE model_name STRING;

-- ... the code to set the column_names, seen above, is needed again

-- Find anomalies in the data
FOR row IN (SELECT * FROM UNNEST(column_names) AS column_name)
DO
  SET column_name = row.column_name;
  SET anomaly_column_name = 'is_anomaly_' || column_name;
  SET model_name = 'dataset-name.data_arima_model_' || column_name;
  SET anomaly_prob_column_name = 'prob_' || column_name;

  SET query_string = (
    SELECT
      'MERGE `project-name.dataset-name.anomaly_results` AS target ' ||
      'USING (' ||
      '  SELECT TIMESTAMP AS time_stamp, ' || column_name || ', is_anomaly AS ' || anomaly_column_name || ', anomaly_probability AS ' || anomaly_prob_column_name ||  
      '  FROM ML.DETECT_ANOMALIES(' ||
      '    MODEL `' || model_name || '`, STRUCT(0.95 AS anomaly_prob_threshold)' ||
      '  )' ||
      ') AS source ' ||
      'ON target.time_stamp = source.time_stamp ' ||
      'WHEN MATCHED THEN ' ||
      '  UPDATE SET ' ||
      '    ' || column_name || ' = source.' || column_name || ', ' ||
      '    ' || anomaly_column_name || ' = source.' || anomaly_column_name || ', ' ||
      '    ' || anomaly_prob_column_name || ' = source.' || anomaly_prob_column_name || ' ' ||
      'WHEN NOT MATCHED THEN ' ||
      '  INSERT ' ||
      '    (' || column_name || ', ' || anomaly_column_name || ', ' || anomaly_prob_column_name || ', time_stamp) ' ||
      '  VALUES ' ||
      '    (source.' || column_name || ', source.' || anomaly_column_name || ', source.' || anomaly_prob_column_name || ', source.time_stamp)'
  );
  -- Execute the query
  EXECUTE IMMEDIATE query_string;
END FOR;

This looks like a very complicated statement so let’s break it down again. The first part of the statement is MERGE which will either create, update or delete a row depending on the matching WHEN statement. Here we are determining whether there is a match between the time_stamp in our results table and the one in the data we are going through. The time_stamp was the best choice to see if we needed to update or insert a row since it is unique in our case.

Now that we’ve looked at the MERGE and WHEN statement, let’s look at the USING component which is where we are actually running our anomaly detector. I won’t focus on the query syntax as this ressembles SQL quite closely. The main part of the statement is ML.DETECT_ANOMALIES where we specify the model (which we created earlier) and the probability threshold. This probability threshold is used to determine what the model will mark as an anomaly. The reason behind this is that when this script is run, each piece of data will have a probability of it being an anomaly. If we have a low threshold, we are being more conservative and risk-averse by classifying data as an anomaly if the system is partially sure that the data is anomaly. On the other hand, we could put a much higher threshold and only classify data as an anomaly if the system is near certain (i.e. associates a high probability of being an anomaly).

And that’s it! You’ve manage to run your anomaly detector. 🥳

How good is BigQuery as an anomaly detector?

After running the scripts above, you should see that the anomaly_results table has been created with our results. It will look a little like this:

For this test, I used 28 lines of data and introduced 5 anomalies into the data that I imported. What you’ll notice is that BigQuery is very good at picking up the intended anomalies. However, we have a different problem this time. Instead of having issus with picking up the anomalies (which we saw when using an LLM as an anomaly detector in this article), BigQuery has a much bigger issue with false positives. In this first experiment, there were 21 false positives! False positives create a different issue. Although less severe than missing an actual anomaly (which would lead to bad data), false positives can lead to incorrect alerts and the need for a manual verification process, which would hinder an automatic system. So what can we do?

1) Increasing the anomaly probability threshold

When setting up the anomaly detector, there was a short introduction to the probability threshold, i.e. the point at which we consider our data point an anomaly. By raising this threshold, we are essentially saying that we want the system to be more sure a data point is an anomaly before marking it as such. However there is a warning associated to this. The higher you raise the threshold, the higher the risk that you miss an anomaly when the system is less certain about it.

In practise, when raising the threshold from 0.95 to 0.99999, I was pleasantly surprised by the fact that BigQuery still picked up 100% of the intended anomalies. The number of false positives dropped to 6 (-71%) which is a big step in the right direction!

2) Using a separate and more extensive training set

The bigger the training set, the better it performs, right? Not quite in this case. Going with this initial thought process, I trialled using a separate and more extensive training data set to produce the model.

Looking at our code above, it is relatively straightforward to change the dataset which the model is based off of, you just have to update project-name.dataset-name.table-with-data to point to the table with the new data you would like to train the model on.

Next, you’ll also have to update the anomaly detector. If you noticed, in the code above, we don’t specify the data where the anomaly detection is occurring:

'SELECT TIMESTAMP AS time_stamp, ' || column_name || ', is_anomaly AS ' || anomaly_column_name || ', anomaly_probability AS ' || anomaly_prob_column_name ||  
'  FROM ML.DETECT_ANOMALIES(' ||
'    MODEL `' || model_name || '`, STRUCT(0.95 AS anomaly_prob_threshold)' ||
'  )' ||

This is because, implicitly, BigQuery uses the same dataset as was used in the model to detect the anomalies. To detect anomalies on a new dataset, you’ll have to modify the line above to have:

'SELECT TIMESTAMP AS time_stamp, ' || column_name || ', is_anomaly AS ' || anomaly_column_name || ', anomaly_probability AS ' || anomaly_prob_column_name ||  
'  FROM ML.DETECT_ANOMALIES(' ||
'    MODEL `' || model_name || '`, STRUCT(0.95 AS anomaly_prob_threshold), (SELECT * FROM `dataset-name.table-with-data-to-analyse`)' ||
'  )' ||

where table-with-data-to-analyse is the name of the table which has the anomalies you would like to detect.

Word of warning: the timestamp of your data is important when doing this! The data which you want to analyse has to have a timestamp which is in the future compared to your training data. This makes sense as you use past data to predict future trends and is a design choice that was made by the BigQuery ML team. If you are curious about the reasoning behind this, here is a discussion from within the Google Cloud Community space with more details.

Doing this did not significantly improve the number of false positives. In fact, it actually increased the number of false positives from 6 (achieved by lowering the threshold) to 20! There are lots of potential reasons behind this. Overfitting is a common cause and reflects the importance of tuning your model. Before jumping into this, let’s have a look at the current state of our results. Once again, all the intended anomalies were detected, so let’s just have a look at the number of false positives:

3) Tuning our model using Non-Seasonal Order terms

The non-seasonal ARIMA model we are using is a combination of differencing, autoregression and a moving average model. When running the model above, by default, we are using AUTO_ARIMA = True. What this means is that the training automatically finds the best non-seasonal order values. There are three values that need to be set:

p - the order of the autoregressive part, a value between 0 and 5
d - the degree of differencing, a value between 0 and 2
q - the order of the moving average part, a value between 0 and 5

To understand more about these values, don’t hesitate to have a look at this documentation.

In code, we’ll need to switch AUTO_ARIMA to be False before we give the new values:

'CREATE OR REPLACE MODEL `dataset-name.data_arima_model_' || column_name || '` ' ||
'OPTIONS(MODEL_TYPE="ARIMA_PLUS", TIME_SERIES_DATA_COL="' || column_name || '", TIME_SERIES_TIMESTAMP_COL="name_of_your_column_with_your_timestamp", AUTO_ARIMA=False, NON_SEASONAL_ORDER=(<p>, <d>, <q>)) AS ' ||
'SELECT name_of_your_column_with_your_timestamp, ' || column_name || ' FROM `project-name.dataset-name.table-with-data`'

When experimenting with a smaller dataset (to understand the relative impact of the variation of the terms), this is what we see:

What we are most interested in is the combination which resulted in the lowest number of false positives. When bringing this back to our larger dataset, some of these combinations didn’t perform as well as the smaller experiment, but the best combination of values was p=0, d=0 and q=1 which corresponds to a moving average model. In our particular case, this provided the best results, but you should always fine-tune these values based on your application as this combination will not always be the best one!

Using this method, the number of false positives dropped to only 1 false positive (when using our optimisation from part 1 but not 2) and 9 false positives (when using both part 1 and 2).

Conclusion

Overall, when using BigQuery for numerical data, the algorithm is able to pick up on all intended anomalies with relative ease. The key concern is the number of false positives. In itself, it isn’t a terrible issue to have as you would rather over-estimate rather than under-estimate the number of anomalies when improving your data quality. However, it does makes things like automation more difficult, especially if you need to manually validate that a data point is not an anomaly.

By experimenting, we found that the best way to improve our utilisation of the BigQuery ML anomaly detector is by combining an increase in the threshold of what is considered an anomaly and fine-tuning the non-seasonal order terms. This resulted in a 95% decrease in the number of false positives.

If you liked this article, don’t hesitate to check out the first article in this anomaly detector series: Step-by-Step Guide to building an Anomaly Detector using a LLM and have a look at @ChloeCaronEng for more exploratory journeys!

Step-by-Step Guide to building an Anomaly Detector using a LLM

Chloe Caron — Wed, 10 Jan 2024 12:36:59 +0000

You’ve probably heard about the importance of data quality being shouted from every rooftop. Bad data is a recipe for disaster. Certain companies have specialised in finding anomalies in your data and flagging it, much like Sifflet and Elementary. But what if we wanted to build an anomaly detector which works for any type of data, structured or unstructured, textual or numeric, include it in our data pipeline and do automated transformation steps… the possibilities are endless!

I’ve created a series of articles following the journey of building an anomaly detector for various situations, featuring OpenAI and BigQuery. In this first article, we’ll focus on building an anomaly detector using a LLM (Large Language Model), more specifically OpenAI.

What is an anomaly detector?

In simple words, an anomaly detector identifies bad quality data. In more technical terms, anomaly detection is used to identify significant deviations from the normal behavioural pattern. By identifying these faults, you can take action against potential sources of bad data and ensure that your data-use remains un-skewed, robust and defect-free.

Why use LLMs like OpenAI?

We want our anomaly detector to be as flexible as possible and to be able to identify anomalies no matter how structured the data is, how it is formatted, where it originates from, etc. It may be rare for you to have this number of uncertainties, but when you do, a conventional approach to anomaly detection may be more difficult to apply. Conventional approaches include using models like auto-encoders, ARIMA (auto-regressive integrated moving average), etc. Another point is - curiosity! LLMs are becoming more and more popular, but how efficient are they at more complex tasks such as anomaly detection and at what cost? That’s what we are about to find out.

Part 1: Building a basic OpenAI anomaly detection

To start, make sure that you have everything setup to use OpenAI. You can find more information on how to get started in this article.

To test that everything is working, we can try out some relatively simple code:

from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {
            "role": "system",
            "content": "You are a data analyser. You spot any anomaly in the data received.",
        },
        {
            "role": "user",
            "content": "Here is the data input I have: {'id': 1, 'date': '1946-01-03', 'cost': '3.0'}, {'id': 2, 'date': '1852-03-04', 'cost': '3.0'}, {'id': 2, 'date': '1852-03-04', 'cost': '-1.0'}",
        }
])

When running this, OpenAI does relatively well. It can pick up on issues like the one above with the cost being a negative value. But this seems too easy, what if we give it more complex data?

As soon as you start to look at more lines of code with more complex anomalies, accuracy starts to drop. For this experiment, let’s use data from the Electricity Consumption UK 2009-2023 dataset.

We can test the anomaly detector using different scenarios, varying the following parameters:

Model used: either GPT 3.5 Turbo 1106 or GPT 4 1106 Preview
Number of anomalies: 0, 1, or 2
Number of lines of training data: 0, 10, 50 or 100
Number of lines of test data: 4, 21 or 80

In a surprisingly high number of cases, OpenAI could find none of the anomalies! While a lot of the experiments gave a low accuracy, there are some combinations that shone through. The key statistics are:

GPT 4 tended to perform much better than GPT 3.5
The more anomalies you have, the more difficult it was for the model to find them
The number of lines of data didn’t have a significant impact on the accuracy

To give you an idea of the outcome, here are the combinations which had a non-zero accuracy:

The results aren’t great but we are far from done! As is, we wouldn’t be able to use OpenAI as an anomaly detector. Our experiments are on a small scale. As the dataset we analyse grows along with the number of anomalies, our accuracy will very much suffer. Keep in mind that throughout this article we are testing our anomaly detector with simple datasets and results may differ depending on the dataset you use and the type of anomalies you have.

Part 2: Using Prompt Engineering to boost our accuracy from 32% to 64%

The idea behind prompt engineering is to construct the queries given to the language models to optimise their performance. This helps to guide them to generate the desired output by fine-tuning their response. There is a plethora of research papers out there on different forms of prompt engineering. DAIR.AI published a guide on prompt engineering that you might find useful to get started.

After trying a variety of different prompt engineering methods, there are 3 key techniques which stood out.

Chain of Thought

When using Chain-of-Thought (CoT) prompting, we are encouraging the model to break down complex processes or “thoughts” into smaller steps before giving a response. CoT can be combined with other prompting techniques (e.g. few-shot as we’ll see later), but we can also apply it on its own. The key is to get the model to think step by step. A way to do this is by adding the following to your prompt:

Let’s think step by step

By doing so, you are encouraging your model to break down its thinking. So what does our solution look like now?

import csv
from openai import OpenAI

client = OpenAI()

# Step 1: Consider you have a CSV, convert the data into a stringified JSON
def read_csv(file_path):
    with open(file_path, "r") as file:
        reader = csv.reader(file)
        data = [row for row in reader]
        file.close()
            return "\n".join([",".join(row) for row in data])

data_with_anomaly = read_csv("path/to/file.csv")

# Step 2: Determine the messages to send to the model
messages = [
    {
      "role": "system",
      "content": """You are a data analyser which spots any anomaly 
                                        in the data received. You will be given data in the form 
                                        of a CSV. There can be no anomaly but there can also be 
                                        multiple anomalies. Let’s think step by step. First work out 
                                        the schema of the data you receive. Then compare the data you 
                                        have to the schema you determined. Don't decide what is an 
                                        anomaly until you have figured out the schema.""",
  },
    {
      "role": "user",
      "content": "Here is the data to analyse, what are the anomalies? Please give me the line number with the anomaly. Make sure to remember on which line of the CSV the anomaly was (ignore the first line since these are the column titles): "
      + data_with_anomaly,
  },
]

# Step 3: Get the response from the model
response = client.chat.completions.create(
      model=model, # Add the model you would like to use, e.g. gpt-3.5-turbo-1106 or gpt-4-1106-preview
      messages=messages,
      max_tokens=1000,
  )

# Step 4: Show the anomaly
print(response.choices[0].message.content)

To keep our experiments fair, we will fix certain parameters. When applying the different prompt engineering techniques, we will keep the number of anomalies in our data to 2 and we will use 20 lines of data in the CSV with the anomalies. These numbers were chosen arbitrarily.

By applying CoT on the anomaly detector, GPT 4 performed 8% better while GPT 3.5 did not show any significant change.

Few-Shot Prompting

Using few-shot prompting, we provide examples of how the model should be responding to the prompts that we give it. This is valuable for more complex tasks, e.g. anomaly detection, by demonstrating to the model how it should be responding given a certain prompt.

In our case, we can give the model example data with an anomaly and then give it the expected response with the correct anomalies being identified. Let’s give it a shot!

import csv
from openai import OpenAI

client = OpenAI()

def read_csv(file_path):
    with open(file_path, "r") as file:
        reader = csv.reader(file)
        data = [row for row in reader]
        file.close()
            return "\n".join([",".join(row) for row in data])

data_with_anomaly = read_csv("path/to/file.csv")

# Step 1: Extract data from three CSVs with example data inside them
data_with_anomaly_1 = read_csv("bad_data_example_1.csv")
data_with_anomaly_2 = read_csv("bad_data_example_2.csv")
data_with_no_anomaly = read_csv("data_with_no_anomaly.csv")

# Step 2: Define the anomalies present in each file with the reasoning behind it
expected_response_1 = """Taking my time to look through the data, I noticed the following:
1. In row 1, the value for 'ND' is zero. In all the other rows, the 'ND' value is non-zero. This is an anomaly.
2. In row 3, the value for 'ENGLAND_WALES_DEMAND' is a negative value. In all the other rows, this is a positive value. This is an anomaly.
3. In row 6, the value for 'EMBEDDED_SOLAR_CAPACITY' is missing. In all the other rows, this value is present. This is an anomaly.
"""

expected_response_2 = """Taking my time to look through the data, I noticed the following:
1. In row 2, the value for 'TSD' is 100. In all the other rows, this value varies between 24244 and 48195. Since this value is very out of range, it is an anomaly.
2. In row 3, the value for 'EMBEDDED_WIND_GENERATION' is 10. In all the other rows, this value varies between 24244 and 48195. Since this value is very out of range, it is an anomaly.
3. In row 7, the value for 'EMBEDDED_SOLAR_GENERATION' is a negative value. In all the other rows, this is a positive value. This is an anomaly.
4. In row 10, the value for 'ENGLAND_WALES_DEMAND' and 'EMBEDDED_WIND_GENERATION' are missing. In all the other rows, this value is present. This is an anomaly.
5. In row 12, the value for 'NON_BM_STOR' is not zero. In all the other rows, this value is zero. This is an anomaly.
6. In row 15, the value for 'MOYLE_FLOW' is zero. In all the other rows, this value is not zero. This is an anomaly.
"""

expected_response_no_anomaly = "After comparing the values of each row to each other, all the data seems to be consistent with each other, I cannot find an anomaly."

# Step 3: Let us adapt the messages we send to the model with this information
messages = [
    {
        "role": "system",
        "content": """You are a data analyser which spots any anomaly 
                                        in the data received. You will be given data in the form 
                                        of a CSV. There can be no anomaly but there can also be 
                                        multiple anomalies.""",
    },
    {
        "role": "user",
        "content": "Here is the data to analyse: " + data_with_anomaly_1,
    },
    {"role": "assistant", "content": expected_response_1},
    {
        "role": "user",
        "content": "Here is the data to analyse: " + data_with_no_anomaly,
    },
    {"role": "assistant", "content": expected_response_no_anomaly},
    {
      "role": "user",
      "content": "Here is the data to analyse, what are the anomalies? Please give me the line number with the anomaly. Make sure to remember on which line of the CSV the anomaly was (ignore the first line since these are the column titles): "
      + data_with_anomaly,
  },
]

# Step 4: As before, get the response from the model
response = client.chat.completions.create(
      model=model, # Add the model you would like to use, e.g. gpt-3.5-turbo-1106 or gpt-4-1106-preview
      messages=messages,
      max_tokens=1000,
  )

# Step 5: As before, show the anomaly
print(response.choices[0].message.content)

Try giving this a run and you’ll notice that few-shot prompting leads to a massive jump when using GPT 4. The accuracy of the anomaly detector goes from 32% to 56% for GPT4! Unfortunately, it stays relatively constant for GPT 3.5.

Self-reflection & multi-step

For our last experiment with prompt engineering, we’re going to try splitting our model into multiple steps and pushing it to self reflect on its findings.

We are going to split our model into three with each step having different responsibilities. The first step will aim to find the anomalies, similar to what we saw above. The second step will trigger a self-reflection and will present the previous prompt and ask it “Are you sure?”. The last step will convert the response to JSON to standardise the finding of the anomalies so that we can use the response moving forward.

Let’s see what this looks like in code:

# All the code above still applies
...

first_answer = response.choices[0].message.content

# Step 1: Append our previous answer to our messages
messages.append({"role": "assistant", "content": first_answer})

# Step 2: Add some self reflection
messages.append(
    {
        "role": "user",
        "content": "Are you sure? How confident are you? Take a deep breathe and tell me the anomalies when you are more sure.",
    }
) 

# Step 3: Run the model
completion = client.chat.completions.create(
        model=model, # Add the model you would like to use, e.g. gpt-3.5-turbo-1106 or gpt-4-1106-preview
        messages=messages,
        max_tokens=1000,
    )

response_with_self_reflection = completion.choices[0].message.content

# Step 4: Convert your answer to a JSON response
json_structure = {
    "anomaly_count": "number of anomalies found in the text, of type 'int'",
    "anomalies": [
        {
            "id": "the row number where the anomaly was found. Ignore the first row of the CSV since it has the column names. Of type 'int'",
            "description": "why this is considered an anomaly, of type 'string'",
        }
    ],
}

json_response = client.chat.completions.create(
    model="gpt-3.5-turbo-1106", # Can use a simpler model to do the conversion 
    messages=[
        {
            "role": "system",
            "content": "You convert a text response into a json object. "
            + f"I want you to only give responses in the form {json.dumps(json_structure)}\n",
        },
        {
            "role": "user",
            "content": f"Convert this response to a json object: {response_with_self_reflection}",
        },
    ],
    max_tokens=1000,
    response_format={"type": "json_object"},
)

# Step 5: Display the results
print(json_response.choices[0].message.content)

Voila! You’ve now got few shot-learning integrated along with self-reflection & multi-step. When running the experiment and removing few-shot (i.e. only using self-reflection & multi-step), GPT 4 saw an improvement from 32% to 60% (unfortunately, GPT 3.5 actually saw a drop in performance).

By combining CoT, few-shot, self-reflection & multi-step, the overall performance of GPT 4 goes from 32% to 68%! On the other hand, the performance of GPT 3.5 doesn’t really see any significant improvement. If we focus on GPT 4, here is a summary of how prompt engineering impacted the accuracy of our anomaly detector:

Pretty good! But a 68% accuracy is not amazing and GPT 3.5 seems to be really struggling to improve its accuracy with the anomaly detector. Why is this the case? If we take a closer look at our data, we see that we are using numerical data:

Having a search, you’ll quickly see that the GPT models are bad at maths. This is to be expected, they are language models and aren’t trained on mathematical concepts.

We can test this theory by swapping out our data to text-based data.

How does data type impact the accuracy of our anomaly detector?

Let’s try to swap out our data for movie reviews instead.

When running our anomaly detector on this data (using the prompt engineering techniques above), we see that:

GPT 3.5 has an accuracy which jumps from 16% (numerical data) to 78% (text-based data)
GPT 4 has an accuracy which jumps from 68% (numerical) to ~100%(text-based)

This is an incredible jump and reflects the importance of choosing the correct model based on your application.

Conclusion

OpenAI can be used to build an effective anomaly detector for text-based data. Our model can be optimised using techniques such as prompt engineering. On the other hand, while we can improve the performance of our anomaly detector for numerical data, it is still relatively limited. So what can we use? Keep an eye out for the next article on using BigQuery for anomaly detection on numerical data!

This article is part of a series on the journey of building an anomaly detector. I share the progress of the journey regularly on X (formerly Twitter), have a look at @ChloeCaronEng!