<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chloe Caron</title>
    <description>The latest articles on DEV Community by Chloe Caron (@chloecaroneng).</description>
    <link>https://dev.to/chloecaroneng</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1253431%2F6786d2da-40e1-4a3e-a6d5-4bdb61fb59f3.jpg</url>
      <title>DEV Community: Chloe Caron</title>
      <link>https://dev.to/chloecaroneng</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/chloecaroneng"/>
    <language>en</language>
    <item>
      <title>How to build an Anomaly Detector using BigQuery</title>
      <dc:creator>Chloe Caron</dc:creator>
      <pubDate>Thu, 18 Jan 2024 12:23:52 +0000</pubDate>
      <link>https://dev.to/chloecaroneng/how-to-build-an-anomaly-detector-using-bigquery-274e</link>
      <guid>https://dev.to/chloecaroneng/how-to-build-an-anomaly-detector-using-bigquery-274e</guid>
      <description>&lt;p&gt;Bad data quality can arise in any type of data, be it numerical, textual or other. As we saw in the &lt;a href="https://blog.theodo.com/2024/01/anomaly-detection-llm/" rel="noopener noreferrer"&gt;last article&lt;/a&gt; of this series, LLMs like OpenAI are quite effective at detecting anomalies in textual data. However, the OpenAI anomaly detector really struggled with numerical data, reaching an accuracy of 68% even after applying multiple forms of prompt engineering (compared to ~100% for textual data). &lt;/p&gt;

&lt;p&gt;This is to be expected since OpenAI is a large &lt;strong&gt;language&lt;/strong&gt; model and isn’t trained on mathematical concepts. What to do? Numerical data is a non-negligible proportion of global data. Although not backed by exact statistics, it wouldn’t be an unreasonable statement to say that the proportion of data which is numeric is bigger than that which is textual. A few sources could include scientific, financial, market and consumer, sensor, healthcare data, etc., the list is endless.&lt;/p&gt;

&lt;p&gt;So if we can’t use LLMs to build our anomaly detector, what do we use? &lt;a href="https://cloud.google.com/bigquery/docs/bqml-introduction" rel="noopener noreferrer"&gt;BigQuery ML&lt;/a&gt; is one solution to our problem! BigQuery ML has an inbuilt &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-detect-anomalies" rel="noopener noreferrer"&gt;‘anomaly detector’&lt;/a&gt; function which we can use to identify anomalies in our data. Let’s give it a shot and work through this together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 1: Choosing and preparing a model
&lt;/h2&gt;

&lt;p&gt;Before we try to detect any anomaly in our data, we need to create a model that our anomaly detector will base itself off of, i.e. does your dataset match what the model predicted?&lt;/p&gt;

&lt;p&gt;Choosing the correct model is important based on your situation. To help understand the differences between the type of models and choose the correct one, I &lt;a href="https://x.com/ChloeCaronEng/status/1742169481080680795?s=20" rel="noopener noreferrer"&gt;built this decision tree&lt;/a&gt;: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F00bdab1954040f0590c082f4d05318ca%2F536c7%2Fmodel-decision-tree.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F00bdab1954040f0590c082f4d05318ca%2F536c7%2Fmodel-decision-tree.png" alt="Decision tree to select the appropriate model"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you remember from my &lt;a href="https://blog.theodo.com/2024/01/anomaly-detection-llm/" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, we are using data from the &lt;strong&gt;&lt;a href="https://www.kaggle.com/datasets/albertovidalrod/electricity-consumption-uk-20092022" rel="noopener noreferrer"&gt;Electricity Consumption UK 2009-2023 dataset&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The data we have is a time series and, for our case scenario, we’ll keep it simple and not look at cross dependencies between variables (although this would be an interesting experiment to try). Our logical choice is, therefore, to use &lt;strong&gt;ARIMA_PLUS&lt;/strong&gt; as our model type. If you would like to know more about the ARIMA_PLUS model, don’t hesitate to have a look at &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-time-series" rel="noopener noreferrer"&gt;BigQuery’s documentation&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Going back to our dataset, you can notice that we have a lot of data columns, i.e. for a singular timestamp we are collecting multiple types of data (e.g. wind generation, solar generation). These can’t all fit into a single model. An ARIMA_PLUS model can only point to a single data column and be trained on that column. In this case, we’ll need to create a separate model for each data column that we have. &lt;/p&gt;

&lt;p&gt;If you haven’t worked with BigQuery before, have a look at some of the tutorials online. &lt;a href="https://blog.coupler.io/bigquery-tutorial/" rel="noopener noreferrer"&gt;This&lt;/a&gt; one is a relatively good start and you just need to know how to upload data using a CSV (or other) and how to run a query before being able to follow through these steps.&lt;/p&gt;

&lt;p&gt;Okay, great! So we know that we’ll be using the ARIMA_PLUS model, we know what data we want to use and we know that we’ll need a model per column. Let’s see this in action as a query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;column_names&lt;/span&gt; &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Get column names into an array&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;column_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ARRAY_AGG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;COLUMN_NAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`project-name.dataset-name.INFORMATION_SCHEMA.COLUMNS`&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;TABLE_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'name_of_your_table_with_data'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;COLUMN_NAME&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'name_of_your_column_with_your_timestamp'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Loop through columns and create ARIMA_PLUS models&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;UNNEST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;DO&lt;/span&gt;
    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
      &lt;span class="s1"&gt;'CREATE OR REPLACE MODEL `dataset-name.data_arima_model_'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'` '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'OPTIONS(MODEL_TYPE="ARIMA_PLUS", TIME_SERIES_DATA_COL="'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'", TIME_SERIES_TIMESTAMP_COL="name_of_your_column_with_your_timestamp") AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'SELECT name_of_your_column_with_your_timestamp, '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' FROM `project-name.dataset-name.table-with-data`'&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;EXECUTE&lt;/span&gt; &lt;span class="k"&gt;IMMEDIATE&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s break this down. In the first section, we are creating an array of the column names (remember to replace &lt;code&gt;project-name&lt;/code&gt;, &lt;code&gt;dataset-name&lt;/code&gt;, &lt;code&gt;name_of_your_table_with_data&lt;/code&gt;, &lt;code&gt;table-with-data&lt;/code&gt; and &lt;code&gt;name_of_your_column_with_your_timestamp&lt;/code&gt; with your own values). This array is used in the next section where we loop through them and create an individual ARIMA_PLUS model for each data column. &lt;/p&gt;

&lt;p&gt;Great, we have our models! Once the query has run and if it’s successful, you’ll be able to see the models in the explorer tab within the GCP console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F03aa6493e3b94642d7c44ee5985038ab%2F2e237%2Fmodels-in-bigquery.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F03aa6493e3b94642d7c44ee5985038ab%2F2e237%2Fmodels-in-bigquery.png" alt="Results of creating the models for the dataset"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  (Optional) Formatting your timestamp
&lt;/h3&gt;

&lt;p&gt;If you use the same dataset that I did (&lt;a href="https://www.kaggle.com/datasets/albertovidalrod/electricity-consumption-uk-20092022" rel="noopener noreferrer"&gt;Electricity Consumption UK 2009-2023 dataset&lt;/a&gt;), you’ll notice that you don’t quite get a good timestamp. Instead you have &lt;code&gt;SETTLEMENT_DATA&lt;/code&gt; and &lt;code&gt;SETTLEMENT_PERIOD&lt;/code&gt; to determine which half-hour period of the day you are in. As is, this will cause an error when you run your model since the &lt;code&gt;TIME_SERIES_TIMESTAMP_COL&lt;/code&gt; that you will try use will not be a timestamp. &lt;/p&gt;

&lt;p&gt;But, no worries, I have a script to solve that! &lt;/p&gt;

&lt;p&gt;Once you’ve imported your data, run the following script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Step 1: Create a new table with the desired structure and transformed data&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`project-name.dataset-name.&amp;lt;table-with-data&amp;gt;_new`&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;EXCEPT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SETTLEMENT_DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SETTLEMENT_PERIOD&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;-- Exclude the columns you want to remove&lt;/span&gt;
  &lt;span class="nb"&gt;DATETIME&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FORMAT_DATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'%F'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SETTLEMENT_DATE&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'T'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="k"&gt;CASE&lt;/span&gt; 
      &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;SETTLEMENT_PERIOD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'00:00:00'&lt;/span&gt;
      &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;SETTLEMENT_PERIOD&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FLOOR&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;SETTLEMENT_PERIOD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;':'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LPAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MOD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SETTLEMENT_PERIOD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;':00'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FLOOR&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;SETTLEMENT_PERIOD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;':'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LPAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MOD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SETTLEMENT_PERIOD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;':00'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;time_stamp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`project-name.dataset-name.table-with-data`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Step 2: Drop the original table&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`project-name.dataset-name.table-with-data`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Step 3: Rename the new table to the original table name&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`project-name.dataset-name.&amp;lt;table-with-data&amp;gt;_new`&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`table-with-data`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where the magic of SQL comes in so I would highly recommend looking over this script to make sure you understand each stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 2: Setting up our results table
&lt;/h2&gt;

&lt;p&gt;Now that our models have been generated, we are nearly ready to run our anomaly detector. But, what’s the point in running the experiment if we can’t see the data? And what is the easiest and most pragmatic way to visualise that data? We’re in the data world, so we can talk about all sorts of beautiful visualisations, but let’s keep it simple with storing our results in a table. First, let’s prepare the table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;dynamic_sql&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- ... the code to set the column_names, seen above, is needed again&lt;/span&gt;

&lt;span class="c1"&gt;-- Start building the dynamic SQL to create the table of results&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;dynamic_sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'CREATE OR REPLACE TABLE `project-name.dataset-name.anomaly_results` (time_stamp TIMESTAMP, '&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Loop through the column names&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;UNNEST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;DO&lt;/span&gt;
    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;dynamic_sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dynamic_sql&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' FLOAT64 DEFAULT NULL, is_anomaly_'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' BOOL DEFAULT FALSE, prob_'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' FLOAT64 DEFAULT NULL, '&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Remove the trailing comma and close the SQL statement&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;dynamic_sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SUBSTR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dynamic_sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;LENGTH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dynamic_sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;')'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Execute the dynamic SQL to create the table to store results&lt;/span&gt;
&lt;span class="k"&gt;EXECUTE&lt;/span&gt; &lt;span class="k"&gt;IMMEDIATE&lt;/span&gt; &lt;span class="n"&gt;dynamic_sql&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this snippet of code, we are creating an empty table and creating 3 columns for each of the original data column: the original data (float64), whether or not the value is an anomaly &lt;code&gt;is_anomaly_&amp;lt;column-name&amp;gt;&lt;/code&gt; (bool), and the probability of it being an anomaly &lt;code&gt;prob_&amp;lt;column-name&amp;gt;&lt;/code&gt; (float64) which will come in useful for debugging. The default values are quite important to have here. When we are writing to our result table, we will be incrementally adding values to the column as we run our anomaly detector against our models. Since we’ll be adding data to a row in a piece-wise manner, we’ll need to have those default values as placeholders to not get an error when we create a new row with partial data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 3: Running the anomaly detector
&lt;/h2&gt;

&lt;p&gt;Now that we have the model and we have a place to store our results, we can run our anomaly detector.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DECLARE&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- ... the code to set the column_names, seen above, is needed again&lt;/span&gt;

&lt;span class="c1"&gt;-- Find anomalies in the data&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;UNNEST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;DO&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'is_anomaly_'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'dataset-name.data_arima_model_'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'prob_'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
      &lt;span class="s1"&gt;'MERGE `project-name.dataset-name.anomaly_results` AS target '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'USING ('&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'  SELECT TIMESTAMP AS time_stamp, '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', is_anomaly AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', anomaly_probability AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;  
      &lt;span class="s1"&gt;'  FROM ML.DETECT_ANOMALIES('&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'    MODEL `'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'`, STRUCT(0.95 AS anomaly_prob_threshold)'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'  )'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;') AS source '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'ON target.time_stamp = source.time_stamp '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'WHEN MATCHED THEN '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'  UPDATE SET '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'    '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' = source.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'    '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' = source.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'    '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' = source.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'WHEN NOT MATCHED THEN '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'  INSERT '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'    ('&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', time_stamp) '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'  VALUES '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
      &lt;span class="s1"&gt;'    (source.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', source.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', source.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', source.time_stamp)'&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;-- Execute the query&lt;/span&gt;
  &lt;span class="k"&gt;EXECUTE&lt;/span&gt; &lt;span class="k"&gt;IMMEDIATE&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks like a very complicated statement so let’s break it down again. The first part of the statement is &lt;code&gt;MERGE&lt;/code&gt; which will either create, update or delete a row depending on the matching &lt;code&gt;WHEN&lt;/code&gt; statement. Here we are determining whether there is a match between the &lt;code&gt;time_stamp&lt;/code&gt; in our results table and the one in the data we are going through. The &lt;code&gt;time_stamp&lt;/code&gt; was the best choice to see if we needed to update or insert a row since it is unique in our case.&lt;/p&gt;

&lt;p&gt;Now that we’ve looked at the &lt;code&gt;MERGE&lt;/code&gt; and &lt;code&gt;WHEN&lt;/code&gt; statement, let’s look at the &lt;code&gt;USING&lt;/code&gt; component which is where we are actually running our anomaly detector. I won’t focus on the query syntax as this ressembles SQL quite closely. The main part of the statement is &lt;code&gt;ML.DETECT_ANOMALIES&lt;/code&gt; where we specify the model (which we created earlier) and the probability threshold. This probability threshold is used to determine what the model will mark as an anomaly. The reason behind this is that when this script is run, each piece of data will have a probability of it being an anomaly. If we have a low threshold, we are being more conservative and risk-averse by classifying data as an anomaly if the system is partially sure that the data is anomaly. On the other hand, we could put a much higher threshold and only classify data as an anomaly if the system is near certain (i.e. associates a high probability of being an anomaly).&lt;/p&gt;

&lt;p&gt;And that’s it! You’ve manage to run your anomaly detector. 🥳&lt;/p&gt;

&lt;h2&gt;
  
  
  How good is BigQuery as an anomaly detector?
&lt;/h2&gt;

&lt;p&gt;After running the scripts above, you should see that the &lt;code&gt;anomaly_results&lt;/code&gt; table has been created with our results. It will look a little like this: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2Fb885193e0a6c71d8b87f5e32f4653bde%2F536c7%2Fanomaly-detection-results.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2Fb885193e0a6c71d8b87f5e32f4653bde%2F536c7%2Fanomaly-detection-results.png" alt="Results of the anomaly detection"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For this test, I used 28 lines of data and introduced 5 anomalies into the data that I imported. What you’ll notice is that BigQuery is very good at picking up the intended anomalies. However, we have a different problem this time. Instead of having issus with picking up the anomalies (which we saw when using an LLM as an anomaly detector in &lt;a href="https://blog.theodo.com/2024/01/anomaly-detection-llm/" rel="noopener noreferrer"&gt;this article&lt;/a&gt;), BigQuery has a much bigger issue with false positives. In this first experiment, there were 21 false positives! False positives create a different issue. Although less severe than missing an actual anomaly (which would lead to bad data), false positives can lead to incorrect alerts and the need for a manual verification process, which would hinder an automatic system. So what can we do?&lt;/p&gt;

&lt;h3&gt;
  
  
  1) Increasing the anomaly probability threshold
&lt;/h3&gt;

&lt;p&gt;When setting up the anomaly detector, there was a short introduction to the probability threshold, i.e. the point at which we consider our data point an anomaly. By raising this threshold, we are essentially saying that we want the system to be more sure a data point is an anomaly before marking it as such. However there is a warning associated to this. The higher you raise the threshold, the higher the risk that you miss an anomaly when the system is less certain about it.&lt;/p&gt;

&lt;p&gt;In practise, when raising the threshold from 0.95 to 0.99999, I was pleasantly surprised by the fact that BigQuery still picked up 100% of the intended anomalies. The number of false positives dropped to 6 (-71%) which is a big step in the right direction! &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F4963360242718ec098d61ce36a1f0c88%2F536c7%2Fincreased-threshold-results.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F4963360242718ec098d61ce36a1f0c88%2F536c7%2Fincreased-threshold-results.png" alt="Effect of increasing the anomaly probability threshold"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Using a separate and more extensive training set
&lt;/h3&gt;

&lt;p&gt;The bigger the training set, the better it performs, right? Not quite in this case. Going with this initial thought process, I trialled using a separate and more extensive training data set to produce the model. &lt;/p&gt;

&lt;p&gt;Looking at our code above, it is relatively straightforward to change the dataset which the model is based off of, you just have to update &lt;code&gt;project-name.dataset-name.table-with-data&lt;/code&gt; to point to the table with the new data you would like to train the model on. &lt;/p&gt;

&lt;p&gt;Next, you’ll also have to update the anomaly detector. If you noticed, in the code above, we don’t specify the data where the anomaly detection is occurring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="s1"&gt;'SELECT TIMESTAMP AS time_stamp, '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', is_anomaly AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', anomaly_probability AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;  
&lt;span class="s1"&gt;'  FROM ML.DETECT_ANOMALIES('&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;span class="s1"&gt;'    MODEL `'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'`, STRUCT(0.95 AS anomaly_prob_threshold)'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;span class="s1"&gt;'  )'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is because, implicitly, BigQuery uses the same dataset as was used in the model to detect the anomalies. To detect anomalies on a new dataset, you’ll have to modify the line above to have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="s1"&gt;'SELECT TIMESTAMP AS time_stamp, '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', is_anomaly AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;', anomaly_probability AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;anomaly_prob_column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;  
&lt;span class="s1"&gt;'  FROM ML.DETECT_ANOMALIES('&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;span class="s1"&gt;'    MODEL `'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'`, STRUCT(0.95 AS anomaly_prob_threshold), (SELECT * FROM `dataset-name.table-with-data-to-analyse`)'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;span class="s1"&gt;'  )'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where &lt;code&gt;table-with-data-to-analyse&lt;/code&gt; is the name of the table which has the anomalies you would like to detect. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Word of warning:&lt;/strong&gt; the timestamp of your data is important when doing this! The data which you want to analyse has to have a timestamp which is in the future compared to your training data. This makes sense as you use past data to predict future trends and is a design choice that was made by the BigQuery ML team. If you are curious about the reasoning behind this, &lt;a href="https://www.googlecloudcommunity.com/gc/Data-Analytics/ML-Detect-Anomalies-returns-null-values-when-using-data-from-a/m-p/694853" rel="noopener noreferrer"&gt;here&lt;/a&gt; is a discussion from within the Google Cloud Community space with more details. &lt;/p&gt;

&lt;p&gt;Doing this did not significantly improve the number of false positives. In fact, it actually increased the number of false positives from 6 (achieved by lowering the threshold) to 20! There are lots of potential reasons behind this. Overfitting is a common cause and reflects the importance of tuning your model. Before jumping into this, let’s have a look at the current state of our results. Once again, all the intended anomalies were detected, so let’s just have a look at the number of false positives:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2Fa718ae1a41cc26c6db3a9fc31802b162%2F536c7%2Fusing-training-set-results.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2Fa718ae1a41cc26c6db3a9fc31802b162%2F536c7%2Fusing-training-set-results.png" alt="Effect of using training data on the number of false positives"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Tuning our model using Non-Seasonal Order terms
&lt;/h3&gt;

&lt;p&gt;The non-seasonal ARIMA model we are using is a combination of differencing, autoregression and a moving average model. When running the model above, by default, we are using &lt;code&gt;AUTO_ARIMA = True&lt;/code&gt;. What this means is that the training automatically finds the best non-seasonal order values. There are three values that need to be set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;p&lt;/strong&gt; - the order of the autoregressive part, a value between 0 and 5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;d&lt;/strong&gt; - the degree of differencing, a value between 0 and 2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;q&lt;/strong&gt; - the order of the moving average part, a value between 0 and 5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To understand more about these values, don’t hesitate to have a look at &lt;a href="https://otexts.com/fpp2/non-seasonal-arima.html" rel="noopener noreferrer"&gt;this documentation&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;In code, we’ll need to switch &lt;code&gt;AUTO_ARIMA&lt;/code&gt; to be &lt;code&gt;False&lt;/code&gt; before we give the new values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="s1"&gt;'CREATE OR REPLACE MODEL `dataset-name.data_arima_model_'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'` '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;span class="s1"&gt;'OPTIONS(MODEL_TYPE="ARIMA_PLUS", TIME_SERIES_DATA_COL="'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'", TIME_SERIES_TIMESTAMP_COL="name_of_your_column_with_your_timestamp", AUTO_ARIMA=False, NON_SEASONAL_ORDER=(&amp;lt;p&amp;gt;, &amp;lt;d&amp;gt;, &amp;lt;q&amp;gt;)) AS '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
&lt;span class="s1"&gt;'SELECT name_of_your_column_with_your_timestamp, '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' FROM `project-name.dataset-name.table-with-data`'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When experimenting with a smaller dataset (to understand the relative impact of the variation of the terms), this is what we see:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F07a61e3879de4d6161a8459d1497102f%2F0d390%2Fnon-seasonal-order-terms-results.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.theodo.com%2Fstatic%2F07a61e3879de4d6161a8459d1497102f%2F0d390%2Fnon-seasonal-order-terms-results.png" alt="Effect of non-seasonal order terms on number of false positives"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What we are most interested in is the combination which resulted in the lowest number of false positives. When bringing this back to our larger dataset, some of these combinations didn’t perform as well as the smaller experiment, but the best combination of values was &lt;code&gt;p=0&lt;/code&gt;, &lt;code&gt;d=0&lt;/code&gt; and &lt;code&gt;q=1&lt;/code&gt; which corresponds to a moving average model. In our particular case, this provided the best results, but you should always fine-tune these values based on your application as this combination will not always be the best one! &lt;/p&gt;

&lt;p&gt;Using this method, the number of false positives dropped to only &lt;strong&gt;1 false positive&lt;/strong&gt; (when using our optimisation from part 1 but not 2) and 9 false positives (when using both part 1 and 2).&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Overall, when using BigQuery for numerical data, the algorithm is able to pick up on all intended anomalies with relative ease. The key concern is the number of false positives. In itself, it isn’t a terrible issue to have as you would rather over-estimate rather than under-estimate the number of anomalies when improving your data quality. However, it does makes things like automation more difficult, especially if you need to manually validate that a data point is not an anomaly. &lt;/p&gt;

&lt;p&gt;By experimenting, we found that the best way to improve our utilisation of the BigQuery ML anomaly detector is by combining an increase in the threshold of what is considered an anomaly and fine-tuning the non-seasonal order terms. This resulted in a &lt;strong&gt;95% decrease&lt;/strong&gt; in the number of false positives.&lt;/p&gt;

&lt;p&gt;If you liked this article, don’t hesitate to check out the first article in this anomaly detector series: &lt;a href="https://blog.theodo.com/2024/01/anomaly-detection-llm/" rel="noopener noreferrer"&gt;Step-by-Step Guide to building an Anomaly Detector using a LLM&lt;/a&gt; and have a look at &lt;strong&gt;&lt;a href="https://twitter.com/ChloeCaronEng" rel="noopener noreferrer"&gt;@ChloeCaronEng&lt;/a&gt;&lt;/strong&gt; for more exploratory journeys!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>bigquery</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Step-by-Step Guide to building an Anomaly Detector using a LLM</title>
      <dc:creator>Chloe Caron</dc:creator>
      <pubDate>Wed, 10 Jan 2024 12:36:59 +0000</pubDate>
      <link>https://dev.to/chloecaroneng/step-by-step-guide-to-building-an-anomaly-detector-using-a-llm-522n</link>
      <guid>https://dev.to/chloecaroneng/step-by-step-guide-to-building-an-anomaly-detector-using-a-llm-522n</guid>
      <description>&lt;p&gt;You’ve probably heard about the importance of data quality being shouted from every rooftop. Bad data is a recipe for disaster. Certain companies have specialised in finding anomalies in your data and flagging it, much like &lt;a href="https://www.siffletdata.com/"&gt;Sifflet&lt;/a&gt; and &lt;a href="https://www.elementary-data.com/"&gt;Elementary&lt;/a&gt;. But what if we wanted to build an anomaly detector which works for any type of data, structured or unstructured, textual or numeric, include it in our data pipeline and do automated transformation steps… the possibilities are endless! &lt;/p&gt;

&lt;p&gt;I’ve created a series of articles following the journey of building an anomaly detector for various situations, featuring OpenAI and BigQuery. In this first article, we’ll focus on building an anomaly detector using a LLM (Large Language Model), more specifically OpenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an anomaly detector?
&lt;/h2&gt;

&lt;p&gt;In simple words, an anomaly detector identifies &lt;a href="https://blog.theodo.com/2023/12/avoid-revenue-hit-from-bad-or-no-data/"&gt;bad quality data&lt;/a&gt;. In more technical terms, anomaly detection is used to identify significant deviations from the normal behavioural pattern. By identifying these faults, you can take action against potential sources of bad data and ensure that your data-use remains un-skewed, robust and defect-free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use LLMs like OpenAI?
&lt;/h2&gt;

&lt;p&gt;We want our anomaly detector to be as flexible as possible and to be able to identify anomalies no matter how structured the data is, how it is formatted, where it originates from, etc. It may be rare for you to have this number of uncertainties, but when you do, a conventional approach to anomaly detection may be more difficult to apply. Conventional approaches include using models like auto-encoders, ARIMA (auto-regressive integrated moving average), etc. Another point is - curiosity! LLMs are becoming more and more popular, but how efficient are they at more complex tasks such as anomaly detection and at what cost? That’s what we are about to find out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 1: Building a basic OpenAI anomaly detection
&lt;/h2&gt;

&lt;p&gt;To start, make sure that you have everything setup to use OpenAI. You can find more information on how to get started in &lt;a href="https://medium.com/data-professor/beginners-guide-to-openai-api-a0420bc58ee5"&gt;this article&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;To test that everything is working, we can try out some relatively simple code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-3.5-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a data analyser. You spot any anomaly in the data received.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Here is the data input I have: {&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: 1, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1946-01-03&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;}, {&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: 2, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1852-03-04&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;}, {&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: 2, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1852-03-04&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-1.0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When running this, OpenAI does relatively well. It can pick up on issues like the one above with the cost being a negative value. But this seems too easy, what if we give it more complex data? &lt;/p&gt;

&lt;p&gt;As soon as you start to look at more lines of code with more complex anomalies, accuracy starts to drop. For this experiment, let’s use data from the &lt;a href="https://www.kaggle.com/datasets/albertovidalrod/electricity-consumption-uk-20092022"&gt;Electricity Consumption UK 2009-2023 dataset&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We can test the anomaly detector using different scenarios, varying the following parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt; used: either GPT 3.5 Turbo 1106 or GPT 4 1106 Preview&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Number of anomalies&lt;/strong&gt;: 0, 1, or 2&lt;/li&gt;
&lt;li&gt;Number of lines of &lt;strong&gt;training data&lt;/strong&gt;: 0, 10, 50 or 100&lt;/li&gt;
&lt;li&gt;Number of lines of &lt;strong&gt;test data&lt;/strong&gt;: 4, 21 or 80&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a surprisingly high number of cases, OpenAI could find none of the anomalies! While a lot of the experiments gave a low accuracy, there are some combinations that shone through. The key statistics are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT 4 tended to perform much better than GPT 3.5&lt;/li&gt;
&lt;li&gt;The more anomalies you have, the more difficult it was for the model to find them&lt;/li&gt;
&lt;li&gt;The number of lines of data didn’t have a significant impact on the accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To give you an idea of the outcome, here are the combinations which had a non-zero accuracy:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GGgzjM2A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/bb6bffa66023b96242fa8b4b7f8be2ab/536c7/openai-base-anomaly-detector-experiment.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GGgzjM2A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/bb6bffa66023b96242fa8b4b7f8be2ab/536c7/openai-base-anomaly-detector-experiment.png" alt="Results from experiment using a simple anomaly detector" width="800" height="1279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The results aren’t great but we are far from done! As is, we wouldn’t be able to use OpenAI as an anomaly detector. Our experiments are on a small scale. As the dataset we analyse grows along with the number of anomalies, our accuracy will very much suffer. Keep in mind that throughout this article we are testing our anomaly detector with simple datasets and results may differ depending on the dataset you use and the type of anomalies you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 2: Using Prompt Engineering to boost our accuracy from 32% to 64%
&lt;/h2&gt;

&lt;p&gt;The idea behind prompt engineering is to construct the queries given to the language models to optimise their performance. This helps to guide them to generate the desired output by fine-tuning their response. There is a plethora of research papers out there on different forms of prompt engineering. &lt;a href="http://DAIR.AI"&gt;DAIR.AI&lt;/a&gt; published a &lt;a href="https://www.promptingguide.ai/"&gt;guide on prompt engineering&lt;/a&gt; that you might find useful to get started.&lt;/p&gt;

&lt;p&gt;After trying a variety of different prompt engineering methods, there are 3 key techniques which stood out. &lt;/p&gt;

&lt;h3&gt;
  
  
  Chain of Thought
&lt;/h3&gt;

&lt;p&gt;When using Chain-of-Thought (CoT) prompting, we are encouraging the model to break down complex processes or “thoughts” into smaller steps before giving a response. CoT can be combined with other prompting techniques (e.g. few-shot as we’ll see later), but we can also apply it on its own. The key is to get the model to think step by step. A way to do this is by adding the following to your prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let’s think step by step&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By doing so, you are encouraging your model to break down its thinking. So what does our solution look like now?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Consider you have a CSV, convert the data into a stringified JSON
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;data_with_anomaly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Determine the messages to send to the model
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a data analyser which spots any anomaly 
                                        in the data received. You will be given data in the form 
                                        of a CSV. There can be no anomaly but there can also be 
                                        multiple anomalies. Let’s think step by step. First work out 
                                        the schema of the data you receive. Then compare the data you 
                                        have to the schema you determined. Don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t decide what is an 
                                        anomaly until you have figured out the schema.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Here is the data to analyse, what are the anomalies? Please give me the line number with the anomaly. Make sure to remember on which line of the CSV the anomaly was (ignore the first line since these are the column titles): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
      &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;data_with_anomaly&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Get the response from the model
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Add the model you would like to use, e.g. gpt-3.5-turbo-1106 or gpt-4-1106-preview
&lt;/span&gt;      &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Show the anomaly
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To keep our experiments fair, we will fix certain parameters. When applying the different prompt engineering techniques, we will keep the number of anomalies in our data to 2 and we will use 20 lines of data in the CSV with the anomalies. These numbers were chosen arbitrarily.&lt;/p&gt;

&lt;p&gt;By applying CoT on the anomaly detector, GPT 4 performed &lt;strong&gt;8% better&lt;/strong&gt; while GPT 3.5 did not show any significant change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Few-Shot Prompting
&lt;/h3&gt;

&lt;p&gt;Using few-shot prompting, we provide examples of how the model should be responding to the prompts that we give it. This is valuable for more complex tasks, e.g. anomaly detection, by demonstrating to the model how it should be responding given a certain prompt. &lt;/p&gt;

&lt;p&gt;In our case, we can give the model example data with an anomaly and then give it the expected response with the correct anomalies being identified. Let’s give it a shot!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;data_with_anomaly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Extract data from three CSVs with example data inside them
&lt;/span&gt;&lt;span class="n"&gt;data_with_anomaly_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bad_data_example_1.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data_with_anomaly_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bad_data_example_2.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data_with_no_anomaly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_with_no_anomaly.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Define the anomalies present in each file with the reasoning behind it
&lt;/span&gt;&lt;span class="n"&gt;expected_response_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Taking my time to look through the data, I noticed the following:
1. In row 1, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ND&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is zero. In all the other rows, the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ND&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; value is non-zero. This is an anomaly.
2. In row 3, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ENGLAND_WALES_DEMAND&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is a negative value. In all the other rows, this is a positive value. This is an anomaly.
3. In row 6, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EMBEDDED_SOLAR_CAPACITY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is missing. In all the other rows, this value is present. This is an anomaly.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;expected_response_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Taking my time to look through the data, I noticed the following:
1. In row 2, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TSD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is 100. In all the other rows, this value varies between 24244 and 48195. Since this value is very out of range, it is an anomaly.
2. In row 3, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EMBEDDED_WIND_GENERATION&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is 10. In all the other rows, this value varies between 24244 and 48195. Since this value is very out of range, it is an anomaly.
3. In row 7, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EMBEDDED_SOLAR_GENERATION&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is a negative value. In all the other rows, this is a positive value. This is an anomaly.
4. In row 10, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ENGLAND_WALES_DEMAND&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; and &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EMBEDDED_WIND_GENERATION&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; are missing. In all the other rows, this value is present. This is an anomaly.
5. In row 12, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NON_BM_STOR&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is not zero. In all the other rows, this value is zero. This is an anomaly.
6. In row 15, the value for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MOYLE_FLOW&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is zero. In all the other rows, this value is not zero. This is an anomaly.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;expected_response_no_anomaly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;After comparing the values of each row to each other, all the data seems to be consistent with each other, I cannot find an anomaly.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Let us adapt the messages we send to the model with this information
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a data analyser which spots any anomaly 
                                        in the data received. You will be given data in the form 
                                        of a CSV. There can be no anomaly but there can also be 
                                        multiple anomalies.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Here is the data to analyse: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;data_with_anomaly_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;expected_response_1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Here is the data to analyse: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;data_with_no_anomaly&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;expected_response_no_anomaly&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Here is the data to analyse, what are the anomalies? Please give me the line number with the anomaly. Make sure to remember on which line of the CSV the anomaly was (ignore the first line since these are the column titles): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
      &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;data_with_anomaly&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: As before, get the response from the model
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Add the model you would like to use, e.g. gpt-3.5-turbo-1106 or gpt-4-1106-preview
&lt;/span&gt;      &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: As before, show the anomaly
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Try giving this a run and you’ll notice that few-shot prompting leads to a massive jump when using GPT 4. The accuracy of the anomaly detector goes &lt;strong&gt;from 32% to 56%&lt;/strong&gt; for GPT4! Unfortunately, it stays relatively constant for GPT 3.5. &lt;/p&gt;

&lt;h3&gt;
  
  
  Self-reflection &amp;amp; multi-step
&lt;/h3&gt;

&lt;p&gt;For our last experiment with prompt engineering, we’re going to try splitting our model into multiple steps and pushing it to self reflect on its findings. &lt;/p&gt;

&lt;p&gt;We are going to split our model into three with each step having different responsibilities. The first step will aim to find the anomalies, similar to what we saw above. The second step will trigger a self-reflection and will present the previous prompt and ask it “Are you sure?”. The last step will convert the response to JSON to standardise the finding of the anomalies so that we can use the response moving forward.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cuKbBGwr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/be4431d08ff68ff5568a844029eb97ec/92338/multi-step-diagram.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cuKbBGwr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/be4431d08ff68ff5568a844029eb97ec/92338/multi-step-diagram.png" alt="Diagram showing the split in the steps" width="800" height="935"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s see what this looks like in code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# All the code above still applies
&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;first_answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Append our previous answer to our messages
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;first_answer&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Add some self reflection
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Are you sure? How confident are you? Take a deep breathe and tell me the anomalies when you are more sure.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;# Step 3: Run the model
&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Add the model you would like to use, e.g. gpt-3.5-turbo-1106 or gpt-4-1106-preview
&lt;/span&gt;        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response_with_self_reflection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Convert your answer to a JSON response
&lt;/span&gt;&lt;span class="n"&gt;json_structure&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anomaly_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number of anomalies found in the text, of type &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anomalies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the row number where the anomaly was found. Ignore the first row of the CSV since it has the column names. Of type &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why this is considered an anomaly, of type &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;json_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-3.5-turbo-1106&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Can use a simpler model to do the conversion 
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You convert a text response into a json object. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I want you to only give responses in the form &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_structure&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Convert this response to a json object: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response_with_self_reflection&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: Display the results
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voila! You’ve now got few shot-learning integrated along with self-reflection &amp;amp; multi-step. When running the experiment and removing few-shot (i.e. only using self-reflection &amp;amp; multi-step), GPT 4 saw an improvement from 32% to 60% (unfortunately, GPT 3.5 actually saw a drop in performance). &lt;/p&gt;

&lt;p&gt;By combining CoT, few-shot, self-reflection &amp;amp; multi-step, the overall performance of &lt;strong&gt;GPT 4 goes from 32% to 68%&lt;/strong&gt;! On the other hand, the performance of GPT 3.5 doesn’t really see any significant improvement. If we focus on GPT 4, here is a summary of how prompt engineering impacted the accuracy of our anomaly detector:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZtjcFjbS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/132a9aad70cb8b67a13a0aa73cb5e905/b1001/results-summary.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZtjcFjbS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/132a9aad70cb8b67a13a0aa73cb5e905/b1001/results-summary.png" alt="Summary of the results when applying prompt engineering" width="800" height="998"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pretty good! But a 68% accuracy is not amazing and GPT 3.5 seems to be really struggling to improve its accuracy with the anomaly detector. Why is this the case? If we take a closer look at our data, we see that we are using numerical data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--npidWIBT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/1bcf510bb0f49b2ee14f9ed76c9e23a6/536c7/example-numerical-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--npidWIBT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/1bcf510bb0f49b2ee14f9ed76c9e23a6/536c7/example-numerical-data.png" alt="Example of numerical data" width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Having a search, you’ll quickly see that the GPT models are bad at maths. This is to be expected, they are language models and aren’t trained on mathematical concepts.&lt;/p&gt;

&lt;p&gt;We can test this theory by swapping out our data to text-based data.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does data type impact the accuracy of our anomaly detector?
&lt;/h3&gt;

&lt;p&gt;Let’s try to swap out our data for movie reviews instead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Vr1-Lxhh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/49fdbbd1a10785befad2ce575132370d/536c7/example-text-data-with-anomalies.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Vr1-Lxhh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.theodo.com/static/49fdbbd1a10785befad2ce575132370d/536c7/example-text-data-with-anomalies.png" alt="Example of text data with anomalies" width="800" height="232"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When running our anomaly detector on this data (using the prompt engineering techniques above), we see that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT 3.5 has an accuracy which jumps from 16% (numerical data) to &lt;strong&gt;78%&lt;/strong&gt; (text-based data)&lt;/li&gt;
&lt;li&gt;GPT 4 has an accuracy which jumps from 68% (numerical) to &lt;strong&gt;~100%&lt;/strong&gt;(text-based)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is an incredible jump and reflects the importance of choosing the correct model based on your application. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenAI can be used to build an effective anomaly detector for text-based data. Our model can be optimised using techniques such as prompt engineering. On the other hand, while we can improve the performance of our anomaly detector for numerical data, it is still relatively limited. So what can we use? Keep an eye out for the next article on using BigQuery for anomaly detection on numerical data!&lt;/p&gt;

&lt;p&gt;This article is part of a series on the journey of building an anomaly detector. I share the progress of the journey regularly on X (formerly Twitter), have a look at &lt;a href="https://twitter.com/ChloeCaronEng"&gt;@ChloeCaronEng&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>data</category>
      <category>openai</category>
      <category>llm</category>
      <category>dataquality</category>
    </item>
  </channel>
</rss>
