DEV Community: Priscilla Parodi

NLP and Elastic: Getting started

Priscilla Parodi — Thu, 02 Jun 2022 21:13:34 +0000

Natural language processing (NLP) is the branch of artificial intelligence (AI) that focuses on understanding human language as closely as possible to human interpretation, combining computational linguistics with statistical, machine learning and deep learning models.

Some examples of NLP tasks:

Named entity recognition is a type of information extraction, identifying words or phrases as entities.

(model used)

Sentiment analysis is a type of text classification, attempting to extract subjective emotions from text.

(model used)

There are more examples that can be used according to your use case.

BERT

In 2018, Google sourced a new technique for pre-training NLP called BERT.

BERT uses “transfer learning”, which is the method of pre-training linguistic representations. Pre-training refers to how BERT was first trained using unsupervised learning on a large source of plain text extracted from a collection of samples (800 million words) and Wikipedia documents (2,500 million words). Earlier models required manual labeling.

BERT was pretrained on two tasks: language modeling (15% of tokens were masked and BERT was trained to predict them from context) and next sentence prediction (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence). With this understanding, BERT can be adapted to many other types of NLP tasks very easily.

Knowing the intent and context and not just the keywords, it is possible to go further in understanding in a way that is even closer to the way humans understand.

NLP with Elastic

To support models that use the same tokenizer as BERT, Elastic is supporting the PyTorch library, one of the most popular machine learning libraries that supports neural networks like the Transformer architecture that BERT uses, enabling NLP tasks.

In general, any trained model that has a supported architecture is deployable in Elasticsearch, including BERT and variants.

These models are listed by NLP task. Currently, these are the tasks supported:

Extract information

Named entity recognition
Fill-mask
Question answering

Classify text

Language identification
Text classification
Zero-shot text classification

Search and compare text

Text embedding
Text similarity

As in the cases of classification and regression, when a trained model is imported you can use it to make predictions (inference).

Note: For NLP tasks you must choose and deploy a third-party NLP model. If you choose to perform language identification, as an option we have a trained model lang_ident_model_1 provided in the cluster.

NLP with Elastic Solutions

There are many possible use cases to add NLP capabilities to your Elastic project and here are some examples:

Security

Spam detection: Text classification capabilities are useful for scanning emails for language that often indicates spam, allowing content to be blocked or deleted and preventing malware emails.

PUT spam-detection/_doc/1
{
  "email subject": "Camera - You are awarded a SiPix Digital Camera! Call 09061221066. Delivery within 28 days.",
  "is_spam": true
}

Enterprise Search

Analysis of unstructured text: Entity recognition is useful for structuring text data, adding new field types to your documents and allowing you to analyze more data and obtain even more valuable insights.

PUT /source-index
{
  "mappings": {
    "properties": {
      "input":    { "type": "text" }
    }
  }
}

PUT /new-index
{
  "mappings": {
    "properties": {
      "input":    { "type": "text" },  
      "organization":  { "type": "keyword"  }, 
      "location":   { "type": "keyword"  }     
    }
  }
}

Observability

Service request and incident data: Extracting meaning from operational data, including ticket resolution comments, allows you to not only generate alerts during incidents, but also go further by observing your application, predicting behavior, and having more data to improve ticket resolution time.

...
  "_source": {
    "support_ticket_id": 119237,
    "customer_id": 283823,
    "timestamp": "2021-06-06T17:23:02.770Z",
    "text_field": "Response to the case was fast and problem was solved after first response, did not need to provide any additional info.",
    "ml": {
      "inference": {
        "predicted_value": "positive",
        "prediction_probability": 0.9499962712516151,
        "model_id": "heBERT_sentiment_analysis"
      }
    }
  }
...

NLP HandsOn

Now, let's proceed with an end-to-end example! To prepare for the NLP HandsOn, we will need an Elasticsearch cluster running at least version 8.0 with an ML node. If you haven't created your Elastic Cloud Trial yet, now is the time.

| Menu | Next Post: NLP HandsOn |

NLP HandsOn

Priscilla Parodi — Thu, 02 Jun 2022 21:13:24 +0000

| Menu |

To start using NLP in your Stack you will need to import your model. The first thing we need to do is upload your model into a cluster.

In our eland library, a Python Elasticsearch client for exploring and analyzing data in Elasticsearch, we have some simple methods and scripts that allow you to upload models from local disk, or to pull models down from the Hugging Face model hub.

Once models are uploaded into the cluster, you’ll be able to allocate those models to specific ML nodes. Once model allocation is complete, we’re ready for inference.

Eland can be installed from PyPI via pip.

Before you go any further, make sure you have Python installed.

You can check this by running:

Unix/macOS
python3 --version

You should get some output like:
Python 3.8.8

Additionally, you’ll need to make sure you have pip available.

You can check this by running:

Unix/macOS
python3 -m pip --version

You should get some output like:
pip 21.0.1 from …

If you installed Python from source, with an installer from python.org, or via Homebrew you should already have pip.

If you don't have Python and pip installed, install it first.

With that, Eland can be installed from PyPI via pip:

$ python3 -m pip install eland

Getting started

To interact with your cluster through the API, we will need to use your Elasticsearch cluster endpoint information.

The endpoint looks like:
https://<user>:<password>@<hostname>:<port>

Open your deployment settings to find your endpoint information and click on the gear icon.

Copy your Elasticsearch endpoint as in the image below.

Note: If you want to try out examples with your own cluster, remember to include your endpoint URLs and authentication details.

Now add the username and password so your request can be authenticated, your endpoint will look like this:

https://elastic:123456789@00c1f8.es.uscentral1.gcp.cloud.es.io:9243

username: elastic is a built-in superuser. Grants full access to cluster management and data indices.

password: If you don't have your password, you will need to reset it and generate a new password.

Copy your endpoint, you'll need it later.

In parallel, let's proceed locating the first model to be imported.

We will import the model from Hugging Face, an AI community to build, train and deploy open source machine learning models.

In this demo we will use a random sentiment analysis model but feel free to import the model you want to use. You can read more details about this model on the Hugging Face webpage.

Copy the model name as in the image below.

Now that we have all the necessary information (elasticsearch cluster endpoint information and the name of the model we want to import) let's proceed by importing the model:

Open your terminal and update the following command with your endpoint and model name:

eland_import_hub_model --url https://<user>:<password>@<hostname>:<port> \
--hub-model-id <model_name> \
--task-type <task_type>

In this case we are importing the bhadresh-savani/distilbert-base-uncased-emotion model to run the text_classification task.

In Huggning Face filters you will be able to see the task of each model. Supported values are fill_mask, ner, question_answering, text_classification, text_embedding, and zero_shot_classification.

eland_import_hub_model --url https://elastic:<password>@<hostname>:<port> \
--hub-model-id bhadresh-savani/distilbert-base-uncased-emotion \
--task-type text_classification

You will see that the Hugging Face model will be loaded directly from the model hub and then your model will be imported into Elasticsearch.

Wait for the process to end.

Let's check if the model was imported.

Click Machine Learning in your Kibana menu.

Under model management click Trained Models:

Your model needs to be on this list as shown in the image below, if it is not on this list check if there was any error message in the previous process.

If your model is on this list it means it was imported but now you need to start the deployment. To do this, in the last column under Actions click Start deployment.

After deploying, the State column will have the value started and under Actions the Start deployment option will be disabled, which means that the deploy has been done.

Let's test our model!

Copy your model ID:

In Kibana's menu, click Dev Tools.

In this UI you will have a console to interact with the REST API of Elasticsearch.

We will to use the inference processor to evaluate this model.

POST _ml/trained_models/<model_id>/deployment/_infer
{
  "docs": { "text_field": "<input>"}
}

This POST method contains a docs array with a field matching your configured trained model input, typically the field name is text_field. The text_field value is the input you want to infer.

In our case it will be:

POST _ml/trained_models/bhadresh-savani__distilbert-base-uncased-emotion/deployment/_infer
{
  "docs": { "text_field": "Elastic is the perfect platform for knowledgebase NLP applications"}
}

Where the model_id is bhadresh-savani__distilbert-base-uncased-emotion and the value that I am using as a test is Elastic is the perfect platform for knowledgebase NLP applications.

Clicking the play button you can send the request:

In this case the predicted sentiment is "joy".

That's it, the model is working. 🚀

Note: You can run more tests to determine if this model works for what you need.

To get all the statistics of your model you can use the _stats request:

GET _ml/trained_models/<model_id>/_stats

Let's continue with part 2, How to run this model on data being ingested?

To do this, let's start by importing a .csv file into Elasticsearch. So we can run the model while importing data.

I think it's interesting to run an analysis on random texts and tweets are good use cases.

Recently Elon Musk announced his interest in buying Twitter, but before that he was famously active on the platform. As we have a sentiment analysis model, let's proceed with analyzing a sample of Elon's tweets.

I found this database on Kaggle, this is a good website for locating datasets.

Note: We don't have a huge amount of data, 172Kb between November 16, 2012 and September 29, 2017. But as this is not a research paper this is not a problem.

Feel free to use whatever data you prefer, or even the twitter API.

Let's download this file:

And import into Elasticsearch.

There are different ways to do this, but since this is a small .csv file, we can use the Upload a file integration.

In the Kibana menu, click Integrations, you will see a list of integrations we have for collecting data.

Search for Upload a file as in the image below:

And then click Select or drag and drop a file and choose your csv file, in our case data_elonmusk.csv that you downloaded earlier.

You will see something similar to the image below:

Click Override settings to rename the Tweet column to text_field. As explained before, there needs to be a field that matches your configured trained model input which is typically called text_field. With this, the model will be able to identify the field to be analyzed.

Rename the Tweet column/field to text_field. Click Apply.

After the page loads, click Import.

And then click Advanced to edit the import process settings.

The import process has several steps:

Processing file - Turning the data into NDJSON documents so they can be ingested using the bulk api

Creating index - Creating the index using the settings and mappings objects

Creating ingest pipeline - Creating the ingest pipeline using the ingest pipeline object

Uploading data - Loading data into the new Elasticsearch index

Creating a data view (Index pattern) - Create a Kibana index pattern (if the user has opted to)

As you can see the CSV processor is being used in the ingest pipeline to import your document.

Feel free to edit the mapping or ingest pipeline.

In our case we need to edit the ingest pipeline to add our previously trained and imported model.

Add the model that will infer the data being ingested into the processor as in the image below:

  {
       "inference": {
       "model_id": "bhadresh-savani__distilbert-base-uncased-emotion"
        }
    }

After that add your index name and click Import. If for some reason it doesn't work, repeat the process and check if you typed something incorrectly.

Note: What we are doing is adding your model for inference in the ingest pipeline, it doesn't need to be a .csv. Read more about it here.

When it finishes loading, your screen will look like mine, click View index in Discover.

If you didn't disable Create data view when you were importing data you should be able to locate your index by the name you used. Now you can explore your index data.

Next to the word Documents, click Field statistics, so far this is a beta feature but excellent for exploring your data. As we can see, Elon was feeling Joyful in 70% of the analyzed tweets considering this sentiment analysis model. The second most popular sentiment in Elon's tweets was anger and then fear.

Let's click on the lens button on the right side of the screen to open Kibana Lens and explore this data.

When the screen loads. Click and drag the Time field to explore this data considering the date of each tweet.

Considering time, some suggestions will appear, I liked one of them, but instead of every 30 days I edited it for an annual review. Also try filtering only by prediction probability between 0.90 and 1 for better accuracy. Here you can have fun with the analysis you want to run.

Apparently anger has increased over time, but joy remains the most common in Elon's Tweets. Fear increased until the beginning of 2016 but decreased in 2017.

Well, there are several interpretations for data, we always need to take into account the model used, accuracy, the quality of our data, the information we seek, the type of analysis and our interpretation, context and knowledge, but I believe that now it is possible to see how useful it is to analyze language.

For example, try running a classification model with the inference data (which is now a new field) to predict sentiment in addition to checking for influencers. Also try importing other models and using other datasets.

I also imported a NER model to identify entities in the same dataset so we can start to correlate text topics (keywords) with sentiment. The year Elon talked about Tesla the most in this dataset was 2015, which coincides with the year with the greatest increase in joy.

This news is from 2015 and Elon was really positive about Tesla even with the company reporting losses.

Again, these are not necessarily facts. But my goal is to show a little bit of what we can do with NLP analysis and correlation (which does not imply causation 😅).

Let's proceed with the last part, How to run this model on an existing index?

If your data is already indexed and you want to infer your model considering this data but without changing the index content, this is possible. If this is your case, let's proceed with this test.

In the Kibana menu click Ingest Pipeline and then Create pipeline and New pipeline.

Give your pipeline a name and click Add a processor.

The first step is to rename the field that will be inferred to text_field.

For that add the Rename processor, in the message field add the field to be renamed and in the target field add text_field. And then click Add.

Now we will add the Inference processor, for that click again Add processor and then under Model ID add your Model ID, in our case: bhadresh-savani__distilbert-base-uncased-emotion

Click Add.

Click Create pipeline and copy the name of your pipeline, you will need it later.

Now open Dev Tools and run the following request (adding your source index, dest index and pipeline name):

POST _reindex
{
  "source": {
    "index": "<your-source-index-name>"
  },
  "dest": {
    "index": "<your-ml-dest-index-name>",
    "pipeline": "<your-pipeline-name>"
  }
}

This copies documents from a source to a destination. You can copy all documents to the destination index, or reindex a subset of the documents, you can also use source filtering to reindex a subset of the fields in the original documents.

This will take some time, wait for the successful response as in the image below:

For this new index you don't have the Data View yet, you need it to access the Elasticsearch data that you want to explore, to do that click Stack Management in the Kibana menu and then click Data Views.

Click Create new data view and then for the Name field add the name of your new index, in my case it is elon-output-ml. Click Create data view.

Now open Discover and select the new index.

That's it, without making changes to your current index you have a new index with the result of this model.

I hope you enjoy using NLP with the Elastic Stack! Feedback is always welcome.

Elastic Anomaly Detection - Learning Process and Anomaly Score

Priscilla Parodi — Mon, 02 Aug 2021 14:39:44 +0000

| Menu | Next Post: Elastic Anomaly Detection - Categorization |

As the name suggests, the algorithm needs to identify anomalies in the data.

But how does the model identify anomalies?

How do we identify anomalies?

For example, considering the image bellow.

What is abnormal in this image?

What if I add something to this image? Now, considering the updated image below, what is abnormal?

It was probably easier with the second image because the cat is not a dog, making the cat the anomaly in this image, for most people. In this process, you are identifying patterns.

Identifying patterns is an essential part of our learning process, but the answers are not necessarily obvious, because you know what a cat is and what a dog is, not from the pictures I showed you, I never told you this, but because you learned it during your life.

We must always remember that the algorithms will only process the data that you choose to share.

In the case of a child who is still learning the difference between a dog and a cow, for example, we might receive the answer that all animals in the image belong to the same category: animals. This answer is not incorrect; it simply applies different criteria based on similar characteristics observed in the available data.

If we are seeking a more specific answer, considering all possible details, variables, and behavior, we need to ensure that all data that could contribute to the answer is analyzed over time. The more data we have, the better our understanding will be.

In the case of a child, for them to identify the cat as 'abnormal' they would need more examples, more data would need to be “analyzed” over time. The conclusion is the same for the algorithms.

Based on this information, you already know that the question 'What is abnormal?' is answered by taking into account what is considered normal (which can vary), and to determine what is normal, the algorithm identifies patterns over time.

There are multiple types of Anomaly Detection analyses available in Elastic's ML solution, including:

Single Metric analysis, for jobs that analyze a single time series;
Multi-Metric analysis, to split a single time series into multiple time series;
Population analysis, to identify abnormal behaviors in a homogeneous "population" over a period of time;
Categorization analysis, which is a machine learning process that tokenizes a text field, clusters similar to data together, and classifies it into categories;

The Anomaly Detection feature analyzes the input stream of data, models its behavior using techniques to construct a model that best matches your data, and performs analysis based on the detectors you defined in your job, considering possible rules and dates you want to ignore or disqualify from being modeled.

The blue line in the chart represents the actual data values, while the shaded blue area represents the bounds for the expected values. Initially, the range of expected values is wide due to a limited amount of data in the analyzed time period. Consequently, the model fails to capture the periodicity in the data.

After processing more data, a model is built with coefficients that result in expected values close to the actual values. This leads to the shaded blue area being close to the blue line. By comparing the values to this area, we can determine if they fall outside of it and monitor the anomaly score to indicate the severity of potential anomalies.

Anomaly Score

The anomaly score (severity) is a value from 0 to 100, which indicates the significance of the observed anomaly compared to previously seen anomalies. Highly anomalous values are shown in red.

In order to provide a sensible view of the results, an anomaly score is calculated for each bucket time interval (we use the concept of a bucket to divide up a continuous stream of data into batches, between 10 minutes and 1 hour, for processing).

When you review your machine learning results, there is a multi_bucket_impact property that indicates how strongly the final anomaly score is influenced by multi-bucket analysis; anomalies with medium or high impact on multiple buckets are represented with a cross symbol instead of a circle.

| Menu | Next Post: Elastic Anomaly Detection - Categorization |

Elastic Anomaly Detection - Categorization

Priscilla Parodi — Mon, 02 Aug 2021 14:39:32 +0000

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

For categorization analysis, the learning process is the same, but there are other steps to process the text.

The input data must be a text field, typically containing repeated elements such as log messages because it's not a natural language processing (NLP) and it works best on machine-written messages.

When you create a categorization anomaly detection job, the machine learning model processes the input text into different categories, identifying patterns over time, as you can see in this example:

Input text

Log message:

Jul 20 15:02:19 localhost sshd[8903]: Invalid user admin from 58.218.92.41 port 26062
Jul 20 15:02:19 localhost sshd[8903]: input_userauth_request: invalid user admin [preauth]
Jul 20 15:02:20 localhost sshd[8903]: Connection closed by 58.218.92.41 port 26062 [preauth]
Jul 20 17:10:23 localhost sshd[2074]: Received disconnect from 41.43.112.199 port 41805:11: disconnected by user
Jul 20 17:10:23 localhost sshd[2074]: Disconnected from 41.43.112.199 port 26062
Jul 20 17:10:23 localhost sshd[2072]: pam_unix (sshd:session): session closed for user ec2-user
Jul 20 19:14:55 localhost sshd[8944]: pam_unix (sshd:session): session closed for user ec2-user by (uid=0)
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user 
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session opened for user ec2-user by (uid=0)
Jul 20 19:17:23 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user

Step 1 - Remove mutable text

Mutable texts are not taken into account to not identify an anomaly or a pattern where there is no relevance as the value is always changing, e.g, date and time.

localhost sshd: Invalid user from port
localhost sshd: input_userauth_request: invalid user [preauth]
localhost sshd: Connection closed by port [preauth]
localhost sshd: Received disconnect from port disconnected by user
localhost sshd: Disconnected from port
localhost sshd: pam_unix session: session closed for user ec2-user
localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user 
localhost runner: pam_unix session: session opened for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user

Step 2 - cluster similar messages together

Which can mean a line or several lines that are part of a task, for example, and that are respecting a pattern.

->mlcategory:1
localhost sshd: Invalid user from port

->mlcategory:2
localhost sshd: input_userauth_request: invalid user [preauth]

->mlcategory:3
localhost sshd: Connection closed by port [preauth]

->mlcategory:4
localhost sshd: Received disconnect from port disconnected by user

->mlcategory:5
localhost sshd: Disconnected from port

->mlcategory:6
localhost sshd: pam_unix session: session closed for user ec2-user localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0) localhost runner: pam_unix session: session closed for user ec2-user localhost runner: pam_unix session: session opened for user ec2-user by (uid=0) localhost runner: pam_unix session: session closed for user ec2-user

Step 3 - Count per time bucket

By processing analyzing time buckets, the behavior in a cluster can be better and easily identified for anomaly checking.

In the image below you can see an example of the graphic behavior of each ml category over time for a further time bucket analysis:

As an example, at a specific time bucket, we could see an mlcategory:1 followed by an mlcategory:4, twice:

mlcategory:1 -> mlcategory:4 -> mlcategory:1 -> mlcategory:4.

We could call it bucket 1, as a reference, and so on, bucket 2...

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

Elastic Anomaly Detection and Data Visualizer HandsOn

Priscilla Parodi — Mon, 02 Aug 2021 14:39:16 +0000

| Menu | Next Post: Elastic Data Frame - Outlier Detection |

We've added a Sample containing data you don't know anything about.

As mentioned here, it is very important to know the data and type of data we have in order to know what kind of analysis we can do.

So, before proceeding with Anomaly Detection we can, for example, use Data Visualizer Kibana>Machine Learning>Data visualizer to understand more about the available data.

In a real use case you may prefer to change some field, mapping or wrong/empty data, but in this case we will use the available data exactly as it is.

Let's select the Index Pattern [eCommerce] Orders. When the page loads you usually won't be able to see any data because the time interval initially set is too short (15 minutes) for this type of data that is not continuously being updated, for you to access all available data click on Use full kibana_sample_data_ecommerce_data.

Now we have some important information, the data type (text/keyword/geo_point/number...), % of documents containing that data type, distintic values, distributions and the possibility to visualize data in a graph Actions, something we won't cover at this point. This can help us with the most important part before analyzing data, which is: What answer don't we have? What is this data not telling us that it would be important to know based on the needs that I/my company have?

Let's pretend we are an Ecommerce owner and we want to analyze our data, the first thing we can notice is that we only have 1 month data May 26, 2021 -> Jun 26, 2021 which may not be positive for a complete analysis because we have holidays and events that can momentarily change customer behavior.

Maybe it's our first month with the company, we could do a full analysis after a few years and possibly map and add rules to skip holidays and momentary events, but now what we want is to start an analysis that will be useful at this point, clearly, with limited possibilities.

Something that can make sense is to understand who our customers are, after all, in just 1 month we already have a total of 4675 events with 3321 distinct values for customer_full_name.keyword, which means that we have a good amount of unique customers.

Another thing that stands out is that these customers are from different continents and countries, we are a global ecommerce company. But how does this population spend their money? Do they behave similarly?

Let's run an Anomaly Detection analysis: Machine Learning>Anomaly Detection>Create job and select [eCommerce] Orders then, click Use a Wizard>Population

Again, let’s use the full sample, click on Use full kibana_sample_data_ecommerce_data>Next

To define a population we need a relationship between the data, not necessarily with the same value for all, but we need at least one field that is common, which characterizes the data as belonging to the same population.

For the Population field, let's add some location data, let’s use the geoip.region_name.

Now, considering a population characterized by groups from different regions, we want to know how these people spend their money, so as a metric to identify abnormal behavior, let's add the sum of taxful_total_price, to understand if there are abnormal behaviors in the sum of the total amount spent over time.

Your screen should look like this image:

We can also add the Bucket Span on the left side, usually the Bucket Span is 15 minutes, but you can click Estimate bucket span and based on your data automatically set a good interval for time series analysis. In my case the estimated time was 3h.

And on the right side you will see the Influencers, if you want to see the influence of other fields on the result, and as you can see, the region will already be there.

Click Next and then define a name for the Job ID, I used the name: pop_price_region. At this time we are not going to add additional or advanced settings. After that, click Next and your webpage should look like this image:

Click Next one more time if your webpage looks like this, otherwise check the error message. Finally, click Create Job.

After loading, click on View Results, in this case we don't want the Job running in real time, otherwise you could do that.

A new page will load and as you can see, we don't have an Anomaly Score > 75, which means we don't have one high-severity event, but we do have two anomalies > 50, in orange.

The two events with severity>50, taking into account the population and not just a single metric, came from New York on June 17th 2021 (Current: $998.88 / Typical: $117.59, 8x higher, Probability: 0.00148...), and from Cairo Governorate on June 21st 2021 (Current: $885.97 / Typical: $118.45, 7x higher, Probability: 0.00234), although all this detailed information is important it is worth remembering that the severity value is a normalized value from 0-100 of all these data considering the behavior of the population in the analyzed period of time, which means that only one data, alone, is not necessarily relevant, it is possible to find other purchases with a value also 7x higher with less relevance, for example.

If you want to leave this job running, as mentioned above, you just need to click Machine Learning>Anomaly Detection>(your job)>Start datafeed, you will set the start date and set the end time, select no end time to search in real time and then click Start.

You can also create alerts based on severity and connect to services like Email, IBM Resilient, Jira, Microsoft Teams, Slack or even write to an index or create a Webhook connector. There are also APIs to perform machine learning anomaly detection activities.

| Menu | Next Post: Elastic Data Frame - Outlier Detection |

HandsOn Setup - Elastic Cloud

Priscilla Parodi — Mon, 02 Aug 2021 14:39:00 +0000

| Menu |

For HandsOn posts we will use Elastic Cloud. If you don't use Elastic Cloud yet, on this link you can access a 30-day trial. Just add your email to start using your free trial as you can see in the image below.

After creating your account you will not have any deployment available, to create a new one click on Create Deployment as in the image below.

Feel free to choose the settings you prefer but before creating your deployment you need to customize to add the ML node.

Then you can finally create your deployment and open Kibana. At this point don't worry about the settings beyond that.

When Kibana opens you will see a message like the message in image 1 below, suggesting that you start adding data. To proceed implementing the data analysis demonstrated here you will need to add all available samples.

That's it. Now you can proceed with the data analysis examples.

Elastic Data Frame - Outlier Detection

Priscilla Parodi — Mon, 02 Aug 2021 14:38:34 +0000

| Menu | Next Post: Elastic Data Frame - Regression Analysis |

Outlier detection identifies unusual data points in the dataset (Unsupervised ML).

When we talk about time series modeling and population anomaly detection, we look for outliers but basing it on how far the metric is from the normal model.

With Outlier Detection we are looking at clusters of data and evaluating density and distance using multi-variate analysis. We are not interested in tracking evolution of this dataset over time like we do in population anomaly detection and there are no buckets.

Evaluation of the Outlier detection

Outliers may denote errors or unusual behavior. In the Elastic Stack, we use an ensemble of four different distance and density based outlier detection methods, based on this approach, a metric is computed called local outlier factor for each data point. The higher the local outlier factor, the more outlying is the data point.

| Menu | Next Post: Elastic Data Frame - Regression Analysis |

Elastic Data Frame - Regression Analysis

Priscilla Parodi — Mon, 02 Aug 2021 14:38:17 +0000

| Menu | Next Post: Elastic Data Frame - Classification Analysis |

Regression makes predictions on your data after it determines certain relationships among your data points (Supervised ML).

For example, suppose we are interested in finding the relationship between apartment size and monthly rent in a city, to do this we need to find the relationship between a number of features and a target variable to know if the target variable can/can't be explained by the feature(s).

This example is a one-dimensional regression problem, because we only have one feature variable (size), but we could easily add more features.

Evaluation of the Regression analysis

For the regression analysis we use a variation of the XGBoost algorithm, which combines decision trees with gradient boosting methodologies.

The two measures that we can use to evaluate regression in the stack is R2 and MSE:

R Squared (ranges from 0 to 1 | the higher the better): measures goodness of fit, to know if the target variable can/can’t be explained by the feature variable, 1 is a perfect fit.
Mean Squared Error (ranges from 0 to 1 | the lower the better) - measures the average error between the actual datapoints and the predicted data points telling you how close a regression line is to a set of points. It does this by taking the distances from the points (purple dots) to the regression line (blue line), these distances (red lines) are the “errors”, and squaring them.

We can measure how well our model is performing by computing the average squared sum of the difference between the true and predicted value - Mean Square Error.

When you view the regression results in Kibana. It provides information about the analysis, model evaluation metrics, total feature importance values, and a scatterplot matrix.

| Menu | Next Post: Elastic Data Frame - Classification Analysis |

Elastic Data Frame - Classification Analysis

Priscilla Parodi — Mon, 02 Aug 2021 14:38:03 +0000

| Menu | Next Post: Elastic Data Frame - Classification vs Regression |

Classification predicts the class or category of a given data point in a data set (Supervised ML).

In regression the goal is to have our prediction close to the actual value. In classification the goal is to be correct (and heavily penalize incorrect predictions).

Evaluation of the Classification analysis

The algorithms we are using for Classification are Boosted Tree and Logistic Function.

A Logistic function can be fit to the data using a cross-entropy cost function, which essentially computes a measure of dissimilarity between the true (observed) probabilities and our model of the probabilities. The goal is to make the model output be as close as possible to the desired output (true classes).

Kibana provides a normalized confusion matrix that contains the percentage of occurrences where the analysis classified data points correctly with their actual class, for example, TP and TN in the image below are good results when they have a higher percentage than FN and FP because this means that the prediction of "true and false" was correct, corresponding to the real value, in most cases. The higher the percentage in this case, the better.

Kibana also provides the receiver operating characteristic (ROC) curve as part of the model evaluation. The plot compares the true positive rate (y-axis) to the false positive rate (x-axis) for each class; in this example, true and false. It is a number between 0 and 1. The higher the AUC, the better the model is at predicting the classes correctly.

| Menu | Next Post: Elastic Data Frame - Classification vs Regression |

Elastic Data Frame - Classification vs Regression

Priscilla Parodi — Mon, 02 Aug 2021 14:37:51 +0000

| Menu | Next Post: Data preparation for Data Frame Analysis with Transforms|

Regression and classification algorithms are supervised ML algorithms but the difference between the two is how they are used for different machine learning challenges. In the classification analysis we are predicting class probabilities such as yes / no, true / false, spam / not spam, while in the regression analysis we are predicting continuous values such as price and salary.

| Menu | Next Post: Data preparation for Data Frame Analysis with Transforms|

Data preparation for Data Frame Analysis with Transforms

Priscilla Parodi — Mon, 02 Aug 2021 14:36:58 +0000

| Menu | Next Post: Trained Models for Supervised Learning |

When you are using Data Frames (multi-variate analysis), Transforms can be useful in the data preparation step.

It converts existing Elasticsearch indexes into summary indexes, allowing you to define a pivot, which is a set of features that transform the index into a different, more digestible format, providing opportunities for new insights and analysis.

In fact, it performs search aggregations on the source index and indexes the results on the destination index. Therefore, a transformation never takes less time or uses less resources than the aggregation and indexing processes.

You can decide whether you want the transform to run once or continuously.

In this example we have 3 documents from a source index that stores reviews, with these fields: user-id, vendor and review.

Source Index (reviews)
{
...
user-id: 123,
vendor: "abc",
review: 4
},
{
...
user-id: 123,
vendor: "def",
review: 3
},
{
...
user-id: 123,
vendor: "ghi",
review: 5
}

With Transforms we can have a Destination Index grouped by user-id, for example, with the number of reviews per user (3 reviews in this case), and a simple average of the reviews (4+3+5)/4.

Destination Index (reviews-result)
{
...
user-id: 123,
num_reviews(sum): 3,
avg_review: 4
}

And it could be updated if running continuously, which means we could use the data we need in the way we need it, e.g., sum, max, cardinality, etc.

| Menu | Next Post: Trained Models for Supervised Learning |

Trained Models for Supervised Learning

Priscilla Parodi — Mon, 02 Aug 2021 14:36:46 +0000

| Menu | Next Post: Inference for Supervised Learning |

When you use a data frame analytics job to perform classification or regression analysis, it creates a machine learning model that is trained and tested against a labelled data set. When you are satisfied with your trained model, you can use it to make predictions against new data.

To see your available models: Kibana>Machine Learning>Data Frame Analytics>Models

Alternatively, you can use APIs like get trained models.

The following example gets information for all the trained models:

GET _ml/trained_models/

Models trained in Elasticsearch are portable and can be transferred between clusters.

It is also possible to import a model to your Elasticsearch cluster even if the model is not trained by Elastic Data Frame analytics. Eland supports importing models directly through its APIs.

| Menu | Next Post: Inference for Supervised Learning |