DEV Community

Priscilla Parodi for Elastic

Posted on • Edited on

3 3 1

Elastic Anomaly Detection - Categorization

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

For categorization analysis, the learning process is the same, but there are other steps to process the text.

The input data must be a text field, typically containing repeated elements such as log messages because it's not a natural language processing (NLP) and it works best on machine-written messages.

When you create a categorization anomaly detection job, the machine learning model processes the input text into different categories, identifying patterns over time, as you can see in this example:

Input text

Log message:

Jul 20 15:02:19 localhost sshd[8903]: Invalid user admin from 58.218.92.41 port 26062
Jul 20 15:02:19 localhost sshd[8903]: input_userauth_request: invalid user admin [preauth]
Jul 20 15:02:20 localhost sshd[8903]: Connection closed by 58.218.92.41 port 26062 [preauth]
Jul 20 17:10:23 localhost sshd[2074]: Received disconnect from 41.43.112.199 port 41805:11: disconnected by user
Jul 20 17:10:23 localhost sshd[2074]: Disconnected from 41.43.112.199 port 26062
Jul 20 17:10:23 localhost sshd[2072]: pam_unix (sshd:session): session closed for user ec2-user
Jul 20 19:14:55 localhost sshd[8944]: pam_unix (sshd:session): session closed for user ec2-user by (uid=0)
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user 
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session opened for user ec2-user by (uid=0)
Jul 20 19:17:23 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user 
Enter fullscreen mode Exit fullscreen mode

Step 1 - Remove mutable text

Mutable texts are not taken into account to not identify an anomaly or a pattern where there is no relevance as the value is always changing, e.g, date and time.

localhost sshd: Invalid user from port
localhost sshd: input_userauth_request: invalid user [preauth]
localhost sshd: Connection closed by port [preauth]
localhost sshd: Received disconnect from port disconnected by user
localhost sshd: Disconnected from port
localhost sshd: pam_unix session: session closed for user ec2-user
localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user 
localhost runner: pam_unix session: session opened for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user 
Enter fullscreen mode Exit fullscreen mode

Step 2 - cluster similar messages together

Which can mean a line or several lines that are part of a task, for example, and that are respecting a pattern.

->mlcategory:1
localhost sshd: Invalid user from port

->mlcategory:2
localhost sshd: input_userauth_request: invalid user [preauth]

->mlcategory:3
localhost sshd: Connection closed by port [preauth]

->mlcategory:4
localhost sshd: Received disconnect from port disconnected by user

->mlcategory:5
localhost sshd: Disconnected from port

->mlcategory:6
localhost sshd: pam_unix session: session closed for user ec2-user
localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user
localhost runner: pam_unix session: session opened for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user

Step 3 - Count per time bucket

By processing analyzing time buckets, the behavior in a cluster can be better and easily identified for anomaly checking.

In the image below you can see an example of the graphic behavior of each ml category over time for a further time bucket analysis:

Alt Text

As an example, at a specific time bucket, we could see an mlcategory:1 followed by an mlcategory:4, twice:

mlcategory:1 -> mlcategory:4 -> mlcategory:1 -> mlcategory:4.

We could call it bucket 1, as a reference, and so on, bucket 2...

Alt Text

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more →

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more