DEV Community

Priscilla Parodi for Elastic

Posted on • Edited on

3 3 1

Elastic Anomaly Detection - Categorization

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

For categorization analysis, the learning process is the same, but there are other steps to process the text.

The input data must be a text field, typically containing repeated elements such as log messages because it's not a natural language processing (NLP) and it works best on machine-written messages.

When you create a categorization anomaly detection job, the machine learning model processes the input text into different categories, identifying patterns over time, as you can see in this example:

Input text

Log message:

Jul 20 15:02:19 localhost sshd[8903]: Invalid user admin from 58.218.92.41 port 26062
Jul 20 15:02:19 localhost sshd[8903]: input_userauth_request: invalid user admin [preauth]
Jul 20 15:02:20 localhost sshd[8903]: Connection closed by 58.218.92.41 port 26062 [preauth]
Jul 20 17:10:23 localhost sshd[2074]: Received disconnect from 41.43.112.199 port 41805:11: disconnected by user
Jul 20 17:10:23 localhost sshd[2074]: Disconnected from 41.43.112.199 port 26062
Jul 20 17:10:23 localhost sshd[2072]: pam_unix (sshd:session): session closed for user ec2-user
Jul 20 19:14:55 localhost sshd[8944]: pam_unix (sshd:session): session closed for user ec2-user by (uid=0)
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user 
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session opened for user ec2-user by (uid=0)
Jul 20 19:17:23 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user 
Enter fullscreen mode Exit fullscreen mode

Step 1 - Remove mutable text

Mutable texts are not taken into account to not identify an anomaly or a pattern where there is no relevance as the value is always changing, e.g, date and time.

localhost sshd: Invalid user from port
localhost sshd: input_userauth_request: invalid user [preauth]
localhost sshd: Connection closed by port [preauth]
localhost sshd: Received disconnect from port disconnected by user
localhost sshd: Disconnected from port
localhost sshd: pam_unix session: session closed for user ec2-user
localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user 
localhost runner: pam_unix session: session opened for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user 
Enter fullscreen mode Exit fullscreen mode

Step 2 - cluster similar messages together

Which can mean a line or several lines that are part of a task, for example, and that are respecting a pattern.

->mlcategory:1
localhost sshd: Invalid user from port

->mlcategory:2
localhost sshd: input_userauth_request: invalid user [preauth]

->mlcategory:3
localhost sshd: Connection closed by port [preauth]

->mlcategory:4
localhost sshd: Received disconnect from port disconnected by user

->mlcategory:5
localhost sshd: Disconnected from port

->mlcategory:6
localhost sshd: pam_unix session: session closed for user ec2-user
localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user
localhost runner: pam_unix session: session opened for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user

Step 3 - Count per time bucket

By processing analyzing time buckets, the behavior in a cluster can be better and easily identified for anomaly checking.

In the image below you can see an example of the graphic behavior of each ml category over time for a further time bucket analysis:

Alt Text

As an example, at a specific time bucket, we could see an mlcategory:1 followed by an mlcategory:4, twice:

mlcategory:1 -> mlcategory:4 -> mlcategory:1 -> mlcategory:4.

We could call it bucket 1, as a reference, and so on, bucket 2...

Alt Text

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay