Sourav Atta

Posted on May 5, 2021

Writing an effective GROK pattern

#logstash #grok #regex #elasticsearch

Grok is one of the popular Logstash filters which is used to parse the unstructured log data to a meaningful format.

Logstash ships with 120 default built-in patterns. You can find them here: https://github.com/logstash-plugins/logstash-patterns-core/tree/master/patterns

Also, some of the patterns can be referred from https://github.com/hpcugent/logstash-patterns/blob/master/files/grok-patterns
I personally prefer the above link for constructing grok pattern.

Now, there may be cases when these grok patterns won't fit. So, we have a regular expression library Oniguruma, which can be combined with grok to create powerful patterns.

Grok Syntax

%{SYNTAX:SEMANTIC}

SYNTAX is the default grok patterns
SEMANTIC is the key

Oniguruma Syntax

(?<field_name>regex pattern)

field_name is the key
regex pattern is the placeholder to add your regex

How to use?

Let's try to create a pattern to parse unstructured log data.

Sample Log Data

09:33:45,416 (metrics-logger-reporter-1-thread-1) type=GAUGE, name=notifications.received, value=2

Required fields from log data

Field	Field Value
timestamp	09:33:45,416
logthread	metrics-logger-reporter-1-thread-1
type	GAUGE
name	notifications.received
value	2

Grok Pattern

We will use Grok Debugger to test our pattern to match the log data.

Let's disintegrate the log data to create a pattern that matches a particular field:

Field	Pattern
timestamp	%{TIME}
type	%{DATA}
name	%{DATA}
value	%{POSINT}

The field thread, can be a combination of the alphanumeric characters.

So, we need to use oniguruma to match the field logthread. Considering the syntax of oniguruma, we need to create a regex pattern that will match the value of the field logthread

Constructing Regex Pattern

We now use Regex Checker that will help us to construct and test the regex pattern for the value of field logthread

The (?:[()a-zA-Z\d-]+) non-capturing group matches single character present in the list below:

+ greedy match i.e. matches the previous token between one and unlimited times, as many times as possible
() matches a single character in the list ()
a-z matches a single character in the range between a and z
A-Z matches a single character in the range between A and Z
\d matches a digit
- matches the character -

Oniguruma

The final Oniguruma pattern for the field logthread:

(?<logthread>(?:[()a-zA-Z\d-]+))

Grok Pattern + Oniguruma (Final Pattern)

The final pattern that will match the log data:

%{TIME:timestamp} \((?<logthread>(?:[()a-zA-Z\d-]+))\) type=%{DATA:type}, name=%{DATA:name}, value=%{POSINT:value}

Output of the pattern

{
  "timestamp": [
    [
      "09:33:45,416"
    ]
  ],
  "HOUR": [
    [
      "09"
    ]
  ],
  "MINUTE": [
    [
      "33"
    ]
  ],
  "SECOND": [
    [
      "45,416"
    ]
  ],
  "logthread": [
    [
      "metrics-logger-reporter-1-thread-1"
    ]
  ],
  "type": [
    [
      "GAUGE"
    ]
  ],
  "name": [
    [
      "notifications.received"
    ]
  ],
  "value": [
    [
      "2"
    ]
  ]
}

Conclusion

The combination of Grok Pattern and Oniguruma is a perfect pair. Tha pairing can help to transform any complex logs into structured data. Give it a try using Grok Pattern + Oniguruma in Logstash !!

Let me know in the comments if you have any better way of doing or facing any problem with the above example.

DEV Community