<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Leandro Pereira</title>
    <description>The latest articles on DEV Community by Leandro Pereira (@leandrojmp).</description>
    <link>https://dev.to/leandrojmp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F240219%2F9b6020f9-007a-40c9-a9ca-87a0faafe377.jpg</url>
      <title>DEV Community: Leandro Pereira</title>
      <link>https://dev.to/leandrojmp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/leandrojmp"/>
    <language>en</language>
    <item>
      <title>logstash: melhorando o desempenho trocando o grok pelo dissect.</title>
      <dc:creator>Leandro Pereira</dc:creator>
      <pubDate>Sun, 25 Oct 2020 22:45:03 +0000</pubDate>
      <link>https://dev.to/leandrojmp/logstash-melhorando-o-desempenho-trocando-o-grok-pelo-dissect-3hf7</link>
      <guid>https://dev.to/leandrojmp/logstash-melhorando-o-desempenho-trocando-o-grok-pelo-dissect-3hf7</guid>
      <description>&lt;p&gt;Já tem um tempo que utilizo a pilha da &lt;a href="https://www.elastic.co/elastic-stack"&gt;elastic&lt;/a&gt; pra coletar e armazenar logs de aplicações, serviços e dispositivos que preciso monitorar. Sendo o &lt;strong&gt;Logstash&lt;/strong&gt; o responsável em receber, tratar e publicar os logs, garantir que ele tenha um bom desempenho é extremamente importante.&lt;/p&gt;

&lt;p&gt;Boa parte dos problemas de desempenho que enfrentei estavam relacionados a erros de configuração, implantação ou uso incorreto de alguns filtros, principalmente o filtro &lt;code&gt;grok&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Embora o &lt;code&gt;grok&lt;/code&gt; seja extremamente útil, ele é um filtro baseado em expressões regulares e que precisa validar cada expressão configurada, dependendo do número de eventos por segundo essas validações podem acabar gerando uma grande carga na CPU da máquina do &lt;strong&gt;Logstash&lt;/strong&gt;, o que pode impactar em todo o processo de monitoramento.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grok ou Dissect?
&lt;/h3&gt;

&lt;p&gt;Uma forma simples de otimizar o desempenho de um pipeline  é verificar a possibilidade de utilizar outro filtro no lugar do &lt;code&gt;grok&lt;/code&gt;. Um filtro que eu venho utilizando bastante como substituto é o &lt;code&gt;dissect&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A principal diferença entre o &lt;code&gt;grok&lt;/code&gt; e o &lt;code&gt;dissect&lt;/code&gt; é que o &lt;code&gt;dissect&lt;/code&gt; não utiliza expressões regulares para analisar a mensagem, os campos são definidos a partir de sua posição, tudo que estiver entre um &lt;code&gt;%{&lt;/code&gt; e um &lt;code&gt;}&lt;/code&gt; é considerado um campo e todo o resto é considerado um delimitador, isso torna o &lt;code&gt;dissect&lt;/code&gt; mais rápido que o &lt;code&gt;grok&lt;/code&gt; e também exige menos processamento durante a análise.&lt;/p&gt;

&lt;p&gt;Considere o exemplo de mensagem a seguir:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Podemos extrair dessa mensagem os seguintes campos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt;, &lt;code&gt;ipaddr&lt;/code&gt;, &lt;code&gt;thread&lt;/code&gt;, &lt;code&gt;logger&lt;/code&gt;, &lt;code&gt;loglevel&lt;/code&gt;, &lt;code&gt;msg&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1qWWxhJn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/zv1anjxa2ovnsg76ko66.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1qWWxhJn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/zv1anjxa2ovnsg76ko66.png" alt="exemplo de mensagem"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Para analisarmos essa mensagem com o &lt;code&gt;grok&lt;/code&gt; e &lt;code&gt;dissect&lt;/code&gt; usamos as seguintes configurações.&lt;/p&gt;

&lt;h4&gt;
  
  
  grok
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;grok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;%{TIMESTAMP_ISO8601:timestamp} %{IP:ipaddr} &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;[%{INT:thread}&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;] %{DATA:logger} %{WORD:loglevel} - %{GREEDYDATA:msg}&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  dissect
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;dissect&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;mapping&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;%{timestamp} %{ipaddr} [%{thread}] %{logger} %{loglevel} - %{msg}&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Perceba que a estrutura é bem semelhante, mas no &lt;code&gt;grok&lt;/code&gt; precisamos especificar contra qual expressão regular iremos validar o campo, como por exemplo &lt;code&gt;TIMESTAMP_ISO8601&lt;/code&gt; para o campo &lt;code&gt;timestamp&lt;/code&gt; ou &lt;code&gt;IP&lt;/code&gt; para o campo &lt;code&gt;ipaddr&lt;/code&gt;,  algo que não temos no &lt;code&gt;dissect&lt;/code&gt; , já que o filtro irá armazenar como valor do campo o que estiver naquela posição na mensagem e isso acaba sendo um dos fatores que influencia na escolha de um ou outro.&lt;/p&gt;

&lt;p&gt;Se as mensagens coletadas em um pipeline possuem sempre a mesma estrutura, ou até mesmo poucas variações, o uso do &lt;code&gt;dissect&lt;/code&gt; no lugar do &lt;code&gt;grok&lt;/code&gt; se faz possível e se mostra vantajoso, já que a análise será mais rápida e precisará de menos processamento.&lt;/p&gt;

&lt;p&gt;Mas será que é mesmo mais rápido e usa menos processamento?&lt;/p&gt;

&lt;h3&gt;
  
  
  Grok vs Dissect
&lt;/h3&gt;

&lt;p&gt;Para comparar o desempenho dos dois filtros utilizei um pipeline simples usando como &lt;code&gt;input&lt;/code&gt; o filtro &lt;code&gt;generator&lt;/code&gt; e como &lt;code&gt;output&lt;/code&gt; o filtro &lt;code&gt;stdout&lt;/code&gt;, e alternei o uso dos filtros &lt;code&gt;grok&lt;/code&gt; e &lt;code&gt;dissect&lt;/code&gt; exemplificados anteriormente.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;generator&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10000000&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="err"&gt;#&lt;/span&gt;
    &lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;filtro&lt;/span&gt; &lt;span class="nx"&gt;de&lt;/span&gt; &lt;span class="nx"&gt;grok&lt;/span&gt; &lt;span class="nx"&gt;ou&lt;/span&gt; &lt;span class="nx"&gt;dissect&lt;/span&gt;
    &lt;span class="err"&gt;#&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;sequence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="nx"&gt;and&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;sequence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;9999999&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="nx"&gt;drop&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;stdout&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esse pipeline basicamente gera 10 milhões de mensagens, aplica o filtro especificado no momento, &lt;code&gt;grok&lt;/code&gt; ou &lt;code&gt;dissect&lt;/code&gt;, e faz um &lt;code&gt;drop&lt;/code&gt; na mensagem, exceto a primeira e a última, que são exibidas apenas para verificar o início e fim do processamento.&lt;/p&gt;

&lt;p&gt;A saída desse pipeline é a seguinte.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;host&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;elk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ipaddr&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;127.0.0.1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sequence&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;loglevel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;INFO&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;program.Logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@version&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;msg&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thread&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="nx"&gt;T02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;38.127&lt;/span&gt;&lt;span class="nx"&gt;Z&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;host&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;elk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ipaddr&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;127.0.0.1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sequence&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;9999999&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;loglevel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;INFO&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;program.Logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@version&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;msg&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thread&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="na"&gt;T02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;34.018&lt;/span&gt;&lt;span class="nx"&gt;Z&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Como estava interessado em comparar o processamento da máquina durante a execução dos filtros, utilizei o &lt;code&gt;nmon&lt;/code&gt; rodando em background e coletando métricas a cada &lt;strong&gt;1&lt;/strong&gt; segundo, durante &lt;strong&gt;5&lt;/strong&gt; minutos, o que resulta em &lt;strong&gt;300&lt;/strong&gt; medidas, uma quantidade mais do que suficiente para esse caso.&lt;/p&gt;

&lt;p&gt;Para a execução do pipeline utilizei uma máquina virtual com &lt;strong&gt;4&lt;/strong&gt; vCPUs e &lt;strong&gt;4 GB&lt;/strong&gt; de RAM rodando o &lt;strong&gt;CentOS 7.8&lt;/strong&gt; e o &lt;strong&gt;Logstash&lt;/strong&gt; na versão &lt;strong&gt;7.9&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Cada medida gerada pelo &lt;strong&gt;nmon&lt;/strong&gt; tem o seguinte formato.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ZZZZ,T0010,00:16:49,21-AUG-2020
CPU001,T0010,50.5,2.0,0.0,46.5,1.0
CPU002,T0010,99.0,0.0,0.0,1.0,0.0
CPU003,T0010,97.0,1.0,0.0,1.0,1.0
CPU004,T0010,96.0,1.0,0.0,3.0,0.0
CPU_ALL,T0010,86.4,0.8,0.0,12.8,0.0,,4
MEM,T0010,3789.0,-0.0,-0.0,1024.0,2903.2,-0.0,-0.0,1024.0,-0.0,281.9,577.3,-1.0,2.0,0.0,181.5
VM,T0010,30,0,0,2384,12760,-1,0,0,0,0,27,0,0,2541,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
PROC,T0010,5,0,2524.7,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0
NET,T0010,2.6,0.1,2.8,0.1
NETPACKET,T0010,10.9,2.0,12.9,2.0
JFSFILE,T0010,59.0,0.0,0.5,59.0,24.3,4.5,4.5,4.5
DISKBUSY,T0010,0.0,0.0,0.0,0.0
DISKREAD,T0010,0.0,0.0,0.0,0.0
DISKWRITE,T0010,0.0,0.0,0.0,0.0
DISKXFER,T0010,0.0,0.0,0.0,0.0
DISKBSIZE,T0010,0.0,0.0,0.0,0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;De todas essas linhas, somente a linha &lt;strong&gt;CPU_ALL&lt;/strong&gt; importa, mais especificamente a terceira coluna que corresponde ao percentual de uso médio de todas as &lt;strong&gt;CPUs&lt;/strong&gt; no momento da coleta.&lt;/p&gt;

&lt;p&gt;Fazendo uma manipulação dos dados coletados durante a execução do pipeline com o &lt;code&gt;grok&lt;/code&gt; e depois com o &lt;code&gt;dissect&lt;/code&gt; podemos visualizar e comparar o consumo de CPU e o tempo de execução.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Zz2hXj5z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/xulsg534yzfgi67ka95e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Zz2hXj5z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/xulsg534yzfgi67ka95e.png" alt="comparativo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;O pico inicial de processamento que vemos no gráfico é causado pela inicialização do &lt;strong&gt;Logstash&lt;/strong&gt;, o platô seguinte corresponde ao processamento durante a análise das mensagens.&lt;/p&gt;

&lt;p&gt;Analisando o gráfico vemos que o tempo para processar as 10 milhões de mensagens é basicamente o mesmo para os dois filtros testados nesse caso específico, mas temos uma diferença grande no processamento. Uma média de uso de CPU de &lt;strong&gt;40%&lt;/strong&gt; quando utilizamos o filtro &lt;code&gt;dissect&lt;/code&gt; em comparação com uma média de uso de CPU de mais de &lt;strong&gt;60%&lt;/strong&gt; quando utilizamos o filtro &lt;code&gt;grok&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusão e Links
&lt;/h3&gt;

&lt;p&gt;Dependendo do número de pipelines e da quantidade de eventos por segundo de cada um deles, dedicar um tempo para validar a possibilidade de troca do filtro &lt;code&gt;grok&lt;/code&gt; pelo &lt;code&gt;dissect&lt;/code&gt; é algo que pode otimizar bastante o desempenho durante o processo de análise e ingestão de dados, além de poder também refletir em redução de custos quando se tem uma infraestrutura em nuvem já que possibilita tanto o uso de máquinas menores, quanto menos consumo de créditos de cpu em alguns casos.&lt;/p&gt;

&lt;p&gt;Nos casos onde os logs coletados seguem sempre a mesma estrutura, como acontece com logs de servidores web, servidores de aplicação, firewalls e roteadores, ou onde se tem controle sobre o formato dos logs gerados, o &lt;code&gt;dissect&lt;/code&gt; é o filtro ideal quando se utiliza o &lt;strong&gt;Logstash&lt;/strong&gt; no processo de coleta e análise de logs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Links
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dissect&lt;/code&gt;: &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-dissect.html"&gt;documentação oficial&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grok&lt;/code&gt;: &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html"&gt;documentação oficial&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>logstash</category>
      <category>grok</category>
      <category>dissect</category>
      <category>ptbr</category>
    </item>
    <item>
      <title>logstash: geolocation with geoip</title>
      <dc:creator>Leandro Pereira</dc:creator>
      <pubDate>Sun, 25 Oct 2020 22:08:44 +0000</pubDate>
      <link>https://dev.to/leandrojmp/logstash-geolocation-with-geoip-45m5</link>
      <guid>https://dev.to/leandrojmp/logstash-geolocation-with-geoip-45m5</guid>
      <description>&lt;p&gt;An important point in monitoring devices, applications or systems exposed on the internet, is to include information about the origin of the requests or attempts we receive.&lt;/p&gt;

&lt;p&gt;This is useful on both the infrastructure and security side as well as the business side. For example, knowing the origin and number of requests or attempts, we can better plan the geographic distribution of servers, detect compromised credentials or attack attempts and identify new business opportunities in different locations.&lt;/p&gt;

&lt;p&gt;A very simple way to do this when using the elastic stack as one of the monitoring tools is to use the &lt;code&gt;geoip&lt;/code&gt; filter in logstash.&lt;/p&gt;

&lt;h3&gt;
  
  
  the geoip filter
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;geoip&lt;/code&gt; filter has a very simple function, it queries an IP address in an internal database, identify its geolocation and returns some fields like the country name, country code, city, geographic coordinates and a few others.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0SEXJdrE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/lxrw6g6kujpv7j7nrmsy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0SEXJdrE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/lxrw6g6kujpv7j7nrmsy.png" alt="geoip filter"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the case of the &lt;code&gt;geoip&lt;/code&gt; filter used by logstash, the internal database with the geolocation information is the &lt;code&gt;GeoLite2&lt;/code&gt;, provided by &lt;a href="https://dev.maxmind.com/geoip/geoip2/geolite2/"&gt;maxmind&lt;/a&gt;, and using this filter with an IP address it is possible to obtain, in addition to the geographical information, data related to the Autonomous System (&lt;em&gt;AS&lt;/em&gt;), associated with the routing for the IP.&lt;/p&gt;

&lt;p&gt;To use the &lt;code&gt;geoip&lt;/code&gt; filter you need your event to have a field where the value is a public IP address and you also need to create a specific mapping for your index to store fields with geolocation data.&lt;/p&gt;

&lt;h3&gt;
  
  
  applying the filter
&lt;/h3&gt;

&lt;p&gt;As an example of how to apply the &lt;code&gt;geoip&lt;/code&gt; filter I will use a &lt;a href="https://github.com/leandrojmp/go-sysmon"&gt;simple API&lt;/a&gt; that I made in Go that returns the status of the connections in a linux machine, this emulates part of what &lt;em&gt;netstat&lt;/em&gt; does.&lt;/p&gt;

&lt;p&gt;This API returns a JSON document with the following format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"srcip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"10.0.1.100"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"srcport"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;56954&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"dstip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"151.101.192.133"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"dstport"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;443&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ESTABLISHED"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The value of the &lt;code&gt;srcip&lt;/code&gt; field is the local machine IP address and the value of the &lt;code&gt;dstip&lt;/code&gt; field is the external IP address, this is the field that we will use with the &lt;code&gt;geoip&lt;/code&gt; filter.&lt;/p&gt;

&lt;p&gt;To query this API with logstash I will use the filter &lt;code&gt;http_poller&lt;/code&gt; as the &lt;code&gt;input&lt;/code&gt; in the pipeline, this basically make requests to an &lt;em&gt;endpoint&lt;/em&gt;  on a specificed schedule.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input {
    http_poller {
        urls =&amp;gt; { "api" =&amp;gt; "http://10.0.1.100:5000/netstat" } 
        schedule =&amp;gt; { "every" =&amp;gt; "30s"}
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The way that logstash receives the data makes no difference for the &lt;code&gt;geoip&lt;/code&gt; filter, you only need a field with a public IP address.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The simplest configuration for the &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-geoip.html"&gt;&lt;code&gt;geoip&lt;/code&gt;&lt;/a&gt; filter has only one required option, the &lt;code&gt;source&lt;/code&gt;, which is the name of the field that has the IP address that we want to find its geolocation data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filter {
    geoip {
        source =&amp;gt; "dstip"
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When an event pass through this filter with success, a new field named &lt;code&gt;geoip&lt;/code&gt; is added to the event.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"geoip" =&amp;gt; {
    "country_code2" =&amp;gt; "IS",
    "continent_code" =&amp;gt; "EU",
    "city_name" =&amp;gt; "Reykjavik",
    "country_code3" =&amp;gt; "IS",
    "region_code" =&amp;gt; "1",
    "timezone" =&amp;gt; "Atlantic/Reykjavik",
    "region_name" =&amp;gt; "Capital Region",
    "location" =&amp;gt; {
        "lon" =&amp;gt; -21.9466,
        "lat" =&amp;gt; 64.1432
    },
    "latitude" =&amp;gt; 64.1432,
    "ip" =&amp;gt; "31.209.137.10",
    "country_name" =&amp;gt; "Iceland",
    "postal_code" =&amp;gt; "101",
    "longitude" =&amp;gt; -21.9466
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the &lt;code&gt;geoip&lt;/code&gt; fails, the tag &lt;code&gt;_geoip_lookup_failure&lt;/code&gt; will be added.&lt;/p&gt;

&lt;p&gt;We can change the default behaviour using other options in the filter configuration.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;target&lt;/code&gt;: name of the field where the geolocation data should be stored, the default is &lt;code&gt;geoip&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;default_database_type&lt;/code&gt;: it has only two options &lt;code&gt;City&lt;/code&gt; and &lt;code&gt;ASN&lt;/code&gt;, the first one is the default and gives you geographic information about the IP, the second one gives you information about the Autonomous System (&lt;em&gt;AS&lt;/em&gt;) associated with the IP.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fields&lt;/code&gt;: the fields that should be returned, by default all available fields will be returned.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tag_on_failure&lt;/code&gt;: the name of the tag that should be added on failure, the default is &lt;code&gt;_geoip_lookup_failure&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since the idea is to enrich the events, we are going to use two &lt;code&gt;geoip&lt;/code&gt; filters in sequence, one with the option &lt;code&gt;default_database_type&lt;/code&gt; set as &lt;code&gt;City&lt;/code&gt; and the other one with the same option set as &lt;code&gt;ASN&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filter {
    geoip {
        default_database_type =&amp;gt; "City"
        source =&amp;gt; "dstip"
        target =&amp;gt; "geo"
        tag_on_failure =&amp;gt; ["geoip-city-failed"]
    }
    geoip {
        default_database_type =&amp;gt; "ASN"
        source =&amp;gt; "dstip"
        target =&amp;gt; "geo"
        tag_on_failure =&amp;gt; ["geoip-asn-failed"]
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  filtering private addresses
&lt;/h3&gt;

&lt;p&gt;When the field that we use as the source for the &lt;code&gt;geoip&lt;/code&gt; filter can also have private IP addresses, we need to filter out those IPs.&lt;/p&gt;

&lt;p&gt;A simple way to avoid that private IPs pass through the &lt;code&gt;geoip&lt;/code&gt; filter is to use a conditional to add a &lt;em&gt;tag&lt;/em&gt;  to the events that have a private IP and another conditional to limit the &lt;code&gt;geoip&lt;/code&gt; filter to only be applied on events without this tag.&lt;/p&gt;

&lt;p&gt;In our example we need to filter out the network &lt;code&gt;10.0.1.0/24&lt;/code&gt;, the localhost IP &lt;code&gt;127.0.0.1&lt;/code&gt; and the local routing IP &lt;code&gt;0.0.0.0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if [dstip] =~ "^10.0.*" or [dstip] =~ "^127.0.*" or [dstip] == "0.0.0.0" { 
    mutate {
        add_tag =&amp;gt; ["internal"]
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now all the events where the destination is a private IP will have the &lt;em&gt;tag&lt;/em&gt; &lt;code&gt;internal&lt;/code&gt; and we can use this &lt;em&gt;tag&lt;/em&gt; to avoid applying the &lt;code&gt;geoip&lt;/code&gt; filter on those events.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if "internal" not in [tags] {
    geoip {
        default_database_type =&amp;gt; "City"
        source =&amp;gt; "dstip"
        target =&amp;gt; "geo"
        tag_on_failure =&amp;gt; ["geoip-city-failed"]
    }
    geoip {
        default_database_type =&amp;gt; "ASN"
        source =&amp;gt; "dstip"
        target =&amp;gt; "geo"
        tag_on_failure =&amp;gt; ["geoip-asn-failed"]
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  mapping
&lt;/h3&gt;

&lt;p&gt;Before we can start analyzing the data in elasticserch or create maps in kibana, we need to create a mapping to tell elasticsearch about the data type of every field.&lt;/p&gt;

&lt;p&gt;Although elasticsearch can infer the data type and create a mapping at ingestion time, this does not work for the fields that need to store geolocation data, those fields need to be defined manually before the index creation.&lt;/p&gt;

&lt;p&gt;The required part of the mapping is the one that defines the field &lt;code&gt;geo.location&lt;/code&gt; as having the type &lt;code&gt;geo_point&lt;/code&gt;, so if we use an index named &lt;code&gt;endpoints&lt;/code&gt;, we would need to create the index and apply the mapping for this field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PUT /endpoints
PUT /endpoints/_mapping
{
    "properties": {
        "geo": {
            "properties": {
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This way we can guarantee that the field &lt;code&gt;geo.location&lt;/code&gt; has the type &lt;code&gt;geo_point&lt;/code&gt; and we can let elasticsearch create the mapping for the other fields when it index the first document.&lt;/p&gt;

&lt;p&gt;While we can work like that without any problem, it is better to create a template for our index with the data type for every field.&lt;/p&gt;

&lt;p&gt;In our example we can use the following template.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PUT _template/endpoints
{
    "order" : 0,
    "version" : 1,
    "index_patterns" : [ "endpoints" ],
    "settings" : {
      "index" : {
        "mapping" : {
          "ignore_malformed" : "true"
        },
        "refresh_interval" : "5s",
        "number_of_shards" : "1",
        "number_of_replicas" : "0"
      }
    },
    "mappings" : {
        "properties" : {
            "@timestamp" : {
                "format" : "strict_date_optional_time||epoch_millis",
                "type" : "date"
            },
            "@version" : { "type" : "keyword" },
            "status" : { "type" : "keyword" },
            "srcip" : { "type" : "ip" },
            "srcport" : { "type" : "keyword" },
            "dstip" : { "type" : "ip" },
            "dstport" : { "type" : "keyword" },
            "geo" : {
                "properties" : {
                    "as_org" : { "type" : "keyword" },
                    "asn" : { "type" : "keyword" },
                    "country_code2" : { "type" : "keyword" },
                    "country_code3" : { "type" : "keyword" },
                    "country_name" : { "type" : "keyword" },
                    "continent_code" : { "type" : "keyword" },
                    "city_name" : { "type" : "keyword" },
                    "region_code" : { "type" : "keyword" },
                    "region_name" : { "type" : "keyword" },
                    "postal_code" : { "type" : "keyword" },
                    "ip" : { "type" : "ip" },
                    "location" : { "type" : "geo_point" },
                    "latitude" : { "type" : "float" },
                    "longitude" : { "type" : "float" },
                    "timezone" : { "type" : "keyword" }
                }
            },
            "message" : { "type" : "text" },
            "tags" : { "type" : "keyword"}
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Since the template is only applied on the index creation, if the index already exists it is necessary to delete it with the request &lt;code&gt;DELETE endpoints&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After we create the template we can add the output to elasticsearch in the pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output {
    elasticsearch {
        hosts =&amp;gt; ["http://elk:9200"]
        index =&amp;gt; "endpoints"
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  visualizing the data and creating maps
&lt;/h3&gt;

&lt;p&gt;As soon as we start logstash with the configured pipeline, the data from the API will start to be collected and sent to elasticsearch, after we create an &lt;em&gt;index pattern&lt;/em&gt; we can visualize the data with the geolocation in kibana.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2_9OvoYf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/qqshyenooo5xcpyh3ccy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2_9OvoYf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/qqshyenooo5xcpyh3ccy.png" alt="kibana discovery"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While our original event only had the information about the destination IP address, using the &lt;code&gt;geoip&lt;/code&gt; filter allowed us to enrich the documents with more information about the destination IP address.&lt;/p&gt;

&lt;p&gt;The field &lt;code&gt;geo.country_name&lt;/code&gt; is an example of data enrichment by the &lt;code&gt;geoip&lt;/code&gt; filter with the option &lt;code&gt;default_database_type&lt;/code&gt; set as &lt;code&gt;City&lt;/code&gt; and the field &lt;code&gt;geo.as_org&lt;/code&gt; is an example of data enrichment by the same filter when we set the &lt;code&gt;default_database_type&lt;/code&gt; to &lt;code&gt;ASN&lt;/code&gt;, this is the reason we used two &lt;code&gt;geoip&lt;/code&gt; filters in the pipeline.&lt;/p&gt;

&lt;p&gt;We can also plot the data using the &lt;em&gt;Maps&lt;/em&gt; tool in kibana.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CGS0wIT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/j72x9i55zuoc5eygfrp9.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CGS0wIT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/j72x9i55zuoc5eygfrp9.gif" alt="geoip map"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If before our monitoring only had the information about the destination IP, now we can have an idea of where this destination is and we can visualize it in a map.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rCF5Jd7P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/w0d984clkmrng55p62cb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rCF5Jd7P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/w0d984clkmrng55p62cb.png" alt="map zoom"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An important thing that we should consider is that the accuracy of the coordinates obtained with the &lt;code&gt;geoip&lt;/code&gt; filter are not exact and can vary according with the geographic location, being more accurate in some countries, the type of the connection, if an IP is from a mobile connection or not, and the zoom level used in the map.&lt;/p&gt;

&lt;p&gt;This &lt;a href="https://www.maxmind.com/en/geoip2-city-accuracy-comparison"&gt;link&lt;/a&gt; from &lt;a href="https://dev.maxmind.com/geoip/geoip2/geolite2/"&gt;maxmind&lt;/a&gt; page, allows us to check the accuracy in each case.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The result of the geoip filter should never be considered exact and may not correspond to reality.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For more information about the &lt;code&gt;geoip&lt;/code&gt; filter you can check the &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-geoip.html"&gt;official documentation&lt;/a&gt; from elastic.&lt;/p&gt;

</description>
      <category>logstash</category>
      <category>geoip</category>
      <category>elasticsearch</category>
    </item>
    <item>
      <title>logstash: improving performance by replacing grok with dissect.</title>
      <dc:creator>Leandro Pereira</dc:creator>
      <pubDate>Sat, 26 Sep 2020 22:14:20 +0000</pubDate>
      <link>https://dev.to/leandrojmp/logstash-improving-performance-by-replacing-grok-with-dissect-2g15</link>
      <guid>https://dev.to/leandrojmp/logstash-improving-performance-by-replacing-grok-with-dissect-2g15</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;It has been a while since I've been using the &lt;a href="https://www.elastic.co/elastic-stack"&gt;elastic stack&lt;/a&gt; to collect and store logs from applications, services and devices that I need to monitor. Since &lt;strong&gt;Logstash&lt;/strong&gt; is responsible for receiving, parsing and publishing those logs, it is extremely important that we guarantee that it has a good performance.&lt;/p&gt;

&lt;p&gt;A lot of the performance problems that I've had were caused by configuration errors, implementation or incorrect usage of some filters, mainly the &lt;code&gt;grok&lt;/code&gt; filter.&lt;/p&gt;

&lt;p&gt;While the &lt;code&gt;grok&lt;/code&gt; filter is extremely useful, it is a filter based on regular expressions and it needs to validate every expression, depending on how many events per second logstash is receiving, those validations could lead to a increase load on the CPU which could affect the entire monitoring process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grok or Dissect?
&lt;/h3&gt;

&lt;p&gt;One simple way to improve the performance of a pipeline is to check if it is possible to use another filter instead of &lt;code&gt;grok&lt;/code&gt;. A filter that I've been using as a replacement for &lt;code&gt;grok&lt;/code&gt; is the &lt;code&gt;dissect&lt;/code&gt; filter.&lt;/p&gt;

&lt;p&gt;The main difference between &lt;code&gt;grok&lt;/code&gt; and &lt;code&gt;dissect&lt;/code&gt; is that &lt;code&gt;dissect&lt;/code&gt; does not use regular expressions to parse the message, the fields are defined by its position, everything between a &lt;code&gt;%{&lt;/code&gt; and a &lt;code&gt;}&lt;/code&gt; is seen as a field and everything else is seen as a delimiter, this makes &lt;code&gt;dissect&lt;/code&gt; faster and lighter than &lt;code&gt;grok&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Considering the following example message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can parse that message into the following fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt;, &lt;code&gt;ipaddr&lt;/code&gt;, &lt;code&gt;thread&lt;/code&gt;, &lt;code&gt;logger&lt;/code&gt;, &lt;code&gt;loglevel&lt;/code&gt;, &lt;code&gt;msg&lt;/code&gt;
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fP5eJ4ID--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/hsa2nh5obrg55iqt4fmu.png" alt="sample message"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To parse that message with &lt;code&gt;grok&lt;/code&gt; and &lt;code&gt;dissect&lt;/code&gt;, we use the following configuration.&lt;/p&gt;

&lt;h4&gt;
  
  
  grok
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;grok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;%{TIMESTAMP_ISO8601:timestamp} %{IP:ipaddr} &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;[%{INT:thread}&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;] %{DATA:logger} %{WORD:loglevel} - %{GREEDYDATA:msg}&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  dissect
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;dissect&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;mapping&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;%{timestamp} %{ipaddr} [%{thread}] %{logger} %{loglevel} - %{msg}&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see that the structure is very similar, but with &lt;code&gt;grok&lt;/code&gt; we need to specify the regex that we will use to validate the field, for example &lt;code&gt;TIMESTAMP_ISO8601&lt;/code&gt; for the field &lt;code&gt;timestamp&lt;/code&gt; or &lt;code&gt;IP&lt;/code&gt; for the field &lt;code&gt;ipaddr&lt;/code&gt;, this is something that we do not need to do when using &lt;code&gt;dissect&lt;/code&gt;, because it will use what is on that position as a value for the field, this is one of the reasons that will make us choose between the two of them.&lt;/p&gt;

&lt;p&gt;If the collected messsages in a pipeline always have the same structure, or even if they have small variations, using &lt;code&gt;dissect&lt;/code&gt; instead of &lt;code&gt;grok&lt;/code&gt; is possible and it is an advantage since the parsing will be faster and will need less processing power.&lt;/p&gt;

&lt;p&gt;But how faster and how less processing power it will need?&lt;/p&gt;

&lt;h3&gt;
  
  
  Grok vs Dissect
&lt;/h3&gt;

&lt;p&gt;To compare the performance of the two filters, I've used a simple pipeline with the &lt;code&gt;generator&lt;/code&gt; filter as an &lt;code&gt;input&lt;/code&gt; and the &lt;code&gt;stdout&lt;/code&gt; filter as an &lt;code&gt;output&lt;/code&gt;, I've also used the &lt;code&gt;grok&lt;/code&gt; and &lt;code&gt;dissect&lt;/code&gt; filters show before on different runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;generator&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10000000&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="err"&gt;#&lt;/span&gt;
    &lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;grok&lt;/span&gt; &lt;span class="nx"&gt;or&lt;/span&gt; &lt;span class="nx"&gt;dissect&lt;/span&gt; &lt;span class="nx"&gt;filter&lt;/span&gt;
    &lt;span class="err"&gt;#&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;sequence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="nx"&gt;and&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;sequence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;9999999&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
         &lt;span class="nx"&gt;drop&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
     &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;stdout&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pipeline basically generates 10 million messages, applies the specified filter, which could be &lt;code&gt;grok&lt;/code&gt; or &lt;code&gt;dissect&lt;/code&gt;, and use &lt;code&gt;drop&lt;/code&gt; to drop every message except the first and the last one, which are show just to verify the beginning and ending of the processing.&lt;/p&gt;

&lt;p&gt;This pipeline output is the following.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;host&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;elk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ipaddr&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;127.0.0.1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sequence&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;loglevel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;INFO&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;program.Logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@version&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;msg&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thread&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="nx"&gt;T02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;38.127&lt;/span&gt;&lt;span class="nx"&gt;Z&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;host&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;elk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ipaddr&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;127.0.0.1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50 127.0.0.1 [1] program.Logger INFO - sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sequence&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;9999999&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;loglevel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;INFO&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;program.Logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@version&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2020-08-20T21:45:50&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;msg&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sample message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thread&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@timestamp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="na"&gt;T02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;34.018&lt;/span&gt;&lt;span class="nx"&gt;Z&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since I was interested in comparing the CPU usage of the machine during the filters execution, I've used &lt;code&gt;nmon&lt;/code&gt; running on background and harvesting metrics every &lt;strong&gt;1&lt;/strong&gt; second during &lt;strong&gt;5&lt;/strong&gt; minutes resulting in &lt;strong&gt;300&lt;/strong&gt; measures, which is more than enough for this case.&lt;/p&gt;

&lt;p&gt;For the execution of the pipeline I've used a VM with &lt;strong&gt;4&lt;/strong&gt; vCPUs and &lt;strong&gt;4 GB&lt;/strong&gt; of RAM running &lt;strong&gt;CentOS 7.8&lt;/strong&gt; and &lt;strong&gt;Logstash&lt;/strong&gt; on the version &lt;strong&gt;7.9&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Each measure from &lt;strong&gt;nmon&lt;/strong&gt; has the following format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ZZZZ,T0010,00:16:49,21-AUG-2020
CPU001,T0010,50.5,2.0,0.0,46.5,1.0
CPU002,T0010,99.0,0.0,0.0,1.0,0.0
CPU003,T0010,97.0,1.0,0.0,1.0,1.0
CPU004,T0010,96.0,1.0,0.0,3.0,0.0
CPU_ALL,T0010,86.4,0.8,0.0,12.8,0.0,,4
MEM,T0010,3789.0,-0.0,-0.0,1024.0,2903.2,-0.0,-0.0,1024.0,-0.0,281.9,577.3,-1.0,2.0,0.0,181.5
VM,T0010,30,0,0,2384,12760,-1,0,0,0,0,27,0,0,2541,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
PROC,T0010,5,0,2524.7,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0
NET,T0010,2.6,0.1,2.8,0.1
NETPACKET,T0010,10.9,2.0,12.9,2.0
JFSFILE,T0010,59.0,0.0,0.5,59.0,24.3,4.5,4.5,4.5
DISKBUSY,T0010,0.0,0.0,0.0,0.0
DISKREAD,T0010,0.0,0.0,0.0,0.0
DISKWRITE,T0010,0.0,0.0,0.0,0.0
DISKXFER,T0010,0.0,0.0,0.0,0.0
DISKBSIZE,T0010,0.0,0.0,0.0,0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From all those lines, just the line with &lt;strong&gt;CPU_ALL&lt;/strong&gt; matters, more specifically the third column, which has the average usage percent of all &lt;strong&gt;CPUs&lt;/strong&gt; at the time of the harvesting.&lt;/p&gt;

&lt;p&gt;After some data manipulation of the harvested data during the execution of the pipeline with &lt;code&gt;grok&lt;/code&gt; and after that, with &lt;code&gt;dissect&lt;/code&gt;, we can visualize and compare the CPU usage and the execution time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZLASA4p0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/gxwscw781bqvl45u512z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZLASA4p0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/gxwscw781bqvl45u512z.png" alt="grok vs dissect"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The initial usage spike that we see in the graphic is caused by initialization of &lt;strong&gt;Logstash&lt;/strong&gt;, the folowing plateau corresponds to the usage during the processing of the messages.&lt;/p&gt;

&lt;p&gt;Looking at the graphic we can see that the time to process the 10 million messages is basically the same for both filters tested in this specific use case, but we have a big difference in the cpu usage. We have an Average CPU Usage of &lt;strong&gt;40%&lt;/strong&gt; when using the &lt;code&gt;dissect&lt;/code&gt; filter in comparison with an Average CPU Usage of more than &lt;strong&gt;60%&lt;/strong&gt; when using &lt;code&gt;grok&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion and Links
&lt;/h3&gt;

&lt;p&gt;Depending on the number of pipelines and events per second of each one of those pipelines, dedicating time to study if it is possible to replace &lt;code&gt;grok&lt;/code&gt; by &lt;code&gt;dissect&lt;/code&gt; is something that could lead to a huge improvement in the ingestion and analysis of data, besides that, it could also leads to costs cuts when you have your infrastructutre on the cloud since it will make possible to use smaller instances, as also less consumption of cpu credits on some cases.&lt;/p&gt;

&lt;p&gt;On the cases where the logs have always the same structure, as web server logs, applicaton logs, firewalls and routers, or when you have control on the format that the logs will use, the &lt;code&gt;dissect&lt;/code&gt; filter is the best choice when using &lt;strong&gt;Logstash&lt;/strong&gt; to collect and parse those logs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Links
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dissect&lt;/code&gt;: &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-dissect.html"&gt;official documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grok&lt;/code&gt;: &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html"&gt;official documentation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;This post was previously published on my blog, at &lt;a href="https://web.leandrojmp.com/posts/en/2020/08/logstash-grok-vs-dissect"&gt;web.leandrojmp.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>logstash</category>
      <category>grok</category>
      <category>dissect</category>
    </item>
  </channel>
</rss>
