<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zenit Kovačević</title>
    <description>The latest articles on DEV Community by Zenit Kovačević (@zenitk).</description>
    <link>https://dev.to/zenitk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F373179%2F57f28a80-4927-4816-a653-bfec77a8b176.jpeg</url>
      <title>DEV Community: Zenit Kovačević</title>
      <link>https://dev.to/zenitk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zenitk"/>
    <language>en</language>
    <item>
      <title>Import from CSV to Elasticsearch using Logstash</title>
      <dc:creator>Zenit Kovačević</dc:creator>
      <pubDate>Thu, 28 May 2020 00:00:00 +0000</pubDate>
      <link>https://dev.to/zenitk/import-from-csv-to-elasticsearch-using-logstash-3a56</link>
      <guid>https://dev.to/zenitk/import-from-csv-to-elasticsearch-using-logstash-3a56</guid>
      <description>&lt;p&gt;Frequently when we want to test out a new feature in Elasticsearch we need some data to play with. Unfortunately, Kibana and Elasticsearch don’t provide an easy, out-of-the-box way to simply import a CSV.&lt;/p&gt;

&lt;p&gt;That’s why there is Logstash in the known E &lt;strong&gt;L&lt;/strong&gt; K stack. Its job is to &lt;em&gt;watch&lt;/em&gt; to a data source,&lt;em&gt;process&lt;/em&gt; incoming data, and &lt;em&gt;output&lt;/em&gt; it into specified sources. Once started, it usually stays on and watches for any changes in the data source.&lt;/p&gt;

&lt;p&gt;Here, I’ll guide you step by step on how to import a sample CSV into Elasticsearch 7.x using Logstash 7.x. By the end, you should be able to tweak the provided example to fit your own needs. There will be plenty of links to the official documentation, but that’s just for convenience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sample CSV
&lt;/h2&gt;

&lt;p&gt;For this blog, I’ve imported the usual &lt;a href="https://news.ycombinator.com/"&gt;Hackernews stories&lt;/a&gt; from &lt;a href="https://cloud.google.com/bigquery/public-data"&gt;BigQuery Public Data Sets&lt;/a&gt;. The CSV is available &lt;a href="https://www.zenitk.com/assets/posts/2020/import-from-csv-to-elasticsearch-with-logstash/hackernews.csv"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The schema looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field name&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;id&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;Unique story ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;by&lt;/td&gt;
&lt;td&gt;STRING&lt;/td&gt;
&lt;td&gt;Username of submitter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;score&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;Story score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;time&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;Unix time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;time_ts&lt;/td&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;Human readable time in UTC (format: YYYY-MM-DD hh:mm:ss)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;title&lt;/td&gt;
&lt;td&gt;STRING&lt;/td&gt;
&lt;td&gt;Story title&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;url&lt;/td&gt;
&lt;td&gt;STRING&lt;/td&gt;
&lt;td&gt;Story url&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;td&gt;STRING&lt;/td&gt;
&lt;td&gt;Story text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;deleted&lt;/td&gt;
&lt;td&gt;BOOLEAN&lt;/td&gt;
&lt;td&gt;Is deleted?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dead&lt;/td&gt;
&lt;td&gt;BOOLEAN&lt;/td&gt;
&lt;td&gt;Is dead?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;descendants&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;Number of story descendants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;author&lt;/td&gt;
&lt;td&gt;STRING&lt;/td&gt;
&lt;td&gt;Username of author&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here are some sample rows (for readability purposes not all columns are selected):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;by&lt;/th&gt;
&lt;th&gt;time_ts&lt;/th&gt;
&lt;th&gt;title&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;7405411&lt;/td&gt;
&lt;td&gt;ferhack&lt;/td&gt;
&lt;td&gt;2014-03-15 17:55:02 UTC&lt;/td&gt;
&lt;td&gt;Hacker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7541990&lt;/td&gt;
&lt;td&gt;jeassonlens&lt;/td&gt;
&lt;td&gt;2014-04-06 18:29:21 UTC&lt;/td&gt;
&lt;td&gt;Marc Cormier, Real Estate Marketing Specialist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7163174&lt;/td&gt;
&lt;td&gt;aryekellman01&lt;/td&gt;
&lt;td&gt;2014-02-01 20:10:58 UTC&lt;/td&gt;
&lt;td&gt;Accents from Around The World&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6807522&lt;/td&gt;
&lt;td&gt;wearefivek&lt;/td&gt;
&lt;td&gt;2013-11-27 11:14:52 UTC&lt;/td&gt;
&lt;td&gt;Developer For Small, Creative Digital Design Studio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6950020&lt;/td&gt;
&lt;td&gt;zeiny&lt;/td&gt;
&lt;td&gt;2013-12-22 11:23:41 UTC&lt;/td&gt;
&lt;td&gt;Psychotropic drug use among people with dementia – a six-month follow-up study&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Prepare Logstash Configuration
&lt;/h2&gt;

&lt;p&gt;First, follow the &lt;a href="https://www.elastic.co/guide/en/logstash/current/installing-logstash.html"&gt;instructions on installing Logstash&lt;/a&gt;.Then, take a brief look at this full logstash configuration script. I’ll explain the details further down. The script can be downloaded &lt;a href="https://www.zenitk.com/assets/posts/2020/import-from-csv-to-elasticsearch-with-logstash/import-hackernews.conf"&gt;here&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input {
  file {
    path =&amp;gt; "/home/zenitk/hackernews.csv"
    start_position =&amp;gt; "beginning"
    sincedb_path =&amp;gt; "/dev/null"
  }
}

filter {

    csv {
        columns =&amp;gt; [
            "id","by","score","time","time_ts","title",
            "url","text","deleted","dead","descendants","author"
        ]
        separator =&amp;gt; ","
        skip_header =&amp;gt; true
    }

    date {
        match =&amp;gt; ["time_ts", "yyyy-MM-dd HH:mm:ss z"]
        target =&amp;gt; "post_date"
        remove_field =&amp;gt; "time_ts"
    }

    mutate {        
        rename =&amp;gt; { 
            "dead" =&amp;gt; "is_dead"
            "id" =&amp;gt; "[@metadata][id]"            
        }

        convert =&amp;gt; {
            "is_dead" =&amp;gt; "boolean"
        }

        remove_field =&amp;gt; [
            "@timestamp", 
            "host",
            "message", 
            "@version", 
            "path", 
            "descendants",
            "time"
            "id"
        ]
    }
}

output {

  elasticsearch { 
      hosts =&amp;gt; ["localhost:9200"]
      index =&amp;gt; "hackernews_import"
      document_id =&amp;gt; "%{[@metadata][id]}"
  }

  stdout {}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As expected, each logstash configuration file is defined with the three main elements: &lt;strong&gt;input&lt;/strong&gt; , &lt;strong&gt;filter&lt;/strong&gt; , and &lt;strong&gt;output&lt;/strong&gt;. More information can be found here in the &lt;a href="https://www.elastic.co/guide/en/logstash/current/configuration-file-structure.html"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Input
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input {
  file {
    path =&amp;gt; "/home/zenitk/hackernews.csv"
    start_position =&amp;gt; "beginning"
    sincedb_path =&amp;gt; "/dev/null"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we define that we want to have a &lt;em&gt;file&lt;/em&gt; on the local machine as the source. At this point, this could be any text file, logstash doesn’t care.In the example above, we basically say that when it runs it should read from the &lt;em&gt;beginning&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.elastic.co/guide/en/logstash/7.x/plugins-inputs-file.html#plugins-inputs-file-path"&gt;path&lt;/a&gt; - the &lt;strong&gt;absolute&lt;/strong&gt; path on your machine&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.elastic.co/guide/en/logstash/7.x/plugins-inputs-file.html#_tracking_of_current_position_in_watched_files"&gt;start_position&lt;/a&gt; - either &lt;em&gt;beggining&lt;/em&gt; or &lt;em&gt;end&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.elastic.co/guide/en/logstash/7.x/plugins-inputs-file.html#_tracking_of_current_position_in_watched_files"&gt;sincedb_path&lt;/a&gt; - basically the location where logstash stores how far it read, so that when it’s started next time it can simply start where it stopped the last time&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt; : if you’re on Windows the path separator in &lt;em&gt;path&lt;/em&gt; has to be &lt;strong&gt;/&lt;/strong&gt; , but &lt;em&gt;sincedb_path&lt;/em&gt; *&lt;em&gt;*&lt;/em&gt;. If you know why let me know!&lt;/p&gt;

&lt;p&gt;More information on the file plugin &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html"&gt;here in the official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Filter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filter {

    csv {
        columns =&amp;gt; [
            "id","by","score","time","time_ts","title",
            "url","text","deleted","dead","descendants","author"
            ]
        separator =&amp;gt; ","
        skip_header =&amp;gt; true
    }

    date {
        match =&amp;gt; ["time_ts", "yyyy-MM-dd HH:mm:ss z"]
        target =&amp;gt; "post_date"
        remove_field =&amp;gt; "time_ts"
    }

    mutate {        
        rename =&amp;gt; { 
            "dead" =&amp;gt; "is_dead"
            "id" =&amp;gt; "[@metadata][id]"            
        }

        convert =&amp;gt; {
            "is_dead" =&amp;gt; "boolean"
        }

        remove_field =&amp;gt; [
            "@timestamp", 
            "host",
            "message", 
            "@version", 
            "path", 
            "descendants",
            "time"
            "id"
        ]
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This section is optional if you want to just import the file and you don’t care how it shows up in elastic. But you should know here is the &lt;strong&gt;all the fun&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;With the &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html"&gt;CSV plugin&lt;/a&gt; we tell logstash how our data is structured. We explicitly write all the column names and skip the header. Note that the filter itself supports &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html#plugins-filters-csv-autodetect_column_names"&gt;auto detection of the column names&lt;/a&gt; based on the header. However, this requires some fiddling with the internal logstash configuration, specifically the the pipeline.workers has to be set to 1.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;time_ts&lt;/em&gt; field is basically just a string, so in the &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-date.html#plugins-filters-date-remove_field"&gt;date plugin&lt;/a&gt; we tell logstash that it’s actually a &lt;strong&gt;date&lt;/strong&gt; with the specified format. Then, we tell it to save it to the new &lt;strong&gt;target&lt;/strong&gt; field, and to remove the old one. This way Elasticsearch recognizes it as an actual date.&lt;/p&gt;

&lt;h4&gt;
  
  
  Mutate
&lt;/h4&gt;

&lt;p&gt;Now, the &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html"&gt;mutate&lt;/a&gt; plugin is where it gets juicy! 🍉&lt;/p&gt;

&lt;p&gt;Here, we’re renaming &lt;strong&gt;dead&lt;/strong&gt; to &lt;strong&gt;is_dead&lt;/strong&gt; , because it’s common courtesy to name your booleans clearly.&lt;/p&gt;

&lt;p&gt;Then, we rename &lt;strong&gt;id&lt;/strong&gt; to &lt;strong&gt;[@metadata][id]&lt;/strong&gt; because we don’t want the id to be saved like a normal field in Elasticsearch, since we would like to use it as an actual document id. What we actually do here is store the id into a temporary variable that is not being passed on in the end output. We use this variable later in the Output section. See the &lt;a href="https://www.elastic.co/blog/logstash-metadata"&gt;Metadata&lt;/a&gt; blog at Elastic for more information.&lt;/p&gt;

&lt;p&gt;Similar to the date field, in &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html#plugins-filters-mutate-convert"&gt;convert&lt;/a&gt; we need to tell logstash that the string “true” is actually a boolean. No magic happens on its own.&lt;/p&gt;

&lt;p&gt;Since Logstash was primarily designed, well, for logging, it stores a bunch of fields to Elasticsearch like “@timestamp”, “host”, “message”, “&lt;a class="mentioned-user" href="https://dev.to/version"&gt;@version&lt;/a&gt;”, “path” that we don’t care about so with the &lt;a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html#plugins-filters-mutate-remove_field"&gt;remove_field&lt;/a&gt; configuration option. Additionaly, we get rid of fields from our CSV that are just taking space.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output {

  elasticsearch { 
      hosts =&amp;gt; ["localhost:9200"]
      index =&amp;gt; "hackernews_import"
      document_id =&amp;gt; "%{[@metadata][id]}"
      #cacert =&amp;gt; "/home/hackerman/elastic.cer"
      #user =&amp;gt; "hackerman"
      #password =&amp;gt; "блинчикиandpancakesandcrepes"
  }

  stdout {}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We output the data both into the standard output (for debugging purposes) and Elasticsearch.&lt;/p&gt;

&lt;p&gt;Now in the &lt;strong&gt;elasticsearch&lt;/strong&gt; section we say with &lt;strong&gt;document_id&lt;/strong&gt; that we want to use our metadata field. With the %{ ... } syntax we just reference a variable.&lt;/p&gt;

&lt;p&gt;If you have a custom certificate on your cluster and/or need credentials the commented section has the keywords you’ll need to access your cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run Logstash
&lt;/h2&gt;

&lt;p&gt;Now that we finally have our logstash configuration file we can finally run it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./logstash -f /home/zenitk/import-hackernews.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If everything runs smoothly (disclaimer: usually it doesn’t) you should see something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;. . .

{
    "post_date" =&amp;gt; 2015-08-11T14:58:21.000Z,
       "author" =&amp;gt; "Datafloq",
        "score" =&amp;gt; "1",
           "by" =&amp;gt; "Datafloq",
         "text" =&amp;gt; nil,
      "is_dead" =&amp;gt; true,
      "deleted" =&amp;gt; nil,
         "time" =&amp;gt; "1439305101",
        "title" =&amp;gt; "5 Reasons Why Small Businesses Need Not Be Shy of Big Data",
          "url" =&amp;gt; "https://datafloq.com/read/5-reasons-small-businesses-need-not-shy-big-data/1390"
}
. . .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can do a check in Kibana:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# check the data types
GET hackernews_import/_mapping

# check if all the rows are imported
GET hackernews_import/_count
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that I didn’t create the index first, but rather let Elasticsearch dynamically define the fields. This is fine for testing purposes or when just trying out the logstash script, but for production, you should strongly consider creating the index first with a dynamic mapping set to strict. This is however a topic on its own.&lt;/p&gt;

&lt;p&gt;Often with CSVs that contain a lot of text, there will be trouble with parsing. If you had a &lt;strong&gt;stdout&lt;/strong&gt; output it will usually show you when and where it failed (although cryptically sometimes). These rows can be corrected and simply placed at the end of the file while running Logstash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Now we went through the whole process of importing CSV data into Elasticsearch.&lt;/p&gt;

&lt;p&gt;Logstash can feel fiddly in the beginning, but once you get familiar with its quirks it can be a potent and easy tool for quick imports. I hope that after going through this blog you feel well armed to tackle some imports on your own. For instance, how about importing a JSON document? Or from a DB? I’m assured you’ll know where to start.&lt;/p&gt;

&lt;p&gt;Please hit the comment section if you have some questions or feedback!&lt;/p&gt;

&lt;h2&gt;
  
  
  Other sources
&lt;/h2&gt;

&lt;p&gt;If you’re more of a video person, here is a similar (but a bit older) tutorial:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/rKy4sFbIZ3U"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

</description>
      <category>logstash</category>
      <category>elasticsearch</category>
      <category>search</category>
    </item>
  </channel>
</rss>
