DEV Community

Cover image for Finding Un-mapped Fields in Elasticsearch
Daniel Dreier
Daniel Dreier

Posted on

Finding Un-mapped Fields in Elasticsearch

At work, our centralized logging Elasticsearch cluster recently experienced a "mapping explosion" that led to performance issues and ultimately an outage. This happened because we left dynamic field mapping enabled on our logging indices with no field count limit. In response, we decided that disabling dynamic field mapping was the right course of action. But now we need to know if there are any "legit" fields being used that weren't part of the Index Template.

What is a mapping explosion?

Normally our application logs have anywhere from 6-12 fields on them. Between our various applications we have around 25 fields in our standard Index Template. We also had dynamic field mapping enabled. This means that if a document/log line gets ingested with a field that isn't defined in the Index Template Elasticsearch will pick the field type based on some defaults or dynamic mapping rules on the index.

For an unknown reason our Production logging Elasticsearch cluster started receiving logs with log text values as field names. This led to the log indices getting overloaded with fields, some growing to over 600 fields. We initially responded to Elasticsearch data nodes repeatedly crashing. The nodes were already configured with 32GB of Java heap space, and it was running out within minutes. With the help of Elastic Support it was determined that the "explosion" of fields on the indices was causing the excessive heap usage.

Here's some index field counts from before and during the issue:

{"index_name":"logs-20200427","index_field_count":25}
{"index_name":"logs-20200428","index_field_count":27}
{"index_name":"logs-20200505","index_field_count":626}
{"index_name":"logs-20200506","index_field_count":533}

In order to get the data nodes to stay running for more than a few minutes we ended up having to delete the worst offending indices. We also disabled dynamic field mapping to prevent this from being an issue again in the future.

Dynamic mapping is disabled, so are we now missing data?

Not every field being logged was included in the index template. So now that dynamic mapping is disabled, those fields are no longer searchable. I did some searching and couldn't find an easy answer for getting a list of all fields, including unmapped ones, directly from Elasticsearch. Asking on discuss.elastic.co led me down the path of using a scripted query and scripted terms aggregation. The end result is a list of all fields on the index, mapped or not.

Accessing un-mapped fields

The "raw" document sent to Elasticsearch is kept under the _source field when indexed. _source maintains every field, whether or not it was mapped into a searchable field type. The data in _source can't be queried with the usual Elasticsearch query types, like terms or match. But it can be queried using a script.

In a script field or script in an aggregation _source can be accessed via the params object. To find out where to go from here, I used the built-in Painless debugging capability. Here's an example:

GET _search
{
  "script_fields": {
    "test": {
      "script": {
        "source": "Debug.explain(params._source)"
      }
    }
  }
}

An abbreviated response:

{
  "shard" : 0,
  "index" : "logs",
  "reason" : {
    "type" : "script_exception",
    "reason" : "runtime error",
    "java_class" : "org.elasticsearch.search.lookup.SourceLookup"
  }
}

This shows that params._source is of type org.elasticsearch.search.lookup.SourceLookup, and SourceLookup doesn't appear to be documented in the official Painless API Reference. But I was able to find more information using javadocs.io. SourceLookup is a wrapper around java.util.Map so using the keySet() method returns a list of all field names (keys) in _source. This is exactly what I wanted!

Getting the list of fields

The first step I took was writing a query using a script field to get the list of all field names from _source. Now each document that matches the query will also have the list of field names.

Here's what that query looks like:

{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-10m",
        "lte": "now"
      }
    }
  },
  "script_fields": {
    "field_names": {
      "script": {
        "source": "params._source.keySet()"
      }
    }
  }
}

I'm using a range filter to limit the search to documents from the past 10 minutes. This is because scripted queries and fields can be expensive to run over large collections.

This is great, but it gives a list of field names per document. It would be even more useful to get a single list of all field names across all documents. Using the same script with a terms aggregation does the job.

Here's what I used:

{
  "size": 0,
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-10m",
        "lte": "now"
      }
    }
  },
  "aggs": {
    "fields": {
      "terms": {
        "size": 100,
        "script": {
          "source": "params._source.keySet()"
        }
      }
    }
  }
}

Now I have a list of the top 100 field names used in the past 10 minutes. I can compare this against the list of fields we already have defined in our Index Template and add any that are missing. Problem solved!

Huge thanks to my teammates for their review and feedback on this article!

Top comments (4)

Collapse
 
benwtrent profile image
Benjamin Trent • Edited

Field mapping explosion is a bad problem to have!

Glad you were able to move past it :).

Painless scripting is CRAZY powerful.

If you find yourself using painless more in the future, I suggest looking into the Painless lab in Kibana

Collapse
 
ddreier profile image
Daniel Dreier

That looks awesome, thanks for sharing! We're not on 7.8 yet but it's one more reason to upgrade.

Collapse
 
jcampaner profile image
jcampaner

Interesting...thanks for the post. I would have thought 600 fields would be quite manageable. The Elastic Common Schema for consolidated logging has this sort of quantity, if I recall correctly. Do you think the issue was the overall quantity or that there was so many dynamic ones?

Collapse
 
ddreier profile image
Daniel Dreier

To be honest, I'm not entirely sure. Some of the field names themselves were also quite long, up to the 255 character max. So it may have been a combination of that and the number of fields that led Elastic Support to determine that it was the cause of the heap space use.