loading...
Cover image for Optimising Highly Indexed Document Storage - Know Your Data (KYD)

Optimising Highly Indexed Document Storage - Know Your Data (KYD)

sumitkumar1209 profile image Sumit Kumar Originally published at Medium on ・6 min read

Optimising Highly Indexed Document Storage — Know Your Data (KYD)

This blog is in continuation of Know Your Data (KYD) series and Optimising E-Commerce Data sub-series. In previous blog I mentioned about using compression as a solution for optimising storage on field which we don’t tend to use manually, here I address the problem of Explosion of Keys in a Highly Indexed Document Storage. In my case it was ElasticSearch and I was able to reduce the index size by 60 times after applying multiple data modelling changes.

Back Story:

As we support multiple types of e-commerce clients, we need to serve their users data in the form shown by the store which means storing and sending back lot of different fields in our organisation datastore.

Problem started when different stores had different fields across categories and everything has to be ingested by us to serve with correct data.

Source GIPHY

Our main principle of not sampling any data and serving with same filters as store serves with features on top was killing us in storage size, even if the store had small catalog. The problem here lied in fields which were being indexed and the storage cost that index incurred.

Source GIPHY

Idea:

Inspiration and motivation to solve this problem of Key Explosion came when we were thinking of upgrading Elasticsearch version from 2.x to 6.x. Elasticsearch started putting restriction on number of fields in a mapping by setting the default to 1000. Even if this setting is adjustable keeping it low is a preferable state to be in. So I started to look around the indices on where I could cut number of fields which should be indexed. The solution came in the form of segregating fields in two parts

  1. Fields on which only filtering has to be applied
  2. Fields which need to be used for aggregations

The advantage this segregation provided us was to reduce the number of fields which need to be indexed for aggregation support.

Our initial documents looked similar to this, where some common fields across products are present like _id, price and some fields which might change based on product type like if it is something like apparel they will have size and color as attribute and if they are something like furniture or cutlery they will something like material, luster etc.

[
  {
    "\_id": "product id 1",
    "price": 232,
    "size": [
      "x",
      "xl"
    ],
    "color": [
      "red",
      "black"
    ],
    "views\_lastweek": 100,
    "views\_desktop\_lastweek": 80,
    ...
  },
  {
    "\_id": "product id 2",
    "price": 14,
    "material": [
      "steel",
      "brass"
    ],
    "luster": [
      "silver"
    ],
    "views\_lastweek": 40,
    "views\_desktop\_lastweek": 20,
    ...
  }
]

We will go with assumption that we need not show luster as options anywhere in the store and material has to be shown may be in side navigation/filter widget. Similarly size need not be shown in side widget and color has to be populated there.

Goals:

I decided to approach this problem with three goals in mind:

  1. To reduce the number of keys, which need to be indexed.
  2. To not compromise with quality of query results we currently deliver with old model and not to reduce any functionality.
  3. Keep it scalable to handle any number of unique fields a store can have

Process:

I. Reduce the number of attributes

To achieve the above goals primary step was to segregate fields in two types filter keys and aggregator keys.

Now any key which need only be filtered upon and not be aggregated on will help us remove that from index and use it just as values in a generic field across all product catalog for e.g. tags, attributes etc.

So taking the above example and assumption of using size and luster as only filter keys I converted my existing data model to

[
  {
    "\_id": "product id 1",
    "price": 232,
**"tags": [  
      "size->x",  
      "size->xl"  
    ],**  
    "color": [
      "red",
      "black"
    ],
    "views\_lastweek": 100,
    "views\_desktop\_lastweek": 80,
    …
  },
  {
    "\_id": "product id 2",
    "price": 14,
    "material": [
      "steel",
      "brass"
    ],
  **"tags": [  
      "luster->silver"  
    ],**  
    "views\_lastweek": 40,
    "views\_desktop\_lastweek": 20,
    …
  }
]

You will notice that we were able to reduce two indexed keys size and luster to one single tags field.

II. Reduce the number of Metric Keys

As you can see in the sample above, we have some metric keys on each product which can multiply to come around 100 keys due to increase in number of segments or duration we have

For e.g.

Metric type = [views, purchases, add to cart, …],

Duration = [lastweek, lastmonth, yesterday, Jan-2018, etc.]

Segment = [desktop, mobile, tablet, ads, email, etc.]

To reduce these keys I evaluated and segmented each product numeric score into textual value of high, medium and low based on how store is performing in those metrics. For e.g. if store received 100 added to cart and current product has a high performing product with around 30 add to cart then it would be placed in high segment than a product which received only 1 added to cart and placed in low segment.

Using this as a principle I reduced all the metric keys into 3 keys, one for each segment

[
  {
    "\_id": "product id 1",
    "price": 232,
    "tags": [
      "size->x",
      "size->xl"
    ],
    "color": [
      "red",
      "black"
    ],
    **"metrics\_h": [  
      "views\_lastweek",  
      "views\_desktop\_lastweek"  
    ],**  
    "metrics\_l": [

    ],
    "metrics\_m": [

    ]
  },
  {
    "\_id": "product id 2",
    "price": 14,
    "material": [
      "steel",
      "brass"
    ],
    "tags": [
      "luster->silver"
    ],
  **"metrics\_m": [  
      "views\_lastweek",  
      "views\_desktop\_lastweek"  
    ],**  
    "metrics\_l": [

    ],
    "metrics\_h": [

    ]
  }
]

This could have been stored in the form of single key metrics and value as views_lastweek -> h but our requirement was to have index level boosting on fields like “*_h” should have boost 30 and “*_m” should have boost 20 and so on.

III. Not Indexing field, but keeping in datastore

Some of the fields in product catalog were supposed to be delivered as is and were not required for querying, filtering, aggregating or sorting upon. Those fields were marked as “index”: false in mapping which further helped in reducing the storage for index.

IV: Keeping low index profile

We tend not to do any partial match query or match phrase query in our datastore, which gave us advantage of declaring our most of index options as docs to keep the lowest index footprint.

{
  "index\_options": "docs",
  "type": "keyword"
}

Benchmarks:

Let’s talk in numbers now.

With optimisation in step I, I was able to achieve

  1. For one of our client Indexing time reduced to 2 minutes from 60 minutes which boils to more than 30x time reduction in Indexing Store’s product catalog to Elasticsearch, albeit both times bulk indexing was used.
  2. Index size reduced from 7168 MB (7 GB) to mere 220 MB.
  3. Number of keys reduced from +30k to > 250 and we still have around 750 keys to spare for Elasticsearch default limit

After applying Step II of the optimisation, I was able to achieve

  1. Previously optimised to 2 minutes Indexing process, was further reduced to 50 seconds
  2. Index storage size reduced from 220 MB to 110 MB
  3. Number of keys reduced further from 250 to around 200

Now with all optimisation in place and after cutting down on indexed fields and marking some of the fields as non-indexed, index size of 110 MB was further reduced to 65 MB while in Elasticsearch 2.0 .

Upgrading Elasticsearch to version 6.x, gave us further reduction in Index size from 65 MB to 34.7 MB, which might be due to large number of sparse data presence and Elasticsearch 6 has lot of space saving improvements.

Conclusion:

With all optimisation in place and after upgrading to Elasticsearch 6.x, I was able to reduce an Index of size 7.1 GB to 34.7 MB and also achieve an indexing time for 18k large size documents from 1 hour to 50 seconds.

Version upgrade from Elasticsearch 2 to 6 gave us saving on only 30 MB as all the optimisation were already implemented in Elasticsearch version 2, which was able to give us around 65 MB.

Index size from 7.1 GB. So, version upgrade is preferred step but not a mandatory step for these optimisations to be implemented.

While most of optimisations are tool specific, one generic conclusion can be derived is Know Your Data.

P.S.

Upgradation of Elasticsearch should be handled with removal of _type field in mapping.

I am also active in StackOverflow communityand primarily has answered Elasticsearch questions in past.

Other posts in Series


Discussion

pic
Editor guide