DEV Community: Sumit

Why Postman Data Engineering chose Apache Spark for ETL (Extract-Transform-Load)

Sumit — Fri, 20 Sep 2019 11:37:30 +0000

Even before I joined Postman, my colleagues here were dealing at a scale of handling 6 million users all around the world. The amount of data which needed to be processed to get some meaningful insight was huge. Here is a taste of its scale:

100+ million requests per day spread across 30+ micro-services
Resting around 10 TB of data
Ingesting around 1TB of monthly internal service logs
100k+ peak concurrent web socket connections

Back Story:

Some months were in for me at Postman. I got assigned to a project which needs to handle millions of rows in service logs. These spans over GBs of data.

Without a doubt, we could have processed these logs in vanilla code and maybe use some libraries too. But this would incur more cost for operating and maintaining the code base. Moreover, libraries which are not tested rigorously increases the surface area for bugs. All these added libraries and infrastructure results in increased human hours. Then we decided to look for something else.

Idea:

With prior experience in distributed systems, the team and I knew of its advantage and limitations. Keeping that in mind the next step for us was to look for solutions in distributed processing. And don’t forget the Map-Reduce functionality.

Postman believes in a philosophy that human hours are the most valuable resources. We always strive for a managed solution as much as possible. Maintained solution handles upgrades of software and hardware by itself (3rd party). We need to focus only on logic not anything else around it.

We were at AWS Community Day 2018, Bengaluru. If you want to check out the photos, visit here.

For the uninitiated What is Apache Spark?

The Debate:

With the above philosophy in mind we primarily wanted to optimize the following:

i. Time for Development

So which method is faster?

Developing any project in a vanilla code of any programming language

OR

using libraries available in the Apache Spark ecosystem.

An advantage of vanilla code is you are familiar with basic concepts. You could have skills and tricks to do something in a way which makes development faster for you. The caveat here is could you propagate the same learnings to your team members? You might want to have better control over what you want to achieve. Could you be sure that you can transfer the same knowledge and need for control to other developers? A blunt question for you to ask, do they actually need it?

One can argue about a similar set of knowledge in the ecosystem of tools. So then what makes Apache Spark ecosystem or any other tools ecosystem weigh much heavier than vanilla alternatives?

According to my viewpoint

support of a community,
better documentation of any methodology,
system paradigms
and the added advantage of potential chances of learning from other people mistakes

P.S.: With Apache Spark being an open source tool you also don’t lose your control.

ii. Modularisation

In my early years of programming one of my colleague beautifully put a thought in my mind. I can say this actually transformed the way, I write code. I won’t put any effort into explaining this, casually putting it here.

“Code is like poetry.”

Tamil Selvan

I can not put more emphasis on how difficult it is to modularise any code. Also then maintain the same if you are not the only developer on the project. A single developer project enables you to put your thoughts (read opinions) on how you structure the code. It might be good or maybe not. The real test of skills happens when there are more than one contributors to the project. Your modularisation should be consumable without you explaining to each one of them.

With tools such as Apache Spark, the contributed code is very small. Why? Because all the nitty-gritty of core functionality is hidden away in library code. What you have is a very simple small liner of the ETL process.

For e.g. A small but complete ETL process could be summarised as

spark.read.parquet("/source/path) # Extract
    .filter(...).agg(...) # Transform
    .write.parquet("/destination/path") # Load

There is a lot of support for simple readers and writers from the Apache Spark community. This enables you to easily modularise your code and also stick to one paradigm. So what do you get finally? You get a beautiful structure code, which everyone can understand and contribute to.

P.S.: You can always extend the default readers and writers to perfectly match with your requirements.

iii. Maintainability

I have a very strong belief in the power of community. For me, community boils down to something like a league of superheroes fighting for a common goal. Here is a quote to put my thought process in very simple words.

Two is better than one if two act as one.

Mike Krzyzewski

Why I believe in that is because if one fails, the other one will help him. There can be a debate using this proverb.

Too many cooks spoil the broth

Proverb

which means that if there are many people involved in doing the same thing, the final result will not be good. But this happens when things happen behind closed doors in a kitchen.

Open source community has solved this problem very beautifully and proved its mettle. Quality and quantity of open source projects and community are proofs for it. Apache Spark is one of them. This means that its quality, maintainability, and ease of use is much better. At least better than few chefs trying to build vanilla products behind closed doors. No shit, Sherlock!

P.S.: Active open source community means there are a lot of people already maintaining and actually doing the work for you.

iv. Time to Production

Time to development is one thing and actually deploying the code in production is another. Most of the times projects stay in “PoC” mode and never come out of it. I have seen some companies use the same “PoC” in production server. These “PoC”s don’t give much thought on whether they will be able to handle the current traffic and rate at which load increases.

Inherent nature of distributed systems such as Apache Spark makes supporting large scale a cakewalk. These systems are designed solely for the very same reason “to handle scale”.

While building a vanilla system most of the times it is designed to handle current load presuming it won’t increase. This might be true in an ideal world, but we don’t live in one.

P.S.: While I try to vertically scale most of the times but I have done horizontal scaling as well, which is buttery smooth in distributed systems.

v. After production caveats

Two words

Monitoring and Observability.

I won’t go into details of these terms as there is a lot of content around it already. For production deployments, you might hope to “do it and forget it” in an ideal world but again we don’t live in it. Deploying anything in production brings a lot of itsy-bitsy or larger problems. Systems in production should be continuously monitored and be designed for observability. Don’t forget the alerts too which could subside under monitoring. Apache Spark with its web UI and added support from AWS makes it a much better alternative than building custom solutions in vanilla code.

So do you actually want to reinvent the wheel?

P.S.: Probably you don’t. I am not judging.

Conclusion:

To conclude all my blabbering on top, here is a TL;DR version on why we chose to use Apache Spark for ETL

AWS Support (Our primary requirement)
Distributed System (To handle scalability)
Open source (Control)
Community power (Superpower)
Documentation (Ease of on-boarding)

I usually write about Data. Find more posts from me on medium or on the dev.to community.

How I reduced storage cost of ElasticSearch by 60x

Sumit — Tue, 22 Jan 2019 14:11:01 +0000

Optimising Highly Indexed Document Storage — Know Your Data (KYD)

This blog is in continuation of Know Your Data (KYD) series and Optimising E-Commerce Data sub-series. In previous blog I mentioned about using compression as a solution for optimising storage on field which we don’t tend to use manually, here I address the problem of Explosion of Keys in a Highly Indexed Document Storage. In my case it was ElasticSearch and I was able to reduce the index size by 60 times after applying multiple data modelling changes.

Back Story:

As we support multiple types of e-commerce clients, we need to serve their users data in the form shown by the store which means storing and sending back lot of different fields in our organisation datastore.

Problem started when different stores had different fields across categories and everything has to be ingested by us to serve with correct data.

Source GIPHY

Our main principle of not sampling any data and serving with same filters as store serves with features on top was killing us in storage size, even if the store had small catalog. The problem here lied in fields which were being indexed and the storage cost that index incurred.

Source GIPHY

Idea:

Inspiration and motivation to solve this problem of Key Explosion came when we were thinking of upgrading Elasticsearch version from 2.x to 6.x. Elasticsearch started putting restriction on number of fields in a mapping by setting the default to 1000. Even if this setting is adjustable keeping it low is a preferable state to be in. So I started to look around the indices on where I could cut number of fields which should be indexed. The solution came in the form of segregating fields in two parts

Fields on which only filtering has to be applied
Fields which need to be used for aggregations

The advantage this segregation provided us was to reduce the number of fields which need to be indexed for aggregation support.

Our initial documents looked similar to this, where some common fields across products are present like _id, price and some fields which might change based on product type like if it is something like apparel they will have size and color as attribute and if they are something like furniture or cutlery they will something like material, luster etc.

[
  {
    "\_id": "product id 1",
    "price": 232,
    "size": [
      "x",
      "xl"
    ],
    "color": [
      "red",
      "black"
    ],
    "views\_lastweek": 100,
    "views\_desktop\_lastweek": 80,
    ...
  },
  {
    "\_id": "product id 2",
    "price": 14,
    "material": [
      "steel",
      "brass"
    ],
    "luster": [
      "silver"
    ],
    "views\_lastweek": 40,
    "views\_desktop\_lastweek": 20,
    ...
  }
]

We will go with assumption that we need not show luster as options anywhere in the store and material has to be shown may be in side navigation/filter widget. Similarly size need not be shown in side widget and color has to be populated there.

Goals:

I decided to approach this problem with three goals in mind:

To reduce the number of keys, which need to be indexed.
To not compromise with quality of query results we currently deliver with old model and not to reduce any functionality.
Keep it scalable to handle any number of unique fields a store can have

Process:

I. Reduce the number of attributes

To achieve the above goals primary step was to segregate fields in two types filter keys and aggregator keys.

Now any key which need only be filtered upon and not be aggregated on will help us remove that from index and use it just as values in a generic field across all product catalog for e.g. tags, attributes etc.

So taking the above example and assumption of using size and luster as only filter keys I converted my existing data model to

[
  {
    "\_id": "product id 1",
    "price": 232,
**"tags": [  
      "size->x",  
      "size->xl"  
    ],**  
    "color": [
      "red",
      "black"
    ],
    "views\_lastweek": 100,
    "views\_desktop\_lastweek": 80,
    …
  },
  {
    "\_id": "product id 2",
    "price": 14,
    "material": [
      "steel",
      "brass"
    ],
  **"tags": [  
      "luster->silver"  
    ],**  
    "views\_lastweek": 40,
    "views\_desktop\_lastweek": 20,
    …
  }
]

You will notice that we were able to reduce two indexed keys size and luster to one single tags field.

II. Reduce the number of Metric Keys

As you can see in the sample above, we have some metric keys on each product which can multiply to come around 100 keys due to increase in number of segments or duration we have

For e.g.

Metric type = [views, purchases, add to cart, …],

Duration = [lastweek, lastmonth, yesterday, Jan-2018, etc.]

Segment = [desktop, mobile, tablet, ads, email, etc.]

To reduce these keys I evaluated and segmented each product numeric score into textual value of high, medium and low based on how store is performing in those metrics. For e.g. if store received 100 added to cart and current product has a high performing product with around 30 add to cart then it would be placed in high segment than a product which received only 1 added to cart and placed in low segment.

Using this as a principle I reduced all the metric keys into 3 keys, one for each segment

[
  {
    "\_id": "product id 1",
    "price": 232,
    "tags": [
      "size->x",
      "size->xl"
    ],
    "color": [
      "red",
      "black"
    ],
    **"metrics\_h": [  
      "views\_lastweek",  
      "views\_desktop\_lastweek"  
    ],**  
    "metrics\_l": [

    ],
    "metrics\_m": [

    ]
  },
  {
    "\_id": "product id 2",
    "price": 14,
    "material": [
      "steel",
      "brass"
    ],
    "tags": [
      "luster->silver"
    ],
  **"metrics\_m": [  
      "views\_lastweek",  
      "views\_desktop\_lastweek"  
    ],**  
    "metrics\_l": [

    ],
    "metrics\_h": [

    ]
  }
]

This could have been stored in the form of single key metrics and value as views_lastweek -> h but our requirement was to have index level boosting on fields like “*_h” should have boost 30 and “*_m” should have boost 20 and so on.

III. Not Indexing field, but keeping in datastore

Some of the fields in product catalog were supposed to be delivered as is and were not required for querying, filtering, aggregating or sorting upon. Those fields were marked as “index”: false in mapping which further helped in reducing the storage for index.

IV: Keeping low index profile

We tend not to do any partial match query or match phrase query in our datastore, which gave us advantage of declaring our most of index options as docs to keep the lowest index footprint.

{
  "index\_options": "docs",
  "type": "keyword"
}

Benchmarks:

Let’s talk in numbers now.

With optimisation in step I, I was able to achieve

For one of our client Indexing time reduced to 2 minutes from 60 minutes which boils to more than 30x time reduction in Indexing Store’s product catalog to Elasticsearch, albeit both times bulk indexing was used.
Index size reduced from 7168 MB (7 GB) to mere 220 MB.
Number of keys reduced from +30k to > 250 and we still have around 750 keys to spare for Elasticsearch default limit

After applying Step II of the optimisation, I was able to achieve

Previously optimised to 2 minutes Indexing process, was further reduced to 50 seconds
Index storage size reduced from 220 MB to 110 MB
Number of keys reduced further from 250 to around 200

Now with all optimisation in place and after cutting down on indexed fields and marking some of the fields as non-indexed, index size of 110 MB was further reduced to 65 MB while in Elasticsearch 2.0 .

Upgrading Elasticsearch to version 6.x, gave us further reduction in Index size from 65 MB to 34.7 MB, which might be due to large number of sparse data presence and Elasticsearch 6 has lot of space saving improvements.

Conclusion:

With all optimisation in place and after upgrading to Elasticsearch 6.x, I was able to reduce an Index of size 7.1 GB to 34.7 MB and also achieve an indexing time for 18k large size documents from 1 hour to 50 seconds.

Version upgrade from Elasticsearch 2 to 6 gave us saving on only 30 MB as all the optimisation were already implemented in Elasticsearch version 2, which was able to give us around 65 MB.

Index size from 7.1 GB. So, version upgrade is preferred step but not a mandatory step for these optimisations to be implemented.

While most of optimisations are tool specific, one generic conclusion can be derived is Know Your Data.

P.S.

Upgradation of Elasticsearch should be handled with removal of _type field in mapping.

I am also active in StackOverflow communityand primarily has answered Elasticsearch questions in past.

Other posts in Series

I saved 7x from storage cost of MongDB

Sumit — Tue, 15 Jan 2019 14:06:02 +0000

Optimising Document Based Storage — Know Your Data (KYD)

As mentioned in the introductory post , this blog mentions details about how we found a solution for Storage of Serialised HyperLogLog (HLL) Registers’ problem and were able to reduce our storage costs by 7 times.

Image Source GIZMODO

Back Story:

A single string conversion of HLL Register gives us 8192 character long string which is independent of value it represents. To read a detailed but engaging post about HLL refer to this link. The post talks about Bloom filters. HLL is based on similar fundamental implementation. I might try to publish a shorter version later.

So the problem we faced was whether HLL String represents value 1 or value 22k or something larger it always occupied same storage space. This methodology lead us to store same amount of data for each store irrespective of their traffic and catalog size, which boils down to low ROI on small traffic customers as they are in lower price plan.

There was a dire need of optimisation for this storage as there were large number of small traffic size store signing up and hence our storage costs were increasing.

Idea:

The idea on how to optimise this came after having a look on what is there in a HLL String, which was nothing but a large number of padding 0 and very few occurrences of any other digit like 1 or 2 for when it represents a smaller value like 1. Something like this

'000000000000000000000000000.....0000000001000000000........0000000'

where first occurrence of 1 was at 5029th position with only 0s in front.

And you might have guessed by now that this is a perfect data to compress, a lot of repeating characters, or if you haven’t here is a fun link to read more about Zip Bomb on how petabytes of data were compressed into 42Kbs of zip file.

To analyse on which compression engine will suit our use case, there was a war between the lots of compression engine available and lot of blogs to reference too. But each one of them has their own advantage and own use-cases to cater to.

What we needed was which engine is best for our use-case and best method to do that was by benchmarking different engines on our data strings.

Goals:

The points on which one can decide a compression engine are

Compression Speed
Decompression Speed
Compression Factor

Now in best case scenario I would like to have all of them in one compression engine, but most of the times you don’t get what you want. So what we needed was

Decompression Speed fastest on highest priority as decompressed data was to be served real time
An above average Compression factor to save storage costs and
Last but not the least a good compression speed to handle scale

Benchmarks:

So I ran set of scripts and put together bunch of graphs to help me choose the best fighter for my war with storage. In each of graphs below vertical Y axis represents percentage of data left after leading zeros in a string, which means lower the percentage lower the value it represents.

For e.g. 0.01% represents a HLL string whose cardinality is 1. There are legends on right to represent different compression engines I tested. In overall I did 4 comparisons to finalise on which engine would suite our need.

I. Compression Speed Comparison

Here in this graph the horizontal X axis represents time taken to compress 8192 Bytes of data via each compression engine in seconds. There is no High level of data analysis power required to derive that the fastest compression engine is of blosclz, lz4 and snappy. Lz4hc comes into the 4th place in this comparison which is high compression configuration for lz4.

II. Decompression speed comparison

2nd graph is a visualisation of decompression speed of all the engines in seconds and if you remember our priorities from above paragraphs our primary need was fastest decompression speed and we can see that on an average lz4hc comes first, even when blosclz was fastest in smaller value representing strings.

III. Decompression Vs. Compression Speed Comparison

What added another point in favour of lz4hc was this graph, where I am comparing ration of Decompression time vs Compression time for same engine, and smaller is better which means engine takes relatively less time decompressing than other engine takes in comparison to other engines.

IV. Compressed Size Comparison

As clear from above points, I was in dilemma of to decide the final winner of this comparison and this graph made it all clear. Here we are comparing compressed size of HLL string from 8192 bytes. We can see that zlib and zstd are the clear winners with consistent smallest size with lz4hc coming in 3rd.

Conclusion:

Taking a decision is easy, what is difficult is to live with the consequences it can bring.

“Nothing is more difficult, and therefore more precious, than to be able to decide.”

-Napoleon Bonaparte

When all data is with you, it’s not difficult to take a decision. Even if “blosclz” was the fastest in compression, “zstd” was smallest in size, “lz4hc” was the only one which checked all boxes of our priorities with being fastest in decompression and relatively smaller compressed file size than most of the engines in most of the cases and relatively faster compression time and 50% of the engines I benchmarked.

I went ahead with using lz4hc for our use case and it helped us reduce MongoDB storage by 7 times than what it used to be before compression.

We got support of MongoDB Binary Data Type to store the compressed binary data which further helped in interoperability of compressed string between different language libraries for lz4hc. We primarily use Node.JS for serving app users and Python for background processes.

P.S.

You might not see reduction in disk space claimed by MongoDB if you are optimising on same instance as MongoDB does not release the disk space back. Seems like they got inspired by Pirates

Source GIPHY

Library Links:

Other posts in Series

Saving 1000 of dollars in storage cost in an early stage startup

Sumit — Tue, 08 Jan 2019 12:17:54 +0000

Optimising E-Commerce Data — Know Your Data (KYD)

Source WebFusion

In my previous organisation (Choice.AI), we had customers (read E-Commerce Shops) ranging from small size handling upto maximum of 100–1000 unique visitors per month to mid-size handling around 30k-50k of unique traffic per month to their stores to finally large customers with 50k+ regular shoppers on websites. Based on their traffic size, each store came under different pricing plan of organisation, so our investment on them also should be planned based on their ROIs.

To deliver faster and equal value to all types of customers, we built solutions which were independent of Store catalog and traffic size. This approach started costing us more as more stores became our clients, because they needed nearly same amount of investment in terms of computing resources and power.

Source BK Website Designs

Here in this series of blog I will talk about major problems which we solved by “knowing” and understanding what our data contains and their solution helped us in terms of optimised allocation of resources to stores based on their catalog and traffic size without compromising in terms of quality, relevance and features organisation offered.

Storage of Serialized HyperLogLog (HLL) Registers

Our primary problem was handling aggregated user events analytics in a document based data storage (MongoDB in our case). Our main goal was to count number of unique visitors across different dimensions and segments where dimensions could be Campaigns or Experiments and Segments could be Devices or Traffic Sources.

To get the number of unique visitors across different span of time which was only available at query time, we needed to store HLL Registers’ serialised data in a data store which would be deserialised and merged to give the value required.

This string was of length 8192 in bytes to be exact, considering 8-bit for a single character and total string length of 8192 characters. It is independent of value it represents whether it could be as small as 1 or as large as 50k, all of the string are of same length.

Explosion of Keys of Product Catalog

Our customers were ranging from different areas of e-commerce and had different and unique products to offer to end consumer. Because of our primary principal of treating every customer data in similar way, the most common issue which we faced was interference of problems from one customer’s badly structured data into processing of another customer’s clean structured hierarchical data.

We allowed our customers to import as many attributes of a product as they want and they will get filtering and aggregation support on those attributes. This approach resulted in keys explosion, which means more fields to index in Highly Indexed Document based Storage System (Elasticsearch in our case) and ultimately occupying large disk size and more time to index and retrieve data on filters.

Next Posts in Series

Know Your Data (KYD)

Sumit — Tue, 01 Jan 2019 13:16:00 +0000

Introduction — Know Your Data (KYD)

In this new era of artificial intelligence and machine learning, you need not be expert to realise that data is the new diamond, more the merrier. But unless your diamonds are shiny, polished, multi-faceted and easily accessible with low maintenance they are nothing more than just a piece of rock.

In my previous organisation (to be called organisation from now on), we analysed millions of user events per day and produced shiny multi-faceted analytics from it. As we know that diamonds come in all shapes and sizes, so they should be maintained based on their potential to fetch money or in simple business terms their ROI.

In my current organisation, we handle data of size in terms of billions of events per day mainly comprised of communication logs between different microservices.

On another note there is nice must-read series of write-ups from Ankit Sobti (CTO, Postman) on how Postman implemented Microservices. Here is link for first one.

Handling data at this scale and complexity, results in plethora of complications and problems. I will write about our approaches to solve these problems.

SPOILER ALERT

…

The solution always lies in the direction realised only after knowing our data.

Next Posts in Series

Optimising E-Commerce Data - Sumit