DEV Community

Cover image for Mapping with AWS S3 Select — Cost Efficient Solution
mrthkc
mrthkc

Posted on

Mapping with AWS S3 Select — Cost Efficient Solution

Welcome to my first 'Dev' post. I want you to know this is not my first post. I'm moving my post from medium because I know here is better platform for me to follow everything about my profession. Good reading.


From April 10th;

We, developers, love both creating and using high performing ( eg: fast responding) functions, softwares and applications. Generally, one of our finalization steps is performing speed and load tests on our products, especially if we are using a third party service with the related parts.

Nowadays fast and flexible NoSql technologies are very popular. There are many tools which serve the response instantly and consistently. Let’s look at AWS’ DynamoDB as an example. However, we have one main problem with DynamoDB. Write requests like INSERT / UPDATE and their reserved capacity are highly priced. If you want fast responses, you need to insert these data to DynamoDB first which could be really costly under high traffic load.

My ex work place was a platform that links advertisers who could be app owners and media companies with ad publishers. If you are working on mobile advertising data, you need to have some solutions to detect fraudulent traffic. While redirecting advertisement clicks to APP stores, we must make sure that possible install and in-app-events would come from real users.

As you can guess, we are operating on a lot of click, install and in-app-event data by matching them with each other and detecting untrustworthy ones.


Now let’s start with the logic of how we are matching all these data, under high traffic load despite the fraudulent traffic. First, we need to create a mechanism to generate smart identifiers for your objects which should be shared with external platforms and businesses. Smarts IDs should be encrypted, with a strong key, to not give any room to AD fakers but also contain some information about advertisement items which are progressing.

Let’s get back to AWS. It’s time to talk about how we were using some of their services. Remember smart identifiers, besides them, we need to keep some data like click date-time (to detect the passed time between an install or in-app-event and click). Relational databases could be very compulsive if you are not loaded and do not have many servers (Plus we need to think about the maintenance costs).

So, we needed a backup storage first, which we decided to go with S3(Simple Storage Service). You can use both JSON and CSV / TSV format to store my ex work place's traffic data. (eg: clicks, in-app-events etc.)

{id:<strong-generated-id>, ts: 987026400, …}
{id:<strong-generated-id>, ts: 1069974000, …}
Enter fullscreen mode Exit fullscreen mode

We planned S3 stored data for only backup and side usages, we also had one design with NoSql to store and map our data, which is DynamoDB. We were inserting tons of click data to DynamoDB after buffering to match these clicks with installs and events as fast as we can.

We do not have to wait too long for this design to be a costly one. Let’s go over the February 6th DynamoDB writing cost as an example.

February 6th Write Capacity Unit(WCU): 138,798
February 6th Cost: $101.06

By looking at this you can see that one of our biggest expense is AWS DynamoDB write requests. (Think about some days WCU is raising to 175K)


At last, after all of the above analysis I am ready to explain our solution for the cost reduction.I told you about our smart IDs and storing them in S3. We noticed that with minor changes we could understand which click item was stored in which S3 file. Take a look at this;

Our smart IDs;

Encrypt (<execution_ts>+<related_ad_campaign>+<random_alpha_numeric>+<server_identifier>);
Encrypt (987026407+123321+ABcd1234EF56+55533);
Enter fullscreen mode Exit fullscreen mode

So, S3 keys are like

S3_key: <item_type>+<buffered_ts>+<server_identifier>+<random_alpha_numeric>
S3_key: click+987026400+55533+ZZXX112233VVPP
Enter fullscreen mode Exit fullscreen mode

Now, it’s easy to determine which traffic item is in which S3 file when an install or in-app-event comes with a converted click’s smart ID. It’s not done yet. In buffered S3 files there are many other click data and we need to find the correct one from the conversion ID. Because we stored the click details in JSON or CSV/TSV format we can use the S3-Select function https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/, like;

func S3Select(bucket, key, itemID){ …use your AWS library’s S3 select… }
item = S3Select(‘traffic’, ‘click+987026400+55533+ZZXX112233VVPP’, <encrypted-smart-identifier>)
Enter fullscreen mode Exit fullscreen mode

Last of all, by organizing your item identifiers and S3 file names in a logic, you can use Simple Storage Service, ‘select’ function to attribute your relational items cost effectively. Before finishing my first medium post, I want to share DynamoDB cost graphs which includes the exact date that we updated our attribution logic;

GR-1

Monthly perspective;

GR-2

Line chart below is showing S3 API Requests’ costs for February;

GR-3

As you can see there are no remarkable changes after mapping logic change.

If fast responses are not essential, if you can handle your pairing objects processes asynchronously and if you are not extremely wealthy, then you can prefer S3 instead of DynamoDB for both storing and retrieving your data.

Top comments (0)