Ronen Botzer for Aerospike

Posted on Apr 26, 2020

Aerospike Modeling: User Profile Store

#aerospike #database #data

Audience Segmentation for Personalization

I recently published the article “Aerospike Modeling: IoT Sensors” to highlight a different modeling approach when using Aerospike versus Cassandra in IoT use cases. There is a similar juxtaposition of Cassandra’s column-oriented ‘many tiny records’ to Aerospike’s row-oriented ‘fewer, larger, records’ when modeling user profiles.

tl;dr

Cassandra databases, including derivatives such as ScyllaDB, have a needle in a haystack problem that affects performance when you need low latency key-value operations. Aerospike is uniquely capable to deliver speed at scale.

Laptop with ads — Photo by Taras Shypka on Unsplash

Overview

If you haven’t heard the terms user profile store or audience segmentation, I recommend Google Cloud’s article on digital advertising, and adpushup’s What is Audience Segmentation? Aerospike was first used heavily in the Ad Tech ecosystem, so it is not surprising that it’s an effective solution for storing user profiles.

Audience segmentation for real time bidding (RTB) is a special case of user profile stores. It’s a form of personalization that happens tens of millions of times per-second, without stop, as ads are served in real time around the world to people using apps, visiting web pages, and watching streaming content. It’s simple to describe, and generally applicable to other forms of online personalization.

A user is typically identified by a cookie containing a unique ID. Associated with that unique ID is a set of audience segment IDs that a Digital Management Platform (DMP) deduced the user to be in. This includes demographic, psychographic, behavioral, and geographic segments.

The user is supplied to an RTB ad exchange by a Supply Side Platform (SSP). The exchange matches eyeballs to advertisers by providing the user ID to Demand Side Platforms (DSP). These DSPs store DMP provided profiles, and pull up the segments known to match this user ID from their user profile store. With this information, the DSP determines if they have an ad programmed to target the user’s audience segments. The DSP can then choose to bid in the ad exchange on the right to serve an ad to the user. If it wins the bid, it will serve the ad to the user, and the entire process from end-to-end lasts around 150ms. Of that, just 20ms is used to decide on whether to try and serve an ad to the user. Aerospike evolved to deliver speed at very large scales, as the database used for many successful companies in this ecosystem.

Modeling

There may be millions of distinct segments, each segment ID an integer value. An average user may have a thousand audience segments they occupy at each point in time. When a user is assigned a segment, that association is given a specific time-to-live (TTL) value. As the user profile store is continuously refreshed from DMP data, temporary associations (such as a location change due to travel or a short term interest in something) will expire, while strong associations will have their segment’s TTL extended.

Modeling in Cassandra

Let’s consider how you would potentially model this kind of user profile store in Cassandra.

CREATE TABLE userspace.user_segments (
    user_id uuid,
    segment_id int,
    attr1 smallint,
    attr2 smallint,
    PRIMARY KEY ((user_id, segment_id), user_id)
)

You would now upsert user_id, segment_id data pairs with a TTL, each pair a distinct row.

Modeling in Aerospike

In Aerospike we would keep all of a user’s audience segmentation data in a single record, whose key is the user ID. Just a reminder, a record in an Aerospike namespace is uniquely identified by the tuple (namespace, set, user-key). The Aerospike client hashes the pair set, user-key through RIPEMD-160 into a 20 byte digest, which is the actual primary index identifier of the record in the namespace. This means that if you keep to the default key policy of KEY_DIGEST, storage is saved as the set (table) name and the 36 character UUID are hashed into 20 bytes of digest.

We will store the user’s segments in a map with a segment ID acting as map key, and the tuple [segment-TTL, {attr1, attr2}] as the map value.

Depending on the precision we desire for the segment TTL, we can use a smaller numeric value than the 8 bytes needed to hold a Unix epoch timestamp. Let’s assume the precision of segment TTLs is hourly, and our application has a local epoch of January 1st 2019, dating to when it was first deployed. The value of the segment-TTL would be an enumeration of the hours since that epoch. For example, December 20th, 2019 at 10am is 8530 hours since the epoch.

So, each user has as a map of {segmentID: [segment-TTL, {attr1, attr2}]}, for example:

{ 8457:  [8889, {}],
  12845: [8889, {}],
  42199: [8889, {}],
  43696: [8889, {}],
}

The map ordering options are UNORDERED, K-ORDERED and KV-ORDERED. All map operations can be applied to a map, regardless of its ordering, but it does affect the performance of map operations. I’ll follow the tip that in general, when a namespace is stored on SSD, choosing K-ORDERED gives the best performance.

Storage Space

If you read my article on modeling IoT uses cases, you already know that Aerospike’s MessagePack serialization will reduce the storage space needed to store these maps.

CE 500K records — CE-4.8.0, 500K users, each with 1000 segments

Using the Aerospike Enterprise Edition Compression feature further compacts the segmentation data. When I compared running my sample code on CE and EE (using Zstandard level 1 compression), I saw a 0.69 compression ratio, saving about 30% on storage.

EE 500K records — EE-4.8.0, 500K users, each with 1000 segments

Aerospike customers modeling similar data saw significantly better compression ratios of 0.25–0.20. As usual, the compression ratio depends on the codec, as well as the data being compressed.

storage compression stats — Aerospike Enterprise Edition has namespace level statistics for storage compression

Advantages

This map structure has several advantages. We can use the remove_by_value_interval map operation to trim expired segments. We can use get_by_value_interval to filter segments that have a specific ‘freshness’. We can easily upsert new user segments into the map as they are processed.

Mainly, this allows for orders of magnitude faster retrieval of a user’s segments from the user profile store. In Cassandra finding a single user requires a query pulling a small number of records from a much larger partition. In Aerospike this is a single record read, which is always low latency, regardless of the number of records in the cluster.

An ad tech user profile store may have tens of billions of profiles, each with a large number of segments. This is due to the fact that there are hundreds of millions (billions) of distinct devices, users utilize incognito modes to browse, and browsers and operating systems anonymize through further means. There are many more ad tech cookies out there than there are humans.

Let’s consider a real use case where Aerospike was chosen over Scylla to replace a petabyte scale Apache Cassandra user profile store. With 50 billion users, and an average of 1000 segments per-user, the C* store had 50 trillion rows. It took many seconds to retrieve a given user from this profile store, leading to an approach of ‘pre warming’ whole segments of users in advance into a smaller front-end Aerospike cluster. This was very costly in terms of cluster resources. For Aerospike, there would only be 50 billion records, one per-user, and fetching any one of those is near 1ms latency.

Summit Talk

Matt Cochran, Director of Data Engineering at The Trade Desk, gave a talk about migrating petabytes of data from Cassandra to Aerospike using this modeling approach:

Code Sample

The code sample I wrote is located in aerospike-examples/modeling-user-segmentation.

Loading Data

I started by running run_workers.sh, which launched ten Python populate_user_profile.py workers at a time, until a few minutes later I had a profile store containing 500K users, each with 1000 random segment IDs. The value of a segment ID is an integer in the range between 0 and 81999.

$ ./run_workers.sh 
Generating users 1 to 5001
Generating users 5001 to 10001
Generating users 10001 to 15001
:

Next I ran update_query_user_profiles.py, which has an --interactive mode to make it easier to see the operations and their results.

Inserting and Updating Segments

I upserted a single segment into a user’s segment map, within a transaction that shows the state of that segment ID before and after.

The original 1000 segments for this user are random, so there’s a chance that this code segment produces an update rather than an insert.

$ python update_query_user_profiles.py --interactive
Upsert segment 64955 => [10581] to user u1
Segment value before: []
Number of segments after upsert: 1001
Segment value after: [64955, [10581, {}]]

Similarly, I upserted 8 more segments at once using map_put_items.

To fetch the 9 most recent segments out of the user’s 1009, I used map_get_by_value to search for any map value matching a list that looks like [10581, *], with the ‘Wildcard’ glob. See the ordering rules for more on how Aerospike compares between list values.

Updating multiple segments for user u1
{ 537: [10581, {}],
  5484: [10581, {}],
  12735: [10581, {}],
  21894: [10581, {}],
  23223: [10581, {}],
  24124: [10581, {}],
  40680: [10581, {}],
  66659: [10581, {}]}
Show all segments with TTL 10581:
[64955, [10581, {}], 537, [10581, {}], 5484, [10581, {}], 
 12735, [10581, {}], 21894, [10581, {}], 23223, [10581, {}], 
 24124, [10581, {}], 40680, [10581, {}], 66659, [10581, {}]]

Next I updated a segment’s TTL. As I mentioned in my article Operations on Nested Data Types in Aerospike, to operate on the embedded list holding the segment TTL and associated data, I needed to provide the context of how to get to that list element.

The context is map key 5484 => 0 index of the list stored as this map key’s value

Add 5 hours to the TTL of user u1's segment 5484
[5484, [10586, {}]]

Reading a User’s Segment Data

I demonstrated how I can get just the segments that will not be expiring today.

Only get segments outside the specified segment TTL range.

I used the map get_by_value_interval operation to find all the segments whose expiration is between [0, NIL] and [start-of-today, NIL] and specified that I wanted all elements not in that range. Notice the True argument designating the inverse of this range for the Python client’s map_get_by_value_range() helper method.

To showcase another capability of the map API, I counted how many segments the user had in a range of segment IDs.

Using the ‘count’ map result type for the map get_by_key_interval operation

Count how many segments u1 has in the segment ID range 8000-9000
15

In the case of fetching a user’s segments, a simple read operation (get) may be preferred because it is the fastest. My code sample is meant to show the expressiveness of Aerospike’s native map and list operations.

Trimming Stale Segments

There is a complement remove operation for most read operations in the list and map API.

Clean the stale segments for user u1
User u1 had 860 stale segments trimmed, with 149 segments remaining

As this operation has to inspect whether every segment is inside the specified range, it’s not one to add ahead of every read operation. Instead, it can be called periodically (once an hour, once a day) to perform the cleanup. As of version 4.7 (both Community Edition and Enterprise Edition), this operation can be attached to a background scan, to be applied to all records in a namespace or set.

Conclusion

A row-oriented modeling approach, leveraging the map and list data types, gives Aerospike an advantage in key-value operations over C* implementations, including an advanced C-based one such as ScyllaDB.

Combined with unique optimizations around NVMe drives, and lacking dependence on lots of DRAM, Aerospike provides much higher performance for user profile stores, with a lot less hardware, whether the scale is measured in gigabytes, terabytes, or petabytes.

Originally published on Medium (Aerospike Developer Blog), February 16 2020

DEV Community