DEV Community: Neel Phadnis

Parallelism with Fine-Grained Streams (Part 2)

Neel Phadnis — Tue, 17 Jan 2023 16:37:01 +0000

Source: Photo by Clem Onojeghuo on Unsplash

While it is possible to process a data set using a large number of parallel streams, a higher degree of parallelism may not be necessarily optimal or even possible. This article explores how to think about parallelism, and discusses many bottlenecks that limit the level of parallelism. It also highlights the need to perform measurements in the target setup due to many factors that cannot be easily quantified.

This article is a sequel to the the blog post Processing Large Data Sets in Fine-Grained Parallel Streams and tutorial Splitting Large Data Sets for Parallel Processing of Queries, in which we discussed how large data sets can be efficiently divided into an arbitrary number of splits for processing across multiple workers in order to achieve high throughput through parallel processing of partition streams.

Query Computation Graph

Let's look at a query in isolation, and how it is processed in parallel. For a given query, the platform’s query planner defines a plan, depicted as a computation graph in the following diagram, for how the query will be executed. In the query computation graph, nodes are workers and edges are data streams.

Please refer to the prior post for the context and terminology.

The plan can consist of multiple stages, each stage having a set of workers processing part of the data, and feeding into the next stage. The first stage typically is data access from the appropriate data source(s), and the last stage involves final processing of results such as sort order and size limit.

For example, an aggregate computation on a data set may involve two stages of map-reduce, where the first-stage workers retrieve and process their respective data partitions, and the second-stage aggregates results from the first stage. In a join of data from two sources, the first stage may involve workers retrieving filtered and projected data from the two sources, bucketing data for the next stage based on the join predicate, and forwarding respective buckets to next stage workers to perform the join.

The terms upstream and downstream refer to the direction of data flow through the query graph.

For simplicity, we will focus on the first stage, which is the data access stage.

Data Access Stage

In this stage, the Aerospike data is divided into multiple splits, each accessed and processed by a worker. The data access involves “pushing down” certain operations to the Aerospike cluster in order to minimize data transfer. Such operations include:

Filtering: A subset of records are selected based on some condition. Appropriate indexes are used to evaluate the condition.
Projection: A subset of bins (columns) of each record are selected.

Each worker retrieves and processes records, and forwards results to the next stage. The processing complexity will determine the worker throughput: A simple aggregation can yield a high throughput rate of thousands of records per second whereas a complex transformation involving a database lookup can be much slower yielding just a few tens or hundreds of records per second.

Optimal Parallelism: Matched Stage Throughputs

Parallelism in each stage is defined by the number of workers in that stage.

The throughput of the overall computation is dictated by the slowest stage, or the bottleneck. For maximum efficiency, throughputs of all stages should match to avoid idle resources and/or excessive buffering. In other words, the output throughput of any stage should equal the processing capacity of the following stage.

The throughput of a stage depends on:

worker throughput, which is determined by the computational complexity of processing in that stage, and
number of workers.

The optimizer determines the optimal throughputs at each stage based on the resource requirements of processing in the stage, available resources, and other scheduling constraints. The number of workers at each stage are determined to deliver the matched throughput.

To simplify the discussion, we will focus on the data access stage.

Methodology

The following methodology involves first understanding the resource limitations that will cap the throughput. These are typically disk and network bandwidth, and the number and capacity of the cluster nodes. We use a simple data access request, such as a single scan or a query, to make the discussion concrete. After you calculate the hardware bottleneck and the throughput limit, you may need to further adjustment to the device I/O, network bandwidth, and/or cluster resources. With a given hardware configuration, we can then focus on optimizing software config and request parameters.

Limits and Bottlenecks

The overall throughput is constrained by the following factors:

Device I/O bandwidth: Disk I/O is a common bottleneck in databases. Aerospike uses SSD devices as high-density fast storage.
Database node resources for pushdown processing: Processing such as filtering, bin projection, and UDF execution cannot exceed the available node resources.
Network bandwidth: The data transfer between the server and worker nodes can not exceed the network bandwidth.
Worker resources: The number of workers as well as processing at worker nodes must be within the available capacity.

Let us assume the following parameters in the system:

Number of database nodes: S
Number of workers in data access stage:: N
Device I/O bandwidth at a database node: d
Effective network bandwidth: B
Record size: R
Filter selectivity (fraction of the total records selected): F
Projection factor (reduction in record size due to bin projection): P
Workers per worker node: 100

SSD Throughput

The maximum record I/O throughput at an Aerospike node is d/R. The maximum cumulative record I/O in the cluster is Sxd/R.

Example

An SSD on an AWS instance may have an I/O rating of several GBps (say, 4GBps).
Assuming a record size of a few KB (say, 2KB), this translates into several million records per second.

Node SSD I/O = 4x10^9 / (2x10^3) = 2x10^6 or 2 million records per second

Also, assuming a cluster size S=10, a cluster can provide device I/O of several tens of millions of records per second.

Cluster SSD I/O = 10 x 2x10^6 = 20x10^6 or 20 million records per second

If a worker on average can process ten thousand records per second (W=10x10^3), it will take several thousand workers in the data access stage to saturate the disk I/O in such a system.

Max workers in data access stage = 20x10^6 / (10x10^3) = 2x10^3 or 2000 workers.

A lower processing throughput per worker would require a larger number of workers before SSD I/O becomes the bottleneck. Another resource may impose a lower limit on the number of workers.

Server Node Throughput

Each Aerospike node reads records from the disk and outputs processed records for the data access workers to consume. Depending on the type of processing, the number and size of the output records will be different. For example, filtering will reduce the number of records, and bin (column) projections will reduce the size of records.

A query involving a scan with a filter needs to read all records from the device, whereas a secondary-index query needs to read only the records filtered by the secondary index. So:

The max record throughput at an Aerospike node is d/R. This corresponds to the record throughput for a scan without a filter.
The max node record throughput for a scan with filter: d/(RxF)
The max node record throughput for a secondary-index query: d/R

Example

Assuming d, R, and S are the same as above, we have a cluster throughput of several tens of millions of records for a scan query without filtering.

Cluster unfiltered scan throughput = Sxd/R = 10 x 4x10^9 / 2X10^3 = 20X10^6 or 20 million records per second

Assuming selectivity factor F=10, a cluster can provide several millions of records throughput for filtered scans and several tens of millions of records for a secondary-index query.

Cluster filtered scan throughput = Sxd/(RxF) = 10 x 4x10^9 / (2x10^3 x 10) = 2x10^6 or 2 million records per second

Cluster secondary-index query throughput = Sxd/R = 10 x 4x10^9 / (2x10^3 ) = 20x10^6 or 20 million records per second

To saturate the Aerospike cluster throughput in this setup, assuming a processing rate of ten thousand records per second at each worker (W=10^4), it will take thousands of workers for a scan query without filtering as well as a secondary-index query, and hundreds of workers for a scan with filtering.

Max workers for an unfiltered scan = 20x10^6 / (10x10^3) = 2x10^3 or 2000 workers.

Max workers for a filtered scan = 2x10^6 / (10x10^3) = 2x10^2 or 200 workers.

Max workers for a secondary-index query = 20x10^6 / (10x10^3) = 2x10^3 or 2000 workers.

Again, a lower processing throughput per worker would require a larger number of workers before Aerospike cluster throughput becomes the bottleneck. Also, another resource may impose a lower limit on the number of workers.

Complex Pushdown Processing

An aggregation performed on a server node can dramatically reduce the number of result records, typically to just one, and therefore aggregation processing on a server node is not pertinent to the number of workers discussion. Only one or a very few downstream workers would suffice, and we will ignore this case for this discussion. In order to perform complex cluster operations such as aggregations, the number of nodes in the Aerospike cluster, as well as each node’s memory and CPU resources, should be sized appropriately.

We will assume Aerospike node CPU and memory are not a bottleneck for this discussion.

Network Bandwidth

The network I/O at a worker or a server node has an upper limit. For a large worker cluster, the cumulative worker I/O can exceed the subnet bandwidth. Additionally, the traffic to the database may need to traverse routers and gateways which will also impose their own limits. The most stringent of these limits can become the bottleneck.

Example

In AWS, the network I/O limit for an instance may range from a few Gbps to 100s of Gbps per instance. A cluster on a subnet may have a total network capacity of 100’s of Gbps for all nodes. If the Aerospike cluster placement requires access to another VPC in AWS, the VPC Gateway I/O limit is in the range of 100 Gbps.

We can reasonably work with the effective network bandwidth limit (B) of 100 Gbps.

Compare this to the Aerospike cluster SSD throughput of several tens of GBps or hundreds of Gbps, and note that the network bandwidth is smaller of the two, potentially an order of magnitude smaller. In such a system, the network bandwidth can become the bottleneck.

Assuming a maximum bandwidth of 100 Gbps and using the prior values for R and W, we will need several hundred workers to saturate the network.

Max workers to saturate network I/O = B / (WxR) = 100/8 x 10^9 / (W*R)
= 12.5x10^9 / (10^4 x 2x10^3) = 6.25 x 10^2 = 625

Interestingly, for a scan with filtering and projection, the network bandwidth requirement at data access stage reduces by a factor or FxP because now fewer records of smaller size need to traverse the network. Assuming F=10 and P=10, the network bandwidth needed just for the data access portion goes down by a hundred fold, which can shift the bottleneck to the SSD I/O. Removing the SSD I/O bottleneck may entail adding additional SSD drives to each Aerospike node.

Worker Nodes

The number of available worker nodes can itself be the limit. A worker node can run a large number of worker processes (or workers), typically configured at 1-2 times the number of CPU cores.

Example

If we assume 100 workers per worker node, the number of data access worker nodes needed:

To saturate the cluster record throughput the number of worker nodes is up to a few tens. >Worker nodes = Range of workers / workers per worker node = Hundreds to thousands / 100 = a few to tens of nodes
To saturate the effective network bandwidth the number of worker nodes is in single digits. >Worker nodes = Range of workers in hundreds / 100 = a few nodes

A processing throughput per worker lower than ten thousand records per second as assumed above would mean a larger number of worker nodes to hit the bottleneck (which is the network bandwidth in this case). Note, the above gives the nodes needed in the data access stage only, and not in the entire worker cluster.

Concurrent Jobs

Concurrent computations on the worker cluster share the resources and throughput.

Example

If a shared platform requires fair scheduling of, say, 3 similar computations at a time, the resource limit for each computation will be 1/3 of the total available limits. Each query in our scenario, thus, can only need a few worker nodes or a few hundred workers before reaching the bottleneck, which is the network bandwidth in our example.

Another Setup

Let's run through a lower tier hardware setup for a similar workload of scans and queries:

Number of database nodes: S = 3
Device I/O bandwidth at a database node: d = 2 GBps
Effective network bandwidth: B = 50 Gbps
Workers per worker node: 50
Record size: R = 2KB
Filter selectivity : F = 10
Projection factor : P = 10

Device I/O limit

Cluster SSD I/O = Sxd/R = 3 x 2x10^9/2x10^3 = 3 x 10^6 or 3 million records per second

Max workers in data access stage = 3x10^6 / (5x10^2) = 6 x 10^3 or 6000 workers.

Database throughput limit

Cluster unfiltered scan throughput = Sxd/R = 3 x 2x10^9 / 2x10^3 = 3x10^6 or 3 million records per second

Max workers for unfiltered scan = 3x10^6 / (10^3) = 3x10^3 or 3000 workers.

Cluster filtered scan throughput = Sxd/(RxF) = 3 x 2x10^9 / (2x10^3 x 10) = 3x10^5 or 0.3 million records per second

Max workers for filtered scan = 0.3x10^6 / (10^3) = 0.3x10^3 or 300 workers.

Cluster secondary-index query throughput = Sxd/R = 3 x 2x10^9 / (2x10^3 ) = 3x10^6 or 3 million records per second

Max workers for secondary-index query = 3x10^6 / (10^3) = 3x10^3 or 3000 workers.

To saturate the cluster record throughput the number of worker nodes is in low tens.

Range of workers / workers per worker node
= 300-3000 / 50 -> 6 - 60 nodes

Network bandwidth limit

Max workers to saturate network I/O = B/(WxR) = 50/8 x 10^9 / (WxR)
= 6.25x10^9 / (10^3 x 2x10^3) = 3.125 x 10^3
= 3125 workers

To saturate the effective network bandwidth the number of worker nodes is in low tens.

Workers / workers per worker node
= 3125 / 50 -> 62.5 nodes

Concurrency limit

With 3 similar computations at a time, the resource limit for each computation will be 1/3 of the total available limits, or up to 21 worker nodes.

Optimizing Data Access

The problem of optimizing data access is to achieve performance and throughput as close to the hardware limits as possible.

In general, streamlined processing with fewest context switches provides superior efficiency. So data access is likely to be more efficient and provide better throughput where:

data is spread across fewer server nodes as it requires less split-merge overhead,
access chunk (or page) size is larger as it allows longer uninterrupted runs, and
asynchronous mode is used as resources are not held up while waiting on response.

At the same time, there are many factors that cannot be easily predicted. Concurrency conflicts, context switching, and flow control delays can lead to potential bottlenecks. Suboptimal system configuration such as missing indexes, and suboptimal choice of the query plan due to incorrect heuristics or metadata will also lead to througput surprises. Therefore, the best way to find out the optimal parameters is to experiment in the target environment with the target workload.

The hardware capacities of a given environment will limit the level of parallelism and throughput. Different workloads place different types of resource burden and can expose bottlenecks in different areas. The system should be balanced with the desired workload in mind so that a bottleneck in one area does not waste unused capacity in other areas. While a good understanding of such limits is important, there are many factors that are dynamic and cannot be predicted easily. Therefore, experimenting in the target environment is essential to discover the system model for optimal performance.

Processing Large Data Sets in Fine-Grained Parallel Streams

Neel Phadnis — Tue, 17 Jan 2023 16:36:37 +0000

Source: Photo by Dan Gold on Unsplash

Aerospike provides several mechanisms for accessing large data sets over parallel streams to match worker throughput in parallel computations. This article explains the key mechanisms, and describes specific schemes for defining data splits and a framework for testing them.

Follow along in the interactive notebook Splitting Large Data Sets for Parallel Processing of Queries.

Parallel Processing of Large Data Sets

In order to process large data sets, a common scheme is to split the data into partitions and assign a worker task to process each partition. The partitioning scheme must have the following properties:

The partitions are collectively exhaustive, meaning they cover the entire data set, and mutually exclusive, meaning they do not overlap.
They are deterministically and efficiently computed.
They are accessible in an efficient and flexible manner as required by worker tasks, for example, in smaller chunks at a time.

It is also critical for the application or platform to have an efficient mechanism to coordinate with workers and aggregate results to benefit from such a partitioning scheme.

Data Partitions in Aerospike

Aerospike organizes records in a namespace in 4096 partitions. The partitions are:

uniformly balanced, meaning they hold approximately the same number of records that are hashed to partitions using the RIPEMD160 hash function, and
uniformly distributed across cluster nodes, meaning each node has about the same number of partitions. To be precise, each node has about the same number of partitions if the cluster size is a power of two, and more in some nodes otherwise.

All three types of Aerospike indexes - primary, set, and secondary - are partition oriented. That is, they are split by partitions at each node (in releases 5.7+), and queries are processed at each node over individual partitions. A client can request a query to be processed over specific partitions so that multiple client workers can work in parallel. It is easy to see how parallel streams up to the total number of partitions (4096) can be set up for parallel processing data streams.

Pagination is supported with Aerospike queries where the client can process a chunk of records at a time by repeatedly asking for a certain number of records until all records are retrieved.

Splitting Data Sets Beyond 4096

Many data processing platforms allow more worker tasks than 4096. For example, Spark allows up to 32K worker tasks to run in parallel. Trino allows theoretical concurrency of greater than 4K.

Aerospike allows for data splits larger than 4096 by allowing a partition to be divided into sub-partitions efficiently. The scheme is based on the digest-modulo function that can divide a partition into an arbitrary number of non-overlapping and collectively complete sub-partitions. It involves adding the filter expression digest % N == i for 0 <= i < N, where the digest is the hashed key of the record.

The advantage of the digest-modulo function is that it can be evaluated without reading individual records from the storage device (such as SSDs). Digests of all records are held in the primary index, which resides in memory. Therefore, determining the membership of a digest, and equivalently of the corresponding record, in a sub-partition is fast. Each sub-partition stream needs to read only its records from the potentially slower storage device, although it needs to perform the in-memory digest-modulo evaluation, which is much faster, for all records.

This scheme works for primary-index and set-index queries because they hold digests of records. The secondary index holds the primary index location of the record, and a lookup provides the digest information.

Defining and Assigning Splits

The problem can be stated as: How are the splits over a data set defined and assigned to N worker tasks, where N can vary from 1 to any arbitrarily large number. In reality, there would be an upper bound on N on a given platform because of either a platform-defined absolute limit, or the overhead of processing a large number of parallel streams and coordinating across them can negate the benefits.

An Aerospike partition ID varies from 0 to 4095. If a partition is divided into S sub-partitions, each sub-partition is identified by a tuple of partition-id p and sub-partition-id s, and modulo factor m: (p, s, m). Each of the splits need an assignment of a list of partitions and/or sub-partitions. For example, a split i can have:

Split i -> [pi1, pi2, pi3, …, (psi1, si1, m), (psi2, si2, m), …]

It is important to understand what partitions or sub-partitions can be requested in a single Aerospike API call:

Full partitions and sub-partitions cannot be mixed in a call.
Full partitions must be consecutive in order, or (pstart-id, pcount).
Sub-partitions must be consecutive, belong to consecutive partitions, and use the same modulo factor, or (pi, pcount, sstart-id, scount, m).

The goal is to achieve best efficiency with the operations available in the APIs.

We will adhere to these constraints in the following discussion.

Split Assignment Schemes

We will examine three variations of split assignment.

If N is the requested number of splits:

At-most N splits (can be fewer), same sized, one API call per split.
At-least N splits (can be more), same sized, one API call per split.
Exactly N splits, same sized, up to three API calls per split.

The first two allow specific discrete values of splits to allocate the same amount of data (as partitions or a sub-partition), and choose the closest allowed number of splits that is a factor or multiple of 4096. Each split is processed with one API call.

The third one allows any number of splits with the same sized data assignment of partitions and/or sub-partitions. Each split however may require up to three API calls.

These schemes are described in detaill below.

At-Most N Splits

In this case, the returned splits can be fewer in number, matching the closest lower factor or multiple of 4096.

Case 1: N < 8192: Full partition assignments
- Returned number of splits F is the closest factor of 4096 that is <= N.
- Number of partitions in each split, n: 4096/F
- Partitions in split i: (start = i*n, count = n)
Case 2: N >= 8192: Sub-partition assignment
- Returned number of splits M is the closest multiple of 4096 that is <= N.
- Number of sub-partitions or modulo-factor, m: M/4096
- Sub-partition in split i: (floor(i/m), i%m, m)

At-Least N Splits

In this case, the returned splits can be more in number, matching the closest higher factor or multiple of 4096.

Case 1: N <= 4096: Full partition assignments
- Returned number of splits F is the closest factor of 4096 that is >= N.
- Number of partitions in each split, n: 4096/F
- Partitions in split i: (start = i*n, count = n)
Case 2: N >= 8192: Sub-partition assignment
- Returned number of splits M is the closest multiple of 4096 that is >= N.
- Number of sub-partitions or modulo-factor, m: M/4096
- Sub-partition in split i: (floor(i/m), i%m, m)

Exactly N Splits

In this case, the exact number of splits of equal sizes are created.

Each of the 4096 partitions is divided into N sub-partitions, resulting in total 4096 * N sub-partitions,
Each split is assigned 4096 sub-partitions in the following manner:
- Sub-partitions are enumerated vertically from sub-partition 0 to N-1 in each partition, starting at partition 0 and ending at partition 4095.
- After assigning 4096 consecutive sub-partitions to a split, the next split gets the following 4096 sub-partitions, and so on.

Thus, sub-partitions in a split fall in one or more of the following three groups, each of which can be retrieved using one API call:

Up to 4095 (including none) consecutive sub-partitions in the starting partition
Up to 4096 (including none) consecutive full partitions
Up to 4095 ((including none) consecutive sub-partitions in the ending partition

For example, if 3 splits are desired, each split can have 4096/3 or 1365 1/3 partitions. In this scheme, the first split would consist of:

0 sub-partitions in partition 0
1365 full partitions: 0-1364
1 (0th) of 3 sub-partitions in partition 1365

And the next (second) split will consist of:

2 (0-1) of 3 sub-partitions in partition 1365
1364 full partitions: 1366-2729
2 (0-1) of 3 sub-partitions in partition 2730

The algorithm details with code and examples are available in the notebook tutorial.

Alternative Ways of Splitting

Splits can be assigned in many other ways. For example, the notebook shows two additional examples of the Exactly N Splits scheme, however you can experiment with a different scheme in the notebook or your target environment.

Parallel Query Framework

The parallel stream processing from the above split assignments can be tested with the following simple framework that is implemented in the notebook tutorial. It can be tweaked to suit the needs of the intended workload and environment.

The test data consists of 100K records (can be changed) of ~1KB size, with a secondary index defined on an integer bin.

Processing Flow

The processing takes place as follows (tunable parameters are italicized):

Splits assignments are made for the requested number of splits and the desired split type.
The desired number of workers (threads) are created. All workers start at the same time to process the splits. Each worker thread does the following in a loop until there are no unprocessed splits available:
- - Obtain the next scheduled split.
- - Create one or more query requests over the split’s partitions and sub-partitions and process them sequentially.
- - Assign the secondary-index query predicate depending on the requested query type.
- - Create the requested filter expression. Append it (with AND) to the sub-partition filter expression if one is being used, otherwise use it separately.
- - Process the query with the filter in the requested mode (sync or async).
  - - Get chunk-size records at a time until all records are retrieved.
  - - Process the records using the stream processing implementation. The notebook example has CountAndSum processing that:
    - - Aggregates the number of records in a count by the worker.
    - - Aggregates an integer bin value in a sum by the worker.
    - - Aggregates count and sum across all workers at the end.
Wait for all workers to finish, and output the aggregated results from stream processing.

In the CountAndSum example, the total number of processed records and the sum of the integer bin across all records must be the same for a given query predicate and filter irrespective of the number of splits, split type, number of workers, and processing mode.

A summary of split assignments and worker statistics can be optionally printed.

Parameters and Variations

Number of splits: Any number of splits over the data set may be requested. Example range 1-10K.
Split type: One of the three variations discussed above can be requested: At-Most N, At-Least N, and Exactly N.
Number of workers: The desired parallelism in processing, example values range between 1-10K.
Query index type: Either primary- or secondary-index query can be specified.
Secondary-index predicate: In case of a secondary-index query, a secondary-index predicate is specified. The default secondary-index predicate is 50001 <= bin1 <= 100000.
Filter expression: An optional filter expression can also be specified. The default filter expression is bin1 % 2 == 0, that is, only even-valued records will be retrieved.
Chunk size: Or page size for iterative retrieval of records in a split.
Processing mode: Either sync or async processing mode to process the query results may be selected.
Stream processing: How records are aggregated or otherwise processed; can be customized by overriding the abstract class StreamProcessing.
Work scheduling: How splits are assigned to workers; can be customized by overriding the abstract class WorkScheduling.

The notebook illustrates many interesting variations, and you can play with additional ones.

Use Cases for Fine-Grained Parallelism

Processing speed can benefit from a very high degree of parallelism for a very large data set processed with transforms, aggregations, and updates.

Multiple data sets that need to be joined, and require shuffling subsets across a large number of worker nodes, may not benefit from a very high degree of parallelism. In such cases, the cost of transfer of data in subsequent steps across a large number of worker nodes can limit the benefit of fine-grained retrieval streams. A Cost Based Optimizer (CBO) on the processing platform should be able to determine the best level of parallelism for data access from Aerospike for a given query.

It would be useful to examine simple heuristics for the level of parallelism in complex computations over a data set. In a future post, we will explore the optimal level of parallelism given the potential conflicting goals of throughput, response time, resource cost, and utilization.

Of Queries and Indexes

Neel Phadnis — Thu, 13 Oct 2022 21:33:05 +0000

Source: Photo by Jan Antonin Kolar on Unsplash

Queries, scans, indexes, pagination, and parallelism are common concepts in databases, but each database differs in specifics. It is vital to understand the specifics in order to get the most out of a database. In Aerospike, queries and indexes play a key role in realizing its speed-at-scale objective. The goal of this post is to help developers better understand the Aerospike capabilities in these areas.

A query is a request for data that meets specific criteria. The criteria or conditions that the result must meet are called the query predicate.

In Aerospike, a query is processed using one of these indexes: the primary index, a set index, or a secondary index.

Primary and Set Indexes

Primary Index

In Aerospike, there is only one system-defined primary index for each namespace, built on the digest of records. The digest is a RIPEMD160 hash of the tuple (set, user-key), where a set, which is equivalent to a table, is an application defined grouping of records in a namespace, which is equivalent to a database or schema, and user-key is an application-provided id that is unique within the set. The primary index is not optional, is created automatically, and cannot be removed, nor can the field on which it is defined be changed. While another index may be created on a bin, which is equivalent to a column, holding a primary key, it is considered a secondary index. For example, for a set having records with employee-number as the user-key and an "ssn" bin for social security number, an index created on the ssn bin is a secondary index.

Scan or Primary-Index Query

In Aerospike, the general way of processing data requests is a scan with a filter expression that captures the query predicate. For example, for a request "get records from employees set where employee-number is in range 100-200", a scan is performed with a filter expression to capture the query predicate "employee-number is in range 100-200". The primary index is used to scan the namespace, and therefore a scan is also called a primary-index query.

Set Indexes

A set index can optionally be created for potential performance improvements when querying a set. In the previous example, the request will execute faster by having a set index on the employees set. If a set index has been created, it will be used for a set query instead of the primary index.

Order

A query using the primary index, or a set index, follows the internal, deterministic digest ordering. Given the digest is a hashed value, this order will not be significant to the application. For example, while the employee numbers have a recognizable order, records with employee-number user-key will have a random scan order. In general, any ordering must be implemented in the application as query results are typically gathered from multiple partitions on multiple nodes.

Query Processing

A filter expression is a boolean computation of the record metadata and/or data, using supported operators and methods. The expression is specified as part of the operation policy, and is evaluated for every record. Only records that pass the filter are further processed. If no filter expression is specified, all records in the set or namespace are processed.

A primary index or set index query can be performed in sync or async mode. In the sync mode, the application thread is blocked until the results are returned, whereas in the async mode, the application thread is not blocked, and the results are returned in a callback.

Code examples of queries using a filter expression can be found here and here.

Secondary Indexes

A secondary index can be optionally defined to speed up processing of queries.

Mapping Bin or CDT Values to Records

A secondary index is defined on a bin (column) or an element at any level of a Collection Data Type (CDT, which is a List or a Map), over its integer, string, or geospatial values. A secondary index keys (maps) a value to one or more records, but not within a record (such as a bin or a CDT element). Thus, a seoondary index is a mapping from a value to one or more records. When a secondary index is defined on a CDT, all CDT values of the indexed type map to the record. So a secondary index on List [1,2,3] in record R will have mappings 1->R, 2->R, and 3->R.

A secondary index is created on a set (table) of records. In Aerospike Database 6.1+, a secondary index created with a null set (or no set parameter) encompasses all records in the namespace. In earlier versions, it would span only the records that were created with a null set parameter. In 6.1+, a secondary index cannot be created on records that are not in any set, and the best practice recommendation is to always create a record in a (non-null) set.

Indexed Value Types

It is important to note that an index is strongly typed, meaning it holds only values of a specific type: integer, string, or geospatial. A bin or a CDT element in Aerospike however is not strongly typed, and can hold a value of any type. An index maps only values of the index type in the bin or CDT element; other values are ignored. For example, an integer index on a List [1, 2, 3, "a", "b", "c"] will index only 1, 2, and 3, and ignore the string elements. A bin or CDT element can have multiple indexes defined to allow queries on different types of values. In this example, a string index must be created on the same List if records need to be retrieved that, say, have a value "c" in the list using a secondary index.

Indexing on Custom Values

In some cases, the values may not be available in one place or even stored in a record or CDT element for indexing. For example, a specific object field "a" in an array of objects: [{"a": 1, "b": 11}, {"a": 2, "b":22}, ...]. In such cases, these values can be copied, or computed and stored, to a bin or a CDT that can then be indexed. In this example, create and index List a-values: [1, 2, ...]. The indexed bin or CDT must be kept in sync with the changes in the values. In the example, if the field "a" is updated in any object, that must be reflected in the a-values list.

Uniqueness and Order

A secondary index cannot be defined as unique or sorted. That is, the secondary index does not support the uniqueness constraint on the field, although it can be defined on a bin that holds unique values, such as the ssn bin in the earlier example. As explained above, it is up to the application to order query results. Also, composite indexes over multiple bins are currently not directly supported, but can be implemented as described earlier in the section Indexing on Custom Values.

Secondary-Index Query

A query using a secondary index is called a secondary-index query, to be distinguished from a primary-index query. A secondary-index query will fail if the supporting secondary index does not exist.

Query Processing

The secondary index lookup identifies the records to process. A secondary-index query may also specify a filter expression, in which case the secondary-index predicate is processed first, the filter expression is evaluated for the resulting records, and the matching records then are processed further. For efficient processing, the most selective available index should be used for the secondary-index predicate and the remaining condition as the filter expression. For example, to find all black Skoda cars in California, a secondary index on manufacturer and not on color should be used, along with a filter expression for black color.

Code examples of a scan using a filter expression can be found here.

Find additional details on CDT indexing in the blog post Query JSON Document Faster (and More) with CDT Indexing.

Pagination

The application can get the results in a stream of smaller chunks by using pagination. Pagination is supported with all types of queries and indexes.

The chunk size limit is specified in the max-records policy parameter. Note, a smaller number of records may be returned because the chunk size limit is divided evenly across all server nodes, but the data may be unevenly distributed with respect to the query predicate.

The same query handle is used to get subsequent chunks until all records are retrieved.

Check out a concrete pagination example and code here.

Parallel Query Processing

Aerospike distributes namespace records in 4096 uniform partitions, and allows separate queries over them for parallelism. Queries can be split into independent parallel sub-queries over one or more partitions, for the needed parallelism to match the required throughput. Further, each partition can be subdivided into N sub-partitions by adding the modulo filter expression digest % N == i for 0 <= i < N. Note, the filter expression evaluation for the sub-partitions is purely metadata based, digest being record metadata. Since record metadata is held in memory, the evaluation requires no access to data on the SSD. A sub-partition only reads its own records, minimizing the necessary SSD reads across the multiple sub-partitions, resulting in maximum parallel throughput.

Using this scheme, a large number of workers on a platform such as Spark (which supports up to 32K workers) can uniformly spread the data among workers for processing via an equal number of mutually exclusive and collectively exhaustive sub-streams using partition queries in combination with the modulo filter expression as described above. The appropriate data scale, throughput, and latency can be achieved by adjusting the cluster size as well as the number of attached SSD devices per node.

Processing Using Indexes

In addition to retrieving records, one can perform additional operations on the selected records:

Project (retrieve) specific bins and computed expressions.
Further processing of selected records with read or write operations.

1 is not conceptually different from retrieving entire records and hence we will not discuss it further. 2 is discussed below.

It is worth mentioning that processing multiple records using indexes is different from batch processing where records are specified by their keys and not by a predicate. Please refer to the blog post to learn more about Batch Operations in Aerospike

Read and Write Operations

In processing operations using indexes, read and write operations cannot be mixed; either only read or only update operations can be specified for processing.

A record may match multiple times for a given condition when using a collection index type. There is no guarantee that a record will be de-duplicated for the same value. In such cases,

The application must be prepared to handle duplication within results.
Write operations must deal with any duplication appropriately (for example, make them idempotent or include logic to apply the operations only once).

Read operations

Read operations are specified using:

bin (transaction) operations for efficient access to complex data types such as HyperLogLog, GeoJSON, Blob, List, and Map, or
a stream UDF for aggregate processing.

Background Updates

Records can also be updated in a “background mode” in conjunction with a query. Such background updates work differently from read operations: the entire operation is processed in the background. The application can only check the status of a background operation, but cannot obtain granular results from it. Any record specific status must be ascertained, and corrected if necessary, separately. Background updates are an efficient way to update a large number of records.

Note that updates using indexes are not supported in “foreground” sync and async modes like read operations where the application receives record-specific results.

Update operations are specified using bin (transaction) operations or a record UDF.

Find additional code examples here.

Queries and indexes are important to realize speed at scale. This post describes key aspects of indexes and queries in Aerospike to help developers better understand these capabilities and utilize them effectively.

Building Large-Scale Real-Time JSON Applications

Neel Phadnis — Tue, 13 Sep 2022 14:58:17 +0000

Source: Photo by Wilhelm Gunkel on Unsplash

“Real-time describes various operations or processes that respond to inputs reliably within a specified time interval (Wikipedia).”

Real-time data must be processed soon after it is generated otherwise its value is diminished, and real-time applications must respond within a tight timeframe otherwise the user experience and business results are impaired. It is critical for real-time applications to have reliably fast access to all data, real-time or otherwise.

The number of real-time interactions between people and devices continues to grow. Leveraging real-time data is still a competitive edge in some areas but its use is expected in others. Up-to-the-moment relevant information is expected to be applied in delivering the best possible customer experience or business decisions.

Much of the data today is generated, transferred, stored, and consumed in the JSON format, including real-time data such as feeds from IOT sensors and social networks, and prior data such as user profiles and product catalogs. Therefore, JSON data is ubiquitous and growing in use. The best possible real-time decisions, increasingly based on AI/ML algorithms, will be arrived at using continually updated massive data sets.

Overview

This article discusses the database perspective on building large-scale real-time JSON applications and touches upon the following key topics:

What to look for in a real-time data platform
How to organize JSON documents for speed at scale
The core JSON functionality required for ease of development

Database for Large-Scale JSON Applications

The key requirements in a database to build such applications are described below, along with how the Aerospike Database delivers them.

Reliably fast random access at scale

Reliably fast response time for read and write operations at any scale and any read-write workload mix is required to meet the real-time contract. Aerospike delivers it through:

Fast and uniform hash-based data distribution to all nodes for optimal resource utilization
Hybrid Memory Architecture (HMA) to store indexes and data in DRAM, SSD, and other devices to provide cost-effective fast storage capacity
Optimized processing of writes and garbage collection for predictable response
One-hop access to all data from the application
Smart Client that handles cluster transitions and data movements transparently
Primary and secondary indexes for fast access
Async and background processing modes for greater efficiency
Multi-op requests to perform many single-record operations atomically in one request

Fast ingest rate

The database must support fast ingestion speeds so that surges in real-time data feeds do not overwhelm the system or result in data loss.

In Aerospike Database 6.0+, batch operations for read, write, delete, and UDF operations are supported so that ingest can achieve the necessary high throughput.

Fast queries

The database must handle concurrent queries over large data efficiently. To this end, Aerospike provides various indexes and granular control over parallel processing of queries.

Convenient JSONPath based access

JSONPath based Document API offers a convenient way to access and modify specific elements within a document. Aerospike support for JSON documents in 6.0+ is discussed below.

Rich Document Functionality

JSON documents are stored in the database as a Collection Data Type (CDT). CDTs are essentially Map and List data types that offer rich functionality to JSON applications.

Efficient storage and transfer

CDTs are stored and transferred efficiently in the MessagePack format.

Rich API

The API supports many common List and Map usages that involve complex processing. They are processed entirely on the server side to eliminate retrieval of data to the client side.

Well integrated into other performance features

CDTs are well integrated into various performance features including Expressions, batch requests, multi-op requests, and secondary indexes.

CDT operations can be used in Expressions that offer efficient server side execution.
Batch requests allow operations on multiple documents in one request.
Multi-op request allows many operations on one document to be performed in one request. For instance, in the same request, you can add items to a JSON array, sort it, get its new size, and top N items in it.
CDT elements at any nested level can be indexed for fast and convenient access, described further below.

Synchronizing data with other systems

Aerospike offers control over replicating all or a subset of the data efficiently to other Aerospike clusters through Cross-Data-Center Replication (XDR). Edge-core synchronization is often necessary for collecting real-time data as well as delivering real-time user experience at the edge. Various connectors facilitate convenient and fast synchronization with other systems as described below.

Easy integration with real-time data streams

Aerospike provides streaming connectors to integrate with the standard streaming platforms like Kafka, Pulsar and JMS, and also allow CDC streams to be delivered to any HTTP end-point.

Fast access from data processing and analytics platforms

The Aerospike Spark and Presto(Trino) connectors enable analytics, AI/ML, and other processing on the respective platforms.

Organizing for Scale and Speed

A critical part of building large-scale JSON applications is to ensure the JSON objects are organized efficiently in the database for optimal storage and access.

Documents may be organized in Aerospike in one or more dedicated sets, over one or more namespaces to reflect ingest, access, and removal patterns. Multiple documents may be grouped and stored in one record either in separate bins (columns) or as sub-documents in a container group document. Record keys are constructed as a combination of the collection-id and the group-id to provide fast logical access as well as group-oriented enumeration of documents. For example, the ticker data for a stock can be organized in multiple records that have keys consisting of the stock symbol (collection-id) + date (group-id). Multiple documents can be accessed using either a scan with a filter expression, a query on a secondary index, or both. A filter expression consists of values and properties of the elements in JSON, for example, an array larger than a certain size or a certain value being present in a sub-tree. A secondary index defined on a basic or collection type provides fast value-based queries as described below.

Example: Real-Time Events Data

Real-Time event streams can be ingested and stored in Aerospike as JSON documents. To allow access by event-id as well as timestamp, they can be organized as follows.

Record key:(namespace, set, <event_id>)
JSON bin:
{ 
    id: <event-id>,
    timestamp: <ts>,
    … 
}

Event-id based document access is a simple record access by incorporating the event-id in the record key. The exact match or range query on timestamp is possible by defining an integer index on it.

For greater scalability, multiple event objects can be grouped in a single document:

Record key:(namespace, set, <group-id>)
JSON bin:
{
    events: [ 
        {
            id: <group-id, event-num>,
            timestamp: <ts>,
            … 
        }, {
        …
        }
    ]
}

The event-id id contains the group-id and event-num which is unique within the group. The group-id, which identifies the record, can be a time period identifier such as the day, week, or month in the year covering all events in the record, or another logical identifier for all record events such as the sensor-id. To access an event directly by its event-id, the group-id is extracted from the event-id, the record is accessed by group-id, and then a JSONPath query is issued on the matching id field. The exact match or range query on timestamp can be performed by creating an integer index on the respective fields in the record.

Review the blog posts Aerospike Time Series API and Data Modeling for Speed-At-Scale (Part 2) for further discussion on organizing JSON documents.

JSON Support in Aerospike

Aerospike announced support for JSON documents in Database 6.0. The Aerospike Document API provides CRUD operations on a JSON document at points indicated by JSONPath. Below are some snippets of document APIs.

More details on the document API can be found in the github repo, tutorial and blog post.

Store a JSON file to database

// Initialize the DocumentClient from AerospikeClient
AerospikeClient aerospikeClient = new AerospikeClient(cPolicy, seedHost, port);
AerospikeDocumentClient documentClient = new AerospikeDocumentClient(aerospikeClient);

// Read the json document into a string.
String jsonString = FileUtils.readFileToString(new File(JsonFilePath));

// Convert JSON string to a JsonNode
JsonNode jsonNode = JsonConverters.convertStringToJsonNode(jsonString);

// Add the document to database
documentClient.put(recordKey, documentBinName, jsonNode);

Get document elements by JsonPATH

// Read an element by path
Object docObject = documentClient.get(recordKey, documentBinName, "$.path.to.the.element");
Object anotherDocObject = documentClient.get(recordKey, documentBinName, "$.path.to.array[index]");

// Get instances of a field from array elements 
Object docObject = documentClient.get(recordKey, documentBinName, "$.path.to.array[*].field");
// Get instances of a field in the document
Object anotherDocObject = documentClient.get(recordKey, documentBinName, "$...field");

Query JSON documents

JSON documents can be indexed for fast queries. In Aerospike Database 6.1+, any JSON element may be indexed to support exact match or range queries.

client.createIndex(policy,namespace,set,indexName,documentBinName,
                indexType, collectionType, contextPath);

A query can be issued using different filters depending on the index type - either a basic type (string or integer) or a collection type (List, MapKeys, MapValues):

Filter filter = Filter.range(documentBinName, fromValue, toValue, contextPath));
Filter filter2 = Filter.contains(documentBinName, collectionType,
                value, contextPath));

In Aerospike Database 6.0+, parallel partition-grained secondary index queries are available to boost throughput in large-scale applications.

Find more details on indexing JSON documents in the blog post Query JSON Documents Faster and code examples in the tutorial on CDT Indexing.

Find, run, and modify working examples, and also run your own code, in the code sandbox from your browser.

. . . .

Real-time large-scale JSON applications need reliably fast access to data, high ingest rates, powerful queries, rich document functionality, scalability with no practical limit, always-on operation, and integration with streaming and analytical platforms. They need all this at low cost. The Aerospike Real-time Data Platform provides all this functionality, making it a good choice for building such applications. The Collection Data Types (CDTs) in Aerospike provide powerful support for modeling, organizing, and querying a large JSON document store. Visit the tutorials and code sandbox on the Developer Hub to explore the capabilities of the platform, and play with the Document API and query capabilities for JSON.

Query JSON Documents Faster (and More) with New CDT Indexing

Neel Phadnis — Tue, 13 Sep 2022 14:58:06 +0000

Source: Photo by Cameron Ballard on Unsplash

The Collection Data Types (CDTs) in Aerospike are List and Map. They offer powerful capabilities to model and access your data for speed-at-scale. A major use of the CDTs is to store and process JSON documents efficiently. In the recent Aerospike Database 6.1 release, secondary index capabilities over the CDTs have been enhanced to make the CDTs even more useful and powerful for JSON documents in addition to other uses.

Understanding Context-Path and Path Specifiers

A CDT element is identified by its context-path. A CDT element’s context-path is defined as the path from the root to the element.

A context-path is very similar to JSONPath in a JSON document, but differs from JSONPath in some respects. Like JSONPath, a context-path describes the path from the root or the top level of the CDT to a nested element.

While a node in a JSONPath is an index in an array or a field in a map (object), each node in a context-path, on the other hand, uniquely identifies an element at that level by one of the following specifiers:

Index (physical position, 0 indexed)
Key (applicable only to a Map)
Rank (relative value position, with 0 being the lowest and -1 being the highest)
Value (the first element with that value)

So a context path is a concatenation of specifiers that identify path nodes. For example, consider a Map:

{“k1”: 1, “k2”: 2, “k3”: [11, 12, 13], “k4”: {“k11”: 11, “k22”: 22}}

The context path for the value 22 is: By-Key(“k4”), By-Key(“k22”) or By-Key(“k4”), By-Value(22). The JSONPath for it is: $.k4.k22.

Consider another nested object represented as a Map at the top level (level 0): it has a List at level 1, and a Map at level 2.

Object = {  “id1”: [ {“a”: 1, “b”: 2}, {“c”: 3, “d”: 4} ],
            “id2”: [ {“e”: 5, “f”: 6}, {“g”: 7, “h”: 8}] }

A context path to the nested element “c”, can look like: By-Key(“id1”), By-Index(1), By-Key(“c”).
The JSONPath of “c” in the corresponding JSON document looks very similar: $.id1[1].c.

Note, however, that “c” can be reached using other context-paths, such as, By-Index(0), By-Rank(1), By-Value(3). There are no alternative contiguous JSONPaths to “c”. Also, a JSONPath can skip a node and can point to more than one element in a JSON document. For example, $.id1..c will point to all “c” nodes below “id1”. A context-path does not allow an interim node to be skipped and cannot point to more than one element in a CDT. So a similar construct By-Key(“id1”)..By-Key(“c”) is not supported.

New CDT Indexing Capabilities

In a nutshell, in Aerospike Database 6.1 and later, any CDT element can be indexed irrespective of their nesting level. Specifically, there are two main new capabilities to highlight:

Elements in a CDT can now be indexed based on their "index” (meaning physical position of the element in CDT), key, rank, or value. So it is now possible to create a secondary index, say, on the element at rank -1 (the highest value) of a List, so that the equality and range queries for the highest value in the List across records can be efficiently executed. For example, retrieve all users that have a personal best score greater than 100 from their lifetime scores List.
Embedded List or Map can now be indexed, So in a List bin with value [1, 2, 3, “s1”, “s2”, “s3”, [11, 12 13]], the embedded List [11, 12, 13] can now be indexed. Many complex objects, especially JSON documents, have deep hierarchies. Now, the elements below the top level can be indexed and efficiently queried on. So in a Map bin {“k1”: 1, “k2”: 2, “k3”: [11, 12, 13], “k4”: {“k11”: 11, “k22”: 22}}, a secondary index can be created on the List “k3”, the Map “k4”, as well as all the elements within them. For example, this makes it possible to retrieve all records with a value 31 in the “k3” list or records with the value of “k11” in the range 10-20.

The following types of secondary indexes can be created on a CDT element. Note, since a given CDT element can hold a value of any type, only values of the specified type are indexed.

Non-collection type: Index an element for values of one of the following types.
- Integer
- String
- Geospatial
List collection type: Index List values of an element. All values within a List of one of the following types are indexed.
- Integer List values
- String List values
- Geospatial List values
Map collection type: Index Map values of an element. All Map keys or Map values of one of the following types are indexed.
- Integer Map keys
- String Map keys
- Integer Map values
- String Map values
- Geospatial Map values

Also new in 6.1 is the ability to index all namespace records, in addition to the previously supported set specific indexing. Thus, an index can be specified on a CDT element across the namespace for querying records from the entire namespace.

Many Indexes on Same Element

In many cases, it is required to create multiple indexes on the same CDT element.

Multiple Types

Multiple indexes of different types are allowed on the same context-path in order to index the respective data type values.

As CDTs do not conform to a schema, an element can be of any type. A secondary index is defined for values of a specific type, and only considers values of that type. Other type values at that context path will be simply ignored. Thus, an integer index at a Map key or List rank will only index integer values at that path. So in a record with List [1, 2, 3, “s1”, “s2”, “s3”], in order to select the record on equality query on “s3”, a string index must be present on all List string values or a specific string element using index, rank, or value in the List.

JSON documents are saved as CDTs, but are simpler as they have single type values in a List (numeric or string) and Map keys (string), and therefore will need only one index value type on these collection types. Map values, however, can be mixed (string or numeric).

Multiple Paths

The same element in a CDT can be arrived at in multiple ways via different context paths using different specifiers. For example, an element at Map index X may also be the rank Y and key Z. Multiple indexes of the same value type therefore are possible on the same static element. However, they are semantically different.

Important: As values in a CDT change, the context paths that were pointing to the same element may no longer point to one or the same element. Queries based on indexes defined using different context paths pointing to the same static element in one CDT can yield different results. The application should define the indexes and queries carefully with data and application semantics in mind.

Querying Values Across Multiple Elements

In some cases, a List or Map is not readily available for indexing, as the elements may be distributed in multiple places in the CDT. An example is a List of Maps, where we want to find out if a specific key in any Map has a specific value. Such an array (List) of objects (Maps) is a common occurrence in JSON documents, and querying a specific object field for a value may be necessary. For a concrete example, consider the JSON document of Nobel laureates in a year and category :

{
  "year" : "2021",
  "category" : "chemistry",
  "laureates" : [ {
     "id" : "1002",
     "name" : "Benjamin List"
    }, {
     "id" : "1003",
     "name" : "David MacMillan"
  } ]
}

In order to issue a query “find the Nobel prize(s) by the winner’s name”, the name field in all objects in the laureates List need to be indexed across such records.

While such a collection type is not directly supported for indexing, the following solution can be implemented: Create a separate List of names of laureates in the record, and index that List.

laureate_names_in_record: ["Benjamin List", "David MacMillan", ..]

It is then possible to find the Nobel prize(s) by the winner's name.

For data that does not change, such as the names of Nobel laureates in the above example, this is relatively straightforward. In order to index values in a record that can change, the indexed List must be kept in sync with the changing values. Depending on where the values are in the CDT and how values are updated, it may be possible to update a value and its index List entry together atomically using the multi-op CDT operations.

Examples

In order to make it clearer, we will describe with examples. You can follow along in the interactive notebook CDT Indexing.

Examples are shown for two categories of CDT index below:

Non-collection index
Collection index
- LIST (all values in the List) for a List, or
- Either MAP_KEYS (all keys) or MAP_VALUES(all values) for a Map

The CDT model is a superset JSON, and therefore the examples described in CDT terminology below are also applicable to JSON documents.

Non-collection Index

A non-collection index supports equality queries on integer and string values with equal filter, and range queries on integer values with the range filter.

Equality queries

Get records with a specific integer or string value:

At a specific index or rank position of a List or a Map
- records with 100/“ABC” at index 0 of a list or a map
- records with 100/”ABC” at rank -1 (highest value) of a list or a map
At a specific key position of a Map
- records with 100/“ABC” at key XYZ of a map

Range queries

Range queries are supported on integer values only.

Get records having an integer value within a range:

At a specific index or rank position of a List or a Map
- records with value in range 1-100 at index 0 of a list or a map
- records with value in range 1-100 at rank -1 (highest value) of a list or a map
At a specific key position of a Map
- records with value in range 1-100 at key XYZ of a map

Collection Index

A collection index supports equality queries on integer and string values with the contains filter, and range queries on integer values with the range filter. The collection type is LIST, MAP_KEYS, or MAP_VALUES in the createIndex Java API.

Equality queries

Get records with a specific integer or string value.

In a List
- records with a list containing 100/“ABC”
In a Map’s keys
- records with a map containing 100/“ABC” as a key
In a Map’s values
- records with a map containing 100/“ABC” as a value

Range queries

Range queries are supported on integer values only.

In a List
- records with a list containing a value in range 1-100
In a Map’s keys
- records with a map containing a key in range 1-100
In a Map’s values
- records with a map containing a value in range 1-100

Takeaways

In Aerospike Database 6.1+, any CDT element can be indexed irrespective of their nesting level. Elements in a CDT can also be indexed based on their position, key, rank, or value. A CDT element can have multiple context-paths with different semantics, and therefore the application should carefully determine the correct context-path while defining an index, with data and application semantics in mind. When values that need to be indexed are not available in one List or Map, consider replicating the values in a separate List for defining an index, and keep the indexed List in sync with the values. View and work on the examples in the notebook CDT Indexing and the Aerospike Sandbox.

Aerospike Through SQL

Neel Phadnis — Tue, 16 Aug 2022 15:57:30 +0000

Source: Photo by Alex wong on Unsplash Unsplash

SQL is broadly used as a data access language for analytics. Even if you are an application developer, chances are you have used it or at least are familiar with it.

Aerospike has broad support for SQL, enabling you to use SQL to access Aerospike data in multiple ways.

Trino

For analytics, you can access Aerospike data on Trino with the Aerospike Trino Connector.

Through Trino, analytics use cases such as ad-hoc SQL queries, reports, and dashboard have access to data in one or more Aerospike clusters, and they can also merge Aerospike data with data from other sources.

For more details of the Trino Connector, see the blog posts Deploy Aerospike and Trino based analytics platform using Docker and Aerospike Trino Connector - Chapter Two.

Starburst is a SQL-based MPP query engine based on Trino that enables you to run Trino on a single machine, a cluster of machines, on-prem or in the cloud. The blog post Analyze Data with Aerospike and Starburst Anywhere describes how to use Starburst Enterprise.

The data browser described in the blog post Aerospike Data Browser uses Trino with the Trino Connector underneath.

Spark

You can use Spark SQL to manipulate Aerospike data on the Spark platform. Aerospike Spark Connector provides parallel access to the Aerospike cluster from Spark.

Spark SQL merges two abstractions: Replicated Distributed Datasets (RDDs) and relational tables. Find examples of importing and storing Aerospike data to and from RDDs in these Aerospike Spark tutorials. You can use Spark SQL to manipulate and process data in RDDs.

More details on the Spark Connector are available in the blog posts Using Aerospike Connect for Spark and Accelerate Spark queries with Predicate Pushdown using Aerospike.

JDBC

Application developers can use simple SQL with JDBC with the community-contributed JDBC Connector.
Please read more details in the blog post Introducing Aerospike JDBC Driver.

. . . .

Aerospike API

While the various connectors allow broad SQL access for multiple purposes, the connectors may not be suitable for general applications as they do not provide the full Aerospike API functionality that a general application needs. For example, update capabilities are limited through SQL.

We recommend that you use the Aerospike API to access its full functionality and performance. Aerospike, a NoSQL database, does not directly support all SQL features. Inversely, Aerospike has many capabilities that cannot be expressed in SQL. This is to be expected because SQL is designed to provide physical data independence, which means the user need not worry about the physical details of the data such as the data distribution, size, selectivity, indexes, and so forth. The query optimizer deals with these details and selects the best execution plan. The goal of the Aerospike API is to provide full control to developers for optimal performance of their applications.

. . . .

This article describes how a developer who is familiar with SQL can quickly implement specific SQL CRUD operations using the Aerospike API. The goal is not to discuss the many mechanisms to control optimal performance (although it points to some of them), but to provide a ramp for a developer who has some knowledge of SQL to map the basic CRUD queries into the Aerospike API. We encourage you to learn about the performance features using the pointers provided.

While Aerospike supports many languages, we have used the Java client API in our examples as it is most widely used. The functionality is similar across all client libraries, and you can find equivalent functions in each.

. . . .

Mapping SQL to Aerospike

While there is no direct mapping of full SQL to Aerospike API, simple CRUD functionality can be easily mapped to Aerospike API as the underlying data models are similar: Aerospike’s set-record-bin organization matches the SQL’s table-record-column organization (see below).

We point out differences and unsupported constructs below. They need to be handled through alternative means such as specific features, libraries, and application code.

Similarities

Aerospike has a record-based data model. An Aerospike Database holds multiple namespaces, which are equivalent to databases in the relational model. A namespace holds records (rows), organized in sets (tables) that are accessed using a unique key that serves as the record ID. A record can contain one or more bins (columns), and a bin can hold a value of different data types. Sets and records do not conform to any schema. The primary index provides fast access to a record by key, which is a unique record identifier, while secondary indexes defined on a bin are supported for content based access.

SQL concept	Aerospike equivalent
Database or schema	Namespace
Table	Set
Record	Record
Column	Bin
Index	Primary and Secondary indexes
Stored Procedures	User Defined Functions (UDFs)

Differences

Aerospike is a NoSQL database, and its API has many differences from SQL databases. Following are some key differences:

Set: A set is a tag on the record that gets created when the first record is created in the set. A set is schemaless and can hold records holding different bins.
Record: A record is schemaless, and can hold any combination of bins.
Bin: A bin is typeless, and can hold a value of any type.
Index: Integrity constraints, such as uniqueness, cannot be specified on an index.
Transactions: All single record requests are transactional. The transaction boundary does not span multiple records. For a detailed discussion, see the blog post Developers: Understanding Aerospike Transactions.

Constructs Not Directly Supported

Due to the differences in its data and execution models, Aerospike API does not directly support the following SQL constructs, however, they can be implemented using data modeling, alternative features, and application code. We will discuss them later.

Join
Aggregations (max, min, top, average, sum, etc.)
Order By, Distinct, Union
Limit
Constraints: NULL, Foreign Key, Default
Built-in functions
View

. . . .

Overview

For the purpose of our discussion, SQL queries can be organized in these categories:

SELECT or read operations,
CREATE, UPDATE, DELETE or write operations,
Metadata operations, and
Other functionality.

Use these interactive tutorials to work along with this text.

A Word on Key, Metadata, Policy, and API Variants

Before we dive in, it is useful to know record key, record metadata, operation policy, and API variants.

Record Key

Each record is uniquely identified by a key or id, consisting of a triple: (namespace, set, user-key) where user-key is a user-specified id that is unique within the set. The key (also called the digest) is returned in all read APIs.

Record Metadata

Each record has metadata associated with it: generation (or version) and expiration time (or time-to-live in seconds). This metadata is returned in all read operations. It is possible to retrieve only the metadata without the record's bins through the "getHeader" operation explained below.

Policy

Aerospike API calls take a policy parameter which includes many details of the how and what of the request. For example, timeout, retries, filter expression, and additional write semantics are specified in the policy object. We will specify significant policy info that is relevant to the operation semantics in each operation below.

API Variants

Aerospike API is designed for control and simplicity. As such, a read and write operation that has one form in SQL has multiple variations in Aerospike API:

By number of records involved: Single record, batch, and query
By the processing mode: Sync, async, and background

In the following examples, only the synchronous APIs are shown when available, but you can easily discover the asynchronous variants in the documentation.

SQL SELECT and Equivalent Read Operations

Single-Record Read Operations

Get

SQL Query	Equivalent Aerospike API (Java)
`SELECT * FROM namespace.set WHERE id = key`	`Record Client::get(Policy policy, Key key)`
`SELECT bins FROM namespace.set WHERE id = key`	`Record Client::get(Policy policy, Key key, String... binNames)`

Existence

There is a variant of single record retrieval to check a record's existence.

SQL	Aerospike
`SELECT EXISTS(SELECT * FROM namespace.set WHERE id = key)`	`boolean Client::exists(Policy policy, Key key)`

Metadata

It is possible to only obtain a record's header info or metadata, consisting of generation (or version) and expiration (time-to-live in seconds).

SQL	Aerospike
`SELECT generation, expiration FROM namespace.set WHERE id = key`	`Record Client::getHeader(Policy policy, Key key)`

Batch Read Operations

A batch request operates on a list of records identified by the keys provided. It works similar to a single record retrieval, except multiple records are returned.

Batch requests are critical for high performance applications as they eliminate multiple client-server round trips, one for each record.

Read

SQL	Aerospike
`SELECT * FROM namespace.set WHERE id IN key-list`	`Record[] Client::get(BatchPolicy policy, Key[] keys)`
`SELECT bins FROM namespace.set WHERE id in key-list`	`Record[] Client::get(BatchPolicy policy, Key[] keys, String... binNames)`

Existence

There is a variant of batch retrieval to check record existence.

SQL	Aerospike
`SELECT id, EXISTS(SELECT * FROM namespace.set WHERE id = key) WHERE key IN key-list`	`boolean[] Client::exists(Policy policy, Key[] keys)`

Metadata

It is possible to obtain header info or metadata consisting of generation (or version) and expiration time (time-to-live in seconds) for a specified set of records.

SQL	Aerospike
`SELECT generation, expiration FROM namespace.set WHERE id IN key-list`	`Record[] Client::getHeader(Policy policy, Key[] keys)`

Composite Batch Read

A more general form of batch reads is also available that provides a union of simple batch results with different namespace, set, and bin specification. The records argument takes the input record keys and populates record details on return.

SQL	Aerospike
`(SELECT bins1 FROM namespace1.set1 WHERE id IN key-list1)` `UNION` `(SELECT bins2 FROM namespace2.set2 WHERE id IN key-list2)` `UNION ...`	`void Client::get(BatchPolicy policy, List<BatchRead> records)`

Predicate-Based Read Operations

In predicate-based read operations (aka queries), records matching a general predicate or condition are retrieved. In SQL, the predicate is specified in the WHERE clause.
Aerospike provides two ways of performing an SQL query:

Using a secondary index based predicate, which can optionally be ANDed with an expression filter
Using a scan (which uses the primary “key” index), which can optionally be ANDed with an expression filter

Secondary Index Query

While a query in SQL doesn’t require an index to exist, the query API in Aerospike requires that the corresponding secondary index exists.

The namespace, set, and secondary index based predicate is specified in the statement argument. The expression filter is optionally specified in the policy argument for additional conditions to be ANDed.

SQL	Aerospike
`SELECT bins FROM namespace.set WHERE condition`	`Record[] Client::query(QueryPolicy policy, Statement statement)`

Scan

The scan operation takes a callback object which is called for every record in the result (within the scope of the call which remains blocked until the operation completes).
The expression filter is optionally specified in the policy argument.

SQL	Aerospike
`SELECT bins FROM namespace.set WHERE condition`	`void Client::scanAll(ScanPolicy policy, String namespace, String setName, ScanCallback callback, String... binNames)`

SQL CREATE, UPDATE, DELETE and Equivalent Write Operations

Aerospike combines Create and Update in a single write operation. The following record-exists-action options specified in the write-policy define the operation semantics if the record already exists:

create-only: Create if record doesn't exist, fail otherwise.
update: Create if record doesn't exist, update otherwise.
update-only: Update if record exists, fail otherwise.
replace: Create if record doesn't exist, replace otherwise.
replace-only: Replace if record exists, fail otherwise.

SQL INSERT maps to create-only and SQL UPDATE maps to update-only options. SQL does not have a way to specify other options, such as replace, which removes an existing record.

Single-Record Write Operations

INSERT and UPDATE

The put operation handles Create (Insert) and Update.

SQL	Aerospike
`INSERT INTO namespace.set VALUES (id=key, bin=value, ...)` + `UPDATE namespace.set SET (bin=value, ...) WHERE id=key`	`void Client::put(WritePolicy policy, Key key, Bin... bins)`

Type-Specific Write Operations

Aerospike allows type-specific update operations. For integer and string types, they include the following. The bins argument holds multiple bin objects, each with the bin name and the operand value.

SQL	Aerospike
`UPDATE namespace.set SET (bin = bin + intval) WHERE id=key`	`void Client::add(WritePolicy policy, Key key, Bin... bins)`
`UPDATE namespace.set SET (bin = bin + strval) WHERE id=key`	`void Client::append(WritePolicy policy, Key key, Bin... bins)`
`UPDATE namespace.set SET (bin = strval + bin) WHERE id=key`	`void Client::prepend( WritePolicy policy, Key key, Bin... bins)`

Other type specific operations including on Collection Data Types (CDTs), are described in the documentation and tutorials.

DELETE

SQL	Aerospike
`DELETE FROM namespace.set WHERE id=key`	`void Client::delete(WritePolicy policy, Key key)`

Batch Write Operations

A batch write operates on multiple records specifically identified with a list of keys. There is a batch API for insertion, update, and deletion of multiple records.
Two forms of batch writes are shown below. Other forms including one with a UDF (described below) and key-specific operations are described in the blog post Batch Operations.

The argument ops is a list of operations to be performed in the specified sequence on each record, and can include read as well as write operations. The argument batchPolicy contains the specifics of how the batch is processed, whereas the arguments writePolicy and deletePolicy have the specifics of how the respective individual record operation is performed.

SQL	Aerospike
`UPDATE namespace.set SET (bin1=fn_1(bin_1), ...) WHERE id in key-list` + `SELECT fn_n(bin_n), ,,, FROM namespace.set WHERE id in key-list`	`BatchResults operate(BatchPolicy batchPolicy, BatchWritePolicy writePolicy Key[] keys, Operation... ops)`
`DELETE FROM namespace.set WHERE id in key-list`	`BatchResults delete BatchPolicy batchPolicy, BatchDeletePolicy deletePolicy, Key[] keys)`

Predicate-Based Write Operations

Predicate-based updates and deletes are possible by specifying the WHERE condition using the secondary index predicate (specified in a statement object) and expression filter (specified in the write policy) as explained earlier.

Predicate-based updates and deletes can involve a large number of records, and therefore are processed in background execution mode with the execute API. Sync and callback async modes are not available. Two forms of execute are possible:

Using a list of bin updates and deletes: A multi-op request provides a list of bin operations. Multi-op requests are further described below.

Since execute performs in a background mode with no returned results, the operation list in the statement object cannot have a read operation, only updates.

SQL	Aerospike
`UPDATE namespace.set SET (bin=value, ...) WHERE condition` + `DELETE FROM namespace.set WHERE condition`	`Client::execute(WritePolicy policy, Statement statement)`

User Defined Functions (UDFs): UDFs are equivalent to stored procedures, and are described further below. Record-oriented UDFs implement arbitrary logic in a Lua function that is registered with the server and invoked through an API call.

SQL	Aerospike
`UPDATE namespace.set SET (bin1=fn1(args), ...) WHERE condition` + `DELETE FROM namespace.set WHERE condition`	`ExecuteTask Client::execute(WritePolicy policy, Statement statement, String packageName, String functionName, Value... functionArgs)`

SQL Stored Procedures and Aerospike User Defined Functions (UDFs)

User Defined Functions (UDFs) are equivalent to stored procedures in SQL systems. A custom User Defined Function (UDF) is written in Lua, registered on the server, and invoked for a specified record(s). You can find further details in the documentation on User Defined Functions (UDFs).

In the following example, the UDF is specified using the arguments packageName and functionName, and supplied a list of arguments it expects in functionArgs. The API returns a generic Object which can be anything like a single value or a map of key-value pairs.

SQL	Aerospike
`EXEC StoredProcedure @arg1 = val1, @arg2 = val2, …`	`Object Client::execute(WritePolicy policy, Key key, String packageName, String functionName, Value... functionArgs)`

A UDF can have arbitrary logic combining CRUD operations.

In Aerospike, aggregation functions such as MIN, MAX, AVERAGE, SUM, etc, over multiple records are implemented with Stream UDFs. This article does not cover the specifics of Stream UDFs; please refer to the tutorials on SQL: Aggregates.

Multi-Op Requests

Multiple single bin read and write operations are possible through the operate API. It differs from the dedicated “single-op” requests, which allow just one operation. The operations in the argument operations are executed atomically and in the order specified,

Unlike in SQL, read and write operations can be combined in the same request (for single-record and batch requests) as illustrated below.

SQL	Aerospike
`SELECT fn1(bin1), …FROM namespace.set WHERE id=key` + `UPDATE namespace.set SET (bin1=fn_n(bin_n), ...) WHERE id=key`	`Record Client::operate( WritePolicy policy, Key key, Operation... operations)`

Multi-op operate APIs are available for a single record, batch, and query operations. See this tutorial that illustrates multi-ops.

Metadata Operations

Namespace Operations

CREATE Namespace

There is no API to create a namespace. A namespace is added through the config and requires a server restart.

TRUNCATE Namespace

The truncate API removes all records in a set or the entire namespace.

SQL	Aerospike
`TRUNCATE namespace`	`void Client::truncate(policy, namespace, set=null, beforeLastUpdate=null)`

DELETE Namespace

There is no API to delete a namespace. A namespace has one or more dedicated storage devices, and they must be wiped clean to delete the namespace.

Set Operations

CREATE Set

There is no explicit operation to create a set. A set is created when the first record is inserted in the set.

ALTER Set

A set is schemaless, and can hold records that have different schemas or bins. A bin has no type associated with it, and can hold values of any type. Therefore ALTER operation on a set to modify its schema is not needed.

TRUNCATE Set

All records in a set can be truncated using the truncate API:

SQL	Aerospike
`TRUNCATE namespace.set`	`void Client::truncate(policy, namespace, set, beforeLastUpdate=null)`

DROP Set

There is no notion of deleting a set as a set is just a name that a record is tagged with. The namespace must be deleted to remove the set name.

Index Operations

CREATE Index

An index is created on a bin for a specific value type. Integer, string, and GeoJSON types are currently supported for indexing.

SQL	Aerospike
`CREATE Index`	`createIndex(Policy policy, String namespace, String setName, String indexName, String binName,IndexType indexType)`

DROP Index

SQL	Aerospike
`DROP Index`	`dropIndex(Policy policy, String namespace, String setName, String indexName)`

UDF Operations

CREATE UDF

The arguments clientPath and serverPath below define the path to UDFfile on the client and server respectively.

SQL	Aerospike
`CREATE StoredProcedure`	`Client::register(Policy policy, String clientPath, String serverPath, Language.LUA)`

Other SQL Capabilities

Join

Most NoSQL databases do not have the Join operation as it is slow and complex. You can avoid Joins by storing the joined objects in aggregate form. Alternatively, the join can be performed in the application by retrieving the referenced object.

Limit

The policy parameter max-records can be specified as a hint. Fewer objects may be returned as the limit gets divided among participating nodes.

Order By, Top, Union, Distinct

List and Expressions can be used to implement these operations. Alternatively, they can be performed in the application.

Aggregations

Aggregations involving Group-By, Having, and Aggregate Functions (such as Max, Min, Top, Average, Sum) can be implemented using Stream UDFs as shown in the tutorials SQL Aggregates - Part 1 and Part 2.

Constraints

Integrity constraints such as NULL, Foreign Key, Default should be handled in the application logic. The uniqueness constraint can be enforced in a List or Map.

Built-In Functions

Many built-in functions like UPPER, TRIM, can be implemented with Expressions or UDFs.

Going Beyond SQL with Aerospike

In order to get the most out of Aerospike for speed-at-scale, thinking beyond SQL is necessary.
The process starts with modeling your data for performance, scale, and other needs of the application. Please review the series Data Modeling for Speed At Scale.

Learn about and use the various performance features that the Aerospike API provides through the documentation and tutorials. Examples of such features include:

Collection Data Types (CDTs)
Multi-op requests
Batch requests
Expressions
Secondary indexes
Set indexes
Complex Data Types - Binary, HLL, GeoJSON
User Defined Functions (UDFs)

Summary

You can use SQL to access Aerospike data through the Trino, Spark, and JDBC Connectors. While the connectors work quite well for the environment and intent they are built for, they do not provide the full Aerospike API functionality that the application may need. Therefore, use of the Aerospike API is recommended for full functionality and performance. The Aerospike API is designed with the goal of enabling developers of high performance applications who need to control performance specific details to make better decisions.

The article describes how a developer who is familiar with SQL can quickly implement specific SQL CRUD operations using the Aerospike API. Coming from a SQL background, it is important to remember that through NoSQL data modeling one should be able to avoid certain SQL features entirely such as the Join to maximize the benefit of using Aerospike for performance and scale. With the introduction provided in this article, you should be able to take the next step to learn the mechanisms in Aerospike API to optimize your application’s performance and scale.

A Quick Orientation to Aerospike API

Neel Phadnis — Wed, 20 Jul 2022 16:58:09 +0000

Source: Photo by Jametlene Reskp on Unsplash

Aerospike Database and the client API provide a rich set of capabilities that have evolved over more than a decade through an increasing number of mission critical deployments. This post provides a high level view of the Aerospike architecture and API to give developers a broader understanding of its architecture and capabilities, and help them become more productive and effective. This post also points to resources for further exploration of specific areas.

The post is organized in the following sections:

Core Concepts: Describes the core architecture and data distribution concepts.
Functional Elements: Describes major elements of functionality in the API.
Key Specifics: Describes a few things that are useful to know up front.
Performance Features: Summarizes the common performance enhancing features.
Useful Libraries: Mentions libraries you should be aware of.

A caveat: Aerospike has client libraries for many languages. Not every aspect described here may apply to all client libraries precisely. While the discussion is broadly applicable to all client libraries, some details may be specific to the Java client library as it is most widely used.

Core Concepts

The key architecture concepts to understand include:

data organization in a cluster,
workings of the client library,
transaction support and replica consistency,
server-side execution of complex data type operations,
various processing modes, and
primary and secondary index queries.

Data Organization in Cluster

Aerospike is a distributed record-oriented database with support for the document-oriented data model. An Aerospike Database holds multiple namespaces, which are equivalent to databases in the relational model. A namespace holds records (rows), organized in sets (tables) and are accessed using a unique key that serves as the record id. A record can contain one or more bins (columns), and a bin can hold a value of different data types. Aerospike supports type-specific operations on Integer, String, List, Map, Geospatial, HyperLogLog, and Blob types. Sets and records do not conform to any schema. The primary index provides fast access to a record by key, and secondary indexes are supported for predicate based access.

Records are hashed by key across 4096 data partitions which are uniformly distributed across cluster nodes. Each data partition is replicated with a Replication Factor (RF) number of copies for fault tolerance.

Please find more details in the architecture overview section of the documentation.

Smart Client

Aerospike has client libraries or clients that implement the API in multiple languages including Java, C#, Python, Go, REST, Node.js, C, Ruby, and more. A client library simplifies application development by taking care of many complex aspects. It has the smarts to actively track and adapt to the latest cluster state and data distribution, almost working as an extension of the cluster. As such, it is referred to as the Smart Client. The Smart Client implements a common wire protocol for server interactions, directly connects to all nodes in the cluster, determines the specific server nodes for a data request, sends the request to them, and coordinates a response back to the application. It also handles timeouts, automatic retries, connection pooling, request throttling, replica selection, among other things.

Please refer to the Smart Client section and the supported clients in the documentation.

Data-Type Server

The API supports many complex data types including List, Map, Blob (or Binary), GeoJSON, and HyperLogLog (HLL). Since many complex data elements can get large in size, Aerospike eliminates expensive client-server data transfer by executing the operations entirely on the server.

Developers can leverage the following features to minimize client-server data transfers:

Complex operations supported in the API. To minimize data transfer, the API provides server-side execution of operations, finer control of the returned data, as well as the ability to customize multi-element processing logic for Collection Data Type (CDT) operations. Please refer to the CDT documentation and tutorials.
Expressions allow flexible logic to be computed on the server for filtering, retrieval, and updates. Please see the tutorial on Expressions.
Lua UDFs: Allow general logic to be computed on the server for retrieval and updates (described further below).

Transactions and Consistency

Aerospike guarantees all single-record requests to be atomic, that is, they either succeed or fail. Multi-op requests (described below) on a single record are transactional. However, multi-record batch and query operations (also described below) are not transactional and there is no rollback available for partially successful requests. The transactional boundary assured is for individual record operations within a multi-record request.

Aerospike replicates data for resiliency and performance in multiple replicas. Replicas are kept in sync by applying a synchronous write operation to all replicas. Read replicas are automatically selected based on the consistency requirements that are specified in the policy.

Please view the blog post Developers: Understanding Aerospike Transactions for details.

Processing Modes

Aerospike supports multiple modes to process a request, each with its trade-offs, and the application should choose the appropriate mode.

Synchronous: In the synchronous mode, the client waits for all responses to arrive from the server nodes before handing the cumulative response to the application. The application can spawn multiple threads and process multiple synchronous requests in parallel, one per thread at a time. From the application’s standpoint, synchronous requests can be easier to implement, but may not offer best resource utilization.
Asynchronous: In the asynchronous mode, the application can submit a request without waiting for the results. The results are processed when they are available in a different callback thread. Depending on the application’s choice, the client library can call back once for each record response, or just once with all responses. The asynchronous mode has superior efficiency and performance, but may be more complex to implement. Check out the tutorial on Asynchronous Operations for details.
Background: Common updates to a large number of records that are selected by a query can be performed in a background mode, where no results are returned from the server. The application can query whether a background request is in progress, completed successfully, or failed; and if necessary must determine the details of any failure separately.

Primary and Secondary Index Queries

Queries use either the Primary Index or a Secondary Index.

A primary-index query is simply a scan, performed on either a set or the entire namespace. A set index can optionally be created to boost performance, and is automatically used in the scan of that set. If a set index is not available, the entire namespace is scanned to determine the set records. A namespace scan uses the primary index.
A secondary-index query returns records meeting a predicate or condition supported by the corresponding secondary index: equality and range for Integer, equality for String, contains and is-contained-by for GeoJSON data type. An appropriate secondary index must have been created in order to execute a secondary-index query.

Note that filter expressions (described below) provide a powerful mechanism to select records for operations, and are broadly used with queries.

You can view query examples in the tutorial on Implementing SQL: Select.

Functional Elements

This section describes major functional elements including:

single- and multi-record function variants,
multi-op requests,
Collection Data Types (CDTs),
expressions, and
User Defined Functions (UDFs).

Note, these categories are not exclusive. For example, a multi-op request can involve CDTs.

Single and Multi-Record Function Variants

In addition to the execution modes, there are function variants based on the number of records involved.

Aerospike defines distinct API functions for single-record, batch, and query operations for simplicity instead of having a common API function with a generic operand that can take a variable number of records or a query predicate. Thus, there are separate functions for single-record and batch variations of operations like exists, get, put, append, and operate as well as their sync and async invocations.

A single-record request operates on a specified key or record.
Batch operations operate over multiple keys or records, where each key is specified. Please see the blog post on Batch Operations for details.
A query operates on multiple records that are identified by:
- a primary-index query (a scan of a set or the entire namespace) or a secondary-index query (a condition or predicate that uses an existing secondary index),
- a filter expression (which does not use or require a secondary index but is calculated on each record),
- both, or
- neither (in which case all records in the specified namespace and set are selected for the operation).

Query requests can be used for retrieval or update. The latter are executed in the background mode as mentioned earlier.

You can find examples of the function variants in the tutorials on SQL Operations.

Multi-Op Requests

Aerospike allows multiple bin operations to be performed on a record with a single multi-op Operate request that takes a list of Operation objects. The operations are performed in the sequence specified. The results are returned by the bins involved - either their final state or the individual operation results in the specified order. CRUD functions including CDT operations can be specified as Operations in the Operate request. Read/Write Expressions that allow server-side bin computations can also be used. Operate can be used for single-record, batch, and query requests in sync, async, and background (query updates only) modes.

You can find examples of multi-op requests in the documentation and tutorials, such as this tutorial on Introduction to Transactions.

Collection Data Types (CDTs)

Applications commonly use Collection Data Types (CDTs), namely List and Map, to store objects. CDTs are useful to model data for efficient storage, access, and transactional updates. Please see the blog post on Data Modeling for Speed At Scale (Part 2) for details.

Expressions

Aerospike Expressions are defined using bins and metadata, API functions, and various operators, and are evaluated on the server to filter records (filter expressions), return computed values (read expressions), and write to bins (write expressions).

A filter expression is a general mechanism usable in most operations. An operation is applied to a record only if the filter expression evaluates to true. A filter expression is specified in the policy parameter of the operation.

As a selection mechanism, a filter expression allows general conditions, whereas a secondary-index query offers superior performance. For best performance, use the most selective condition in a secondary-index query when possible.

Please find details in the workshop on Unleashing the Power of Expressions and the tutorial on Expressions in Aerospike.

User Defined Functions (UDFs)

User Defined Functions (UDFs) allow custom code to be executed on the server. A UDF is written in Lua, registered with the server, and invoked through a request from the client.
There are two distinct types of UDF: Record and Stream. A Record UDF performs a read and/or write operation on a single record, and can be invoked in sync, async, or background mode. A Stream UDF performs an aggregate computation over multiple records selected by a query. A Stream UDF typically also has a client execution phase that is handled by the client library.

Please be aware that UDFs may not be appropriate for performance sensitive use cases. For record-oriented functions, expressions should be the first choice whenever possible for best performance.

Please find details in UDF documentation and the tutorials on SQL Operations.

Key Specifics

Some API specifics are useful to know to avoid potential confusion:

the policy parameter,
data persistence and expiration,
write semantics, and
metadata operations.

Policy

Most API calls take a policy parameter which includes significant info that affects both the how and what of the request. Many specifics that define how an operation is performed (such as the timeout and retries) and what data is involved (such as the filter expression) are specified in the policy object. Many operation semantics details may not be obvious by just looking at the call parameters because they are specified in the policy. Examples include:

filter-expression: condition to select records for the operation,
send-key: store the user-key at record creation,
record-exists-action: different behavior of writes depending on whether the record exists or not,
generation-policy: used to isolate concurrent read-write transactions,
expiration: assign time-to-live duration to a record, and
durable-delete: used to prevent deleted records from reappearing after node failure.

Read more about policies in the Policies section of the documentation.

Data Persistence and Expiration

The server applies updates by default in memory, to be flushed to persistent storage at regular intervals. Updates can also be made immediately durable or persistent by using the commit-to-device option in the namespace configuration.

A record by default is created never to expire, but a different time-to-live (ttl) can be specified. An expired record is automatically removed and its space reclaimed, thus relieving the lifecycle management burden in applications for temporary objects.

Write Semantics

In Aerospike, a put or write combines Create (Insert), Update, and sometimes Delete. The default create-or-replace semantics of a write can be modified for alternative behaviors.

The record-exists-action specified within the write-policy defines the operation semantics when the record already exists, with create/update/update-only/replace/replace-only variants.
Map updates have a write-mode of “update”, “create-only”, and “update-only”` to control insertion behavior based on the existence of a key.
A bin is removed from a record by writing.a NULL value to it. When the last bin is removed, the record is automatically deleted.

Metadata Operations

Not all metadata operations are available through the API in the client libraries.

Namespace: A namespace is added through the config and requires a server restart. The truncate API removes all records in a set or the entire namespace.
Set: A set is automatically created when the first record is inserted in the set. Records in a set can be truncated using the truncate API.
Index: The API supports creation and deletion of a set index and secondary index. A secondary index is defined on a bin or an element in List or Map for a specific value type.

Please view the tutorial SQL: Updates for examples.

Performance Features

Aerospike is purpose-built to deliver high performance for large data with small cluster size. All features are designed and tradeoffs are made with this overarching goal. Data modeling is key to achieving best performance. Please see the blog post Data Modeling for Speed At Scale.

Each application is different, and Aerospike provides flexibility to accommodate custom performance considerations. At the same time, a developer should be aware of the following performance features (described earlier) that are commonly used:

Collection Data Types (CDTs)
Multi-op requests
Batch requests
Expressions
Secondary indexes
Set indexes
Complex Data Types - Binary, HLL, GeoJSON
User Defined Functions (UDFs)

Useful Libraries

Here are some useful libraries.

Document API

The Aerospike Document API provides CRUD operations at arbitrary points within a JSON document. It allows a JSONPath argument to specify parts of the document simply and expressively to apply these methods. Check out the tutorial for the Document API.

Java Object Mapper

The object mapper library uses Java annotations to define the Aerospike semantics for the saving and loading behavior. Annotations are specified next to the definitions of a class, methods, and fields. The object mapper makes managing persistent data easier to implement, easier to understand, and less error prone. Check out the Java Object Mapper workshop and tutorial.

Summary

Aerospike client API provides a rich set of capabilities. This post provides a high level view of the Aerospike architecture and API to give developers a broader understanding of its architecture and capabilities, and help them become more productive and effective. It provides an orientation of the client API by describing the core concepts, functional elements, key specifics, common performance features, and useful libraries; and points out resources for further exploration of specific topics.

Data Modeling for Speed At Scale (Part 2)

Neel Phadnis — Wed, 29 Jun 2022 00:10:57 +0000

Source: Photo by Pietro Jeng on Unsplash

This post focuses on the use of Collection Data Types (CDTs) for data modeling in Aerospike with a large number of objects. This is Part 2 in the two part series on Data Modeling. You can find the first post here.

Context

Data Modeling is the exercise of mapping application objects onto the model and mechanisms provided by the database for persistence, performance, consistency, and ease of access.

Aerospike Database is purpose built for applications that require predictable sub-millisecond access to billions and trillions of objects and need to store many terabytes and petabytes of data, while keeping the cluster size - and therefore the operational costs - small. The goals of large data size and small cluster size mean the capacity of high-speed data storage on each node must be high.

Aerospike pioneered the database technology to effectively use SSDs to provide high-capacity high-speed persistent storage per node. Among its key innovations are:

Access SSDs like direct addressable memory which results in superior performance,
Support a hybrid memory architecture for index and data in DRAM, PMEM, or SSD,
Implement proprietary algorithms for consistent, resilient, and scalable storage across cluster nodes, and
Provide Smart Client for a single-hop access to data while adapting to the changes in the cluster.

Therefore, choosing Aerospike Database as the data store is a significant step toward enabling your application for speed at scale. By choosing the Aerospike Database, a company of any size can leverage large amounts of data to solve real-time business problems and continue to scale in the future while keeping the operational costs low.

Part 1 described many capabilities that Aerospike provides toward speed-at-scale such as indexes, data compression, server-side operations, namespace and cluster configuration, multi-op requests, batch requests, and more.

This post focuses on Collection Data Types (CDTs), specifically List and Map data types, and discusses how applications can optimize speed as well as storage density by leveraging them.

. . . .

Collection Data Types (CDTs)

List and Map are the Collection Data Types (CDTs) in Aerospike. A List is a tuple or array of values, and a Map is a dictionary of key-value pairs. The element value can be of any supported type, including List and Map, and CDTs can be nested at an arbitrary level.

CDTs are essential to model:

aggregation of related objects in one record, allowing transactional semantics across multi-object updates,
containers for collection of objects to effectively store large number of objects, and
complex objects such as JSON documents.

We will briefly describe key CDT concepts like nesting and ordering before diving into specific modeling patterns and techniques.

Nested Elements and Context Path

A nested element in a CDT can be accessed directly by using its context-path. A context-path describes the path from the root or the top level of the CDT to a nested element, where each node in context-path uniquely identifies an element at that level by key, index, value, or rank (value order). The context-path points to only one element which can be of any data type, and an operation on the element identified by a context-path must be a valid operation on the element’s data type.

Consider a nested object represented as a Map (level 0): it has a List at level 1, and a Map at level 2.

Object = {  “id1”: [ {“a”: 1, “b”: 2}, {“c”: 3, “d”: 4} ], 
            “id2”: [ {“e”: 5, “f”: 6}, {“g”: 7, “h”: 8}] }

A context path to the nested element “c”, can look like: [By-Key(“id1”), By-Index(1), By-Key(“c”)].

Note, in the Aerospike API, the top CDT level (level 0) is implied by the bin and only lower level elements require a non-null context-path.

As a performance and convenience feature, CDTs allow creation of missing interim levels as part of the create operation of a nested element, as checking all path nodes prior to creation of a nested element in the application can be inconvenient and slow. For example, adding an element to a Map in a List, none of which exist, will first create the List, add the Map to the List, and then add the element to the Map.

Global Ordering of Values

Applications commonly use an Ordered List for use cases such as the Leaderboard. In Aerospike, List elements can be any type, and therefore Aerospike defines a deterministic ordering within and across supported types.

A CDT is frequently used as a container for objects that are represented as either List or Map. To support retrieval by rank (value-order) as well as by specific value-range, Aerospike defines how List values and Map values are compared. For example, two lists are compared by comparing their respective elements in stored order until the first inequality results or the end of either or both lists is reached. As well, there is a defined order across the data types. For example, any integer value is ranked lower than any string value. You can view the ordering rules here.

This deterministic comparison of values provides the basis for content or value based selection of elements such as By-Value and By-Value-Range.

. . . .

Modeling Objects with CDTs

CDTs make it possible to aggregate related objects in a single record, and therefore transactional updates across them.

Application objects can be stored in multiple ways in Aerospike: As a record, as a List, or as a Map.

Storing Objects As a Record

Object fields are stored in record bins. Flat objects can be stored in simple bins, that is, without use of CDTs, but objects with array and map fields require CDT bins.

For example:

Object:  id = 4, name = “Adam Smith”, start-date = 1/1/2015, department = finance, salary = 100000
Record bins: id:4, name: “Adam Smith”, start-date: 1/1/2015, department: finance, salary: 100000

Modeling objects as records have the following advantages:

Mapping of object fields to record bins is simple to understand.
Secondary indexes can only be defined at a bin level for field-value based access.
Certain data types like HyperLogLog (HLL) and BLOB can only be stored as a bin.
XDR sync granularity can be specified at a field (bin) level, allowing greater control and efficiency.

Storing Objects As a List

As a List, an object is stored as a tuple of its field values, where each field is placed at a specific position in the List.

For example:

Object: id = 4, name = “Adam Smith”, start-date = 1/1/2015, department = finance, salary = 100000
List bin: [4, “Adam Smith”, “20150101”, “finance”, 100000]

Lists offer these unique advantages:

The tuple form eliminates redundant storage of keys as in Map. Objects stored as tuples must use Unordered (or more precisely, insertion- or application-ordered) List, which preserves the tuple order irrespective of the field values. Application must manage the object schema, that is, how the tuple order matches the object fields, and also schema evolution. The Object Mapper library can be used to manage these aspects transparently.
Lists also allow convenient value-based selection of objects. The value of a List object is based on the initial field in the tuple. For example, objects represented as [type, size, color] can be retrieved by matching type values.
- A wildcard based value match can be conveniently specified. For example, get-by-value(collection, [“type x”, *]) will match all List tuples in the collection with “type x” as their first element.
- A value range based selection can be conveniently specified using the value delimiters NIL and INF, NIL denoting the absolute lowest and INF denoting the absolute highest value. For example, get-by-value-range(collection, [“value1”, NIL],[“value2”,INF]) selects all List objects with the first element between “value1” and “value2” (both values inclusive).

Storing Objects As a Map

As a Map, an object is stored as field specific key-value pairs.

For example:

Object: id = 4, name = “Adam Smith”, start-date = 1/1/2015, department = finance, salary = 100000
Map bin: {“id”: 4, “name”: “Adam Smith”, “start-date”: “20150101”, “department”: “finance”, “salary”: 100000}

Advantages of using a Map are:

Maps provide a natural way for storing JSON documents.
The object schema is self defining, thus removing the burden on the application of managing it.

Map values are not convenient to compare for value based access as List values. Wildcard cannot be used to denote a range of Map values. For example, {a:*} or {a:1, *} cannot be used for value comparison. Exact value or value-range requires specifying all keys in the map, which is not convenient for large maps. For these reasons, objects should be modeled as List tuples if value based access is required.

. . . .

Speed-At-Scale with CDTs as Containers

CDTs unlock many advantages toward speed-at-scale when they are used as a container for a collection of objects.

Performance

Related objects stored in a CDT container can be written or retrieved together in a single operation, which can provide a significant throughput improvement.

Aerospike CDTs support the normal list and map operations, but more significant are the many common patterns requiring more complex processing. They are performed fully on the server side to eliminate retrieval of data to the client side, and therefore provide superior performance.

Additional performance aspects include:

Multi-element updates are akin to batch operations on a CDT. Applications can get the right semantics using the constructs provided for individual element failure handling.
Many selection criteria are available:
- Access based on value and “rank” (value order) in addition to the normal index or key based access.
- Single, multi, and range selectors by index, key, value, and rank.
- Vicinity or relative selection, for example, using “relative rank” with respect to a value.
- Negate the selection criterion with a convenient INVERT flag.
Multiple return types allow just the required data to be requested. For example, COUNT for the number of elements selected, and NONE when no values need to be returned such as when adding elements, among others.
CDT operations can be used in Expressions, which is another mechanism for efficient server side execution and minimizing the data transfer cost. Expressions allow server side record selection (filter expression), reading (read expression) or writing (write expression) a computed value.

You can review List performance analysis here and Map performance analysis here.

Scale

Each record in Aerospike incurs a 64 byte storage overhead in the primary index, which is typically stored in DRAM. For a large number of objects, the DRAM size and cost can be significant if each object is stored in a record. CDTs when used as containers can store multiple objects in a single record, providing greater density and scale.

Further, List provides a compact tuple form for storing objects which eliminates the bin overhead.

Ordering

Objects stored as records cannot be retrieved from the server in any specific order. Sorting must be performed on the client side.

On the other hand, a CDT supports identifying elements by different orders: index or key as well as value and rank order. For instance, Ordered List maintains a value-based sort order allowing easy modeling of common patterns like Leaderboard, time-ordered events, and ordered groups. Unordered List maintains insertion-order and can be used to store, for example, a Queue and objects in tuple form as described earlier. Note, the object should be stored as a tuple with the first field that is significant for value ordering. For example, type, id, and timestamp.

CDTs maintain internal indexes for fast access by key, index, or value.

Enforcing Unique Values

Ordered Lists provide a way to check and enforce uniqueness of values at the time of insertion.

Bin Limits

In Aerospike, bin names can be up to 14 characters long. Distinct bin names in a namespace are limited to 32K. Storing objects as Maps or Lists has no limit on the length or the number of distinct object field names.

Modeling Object Collections

When an object is stored as a record, storing a collection of objects is a matter of mapping it on one or more sets and one or more namespaces. Part 1 talks about some of the considerations in organizing records in sets and namespaces.

Multiple related objects can be stored in a single CDT container for access and storage efficiencies as discussed above. If the number of objects in the collection exceeds the record size limit, the collection must be split by some criteria across multiple records. The considerations in organizing records in sets and namespaces from Part 1 apply in this case. CDT containers work especially well for objects that will be stored or retrieved together.

When using the CDTs for collections, key design considerations include:

Direct access: When direct access to the object in a collection is needed, it can be achieved by proper object id design.
Query access: Multiple objects can be accessed by some criteria with a scan using filter expression, and/or a query using a secondary index.

How objects in the collection are stored - as a tuple in a List or as key-value pairs in a Map - affects direct access and query. Tuple or List object representation offers better support for both as described earlier.

Object ID Design for Direct Access

Objects stored in CDTs may need to be independently accessible. In this case, the object identifier must be designed for direct access.

To enable direct access using the object id, object ids should contain record id (key), say as a prefix. With this, the record key extracted from the object id can be used to navigate first to the record, and then to the specific object within the CDT. All objects stored in a record will contain the common record key. For example, if a CDT holds all store objects in a region, the record key can be the region id, which is also embedded in all store ids of the region.

Note, it is not necessary to aggregate objects in a record by a real world relationship. A subset of the hashed object id bits may be used as a record id. See this blog post for the details of this scheme.

Direct Access with List and Map Object Representation

A List object can be accessed within the container (List or Map) using the object id by value if the id is stored as the first field. A Map object is not easy to access in a List container because a Map does not allow wildcard based value comparison, for example, access by id like get-by-value(list, {“id”:”id1”, *}) is not possible.

Consider a List of Lists:

[   ["id1", 10, 11,…], ["id2", 20, …], 
    ["id3", 30, …], ["id4", 10, 101…]  ]

Direct access using id: get-by-value(outerList, ["id1", *])

Or a Map of Lists

{   "id1": ["id1", 10, 11,…], "id2": ["id2", 20, …], 
    "id3": ["id3", 30, …], "id4": ["id4", 10, 101…]  }`

Direct access using id: get-by-key(map, "id1")

Now consider a List of Maps:

[   {“id”:"id1", “a”: 10, “b”: 11,…}, {‘“id”: "id2", “a”: 20, …}, 
    {“id”: "id3", “a”: 30, …}, {‘“id”: "id4", “a': 10, “b”: 101…}  ]

Here it is not possible to directly access the object with id "id1" in the List container.

Or a Map of Maps:

{   “id1”: {“id”:"id1", “a”: 10, “b”: 11,…}, “id2”: {“id”: id2’, “a”: 20, …},   
    “id3”: {“id”: "id3", “a”: 30, …}, “id4”: {“id”: "id4", “a': 10, “b”: 101…}  }

Direct access using id: get-by-key(outerMap, "id1")

Organizing Collections in Records

Due to the limit on the size of a record, a large object collection may need to be split across multiple CDTs, each stored in a separate record.

Single record collection: The record key represents the collection id.
Multi-record collection: Record keys are generated by appending the collection id with record-specific group ids. For example, the ticker for a stock can be organized in multiple records that have keys stock symbol (collection id) + date (group id).

Organizing Collections in Sets and Namespaces

All record organization concepts apply to records holding collections in CDTs as described in Part 1. For instance, a multi-record collection can be further organized if necessary in one or more dedicated sets, over one or more namespaces.

Querying Collections

A query or predicate based access to multiple objects is provided with these mechanisms:

In a single CDT:
Internal indexes are maintained within a CDT that allow fast access to elements by specific keys (Map only), indexes, values, and ranks.
Across multiple records:
A secondary index can be defined on List values, Map keys, and Map values, to allow queries across List or Map bins in multiple records. The secondary index efficiently identifies all matching records, and with appropriate CDT operation, specific CDT elements from each record can be retrieved.
(Note indexing at any CDT level is planned for Aerospike Database 6.1. Prior to 6.1, CDT elements below the top level of the CDT can be replicated for indexing in a separate record bin or at the top level of the CDT.)

A filter expression can be specified in an operation so that the operation is applied only to matching records. A filter expression can be defined using the CDTs in the record. For example, a filter expression can select a record if a CDT bin has a nested List larger than a certain size or a certain value exists in a nested Map element.

As discussed earlier, it is easy to define value based predicates on List tuples using the wildcard, NIL, and INF values. Therefore, to be able to use filter expressions with value based predicates, use tuple representation. Note, the value based predicates cannot use arbitrary fields in the object tuple, only the first field.

Managing Temporary Objects

The record-level metadata such as the expiration time and update time are not applicable to individual objects within the CDTs, if they can be updated or expired independently. In such cases, these mechanisms must be implemented on a per-object basis. It is possible to piggyback such housekeeping expiration of objects on other regular application operations using a multi-op request and CDT delete operations based on the rank or value range (“delete all elements with expiration-time field less than current time”).

Index and rank based selection also allows one to keep the size of a CDT capped to a specified maximum.

Integration with Performance Features

CDTs are integrated into and benefit from the other Aerospike performance features discussed in Part 1 including:

Sets and set indexes

You can organize a collection efficiently in one or more sets, and define set indexes for scan performance.

Multi-op requests

You can perform multiple operations on a record in a single request. For instance, in the same request, you can add items to a list, sort it, get its new size, and top N items.

Batch requests

You can conveniently issue one request for operations on multiple records hosting a collection.

. . . .

Transactions and Consistency

Data stored in a CDT can be updated in a single transaction because all single record operations support transactional semantics in Aerospike. Different parts within the CDT can be updated or retrieved atomically and efficiently using a single multi-op request.

Maximum Record Size

A namespace is configured for a maximum record size, with the upper limit of 8MB, and represents the unit of transfer in a device IO operation. Record data cannot exceed the configured maximum record size, and is an important consideration for large object as well as multi-object record design. The application design may consider workarounds such as a multi-record map.

. . . .

Examples

The following examples illustrate the use of CDTs.

Events Data for Real-Time Applications

Event objects can be stored for access by event-id as well as timestamps by storing them in a Map container with each event object stored as a tuple in a List with timestamp as the first field.

{ event-id: [timestamp, other event attributes], … }

Event-id based access is a simple key access in the Map container. You can retrieve the event that is at or closest to a timestamp with:

get-by-relative-rank(map, value=timestamp, relative-rank=-1, count=2)

The above operation returns two elements in the events container Map with values starting just prior to the timestamp indicated by relative-rank of -1, and the one after that, which could be the exact timestamp or the one immediately after that. This scheme takes care of the timestamp not present because it is beyond the range of existing timestamps as well as there is no event at that exact timestamp.

The size of the container can be efficiently managed with remove-by-index-range (trim to specific size) or remove-by-value-range (remove old events) operations.

Rank Ordered Lists

Lists that need to be retrieved by rank such as players with highest game scores, blogs with highest views, and videos with most likes can be conveniently modeled with an Ordered List container with each object stored as a tuple in an Unordered (application ordered) List.

[ [score, other attributes], … ]

To obtain N objects with the highest scores:

get-by-rank-range(list, start-rank=-N)

In the above operation, the start-rank indicates the element with the Nth highest score. All elements after that are returned. The operation effectively returns the top N ranked elements.

Efficient Container Computations

Expressions allow operations involving one or more large containers in a record to be performed efficiently on the server side. For example, the top N elements common to two Lists A and B in a record can be computed on the server side with:

get-by-rank-range( list=remove-by-value-list( list=A, value-list=get-list(B), op-flag=INVERT ), start-rank=-N )

The first server computation is A-(A- B) which yields common elements in A and B. Note, op-flag=INVERT has the effect of inverting the selection, so in this case instead of A-B, it would return the elements in A that are not in A-B, or A-(A-B). The second computation returns the top N elements from the common list.

. . . .

Object Mapping Libraries

The following libraries are available for applications to use:

Document API library to store a JSON document to, and retrieve it from, a Map bin. The library also supports JSONPath queries on a stored document. Check out the interactive tutorial.
Java Object Mapper library for convenient and efficient mapping of objects into Aerospike database with simple annotation of Java classes. Check out the workshop and interactive tutorial.

. . . .

Summary

Aerospike Database is built for speed at scale, and provides a path to companies of any size to leverage large data for real-time decisions without incurring huge operational cost, and to scale in the future. The blog post describes the Collection Data Types (CDTs), how they can be used to model objects for speed at scale, and their capabilities like ordering and server-side execution to improve performance and ease of implementation.

Data Modeling for Speed At Scale

Neel Phadnis — Thu, 02 Jun 2022 22:04:35 +0000

Source: Photo by NASA on Unsplash

Introduction

Data Modeling is the exercise of mapping application objects onto the model and mechanisms provided by the database for persistence, performance, consistency, and ease of access.

Aerospike pioneered the database technology to effectively use SSDs to provide high-capacity high-speed persistent storage per node. Among its key innovations are that Aerospike:

Accesses SSDs like direct addressable memory which results in superior performance,
Supports a hybrid memory architecture for index and data in DRAM, PMEM, or SSD,
Implements proprietary algorithms for consistent, resilient, and scalable storage across cluster nodes, and
Provides Smart Client for a single-hop access to data while adapting to the changes in the cluster.

Therefore, choosing Aerospike Database as the data store is a significant step toward enabling your application for speed at scale. By choosing the Aerospike Database today, it is possible for a company of any size to leverage large amounts of data to solve real-time business problems and continue to scale in the future while keeping the operational costs low.

Data design should take into account many capabilities that Aerospike provides toward speed-at-scale such as data compression, Collection Data Types (CDTs), secondary indexes, multi-op requests, batch requests, server-side operations, cluster organization, and more. We discuss them later in this post.

. . . .

NoSQL Data Modeling Principles

Aerospike is a NoSQL database, and does not have rigid schema as required by relational databases, To enable web-scale applications, Aerospike has a distributed architecture, and allows applications to choose availability or consistency during a network partition per the CAP theorem.

Typically, NoSQL data modeling starts with identifying the patterns of access in the application, that is, how the application reads and updates the data. The goal is to organize data for the required performance, efficiency, and consistency. In some NoSQL databases, design of keys, which serve as handles for access, is an important consideration for collocating them using a common property value. More on this later.

In Aerospike, many key data modeling principles are applicable that are prevalent in NoSQL databases including the use of:

Denormalization: Allowing duplication of data, by storing it in multiple places, to simplify and optimize access.
Aggregates: Storing nested entities together in embedded form to simplify and optimize access.
Application joins: Performing joins in the application in rare cases when they are required, for example, to follow the stored references in many-to-many relationships.
Single record transactions: Storing data that must be updated atomically together in one record.

Modeling Object Relationships

Related objects can be modeled either by holding a reference to the objects or by embedding the objects. The choice involves the trade-offs in ease, performance, and consistency; and is governed by two key factors: 1) the cardinality - 1:1, 1:N, M:N - of relationships, and 2) access patterns, as described below. Data modeling requires striking the right balance of conflicting goals such as, for example, while related objects should be embedded for ease and performance of reads, embedding across multiple objects can adversely affect update performance and consistency.

The following factors will dictate whether to embed or to reference an object:

Shared or exclusive relationship

Exclusive relationships 1:1 or 1:N should be embedded. For example, these 1:1 relationships should be stored together: owner and car, citizen and passport, family and residence; and so should these 1:N relationships: account and transactions, person and properties, and company and brands.
Shared objects with M:N relationships should be stored independently. For example, students and courses, tourists and destinations, and donors and charities.

Being accessed together

If 1:1 or 1:N embedded objects and the parent object are accessed and updated independent of each other, they are candidates for storing separately. For example, owner and car, person and accounts can have different operations and access patterns. Aggregates are often not optimal when embedded objects would be frequently and independently updated. For example, user and sent- or received- message folder have very different update patterns.

Immutability

If an M:N shared object does not change and also is accessed together with the referring object, it should be embedded with the referring object. For example, travelers and favorite destinations, students and completed courses.

Consistency requirements

The application may be able to tolerate temporary inconsistency in a shared object. If an M:N shared object is accessed together with the referring object, updated infrequently, and may remain slightly out-of-date while all its embedded copies are being updated, it is a candidate for embedding. For example, students and current course instructors.

. . . .

Beyond the standard NoSQL modeling techniques and guidelines, data modeling in Aerospike involves additional considerations as discussed below.

Multi-Model Database

Aerospike is a record-oriented store. It's easy to view a key-value store as a special case of the record-oriented store, where a record holds just one (nameless) field.

In the Aerospike data model, a record is a schema-less list of bins (fields), which means a record can hold a variable number of arbitrary bins. A bin is type-less, which means it can hold a value of any supported type. Aerospike supports scalar types like Integer, String, Boolean, and Double; Collection Data Types (CDTs) like List and Map; and special data types like Blob (bytes), GeoJSON, and HyperLogLog (HLL).

Records are created in a namespace. A record is typically assigned to a set (similar to a table) within the namespace. A database can have multiple namespaces, and each namespace has dedicated storage devices and policy for how indexes and data are stored, for example, hybrid DRAM and flash, all flash, and so on.

Aerospike Collection Data Types provide an efficient way to store hierarchical objects, including JSON documents. Application objects can be stored in multiple ways in Aerospike:

As a record: Object fields are stored in record bins
As a Map: Object fields are stored as key-value pairs.
As a List: Object field values are stored in the List in a specific order.

We will defer the discussion of CDTs to a future post.

Design of Record Keys

Records are accessed with a unique key. Aerospike record key (or digest) is a hash of the tuple (set, user-key) and is unique within a namespace, where user-key is an application provided id.

Aerospike does not provide a way for records to be placed on the same node for locality of access through complex key design schemes as some other databases. Aerospike uniformly distributes records across nodes for load balancing, optimal resource utilization, and performance, and so no effort need be spent in designing keys for collocation.

At the same time, it is possible for the application to compose the key to quickly access related objects as described below.

Modeling Related Objects

There are multiple ways in which related object can be organized:

Sets provide a mechanism to keep records organized by some criterion, such as type of objects, metadata vs data, a logical mapping, and so on.

Related objects can be held in CDTs either in one record that has the group-id as its key, or multiple records whose keys are generated by appending sub-group ids to the group-id. For example, ticker for a stock can be organized in records by stock (group id) + date (sub-group id).

A List can be used to store a group of related objects as a List of Lists or a List of Maps.
A Map can also be used as a Map of Lists or a Map of Maps.

CDTs provide many advantages such as greater density of objects per node by reducing the per-record system overhead, powerful server-side operations, as well as element ordering. We will cover use of CDTs in a future post.

Understanding Transactions and Consistency

It is important to understand transactions in Aerospike to ensure data consistency and correctness. Aerospike allows a namespace to be configured for Availability or Strong Consistency (SC). Multiple read consistency levels are possible in the SC mode for the application to strike the right balance of performance and level of consistency.

In the SC mode, Aerospike provides transactional guarantees for single record operations. This includes multiple operations on a single record that can be performed in a single request. Therefore, data that needs to be updated atomically must be stored in one record. CDTs provide an easy way to store such objects in one record.

While transactional updates are currently not available across multiple records, delayed consistency across multiple records can be achieved through known schemes.

Managing Temporary Objects

Aerospike has useful mechanisms that should be leveraged to manage objects with a defined lifespan. Such records can be marked with an expiration time (or time-to-live, TTL; the default is no expiration). Expired objects are automatically removed and their space recovered through garbage collection. This mechanism provides a convenient and efficient way for the application to manage its temporary objects that have a specific lifetime and must be removed after that.

If data needs to be archived based on some age criterion to another location, sets and secondary indexes can be used to efficiently identify the records to archive.

Organizing Namespaces

A namespace’s index and data can be placed in different storage types with different speed, size, and cost characteristics such as DRAM. PMEM, and SSD. Applications can allocate data to different namespaces depending on the speed and size needs for different objects.

The “data in index” option is available for high speed counters: It stores a single numerical value, typically a counter, that is updated at high frequency and for which access speed as well as consistency are critical. For example, it is important to accurately read and update the number of seats available for a popular event when the tickets go on sale to avoid under-booking or over-booking. Similarly, fast objects can be stored in a PMEM or fast SSD namespace, and large low-cost “all-flash” namespace can store objects with less stringent access latency.

It is also possible to split an object across multiple namespaces with the same set and user-key and therefore, the digest of records, serving as an implicit reference. For example, one namespace may hold archived versions, and another the latest version.

Other namespace configuration options significant for data modeling decisions include maximum record size and choice of Availability versus Strong Consistency.

Maximum Record Size

Compressing Data

Using compression can significantly compact data and reduce the data storage requirements, thus increasing the data density per node, reducing the cluster size, and lowering the cost.

In addition to improving storage density, compression can also improve wire transfer speed for large objects. Compression can be enabled for efficient client-server data transfer.

. . . .

Optimizing Speed

To achieve optimal performance, many mechanisms are available in Aerospike including the following.

Secondary Indexes

Scan operations use primary indexes on namespace and sets, whereas query operations use secondary indexes. Secondary indexes can be created on a bin’s Integer, String, and GeoJSON values. Secondary indexes improve query performance, but have a cost of keeping the index in sync when the underlying data is updated. Typically, a secondary index on a field works best for high query/update ratio and high selectivity of the index field.

In Aerospike Database 6.0+, the application can boost access throughput with hyper-parallel “partition-grained” secondary index queries, in addition to primary index queries from prior releases.

Batch Operations

Prior to Aerospike Database 6.0, “read' or “exists” operations on multiple records could be batched in a single request for efficiency and speed. In 6.0, batch operations for write, delete, and UDF operations are also supported. Fast ingests, for example, for IOT streams, can get better throughput with batch writes.

Server-Side Operations

Expressions and UDFs allow complex operations to be performed on the server, without having to retrieve data to the client first.

Expressions: Expressions offer a powerful way to define complex logic for server-side evaluation - either to filter records, or to retrieve data and store results.

UDF: User Defined Functions (UDFs) are defined in Lua for record- and stream-oriented computations. They are invoked through a client request, and executed on server.

Sets and Set Indexes

Related records can be organized in sets. To enable fast scans on a set, a set index can be defined. A set index can provide a big performance boost to small set scans as compared to the alternative of having to scan the entire namespace in the absence of a set index.

It is also a lot more efficient to truncate a set as opposed to deleting individual records when the data is no longer needed, and therefore such deletion cohorts may be organized in sets.

. . . .

Additional Data Design Considerations

In addition to the data modeling aspects described above, there are Aerospike cluster design aspects that overlap with data design, and affect application performance, reliability, and ease of development. They are briefly described below.

Replication for Reliability and Performance

An Aerospike cluster holds a Replication Factor (RF) number of copies of data for reliability and performance. A RF of 2 is typical, and for higher resilience can be 3, but a larger RF adversely impacts both speed and scale.

Synchronous and Asynchronous Replication

An important design decision is whether the data is held in one tightly synchronized cluster across multiple sites or racks, or multiple loosely XDR synchronized clusters. The decision depends on the application's need for consistency, site autonomy, and data regulation requirements.

Rack Awareness

For fast local reads and availability, data is replicated in a rack aware fashion where all sites are similar and each site holds its own copy of the entire data.

Client-Side Processing

The client directly connects to all server nodes, and there is no coordinator node to coordinate processing, and as such, operations like sorting and aggregation involve client side processing.

. . . .

Summary

Data modeling is the exercise of mapping application objects and access patterns onto the database’s native data model and mechanisms for optimal performance, efficiency, and consistency. Aerospike Database is purpose built for speed at scale, and provides a path to companies of any size to leverage large data for real-time decisions without incurring huge operational cost, and also scale in the future. The blog post described data modeling considerations when designing for speed-at-scale applications with the Aerospike Database. In a future post, we will describe how CDTs can be used for data modeling.

Resiliency in Aerospike Multi-Site Clusters

Neel Phadnis — Mon, 15 Jun 2020 16:29:20 +0000

As part of the recently announced Aerospike Database 5, multi-site clusters are now a supported configuration. Aerospike database has supported a cluster architecture for almost a decade. This post describes what is different in multi-site clusters. Specifically it describes in greater detail how multi-site clusters provide strong resiliency against a variety of failures

Accelerating Global Business Transactions

The changes in the user behavior and expectations brought about by mobile devices and digital transformation are accelerating the trend of global business transactions. Today, users are connected 24x7, across the globe, and expect immediate results whether they are making payments to their friends, ordering a product on a site, or tracking a package. To respond to these changes, businesses must be always-on and respond quickly to their customers and partners that can be anywhere. The database is a key enabler of the technology platform driving these capabilities, and must have the following characteristics:

multi-site: has global footprint to reflect business presence
always on: is able to automatically recover from a variety of failures
strongly consistent: provides guarantees against staleness and loss of data
cost effective: inexpensive to buy and efficient to operate

As we will see below, a multi-site cluster is a good fit for these needs of global business transactions.

What are Multi-Site Clusters?

A multi-site cluster is essentially a regular cluster: it has multiple nodes, and the data is sharded in many partitions, each with a number (replication factor) of replicas, and the replicas are evenly distributed across the nodes. But there are several important differences.

Geographically Separated

As the name suggests, a multi-site cluster is a cluster that spans multiple sites. Sites can be located anywhere geographically, across the same city or on multiple continents, and involve hybrid and heterogeneous environments consisting of VMs, containers, and bare-metal machines in on-premise data centers, as well as private and public clouds.

A 3-site cluster

Typically, nodes in a multi-site cluster are identically sized and evenly distributed across all sites to enable symmetry in performance and recovery.

Strongly Consistent

The key use case for a multi-site cluster is business transactions, and as such consistency of data across all sites is critical. Therefore, a multi-site cluster is configured in the Strong Consistency (SC) mode. In the SC mode, all replicas are synchronously updated, thus maintaining one version of the data so that the replicas are immediately and always consistent. Eventually consistent systems allow multiple versions that must be merged, requiring applications to be willing to work with stale versions and lost writes.
The other, Available during Partitions (AP), mode in Aerospike does not provide guarantees against lost or stale data in the event of a cluster split, for instance, and is unsuitable for business transactions that cannot tolerate inconsistent data.

Rack Aware

An Aerospike cluster can have nodes across multiple physical racks, and rack-aware data distribution ensures balanced distribution of data across all racks. In the context of a multi-site cluster, a rack is equivalent to a site, and rack-aware data distribution ensures no duplicate data at any site when the replication factor (RF, the number of replicas for all data partitions) is less than or equal to the number of racks (N). It ensures at most one replica when RF < N, and exactly one replica when RF = N.

Typically, a multi-site cluster is configured with RF=N because it provides the best performance and resiliency characteristics. With RF=N, each site has a full copy of the entire data, and therefore all reads can be very fast as they are performed on the local replica. Also, in case of a site failure, the remaining site(s) has all data to serve all requests.

While a multi-site cluster with 2 or 3 sites with the replication factor of 2 or 3 respectively are common, different configurations with more sites and replication factors are also supported.

"Always On" Resiliency

Aerospike clusters support zero downtime upgrades with full data availability. Also in multi-site clusters, all planned events including upgrades, patches, and hardware maintenance can be performed with no interruption to service.

Multi-site clusters also support automatic recovery from most node, site, and link failures. Automatic recovery and continuity in the event of a site failure is a key reason for businesses to choose multi-site clusters.

How Aerospike recovers from various failures is explained further below.

Fast Failure Detection and Recovery

All nodes maintain the healthy (or original) state of the cluster called "the roster" which includes the nodes and replica-to-node map for all data partitions. All nodes exchange heartbeat messages with every other node in the cluster at a regular (configurable, typically sub-second) interval. The heartbeat messages include all nodes that a node can communicate with. Using a specific number (configurable) of recent heartbeats, a primary node is able to determine quickly any failures in the cluster and the new membership of the cluster (i.e., the nodes that can all see one another), and it disseminates the new cluster definition to all connected nodes. With this information, each node can make independent decisions about any of its roster replicas that must be promoted to become the new master to replace failed masters, as well as any new replicas that it must create to replace failed replicas. This process of failure detection and recovery from it to form a new cluster takes just a few seconds.

Availability During Migration

New replicas are populated by migrating data from the master. While migrations typically take longer than a few seconds, especially when they involve inter-site data transfer, the partition would remain available for all operations while the migration is in progress.

Recovery from Failures

Next we will look into the details of recovery from various failures. It is important to note the two invariants of the SC cluster and partition availability rules during a cluster split that ensure strong consistency.

SC Cluster Invariants

The following two invariants are preserved in an operational SC cluster at all times. Recovery from a failure ensures these invariants are met before new requests are serviced.

A partition has exactly one master. At no time can there be more than one master in order to preserve a single version of data. Potential race conditions involving an old and a new master are resolved using the higher "regime" number of the newer master. The master is the first available roster replica in the partition's succession list of nodes. A partition's succession list is deterministic and is derived solely from the partition id and node ids.
A partition has exactly RF replicas. These are the first RF cluster nodes in the partition's succession list of nodes.

Note the second invariant does not mean that all replicas must have all data for continued operation; partitions remain available while background migrations bring replicas to an up-to-date state.

Partition Availability Rules

The following three rules dictate whether a data partition can be active in a sub-cluster resulting from a failure:

A sub-cluster has the majority of nodes and at least one replica for the partition.
A sub-cluster has exactly half nodes and has the master replica for the partition.
A sub-cluster has all replicas for the partition..

Node Failures

In the following diagram, the cluster consists of 9 nodes across 3 sites with a replication factor of 3. When a node fails, such as Node 1 as shown in the diagram, all partition replicas on the node become unavailable.

Node Failure: Master roles are transferred and new replicas are created

The cluster detects the node failure and acts to reinstate the two invariants mentioned above.

For every master replica on Node 1, the next roster replica in the partition's succession list becomes the new master. As illustrated in the diagram, P111-R1 on Node 6 becomes the new master.

For every replica on Node 1, a new replica is created to preserve the replication factor. Also, per rack-aware distribution rules, the new replica must reside on the Site 1 to preserve one replica per site. As illustrated in the diagram, new replicas P111-R3 on Node 3 and P222-R3 on Node 2 are created.

With this, the cluster is ready to process new requests while the new replicas continue to be populated from their respective master.

Site and Link Failures

The following diagram shows a site failure (Site 1) and link failures (Site1-Site2 and Site1-Site3). Both result in the same recovery and end state, and so let's consider the latter for our discussion.

Site or Link Failures: Sites 2 and 3 form the new cluster

When the links fail, the cluster is split into two sub-clusters: one with nodes 1–3 and the other with nodes 4–9. Each sub-cluster acts to first determine which partitions are active by applying the three rules mentioned above:

The sub-cluster with nodes 4–9 has the majority of the nodes and therefore is the majority sub-cluster. Since it has at least one (actually exactly two, one each on Site 2 and Site 3) replica for every partition, it can serve all requests for all data. No data is available within the other (minority) sub-cluster.
No sub-cluster has exactly half the nodes, and therefore this rule is not applicable. (This rule would apply for a 2-site cluster with even nodes.)
No sub-cluster has all replicas for any partition, and therefore this rule is also not applicable. (With Rack Aware distribution, this rule is never applicable in multi-site clusters.)

The majority sub-cluster then proceeds to reinstate the two invariants and promotes appropriate roster replicas to replace master replicas in the other sub-cluster and also creates new replicas for all replicas in the other sub-cluster.

With this, the cluster is ready to process new requests while the new replicas continue to be populated from their respective master.

Return to Healthy State

When failures are fixed, the cluster returns to a healthy state through a similar process of detection and recovery described earlier. The roster (i.e., original) masters regain the master role, roster replicas are brought up-to-date, and any new replicas are dropped. Requests received by a replica while it is receiving updates are proxied to appropriate replica.

Comparing Resiliency in 2-Site and 3-Site Clusters

The following table demonstrates how a 3-site (RF=3) cluster provides superior resiliency as compared to a 2-site (RF=2) cluster.

A 3-site cluster is more resilient than a 2-site cluster

Essentially, a 3-site cluster automatically recovers with full availability of data from these failure that a 2-site cluster cannot: any single site failure and node failures spanning any two sites. It can also recover from a two site failure with manual intervention that requires re-rostering with surviving nodes.

The scenarios are described in more detail below.

2-Site Cluster (RF=2)

Node failures: The cluster can automatically recover from up to a minority node failures on the same site. For node failures across sites, the cluster will be partially available as some partitions would lose both replicas and thus would become unavailable. For a majority node failures, the cluster will neither recover automatically nor will be fully available.
Site failure: The cluster can automatically recover from a minority site (ie, the site with fewer nodes) failure in the case of an odd nodes cluster. When the site with equal nodes (in an even nodes cluster) or majority nodes fails, a manual intervention to re-roster the remaining nodes is needed to make the cluster fully operational.
Link failure: In an odd nodes cluster, the majority site will remain operational, and so will the applications that can connect to the majority site. In an even nodes cluster, both sites will remain operational for exactly half the partitions for which they have the master replica. For an application to work, it must be able to connect to all nodes on the two sites.

3-Site Cluster (RF=3)

Node failures: Just like the 2-site cluster case, a 3-site cluster can automatically recover from up to a minority node failures across any two nodes. For node failures involving all three sites, the cluster will be partially available as some partitions would lose all 3 replicas and would become unavailable. For a majority node failures, the cluster will neither recover automatically nor will be fully available.
Site failures: The cluster can automatically recover from a single site failure. When two sites fail, a manual intervention to re-roster the nodes on the third site is needed to make the cluster fully operational.
Link failures: In one or two link failures that allow two sites to form a majority sub-cluster, an automatic recovery is possible. If all three inter-site links fail, the operator can decide to re-roster the nodes at any one of the three sites to create an operational cluster. For an application to work, it must be able to connect to all nodes of the operational cluster.

Other Considerations

Write Latency
Write transactions update all replicas synchronously. Therefore write latency is dictated by the maximum separation between any two sites. This can range from a few milliseconds in a cluster spanning multiple zones in the same region to hundreds of milliseconds for a cluster across multiple continents. Application requirements for strong consistency, disaster recovery, and write latency must be balanced to come up with the optimal cluster design.

Node Sizing
Node sizing should take into account maximum node failures on a site, as the content and load from the failed nodes will be distributed to the remaining nodes on the same site in Rack Aware distribution. In the extreme case, it is possible to have a single node hold all the site's replicas and server all its requests, however the cluster may not function optimally.

Migration Time
With high bandwidth connectivity, a replica can be migrated quickly. Migrations are typically throttled so that the cluster can provide adequate response to the normal workload. Applications are unlikely to experience higher latency during this period if the workload is not bandwidth intensive. Multiple simultaneous node failures may extend the migration duration and latency depending on the configuration and network bandwidth.

Global Data Infrastructure
For a different class of applications that requires fast write performance, selective replication across sites to meet regulatory requirements, autonomy of site operations, but that can live with less stringent consistency guarantees, Aerospike provides Cross Data-center Replication (XDR). XDR can be combined with multi-site clusters to architect an optimal global data infrastructure to satisfy multiple applications.

Conclusion

Aerospike multi-site clusters span geographically distant sites and provide strong consistency and always-on availability at a low cost, making them a good fit for global business transactions. Multi-site clusters can quickly detect and recover from node, site, and link failures to provide "always-on" availability. Today, many deployments are successfully running mission critical transactions at scale with multi-site clusters. Multi-site clusters and Cross Data-center Replication(XDR) together provide the capabilities and flexibility in creating the optimal global data infrastructure within an enterprise.

Twelve Do's of Consistency in Aerospike

Neel Phadnis — Thu, 28 May 2020 00:29:26 +0000

For applications that demand absolute correctness of data, Aerospike offers the Strong Consistency (SC) mode that guarantees no stale or dirty data is read and no committed data is lost. Aerospike's strong consistency support has been independently confirmed through Jepsen testing.

Developers building such applications should follow the following Twelve Do's of Consistency.

1. Model your data for single record atomicity.

The scope of a transaction in Aerospike is a single request and a single record. In other words, an atomic update can only be performed on a single record. Therefore model your data such that data that must be updated in a transaction (atomically) is kept in a single record. Data modeling techniques like embedding, linking, and denormalization can be used to achieve this goal.

2. Configure the namespace in SC mode by setting strong-consistency to true.

Per the CAP theorem, the system must make a choice between Availability and Consistency if it continues to function during a network partition. Aerospike offers both choices. A namespace (equivalent to a database or schema) in a cluster can be configured in AP (choosing Availability over Consistency) or SC (Strong Consistency, choosing Consistency over Availability) mode. All writes in SC mode are serialized and synchronously replicated to all replicas. ensuring one version and immediate consistency.

3. Use the Read-Modify-Write pattern for read-write transactions.

In this pattern, the generation comparison check is included in the write policy. A record's generation is its version, and this check preserves validity of a write that is dependent on a previous read. The "Check-And-Set" (CAS) equality check with read generation would fail raising generation-error if another write has incremented the generation in the meanwhile. In which case, the entire Read-Modify-Write pattern must be retried.

Read-Modify-Write pattern for read-write transactions

4. Tag a write with a unique id to confirm if a transaction succeeded or failed.

Uncertainty about a transaction's outcome can arise due to client, connection, and server failures. System load can lead to incomplete replication sequence before the request times out with "in-doubt" status. There is no transaction handle for the application to use to probe the status in this case. It must therefore tag a record with a unique id as part of the transaction, which it can use later to check if the transaction succeeded or failed.

Tagging a write with a unique id

5. Achieve multi-operation atomicity and only-once effect through Operate, predicate expressions, and various policies.

The Aerospike operation Operate allows multiple operations to be performed atomically on a single record. It can be combined with various policies that enable conditional execution to achieve only-once effect. Examples include predicate expressions in operate policy, insertion in map with create-only write mode, insertion in list with add-unique write flag, and so on.

6. Simplify write transactions by making write only-once (idempotent).

An only-once write (enabled by the mechanisms described in 5 above) becomes safe to just retry on failure. A prior success will result in an "already exists" failure which indicates prior successful execution of the transaction.

Safe retries with only-once write transactions

7. Record the details for subsequent handling in a batch or manual process if a write's outcome cannot be resolved.

During a long duration cluster split event, the client may be unable to resolve a transaction's outcome. The client can timeout after retries but should record the details needed for external resolution such as the record key, transaction id, and write details.

Record transaction details for external resolution

8. Choose the optimal read mode.

There are four SC read modes to choose from: Linearizable, Session, Allow-replica, and Allow-unavailable. They all guarantee no data loss and no dirty reads, but differ in "no stale" guarantees as well as performance. A Linearizable read ensures the latest version across all clients, but it involves checking with all replicas and therefore is most expensive. Also, without additional external synchronization mechanism among clients, the version is not guaranteed to be the latest when it reaches the client. A Session read is faster as it directly reads from the master replica, and therefore recommended. In a multi-site cluster, local reads are much faster than remote reads. Since the master replica may reside at another site, the Allow-replica mode offers much better performance with no-stale guarantee practically equivalent to the Session mode, and therefore is recommended in multi-site clusters. There are no staleness guarantees with Allow-unavailable mode, but the application may judiciously leverage it when it is aware of stale data but can still derive positive value from it.

9. Use the default value for max-retries (zero) in write-policy.

The max-retries value indicates the number of retries that the client library will perform automatically in case of a failure. Because the transaction logic is sensitive to the type of failure, a transaction failure must be handled in the application, not automatically by the client library. Therefore use the default value to turn off the automatic retries in the client library.

10. For maximum durability, commit each write to the disk on a per-transaction basis using commit-to-device setting.

With this setting, a replica flushes the write buffer to disk before acknowledging back to the master. The application on a successful write operation is certain that the update is secure on the disk at each replica, thus achieving maximum possible durability. Be aware of the performance implications of flushing each write to disk (unless using data in PMEM), and balance it with the desired durability.

11. For exactly-once multi-record (non-atomic) updates use the pattern: record atomically - post at-least-once - process only-once.

Aerospike does not support multi-record transactions. To implement exactly-once semantics for multi-record updates, record the event atomically in the first record as part of the update. Implement a process to collect the recorded event and post it for processing in the second record. At-least-once semantics can be achieved by removing the event only after successful hand-off to or execution of the subsequent step which would update another record with only-once semantics. This sequence achieves exactly-once execution of multi-record updates. The pattern is explored further in this post.

12. Resolve in-flight transactions during crash recovery by recording the transaction intent.

Before a write request is sent to the server, record the intent so that it can be read and retried if necessary during crash recovery. The intent is removed on successful execution as part of normal processing. During recovery, the intent list is read and retried.

Resolving Uncertain Transactions in Aerospike

Neel Phadnis — Thu, 28 May 2020 00:13:04 +0000

For applications that desire strong consistency, it is important to not only have all replicas in the cluster be always in sync, but also to have application's knowledge of a transaction's outcome consistent with the actual outcome in the database. If the application is uncertain about a transaction's outcome, it must first resolve it so that it can either do nothing, retry the transaction, or arrange for external resolution.

There are many situations that leave the client uncertain of a transaction's outcome.

Let us first look at how a write transaction is processed in Aerospike. Executing a write request involves many interactions between the client, master, and replicas as follows.

A write transaction sequence

The client sends the request to the master.
Master processes the write locally.
Master sends updates to all replicas.
Replicas acknowledge to the master after applying the update locally.
After receiving acknowledgements from all replicas, master returns success to client.
Master sends an advisory "replicated" message to all replicas. Replicas are then free to serve this version without having to consult with the master.

Sources of uncertainty

There are failures that prevent a write from successfully completing and/or client from receiving a clear success or failure status.

Client, connection, and master failures
Failures like a client crash, connection error, node failure, or network partition can happen at any moment in the above sequence. Any of them can leave the client uncertain about a write's outcome. The client must resolve the uncertainty so that it can achieve the intended state.

In-Doubt status
The client can also receive the "in-doubt" flag in a timeout error. The flag signifies that the master has still not finished the replication sequence with all replicas. Clearly, the client, connection, and master all must be healthy for a timeout error to be returned with the in-doubt flag set.

Why wouldn't the master finish the replication sequence? The causes can be many including slow master, network, and/or replica; node and/or network failure, storage failure, and potentially other failures. In such a case, the time for recovery to be completed, either automated or manual, and the write to be resolved can be unpredictable.

How common are such uncertainties?
The frequency of the events will be determined by factors like the size of the cluster and load. The following table shows these events are common enough to ignore if the application desires strong consistency.

Events causing write uncertainty are common

Aerospike model

It is important to understand the transaction model in Aerospike before we describe a solution. Aerospike was designed for speed@scale, with the goal of keeping most common operations simple and predictably fast, and deferring complexity to applications for less common scenarios.

Transactions in Aerospike span a single request and a single record. The API does not support a notion of a transaction id or a way to query a transaction status. Therefore, the application must devise its way to query a transaction status.

The application links with the client library (Smart Client) that directly and transparently connects to all nodes in the cluster and dynamically adapts to cluster changes. Therefore, simply retrying a transaction that failed due to a recoverable cluster failure can result in success.

Resolving uncertainty: Potential solutions

Some intuitive potential solutions unfortunately don't always work.

Potential solution 1: A polling read back in a loop until the record generation reflects an update.

Note the read must look for the generation (ie, version) of the record that reflects the write. On a timeout after several read attempts, the application may attempt to ascertain a failed write by "touching" the record that increments its generation without changing any data.

Case 1. The write sequence has completed.

Assuming no subsequent updates, the read will return the prior version if the write failed. Otherwise it will return the new version.
However, if there are subsequent updates, the latest version will be returned and there is no way of knowing the outcome of the write in question.

Case 2. The write sequence is still in progress.

If the read request goes to the same master, the read will return the prior version if the write is not completed yet. If the client attempts a "touch"in this case, it will be queued for the original write to finish (and may time out). The write's outcome remains unknown.

If the read request is directed to a new master because of a cluster transition, the new master will return the new version (ie, updated by the write in question, assuming no subsequent updates) if the write was replicated to a replica in the newly formed cluster prior to the cluster transition. Otherwise the previous version will be returned. If the client attempts a "touch"in this case, the original write will lose out (ie, fail when the cluster reforms) to the version in the new cluster regime. Just as in Case 1, if there are subsequent updates in the new cluster, the latest version will be returned and there is no way of knowing the outcome of the write in question.

Thus, this potential solution cannot be used to resolve a write's outcome.

Potential solution 2: Retry the write

When a retry can work:

when the write is idempotent: Great, as the application knows which writes are idempotent,
when the original request has failed: Good if the application knows, e.g., when there is a timeout with no in-doubt flag, and
when there is a newly formed cluster and the write was not replicated to it: Yes, but the application does not know if this is the case.

When a retry will not work:
A retry is not safe because it will duplicate the write when:

the original write sequence has since completed successfully, or
the new cluster has the write replicated to it prior to the partition.

Again, this potential solution doesn't work in a general case.

A general solution

In order to query the status of a write, the application must tag the write with a unique id that it can use as a transaction handle. This must be done atomically with the write using Aerospike's multi-operation "operate" API.

It also is very useful to implement a write with"only-once" semantics, and safely retry when there is uncertainty about the outcome. This can be accomplished by storing the txn-id as a unique item in a map or list bin in the record. Aerospike has the create-only map write flag to ensure that the entire multi-operation "operate" succeeds only-once. (Other mechanisms such as predicate expressions may be used instead.) Subsequent attempts would result in an "element exists" error and point to a prior successful execution of the write.

Adding a key-value ordered map, txns: txn_id => timestamp:

key: { … // record data txns: map (txn_id => timestamp) // transaction tag }

Below is the pseudo code for the general solution.

Resolving uncertainty by tagging a write with a unique id

A simpler solution?

Requiring a txns map or list in each record and tagging and checking txn-id with each write, and also trimming txns can be a significant space and time overhead. Can it be avoided or simplified?

If consistency is absolutely needed, this (or equivalent variations) is the recommended solution. Without an in-built support in the API or server, currently this is a general way to resolve uncertainties around a transaction's outcome and ensuring "only/exactly once" semantics.

However, simplifications are possible. Here are some things to consider to devise a simpler solution:

Frequency of updates: If writes are rare (e.g., daily update to usage stats), it may be possible to read back and resolve a write's outcome.
Uniqueness of update: Can a client identify its update in another way (i.e., without a txn-id) in a multi-client writes scenario?
A handful of write clients: If there are a small number of write clients, a more efficient scheme can be devised such as client-specific versions in their own bins (assuming a client can serialize its writes).
Likelihood of client, connection, node, or network partition failures: If such failures are rare, an application may decide to live with lost or duplicate writes for less critical data.
Ability to serialize all writes through external synchronization: A simpler solution can be devised in this case.
Ability to record uncertain writes and resolve them out of band: Log the details for external resolution and make appropriate data adjustments if necessary.

Conclusion

Strong consistency requires the application to resolve uncertain transaction outcomes and implement safe retries. In order to query and resolve uncertain outcomes, the application needs to tag a transaction with a unique id. To achieve exactly-once write semantics, the application can use mechanisms available in Aerospike Operator, Map/List, and Predicate Expressions. In some cases, knowledge of data, operation, and architecture may be used to simplify the solution.

DEV Community: Neel Phadnis

Parallelism with Fine-Grained Streams (Part 2)

Query Computation Graph

Data Access Stage

Optimal Parallelism: Matched Stage Throughputs

Methodology

Limits and Bottlenecks

SSD Throughput

Example

Server Node Throughput

Example

Complex Pushdown Processing

Network Bandwidth

Example

Worker Nodes

Example

Concurrent Jobs

Example

Another Setup

Device I/O limit

Database throughput limit

Network bandwidth limit

Concurrency limit

Optimizing Data Access

Processing Large Data Sets in Fine-Grained Parallel Streams

Parallel Processing of Large Data Sets

Data Partitions in Aerospike

Splitting Data Sets Beyond 4096

Defining and Assigning Splits

Split Assignment Schemes

At-Most N Splits

At-Least N Splits

Exactly N Splits

Alternative Ways of Splitting

Parallel Query Framework

Processing Flow

Parameters and Variations

Use Cases for Fine-Grained Parallelism

Of Queries and Indexes

Primary and Set Indexes

Primary Index

Scan or Primary-Index Query

Set Indexes

Order

Query Processing

Secondary Indexes

Mapping Bin or CDT Values to Records

Indexed Value Types

Indexing on Custom Values

Uniqueness and Order

Secondary-Index Query

Query Processing

Pagination

Parallel Query Processing

Processing Using Indexes

Read and Write Operations

Read operations

Background Updates

Building Large-Scale Real-Time JSON Applications

Overview

Database for Large-Scale JSON Applications

Reliably fast random access at scale

Fast ingest rate

Fast queries

Convenient JSONPath based access

Rich Document Functionality

Efficient storage and transfer

Rich API

Well integrated into other performance features

Synchronizing data with other systems

Easy integration with real-time data streams

Fast access from data processing and analytics platforms

Organizing for Scale and Speed

Example: Real-Time Events Data

JSON Support in Aerospike

Store a JSON file to database

Get document elements by JsonPATH

Query JSON documents

Related Links:

Query JSON Documents Faster (and More) with New CDT Indexing

Existence

Metadata

Existence

Metadata

Secondary Index Query

CREATE Namespace

TRUNCATE Namespace

DELETE Namespace

CREATE Set

ALTER Set

TRUNCATE Set

DROP Set

CREATE Index

DROP Index