DEV Community: Micah Carrick

Using a YubiKey with AWS CLI Sessions

Micah Carrick — Sun, 23 Feb 2025 19:34:43 +0000

This is the bash script I use with yubikey-manager CLI (ykman) to create a session for the AWS CLI using a YubiKey as a MFA device. This configuration is specifically for using short-term credentials.

Using the script avoids having to copy/paste the code obtained from the YubiKey to the get-session-token command.

Requirements:

The jq utility
A YubiKey MFA device configured for a AWS IAM user and it's serial number ARN
AWS CLI configured for short-term credentials per Setting up the AWS CLI. For example:

~/.aws/config

[profile my-session]

[profile my-profile]
source_profile = my-session

The script will first use ykman which pauses and waits for the button on the YubiKey to be pressed. This produces a code that is passed to get-session-token.

#!/bin/env bash

# MFA_SERIAL_ARN="arn:aws:iam::[ACCOUNT_ID]:mfa/[IAM_USER]"
MFA_SERIAL_ARN="arn:aws:iam::111111111111:mfa/jane.doe"
USER_PROFILE="my-profile"
SESSION_PROFILE="my-session"

echo "Fetching code from Yubikey device"
mfa_code=$(ykman oath accounts code --single $MFA_SERIAL_ARN)

echo "Creating session (code=$mfa_code)"
sts=$(aws sts get-session-token \
--duration 14400 \
--serial-number $MFA_SERIAL_ARN \
--token-code $mfa_code \
--profile $USER_PROFILE)

access_key_id=`echo $sts | jq -r '.Credentials.AccessKeyId'`
secret_access_key=`echo $sts | jq -r '.Credentials.SecretAccessKey'`
session_token=`echo $sts | jq -r '.Credentials.SessionToken'`
expiration=`echo $sts | jq -r '.Credentials.Expiration'`

echo "Session expires on: $expiration"
aws configure set aws_access_key_id $access_key_id \
--profile $SESSION_PROFILE
aws configure set aws_secret_access_key $secret_access_key \
--profile $SESSION_PROFILE
aws configure set aws_session_token $session_token \
--profile $SESSION_PROFILE

The output of the script would look something like this:

Fetching code from YubiKey device
Touch your YubiKey...
Creating session (code=123456)
Session expires on: 2025-02-23T22:12:29+00:00

aerospike

Micah Carrick — Fri, 13 Dec 2024 17:16:12 +0000

Custom Bluesky Handle on AWS with Terraform/OpenTofu

Micah Carrick — Sat, 23 Nov 2024 17:12:23 +0000

How to set up your custom Bluesky handle using Terraform/OpenTofu with AWS Route53.

In this post I'll show some example Terraform code to create DNS records in AWS Route53 to use Domain Names as Handles in Bluesky. This is not just for vanity, it is also one way to verify your account.

While setting up a DNS record in Route53 is very easy to do using the AWS Console a la "click ops", many of us with a DevOps/SRE background have too many scars from manually provisioning infrastructure--even for our own personal projects.

Bluesky is built on the AT Protocol, a decentralized network for social applications. In the AT Protocol your handle (eg. your_handle.bsky.social) is a human-friendly identifier that links to a canonical, permanent decentralized identifier (aka DID).

In order to use a custom domain name for your handle (eg. YOUR_DOMAIN.com) you create a DNS TXT record in which the host will be your handle and the record value resolves to your DID.

First, create a aws_route53_zone if you do not already have a hosted zone for your domain.

resource "aws_route53_zone" "domain" {
  name = "YOUR_DOMAIN"
}

(replace YOUR_DOMAIN with your top-level domain TLD)

Next, find the DNS record value for your DID as described in How to verify your Bluesky account. This value will looks something like did=did:plc:YOUR_DID.

Use this value in the list of records for a TXT type aws_route53_record resource.

resource "aws_route53_record" "TXT_atproto" {
  zone_id = aws_route53_zone.domain.zone_id
  name    = "_atproto.YOUR_DOMAIN"
  type    = "TXT"
  ttl     = 300
  records = [
    "did=did:plc:YOUR_DID"
  ]
}

(replace YOUR_DOMAIN with your handle host name, and YOUR_DID with your DID)

The name attribute of this aws_route53_record resource can be a TLD (eg. YOUR_DOMAIN.com) or a subdomain (eg. YOUR_HANDLE.YOUR_DOMAIN.com).

After you apply this Terraform/OpenTofu you can verify the DNS record using dig.

> dig TXT _atproto.YOUR_DOMAIN +short 
"did=did:plc:YOUR_DID"

Finally, update your handle in your Bluesky account settings as described in How to verify your Bluesky account.

You can find me on Bluesky as @micah.carrick.social.

AWS Instances for Aerospike Clusters

Micah Carrick — Sat, 30 Jan 2021 14:54:35 +0000

How I select the optimal AWS instance type for running Aerospike

One of the most common conversations I have when providing consulting services for Aerospike customers running on AWS is about how to select the optimal AWS instance type.

Which instance provides the lowest latency?
Do I need a "storage optimized" instance type?
Should I use EBS?
Can I use a cheaper instance type?

When it comes to running production Aerospike workloads at scale on AWS I generally only look at a this short list of instance families:

Instance Family	Strengths
i3en instances	All the things! Big and fast with lots of storage
i3 instances	Lots of storage at a low cost
r5/r5d instances	Fast with lowest cost per GB of DRAM
c5d instances	Very fast and low cost per vCPU
m5d instances	Low-latency all-rounder

So let's look closer at what I consider when selecting AWS instance types for Aerospike customers and then what each of these instance families is good at and where it might fall short.

Instance Type Considerations

None of the technical considerations matter without cost. If price were no object I'd just say run Aerospike on a big ol' cluster consisting of the beefy i3en instances and be done with it.

But alas, we live in the real world...

Storage

The first consideration for Aerospike on AWS is storage. Aerospike's Hybrid Storage configured to use SSD/Flash is going to be used in the vast majority of use cases.

This means I want fast, consistent, and reliable NVMe SSD storage. Amazon's network-attached storage, Elastic Block Store (EBS), does not meet that need. EBS is fantastic for many storage needs, but it is not fast, consistent, or reliable when milliseconds matter. No, what I need is Amazon's SSD instance store volumes.

But, not all instance store volumes are equal. Aerospike has certified the throughput of AWS SSD volumes for various instance families. This provides a good baseline for the relative performance expected from any given instance type.

As a general rule I try to avoid instance sizes that don't make use of a full sized SSD for a given family. That means that I try not to go smaller than the smallest instance type within a family that has the same size SSD as the largest in that family (half-size instances and above). This is to avoid sharing the physical SSD with other tenants on the underlying AWS host.

Memory

When Aerospike's Hybrid Storage is configured to store data in memory then the RAM is obviously the primary factor in the instance's storage capacity. When configured to store primary indexes in memory it's still a key factor, but, with some nuance.

Aerospike's primary indexes are a fixed size (64 bytes) per object. That means that if I am storing a large amount of small objects then I need a higher memory-to-disk ratio than if I have fewer large objects.

I also want to condider optional features than can leverage RAM such as read page cache and secondary index.

Network

The oft-overlooked network capacity is very important in cloud environments.

Obviously the network throughput and latency is an important consideration. However, there is also clustering and the application's usage pattern to consider. Does the application have an aggressive SLA? How will it behave when the network is down? What about when the network is slow or dropping packets?

So a more consistent and reliable network results in a more stable cluster. Sure, Aerospike will automatically handle nodes leaving the cluster due to a flakey network, however, that self-healing means replicating data and replicating data means spending money on data transfer.

This means instance size matters in addition to the type. When I'm looking at the AWS instance type documentation I'm paying special attention to the "Networking Performance" column. When the value is "up to X" then it's a bursty shared network allocation and the noisy neighbor effect will be more pronounced.

CPU

I avoid the "burstable" general-purpose instance families like the t2/t3 instances for production workloads. It's too risky to be able to accidentally push up above the baseline performance and have unexpected disruptions when CPU credits run dry from bursting.

But beyond that, CPU is rarely a primary consideration in selecting the right instance type. As instance types are sized up to accomodate storage, memory, or network, I more often find CPU to be under-utilized.

This is an opportunity for optimization as there are a number of optional features that can take advantage of under-utilized CPU:

TLS encryption in transit and encryption at rest can improve security (IMHO encryption should be the rule, not the exception).
Storage compression can reduce overall storage capacity which means more cost savings.
Client-server compression on the wire can save money on AWS ingress/egress costs.
Taking advantage of the rich set of APIs available for complex-data types (CDTs) can optimize for both cost and performance by offloading some of the data manipulation to the Aerospike nodes.

So, for use cases that are not using these features I consider whether or not there is an opportunity to make use of more CPU to optimize for cost, performance, or both. And then of course the inverse is also true. I make sure to consider CPU for use cases that are using these features.

Instance Type Selection

The storage optimized instances are the obvious contender. Both the older i3 and the shiny new i3en instance families are viable. But, the compute optimized c5 instances and the memory optimized r5 are great options depending on the use case, and the m5 instances are a solid choice from the general purpose category.

i3en Instances

When talking about storage optimized AWS instances the i3en is the big dog. I was pretty excited about these when AWS announced them just last year.

First, they have a huge amount of SSD capacity at the lowest cost per GB. The largest i3en.24xlarge weighs in at whopping 8 x 7.5 TB SSD volumes for 60 TB of storage per instance at ~$1.58 per GB per year¹.

Second, they have pretty fast drives. We've clocked one of these SSD drives at 162k TPS. When compared to the i3 at 33k TPS that's a significant improvement!

When compared with the older i3 instances, these instances have more SSD storage at a lower cost per GB, more SSD throughput, more CPU, more network, and more RAM. So... moar!

The down sides are that being relatively new they may not be available in all regions and they have a higher cost per GB of RAM than memory optimized r5 instances.

I like the i3en instances for low-latency workloads storing large amounts of data with a memory-to-disk ratio on the i3en instances that results in a storage-bound cluster. In other words, when the number of instances is selected based on how much data needs to be stored and the indexes will fit into the available memory.

I also like the i3en instances for a configuration in which indexes are stored on the SSD as well as the data ("all flash"). This allows for HUGE amounts of data (think hundreds of TB and beyond) without breaking the bank on DRAM.

For the best performance the SSD disks should be over-provisioned 20%.

i3 Instances

Before the i3en instances hit the scene these were the kings of SSD capacity. The largest i3.16xlarge instance has 8 x 1.9 TB SSD volumes. That's 15.2 TB of storage per instance at ~$2.88 per GB per year¹.

They also have a very good cost per GB of RAM.

The down side is the drives are substantially slower and they have less vCPU than all the other instance families being considered. In short, they are the slowest.

I like the i3 instances when i3en is not available and storing large amounts of data. With a fairly low cost per GB of RAM and a fairly low cost per GB of SSD storage, I also like the i3 as a cost-effective all-rounder for lower throughput workloads.

For the best performance the SSD disks should be over-provisioned 20%.

r5 and r5d Instances

The memory optimized r5 instances provide the best cost per GB of RAM which makes them the ideal for when Aerospike is configured as an in-memory database. The largest r5.24xlarge instances have 768 GB of RAM per instance at ~$78.84 per GB per year.

However, the r5d variant also adds 4 x 900 GB SSD volumes for a total of 3.6 TB per instance. These are fairly fast SSD drives which we've clocked at 138k TPS.

The down side is that the r5d variant has the highest cost per GB of SSD storage of all the instance types considered here.

I like the r5 instances for running Aerospike as an in-memory database and I like the r5d instances for low-latency use cases where the memory-to-disk ratio results in a memory-bound cluster.

c5d Instances

The compute optimized c5d instances are all about speed. When latency is the be-all/end-all these are the front runner. They have the most powerful CPUs and the SSD volumes are very fast.

The down sides are that they have the highest cost per GB of RAM by a wide margin and max out at 192 GB RAM per instance.

I like the c5d instances when the use case is very latency sensitive, is making use of CPU heavy features of Aerospike (encryption, compression, CDTs, etc.), and the memory-to-disk ratio does not result in a memory-bound cluster.

Instance Comparison

To do an objective comparison of these instance types based on a specific workload with specific performance requirements, I refer to the following resources:

Amazon Instance Types and Amazon EC2 Pricing provide up-to-date specs on the various instances and current pricing respectively.
Aerospike Cloud-Based Flash ACT Results provides some baseline TPS numbers for various instances, however, I typically do a custom test using the open-source Aerospike Certification Tool (ACT). This is the only way to get an accurate profile of the SSD's performance profile for a specific workload.
Aerospike Capacity Planning Guide provides details on how to calculate the RAM, SSD storage, and SSD throughput needs for a specific workload.

The following table compares the maximum size instance types for each family on my short list. The cost is broken down into annual cost per some unit of capacity. This makes it easy to make a first pass at which instance familiy is going to be a good fit for a specific workload.

Instance	vCPU	SSD	DRAM	TPS²	Hourly	per GB SSD	per GB DRAM	per 1k TPS²
`i3en.24xlarge`	96	60.0 TB	768 GB	1,296k	$10.848	$1.58 /year	$123.74 /year	$73.32 /year
`i3.16xlarge`	64	15.2 TB	488 GB	264k	$4.992	$2.88 /year	$89.61 /year	$165.64 /year
`r5d.24xlarge`	96	3.6 TB	768 GB	552k	$6.912	$16.82 /year	$78.84 /year	$109.69 /year
`c5d.24xlarge`	96	3.6 TB	192 GB	564k	$4.608	$11.21 /year	$210.24 /year	$71.57 /year
`m5d.24xlarge`	96	3.6 TB	384 GB	432k	$5.424	$13.20 /year	$123.74 /year	$109.99 /year

For example:

A high-throughput workload without a small amount of data is likely to be most cost effective on c5d instances as they have a low cost per 1k TPS.
A high-throughput workload with a very large amount of data is likely to be most cost effective on i3en instances as they have a low cost per 1k TPS but also have a lot of SSD storage.
A high-throughput workload with a lot of small objects, and thus a high memory-to-disk ratio, is likely to be most cost effective on r5d instances as they have a low cost per GB of DRAM.
A low-throughput workload with a lot of data with a moderate/high number of objects is likely to be most cost effective on i3 instances as they have both low cost per GB of DRAM as well as low cost per GB of SSD storage.

¹ Cost estimates based on largest instance type in family using the full list pricing for on-demand instances in us-east-1 as of September 2020

² Per-instance TPS based on a single drive multiplied by total drives per instance. See details and caveats at Cloud-Based Flash ACT results

Scaling Strategy (Instance Size vs. Cluster Size)

The instance type selection pointed out that the smaller end of the sizes within an instance family have bursty network and shared SSD controllers due to the amount of other tenants on the host. What this amounts to is these instance sizes will have more variability in their performance characteristics.

For smaller workloads there must be a trade off between larger instance sizes and larger cluster sizes. The consideration here is about having enough instances in the cluster to spread the workload out which lessens the impact of single instance maintenance or failure versus minimizing the variability of the performance. On the other end of the spectrum as the cluster grows more nodes can mean longer maintenance for rolling updates and larger instances mean slower cold restarts as indexes are being rebuilt.

There is no hard and fast rule here and these considerations must be taken in context of each organization's budget and operational parameters, however, the approach I take for production clusters is loosely:

Prefer a cluster size of 2x availability zones and don't go less than 1x
Prefer instance sizes with fixed network performance and don't go less than 4 vCPU

So that means if I'm designing a production cluster in 3 zones based on the r5d family the smallest cluster I would recommend is 3x r5d.xlarge (4vCPU).

The scaling strategy would be to scale out to 6x r5d.xlarge to get to 2x zones after which it would then scale up until it gets to the r5d.8xlarge which have the fixed network.

However, if the workload is latency sensitive and the variability of the small instances is a concern then the scaling strategy would insted be to scale up from the minimum cluster size of 3 until it gets to 3x r5d.8xlarge where it has the fixed network. Then it would start scaling out to 6x r5d.8xlarge.

Once reaching the point of 6x r5d.8xlarge scaling up to the r5d.12xlarge would be the next step to get to the full-size SSD (not shared). After that the decisions to continue to scale up vs. out are largely dependent on the opertational parameters that best suite the organization.

Conclusion

The art of selecting the right instance type is about optimizing the cost to performance ratio.

One of the beautiful things about running Aerospike in the cloud is that you don't have to get it perfect on day one. You can start with a conservative approach in which you leave room for the application usage patterns to evolve and stabilize before right-sizing based on real-world production metrics. Remember, with Aerospike you can always do rolling updates, with zero downtime, to switch between instance sizes and types or do A/B testing.

Python argument parsing with log level and config file

Micah Carrick — Tue, 15 Sep 2020 11:56:36 +0000

Making use of parse_known_args() to set root log level and parse default values from a configuration file before parsing main args.

Suppose you're banging out a simple command-line utility in Python which has the following requirements:

Root log level can be set with a command-line argument
Configuration is optionally read from an external file
Command-line arguments take precedence over configuration file

The argparse.ArgumentParser class includes the parse_known_args() method which allows for incrementally parsing the command-line arguments such that you can setup how subsequent arguments are parsed.

Set log level from command-line argument

The root logger must be configured before there is any output. If the argument parsing is going to be doing work that could or should produce log messages but you want the log level to be set from the arguments being parsed, use parse_known_args() to parse and set the log level first. Then proceed with the rest of the arguments.

As an aside, I'm generally a fan of using logging config files to setup more robust logging for Python applications. However, for quick 'n simple command-line scripts I like the simplicity of this approach.

Parse defaults from configuration file

A similar technique can be used to parse a configuration filename from the command-line arguments and then use that configuration file to set the default values for the remaining arguments being parsed.

Suppose a configuration file contains:

[options]
option1=config value
option2=config value

The way this works, is the default values for subsequent commmand-line arguments are defined in a dictionary named defaults rather than using the default keyword argument with add_argument(). Then, the parse_known_args() is used to parse the configuration filename from the command line arguments. If it is present then it reads the values out of the configuration and updates the defaults dictionary. Then those defaults are applied to the remaining command-line arguments with the set_defaults() method.

Putting it together

See the complete example.py source code which combines both of the above techniques.

Override default log level:

$ ./example.py -l DEBUG
INFO:__main__:Log level set: DEBUG
INFO:__main__:Option 1: default value
INFO:__main__:Option 2: default value

Read values from configuration file:

$ ./example.py -l DEBUG -c example.conf
INFO:__main__:Log level set: DEBUG
INFO:__main__:Loading configuration: example.conf
INFO:__main__:Option 1: config value
INFO:__main__:Option 2: config value

Override values with command-line arguments:

$ ./example.py -l DEBUG -c example.conf -1 "cli value"
INFO:__main__:Log level set: DEBUG
INFO:__main__:Loading configuration: example.conf
INFO:__main__:Option 1: cli value
INFO:__main__:Option 2: config value

Aerospike Security Events and Audit Logs

Micah Carrick — Mon, 14 Sep 2020 23:22:57 +0000

In the Enterprise Database Security session I presented at Aerospike Summit 2020 I gave an overview of data protection with Aerospike Enterprise.

Download presentation PDF

To provide context, refer to the following diagram depicting an Aerospike deployment.

Once all the enterprise security fetures have been implemented, how do we verify we’re doing any of this right? How do we get visibility into what’s happened in the past and how do we respond to events as they happen in real time?

Security Event Architecture

The diagram below is depicting an Aerospike node on the left producing a security audit trail and shipping that to a downstream system via syslog. The rest of this diagram is just one of many types of architectures for consuming Aerospike audit logs. I’ll talk this one through to give you an idea of what’s happening.

First, the audit trail from Aerospike is separate from the standard server logs used for troubleshooting and analysis. It includes events that are relevant for security monitoring such as authentication events, user administration, system administration, etc.

The audit trail is shipped using the syslog protocol to some type of log collection like syslog-ng, rsyslog, the Elastic stack, Splunk agents, etc.

There is often a highly scalable queuing system, something like Kafka, in between the log producers and collectors and the downstream consumers. This avoids tight coupling between the systems allowing producing and consuming at different rates and independently maintaining system components.

There is also typically events and data from other sources being brought in and then ingested by the SIEM and/or log analysis platforms. It is in these platforms that all of this security data can be monitored in real time to detect and respond to potential security threats or data breaches. In addition to the real-time monitoring, this also provides security professionals with historical data to use for forensics, audits, training new M/L models, etc.

And one final note about the Aerospike audit trail is that what events are logged is configurable. It is possible to ship every single data operation, including all reads and writes. In this way which application or user made what change at what time is auditable. This is a very common requirement from the data privacy and compliance side.

However, I often find that enterprises make exceptions when the scale becomes impractical for the downstream systems. Imagine an Aerospike cluster handling tens of millions of operations per second in a very cost effective way, and then shipping all those events to a downstream system that isn’t designed to scale, doesn’t scale linearly, can’t handle the real-time ingestion, or isn’t cost effective at scale. This comes down to a risk-based business decision and in many cases organizations rely on compensating controls in the application and system access controls to achieve the auditability of data access that they require.

Aerospike Data Protection

Micah Carrick — Mon, 14 Sep 2020 23:22:33 +0000

In the Enterprise Database Security session I presented at Aerospike Summit 2020 I gave an overview of data protection with Aerospike Enterprise.

Download presentation PDF

To provide context, refer to the following diagram depicting an Aerospike deployment.

Even after securing the network and implementing authentication and authorization, remember that we're persisting this data on physical devices somewhere. Even in the cloud there’s still racks and servers behind all that "magic".

So how do we protect that persisted data from unauthorized access and how do we reduce the attack surface?

Data Isolation and Encryption

For protecting data at rest, that is, data persisted in the storage layer, we want to look at how we do data isolation and encryption in Aerospike.

The diagram above depicts two SSD devices physically attached to an Aerospike server node. These are physical devices.

In this example, each SSD device is also divided into two logical partitions for a total of 4 partitions. These are logical devices and data stored on one partition is logically separate from data in the other partitions.

One of the authorization scopes we can apply to a permission with Aerospike’s access controls is namespace. Namespaces not only provide scope for access controls, they also configure the storage layer. That configuration includes which physical or logical devices the data is persisted to, enabling AES encryption, and which key is used to encrypt and decrypt that data.

So what that means is, from a data protection standpoint, each namespace provides data isolation from other namespaces. If the user credentials or the encryption key for Namespace #1 were to be compromised, the data stored in Namespace #2 is protected separately.

OS Hardening and System Access Controls

The data is encrypted in transit with TLS and encrypted at rest using AES encryption on the namespace. So far so good.

However, the Aerospike process itself is obviously working with the unencrypted data in memory. Standard OS/system hardening procedures and system access controls are absolutely critical for Aerospike deployments that store sensitive data. I’m not going to get into Linux system hardening as it’s tangential and there are plenty of industry standard tools, processes, and requirements on that front. But I do want to emphasise a best practice:

Keep Aerospike nodes as singular in purpose.

In general we don’t recommend running Aerospike along side any other non-related applications, but it’s especially important when working with sensitive data. It may be tempting or convenient to run some tools or dashboards directly on one of the Aerospike nodes but, that just increases the surface area for potential vulnerabilities.

The tools or dashboards should not have direct access to sensitive data and therefore should not be running on an Aerospike node.

A less obvious example of this is the Aerospike Tools package which includes the aql, asinfo and asadm commands among others. These administrative tools are bundled with the Aerospike Server and they are required to be run directly on the Aerospike nodes for a subset of operational tasks. Obtaining a collectinfo is a good example of this.

However, the tools are also available as a stand-alone package and can be run from a remote server for most tasks. So a best practice is to make the Aerospike Tools available to authorized users on dedicated nodes specifically for that purpose. You will still want an escalation path that allows for node-level diagnostics and troubleshooting, however, that should be the exception and not the rule.

Secrets Management

Protecting secrets in general is a very broad topic. So to keep things brief here I just want to enumerate the secrets associated with Aerospike that need to be protected and a couple of common patterns for how that’s done in a Production environment.

On any given Aerospike server node you may have TLS private keys, encryption-at-rest keys, external authentication (LDAP) credentials, and Cross-Datacent Replication (XDR) credentials. All of these secrets must be protected.

These secrets are essentially bits of configuration that you are managing. They are keys and passwords that need to be available to the Aerospike process (asd) at startup or at runtime. Once those secrets make it to the server your OS hardening and system access controls are in place to protect them. However, the challenge is in managing the full lifecycle of those secrets. They have to be created, they have to get deployed onto the servers, they may need to be revoked, and they will need to be rotated periodically.

This is a problem well suited for secrets management tools. Most enterprise security platforms have secrets management built in, all the major cloud providers have dedicated secrets management services, and open-source tools like Vault by Hashicorp have a vast array of Enterprise features and wide adoption. So let’s discuss a couple of patterns for using secrets management for Aerospike keys and credentials.

The first pattern, shown as the top-half of the diagram above, is to integrate secrets management software into the configuration management workflow. For example, config management tools such as Ansible, Chef, Puppet, etc. can be set up to bring in secrets from the secret store when configuring a node. Aerospike loads the secrets from the filesystem and is completely decoupled from the secrets management system. This has the advantage of being straightforward to setup and compatible with just about any secrets management tool out there. However, it does result in any given secret being in 2 locations; once in the secret store and once on the Aerospike server. Secret lifecycle management for Aerospike secrets is always a 2-step process of updating the secret store and then running the configuration management tool.

The second pattern, shown as the bottom-half of this diagram, is for Aerospike to integrate directly with the secrets management system. However, this requires Aerospike compatibility with a specific secret store. At this point in time Aerospike supports just one direct integration with a secrets management platform and that is the Vault Integration with Aerospike. Vault in turn integrates with a large number of other systems.

This pattern has the advantage of centrally managing secrets. When implemented correctly it lowers the credential management burden and lowers the risk of compromised secrets.

However, being a direct integration, this pattern requires that the secret store is able to meet availability and scalability requirements. This means that this pattern is going to be a more complex architecture.

Aerospike Authentication and Authorization

Micah Carrick — Mon, 14 Sep 2020 23:22:06 +0000

In the Enterprise Database Security session I presented at Aerospike Summit 2020 I gave an overview of authentication and authorization with Aerospike Enterprise.

Download presentation PDF

To provide context, refer to the following diagram depicting an Aerospike deployment.

On the left we have developers building applications and back office jobs that will use an Aerospike database.

In the middle we have an Aerospike cluster managed by one or more administrative groups such as SREs, DevOps, DBAs, etc.

And on the right we have downstream systems which ingest and analyze security events and log data for use by information security teams.

The red group icons depict actors that need to interact with the Aerospike database in some way. How do we control who and what can access this data and how do we manage that within an existing enterprise architecture?

Identity and Access Management is a huge part of any enterprise IT organization. Aerospike includes a framework for authentication and authorization out-of-the-box, but it also integrates into your existing IAM infrastructure.

Authentication (AuthN)

Both humans users and machines (applications) need to authenticate when connecting to Aerospike Enterprise.

You may be pretty familiar with the concept of a human user having to authenticate with a database. You know, it’s like GRANT some_permission to USER 'micah' on some_resource. But let’s touch on applications needing to authenticate.

As discussed as part of Aerospike Network Security, the nodes the applications are running on have been authenticated with TLS. But that was about network security. Is this server allowed to communicate with that server.

Now we want to authenticate the application running on that network so that we can later control what the application is authorized to do. In other words, we’re not trying to determine whether that application is allowed to communicate, we’re trying to determine which application is communicating so we can later control what it is allowed to do. And to do that, we need that application to be authenticated--to identify itself.

Right out of the box you can enable Aerospike’s internal authentication, which is shown above. Both humans and applications present a username and password combination when connecting to Aerospike and all the user management is done directly within Aerospike.

This works for simple use cases and it’s a no-brainer to setup. However, every organization has their own unique set of IAM requirements. Things like password policies, credential lifecycle management, MFA, etc. The nuances and complexity of such systems is best delegated to the purpose-built tools already established in the enterprise IT infrastructure. So Aerospike supports integrating into these systems through external authentication.

In the external authentication setup, Aerospike will delegate the credential check to the 3rd party system. In this case we’re looking at a typical directory to which Aerospike is integrated via LDAP. After a successful authentication, Aerospike will use an access token to authenticate subsequent connections for the lifetime of that token and then go back to the LDAP server as needed to re-authenticate.

This is a very common setup for human users of the database and in some cases applications as well.

However, with this architecture, the directory is in the critical path for the functionality of the applications using Aerospike. The LDAP directory may not be designed with the same availability, performance, or scale that an application is being designed for, so it may not be viable for all use cases.

So now we can look at an pattern that combines both authentication methods. LDAP and the directory are still used for humans to authenticate, but the applications authenticate using the internal system. Now this removes the LDAP directory from the critical path, however, it presents a different problem. Part of the role of the directory in the external authentication setup was centralizing IAM.

If users are managed directly in Aerospike, how are the credentials going to be provisioned, rotated, and revoked for the applications? How will the organization's policies and regulatory requirements be enforced?

In smaller organizations or autonomous business units, this may not pose a large problem. But in larger enterprises this becomes untenable.

This is where secrets management and dynamic credentials can help. Rather than the applications themselves having credentials to Aerospike, the applications query the secrets management system to obtain the credentials--often short lived credentials to lower risk.

The secrets management system has the access necessary to manage the full lifecycle of Aerospike credentials and does so within the domain of the existing centralized IAM.

With one of these three patterns we’ll have established which user or application is trying to do something, and can now authorize them or it to do so.

Authorization (AuthZ)

We can apply access controls to the human users or applications by assigning them to Aerospike roles.

The role can then be allowed a set of privileges. A privilege consists of a permission to perform some action along with a scope. For example, the permission to read data at a global scope would be one privilege and the permission to read data only for a specific set in a specific namespace would be a different privilege.

This will allow for a least privileged access model in which any database user, be it a human user or application code, can be associated to roles that allow only the access necessary to perform their function.

Let’s look at these examples using a hypothetical setup for Acme corp.

First, the Acme IAM role is for the administrative user or system, such as the secrets management system depicted in the previous slide, with a privilege to manage the full lifecycle of Aerospike users globally.

Next, the Acme SRE role in this example allows site reliability engineers to perform functions to address issues relating to system stability like querying server metrics, gracefully removing a node for maintenance, enabling different log levels, etc.

Next the Acme DBA role in this example allows database administrators to perform functions to optimize for the specific database use cases like managing secondary indexes, throttling scans, adding/removing user-defined functions, etc.

The final 3 roles in this example, Acme App1, Acme App2, and Acme Daily Loader, each allow the applications specific access to data, but scoped down to only that which is necessary for the function that application performs. For example, notice that the Acme App2 role can only read data from the set named app2 within the namespace ns1. It will not be allowed to read data from the set that Acme App1 uses nor will it be able to write any data at all.

So this is how you can set up some fine-grained role-based access control for users and applications.

And finally, every role can be assigned a whitelist of IP CIDR ranges from which database users associated with that role can connect from. This provides an even finer level of granularity on top of the existing network security.

For example, maybe a handful employees can all connect to Aerospike from their workstations within a particular private subnet, but only Alice and Bob can create new users and only when they do so from their specific personal workstations.

Aerospike Network Security

Micah Carrick — Mon, 14 Sep 2020 20:14:12 +0000

In the Enterprise Database Security session I presented at Aerospike Summit 2020 I gave an overview of network security with Aerospike Enterprise.

Download presentation PDF

To provide context, refer to the following diagram depicting an Aerospike deployment.

On the left we have developers building applications and back office jobs that will use an Aerospike database.

In the middle we have an Aerospike cluster managed by one or more administrative groups such as SREs, DevOps, DBAs, etc.

And on the right we have downstream systems which ingest and analyze security events and log data for use by information security teams.

The red arrows highlight where there is network connectivity between the Aerospike database and other systems or users as well as connectivity in between individual Aerospike nodes. This is where we need to apply network security.

Firewall Rules

First, let's look at firewall rules adhering to the principle of least privilege in which the firewall blocks all traffic and then rules are opened up to allow network access to only as needed.

For an Aerospike cluster there are 4 types of network traffic that needs to be allowed.

First, every Application node must be allowed to open up a TCP connection to every Aerospike node on the service port. This is port 3000 by convention but all Aerospike network settings can be configured.

The simple and more common way to setup these firewall rules is to allow the CIDR range for the Application network to open TCP connections to the CIDR range for the Aerospike network:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3000	192.168.128.0/25	10.0.1.0/25

However, some security models might require each IP address to be explicitly allowed:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3000	192.168.128.1	10.0.1.1
ALLOW	TCP	3000	192.168.128.1	10.0.1.2
ALLOW	TCP	3000	192.168.128.2	10.0.1.1
ALLOW	TCP	3000	192.168.128.2	10.0.1.2

The next type of network traffic to allow are the heartbeat connections between each Aerospike node. This is the clustering protocol in mesh mode which allows the Aerospike nodes to form a cluster.

Every Aerospike node must be allowed to open a TCP connection to every other Aerospike node on the heartbeat port which is 3001 by convention.

The common rule to allow the heartbeat connectivity in the Aerospike CIDR range:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3001	10.0.1.0/25	10.0.1.0/25

Alternatively, the rules to allow explicit IP addresses for Aerospike nodes:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3000	10.0.1.1	10.0.1.2
ALLOW	TCP	3000	10.0.1.2	10.0.1.1

The third type of network traffic to allow are the fabric connections between each Aerospike node. This connectivity allows data to transfer between the nodes for replication and "migrations" (redistribution of data).

Every Aerospike node must be allowed to open a TCP connection to every other Aerospike node on the fabric port which is 3002 by convention.

The common rule to allow the fabric connectivity in the Aerospike CIDR block range:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3002	10.0.1.0/25	10.0.1.0/25

Alternatively, the rules to allow explicit IP addresses for Aerospike nodes:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3002	10.0.1.1	10.0.1.2
ALLOW	TCP	3002	10.0.1.2	10.0.1.1

Notice that the fabric rules are identical to the heartbeat rules except for the port. If the configured ports are sequential then the rules for heartbeat and fabric can be combined for firewalls that allow specifying port ranges:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3001-3002	10.0.1.0/25	10.0.1.0/25

The fourth and final type of network traffic is only applicable for deployments which are using Cross Datacenter Replication (XDR) to replicate data between Aerospike clusters in different data centers or cloud regions.

XDR traffic uses the same service connections that applications use. That means that every Aerospike node in the XDR source cluster must be allowed to open a TCP connection to every Aerospike node on the service port which is 3000 by convention.

The rule to allow traffic from the XDR source cluster to the XDR destination cluster using CIDR blocks:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3000	10.0.1.0/25	172.16.0.0/25

Alternatively, the rules to allow explicit IP addresses:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3000	10.0.1.1	172.16.0.1
ALLOW	TCP	3000	10.0.1.1	172.16.0.2
ALLOW	TCP	3000	10.0.1.2	172.16.0.1
ALLOW	TCP	3000	10.0.1.2	172.16.0.2

Notice that the xdr rules are identical to the service rules except for the source being the XDR source instead of the Application nodes. The rules for service and xdr can be combined for firewalls that allow specifying multiple CIDR/IP ranges:

Rule	Type	Port	Source	Destination
ALLOW	TCP	3000	172.16.0.0/25,192.168.128.0/25	10.0.1.0/25

Encryption in Transit (TLS)

The second part of securing the network is about using TLS to encrypt data in transit and ensure connections are only established with trusted machines on the network.

TLS Certificates

We just looked at the 4 types of network connectivity: service, heartbeat, fabric, and XDR. Aerospike can be configured to use TLS on each of those types of network connections independently.

The service connections support standard or mutual authentication TLS also referred to as mTLS.

Both modes encrypt the data in transit, however, with standard TLS, only the Aerospike nodes authenticate themselves to the application nodes. With mutual TLS, the Aerospike nodes authenticate themselves to the application nodes who also authenticate themselves to the Aerospike nodes. So it’s a 2-way authentication.

A “bad actor” that found its way into the network somehow, could not pretend to be an application node nor pretend to be an Aerospike node without possessing the correct private key.

When TLS is enabled on the fabric or heartbeat connections, they will always use what amounts to mutual authentication for those Aerospike-to-Aerospike connections. So once again, if a “bad actor” somehow breaches the private network, they could not pretend to be another Aerospike node in the cluster nor decrypt any of the data transferring between nodes without possessing the appropriate private key.

If you recall, XDR connectivity is actually just using the service connections. So with XDR, the source cluster acts as the TLS clients, much like the application nodes, and the destination cluster acts as the TLS servers.

Now, all of these types of connections are generally configured to use the same server certificate as they are the same servers, however, they can technically be configured to have separate certificates.

Additionally, every Aerospike node can be configured to use the same certificate, meaning the entire cluster shares that certificate, or every node can be set up with it’s own unique certificates.

So that gives us three dimensions to work with; Standard vs. mutual TLS on the service connections, individual or shared certificates on each type of connection, and individual or shared certificates on each Aerospike server node.

Obviously, standard TLS with a single cluster-wide certificate is the simplest in terms of setup and management complexity. And if you’ve spent much time dealing with certificate lifecycle management, one certificate certainly sounds more pleasant to manage than dozens, hundreds, or thousands. And indeed it is.

But for organizations adopting more of a “zero trust” model, perhaps within environments dealing with highly sensitive data, on networks which the organization has deemed as untrusted such as the public cloud, unique certificates on each node may be required.

However, most enterprise use cases will fall somewhere in between these two extremes and the flexibility of Aerospike’s TLS configuration will allow it to be tailored to the specific needs of the organization, the environment, and the use case.

TLS Cipher Suites

A cipher suite is a set of algorithms that are used in various phases during the TLS communication. The protocol allows for the client and server to negotiate which set of these algorithms both sides support.

Every TLS connection I described in the previous section can be configured as to which cipher suites are allowed and in what priority.

Without going too deep into the weeds about TLS cipher suites, let me just make two points about selecting TLS cipher suites to use with Aerospike Enterprise.

Point #1

Unlike a public, internet-facing application, many Aerospike deployments are done in environments where the organization is in control of both the client and the server. That means that compatibility with public clients like web browsers is not a factor and the list of allowed cipher suites can be narrowed down to just the more current algorithms which provide the best security and performance.

At the time of this presentation, that is highly likely to mean a cipher suite using AES encryption, which has hardware acceleration built-in to modern CPUs, using Galois Counter Mode or GCM, which also typically out-performs previous block cipher modes.

Point #2

Aerospike uses OpenSSL and thus configuring the cipher suite uses the OpenSSL notation. This is recognizable by the use of hyphens as shown in the top line in the image. Other tools and libraries, such as Java, may use the IANA notation. This is recognizable by the use of underscores as shown in the second line here. This means that specifying the cipher suites in Aerospike configuration may use a different notation that other sources you may be referencing.

Serialized or Compressed Objects with Aerospike? Consider carefully.

Micah Carrick — Fri, 11 Sep 2020 09:17:24 +0000

Make sure you consider these trade-offs before storing serialized or compressed blobs in Aerospike.

Serializing and compressing data client-side before storing it in a back-end database is a common pattern. I run into this frequently in my work with Aerospike customers, typically in the form of protocol buffers (protobuf) or gzipped JSON. After all, who doesn't want to reduce network bandwidth and storage?

However, unless the use case is a dumb-simple get/put cache, you may be trading off some powerful Aerospike functionality for very little gain - if any.

Some of the benefits of serializing and compressing objects client-side include the following:

Interoperability: Application developers can work in language-native constructs and exchange objects in a language-agnostic format.
Network bandwidth: Getting and putting compressed objects that are compressed lower network bandwidth.
Storage space: Storing objects that are compressed usually use less storage space disk.

These are all good things. However, Aerospike provides alternative mechanisms to achieve each of these benefits as well. Aerospike client libraries allow application developers to work in language-native constructs for Aerospike data, client policies can enable compression on the network, and storage compression can be enabled and tuned.

Moreover, Aerospike will add some sensible logic and flexibility without any additional work in the applications.

When compression is enabled Aerospike won't store objects compressed if the result is actually larger when compressed.
Compression comes at the cost of CPU. Applications can choose to compress data on the network on a per-transaction basis and storage compression algorithm/level can be configured on a per-namespace basis. This allows optimizing to get the most "bang for your buck" per use case.

So you may already be thinking: "Well, gosh, maybe I don't need to serialize all my objects client-side". But the real trade-off to consider is not about the fact that you get parity with Aerospike out-of-the-box, it's about what you lose when storing serialized and/or compressed blobs in Aerospike.

You will lose the ability to do Predicate Filters on queries and scans against the data in the blob
You will lose the ability to leverage Bitwise Operations
You will lose the ability to use the feature-rich Complex Data Type (CDT) API on Lists and Maps

Now those are some incredibly useful features - especially those CDT operations. Unless you know you only want a dumb get/put cache you don't want to miss out on those.

But you're a geek... I'm a geek... so let's see this in action with some quick 'n dirty Python.

Serialized/Compressed Blobs vs Aerospike CDT Performance

As an example, assume a use case which rolls up all purchase transactions by day and stores them in records split by month and account number using a composite key like: monthly:<YYYYMM>:<Account ID>.

In Python, each record can be represented as standard dictionary with nested lists and dictionaries:

'monthly:201901:00001' {
  'acct': '00001',
  'loc': 1,
  'txns': {
    '20190101': [
      {
        'txn': 1,
        'ts': 50607338,
        'sku': 5631,
        'cid': "GFOBVQPRCZVT",
        'amt': 873300,
        'qty': 23,
        'code': 'USD'
      },
      { ... }
    ]
    '20190102': [ ... ]
  }
}

The txns key contains a dictionary where each key in that dictionary is the day of the month in YYMMDD format and the value is a list of every transaction for that day.

In order to highlight the pros and cons of serialization/compression client-side vs. using Aerospike's built-in features, two Aerospike namespaces are setup.

The first namespace ns1 uses a file storage engine without any compression enabled. It will be used to store the records as blobs that have been serialized as JSON and compressed with zlib level 6 (default) in the Python code.

The namespace stanza in the aerospike.conf file looks like this:

namespace ns1 {
    replication-factor 1
    memory-size 2G

    storage-engine device {
        file /opt/aerospike/data/ns1.dat
        filesize 100M
    }
}

The second namespace ns2 uses a file storage engine with ZStandard compression level 1 (least amount of compression, best performance). It will be used to store the records as Aerospike CDTs.

The namespace stanza in the aerospike.conf file looks like this:

namespace ns2 {
    replication-factor 1
    memory-size 2G

    storage-engine device {
    file /opt/aerospike/data/ns2.dat
    filesize 100M
    compression zstd
            compression-level 1
    }
}

Using the Python script called generate-data.py, dummy data is generated using the above data model and loaded into each of the two namespaces. It generates 2 years of historic transaction data for 10 accounts each doing 250 transactions per day.

Looking at just the section of generate-data.py that loads data into the two namespaces, the "blob" version first converts the Python object to JSON and then compresses the JSON string using zlib and then writes the record to ns1 namespace. The "cdt" version just writes the Python object as-is to the ns2 namespace.

# write each record
start = time()
for pk, record in objects.items():

    if object_type == 'blob':
        record_data = {'object': zlib.compress(json.dumps(record).encode("utf-8"), zlib_level)}

    elif object_type == 'cdt':
        record_data = record

    key = (namespace, set_name, pk)
    client.put(key, record_data, 
            policy={'exists': aerospike.POLICY_EXISTS_CREATE_OR_REPLACE}
    )
elapsed = time() - start

After generate-data.py loads the dummy objects into each of the Aerospike namespaces it outputs some statistics about that namespace:

$ python3 generate-data.py 
Aerospike:          127.0.0.1:3000 ns1.example
Run time:           9.354 seconds
Object type:        blob
Object count:       240
Avg object size:    217.0 KiB
Compression ratio:  -
---
Aerospike:          127.0.0.1:3000 ns2.example
Run time:           5.610 seconds
Object type:        cdt
Object count:       240
Avg object size:    182.5 KiB
Compression ratio:  0.349
---

So right away it is clear that using Aerospike native CDTs with compression enabled results in smaller objects (better storage compression) and loaded the data faster.

Some of this can be explained by the fact that (a) when Aerospike compresses the data it is using ZStandard compression instead of zlib which was used in the Python code and (b) Aerospike is built with a very fast, statically typed, compiled language (C lang) and our Python code is a slower, dynamically typed language running on an interpreter. So it is certainly not apples-to-apples.

However, the two key takeaways here, from the application development perspective, are that the Aerospike compression is essentially free and that you work in your native language types.

Use Case: Correct Data with Background Read/Write Scan

To illustrate the value of being able to leverage advanced Aerospike features that are not available when doing client-side serialization/compression, let's take a look at a data correction use case. Suppose that there was a bug in the application that resulted in an incorrect value for the location (loc) for just one account (acct).

If the records are serialized and compressed client-side, application code would need to read every record back over the network, into the application RAM, deserialize and decompress it, make the correction, and then write the entire record back over the network.

However, if using Aerospike CDTs with server-side compression, the application can initiate a background read/write scan with a predicate filter to do the work entirely on the Aerospike nodes.

An example of this is illustrated in the Python script correct-data.py. This script operates on the same data that was generated with generate-data.py above.

First, a predicate filter is setup which will filter records to only those that have an account ID (acct) of 00007 AND a current location ID (loc) value of 5.

account_to_correct = '00007'
incorrect_location = 5

predicate_expressions = [

    # push expressions to filter by loc=5
    predexp.integer_bin('loc'),
    predexp.integer_value(incorrect_location),
    predexp.integer_equal(),

    # push expression to filter by acct=00007
    predexp.string_bin('acct'),
    predexp.string_value(account_to_correct),
    predexp.string_equal(),

    # filter by the `loc` AND `acct` expressions
    predexp.predexp_and(2)
]

policy = {
    'predexp': predicate_expressions
}

Next, a background scan is sent to each Aerospike node using the above predicate expressions to filter the scan results and passing an array of write operations to perform on each resulting record. In this case, the write ops contains just one operation to update the location ID (loc) to 2.

correct_location = 2

# Do a background scan, which runs server-side, to update the records that
# match the predicate expression with the correct value for 'loc'.
ops =  [
    operations.write('loc', correct_location)
]

bgscan = client.scan(namespace, set_name)
bgscan.add_ops(ops)
scan_id = bgscan.execute_background(policy)
print("Running background read/write scan. ID: {}".format(scan_id))

# Wait for the background scan to complete.
while True:
    response = client.job_info(scan_id, aerospike.JOB_SCAN)
    if response["status"] != aerospike.JOB_STATUS_INPROGRESS:
        break
    sleep(0.25)

What's that? You're worried about that background read/write scan impacting performance? No worries, Aerospike has that covered by allowing you to throttle the records per second using the background-scan-max-rps.

Consider all the opportunities to optimize cost and performance by sending lightweight binary operations to the Aerospike database nodes rather than passing ~200k objects back and forth to be processed client-side. Think about how much money you could save! You'll be a hero!

View complete source code on Github

DEV Community: Micah Carrick

Using a YubiKey with AWS CLI Sessions

aerospike

Custom Bluesky Handle on AWS with Terraform/OpenTofu

AWS Instances for Aerospike Clusters

Instance Type Considerations

Storage

Memory

Network

CPU

Instance Type Selection

i3en Instances

i3 Instances

r5 and r5d Instances

c5d Instances

Instance Comparison

Scaling Strategy (Instance Size vs. Cluster Size)

Conclusion

Python argument parsing with log level and config file

Set log level from command-line argument

Parse defaults from configuration file

Putting it together

Aerospike Security Events and Audit Logs

Security Event Architecture

Further Reading

Aerospike Data Protection

Data Isolation and Encryption

OS Hardening and System Access Controls

Secrets Management

Further Reading

Aerospike Authentication and Authorization

Authentication (AuthN)

Authorization (AuthZ)

Aerospike Network Security

Firewall Rules

Encryption in Transit (TLS)

TLS Certificates

TLS Cipher Suites

Serialized or Compressed Objects with Aerospike? Consider carefully.

Serialized/Compressed Blobs vs Aerospike CDT Performance

Use Case: Correct Data with Background Read/Write Scan