Davide de Paolis for AWS Community Builders

Posted on Dec 13, 2022 • Edited on Feb 3, 2023

AWS Storage Cheat-sheet/Write-up

#aws #storage #techlead #solutionsarchitect

Types of Storage solutions in AWS

Block Storage: data is stored in chunks, on a volume attached on a single instance.
Very low latency data access and it is comparable to an Hard drive, where you can create volumes and partitions.
File Storage: data is stored as files in directories, within a file system. Think of a Network Attached Storage Server NAS - it has its own Block Storage system but it shares the file system over the network to multiple users and it can be mounted and when mounted you can't create volumes or partitions.
Object Storage: data is stored as objects in a flat address space and referenced by a unique key.
Metadata to help categorize and identify the objects can be associated. You can mimic a file hierarchy using folders structure, but it is just by using prefix - it is no real structure

S3 - Simple Storage Service

Amazon S3 is a fully managed, object-based storage service that is highly available, highly durable, very cost-effective, and widely accessible.
It offers unlimited storage available, with a file size between 0 bytes and 5 Terabytes

It provides strong read after write consistency and uses standards-based REST web interfaces to work with objects.

S3 is a regional service so you need to specify to which region you want to upload your data to (even though S3 in the UI config does not let you select a Region and is shown as global). Nevertheless, Buckets must have a unique name - all over the world, across all AWS accounts.
To reduce latency it is best practice to create Buckets in the a region that is physically closest to your users.

On object consists of:

key ( name of the object/file)
versionId
value ( the data/file)
metadata
subresources

In the UI console they will look like real folder but they are not. there is no hierarchy folders, make the key look like file system but are in fact just part of the key.

S3 Storage Classes

AWS S3 offers different types of storage classes you can choose from, first let's understand the difference between Durability and Availability.

Durability and Availability

Durability: A system that is durable is able to perform its responsibilities over time, even when unexpected events may occur.
Availability: A system that is available is capable of delivering the designed functionality at a given point in time. Highly available systems are those that can withstand some measure of degradation while still remaining available.

from Well-Architected Concepts

To put it simply, Durability measures the (un)likelihood of data loss while Availability refers to the ability to access/retrieve your stored data.

When choosing what type and option, Ask yourself:

how critical is your data?
how reproducibile is it?
how often is the data going to be retrieved?

Objects stored in S3 have a eleven nines of durability (99.9999999%) - likelihood of loosing data is very rare because S3 copies objects across multiple AZ within the region.
Availability can range from 99.5% to 99.99% depending on the Storage Class.

Available Storage Classes:

S3 Standard: low latency / frequent access
S3 Intelligent Tiering: monitors access patterns and automatically moves objects that have not been accessed to lower-cost access tiers.
S3 Standard Infrequent Access: infrequent access
S3 Standard 1 ZONE IA: infrequent access, slightly less availability because data is stored on a single AZ

On top of S3 Standard there is S3 Glacier that is a solution where we refer to to as Vaults, rather than Buckets, to better express the type of storage/access provided.
There is not Graphical User Interface and Data is not immidiately accessible, but retrieving it can take up to few hours depending on Retrieval Options (Expedite,Standard, Bulk) and type:

S3 Glacier:
S3 Glacier Deep Archive

Versioning

You can use versioning to preserve, retrieve and restore every version of every object a Bucket.
It is very useful to recover file in case of accidental deletion or overwrite.

Version is not active by default, once enabled, you can only suspend it ( version already created will be kept, no new ones will be generated).
Since with S3 you pay per storage, and every version is like a new file, you will incur in new costs per every new version.

When you delete an object in a versioning-enabled bucket, all versions remain in the bucket, and Amazon S3 creates a delete marker for the object. To undelete the object, you must delete this delete marker.

If you want to permanently delete the object and all its versions then you need to use CLI/SDK and call DELETE Object specifying versionId.

Replication

With Same Region (SRR) and Cross Region Replication (CRR) a bucket can be copied to the same or to a different different region - or a different account.

Replication is possible only on buckets that have versioning enabled.
It is important to remember that only new files are replicated ( so it is better to enable it on bucket creation, otherwise you will have to copy existing files yourself).

Static Website Hosting

You can use S3 to serve simple and static websites that require no server-side scripting of any kind. (client side js in the index.html is absolutely fine)
Just select the option Use this bucket to host a website, make the bucket and its contents publicly accessible.

the bucket name must match the domain name

Just remember that the URL will be be HTTP so you will likely then need CloudFront to serve your static website as HTTPs (check these docs https://aws.amazon.com/premiumsupport/knowledge-center/cloudfront-serve-static-website/)
Set the entry pointy as index.html ( a file that of course must be in your bucket)

S3 can also simply be used to redirect to any domain

Logging

In AWS, there are two ways to log access to S3 storage resources, i.e. buckets and bucket objects:

Server Access Logging: Captures details of requests that are made to a bucket and its objects and save the logs to a specific target Bucket ( that must be in the same region).
- conducted on a best effort basis
- disabled by default
- needs Permissions to write logs to a log groups
- newline-delimited log records where each represents one request and consists of space-delimited fields.
Object Logging: Logs in JSON format all the actions being taken on S3 Objects ( like Get, Put and Delete) using AWS CloudTrail.
- can be configured from CloudTrail or directly from the Bucket itself.

Object Lock

Prevents items from being deleted or overwritten.
Can be activated only at Bucket Creation, and only if Versioning is enabled.
Level of compliance known as WORM Write Once Read Many

Two possible retention mode exist:

Governance mode: allows to set a retention period, during which can not be deleted, afterwards, it is possible to delete the file.
It is important to note that if a User/Role had specific permissions like for example BypassGovernanceMode or s3:GetObjectLockConfiguration it will indeed be possible to bypass object locking.
Compliance mode: also has a retention period, but there are no permission that could allow bypassing it.
Compliance mode also allows to put versions in LegalHold - so that even when object is deleted after retention period, its versions still can not be deleted (unless a user has specific permission to remove LegalHold).

Cost allocation Tags

As with basically almost every AWS service, S3 offers the possibility of adding Tags ( like environment, project, application and so on) so that you can then use the cost explorer and allocate costs by tags.
Check this post I wrote a post already about it long time ago for more info.

Multipart Upload

Uploads objects in parts, independently, in parallel and in any order
it can be used for any object above 5MB but is recommended from 100MB (and only upload option for files bigger than 5GB).

Transfer acceleration

Transfer acceleration allows to transfer data in and out of S3 Buckets taking advantage of CloudFront Edge Locations.

Take into account though, that you will be charged for transferring data into the Bucket ( while normally a PutObject API call does not incur into any cost), and you will pay more to retrieve objects ( since they got through the Edge Location) - but only if there was an actual benefit.

Enabling requires just a couple of settings, the only constraint is that the bucket name must be DNS compliant and not contain periods.

After Acceleration is configured GET and PUT request must use the new transfer acceleration endpoint.

Requester Pays

Bucket Owner will pay for data being stored.
Requester will be charged for data being transferred.

Of course in order to charge requester, requests must be authenticated to your bucket - still to be sure that requester is aware of these charges, a special header must include x-amz-request-payer must be included in the request.

Multifactor Authentication

S3 offers two features that will require MFA has additional layer of security.

MFA Delete will require the bucket owner to authenticate with two forms of authentication when

Changing the versioning state of your bucket
Permanently deleting an object version

MFA-protected API access is configured via IAM policies, where you can specify which API operations a user is allowed to call and those where user needs to be authenticated with AWS multi-factor authentication (MFA) before you allow them to perform particularly sensitive actions.

Securing and Controlling Access

I have explained in bigger detail the different policies in the Policy Section of my previous Exam Preparation Post about IAM but just a recap:

Identity-based policies: associated to a user, group or role, specifying actions, conditions
Resource Policies: attached directly to a resource (in this case to the S3 Bucket)

Due to their nature, usage and being independent from IAM Bucket policy could become more complex and are therefore allowed a bigger size ( up to 20 Kb, compared to just 2 Kb for user policies, 5 Kb for groups and 10 Kb for roles)

-ACLs: they allow setting different permissions per object, they don't have the same JSON format, and cannot apply implicit deny, nor conditions

ACLS are the legacy access control mechanism that predates IAM and is therefore not recommended.

When using IAM Policies and why using bucket policies?

IAM policies are better for central management, to reuse policies across multiple buckets or when policy affects different AWS services.
The advantage of bucket policies is that they can grant cross account access without having to create and assume roles and they are useful in case we reach size limit in IAM policy.
If you need Cross-account Console Access you will need IAM Roles

IAM Policies, Bucket Policies and ACLs are NOT mutually exclusive and they can co-exist, but it is worth to keep in mind the precedence:

Policies and ACL are evaluated all-together and applied following the Principle of least privilege:
we start with an implicit deny, and we allow only when there is an explicit Allow, but in case of an explicit Deny, that will overrule the explicit allow.

Managing Public Access

In the last years, especially because of loose or wrong configuration, many company had data breaches, therefore AWS has gradually improved security on S3 and by default blocked public access by default. If you want public access you have to esplicitely change the bucket policy and ACL, and AWS will still show you lost of warnings reminding you of the risks.

S3 Encryption

You can protect data in transit using Secure Socket Layer/Transport Layer Security (SSL/TLS) or client-side encryption.

You should allow only encrypted connections over HTTPS (TLS) using the aws:SecureTransport condition on Amazon S3 bucket policies.

For protecting data at rest in Amazon S3 you can activate Encryption at Bucket or Object level and enforce it using a policy

Server-Side Encryption

Request Amazon S3 to encrypt your object before saving it on disks in its data centers and then decrypt it when you download the objects.

SSE-S3: uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256) GCM to encrypt your data
SSE-KMS: Uses a key generated and managed by AWS KMS
SSE-C: uses a key generated and managed by you ( the client/customer)

Client-Side Encryption

Client-side encryption is the act of encrypting your data locally to ensure its security as it passes to the Amazon S3 service. Amazon S3 receives your encrypted data; it does not play a role in encrypting or decrypting it ( but you can also use kms created keys)

When enabling Encryption, only new files will be encrypted - this means if you need encryption for old objects, you will have to re-upload them.

CORS

Cross-origin resource sharing (CORS) defines a way for client web applications that are loaded in one domain to interact with resources in a different domain. With CORS support you can selectively allow cross-origin access to your Amazon S3 resources.
Interesting to remind that CORS policies can contain more than one case and for example define different rule for different HTTP methods.

Lifecycle configuration

Lifecycle configuration are the most cost-effective strategies when your objects follow predictable patterns of usage.
Some examples are,

delete older versions of your objects
clean up incomplete multipart uploads
deleting data after some time
moving it to Archival Storage if you don't need the object to be accessible all the time, but you have to keep it for longer times to comply with some regulation.

In lifecycle management there are 2 types of actions: Transition and Expiration:

Lifecycle configurations are written in XML.
Check this example of a configuration where all the files matching a prefix are moved to S3 Glacier after 90 days and after 1 year are being deleted entirely:

<LifecycleConfiguration>
  <Rule>
    <ID>example-id</ID>
    <Filter>
       <Prefix>my-files-prefix/</Prefix>
    </Filter>
    <Status>Enabled</Status>
    <Transition>
      <Days>90</Days>
      <StorageClass>GLACIER</StorageClass>
    </Transition>
    <Expiration>
      <Days>365</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Check here for more examples.

Versions are not affected by Transition and Expiration actions because they have specific actions (NoncurrentVersionTransition and NoncurrentVersionExpiration) and you can choose if keeping a Max Number of versions, or the Number of days since object has been non current ( that means it was overridden or deleted).

Cost considerations of transitions

Transitioning to archival storage reduce cost for storage, but you still need to do some consideration in order to how frequently and what you are transitioning to reduce transitioning costs ( because as much as storage costs decrease, the transition cost increases ).
To minimise this cost, you should consider transitioning mostly large objects that need to be retained over long periods of time or eventually consider aggregating several small objects into one large object.

Event Notification

Send notifications when an event happen on the bucket ( like adding or removing a file).
destination targets can be ie:

SNS topics
SQS queues
AWS Lambda functions

S3 Object Lambda

Is a relatively recent feature that allows the execution of a Lambda function whenever an Object is being accessed.
It is useful for example if multiple applications or users access the same resources but you want to serve slightly different version of the same object ( like resized, or with some information edited).
You ca trigger an AWS pre-built function (like PII Access Control/ Redaction or Decompression) or a Lambda function that you have written.

Pre-signed URLS

Pre-signed URL is a way of temporarily authorise access to users - who might not have AWS account.

S3 Select and Glacier Select

This features enable applications to retrieve only a subset of data from an object (like a zip archive of CSV files) by using simple SQL expressions.

Check here for more info.

Amazon Elastic File System

EFS is a file-based storage system that is accessed using the NFS protocol.

EFS provides simple, scalable file storage for use with Amazon EC2 instances. Similarly to file servers, or Storage Area Networks or NAS, Amazon EFS provides the ability for users to browse cloud network resources.

EFS can scale to Petabytes and provides low latency access and high level of throughput (MB/s)

Data is stored redundantly across multiple AZs, and can be accessed by up to thousands of Amazon EC2 and other AWS compute instances running in multiple Availability Zones within the same AWS Region.

You can connect from your VPC using mount targets, (network interfaces your instances will use to connect to your file system).
If you have multiple AZ you can create multiple mount targets.

You can even connect across regions (and AWS accounts), in that case you need a Peering connection ( and mount target will use IP Address, not DNS)

It uses a pay for what you use model with no pre-provisioning required.

LifeCycle Management and Backups

Similar to Lifecycle configurations for S3, LifeCycle management will move resources that have been less frequently accessed to the Infrequent Access tier.

Backups for EFS are automatic (done via AWS Backup) and enabled by default.

Storage Classes:

Standard
Infrequent Access

Performance Modes

General Purpose: home directories and general file-sharing environments. Low latency and all-round performance. A file system can support up to 35,000 file operations per second but in General Purpose performance mode, read and write operations consume a different number of file operations. (Read = 1 file operation while write/update = 5 file operations).
Max I/O: offers virtually unlimited amounts of throughput and IOPs, but file operation latency is higher.

Throughput Modes

Bursting: All EFS file systems, regardless of size, can burst up to 100 MiB/s of metered throughput, if they have burst credits.
Provisioned: you will pay additional cost for any bursting

Use Bursting Throughput mode if your workloads are typically spiky. A spiky workload drives high levels of throughput for short periods of time, with otherwise low levels of throughput. For applications that have a relatively constant throughput, use Provisioned mode.

Security and Access Control

First of all in order to be able to create an EFS file system, you need to have Allow permissions to a series of actions like CreateFileSystem and CreateMountTarget ( among others)

IAM policies can be used to control who can administer the file system.

Security groups control network traffic that is allowed to reach the file system ( remember that Security Group act like a firewall) therefore are usefule to control control what NFS clients that can access your file system ( inbound rule of type NFS with a source either the security group of your instances or their IPs)

You can control who can have access to what files and directories with POSIX-compliant user and group-level permissions.

Encryption

Encryption can be At Rest and/or In Transit, and can be configured separately.

If you need to ensure data remains secure while being transferred between your EFS file system and your client, you will need Encryption in Transit. For that, TLS protocol will be used.
Encryption in transit is enabled when mounting the file system.

At Rest uses KMS to manage encryption keys
You can enable encryption at rest only at the time of creating the FileSystem.
If you have a file system that is not encrypted, you will have to create a new file system with encryption enabled and then migrate the data.

How do you import your data into EFS?

AWS DataSync is a service designed to move, migrate, and synchronize data from on-prem to EFS ( or other Storage Services)

Examples of use cases:

migrate an NFS file system from EC2 to EFS within same region.
replicate an NFS file system from EC2 in a region to a EFS in another region for disaster recovery
migrate an Amazon EFS file system to another EFS with different configurations ( like lifecycle management, or performance mode)
replicate an EFS file system to another region for disaster recovery

Instance Store Volumes

Instance Store Volumes are high performance local disks that are physically attached to the host computer on which an EC2 instance is running.

Instance store volumes provide ephemeral storage - that means that instance is stopped or terminated your data is lost (in case of reboot, data will remain intact).

If Instance Store Volumes are Ephemeral, what are their benefits?

very high I/O speed
no additional storage costs ( you are already paying for the instance itself and remember that the bigger the size of the instance the more capacity of store volume you get)

usually ideal as cache or buffer for rapidly changing data and often used within a LoadBalancing group where data is replicated and pooled among the instances

If you need to persist data which is critical, then use EBS!

Amazon Elastic Block Store

EBS is a block-based storage system for mounting volumes.

EBS volumes are independent of the EC2 instance - these volumes are logically , rather than directly, attached to the instance

By default a Root EBS is deleted when an instance is terminated, non-root are not deleted by default. Both settings can be changed.

EBS volume itself is only available in a single AZ and can be accessed by instances ONLY from the same AZ.
If you need access from instances in other AZ you can recreate volume from a snapshot and attach it to another instance in another AZ.

You can have multiple EBS volumes attached to an instance, but each EBS can be attached to one instance only

Multi- Attach

EBS can be attached only to one EC2 Instance at a time but recently EBS MultiAttach was introduced.
It has some limitations:

only with provisioned SSD volumes
only on instances based on Nitro
still only on same AZ

Multi-Attach do not support I/O fencing so it is necessary to provide a write ordering between instances to prevent data corruption.

For Multi-Attached EBS choosing a Clustered File System ( like GF32) is recommended

Volume Types

There are 2 main categories, both with sub-types you can choose from depending on your requirements in terms of cost, MB/s, IOPs, throughput, latency, capacity, and volume durability.

SSD (Solid State Drive): optimized for transactional workloads involving frequent read/write operations with small I/O size, where the dominant performance attribute is IOPS. Use cases are boot volumes for EC2 or databases.

HDD (Hard Disk Drives): optimized for large streaming workloads where the dominant performance attribute is throughput. Use case are Big data and data warehouses. They can't be used to boot a volume.

Check here for more info

Backups

Every write to EBS is replicated multiple times within AZ to prevent loss of data.

EBS provides backup capabilities (manually or via CloudWatch schedules),
Snapshots are incremental and are stored on S3 -
you can recreate a volume from a snapshot.

even though incremental, each snapshot is able to restore the entire backup - so to save up space and storage costs, old snapshots can be deleted)

Since snapshots are saved in S3 which is a Regional service, we can use those snapshot to create a new volume in a different AZ.
We can also copy the snapshot to another region, and from that create a new volume in a different region.

(it is also possible to create an AMI and then share it to create a new volume - even on other accounts - there might be restrictions on encryption)

Amazon Data Lifecycle Manger (DLM)

DLM automates the creation, retention and deletion of EBS snapshopts and EBS backed AMIs

It helps to:

Protect valuable data by enforcing a regular backup schedule.

Create standardized AMIs that can be refreshed at regular intervals.

Retain backups as required by auditors or internal compliance.

Reduce storage costs by deleting outdated backups.

Create disaster recovery backup policies that back up data to isolated accounts.

Encryption

Amazon EBS encryption uses AWS KMS keys when creating encrypted volumes and snapshots.

When creating an encrypted EBS volume:

data inside the volume is encrypted at rest
data being transferred between the instance is encrypted in transit
snapshots and volumes created from these snapshots are also encrypted.

Encryption does not affect IOPS performance
EBS uses AES-256 encrypting algorithm

Amazon EBS automatically creates a unique AWS managed key in each Region where you store AWS resources. By default, Amazon EBS uses this KMS key for encryption, but you can specify a symmetric customer managed encryption key that you created as the default KMS key for EBS encryption.
Using a CMK (Customer Managed Key) gives you more flexibility, including the ability to create, rotate, and disable KMS keys.
(more on this here

RAID with EBS

Redundant Array of Indipendent Disks (RAID)is not provided by AWS and must be configured through your OS.

Raid 0 is used for striping data across disks. (more performance but if a disk fails you loose all data across the raid)
Raid 1 is used for mirroring data across disks (you are using double the amount of storage at the same time but safer if a disk fails)

FSx

FSx provides fully managed 3rd Party File Systems:

Windows File Server:
- supports Windows native file system features like ACLs, shadow copies and user quotas
- uses NTFS
- mounted using SMB protocol
- pricing options on capacity throughput and backups (and by the fact of it being multi AZ or 1 AZ)
- High Availability
Lustre
- for compute intensive workloads (ML, HPC, Video processing and Financial modeling) - on linux
- works natively with S3
- simpler price format

In both cases you can connect from On-Premises using VPN or Direct Connect

AWS Backup

It's a service to create backup policies and backup plans for multiple AWS Services ( Storage or Databases).
These backups can then be used cross-account too.

Consider cost of Storage and store for Restore

Data migration services

if you need to migrate data from on-premises to the cloud, AWS offers different solutions - which I will mention in the Migration Services article.