DEV Community

Thomas H Jones II
Thomas H Jones II

Posted on • Originally published at thjones2.blogspot.com

Streamed Backups to S3

Introduction/Background

Many would-be users of AWS come to AWS from a legacy hosting background. Often times, when moving to AWS, the question, "how do I back my stuff up when I no longer have access to my enterprise backup tools," is asked. If not, it's a question that would-be AWS users should be asking.

AWS provides a number of storage options. Each option has use-cases that it is optimized for. Each also has a combination of performance, feature and pricing tradeoffs (see my document for a quick summary of these tradeoffs). The lowest-cost — and, therefore, typically most attractive for data-retention use-cases typical of backups-related activities — is S3. Further, within S3, there are pricing/capability tiers that are appropriate to different types of backup needs (the following list is organized by price, highest to lowest):

  • If there is a need to perform frequent full or partial recoveries, the S3 Standard tier is probably the best option
  • If recovery-frequency is pretty much "never" — but needs to be quick if there actually is a need to perform recoveries — and the policies governing backups mandates up to a thirty-day recoverability window, the best option is likely the S3 Infrequent Access (IA) tier.
  • If there's generally no need for recovery beyond legal compliance capabilities, or the recovery-time objectives (RTO) for backups will tolerate a multi-hour wait for data to become unavailable, the S3 Glacier layer is probably the best option.

Further, if projects' backup needs span the usage profiles of the previous list, data lifecycle policies can be created that will move data from a higher-cost tier to a lower-cost tier based on time thresholds. To prevent being billed for data that has no further utility, the lifecycle policies can also include an expiration-age. Data that reaches the set expiration age will simply be deleted by AWS and associated charges for the expired data will cease.

There are a couple of ways to get backup data into S3:

  • Copy: The easiest — and likely most well known — is to simply copy the data from a host into an S3 bucket. Every file on disk that's copied to S3 exists as an individually downloadable file in S3. Copy operations can be iterative or recursive. If the copy operation takes the form of a recursive-copy, basic location relationship between files is preserved (though, things like hard- or soft-links get converted into multiple copies of a given file). While this method is easy, it includes a loss of filesystem metadata — not just the previously-mentioned loss of link-style file-data but ownerships, permissions, MAC-tags, etc.
  • Sync: Similarly easy is the "sync" method. Like the basic copy Every file on disk that's copied to S3 exists as an individually downloadable file in S3. The sync operation is inherently recursive. Further, if an identical copy of a file exists within S3 at a given location, the sync operation will only overwrite the S3-hosted file if the to-be-copied file is different. This provides good support for incremental style backups. As with the basic copy-to-S3 method, this method results in the loss of file-link and other filesystem metadata.

Note: if using this method, it is probably a good idea to turn on bucket-versioning to ensure that each version of an uploaded file is kept. This allows a recovery operation to recover a given point-in-time's version of the backed-up file.

  • Streaming copy: This method is the least well-known. However, this method can be leveraged to overcome the problem of loss of filesystem metadata. If the stream-to-S3 operation includes an inlined data-encapsulation operation (e.g., piping the stream through the tar utility), filesystem metadata will be preserved.

Note: the cost of preserving metadata via encapsulation is that the encapsulated object is (mostly) opaque to S3. As such, there is not really a good (direct) means by which to emulate an incremental backup operation.

Technical Implementation

As the title of this article suggests, the technical-implementation focus of this article is on streamed backups to S3.

Most users of S3 are aware of its static file-copy options. That is copying a file from an EC2 instance directly to S3. Most such users, when they want to store files in EC2 and need to retain filesystem metadata either look to things like s3fs or do staged encapsulation.

The former allows you to treat S3 as though it were a local filesystem. However, for various reasons, many organizations are not comfortable using FUSE-based filesystem-implementations — particularly opensource project ones (usually due to fears about support if something goes awry).

The latter means using an archiving tool to create a pre-packaged copy of the data first staged to disk as a complete file and then copied to S3. Common archiving tools include the Linux Tape ARchive utility (tar), cpio or even mkisofs/genisoimage. However, if the archiving tool supports reading from STDIN and/or writing to STDOUT, the tool can be used to create an archive directly within S3 using S3's streaming-copy capabilities.

Best practices for backups is to ensure that the target data-set is in a consistent state. Generally, this means that the data to be archived is non-changing. This can be done by quiescing a filesystem ...or snapshotting a filesystem and backing up the snapshot. Use of LVM snapshots will be used to illustrate how to a consistent backup of a live filesystem (like those used to host the operating system.)

Note: this illustration assumes that the filesystem to be backed up is built on top of LVM. If the filesystem is built on a bare (EBS-provided) device, the filesystem will need to be stopped before it can be consistently streamed to S3.

The high-level procedure is as follows:

  1. Create a snapshot of the logical volume hosting the filesystem to be backed up (note that LVM issues an fsfreeze operation before to creating the snapshot: this flushes all pending I/Os before making the snapshot, ensuring that the resultant snapshot is in a consistent state). Thin or static-sized snapshots may be selected (thin snapshots are especially useful when snapshotting multiple volumes within the same volume-group as one has less need to worry about getting the snapshot volume's size-specification correct).
  2. Mount the snapshot
  3. Use the archiving-tool to stream the filesystem data to standard output
  4. Pipe the stream to S3's cp tool, specifying to read from a stream and to write to object-name in S3
  5. Unmount the snapshot
  6. Delete the snapshot
  7. Validate the backup by using S3's cp tool, specifying to write to a stream and then read the stream using the original archiving tool's capability to read from standard input. If the archiving tool has a "test" mode, use that; if it does not, it is likely possible to specify /dev/null as its output destination.

For a basic, automated implementation of the above, see the linked-to tool. Note that this tool is "dumb": it assumes that all logical volumes hosting a filesystem should be backed up. The only argument it takes is the name of the S3 bucket to archive to. The script does only very basic "pre-flight" checking:

  • Ensure that the AWS CLI is found within the script's inherited PATH env.
  • Ensure that either an AWS IAM instance-role is attached to the instance or that an IAM user-role is defined in the script's execution environment (${HOME}/.aws/credential files not currently supported). No attempt is made to ensure the instance- or IAM user-role has sufficient permissions to write to the selected S3 bucket
  • Ensure that a bucket-name has been passed, but not checked for validity.

Once the pre-flights pass: the script will attept to snapshot all volumes hosting a filesystem; mount the snapshots under the /mnt hierarchy — recreating the original volumes' mount-locations, but rooted in /mnt; use the tar utility to encapsulate and stream the to-be-archived data to the s3 utility; use the S3 cp utility to write tar's streamed, encapsulated output to the named S3 bucket's "/Backups/" folder. Once the S3 cp utility closes the stream without errors, the script will then dismount and delete its previously-created snapshots.

Alternatives

As mentioned previously, it's possible to do similar actions to the above for filesystems that do not reside on LVM2 logical volumes. However, doing so will either require different methods for creating a consistent state for the backup-set or backing up potentially inconsistent data (and possibly even wholly missing "in flight" data).

EBS has the native ability to create copy-on-write snapshots. However, the EBS volume's snapshot capability is generally decoupled from the OS'es ability to "pause" a filesystem. One can use a tool — like those in the LxEBSbackups project — to coordinate the pausing of the filesystem so that the EBS snapshot can create a consistent copy of the data (and then unpause the filesystem as soon as the EBS snapshot has been started).

One can leave the data "as is" in the EBS snapshot or one can then mount the snapshot to the EC2 and execute a streamed archive operation to S3. The former has the value of being low effort. The latter has the benefit of storing the data to lower-priced tiers (even S3 standard is cheaper than snapshots of EBS volumes) and allowing the backed up data to be placed under S3 lifecycle policies.

(original posting-location)

Top comments (0)