DEV Community

loading...
Cover image for Backups Backups Backups
RJJ Software

Backups Backups Backups

dotnetcoreblog profile image Jamie ・9 min read

The cover image for this post is by Markus Spiske over on Unsplash

A Little History

Like many people who came of age in the 1990s, I hadn't taken backups seriously until I got to college.

side note: by "College" I mean in the UK sense. For friends who went through the American style of education, this means non-compulsory education which usually starts at the age of 16 and lasts for 2 years

It was while at college that I suffered my first big data loss.

I'd been working on an app, written in Assembler for a Flight 68K. To run the application, we had to visit the labs and:

  • Boot a device into DOS
  • Start the IDE
  • Load the source code from either hard drive or floppy disk

I was at college in the early 2000s

  • Build, deploy, and run the application

At some point between making changes on my Windows 2000 machine and loading it into the DOS-based IDE, the source code became corrupted. These days, I would know how to repair the source code

and not just by the use of git reset --hard origin/master or whatever your favourite way is for undoing any local changes

but back then I had no way of knowing how to fix the text files. I also wasn't using source control, because I hadn't been introduced to the topic yet - we were studying electronics.


This would happen again, several months later, with a short story that I was working on.

Those of you who have read my other articles will know that I'm not much of a writer, but there was a time when I wanted to try it out.

I had a short story, stored as a Word document, on a 64 MB

yes, MB

USB memory stick. That memory stick was attached to my keys and was quite bulky. Keys have a certain amount of weight, and USB ports are usually found on the front or back of a PC tower.

Over time, the connectors between the USB port and the board within the memory stick snapped. Meaning that I had no (cheap) way of rescuing the data from it.


In both of these real-world examples, there was no real money lost. There was also no reputational damage. And there was definitely no-one holding me to ransom.

But that could happen to you if you're not taking backups.

3-2-1

image by Joshua Golde over on Unsplash

I am in no way a backup and restoration expert, but the most basic thing that you'll need to understand is the 3-2-1 rule. The tl;dr is:

at least 3 copies of the data, stored on 2 different types of storage media, and one copy should be kept offsite

If you are using git (or some other distributed source control system), you are almost doing this already. You have (at least) 2 copies of the source code, and one is offsite. I've said "at least" because there might be others in the team who also have the full git tree.

But what if you wanted to keep other things backed up?

Drafting a Backup Plan

image created by Med Badr Chemmaoui at unsplash

The first things to think about are:

What are you backing up?

Is it just source code? Usually, that is pretty small, compared to raw video footage, photos, or other types of data.

Is it financial data? You might need to look into certain regulations about long term storage for that.

How often does the data change?

If it's source code, this might be hundreds of times per day.

If it's raw video footage, this might be rarely. If you're dealing with raw video footage, you'll likely save edits to that as a separate file or project - depending on how what you are creating

How will you restore the data?

Is this going to be a single restoration process? i.e. when some calamity happens, like your office floods:

Are you going to be restoring individual parts, or files, within the backed up data? i.e. some kind of git revert action?

How will you test the restore action?

A lot of people forget this step. What use is a backup of all of your data, if you have no idea whether the restoration steps work?

This can be as simple as spinning up a new VM, pulling the most recent backup, and running through the restore steps. You'll be surprised at how many systems have restoration steps which don't work or aren't kept up to date.

Creating a Backup Plan

image by Alvaro Reyes over on Unsplash

Depending on the type of data that you are backing up, you might have different backup schedules.

Shortly after I graduated (University of Hull, 2008), I started working at a school. I wasn't working in the IT department, but I had a lot of communication with the folks in that team. Since the PCs that they had at the school where disposable, and all o the important data was stored on network shares, they didn't bother running backups on individual machines.

What they did backup was the collection of network shares. Each day, at around 11pm, a cron job would fire which would back up all of the network shares to tape

tape is more reliable than spinning rust

Each Saturday, those tapes would be taken offsite and copied. Once returned, they would be entered into a monthly rotation queue. Each month, the data on the tapes would be replaced. And every six months, the remote tapes would be rotated (and re-used).

This worked for them because the important data would change very slowly. And they accepted that, after an entire year, a backup would be destroyed.

They also had a separate network stack, at a third location. This network stack was isolated from the world

via network cards, anyway

and they would use it as a training and practice ground for restoring the data which had been backed up.

This might be too much for you, or might not be good enough. I decided to mention it because I thought that it was a good middle ground.

My Backup And Restore Plan

image by Plush Design Studio from unplash

For those who don't know, I work on a bunch of podcasts. In fact here is a link to my podchaser profile.

The reason that I bring this up, is because I currently have over 200 GB of raw audio, produced and edited content, and related files combined. I don't know whether you've tried to keep a rotating backup of 200 GB of data, but it requires some forethought.

I had decided early on that, once the data was backed up, I would hardly ever need to restore that data. In fact, it's a point that was reinforced in episode 19 of The .NET Core Podcast, when Richard Campbell told me:

Don't throw anything away. Your legacy is important too. And I think that listeners appreciate that you're evolving.

play pause The .NET Core Podcast

there was a time when I needed to revisit an episode, but that's a story for another day

And so I would need to backup:

  • raw audio
  • audio projects representing the rendered audio
  • cover art, in super high quality (we're talking 4000*4000 pixels png and the RAW GiMP project files)

The raw audio is created several times a week and can range from 60 to 240 minutes per show (remember: I work on several shows). So that is a huge amount of data: each person on the show has their own audio channel, so a 120-minute recording with 4 people actually maps to 480 minutes of raw, uncompressed audio.

for those who don't know, wav audio usually runs into about a GB per hour of audio. You can reduce the space required by using FLAC, but that makes the restoration process a little more complex

As many will have noticed, I use Pop!_OS which is a distribution of Linux

I also use a Mac Air for editing while travelling. What's great about Unix-based OS distributions is that they ship with a CLI app called rsync. Rsync does a lot of things, but the basic process is:

  • an SSH connection is created
  • files are copied over to the remote machine
  • validation is performed to ensure that the files made it to the remote machine

I also have several local Synology DS218s. These devices have two drives installed in them but have them in a RAID-1 which means both drives are exact copies of each other. These devices also run a Linux distribution, so rsync is available there, too.

The reason that I bring up rsync is that I use it to copy these large files across my network. So when I want to create a backup on one of the Synology devices, I run something like:

rsync -avP  -e "ssh -i /path/to/ssh/keys -p <port-number>". <user>@<IP>::/backup/path
Enter fullscreen mode Exit fullscreen mode

This command tells my machine to set up a secure shell (ssh) connection to the remote device found at IP as a specific user user on the supplied port number (port-number)

I'd always recommend changing the default SSH port from 22, as an easy win for SSH security

it also uses an SSH key (/path/to/ssh/keys) to authenticate with the remote device. Once the connection is created, rsync takes over and sends all of the files in the current directory (.) using the following setup:

  • recursive
  • keeping symlinks
  • preserving
    • permissions
    • timestamps
    • groups
  • omitting directory timestamps
  • preserving device and special files

and that's just what -a does

in a verbose way (-v), so that I can see if something goes wrong, and keeping any previously partially completed files copy operations (-P).

Because each of my devices (except for my Windows laptop) has rsync built-in, I can backup or restore files in either direction.

On top of that, I have one-way sync to an offsite cloud backup provider. Each week, my "master" or main NAS performs one-way sync to my offsite provider: meaning that it will never pull data down from the cloud, but it can write to the cloud.

I can get the data from the cloud but only do this every few months as there is a big cost involved in pulling the data back down, as it is Glacier-like storage.

writes are super cheap, but reads can be expensive

But when I do test a restore, I:

  • take down two of my NAS devices
  • clone the cloud version of the files to one of them
  • recursively verify that both of the NAS devices have the same data

If the data doesn't match, then I know which part of the backup process has failed and can pin-point where the weak step in the chain is.


This backup plan isn't for everyone one, and you might think that it's a little over the top. But I'd rather it was over the top than suffer some massive data outage.

Conclusion

image by João Silas over on Unsplash

Whether you simply have a couple of USB drives in rotation or a wildly over-complicated setup like mine

remember, it's just for podcast audio, not company data

you really should be looking into having:

at least 3 copies of the data, stored on 2 different types of storage media, and one copy should be kept offsite

But you should also be checking that your backups can be restored, too. Because what's the point of backing things up, if you're not checking that it can be restored?

What's Your Backup and Restore Plan?

Is it as complicated as mine? Is it more complicated? Do you think that mine is too complicated for the type of data that I'm backing up? Should my backup process be simpler?

Let's swap suggestions in the comments.

Discussion (3)

pic
Editor guide
Collapse
phlash909 profile image
Phil Ashby

Thanks for the reminder Jamie :)

My personal backup and restore testing regime uses rsync and a NAS:

  • everything that matters (family rule) goes on the NAS
  • twice daily: 'rsync --link-dest' copies, from NAS to local USB disk, holding 5 previous syncs to deal with the most common failure - human error :)
  • overnight 'rsync' to offsite archive (VM in Azure) for point in time recovery.
  • restore testing: regular file read runs on offsite (checksum generation)
Collapse
thejoezack profile image
Joe Zack

So much of the data I care about any more is scattered throughout the clouds. I don't directly back up things like my email anymore because it's difficult to even keep an up-to-date inventory, let alone back it up.

Scary!

*googles how best to back-u emails

Collapse
dotnetcoreblog profile image
Jamie Author

Oh most definitely. If it's a webmail service, I let them worry about backing up my emails. The way I see it: if google mail goes offline, then I have slightly bigger things to worry about.