Friday, October 30, 2009

Backup

Luminous Landscape has an interesting new article on backing up. It is written by Geoff Baehr and he clearly knows what he's talking about. Problem is, he's explained the steps but missed the concept.

A good computer system requires two separate and distinct components - both reliability and backup and I don't think that Geoff made it clear enough which is which.

He talks of redundant arrays using Raid to make sure that data is reliable, but at the same time refers to this as backup. It should not be both.

Reliability means that when your next hard drive fails, how difficult is it going to be and how long is it going to take to get you back up running again. Using Raid 5 or 6, you have the capability to survive a hard drive failure - but of course usually these drives are your data storage drives, not the boot drive which contains your operating system and usually your applications.

Backup has nothing to do with reliability other than that when reliability fails you, that's when you need a backup of your data.

Backups aren't about failing hard drives and redundancy and raid 6 etc.. You don't need any of this because you back up to more than one place, at least one of those places which is then disconnected from your computer and any electrical or other connections so that it isn't subject to power spikes and lightning bolts. If one of the backups is off site, then you are even protected against theft and fire and flood and it doesn't get much better than that.

What it comes down to is how much time are you willing to spend on doing your backups, how long are you willing to wait to restore your system (which often translates into how much money are you going to lose, how many clients are you going to piss off if you can't access your images).

Geoff mentions using Amazon S3 which looks possibly interesting, though at the usual upload speeds of a cable modem, the initial backup could take weeks. He discusses backing up to a hard drive then mailing the hard drive to Amazon but frankly if I have the data on one or more hard drives, why do I need to then mail it - wouldn't it be easier to just store the hard drives at mum's house?

I use a Drobo for backup though Geoff does have a good point about a proprietary storage system which might some day be hard to fix if the company goes away. Still, there are enough people using Drobo's that I suspect that it's going to be accessable for a long time to come. In theory, a Drobo or a raid system is often a mix of both reliability and backup in that the drobo can regenerate the data from any failed drive and maybe even two, which arguably isn't strictly necessary in a backup. I don't believe I can regenerate a boot disk from Drobo because what's sending data to it is Time Machine which doesn't back up absolutely everything.

My next move will be to buy an extra external drive and make a copy of the boot drive via super duper. Mind you, I'm still using a G5 so will have to upgrade that first - just as soon as I have some money.

3 comments:

Omar said...

You are correct in separating backup from reliability. However, I think you miss the point with the Amazon backup. I happen to use Jungle Disk to upload to Amazon S3. It's the automatic part that is really appealing to me. I know that every day, all files I create or change will be safely stored off site.

The sending the disk part is just to reduce the initial startup. I sent 30gb over several days, but now my daily uploads typically take minutes. If I bring a bunch of pics home, it can take several hours. But again, it's about not even thinking about it. It's just done.

Now, I still like the idea of a backup copy within my home network so I'll never have to download that 30gb.

The Amazon system isn't cheap, but then I've got data from four computers backed up for 10 to 12 bucks per month.

George Barr said...

Omar:

thanks for the information - looks like I might just have to explore this Amazon and Jungle Disk thing further.

George

Tommy Williams said...

There's another aspect to reliability beyond the hardware out-and-out failing, and that's the integrity of the data in all the files. There's a step that almost no one ever talks about of making sure that all the bits that make up all your photo files are still good--the drive itself will work, the files will show up in a directory listing, etc. but due to a localized hardware failure, there are critical bits of data within the files that have gone bad.

You typically deal with this by using a hashing mechanism--MD5 or SHA1 or what have you--that can create a "signature" of each file that is much smaller than the file itself but that will give you different results if the data in the file has changed.

But then you need multiple redundant copies of the MD5 hashes and you need some way to easily update those as you change files. This is another reason that I don't like DNG and prefer sidecar XMP files for metadata: my RAW files never change as I manipulate metadata or do different development treatments in Lightroom--only the sidecar XMP files change and these are tiny. And, to be honest, it's not the end of the world if one of those gets lost.

There's still a hassle with developed files, like TIF files or layer PSDs or whatever that you might go back and edit repeatedly, but I have an order of magnitude--two orders of magnitude, in fact--fewer of those than I have of RAW files so it's still manageable.