The trouble with Time Machine

9 05 2012

Every now and again Time Machine will spit out a “Cant perform backup, you must re-create your backup from scratch” or “Cant attach backup”.  For anyone who was relying on its rollback-time feature this is a reasonably depressing message and does typify modern operating systems, especially those of the closed source variety. At some point, having spent all the budget on pretty user interfaces, and catered for all use cases the deadline driven environment decides, “Aw stuffit we will just popup a catch all, your stuffed mate dialog box”. 99% of users, rant and rave and delete their backup starting again with a sense of injustice. If your reading this and have little or no technical knowledge, thats what you should do now.

If you get down to bare nuts and bolts you will find that a Time Machine backup is not that disimular to a BackupPC backup of 10 years ago. It makes extensive use of hard links to snapshot the state of the disk. It perform this in folders with thousands of files creating uniformly distributed tree. That all works fine except when it doesn’t. Anyone who has used hard links in anger on a file system will know it tends to put the file system under a lot of stress resulting in more filesystem corruptions than normal. File systems are not that transactional so if an operation fails part way through, then the hard links may start to generate orphaned links.

Now TimeMachine runs fsck_hsf when it attaches a sparse bundle file system which is the Time Machine backup. Unfortunately it doesn’t try that hard to fix any problems it finds and couldn’t possibly corrupt its pretty UI by telling the user that it might have a problem with the users cherished backup of life’s memories. Not good for marketing, loosing your loyal customers photos when you promised them it wouldn’t happen. Fortunately, those messages are logged in /var/log/fsck_hfs.log. If you use Time Machine and are finding the attach stage takes forever. Take a look in there for the words “FILESYSTEM DIRTY”. That indicates, that the last time Time Machine tried to attache the drive the file system check was unable to check the file system and correct any errors, and so, it marked it DIRTY. It is possible to correct one of these filesystems, however, with all those hard links the likelyhood is that your filesystem, even if fsck_hfs -dryf /dev/discXs1 does correct the errors and put it into a FILESYTEM CLEAN state, it wont be a usable and valid backup. When your laptop exits you house with a man wearing a stripy jumper and tights over his head, your children (and you) will cry realising that the backup in the cupboard is corrupt.

What advice can I give you?

  1. Check your backups regularly
  2. If you use TimeMachine, open the “console” program, type DIRTY into the search box and if you find that word, go out an buy another backup disk…. quick.

For those that want to try and recover a Time Machine backup.

chflags -R nouchg /Volumes/My\ Time\ Capsule/mylaptop.sparsebundle
hdiutil attach -nomount -noverify -verbose -noautofsck /Volumes/My\ Time\ Capsule/mylaptop.sparsebundle
tail -f /var/log/fsck_hfs.log
# If you see  "The Volume could not be repaired"
# then you need to run
fsck_hsf -dryf /dev/rdiskXs2
# where X was the number of disk listed when you hdutil attached.
# I can almost guarentee that the disk will not be recoverable and you will see tens of thousands
# of broken hard link chains. Fixing those will probably corrupt the backup.
# which is why this is futile.

If you are using a Time Capsule, power cycle it first, connect your machine to it of 1000BaseT and make sure no other machines are accessing it. Don’t use Wifi unless you want to grow old and die before the process completes.

 

Update

Perhaps I am being a little unfair here. The same unreliability could happen with any backup mechanism that is vulnerable to corrupted backups as a result of the user shutting the lid, the computer going to sleep, a power failure. Time Machine and Time Capsules weakness is that its all to easy to disconnect the network hard disk image and once you do that the Time Capsule end has no way of shutting down the back up process in a safe way. Do that enough times (I have found 1 is enough) and the backup is corrupt and unrecoverable and even the HFS+ Journal can’t recover.

I was also a bit unfair on BackupPC, which is initiated from the server and so although it may create nightmare file systems, can leave the backup image in a reasonable state when the server looses sight of the client.

Time Machine on an attached drive appears more reliable, but a lot less useful.

Advertisements

Actions

Information

One response

21 05 2012
Ian

Shortly after this post the reason for TimeCapsule dropping its network connection and continually corrupting the sparsebundle filesystem became apparent. The internal power supply failed. I suspect it was producing a less well regulated supply than it should have been. After taking the TimeCapsule apart, the power supply has capacitors with bulging ends. (Warning: there are no safe leak resistors in this power supply, so do not touch anything metal till you have discharged the capacitors with a insulated, high resistance, eg a volt meter on 1000v range. If you dont, the switch mode power supply will kick into life and you will get a shock almost as bad as if the power supply was connected to mains. If you dont know how dangerous they can be, dont even think of opening one up). On ordering new capacitors from Element14 I noticed that the ones that failed have an expected lifetime of 1000h at 105c. At 24h operation, thats 41 days. Normally a power supply would not run at 105c and the lifetime of the caps is going to be at least an order of magnitude more at 60c however, the TimeCapsule is designed to be silent an sleek hence ventilation is poor especially through the power supply and I would not be surprised if it was running at that for a lot of the time especially when the disk is running hard. I am replacing with marginally more expensive 6000h lifetime caps (78c each), and will attempt to improve cooling.

Makes me think Capacitor lifetime is a good way to build in obselecense into a consumer product, or am I being cynical ?




%d bloggers like this: