Recover from disk accident crash

I had this very bad habit of testing speed of my disks with dd, very simply :

But with some lack of sleep, I accidentaly replaced the wrong argument and wrote to my disk, by putting my disk in ‘of’ instead of ‘if’ argument. My disks are in RAID5 to have redundancy and allow one failure. If it was a physical disk, that would be OK, just have to resync the array. But this would have been too easy, and the mistake was done with the array disk, unrecoverable. And with the first giga of the disk, it aims critical data…The repair was quite difficult, it took me one day to minimally recover and the service to be back again (step 1), but siw months to fully recover (step 2). As it could be useful, below are main parts.

Step 1 : recover the disk

With the first giga of data overwritten, the ext4 filesystem was broken, and fsck was unable to recover with the first superblock absent. The solution is to use a backup superblock. To find where backup superblocks are located, the easiest solution is to run mke2fs in test mode. DO NOT FORGET the -n flag… If not your disk would be another time overwritten, and backup superblocks would be destroyed…

When you have the superblock positions you can now repair the filesystem :

It will rebuild the master superblock, and find a bunch of errors because of the first giga garbage… If you are not a supernatural drive hacker, you will have no choice other than to accept all the changes proposed by fsck, while hoping it have not taken the wrong solution… In my case it have worked pretty well.

You will need to rerun fsck several times

You may also find some recurring errors that fsck do not succeed to fix. The only way I found to fix is to use debugfs, and remove the wrong entries.

Do not forget to rerun several times fsck, until no errors are found or fixed.

After all these steps, you should now have a working ext filesystem. Because of the first giga overwritten, it is most likely that the / folder would have been deleted, and you may find an empty filesystem. But do not panic, your files are in lost+found ! The filenames will be likely lost (all files are renamed #xxxxx), but it should be easy to rename back from their contents, and move back to root folder (As it is often folders, it is easier)

Ouf! You have now back a working folder. You may take a snapshot of this folder, and you can use it normally.

But you may have lost or altered some files (one giga was overvritten, and it may affect several thousands files…). Here come the second part.

Step 2 : recover files

As I am a bit paranoid with my files, I had several security systems :

  1. A simple file list backup (weekly) : to be able to detect file loss
  2. A md5 hash backup of all my files (weekly) : to be able to detect file corruption (viruses,…)
  3. A crashplan offsite backup (immediate backup or daily) : offsite backup, if anything wrong…

The second one was very useful to see the damages : it obviously need to be rebuilt completely, which is quite long. Thus the first file list can provide a rough result more quickly. By diffing the md5 hashes, you will have a file list which may be corrupted, and thus should be restored from crashplan.

Crashplan has a powerful GUI, but you must select each file to recover manually, which could be very tedious. If you have sufficient space and bandwidth, you can obvisouly restore all files, but that was not my case. Moreover restore speed can be quite slow, I didn’t workout why. Crashplan has an incredible API which would have been great to restore files, but sadly it is reserved to enterprise version. So I tried to automate this task by the GUI. I tested several automation frameworks on Linux and Windows, and the one which was the least problematic was PyWinAuto on Windows. Performance is not good, but proved to be quite reliable.

Here is the script (prerequisite : PyWinAuto)

And then a script to restore files (check md5 and restore file attributes)

By the way a simple script to merge folders :

And yeah, you have now restored your files !! (it took me six months…)

Conclusions

  • Best is always prevention : you need to set a few things to be prepared
  • The three levels are useful :
    • Detect the file you lost (file list)
    • Detect a file corruption (md5 hash)
    • Backup contents (offsite)
  • Crashplan do not backup files reliably : I had approximately 5% of files which were corrupted

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close Menu