My storage array isn’t as robust as one might desire. It uses four two-bay SATA to USB docks, all hanging off a laptop. The laptop bit me by developing a stuck memory bit at around the 300MB mark. It always read zero and corrupted many a directory. Fortunately I didn’t lose a lot of files, and lost none that I wanted to keep. I replaced it with another laptop and the resolve to run regular memory tests on the thing.
Now I’ve got another problem. I decided to move from 6 drives to 8 (they’re all 2 TB), and thus had to do a reshape of the array. This requires removing the bitmap. The bitmap is good because if the array fails a good drive, it can be re-added easily. Without the bitmap a full resync/rebuild is required which on my array takes days.
The reshape was supposed to take about 9 days. Power failures – my nemesis – usually aren’t a huge problem because the rebuild picks up where it left off. However after about 5 days of rebuilding sure enough, even though it is winter where power failures are rare, the power went out for about 60 seconds.
I should have been more careful, but finding the array in a failed state, I rebooted the system. Unfortunately one of the docks didn’t come up with the rest. Because I run RAID 6 and all but 2 drives were available, the array restarted and the reshape operation continued with zero redundancy. With no bitmap, I can’t put the failed drives back in, and they have to be rebuilt. That can’t be done until the reshape is finished.
This all takes days and days. If a drive fails in the meantime, I lose the whole enchilada. I have my fingers crossed. There’s 25 minutes to go on the reshape, then I’ll see if the resync starts automatically, which I think it will.
I think mdadm should have a setting to require a minimum # of drives before it will start. I could have then set it to the full set during the reshape, or at least n-1 which would have been preferable to the unnecessary stress I’m under right now.
OK the array is rebuilding now. A little over 4 days from now I’ll know whether my data are safe once again. A week or so is a pretty long window for 6 drives, although the odds are with me. But those odds are 10:1 or 50:1, not the 100000:1 or thereabouts I’d expect with a fully functional RAID 6 array.
[4 days later] It finally finished and I am safe once again, with new knowledge to prevent potential tragedy in future. I added the write intent bitmap back in so power glitches won’t force another rebuild, and fsck reports the file system as clean which is a bit of a relief. The file system needed to be resized, and that took about 45 minutes and about 8% of the (800MHz dual core) CPU.