Follow

Let me tell you about the last time I used RAID-5 in 4 disk array:

When one of my disks died a sudden death, I swiftly popped in a new one and kicked off the resync process, which took about 48 hours to complete.

14 minutes(!) after the resync finished a second disk died.

Lesson 1: RAID-5 is pretty useless. When a disk dies you have no more redundancy during the extremely stressful RAID recovery.

Lesson 2: RAID is not an alternative to backups.

@Lyude

It certainly makes it a lot more likely your array will survive in the worst case scenario.

@fribbledom oh yeah. It's not a backup but it helps, especially since the two-disk failure mode is actually super common and kills a lot of people's RAIDs

@fribbledom very true. "RAID is not a backup" is an oft-repeated mantra that still doesn't seem to sink in for some people. Honestly I just use RAID because I can't be bothered to manage multiple mount points haha

@fribbledom Very true. To me, RAID is uptime enhancement. I back my NAS up locally, my PCs to the cloud, and my cloud drives to the NAS.

@fribbledom last time I used RAID-5 and a disk died, I replaced the dead disk and a second disk died during the resync. that's the only time I've ever lost data that I was specifically trying to protect.

@walruslifestyle @fribbledom I remember reading a long article that basically amounted to "For disks bigger than a TB and without special low failure rates (read: expensive), RAID5 almost guarantees another disk will break during rebuild"

Basically, big "normal" disks' failure rates are too low. Expensive "storage" HDs have an order of magnitude better survivability. But bad chance can still bite you in the ass.

@fribbledom RAID5 is a trap. It's a pet peeve of mine that people downtalk btrfs because it doesn't support RAID5/6. Of course not, in any situation requiring RAID you'd be better served with RAID0,1, or 10

@fribbledom Yeah, I know a place who suffered a major data loss because they had no backups. Or rather, they were using RAID as the backup, and had no other backup otherwise. Also, they were using the kind of RAID where if you lose one disk you lose everything... so even less of a backup.

@fribbledom I run 3-disk raid1e (or raid10-near2, depending on who you ask)

it seemed balanced to me, and I noted that, it gives you essentially the same stats as 4-disk raid10 but with fewer disks:
- up to one drive failure*
- 50% capacity
- at least 2x the speed of the slowest drive

*yes, 4-disk raid10 can survive 2 drive failures, but only as long as they're the correct drives, and I don't trust that

@fribbledom (and then you potentially have a 4th disk available for backups)

@fribbledom Rackspace lost my data that way once. The guy who called me was surprised I had backups.

@fribbledom 1/2
"The biggest difference between RAID 5 and RAID 10 is how it rebuilds the disks. RAID 10 only reads the surviving mirror and stores the copy to the new drive you replaced. Your usual read and write operations are virtually unchanged from normal operations."

@fribbledom 2/2
"However, if a drive fails with RAID 5, it needs to read everything on all the remaining drives to rebuild the new, replaced disk. Compared to RAID 10 operations, which reads only the surviving mirror, this extreme load means you have a much higher chance of a second disk failure and data loss."
acronis.com/en-us/articles/wha

@fribbledom I've always just kept all my stuff on one disk and occasionally backed it up to another, or in the last 10 years or so to a cloud thing. I've never messed around with RAID because someone described it to me once and I went "OK so you buy two hard drives, same sort, probably from the same batch, put 'em in the same box, with the same vibration, the same heat cycles, doing the same wear to each... that's not backup, that's figuring out the tolerances at the hard drive factory."

RAID is never a backup.
It just do:
1. make the storage larger, or
2. make read/write speedier, or
3. make the data more sustainable

But no, RAID is not a backup.
@ifixcoinops @fribbledom

@fribbledom So true! 3 years ago one of my friends was not so lucky and lost a second disk before the resync could complete.

@fribbledom you get regular write speed and 2x read speed in raid 1.

You get 2x write and 4x read in raid 10.

No other levels are worth your attention. 😋

@fribbledom I just back my important personal data up regularly and assume I'll lose the rest every few years.

@fribbledom Here's my story from just this past week. I've had a 2-bay Synology with the same 2 disks in in RAID-1 for the last ~9yrs. I got an alert on Weds that disk 2 has failed. I immediately made a backup to an external hard drive and ordered a replacement disk. I installed the new disk on Saturday and while the array was rebuilding, disk 1 failed.

RAID saved my neck in that it gave me a short window to make a real backup before the entire array crashed. RAID itself is not a backup.

@fribbledom which drives did you use? When did you buy them?

@fribbledom meaning: Same brand, bought at the same time?
Personal painful experience: Lifetime variance correlates with the production batch. Most likely the source of stories like yours ("Second HD failed during RAID rebuild").
I try to buy at different times from different vendors...

@fribbledom I'm using a RAID-Z2, and bought drives from different suppliers, to minimize chances of multiple drives failing

@fribbledom I really hope that will be enough, otherwise ~20TB of Data needs to be recovered from varios sources. (no, i don't have a "one contains everything"-backup)

@kunsi @fribbledom

I can recommend this article.
jrs-s.net/2015/02/06/zfs-you-s

Key-Points:
1) Mirrors are faster and easier to maintain than other (z)raid-levels.
2) do backups

@fribbledom I had 4 failing disks (non zero hw ecc errors) working with 4 redundant copies with btrfs for ~4 years non-stop, monthly scrub. Still no data lost, but the ext2 /boot partition is gone.

zfs on raid 6 works well. Should have daily backups with 7 day retention, even 30 day if its in the budget.

@fribbledom we lost a server that was on Raid 5 because the lazy and incompetent admin didn't know a disk had failed. So we lost about 6 hours of customer data plus the downtime.

I looked into it, as I know nothing about such things, realised raid 5 was pants and told him to instruct our IT contractors to rebuild the server with a different configuration.

His reply: aw but I've already told em to do it exactly how it was before and I don't want to email them again.

@fribbledom

Spoiler alert, he didn't. The same thing happened again in 6 months time.

@fribbledom RAID increases availability by (usually) sparing you a tedious restore from backups. If you're paranoid, you should proactively replace drives on a rolling schedule rather than wait for one to fail.

@fribbledom I have had even better cases:

- controller glitch, the array falls apart corrupting some data in the process. In theory recovery is possible since disks are okay physically but it requires quite a bit of knowledge, effort, time and specific equipment.

- 10+ disk array, one is set to "hot spare" so if one disk fails, the rebuild starts immediately. Power outage, the server powers down correctly yet on the next start two disks fail simultaneously.

Simple RAIDs don't provide any redundancy, they only save time on recovery from partial failures. I have had problems even with mirrored RAIDs, mainly due to controller glitches.

@fribbledom
Backups allow you to restore after an admin accidentally the whole /

RAID 60 means that no amount of disk failure will stop an admin from accidentally the whole /

I can only recommend RAID 5 for disks up to 4 TB, and this is for data you either have backed up off site, or you can fetch again (torrents)

@fribbledom If you bought the disks at the same time it might suggest another thing: the life expectancy of modern disks is very predictable!

@fribbledom
Definitely not an alternative for backups (availability, not safety), but I've had several RAID 5 arrays, and several rebuilds, and it's fine.
I do think, though, that at around 6 disks the latest, RAID 6 is the much safer option.
I guess it also depends what disks you use and how old they are. The more probable the next failure the higher the danger...

Trying BTRFS Raid on my main machine now, but it's already got checksum failures and haven't worked out how to fix them..

@fribbledom 1/3
Just had this convo come up today with a coworker, asking why I don't have the same problems in servers. I thought you might be at least interested.

Higher end raid controllers do more than raid.
There is a feature called "scrubbing", mainly to find and fix bit errors.
But a side effect is, the controller prioritizes host requests, and when the host is "idle" it spends its free time scrubbing in the background.
This means the drives are always under 100% load.

@fribbledom 2/3
There's zero difference between the high load of scrubbing and the high load of a raid rebuild.
I also have a drive swap schedule based on the low end MTBF.

On raid controllers without this, disk load varies, so a higher chance of another disk failing during a rebuild.
Often also the owners ignore MTBF, because "it's still working", so you have older drives usually past their life expectancy, seeing a huge load difference.
It's the most likely time a second drive may die.

@fribbledom 3/3
Raid 6 can only help so much there.
A hotspare is always good, but to use it still triggers a rebuild onto it.

But bast practice is raid6, hotspares, coldspares, background scrubbing, and battery backed cache.

Plus backups backups backups! With extra backups sprinkled on top!

Sign in to participate in the conversation
Mastodon

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!