Building a Large RAID5

With the development of HDHomeRun DVR progressing at full steam and the kickstarter funded, I decided I needed a centralized repository to store my media collection on. I decided to build a RAID 5 to hold my Plex library, archived video/photo editing work, and future DVR storage.

I built the RAID array on an Ubuntu 14.04 LTS server running Linux software mdraid on 4 desktop-class Seagate Barracuda 3TB drives (the seemingly ubiquitous ST3000DM001).

The Case Against RAID5

I realize there is a lot of reluctance among sysadmins these days to build large parity-based RAID5/6 arrays (many think using RAID5 on terabyte sized drives is just asking for trouble), and that’s certainly understandable for mission-critical systems.

Why? Well, it all boils down to something called the unrecoverable read error (URE) frequency. For consumer grade drives that’s something like 1 per 10^14, which means that one in every 100 trillion reads will be wrong and there will be no way for the drive to know.

If you’re talking about a drive with 1 trillion bytes, when you’re reading the entire drive (as you might during a rebuild after a failed drive), the probability of reading each byte correctly is (1 – 10^14) — a near certainty! However, propagate the probability across 1 trillion bytes…

(1 - (1 - 1E-14)^(1E12)) = 0.0099423

that comes out to about a 1% chance that you will have at least one read error on one drive alone.  If you have a 9 TB raid:

(1 - (1 - 1E-14)^(9E12)) = 0.086003

that comes out to an 8.6% chance that there will be at least one read error leading to inconsistent data during an array rebuild.

As if that wasn’t enough, array build times for large RAIDs are getting insane. (It actually ended up taking about 9 hours to build this one!)

Why I went with RAID5 Anyway

Well, even after all that math and people trying to convince me that I was sentencing myself to certain doom. I built it on RAID5 anyway.

Honestly, I really don’t care about an 8.6% chance of a byte being wrong when a hard drive fails and I have to rebuild:

  • I’m using this primarily as a media dump for recorded TV shows and movies and music and such. One byte being wrong happens all the time in streamed over-the-air broadcasts. Your picture just looks weird for a second.
  • Anything critical will be backed up to the cloud – OK, so I mounted /home on this I would get a lot of space in my home directory for… stuff. I can back those up easily on Crashplan or something.
  • This is a home server. I won’t lose my job as my own sysadmin if I have to take the array offline for 9 hours to let it rebuild.
  • The 1 per 10^14 URE figure is worst case but in practice the average URE frequency is probably several orders of magnitude lower.

Building the Array

When getting ready to build the array, I came across another conundrum. Should I use the entire raw device (e.g. /dev/sdc), or should I partition the device and use a partition (e.g. /dev/sdc1)?

The main argument for using the raw device seems to be that you don’t waste space at the beginning end of the drive. But really, what’s a 100MB buffer zone compared to the full awesomeness of a 3 TB drive?

The main arguments for using partitions are:

  • Manufacturers sometimes differ in the exact number of sectors that comprise a drive, even if by a few
  • If you inadvertently connect the drive to a non-md-RAID-aware system, it won’t prompt you to initialize a partition table – which if you accidentally do, will render that drive non-operational
  • There’s no performance difference provided you align your start sector.

In the event that you swap in a disk with fewer sectors and you are using the raw disk, it is possible, if you are using LVM, to shrink the filesystem, logical volume, and volume group and add the slightly smaller disk with no ill effect. However, this requires that you run LVM on your RAID (which is always a Good Idea) and that you use a filesystem which can be shrunk. This excludes XFS if you are planning on using raw disks!

Actually building was quite simple – it just goes to show how adequate planning and research make everything go smoothly!

Once I plugged in the drives and rebooted, I had /dev/sdc, /dev/sdd, /dev/sde, and /dev/sdf online. For each one of the drives, I went into GNU parted and initialized a GPT (GUID Partition Table) – this was required as the drive was too large for a regular DOS MBR.

# parted /dev/sdc
GNU Parted 2.3
Using /dev/sdc
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mktable gpt
(parted) mkpart
Partition name? []? raid
File system type? [ext2]?
Start? 1M
End? -100M

This makes the first partition aligned with the 4k sector size, and ends it 100 MB from the end of the disk. We also need to flag the partition we just created (which should be the first on the disk) as RAID to change its type so that it will be autodetected:

(parted) set 1 raid on

Repeat for all the disks.

Now, create the array in mdadm:

# mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

Now I can watch as mdadm builds the array!

# watch -n 5 cat /proc/mdstat

Or, in my case, since it was going to take 9 hours to build, I decided to go to sleep.

After building is complete, update the mdadm.conf file so it automatically assembles and scans the new array:

# mdadm --detail --scan >> /etc/mdadm/mdadm.conf

Note: I tried this before I went to bed, but for some reason at that point mdadm thought I had one spare. All the drives were correctly configured in the morning after the build was done.

Creating the LVM

Creating LVM is also pretty easy if you’ve done it before. First you need to initialize LVM on your block device to make it a physical volume:

# pvcreate /dev/md0

Then create a volume group that uses just the physical volume:

# vgcreate my_raid_vg1 /dev/md0

(You can name it whatever you want, it doesn’t have to be my_raid_vg1).
Finally, create some logical volumes. I created one for /var/log and one for /home:

# lvcreate --size 4G --name logs my_raid_vg1
Logical volume "logs" created
# lvcreate --extents 100%FREE --name homes my_raid_vg1
Logical volue "homes" created

You can now see the LVM volumes in /dev/mapper.

For a more in-depth look at LVM check out The Geek Stuff: How To Create LVM Using vgcreate, lvcreate, and lvextend lvm2 Commands.

Filesystem Selection

So now came the decision of what filesystem to use. The Ext4 vs XFS debate can get pretty heated, but I decided to go with XFS because I’d never used it before, and the design of XFS is particularly optimized for parallel I/O like you would find in a RAID.

To create your filesystem in the most optimized fashion, you need to know the stripe unit and number of data blocks per stripe, or the stripe width.

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid5 sde1[2] sdb1[0] sdd1[1] sdf1[4]
 8790108672 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

Here, my chunk size is 512k, which means a stripe of data across all four drives is 512 kilobytes wide. Because I am using RAID5, a stripe contains 1 parity chunk and (# drives – 1) data chunks. The stripe unit (512k) goes into the sw parameter, and the stripe width goes into the sw parameter.

To make the filesystem:

# mkfs -t xfs -d su=512k -d sw=3 /dev/mapper/my_raid_vg1-homes

Now you’re ready to add the entry to fstab:

# echo "/dev/mapper/my_raid_vg1-homes /home xfs rw,inode64 0 2" >> /etc/fstab

Now reboot and hope everything works!

Extra reading: https://raid.wiki.kernel.org/index.php/RAID_setup

No Comments, Be The First!

Your email address will not be published.