Configuring RAID for Disk Arrays

(Updated: Nov 2009)

A number of Linux™ kernels offer software RAID support, by which the kernel organizes disks into a RAID array. All Lustre-supported kernels have software RAID capability, but Lustre has added performance improvements to the RHEL 4 and RHEL 5 kernels that make operations (primarily write performance) even faster. Therefore, if you are using software RAID functionality, we recommend that you use a Lustre-patched RHEL 4 or 5 kernel to take advantage of these performance improvements, rather than a SLES kernel.

The procedure for configuring RAID on a Lustre file system is shown below. For additional information, see Section 10.1: Considerations for Backend Storage and Section 10.2: Insights into Disk Performance Measurement in Chapter 10 in the Lustre Operations Manual.

Enabling Software RAID on Lustre
This procedure describes how to set up software RAID on a Lustre system. It requires use of mdadm, a third-party tool to manage devices using software RAID.

1. Install Lustre, but do not configure it yet. See Installing and Configuring Lustre.

2. Create the RAID array with the mdadm command.

The mdadm command is used to create and manage software RAID arrays in Linux, as well as to monitor the arrays while they are running. To create a RAID array, use the --create option and specify the MD device to create, the array components, and the options appropriate to the array.

Note: For best performance, we generally recommend using disks from as many controllers as possible in one RAID array.

To illustrate how to create a software RAID array for Lustre, the steps below include a worked example that creates a 10-disk RAID 6 array from disks /dev/dsk/c0t0d0 through c0tod4 and /dev/dsk/c1t0d0 through c1tod4. This RAID array has no spares.

For the 10-disk RAID 6 array, there are 8 active disks. The chunk size must be chosen such that  <= 1024KB/8. Therefore, the largest valid chunk size is 128KB.


 * a. Create a RAID array for an OST. On the OSS, run:
 * $ mdadm --create  -c  -l  -n  -x  


 * where:
 *  RAID array to create, in the form of /dev/mdX
 *  Size of each stripe piece on the array’s disks (in KB); discussed above.
 *  Architecture of the RAID array. RAID 5 and RAID 6 are commonly used for OSTs.
 *  Number of active disks in the array, including parity disks.
 *  Number of spare disks initially assigned to the array. More disks may be brought in via spare pooling (see below).
 *  List of the block devices used for the RAID array; wildcards may be used.


 * For the worked example, the command is:
 * $ mdadm --create /dev/md10 -c 128 -l 6 -n 10 -x 0 /dev/dsk/c0t0d[01234] /dev/dsk/c1t0d[01234]


 * This command output displays:
 * mdadm: array /dev/md10 started.


 * We also want an external journal on a RAID 1 device. We create this from two 400MB partitions on separate disks: /dev/dsk/c9t0d20p1 and /dev/dsk/c1t0d20p1.


 * b. Create a RAID array for an external journal. On the OSS, run:
 * $ mdadm --create  -l  -n  -x <spare_devices> <block_devices>


 * where:
 * <array_device> RAID array to create, in the form of /dev/mdX
 * <raid_level> Architecture of the RAID array. RAID 1 is recommended for external journals.
 * <active_devices> Number of active disks in the RAID array, including mirrors.
 * <spare_devices> Number of spare disks initially assigned to the RAID array. More disks may be brought in via spare pooling (see below).
 * <block_devices> List of the block devices used for the RAID array; wildcards may be used.


 * For the worked example, the command is:
 * $ mdadm --create /dev/md20 -l 1 -n 2 -x 0 /dev/dsk/c0t0d20p1 /dev/dsk/c1t0d20p1


 * This command output displays:
 * mdadm: array /dev/md20 started.


 * We now have two arrays - a RAID 6 array for the OST (/dev/md20), and a RAID 1 array for the external journal (/dev/md20).


 * The arrays will now be re-synced, a process which re-synchronizes the various disks in the array so their contents match. The arrays may be used during the resync process (including formatting the OSTs), but performance will not be as high as usual. The re-sync progress may be monitored by reading the /proc/mdstat file. Next, you need to create a RAID array for an MDT. In this example, a RAID 10 array is created with 4 disks: /dev/dsk/c0t0d1, c0t0d3, c1t0d1, and c1t0d3. For smaller arrays, RAID 1 could be used.


 * c. Create a RAID array for an MDT. On the MDT, run:
 * $ mdadm --create <array_device> -l <raid_level> -n <active_devices> -x <spare_devices> <block_devices>


 * where:
 * For the worked example, the command is:
 * $ mdadm --create -l 10 -n 4 -x 0 /dev/md10 /dev/dsk/c[01]t0d[13]


 * This command output displays:
 * mdadm: array /dev/md10 started.


 * If you creating many arrays across many servers, we recommend scripting this process.


 * Note: Do not use the --assume-clean option when creating arrays. This could lead to data corruption on RAID 5 and will cause array checks to show errors with all RAID types.

3. Set up the mdadm tool.

The mdadm tool enables you to monitor disks for failures (you will receive notification). It also enables you to manage spare disks. When a disk fails, you can use mdadm to make a spare disk active, until such time as the failed disk is replaced.

Here is an example mdadm.conf from an OSS with 7 OSTs including external journals. Note how spare groups are configured, so that OSTs without spares still benefit from the spare disks assigned to other OSTs. ARRAY /dev/md10 level=raid6 num-devices=10 UUID=e8926d28:0724ee29:65147008:b8df0bd1 spare-group=raids ARRAY /dev/md11 level=raid6 num-devices=10 spares=1 UUID=7b045948:ac4edfc4:f9d7a279:17b468cd spare-group=raids ARRAY /dev/md12 level=raid6 num-devices=10 spares=1 UUID=29d8c0f0:d9408537:39c8053e:bd476268 spare-group=raids ARRAY /dev/md13 level=raid6 num-devices=10 UUID=1753fa96:fd83a518:d49fc558:9ae3488c spare-group=raids ARRAY /dev/md14 level=raid6 num-devices=10 spares=1 UUID=7f0ad256:0b3459a4:d7366660:cf6c7249 spare-group=raids ARRAY /dev/md15 level=raid6 num-devices=10 UUID=09830fd2:1cac8625:182d9290:2b1ccf2a spare-group=raids ARRAY /dev/md16 level=raid6 num-devices=10 UUID=32bf1b12:4787d254:29e76bd7:684d7217 spare-group=raids ARRAY /dev/md20 level=raid1 num-devices=2 spares=1 UUID=bcfb5f40:7a2ebd50:b3111587:8b393b86 spare-group=journals ARRAY /dev/md21 level=raid1 num-devices=2 spares=1 UUID=6c82d034:3f5465ad:11663a04:58fbc2d1 spare-group=journals ARRAY /dev/md22 level=raid1 num-devices=2 spares=1 UUID=7c7274c5:8b970569:03c22c87:e7a40e11 spare-group=journals ARRAY /dev/md23 level=raid1 num-devices=2 spares=1 UUID=46ecd502:b39cd6d9:dd7e163b:dd9b2620 spare-group=journals ARRAY /dev/md24 level=raid1 num-devices=2 spares=1 UUID=5c099970:2a9919e6:28c9b741:3134be7e spare-group=journals ARRAY /dev/md25 level=raid1 num-devices=2 spares=1 UUID=b44a56c0:b1893164:4416e0b8:75beabc4 spare-group=journals ARRAY /dev/md26 level=raid1 num-devices=2 spares=1 UUID=2adf9d0f:2b7372c5:4e5f483f:3d9a0a25 spare-group=journals

MAILADDR admin@example.com
 * 1) Email address to notify of events (e.g. disk failures)

4. Set up periodic checks of the RAID array. We recommend checking the software RAID arrays monthly for consistency. This can be done using cron and should be scheduled for an idle period so performance is not affected.

To start a check, write "check" into /sys/block/[ARRAY]/md/sync_action. For example, to check /dev/md10, run this command on the Lustre server:

$ echo check > /sys/block/md10/md/sync_action

5. ''Format the OSTs and MDT, and continue with normal Lustre setup and configuration.''

For configuration information, see Installing and Configuring Lustre.

Note: Per Bugzilla 18475, we recommend that stripe_cache_size be set to 16KB (instead of 2KB).

These additional resources may be helpful when enabling software RAID on Lustre:
 * md(4), mdadm(8), mdadm.conf(5) manual pages
 * Linux software RAID wiki
 * Kernel documentation: Documentation/md.txt