Lustre DDN Tuning
(Updated: Feb 2010)
DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT
This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.
This page provides guidelines to configure DDN storage arrays for use with Lustre. For more complete information on DDN tuning, refer to the performance management section of the DDN manual for your product.
This section covers the following DDN arrays:
- S2A 8500
- S2A 9500
- S2A 9550
Setting Readahead and MF
For the S2A DDN 8500 storage array, we recommend that you disable readahead. In a 1000-client system, if each client has up to 8 read RPCs in flight, then this is 8 * 1000 * 1 MB = 8 GB of reads in flight. With a DDN cache in the range of 2 to 5 GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (generally, file data is not contiguous). The Multiplication Factor (MF) also influences the readahead; you should disable it.
CLI commands for the DDN are:
cache prefetch=0 cache MF=off
For the S2A 9500 and S2A 9550 DDN storage arrays, we recommend that you use the above commands to disable readahead.
Setting Segment Size
The cache segment size noticeably affects I/O performance. Set the cache segment size differently on the MDT (which does small, random I/O) and on the OST (which does large, contiguous I/O). In customer testing, we have found the optimal values to be 64 KB for the MDT and 1 MB for the OST.
Note: The cache size parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.
These are CLI commands for the DDN.
- For the MDT LUN:
$ cache size=64 size is in KB, 64, 128, 256, 512, 1024, and 2048. Default 128
- For the OST LUN:
$ cache size=1024
Setting Write-Back Cache
Performance is noticeably improved by running Lustre with write-back cache turned on. However, there is a risk that when the DDN controller crashes you need to run e2fsck. Still, it takes less time than the performance hit from running with the write-back cache turned off.
For increased data security and in failover configurations, you may prefer to run with write-back cache off. However, you might experience performance problems with the small writes during journal flush. In this mode, it is highly beneficial to increase the number of OST service threads option ost ost_num_threads=512 in /etc/modprobe.conf. The OST should have enough RAM (about 1.5 MB /thread is preallocated for I/O buffers). Having more I/O threads allows you to have more I/O requests in flight, waiting for the disk to complete the synchronous write.
You have to decide whether performance is more important than the slight risk of data loss and downtime in case of a hardware/software problem on the DDN.
Note: There is no risk from an OSS/MDS node crashing, only if the DDN itself fails.
For the S2A DDN 8500 array, changing maxcmds to 4 (from the default 2) improved write performance by as much as 30 percent in a particular case. This only works with SATA-based disks and when only one controller of the pair is actually accessing the shared LUNs.
However, this setting comes with a warning. DDN support does not recommend changing this setting from the default. By increasing the value to 5, the same setup experienced some serious problems.
The CLI command for the DDN client is provided below (default value is 2).
$ disk maxcmds=3
For S2A DDN 9500/9550 hardware and above, you can safely change the default from 6 to 16. Although the maximum value is 32, values higher than 16 are not currently recommended by DDN support.
Note: For help determining an appropriate maxcmds value, refer to the PDF provided with the DDN firmware. This PDF lists recommended values for that specific firmware version.
Further Tuning Tips
Here are some tips we have drawn from testing at a large installation:
- Use the full device instead of a partition (sda versus sda1). When using the full device, Lustre writes nicely-aligned 1 MB chunks to disk. Partitioning the disk can destroy this alignment and will noticeably impact performance.
- Separate the ext3 OST into two LUNs, a small LUN for the ext3 journal and a big one for the "data".
- Since Lustre 1.0.4, we supply ext3 mkfs options when we create the OST like -j, -J and so on in the following manner (where /dev/sdj has been formatted before as a journal). The journal size should not be larger than 1 GB (262144 4 KB blocks) as it can consume up to this amount of RAM on the OSS node per OST.
$ mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]
Tip: A very important tip—on the S2A DDN 8500 storage array, you need to create one OST per TIER, especially in write through (see output below). This is of concern if you have 16 tiers. Create 16 OSTs consisting of one tier each, instead of eight made of two tiers each.
- Performance is significantly better on the S2A DDN 9500 and 9550 storage arrays with two tiers per LUN.
- Do NOT partition the DDN LUNs, as this causes all I/O to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1 MB boundaries. Having the partition table on the LUN causes all 1 MB writes to do a read-modify-write on an extra chunk, and ALL 1 MB reads to, instead, read 2 MB from disk into the cache, causing a noticeable performance loss.
- You are not obliged to lock in cache the small LUNs.
- Configure the MDT on a separate volume that is configured as RAID 1+0. This reduces the MDT I/O and doubles the seek speed.
For example, one OST per tier:
|LUNLabel||Owner||Status||Capacity (Mbytes)||Block Size||Tiers||Tier||List|
System verify extent: 16 Mbytes
System verify delay: 30