WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Lustre DDN Tuning
DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT
This content was submitted by an external contributor. We provide this information as a resource for the Lustre open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.
Introduction
This is a guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the DDN manual for your product.
Settings
MF, readahead
For a DDN 8500, we recommend that you disable readahead. If you consider a 1000-client system, and each client has up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With DDN cache in the range of 2-5GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits, even if the file data was contiguous on disk (which, often, it is not). The Multiplication Factor (MF) also influences readahead and should be disabled.
The necessary DDN CLI commands are:
cache prefetch=0 cache MF=off |
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls
For a DDN S2A 9500 or 9550, we also recommend that you disable readahead (using the commands above).
segment size
The cache segment size noticeably affects I/O performance. The cache size should be set differently on the MDT (which does small, random I/Os) and an OST (which does large, contiguous I/Os). In customer testing, the optimum values are 64kB for the MDT and 1MB for the OST. Unfortunately, the cache size parameter is common to all LUNs on a single DDN, and cannot be changed on a per-LUN basis.
The necessary DDN CLI commands are:
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128 cache size=1024 # for OST LUN |
The effects of cache segment size have not been studied extensively on the S2A 9500 or 9550.
Write-back cache
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.
Further Tuning tips
Experiences drawn from testing at a large installation:
- separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"
- since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script: ${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds |
- Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.
- On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.
- Do NOT partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.
- you are not obliged to lock in cache the small luns...
maxcmds
S2A 8500:
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.
The DDN cli commands needed are:
disk maxcmds=3 # default is 2 |
S2A 9500/9550: For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.
Illustration - one OST per Tier
Capacity Block LUN Label Owner Status (Mbytes) Size Tiers Tier list ------------------------------------------------------------------ 0 1 Ready 512 1 1 1 1 Ready 512 1 2 2 1 Ready 512 1 3 3 1 Ready 512 1 4 4 2 Ready [GHS] 1 5 5 2 Ready [GHS] 1 6 6 2 Critical 512 1 7 7 2 Critical 1 8 10 1 Cache Locked 64 512 1 1 11 1 Cache Locked 64 512 1 2 12 1 Cache Locked 64 512 1 3 13 1 Cache Locked 64 512 1 4 14 2 Ready [GHS] 64 512 1 5 15 2 Ready [GHS] 64 512 1 6 16 2 Critical 64 512 1 7 17 2 Critical 64 512 1 8 System verify extent: 16 Mbytes System verify delay: 30 |