WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Lustre DDN Tuning

From Obsolete Lustre Wiki
Jump to navigationJump to search

DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT

This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.


This page provides guidelines to configure DDN storage arrays for use with Lustre. For more complete information on DDN tuning, refer to the performance management section of the DDN manual for your product.

This section covers the following DDN arrays:

  • S2A 8500
  • S2A 9500
  • S2A 9550

Setting Readahead and MF

For the S2A DDN 8500 storage array, we recommend that you disable readahead. In a 1000-client system, if each client has up to 8 read RPCs in flight, then this is 8 * 1000 * 1 MB = 8 GB of reads in flight. With a DDN cache in the range of 2 to 5 GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (generally, file data is not contiguous). The Multiplication Factor (MF) also influences the readahead; you should disable it.

CLI commands for the DDN are:

cache prefetch=0
cache MF=off

For the S2A 9500 and S2A 9550 DDN storage arrays, we recommend that you use the above commands to disable readahead.

Setting Segment Size

The cache segment size noticeably affects I/O performance. Set the cache segment size differently on the MDT (which does small, random I/O) and on the OST (which does large, contiguous I/O). In customer testing, we have found the optimal values to be 64 KB for the MDT and 1 MB for the OST.

Note: The cache size parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.

These are CLI commands for the DDN.

  • For the MDT LUN:
$ cache size=64
size is in KB, 64, 128, 256, 512, 1024, and 2048. Default 128
  • For the OST LUN:
$ cache size=1024

Setting Write-Back Cache

Performance is noticeably improved by running Lustre with write-back cache turned on. However, there is a risk that when the DDN controller crashes you need to run e2fsck. Still, it takes less time than the performance hit from running with the write-back cache turned off.

For increased data security and in failover configurations, you may prefer to run with write-back cache off. However, you might experience performance problems with the small writes during journal flush. In this mode, it is highly beneficial to increase the number of OST service threads option ost ost_num_threads=512 in /etc/modprobe.conf. The OST should have enough RAM (about 1.5 MB /thread is preallocated for I/O buffers). Having more I/O threads allows you to have more I/O requests in flight, waiting for the disk to complete the synchronous write.

You have to decide whether performance is more important than the slight risk of data loss and downtime in case of a hardware/software problem on the DDN.

Note: There is no risk from an OSS/MDS node crashing, only if the DDN itself fails.

Further Tuning tips

Experiences drawn from testing at a large installation:

  • separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"
  • since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.
 # mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds
  • Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.
  • On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.
  • Do NOT partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.
  • you are not obliged to lock in cache the small luns...

maxcmds

S2A 8500:

One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs

This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.

The DDN cli commands needed are:

  disk   maxcmds=3                       # default is 2       

S2A 9500/9550: For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.

Illustration - one OST per Tier

                                   Capacity  Block
LUN  Label    Owner  Status        (Mbytes)  Size  Tiers Tier list
------------------------------------------------------------------
 0              1    Ready             512    1  1
 1              1    Ready             512    1  2
 2              1    Ready             512    1  3
 3              1    Ready             512    1  4
 4              2    Ready [GHS]        1  5
 5              2    Ready [GHS]        1  6
 6              2    Critical          512    1  7
 7              2    Critical           1  8
10              1  Cache Locked          64   512    1  1
11              1  Cache Locked          64   512    1  2
12              1  Cache Locked          64   512    1  3
13              1  Cache Locked          64   512    1  4
14              2    Ready [GHS]         64   512    1  5
15              2    Ready [GHS]         64   512    1  6
16              2    Critical            64   512    1  7
17              2    Critical            64   512    1  8
 System verify extent: 16 Mbytes
 System verify delay:  30