WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Lustre DDN Tuning: Difference between revisions

From Obsolete Lustre Wiki
Jump to navigationJump to search
No edit summary
 
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Introduction ==
<small>''(Updated: Feb 2010)''</small>
Guide to configuring DDN storage arrays for use with Lustre.  For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.


== Settings ==
<small>'' DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''</small>
=== MF, readahead ===
For DDN 8500 CFS recommends to disable the readahead.  If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight.  With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't).  The Multiplication Factor (MF) also influences the readahead and should be disabled.


The DDN cli commands needed are:
<small>''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''</small>


{| border=1 cellspacing=0
----
|-
 
|
This page provides guidelines to configure DDN storage arrays for use with Lustre. For more complete information on DDN tuning, refer to the performance management section of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.
  cache prefetch=0
 
  cache MF=off
This section covers the following DDN arrays:
|}
* S2A 8500
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls
* S2A 9500
* S2A 9550
 
=== Setting Readahead and MF ===
For the S2A DDN 8500 storage array, we recommend that you disable readahead. In a 1000-client system, if each client has up to 8 read RPCs in flight, then this is 8 * 1000 * 1 MB = 8 GB of reads in flight. With a DDN cache in the range of 2 to 5 GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (generally, file data is not contiguous). The Multiplication Factor (MF) also influences the readahead; you should disable it.
 
CLI commands for the DDN are:
 
cache prefetch=0
cache MF=off
 
For the S2A 9500 and S2A 9550 DDN storage arrays, we recommend that you use the above commands to disable readahead.
 
===Setting Segment Size===
The cache segment size noticeably affects I/O performance. Set the cache segment size differently on the MDT (which does small, random I/O) and on the OST (which does large, contiguous I/O). In customer testing, we have found the optimal values to be 64 KB for the MDT and 1 MB for the OST.
 
'''''Note:''''' The cache size parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.
 
These are CLI commands for the DDN.
 
*For the MDT LUN:
 
$ cache size=64
size is in KB, 64, 128, 256, 512, 1024, and 2048. Default 128
 
*For the OST LUN:
 
$ cache size=1024
 
===Setting Write-Back Cache===
Performance is noticeably improved by running Lustre with write-back cache turned on. However, there is a risk that when the DDN controller crashes you need to run ''e2fsck''. Still, it takes less time than the performance hit from running with the write-back cache turned off.
 
For increased data security and in failover configurations, you may prefer to run with write-back cache off. However, you might experience performance problems with the small writes during journal flush. In this mode, it is highly beneficial to increase the number of OST service threads option ''ost ost_num_threads=512'' in ''/etc/modprobe.conf''. The OST should have enough RAM (about 1.5 MB /thread is preallocated for I/O buffers). Having more I/O threads allows you to have more I/O requests in flight, waiting for the disk to complete the synchronous write.
 
You have to decide whether performance is more important than the slight risk of data loss and downtime in case of a hardware/software problem on the DDN.
 
'''''Note:''''' There is no risk from an OSS/MDS node crashing, only if the DDN itself fails.
 
===Setting maxcmds===
 
For the S2A DDN 8500 array, changing ''maxcmds'' to 4 (from the default 2) improved write performance by as much as 30 percent in a particular case. This only works with SATA-based disks and when only one controller of the pair is actually accessing the shared LUNs.
 
However, this setting comes with a warning. DDN support does not recommend changing this setting from the default. By increasing the value to 5, the same setup experienced some serious problems.
 
The CLI command for the DDN client is provided below (default value is 2).
 
$ disk maxcmds=3
 
For S2A DDN 9500/9550 hardware and above, you can safely change the default from 6 to 16. Although the maximum value is 32, values higher than 16 are not currently recommended by DDN support.
 
'''''Note:''''' For help determining an appropriate ''maxcmds'' value, refer to the PDF provided with the DDN firmware. This PDF lists recommended values for that specific firmware version.
 
===Further Tuning Tips===
 
Here are some tips we have drawn from testing at a large installation:


For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.
* Use the full device instead of a partition (''sda'' versus ''sda1''). When using the full device, Lustre writes nicely-aligned 1 MB chunks to disk. Partitioning the disk can destroy this alignment and will noticeably impact performance.


=== segment size ===
* Separate the ''ext3'' OST into two LUNs, a small LUN for the ''ext3'' journal and a big one for the "data".
The cache segment size affects the IO performance noticably.  It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs).  The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST.  Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.


The DDN cli commands needed are:
* Since Lustre 1.0.4, we supply ''ext3 mkfs'' options when we create the OST like -j, -J and so on in the following manner (where ''/dev/sdj'' has been formatted before as a journal). The journal size should not be larger than 1 GB (262144 4 KB blocks) as it can consume up to this amount of RAM on the OSS node per OST.


{| border=1 cellspacing=0
  $ mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]
|-
|
                  cache size=64                          # for MDT LUN.  size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128
                  cache size=1024                        # for OST LUN
|}
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.
=== Write-back cache ===
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.


Other customers run with the write-back cache OFF, for increased data security and in failover configurations.  However, some of these customers experience performance problems with the small writes during journal flush.  In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers).  Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.
----


This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.
'''''Tip:''''' A very important tip—on the S2A DDN 8500 storage array, you need to create one OST per TIER, especially in write through (see output below). This is of concern if you have 16 tiers. Create 16 OSTs consisting of one tier each, instead of eight made of two tiers each.


=== Further Tuning tips ===
- Performance is significantly better on the S2A DDN 9500 and 9550 storage arrays with two tiers per LUN.
Experiences drawn from testing at a large installation:


* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"
- Do NOT partition the DDN LUNs, as this causes all I/O to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1 MB boundaries. Having the partition table on the LUN causes all 1 MB writes to do a read-modify-write on an extra chunk, and ALL 1 MB reads to, instead, read 2 MB from disk into the cache, causing a noticeable performance loss.
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal).  The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.
{{{
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]
in LMC {config}.sh script:
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds
}}}
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache.  This has been shown to cause a noticable performance loss.
* you are not obliged to lock in cache the small luns...
=== maxcmds ===
'''S2A 8500:'''


One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs
- You are not obliged to lock in cache the small LUNs.


This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.
- Configure the MDT on a separate volume that is configured as RAID 1+0. This reduces the MDT I/O and doubles the seek speed.


The DDN cli commands needed are:
----


{{{
For example, one OST per tier:
disk  maxcmds=3                      # default is 2
{| border=1 cellspacing=0
}}}
|-
'''S2A 9500/9550:'''
!LUNLabel
For this hardware, a value of 16 is recommended.  The default value is 6.  The maximum value is 32 but values above 16 are not currently recommended by DDN support.
!Owner
=== Illustration - one OST per Tier ===
!Status
{{{
!Capacity (Mbytes)
                                  Capacity  Block
!Block Size
LUN  Label    Owner Status       (Mbytes) Size Tiers Tier list
!Tiers
------------------------------------------------------------------
!Tier
0             1   Ready             512   1 1
!List
1             1   Ready             512   1 2
|-
2             1   Ready             512   1 3
|0||1||Ready||102400||512||1||1||
3             1   Ready             512   1 4
|-
4             2   Ready [GHS]       1 5
|1||1||Ready||102400||512||1||2||
5             2   Ready [GHS]       1 6
|-
6             2   Critical         512   1 7
|2||1||Ready||102400||512||1||3||
7             2   Critical           1 8
|-
10             1 Cache Locked         64   512   1 1
|3||1||Ready||102400||512||1||4||
11             1 Cache Locked          64   512   1 2
|-
12             1 Cache Locked         64   512   1 3
|4||2||Ready [GHS]||102400||4096||1||5||
13             1 Cache Locked         64   512   1 4
|-
14             2   Ready [GHS]         64   512   1 5
|5||2||Ready [GHS]||102400||4096||1||6||
15             2   Ready [GHS]         64   512   1 6
|-
16             2   Critical            64   512    1 7
|6||2||Critical||102400||512||1||7||
17             2   Critical            64   512    1 8
|-
System verify extent: 16 Mbytes
|7||2||Critical||102400||4096||1||8||
System verify delay:  30
|-
|10||1||Cache Locked||64||512||1||1||
|-
|11||1||Ready||64||512||1||2||
|-
|12||1||Cache Locked||64||512||1||3||
|-
|13||1||Cache Locked||64||512||1||4||
|-
|14||2||Ready [GHS]||64||512||1||5||
|-
|15||2||Ready [GHS]||64||512||1||6||
|-
|16||2||Ready [GHS]||64||4096||1||7||
|-
|17||2||Ready [GHS]||64||4096||1||8||
|-
|}
System verify extent: 16 Mbytes


}}}
System verify delay: 30

Latest revision as of 10:11, 22 February 2010

(Updated: Feb 2010)

DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT

This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.


This page provides guidelines to configure DDN storage arrays for use with Lustre. For more complete information on DDN tuning, refer to the performance management section of the DDN manual for your product.

This section covers the following DDN arrays:

  • S2A 8500
  • S2A 9500
  • S2A 9550

Setting Readahead and MF

For the S2A DDN 8500 storage array, we recommend that you disable readahead. In a 1000-client system, if each client has up to 8 read RPCs in flight, then this is 8 * 1000 * 1 MB = 8 GB of reads in flight. With a DDN cache in the range of 2 to 5 GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (generally, file data is not contiguous). The Multiplication Factor (MF) also influences the readahead; you should disable it.

CLI commands for the DDN are:

cache prefetch=0
cache MF=off

For the S2A 9500 and S2A 9550 DDN storage arrays, we recommend that you use the above commands to disable readahead.

Setting Segment Size

The cache segment size noticeably affects I/O performance. Set the cache segment size differently on the MDT (which does small, random I/O) and on the OST (which does large, contiguous I/O). In customer testing, we have found the optimal values to be 64 KB for the MDT and 1 MB for the OST.

Note: The cache size parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.

These are CLI commands for the DDN.

  • For the MDT LUN:
$ cache size=64
size is in KB, 64, 128, 256, 512, 1024, and 2048. Default 128
  • For the OST LUN:
$ cache size=1024

Setting Write-Back Cache

Performance is noticeably improved by running Lustre with write-back cache turned on. However, there is a risk that when the DDN controller crashes you need to run e2fsck. Still, it takes less time than the performance hit from running with the write-back cache turned off.

For increased data security and in failover configurations, you may prefer to run with write-back cache off. However, you might experience performance problems with the small writes during journal flush. In this mode, it is highly beneficial to increase the number of OST service threads option ost ost_num_threads=512 in /etc/modprobe.conf. The OST should have enough RAM (about 1.5 MB /thread is preallocated for I/O buffers). Having more I/O threads allows you to have more I/O requests in flight, waiting for the disk to complete the synchronous write.

You have to decide whether performance is more important than the slight risk of data loss and downtime in case of a hardware/software problem on the DDN.

Note: There is no risk from an OSS/MDS node crashing, only if the DDN itself fails.

Setting maxcmds

For the S2A DDN 8500 array, changing maxcmds to 4 (from the default 2) improved write performance by as much as 30 percent in a particular case. This only works with SATA-based disks and when only one controller of the pair is actually accessing the shared LUNs.

However, this setting comes with a warning. DDN support does not recommend changing this setting from the default. By increasing the value to 5, the same setup experienced some serious problems.

The CLI command for the DDN client is provided below (default value is 2).

$ disk maxcmds=3

For S2A DDN 9500/9550 hardware and above, you can safely change the default from 6 to 16. Although the maximum value is 32, values higher than 16 are not currently recommended by DDN support.

Note: For help determining an appropriate maxcmds value, refer to the PDF provided with the DDN firmware. This PDF lists recommended values for that specific firmware version.

Further Tuning Tips

Here are some tips we have drawn from testing at a large installation:

  • Use the full device instead of a partition (sda versus sda1). When using the full device, Lustre writes nicely-aligned 1 MB chunks to disk. Partitioning the disk can destroy this alignment and will noticeably impact performance.
  • Separate the ext3 OST into two LUNs, a small LUN for the ext3 journal and a big one for the "data".
  • Since Lustre 1.0.4, we supply ext3 mkfs options when we create the OST like -j, -J and so on in the following manner (where /dev/sdj has been formatted before as a journal). The journal size should not be larger than 1 GB (262144 4 KB blocks) as it can consume up to this amount of RAM on the OSS node per OST.
$ mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]

Tip: A very important tip—on the S2A DDN 8500 storage array, you need to create one OST per TIER, especially in write through (see output below). This is of concern if you have 16 tiers. Create 16 OSTs consisting of one tier each, instead of eight made of two tiers each.

- Performance is significantly better on the S2A DDN 9500 and 9550 storage arrays with two tiers per LUN.

- Do NOT partition the DDN LUNs, as this causes all I/O to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1 MB boundaries. Having the partition table on the LUN causes all 1 MB writes to do a read-modify-write on an extra chunk, and ALL 1 MB reads to, instead, read 2 MB from disk into the cache, causing a noticeable performance loss.

- You are not obliged to lock in cache the small LUNs.

- Configure the MDT on a separate volume that is configured as RAID 1+0. This reduces the MDT I/O and doubles the seek speed.


For example, one OST per tier:

LUNLabel Owner Status Capacity (Mbytes) Block Size Tiers Tier List
0 1 Ready 102400 512 1 1
1 1 Ready 102400 512 1 2
2 1 Ready 102400 512 1 3
3 1 Ready 102400 512 1 4
4 2 Ready [GHS] 102400 4096 1 5
5 2 Ready [GHS] 102400 4096 1 6
6 2 Critical 102400 512 1 7
7 2 Critical 102400 4096 1 8
10 1 Cache Locked 64 512 1 1
11 1 Ready 64 512 1 2
12 1 Cache Locked 64 512 1 3
13 1 Cache Locked 64 512 1 4
14 2 Ready [GHS] 64 512 1 5
15 2 Ready [GHS] 64 512 1 6
16 2 Ready [GHS] 64 4096 1 7
17 2 Ready [GHS] 64 4096 1 8

System verify extent: 16 Mbytes

System verify delay: 30