WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Lustre DDN Tuning: Difference between revisions

From Obsolete Lustre Wiki
Jump to navigationJump to search
No edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
----
<small>''(Updated: Feb 2010)''</small>


DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT
<small>'' DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT''</small>


This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.  
<small>''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''</small>


----
----
Line 10: Line 10:


This section covers the following DDN arrays:
This section covers the following DDN arrays:
S2A 8500
* S2A 8500
S2A 9500
* S2A 9500
S2A 9550
* S2A 9550
 
=== Setting Readahead and MF ===
For the S2A DDN 8500 storage array, we recommend that you disable readahead. In a 1000-client system, if each client has up to 8 read RPCs in flight, then this is 8 * 1000 * 1 MB = 8 GB of reads in flight. With a DDN cache in the range of 2 to 5 GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (generally, file data is not contiguous). The Multiplication Factor (MF) also influences the readahead; you should disable it.
 
CLI commands for the DDN are:
 
cache prefetch=0
cache MF=off
 
For the S2A 9500 and S2A 9550 DDN storage arrays, we recommend that you use the above commands to disable readahead.
 
===Setting Segment Size===
The cache segment size noticeably affects I/O performance. Set the cache segment size differently on the MDT (which does small, random I/O) and on the OST (which does large, contiguous I/O). In customer testing, we have found the optimal values to be 64 KB for the MDT and 1 MB for the OST.
 
'''''Note:''''' The cache size parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.
 
These are CLI commands for the DDN.
 
*For the MDT LUN:
 
$ cache size=64
size is in KB, 64, 128, 256, 512, 1024, and 2048. Default 128
 
*For the OST LUN:
 
$ cache size=1024
 
===Setting Write-Back Cache===
Performance is noticeably improved by running Lustre with write-back cache turned on. However, there is a risk that when the DDN controller crashes you need to run ''e2fsck''. Still, it takes less time than the performance hit from running with the write-back cache turned off.
 
For increased data security and in failover configurations, you may prefer to run with write-back cache off. However, you might experience performance problems with the small writes during journal flush. In this mode, it is highly beneficial to increase the number of OST service threads option ''ost ost_num_threads=512'' in ''/etc/modprobe.conf''. The OST should have enough RAM (about 1.5 MB /thread is preallocated for I/O buffers). Having more I/O threads allows you to have more I/O requests in flight, waiting for the disk to complete the synchronous write.
 
You have to decide whether performance is more important than the slight risk of data loss and downtime in case of a hardware/software problem on the DDN.
 
'''''Note:''''' There is no risk from an OSS/MDS node crashing, only if the DDN itself fails.


== Settings ==
===Setting maxcmds===
=== MF, readahead ===
For a DDN 8500, we recommend that you disable readahead. If you consider a 1000-client system, and each client has up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With DDN cache in the range of 2-5GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits, even if the file data was contiguous on disk (which, often, it is not). The Multiplication Factor (MF) also influences readahead and should be disabled.


The necessary DDN CLI commands are:
For the S2A DDN 8500 array, changing ''maxcmds'' to 4 (from the default 2) improved write performance by as much as 30 percent in a particular case. This only works with SATA-based disks and when only one controller of the pair is actually accessing the shared LUNs.


{| border=1 cellspacing=0
However, this setting comes with a warning. DDN support does not recommend changing this setting from the default. By increasing the value to 5, the same setup experienced some serious problems.
|-
 
|
The CLI command for the DDN client is provided below (default value is 2).
  cache prefetch=0   
 
  cache MF=off
$ disk maxcmds=3
|}
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls


For a DDN S2A 9500 or 9550, we also recommend that you disable readahead (using the commands above).
For S2A DDN 9500/9550 hardware and above, you can safely change the default from 6 to 16. Although the maximum value is 32, values higher than 16 are not currently recommended by DDN support.


=== segment size ===
'''''Note:''''' For help determining an appropriate ''maxcmds'' value, refer to the PDF provided with the DDN firmware. This PDF lists recommended values for that specific firmware version.
The cache segment size noticeably affects I/O performance. The cache size should be set differently on the MDT (which does small, random I/Os) and an OST (which does large, contiguous I/Os). In customer testing, the optimum values are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN, and cannot be changed on a per-LUN basis.


The necessary DDN CLI commands are:
===Further Tuning Tips===


{| border=1 cellspacing=0
Here are some tips we have drawn from testing at a large installation:
|-
|
  cache size=64                          # for MDT LUN.  size is in kB, 64, 128, 256, 512, 1024, and 2048.  Default 128
  cache size=1024                        # for OST LUN
|}
The effects of cache segment size have not been studied extensively on the S2A 9500 or 9550.


=== Write-back cache ===
* Use the full device instead of a partition (''sda'' versus ''sda1''). When using the full device, Lustre writes nicely-aligned 1 MB chunks to disk. Partitioning the disk can destroy this alignment and will noticeably impact performance.
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.


Other customers run with the write-back cache OFF, for increased data security and in failover configurations.  However, some of these customers experience performance problems with the small writes during journal flush.  In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers).  Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.
* Separate the ''ext3'' OST into two LUNs, a small LUN for the ''ext3'' journal and a big one for the "data".


This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN.  Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.
* Since Lustre 1.0.4, we supply ''ext3 mkfs'' options when we create the OST like -j, -J and so on in the following manner (where ''/dev/sdj'' has been formatted before as a journal). The journal size should not be larger than 1 GB (262144 4 KB blocks) as it can consume up to this amount of RAM on the OSS node per OST.  


=== Further Tuning tips ===
$ mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]
Experiences drawn from testing at a large installation:


* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"
----
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal).  The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.
{| border=1 cellspacing=0
|-
|
<pre>
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds
</pre>
|}
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes.  The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache.  This has been shown to cause a noticable performance loss.
* you are not obliged to lock in cache the small luns...


=== maxcmds ===
'''''Tip:''''' A very important tip—on the S2A DDN 8500 storage array, you need to create one OST per TIER, especially in write through (see output below). This is of concern if you have 16 tiers. Create 16 OSTs consisting of one tier each, instead of eight made of two tiers each.
'''S2A 8500:'''


One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs
- Performance is significantly better on the S2A DDN 9500 and 9550 storage arrays with two tiers per LUN.  


This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.
- Do NOT partition the DDN LUNs, as this causes all I/O to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1 MB boundaries. Having the partition table on the LUN causes all 1 MB writes to do a read-modify-write on an extra chunk, and ALL 1 MB reads to, instead, read 2 MB from disk into the cache, causing a noticeable performance loss.


The DDN cli commands needed are:
- You are not obliged to lock in cache the small LUNs.


{| border=1 cellspacing=0
- Configure the MDT on a separate volume that is configured as RAID 1+0. This reduces the MDT I/O and doubles the seek speed.
|-
|
  disk  maxcmds=3                      # default is 2     


|}
----
'''S2A 9500/9550:'''
For this hardware, a value of 16 is recommended.  The default value is 6.  The maximum value is 32 but values above 16 are not currently recommended by DDN support.


=== Illustration - one OST per Tier ===
For example, one OST per tier:
{| border=1 cellspacing=0
{| border=1 cellspacing=0
|-
|-
|
!LUNLabel
<pre>
!Owner
                                  Capacity Block
!Status
LUN  Label    Owner  Status        (Mbytes) Size Tiers Tier list
!Capacity (Mbytes)
------------------------------------------------------------------
!Block Size
0             1   Ready             512   1 1
!Tiers
1             1   Ready             512   1 2
!Tier
2             1   Ready             512   1 3
!List
3             1   Ready             512   1 4
|-
4             2   Ready [GHS]       1 5
|0||1||Ready||102400||512||1||1||
5             2   Ready [GHS]       1 6
|-
6             2   Critical         512   1 7
|1||1||Ready||102400||512||1||2||
7             2   Critical           1 8
|-
10             1 Cache Locked         64   512   1 1
|2||1||Ready||102400||512||1||3||
11             1 Cache Locked          64   512   1 2
|-
12             1 Cache Locked         64   512   1 3
|3||1||Ready||102400||512||1||4||
13             1 Cache Locked         64   512   1 4
|-
14             2   Ready [GHS]         64   512   1 5
|4||2||Ready [GHS]||102400||4096||1||5||
15             2   Ready [GHS]         64   512   1 6
|-
16             2   Critical            64   512    1 7
|5||2||Ready [GHS]||102400||4096||1||6||
17             2   Critical            64   512    1 8
|-
System verify extent: 16 Mbytes
|6||2||Critical||102400||512||1||7||
System verify delay:  30
|-
|7||2||Critical||102400||4096||1||8||
|-
|10||1||Cache Locked||64||512||1||1||
|-
|11||1||Ready||64||512||1||2||
|-
|12||1||Cache Locked||64||512||1||3||
|-
|13||1||Cache Locked||64||512||1||4||
|-
|14||2||Ready [GHS]||64||512||1||5||
|-
|15||2||Ready [GHS]||64||512||1||6||
|-
|16||2||Ready [GHS]||64||4096||1||7||
|-
|17||2||Ready [GHS]||64||4096||1||8||
|-
|}
System verify extent: 16 Mbytes


</pre>
System verify delay: 30
|}

Latest revision as of 10:11, 22 February 2010

(Updated: Feb 2010)

DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT

This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.


This page provides guidelines to configure DDN storage arrays for use with Lustre. For more complete information on DDN tuning, refer to the performance management section of the DDN manual for your product.

This section covers the following DDN arrays:

  • S2A 8500
  • S2A 9500
  • S2A 9550

Setting Readahead and MF

For the S2A DDN 8500 storage array, we recommend that you disable readahead. In a 1000-client system, if each client has up to 8 read RPCs in flight, then this is 8 * 1000 * 1 MB = 8 GB of reads in flight. With a DDN cache in the range of 2 to 5 GB (depending on the model), it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (generally, file data is not contiguous). The Multiplication Factor (MF) also influences the readahead; you should disable it.

CLI commands for the DDN are:

cache prefetch=0
cache MF=off

For the S2A 9500 and S2A 9550 DDN storage arrays, we recommend that you use the above commands to disable readahead.

Setting Segment Size

The cache segment size noticeably affects I/O performance. Set the cache segment size differently on the MDT (which does small, random I/O) and on the OST (which does large, contiguous I/O). In customer testing, we have found the optimal values to be 64 KB for the MDT and 1 MB for the OST.

Note: The cache size parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.

These are CLI commands for the DDN.

  • For the MDT LUN:
$ cache size=64
size is in KB, 64, 128, 256, 512, 1024, and 2048. Default 128
  • For the OST LUN:
$ cache size=1024

Setting Write-Back Cache

Performance is noticeably improved by running Lustre with write-back cache turned on. However, there is a risk that when the DDN controller crashes you need to run e2fsck. Still, it takes less time than the performance hit from running with the write-back cache turned off.

For increased data security and in failover configurations, you may prefer to run with write-back cache off. However, you might experience performance problems with the small writes during journal flush. In this mode, it is highly beneficial to increase the number of OST service threads option ost ost_num_threads=512 in /etc/modprobe.conf. The OST should have enough RAM (about 1.5 MB /thread is preallocated for I/O buffers). Having more I/O threads allows you to have more I/O requests in flight, waiting for the disk to complete the synchronous write.

You have to decide whether performance is more important than the slight risk of data loss and downtime in case of a hardware/software problem on the DDN.

Note: There is no risk from an OSS/MDS node crashing, only if the DDN itself fails.

Setting maxcmds

For the S2A DDN 8500 array, changing maxcmds to 4 (from the default 2) improved write performance by as much as 30 percent in a particular case. This only works with SATA-based disks and when only one controller of the pair is actually accessing the shared LUNs.

However, this setting comes with a warning. DDN support does not recommend changing this setting from the default. By increasing the value to 5, the same setup experienced some serious problems.

The CLI command for the DDN client is provided below (default value is 2).

$ disk maxcmds=3

For S2A DDN 9500/9550 hardware and above, you can safely change the default from 6 to 16. Although the maximum value is 32, values higher than 16 are not currently recommended by DDN support.

Note: For help determining an appropriate maxcmds value, refer to the PDF provided with the DDN firmware. This PDF lists recommended values for that specific firmware version.

Further Tuning Tips

Here are some tips we have drawn from testing at a large installation:

  • Use the full device instead of a partition (sda versus sda1). When using the full device, Lustre writes nicely-aligned 1 MB chunks to disk. Partitioning the disk can destroy this alignment and will noticeably impact performance.
  • Separate the ext3 OST into two LUNs, a small LUN for the ext3 journal and a big one for the "data".
  • Since Lustre 1.0.4, we supply ext3 mkfs options when we create the OST like -j, -J and so on in the following manner (where /dev/sdj has been formatted before as a journal). The journal size should not be larger than 1 GB (262144 4 KB blocks) as it can consume up to this amount of RAM on the OSS node per OST.
$ mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]

Tip: A very important tip—on the S2A DDN 8500 storage array, you need to create one OST per TIER, especially in write through (see output below). This is of concern if you have 16 tiers. Create 16 OSTs consisting of one tier each, instead of eight made of two tiers each.

- Performance is significantly better on the S2A DDN 9500 and 9550 storage arrays with two tiers per LUN.

- Do NOT partition the DDN LUNs, as this causes all I/O to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1 MB boundaries. Having the partition table on the LUN causes all 1 MB writes to do a read-modify-write on an extra chunk, and ALL 1 MB reads to, instead, read 2 MB from disk into the cache, causing a noticeable performance loss.

- You are not obliged to lock in cache the small LUNs.

- Configure the MDT on a separate volume that is configured as RAID 1+0. This reduces the MDT I/O and doubles the seek speed.


For example, one OST per tier:

LUNLabel Owner Status Capacity (Mbytes) Block Size Tiers Tier List
0 1 Ready 102400 512 1 1
1 1 Ready 102400 512 1 2
2 1 Ready 102400 512 1 3
3 1 Ready 102400 512 1 4
4 2 Ready [GHS] 102400 4096 1 5
5 2 Ready [GHS] 102400 4096 1 6
6 2 Critical 102400 512 1 7
7 2 Critical 102400 4096 1 8
10 1 Cache Locked 64 512 1 1
11 1 Ready 64 512 1 2
12 1 Cache Locked 64 512 1 3
13 1 Cache Locked 64 512 1 4
14 2 Ready [GHS] 64 512 1 5
15 2 Ready [GHS] 64 512 1 6
16 2 Ready [GHS] 64 4096 1 7
17 2 Ready [GHS] 64 4096 1 8

System verify extent: 16 Mbytes

System verify delay: 30