Lustre Tuning

(Updated: Feb 2010)

Many options in Lustre™ are set by means of kernel module parameters. These parameters are contained in the modprobe.conf file (On SuSE, this may be modprobe.conf.local).

OSS Service Thread Count
The oss_num_threads parameter allows the number of OST service threads to be specified at module load time on the OSS nodes:

options ost oss_num_threads={N}

An OSS can have a maximum of 512 service threads and a minimum of 2 service threads. The number of service threads is a function of how much RAM and how many CPUs are on each OSS node (1 thread / 128MB * num_cpus). If the load on the OSS node is high, new service threads will be started in order to process more requests concurrently, up to 4x the initial number of threads (subject to the maximum of 512). For a 2GB 2-CPU system, the default thread count is 32 and the maximum thread count is 128.

Increasing the size of the thread pool may help when:
 * Several OSTs are exported from a single OSS
 * Back-end storage is running synchronously
 * I/O completions take excessive time due to slow storage

Decreasing the size of the thread pool may help if:
 * The clients are overwhelming the storage capacity
 * There are lots of "slow I/O" or similar messages

Increasing the number of I/O threads allows the kernel and storage to aggregate many writes together for more efficient disk I/O. The OSS thread pool is shared—each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB) for internal I/O buffers.

It is very important to consider memory consumption when increasing the thread pool size. Drives are only able to sustain a certain amount of parallel I/O activity before performance is degraded due to the high number of seeks and the OST threads just waiting for I/O. In this situation, it may be advisable to decrease the load by decreasing the number of OST threads.

Determining the optimum number of OST threads is a process of trial and error. You may want to start with a number of OST threads equal to the number of actual disk spindles on the node. If you use RAID, subtract any dead spindles not used for actual data (e.g., 1 of N of spindles for RAID5, 2 of N spindles for RAID6), and monitor the performance of clients during usual workloads. If performance is degraded, increase the thread count and see how that works until performance is degraded again or you reach satisfactory performance.

MDS Threads
There is a similar parameter for the number of MDS service threads:

options mds mds_num_threads={N}

At this time, no testing has been done as to what the optimal number of MDS threads are. The default number varies based on the server size up to a maximum of 32. The maximum number of threads (MDS_MAX_THREADS) is 512.

Note: The OSS and MDS will automatically start new service threads dynamically in response to server loading within a factor of 4. The default is calculated the same way as before. Setting the _mu_threads module parameter disables the automatic thread creation behavior.

LNET Tunables
Transmit and receive buffer size: With Lustre release 1.4.7 and later, ksocklnd now has separate parameters for the transmit and receive buffers.

options ksocklnd tx_buffer_size=0 rx_buffer_size=0

If these parameters are left at the default (0), the system automatically tunes the transmit and receive buffer size. In almost every case, the defaults produce the best performance. Do not attempt to tune this unless you are a network expert.

irq_affinity: By default, this parameter is on. In the normal case on an SMP system, we would like our network traffic to remain local to a single CPU. This helps to keep the processor cache warm and minimizes the impact of context switches. This is especially helpful when an SMP system has more than one network interface and ideal when the number of interfaces equals the number of CPUs.

If you have an SMP platform with a single fast interface such as 10GB Ethernet and more than two CPUs, you may see performance improve by turning this parameter off, as always test to compare the impact.

=Options for Formatting MDS and OST= The backing file systems on the MDS and OSTs are independent of each other, so the formatting parameters for them should not be same. The size of the MDS backing file system depends solely on how many inodes you want in the total Lustre file system. It is not related to the size of the aggregate OST space.

Planning for Inodes
Every time you create a file on a Lustre file system, it consumes one inode on the MDS and one inode for each OST object that the file is striped over (normally it is based on the default stripe count option -c, but this may change on a per-file basis). In ext3/ldiskfs file systems, inodes are pre-allocated, so creating a new file does not consume any of the free blocks. However, this also means that the format-time options should be conservative as it is not possible to increase the number of inodes after the file system is formatted. But it is possible to add OSTs with additional space and inodes to the file system.

To be on the safe side, plan for 4KB per inode on the MDS. This is the default. For the OST, the amount of space taken by each object depends entirely upon the usage pattern of the users/applications running on the system. Lustre, by necessity, defaults to a very conservative estimate for the object size (16KB per object). You can almost always increase this for file system installations. Many Lustre file systems have average file sizes over 1MB per object.

Sizing the MDT
When calculating the MDS size, the only important factor is the average size of files to be stored in the file system. If the average file size is, for example, 5MB and you have 100TB of usable OST space, then you need at least 100TB * 1024GB/TB * 1024MB/GB / 5MB/inode = 20 million inodes. We recommend that you have twice the minimum, that is, 40 million inodes in this example. At the default 4KB per inode, this works out to only 160GB of space for the MDS.

Conversely, if you have a very small average file size, for example 4KB, Lustre is not very efficient. This is because you consume as much space on the MDS as you are consuming on the OSTs. This is not a very common configuration for Lustre.

Overriding Default Formatting Options
To override the default formatting options for any of the Lustre backing filesystems, use the --mkfsoptions='backing fs options' argument to mkfs.lustre to pass formatting options to the backing mkfs. For all options to format backing ext3 and ldiskfs filesystems, see the mke2fs(8) man page; this section only discusses some Lustre-specific options.

Number of Inodes for MDS
To override the inode ratio, use the option -i  (for instance, --mkfsoptions='-i 4096' to create one inode per 4096 bytes of file system space). Alternately, if you are specifying some absolute number of inodes, use the -N  option. You should not specify the -i option with an inode ratio below one inode per 1024 bytes in order to avoid unintentional mistakes. Instead, use the -N option.

A 2TB MDS by default will have 512M inodes. The largest currently-supported file system size is 8TB, which would hold 2B inodes. With an MDS inode ratio of 1024 bytes per inode, a 2TB MDS would hold 2B inodes, and a 4TB MDS would hold 4B inodes, which is the maximum number of inodes currently supported by ext3.

Inode Size for MDS
Lustre uses "large" inodes on the backing file systems in order to efficiently store Lustre metadata with each file. On the MDS, each inode is at least 512 bytes in size by default, while on the OST each inode is 256 bytes in size. Lustre (or more specifically the backing ext3 file system), also needs sufficient space left for other metadata like the journal (up to 400MB), bitmaps and directories. There are also a few regular files that Lustre uses to maintain cluster consistency.

To specify a larger inode size, use the -I  option. We do NOT recommend specifying a smaller-than-default inode size, as this can lead to serious performance problems; and you cannot change this parameter after formatting the file system. The inode ratio must always be larger than the inode size.

Number of Inodes for OST
For OST file systems, it is normally advantageous to take local file system usage into account. Try and minimize the number of inodes created on each OST, while keeping enough margin for potential variance in future usage. This helps in reducing the format and e2fsck time, and makes more space available for data. The current default is to create one inode per 16KB of space in the OST file system, but in many environments, this is far too many inodes for the average file size. As a good rule of thumb, the OSTs should have at least:

num_ost_inodes = 4 *  *  / 

You can specify the number of inodes on the OST file systems via the -N option to --mkfsoptions. Alternately, if you know the average file size, then you can also specify the OST inode count for the OST file systems via -i . (For example, if the average file size is 16MB and there are by default 4 stripes per file then --mkfsoptions='-i 1048576' would be appropriate).