FAQ - Sizing
(Updated: Dec 2009)
What is the maximum file system size? What is the largest file system you've tested?
Each backend OST file system is restricted to a maximum of 8 TB on Linux 2.6 (imposed by ext3). Filesystems that are based on ext4 (SLES11) will soon be able to handle single OSTs of up to 16TB. Of course, it is possible to have multiple OST backends on a single OSS, and to aggregate multiple OSSs within a single Lustre file system. Running tests with almost 4000 smaller OSTs has been tried - hence 32PB or 64PB file systems could be achieved today.
Lustre users already run single production file systems of over 10PB, using over 1300 8TB OSTs.
What is the maximum file system block size?
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to as large as PAGE_SIZE (on IA64 or PPC, for example) in ext4. It is not clear, however, that this is necessary.
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ldiskfs extents and mballoc features used by Lustre do a good job of allocating I/O aligned and contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is usually 1MB (aligned to the start of the LUN) and can be further aggregated by the disk elevator or RAID controller.
What is the maximum single-file size?
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB. On 64 bit clients, the maximum file size is 2^63. A current Lustre limit for allocated file space arises from a maximum of 160 stripes and 2TB per file on current ldiskfs file systems, leading to a limit of 320TB per file.
What is the maximum number of files in a single file system? In a single directory?
We use the ext3 hashed directory code, which has a theoretical limit of about 15 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories in a single directory is the same as the file limit, about 15 million.
We regularly run tests with ten million files in a single directory. On a properly-configured quad-socket MDS with 32 GB of ram, it is possible to do random lookups in this directory at a rate of 20,000/second.
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size/4kB, so about 2 billion inodes for a 8 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options to increase the number of inodes. Production file systems containing over 300 million files exist.
With the introduction of clustered metadata servers (Lustre 2.x) and with ZFS-based MDTs, these limits will disappear.
How many OSSs do I need?
The short answer is: As many as you need to achieve the required aggregate I/O throughput.
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-GigE-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.
What is the largest possible I/O request?
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, and we have tested single read() and write() system calls up to 1GB in size, so it has not been an issue in reality.
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server. There is still a per-syscall overhead for locking and such, so using 1MB or larger read/write requests will minimize this overhead.
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. In newer kernels there is less need to modify the kernel, though the block device tunables are still set for low-latency desktop workloads, and need to be tuned for high-bandwidth IO.
How many nodes can connect to a single Lustre file system?
The largest single production Lustre installation is approximately 26,000 nodes today (2009). This is the site-wide file system for a handful of different supercomputers.
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many more clients.