WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Difference between revisions of "Lustre FAQ"

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 96: Line 96:
 
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.
 
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.
  
== Sizing ==
+
[[FAQ - Sizing|Sizing]]
 
 
''' What is the maximum file system size? What is the largest file system you've tested? '''
 
 
 
Each backend OST file system is restricted to a maximum of 8 TB on Linux 2.6 (imposed by ''ext3'').
 
Filesystems that are based on ''ext4'' (SLES11) will soon be able to handle single OSTs of up to 16TB.
 
Of course, it is possible to have multiple OST backends on a single OSS, and to aggregate multiple OSSs within a single Lustre file system.  Running tests with almost 4000 smaller OSTs has been tried - hence 32PB or 64PB file systems could be achieved today.
 
 
 
Lustre users already run single production file systems of over 10PB, using over 1300 8TB OSTs.
 
 
 
''' What is the maximum file system block size? '''
 
 
 
The basic ''ext3'' block size is 4096 bytes, although this could in principle be easily changed to as large as PAGE_SIZE (on IA64 or PPC, for example) in ext4. It is not clear, however, that this is necessary.
 
 
 
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ''ldiskfs'' extents and ''mballoc'' features used by Lustre do a good job of allocating I/O aligned and contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is usually 1MB (aligned to the start of the LUN) and can be further aggregated by the disk elevator or RAID controller.
 
 
 
''' What is the maximum single-file size? '''
 
 
 
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clients, the maximum file size is 2^63.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and 2TB per file on current ''ldiskfs'' file systems, leading to a limit of 320TB per file.
 
 
 
''' What is the maximum number of files in a single file system? In a single directory? '''
 
 
 
We use the ''ext3'' hashed directory code, which has a theoretical limit of about 15 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories in a single directory is the same as the file limit, about 15 million.
 
 
 
We regularly run tests with ten million files in a single directory. On a properly-configured quad-socket MDS with 32 GB of ram, it is possible to do random lookups in this directory at a rate of 20,000/second.
 
 
 
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size/4kB, so about 2 billion inodes for a 8 TB MDS file system.  This can be increased at initial file system creation time by specifying ''mkfs'' options to increase the number of inodes.  Production file systems containing over 300 million files exist.
 
 
 
With the introduction of clustered metadata servers (Lustre 2.x) and with ZFS-based MDTs, these limits will disappear.
 
 
 
''' How many OSSs do I need? '''
 
 
 
The short answer is: As many as you need to achieve the required aggregate I/O throughput.
 
 
 
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-GigE-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth.  The 25 TB of storage must be capable of 2.5 GB/s.
 
 
 
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio.  Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.
 
 
 
''' What is the largest possible I/O request? '''
 
 
 
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a ''read()'' or ''write()'' system call. In principle this is limited only by the address space on the client, and we have tested single ''read()'' and ''write()'' system calls up to 1GB in size, so it has not been an issue in reality.
 
 
 
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.
 
 
 
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server.  There is still a per-syscall overhead for locking and such, so using 1MB or larger read/write requests will minimize this overhead.
 
 
 
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver.  In newer kernels there is less need to modify the kernel, though the block device tunables are still set for low-latency desktop workloads, and need to be tuned for high-bandwidth IO.
 
 
 
''' How many nodes can connect to a single Lustre file system? '''
 
 
 
The largest single production Lustre installation is approximately 26,000 nodes today (2009). This is the site-wide file system for a handful of different supercomputers.
 
 
 
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many more clients.
 
  
 
[[FAQ - Installation|Installation]]
 
[[FAQ - Installation|Installation]]

Revision as of 14:55, 5 February 2010

You will find answers to a variety of questions about Lustre™ below.

Fundamentals

Can you describe the data caching and cache coherency method?

Complete cache coherence is provided for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.

Does Lustre separate metadata and file data?

Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs). Note that unlike many block-based clustered filesystems where the MDS is still in charge of block allocation, the Lustre MDS is not involved in file IO in any manner and is not a source of contention for file IO.

The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping methods, such as RAID-1 or RAID-5 redundancy.

What is the difference between an OST and an OSS?

As the architecture has evolved, we refined these terms.

An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces and usually one or more disks.

An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.

Is it common for a single OSS to export more than one OST?

Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.

Does Lustre perform high-level I/O load balancing?

Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.

Objects are distributed amongst OSTs in a round-robin manner to ensure even load balancing across OSTs and OSS nodes. In Lustre 1.6 and later, if the OSTs are imbalanced in terms of space usage, the MDS will take this into account and allocate a larger fraction of files to OSTs with more free space.

Is there a common synchronized namespace for files and directories?

Yes. All clients that mount the file system will see a single, coherent, synchronized namespace at all times.

Can Lustre be used as part of a "single system image" installation?

Yes. Lustre as the root file system is being used by some installations on both clients and servers.

Do Lustre clients use NFS to reach the servers?

No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.

Does Lustre use or provide a single security domain?

Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the MDS by a server-local PAM-managed group database. Lustre supports Access Control Lists (ACLs). Strong security using Kerberos is being developed and will be in a future release.

Does Lustre support the standard POSIX file system APIs?

Yes. Applications that use standard POSIX file system APIs can run on Lustre without modifications.

Is Lustre "POSIX compliant"? Are there any exceptions?

POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.

For example, the coherency of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results.

This is true of all I/O and metadata operations, with two exceptions:

  • atime updates
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on a disk anyway, we will piggy-back an atime update if needed -- and when files are closed. Clients will refresh a file's atime' whenever they read or write objects from that file from the OST(s), but will only do local atime updates for reads from cache.
  • flock/lockf
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not enabled by default today. It is possible to enable client-local flock locking with the -o localflock mount option, or cluster-wide locking with the -o flock mount option. If/when this becomes the default, it is also possible to disable flock for a client with the -o noflock mount option.

Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.

Can you grow/shrink file systems online?

Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS. In an upcoming version of Lustre, the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems. Shrinking is not supported.

Which disk file systems are supported as Lustre back-end file systems?

Lustre includes a patched version of the ext3 file system, called ldiskfs, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. In newer kernels that support the ext4 file system, this will be used instead of ext3. Work is underway to use the Solaris ZFS file system to increase the scalability and robustness of the back-end file system.

Why did Lustre choose ext3? Do you ever plan to support others?

There are many reasons to choose ext3. One is size; at about 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.

When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last few years, however, the Lustre team has carried ext3 substantially forward, and it is now competitive with other Linux file systems. Most of the changes made to ext3 for improving Lustre performance have been included into the upstream ext4 filesystem, reducing the number and size of patches in ldiskfs dramatically.

Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. In future Lustre releases, we will support ZFS as the backing file system for both OSTs and MDTs.

Why didn't you use IBM's distributed lock manager?

The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.

The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however. To its credit, it is a complete DLM which implements many features which we do not require in Lustre.

In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).

Are services at user or kernel level? How do they communicate?

All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.

Sizing

Installation

Networking

Metadata Servers

Object Servers and I/O Throughput

Recovery

OS Support

Release Testing and Upgrading

Licensing and Support

Glossary

  • ACL: Access Control List
  • DLM: Distributed Lock Manager
  • EA: Extended Attribute
  • FC: Fibrechannel
  • HPC: High-Performance Computing
  • IB: InfiniBand
  • MDS: Metadata Server
  • NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect
  • OSS: Object Storage Server
  • OST: Object Storage Target