WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Difference between revisions of "Lustre FAQ"

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 92: Line 92:
 
== Sizing (7)  ==
 
== Sizing (7)  ==
  
 +
=== What is the maximum file system size? What is the largest file system you've tested? ===
 +
 +
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST's has been tried - hence 32PB file systems can be achieved today.
 +
 +
Lustre users already run single production filesystems of 1.4PB.
 +
=== What is the maximum file system block size? ===
 +
 +
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.
 +
 +
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.
 +
=== What is the maximum single-file size? ===
 +
 +
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.
 +
=== What is the maximum number of files in a single file system? In a single directory? ===
 +
 +
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).
 +
 +
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.
 +
 +
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.
 +
 +
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.
 +
=== How many OSSs do I need? ===
 +
 +
The short answer is: as many as you need to achieve the required aggregate I/O throughput.
 +
 +
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.
 +
 +
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.
 +
=== What is the largest possible I/O request? ===
 +
 +
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.
 +
 +
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.
 +
 +
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server.
 +
 +
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.
 +
=== How many nodes can connect to a single Lustre file system? ===
 +
 +
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.
 +
 +
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.
  
 
== Installation (13) ==
 
== Installation (13) ==

Revision as of 23:46, 1 January 2008

Glossary

  • ACL: Access Control List
  • DLM: Distributed Lock Manager
  • EA: Extended Attribute
  • FC: Fibrechannel
  • HPC: High-Performance Computing
  • IB: InfiniBand
  • MDS: Metadata Server
  • NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect
  • OSS: Object Storage Server
  • OST: Object Storage Target (what's the difference? )


Fundamentals (15)

Can you describe the data caching and cache coherency method?

There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.

Does Lustre separate metadata and file data?

Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).

The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.

What is the difference between an OST and an OSS?

There is a lot of confusion, and it's mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.

An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.

An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.

It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.

Does Lustre perform high-level I/O load balancing?

Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.

By default, objects are randomly distributed amongst OSTs.

Is there a common synchronized namespace for files and directories?

Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.

Can Lustre be used as part of a "single system image" installation?

Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the roadmap ).

Do Lustre clients use NFS to reach the servers?

No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.

Does Lustre use/provide a single security domain?

Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the roadmap ).

Does Lustre support the standard POSIX file system APIs?

Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.

Is Lustre "POSIX compliant"? Are there any exceptions?

POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.

For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:

  • 1. atime updates

It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.

  • 2. flock/lockf

POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the roadmap).

Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.

Can you grow/shrink file systems online?

Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS. In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems. Shrinking is not supported.

Which disk file systems are supported as Lustre backend file systems?

Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.

Why did CFS choose ext3? Do you ever plan to support others?

There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.

When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, CFS has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.

Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.

Why didn't you use IBM's distributed lock manager?

The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.

The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.

In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).

Are services at user or kernel level? How do they communicate?

All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.

Sizing (7)

What is the maximum file system size? What is the largest file system you've tested?

Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs. Running with almost 4000 thousand OST's has been tried - hence 32PB file systems can be achieved today.

Lustre users already run single production filesystems of 1.4PB.

What is the maximum file system block size?

The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.

Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.

What is the maximum single-file size?

On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB. On 64 bit clusters, the maximum file size is 2^64. A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.

What is the maximum number of files in a single file system? In a single directory?

We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).

More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.

A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.

With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.

How many OSSs do I need?

The short answer is: as many as you need to achieve the required aggregate I/O throughput.

The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.

Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.

What is the largest possible I/O request?

When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.

The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.

Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server.

On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.

How many nodes can connect to a single Lustre file system?

The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.

Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.

Installation (13)

Networking (5)

Metadata Servers (10)

Object Servers and I/O Throughput (13)

Recovery (8)

OS Support (9)

Release Testing and Upgrading (4)

Licensing and Support (7)