WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Lustre FAQ: Difference between revisions

From Obsolete Lustre Wiki
Jump to navigationJump to search
No edit summary
Line 15: Line 15:


== Fundamentals (15) ==
== Fundamentals (15) ==
=== Can you describe the data caching and cache coherency method?===


There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.
=== Does Lustre separate metadata and file data?===
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.
=== What is the difference between an OST and an OSS?===
There is a lot of confusion, and it's mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.
=== Does Lustre perform high-level I/O load balancing?===
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.
By default, objects are randomly distributed amongst OSTs.
=== Is there a common synchronized namespace for files and directories?===
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.
=== Can Lustre be used as part of a "single system image" installation?===
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the roadmap ).
=== Do Lustre clients use NFS to reach the servers?===
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.
=== Does Lustre use/provide a single security domain?===
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the roadmap ).
=== Does Lustre support the standard POSIX file system APIs?===
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.
=== Is Lustre "POSIX compliant"? Are there any exceptions?===
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:
* 1. atime updates
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.
* 2. flock/lockf
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the roadmap).
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.
=== Can you grow/shrink file systems online?===
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.
=== Which disk file systems are supported as Lustre backend file systems?===
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.
=== Why did CFS choose ext3? Do you ever plan to support others?===
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, CFS has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.
=== Why didn't you use IBM's distributed lock manager?===
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.
In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).
=== Are services at user or kernel level? How do they communicate?===
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.


== Sizing (7)  ==
== Sizing (7)  ==

Revision as of 23:43, 1 January 2008

Glossary

  • ACL: Access Control List
  • DLM: Distributed Lock Manager
  • EA: Extended Attribute
  • FC: Fibrechannel
  • HPC: High-Performance Computing
  • IB: InfiniBand
  • MDS: Metadata Server
  • NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect
  • OSS: Object Storage Server
  • OST: Object Storage Target (what's the difference? )


Fundamentals (15)

Can you describe the data caching and cache coherency method?

There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.

Does Lustre separate metadata and file data?

Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).

The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.

What is the difference between an OST and an OSS?

There is a lot of confusion, and it's mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.

An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.

An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.

It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.

Does Lustre perform high-level I/O load balancing?

Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.

By default, objects are randomly distributed amongst OSTs.

Is there a common synchronized namespace for files and directories?

Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.

Can Lustre be used as part of a "single system image" installation?

Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the roadmap ).

Do Lustre clients use NFS to reach the servers?

No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.

Does Lustre use/provide a single security domain?

Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the roadmap ).

Does Lustre support the standard POSIX file system APIs?

Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.

Is Lustre "POSIX compliant"? Are there any exceptions?

POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.

For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:

  • 1. atime updates

It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.

  • 2. flock/lockf

POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the roadmap).

Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.

Can you grow/shrink file systems online?

Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS. In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems. Shrinking is not supported.

Which disk file systems are supported as Lustre backend file systems?

Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.

Why did CFS choose ext3? Do you ever plan to support others?

There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.

When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, CFS has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.

Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.

Why didn't you use IBM's distributed lock manager?

The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.

The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.

In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).

Are services at user or kernel level? How do they communicate?

All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.

Sizing (7)

Installation (13)

Networking (5)

Metadata Servers (10)

Object Servers and I/O Throughput (13)

Recovery (8)

OS Support (9)

Release Testing and Upgrading (4)

Licensing and Support (7)