FAQ - Fundamentals
(Updated: Dec 2009)
Can you describe the data caching and cache coherency method?
Complete cache coherence is provided for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.
Does Lustre separate metadata and file data?
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs). Note that unlike many block-based clustered filesystems where the MDS is still in charge of block allocation, the Lustre MDS is not involved in file IO in any manner and is not a source of contention for file IO.
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping methods, such as RAID-1 or RAID-5 redundancy.
What is the difference between an OST and an OSS?
As the architecture has evolved, we refined these terms.
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces and usually one or more disks.
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.
Is it common for a single OSS to export more than one OST?
Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.
Does Lustre perform high-level I/O load balancing?
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.
Objects are distributed amongst OSTs in a round-robin manner to ensure even load balancing across OSTs and OSS nodes. In Lustre 1.6 and later, if the OSTs are imbalanced in terms of space usage, the MDS will take this into account and allocate a larger fraction of files to OSTs with more free space.
Is there a common synchronized namespace for files and directories?
Yes. All clients that mount the file system will see a single, coherent, synchronized namespace at all times.
Can Lustre be used as part of a "single system image" installation?
Yes. Lustre as the root file system is being used by some installations on both clients and servers.
Do Lustre clients use NFS to reach the servers?
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.
Does Lustre use or provide a single security domain?
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the MDS by a server-local PAM-managed group database. Lustre supports Access Control Lists (ACLs). Strong security using Kerberos is being developed and will be in a future release.
Does Lustre support the standard POSIX file system APIs?
Yes. Applications that use standard POSIX file system APIs can run on Lustre without modifications.
Is Lustre "POSIX compliant"? Are there any exceptions?
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.
For example, the coherency of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results.
This is true of all I/O and metadata operations, with two exceptions:
- atime updates
- It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on a disk anyway, we will piggy-back an atime update if needed -- and when files are closed. Clients will refresh a file's atime' whenever they read or write objects from that file from the OST(s), but will only do local atime updates for reads from cache.
- POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not enabled by default today. It is possible to enable client-local flock locking with the -o localflock mount option, or cluster-wide locking with the -o flock mount option. If/when this becomes the default, it is also possible to disable flock for a client with the -o noflock mount option.
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.
Can you grow/shrink file systems online?
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS. In an upcoming version of Lustre, the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems. Shrinking is not supported.
Which disk file systems are supported as Lustre back-end file systems?
Lustre includes a patched version of the ext3 file system, called ldiskfs, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. In newer kernels that support the ext4 file system, this will be used instead of ext3. Work is underway to use the Solaris ZFS file system to increase the scalability and robustness of the back-end file system.
Why did Lustre choose ext3? Do you ever plan to support others?
There are many reasons to choose ext3. One is size; at about 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last few years, however, the Lustre team has carried ext3 substantially forward, and it is now competitive with other Linux file systems. Most of the changes made to ext3 for improving Lustre performance have been included into the upstream ext4 filesystem, reducing the number and size of patches in ldiskfs dramatically.
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. In future Lustre releases, we will support ZFS as the backing file system for both OSTs and MDTs.
Why didn't you use IBM's distributed lock manager?
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however. To its credit, it is a complete DLM which implements many features which we do not require in Lustre.
In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).
Are services at user or kernel level? How do they communicate?
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.