Lustre FAQ: Difference between revisions

Revision as of 15:32, 22 July 2009

Glossary

ACL: Access Control List
DLM: Distributed Lock Manager
EA: Extended Attribute
FC: Fibrechannel
HPC: High-Performance Computing
IB: InfiniBand
MDS: Metadata Server
NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect
OSS: Object Storage Server
OST: Object Storage Target (what's the difference?)

Fundamentals

Can you describe the data caching and cache coherency method?

There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.

Does Lustre separate metadata and file data?

Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).

The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping methods, such as RAID-1 or RAID-5 redundancy.

What is the difference between an OST and an OSS?

There is a lot of confusion, and it's mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.

An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.

An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.

It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.

Does Lustre perform high-level I/O load balancing?

Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.

Objects are distributed amongst OSTs in a round-robin manner to ensure even load balancing across OSTs and OSS nodes. In Lustre 1.6 and later if the OSTs are imbalanced in terms of space usage the MDS will take this into account and allocate a larger fraction of files to OSTs with more free space.

Is there a common synchronized namespace for files and directories?

Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.

Can Lustre be used as part of a "single system image" installation?

Yes. Lustre as the root file system is being used by some installations on both clients and servers.

Do Lustre clients use NFS to reach the servers?

No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.

Does Lustre use/provide a single security domain?

Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Lustre supports Access Control Lists (ACLs). Strong security using Kerberos is being developed and will be in a future release.

Does Lustre support the standard POSIX file system APIs?

Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.

Is Lustre "POSIX compliant"? Are there any exceptions?

POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.

For example, the coherency of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results.

This is true of all I/O and metadata operations, with two exceptions:

1. atime updates

It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.

2. flock/lockf

POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet fully supported.

Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.

Can you grow/shrink file systems online?

Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS. In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems. Shrinking is not supported.

Which disk file systems are supported as Lustre back-end file systems?

Lustre includes a patched version of the ext3 file system, called ldiskfs, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. In newer kernels that support the ext4 file system, this will be used instead of ext3. Work is underway to use the Solaris ZFS filesystem to increase the scalability and robustness of the back-end filesystem.

Why did Lustre choose ext3? Do you ever plan to support others?

There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.

When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.

Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.

Why didn't you use IBM's distributed lock manager?

The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.

The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.

In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).

Are services at user or kernel level? How do they communicate?

All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.

Sizing

What is the maximum file system size? What is the largest file system you've tested?

Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs. Running with almost 4000 OST's has been tried - hence a 32PB file systems could be achieved today.

Lustre users already run single production filesystems of over 1.9PB, and a 5PB filesystem is being deployed.

What is the maximum file system block size?

The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.

Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is usually 1MB (aligned to the start of the LUN) and can be further aggregated by the disk elevator or RAID controller.

What is the maximum single-file size?

On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB. On 64 bit clusters, the maximum file size is 2^64. A current Lustre limit for allocated file space arises from a maximum of 160 stripes and 2TB per file on current ext3 filesystems, leading to a limit of 320TB per file.

What is the maximum number of files in a single file system? In a single directory?

We use the ext3 hashed directory code, which has a theoretical limit of about 125 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and has the same limit as regular files in later versions (small ext3 format change).

More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.

A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. Production file systems containing over 300 million files exist.

With the introduction of clustered metadata servers (Lustre 2.0) and with ZFS-based MDTs, these limits will disappear.

How many OSSs do I need?

The short answer is: as many as you need to achieve the required aggregate I/O throughput.

The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-GigE-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.

Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.

What is the largest possible I/O request?

When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.

The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.

Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server.

On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.

How many nodes can connect to a single Lustre file system?

The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.

Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.

Installation

Which operating systems are supported as clients and servers?

Please see OS Support.

Can you use NFS or CIFS to reach a Lustre volume?

Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.

Although NFS export works today, we don't support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We're working on these, but in the meantime, we suggest Samba.

CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.

What is the typical MDS node configuration?

10,000-node clusters with moderate metadata loads are commonly supported with a 8-code node with 8GB of RAM, providing sustained throughput of over 5,000 ops/second. It is common for these systems to have roughly 200 million files. Even in 10,000-client clusters, the single MDS has been shown not to be a bottleneck under typical HPC loads.

High throughput with very large directories is possible with extra RAM and, optionally, solid-state disks. Typically, write I/O is low, but seek latency is very important, hence RAID-0+1 mirrored storage (RAID-0 striping of multiple RAID-1 mirrored disks) is strongly recommended for the MDS. Access to metadata from RAM can be 10x faster, so RAM more RAM is usually beneficial.

Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.

What is the typical OSS node configuration?

Multi-core 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual- or quad- socket and support up to 4 fibrechannel interfaces. RAM is used for locks and metadata caching, and in 1.8 and later extra server RAM will be used for file data caching, though is not strictly required. Aproximately 2 GB/OST is recommended for OSS nodes in 1.8, and twice that if OST failover will also run backup OSTs on the node.

Which architectures are interoperable?

Lustre requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endianness.

Which storage devices are supported, on MDS and OSS nodes?

Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported.

Which storage interconnects are supported?

Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.

For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.

Are fibrechannel switches necessary? How does HA shared storage work?

Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.

Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and can be configured to take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.

Can you put the file system journal on a separate device?

Yes. This can be configured when the backend ext3 file systems are created.

Can you run Lustre on LVM volumes, software RAID, etc?

Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.

Can you describe the installation process?

The current installation process is straightforward, but manual:

1. Install the provided kernel and Lustre RPMs. 2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts. 3. Format and mount the OST and MDT filesystems. The command is usually identical on all nodes, so it's easy to use a utility like pdsh/prun to execute it. 4. Start the clients with "mount", similar to how NFS is mounted.

What is the estimated installation time per compute node?

Assuming that node doesn't require special drivers or kernel configuration, 5 minutes. Install the RPMs, add a line for the filesystem to /etc/fstab. Compute nodes can be installed and started in parallel.

What is the estimated installation time per I/O node?

5-30 minutes, plus formatting time, which can also be done in parallel. The per-node time depends on how easily the commands can be run in parallel on the I/O nodes. Install the RPMs, format the devices, add a line to /etc/fstab for the filesystem, mount.

Networking

Which interconnects and protocols are currently supported?

Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB, OFED, Topspin, Myrinet GM and MX, CISCO, and Cray's Rapid Array and Seastar networks.

The up-to-date versions of each network type are at the beginning of the lnet/ChangeLog file and each release announcement.

Can I use more than one interface of the same type on the same node?

Yes, with Lustre 1.4.6 and later.

Can I use two or more different interconnects on the same node?

Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.

Can I use TCP offload cards?

Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.

Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.

Does Lustre support crazy heterogeneous network topologies?

Yes, although the craziest of them are not yet fully supported.

Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.

Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and InfiniBand interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.

These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on GigE is the interconnect of choice for most organizations, which requires no additional Portals routing.

Metadata Servers

How many metadata servers does Lustre support?

Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.

You can configure multiple active metadata servers today, but they must each serve separate file systems.

Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.

How will clustered metadata work?

At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.

When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.

We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.

Isn't the single metadata server a bottleneck?

Not so far. We regularly perform tests with single directories containing millions of files, and we have several customers with 10,000-node clusters (or larger) and a single metadata server.

Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.

The Lustre metadata server software is extremely multithreaded, and we have made substantial efforts to reduce the lock contention on the MDS so that many clients can work concurrently. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second. Future versions of Lustre will continue to improve the metadata performance.

If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SunFire x4600 or Bull NovaScale) with large memory spaces and solid-state disks which can utilize large caches to speed MDS operations.

What is the typical MDS node configuration?

Please see Installation.

How do I automate failover of my MDSs?

Please see Recovery.

Do you plan to support MDS failover without shared storage?

No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.

How is metadata allocated on disk?

The standard way that Lustre formats the MDS file system is with 512-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.

What is intent locking?

Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.

Consider the case of 1,000 clients all chdir'ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.

This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.

Consider another very common case, like a user's home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).

The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.

What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.

Lustre 1.x does not include a metadata writeback cache on the client, so today's metadata server always executes the operation on the client's behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.

How does the metadata locking protocol work?

Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.

There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file "dir/file", as would happen during the unpacking of a tar archive. The client would first lookup "dir", and in doing so take a lock to cache the result. It would then ask the metadata server to create "file" inside of it, which would lock "dir" to modify it, thus yanking the lock back from the client. This "ping-pong" back and forth is unnecessary and very inefficient, so Lustre 1.4.x introduced separate locking of different parts of the inode (simple existence, directory pages, and attributes).

Does the MDS do any pre-allocation?

Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.

Object Servers and I/O Throughput

What levels of throughput should I expect?

This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application's I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system's raw I/O bandwidth capability.

With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:

TCP/IP
- Single-connected GigE: 115 MB/s
- Dual-NIC GigE on a 32-bit OSS: 180 MB/s
- Dual-NIC GigE on a 64-bit OSS: 220 MB/s
- Single-connected 10GigE on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest
Unoptimized InfiniBand
- Single-port SDR InfiniBand on a 64-bit OSS: 700-900 MB/s
- DDR InfiniBand on a 64-bit OSS: 1500 MB/s

How fast can a single OSS be?

Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.

Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.

How well does Lustre scale as OSSs are added?

Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.

How many clients can each OSS support?

The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see Sizing.

What is a typical OSS node configuration?

Please see Installation.

How do I automate failover of my OSSs?

Please see Recovery .

Do you plan to support OSS failover without shared storage?

Yes, Server Network Striping will allow RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.

How is file data allocated on disk?

Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like "934151", which are object numbers. Inside each object is a file's data, or a portion of that file's data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.

The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre's ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.

How does the object locking protocol work?

Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:

First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.

Second, it removes the so-called "split-brain" problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.

In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.

If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.

Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object's attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing ("ls -l") in the output directory while the job is writing its data.

Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don't actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.

Does Lustre support Direct I/O?

Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O, but does not cache the data on the client or server. With an RDMA-capable network (anything other than TCP) there is only a single data copy directly from the client RAM to the server RAM and straight to the disk.

Can these locks be disabled?

Yes, but:

It's only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.
In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.

Do you plan to support T-10 object devices?

We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.

Does Lustre support/require special parallel I/O libraries?

Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries, when the IO size is sufficient.

A Lustre-specific MPI/IO ADIO driver has been developed to allow an application to provide hints about how it would like its output files to be striped, and to optimize the IO pattern when many clients are doing small read or write operations to a single file.

Recovery

How do I configure failover services?

Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.

This does not typically require a fibrechannel switch.

How do I automate failover of my MDSs/OSSs?

The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat's Cluster Manager and SuSE's Heartbeat).

Completely automated failover also requires some kind of programmatically controllable power switch, because the new "active" MDS must be able to completely power off the failed node. Otherwise, there is a chance that the "dead" node could wake up, start using the disk at the same time, and cause massive corruption.

How necessary is failover, really?

The answer depends on how close to 100% uptime you need to achieve. Failover doesn't protect against the failure of individual disks -- that is handled by software or hardware RAID at the OST and MDT level. Lustre failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.

We would suggest that simple RAID-5 or RAID-6 storage is sufficient for most users, with manual restart of failed OSS and MDS nodes, but that the most important production systems should consider failover.

I don't need failover, and don't want shared storage. How will this work?

If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.

When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.

If a node suffers a connection failure, will the node select an alternate route for recovery?

Yes. If a node has multiple network paths, and one fails, it can continue to use the others.

What are the supported hardware methods for HBA, switch, and controller failover?

These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. Applications will see a delay while failover and recovery is in progress, but system calls complete without errors.

Can you describe an example failure scenario, and its resolution?

Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.

How are power failures, disk or RAID controller failures, etc. addressed?

If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.

If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device's caches, Lustre requires a file system repair. Lustre's tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.

OS Support

Which operating systems are/will be supported?

There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.

Today, native kernel drivers exist only for Linux. Other native ports, such as to Windows and Solaris are underway.

liblustre is not yet as widely used as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer runs liblustre on the Catamount operating system on 22000 nodes to provide all client and server storage, including root FS and swap space, so this is a very serious use of liblustre. It could in principle be used on any Unix-like operating system, or even Windows.

CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.

We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris may be available at a later date.

With Lustre 1.6.0 Linux 2.4 clients and servers will no longer be supported. See the lustre/ChangeLog and release notes for specific kernel and vendor distributions that are supported.

Why have you decided to patch the Linux kernel?

Lustre's goals are extremely ambitious; there are few, if any, other file systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn't been done before, the infrastructure was not present in the kernel for such a file system.

The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.

In 1.6 releases Lustre runs without patching the kernel, for Linux kernels 2.6.16 and newer. There are still a small number of kernel patches needed for servers, and to the copy of the ext3 filesystem used by Lustre (ldiskfs), but the majority of the ext3 changes have been incorporated into ext4 in the 2.6.25 kernel.

Are there plans to get these patches into the kernel.org/OSDL kernel?

Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream in the ext4 filesystem.

With 2.6.16 and later kernels enough of the Lustre patches have been incorporated into the upstream kernel that clients can run without patches. There is a shrinking number of patches needed on the server.

Can I run Lustre without patching my kernel?

The SuSE Linux Enterprise Server 9 kernel contains the Lustre VFS patches needed to run both clients and servers, though additional patches are needed to address some minor defects found after that kernel was released.

It is possible to run Lustre clients on all kernels newer than 2.6.16, and the latest RHEL4 2.6.9 kernel without patching.

Which Linux kernels are supported?

We currently support a number of kernels for Lustre 1.4: the Red Hat Enterprise Linux 3 kernel (based on 2.4.21), SuSE Linux Enterprise Server 9 (based on 2.6.5), SuSE Linux Enterprise Server 10 (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.9).

For Lustre 1.6 the Red Hat Enterprise Linux 5 kernel (based on 2.6.16) is also supported, but RHEL3 is removed.

An up-to-date list of exact kernel versions supported by each release is at the start lustre/ChangeLog and the release notes.

Which Linux distributions are supported?

Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.

Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the Lustre-supported kernels.

What if I don't run one of those kernels?

If you can't or won't run one of the supported kernels, you can also build Lustre for the particular kernel on your client. For servers we still strongly recommend one of the supported kernels, as it's not a matter of "just building from source". There are patches for each kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, the Lustre group simply does not have the resources to maintain those patches.

We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.

Do you support Lustre on an SGI Altix?

We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.

When was Lustre for Linux first used in production?

A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.

Release Testing and Upgrading

How does the Lustre group fix issues?

The Lustre group approaches bug tracking and fixing seriously and methodically:

Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.
Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior architects. A detailed design description for the patch is written and reviewed by principal engineers.
Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.
Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.
Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it's added to a branch for regression testing, and final release testing.

What testing does each version undergo prior to release?

Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.

Are Lustre releases backwards and forward compatible on the disk? On the wire?

Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.

Before Lustre 1.4.2, the same care was not taken for wire structures, because we've been adding features so quickly that it hasn't been practical. Since 1.4.2 each client and server negotiates a set of supported protocol features between them, allowing new network protocols to be added while not breaking older clients or servers.

Support for running with older protocols are removed on every second major release, so the 1.4.x release will be incompatible with 1.8.x release. Similarly, the 1.6.x release will not be able to interoperate with 2.0.x release.

The Lustre group always tests release 1.x.y with 1.x.(y-1) and 1.(x-2).latest when making a release. It isn't possible to exhaustively test all release combinations, and as a result we only recommend using the most recent version of two successive releases at one time, even though in theory any 1.x release will work with any 1.(x-2) release.

Do you have to reboot to upgrade?

Not unless you upgrade your kernel. It's usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.

Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.

Licensing and Support

What is the licensing model for the Lustre file system for Linux?

The Lustre file system for Linux is an Open Source product.

New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.

As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons. These are easily separable, non-core features primarily of interest to enterprise customers.

Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.

How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing.

Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.

If public money helped fund the development of Lustre, don't the taxpayers own that code?

Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.

Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.

If Lustre stops getting professional support, the taxpayers' money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.

How does the commercial distribution of Lustre work?

Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.

For more information about Lustre support contracts, please contact lustre-solutions@sun.com.

Will Sun develop custom features for a proprietary product?

You can ask, but generally speaking, no. There are several potential reasons for this:

Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.

Pragmatic: We don't have the resources to test and support very many different versions. It's best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.

Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We'd rather not paint ourselves into that corner.

What Lustre support services are available?

Sun provides worldwide, 12/5 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.

Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.

Contract development of new features, or acceleration of our development roadmap features to a guaranteed delivery date, are possible.

Sun also provides public and private on-site training services for your system administrators or support staff. Please check our upcoming course schedule here.

@@ Line 162: / Line 162: @@
 ===What is the typical MDS node configuration?===
-,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical HPC loads.
+,000-node clusters with moderate metadata loads are commonly supported with a 8-code node with 8GB of RAM, providing sustained throughput of over 5,000 ops/second. It is common for these systems to have roughly 200 million files.  Even in 10,000-client clusters, the single MDS has been shown not to be a bottleneck under typical HPC loads.
-High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-0+1 mirrored storage
+High throughput with very large directories is possible with extra RAM and, optionally, solid-state disks. Typically, write I/O is low, but seek latency is very important, hence RAID-0+1 mirrored storage
 (RAID-0 striping of multiple RAID-1 mirrored disks) is strongly recommended for the MDS.  Access to metadata from RAM can be 10x faster, so RAM more RAM is usually beneficial.