WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Difference between revisions of "Lustre FAQ"

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 198: Line 198:
  
 
== Networking (5) ==
 
== Networking (5) ==
 +
 +
===Which interconnects and protocols are currently supported?===
 +
Today, CFS supports Lustre running on TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray's Rapid Array and Seastar networks.
 +
===Can I use more than one interface of the same type on the same node?===
 +
 +
Yes, with Lustre 1.4.6 and later.
 +
===Can I use two or more different interconnects on the same node?===
 +
 +
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.
 +
===Can I use TCP offload cards?===
 +
 +
Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.
 +
 +
Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.
 +
===Does Lustre support crazy heterogeneous network topologies?===
 +
 +
Yes, although the craziest of them are not yet fully supported.
 +
 +
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.
 +
 +
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.
 +
 +
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.
  
 
== Metadata Servers (10) ==
 
== Metadata Servers (10) ==

Revision as of 00:20, 2 January 2008

Glossary

  • ACL: Access Control List
  • DLM: Distributed Lock Manager
  • EA: Extended Attribute
  • FC: Fibrechannel
  • HPC: High-Performance Computing
  • IB: InfiniBand
  • MDS: Metadata Server
  • NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect
  • OSS: Object Storage Server
  • OST: Object Storage Target (what's the difference? )


Fundamentals (15)

Can you describe the data caching and cache coherency method?

There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.

Does Lustre separate metadata and file data?

Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).

The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file's data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.

What is the difference between an OST and an OSS?

There is a lot of confusion, and it's mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.

An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.

An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.

It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.

Does Lustre perform high-level I/O load balancing?

Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.

By default, objects are randomly distributed amongst OSTs.

Is there a common synchronized namespace for files and directories?

Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.

Can Lustre be used as part of a "single system image" installation?

Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the roadmap ).

Do Lustre clients use NFS to reach the servers?

No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre's metadata, I/O, locking, recovery, or performance requirements.

Does Lustre use/provide a single security domain?

Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the roadmap ).

Does Lustre support the standard POSIX file system APIs?

Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.

Is Lustre "POSIX compliant"? Are there any exceptions?

POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.

For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:

  • 1. atime updates

It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.

  • 2. flock/lockf

POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the roadmap).

Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.

Can you grow/shrink file systems online?

Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS. In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems. Shrinking is not supported.

Which disk file systems are supported as Lustre backend file systems?

Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.

Why did CFS choose ext3? Do you ever plan to support others?

There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.

When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, CFS has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.

Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.

Why didn't you use IBM's distributed lock manager?

The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM's DLM), experience thus far has seemed to indicate that we've made the correct choice: it's smaller, simpler and, at least for our needs, more extensible.

The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.

In particular, Lustre's DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).

Are services at user or kernel level? How do they communicate?

All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.

Sizing (7)

What is the maximum file system size? What is the largest file system you've tested?

Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs. Running with almost 4000 thousand OST's has been tried - hence 32PB file systems can be achieved today.

Lustre users already run single production filesystems of 1.4PB.

What is the maximum file system block size?

The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.

Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.

What is the maximum single-file size?

On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB. On 64 bit clusters, the maximum file size is 2^64. A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.

What is the maximum number of files in a single file system? In a single directory?

We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).

More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.

A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.

With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.

How many OSSs do I need?

The short answer is: as many as you need to achieve the required aggregate I/O throughput.

The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.

Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.

What is the largest possible I/O request?

When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.

The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it's important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.

Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests "in flight" at a time, per server.

On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.

How many nodes can connect to a single Lustre file system?

The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.

Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.

Installation (13)

Which operating systems are supported as clients and servers?

Please see OS Support .

Can you use NFS or CIFS to reach a Lustre volume?

Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.

Although NFS export works today, we don't support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We're working on these, but in the meantime, we suggest Samba.

CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.

What is the typical MDS node configuration?

1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.

High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.

Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.

What is the typical OSS node configuration?

IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.

Which architectures are interoperable?

Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.

Which storage devices are supported, on MDS and OSS nodes?

Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported.

Which storage interconnects are supported?

Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.

For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.

Are fibrechannel switches necessary? How does HA shared storage work?

Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.

Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.

Can you put the file system journal on a separate device?

Yes. This can be configured when the backend ext3 file systems are created.

Can you run Lustre on LVM volumes, software RAID, etc?

Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.

Can you describe the installation process?

The current installation process is straightforward, but manual:

1. Install the provided kernel and Lustre RPMs. 2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts. 3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it's easy to use a utility like pdsh/prun to execute it. 4. Start the clients with "mount", similar to how NFS is mounted.

We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.

What is the estimated installation time per compute node?

Assuming that node doesn't require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.

What is the estimated installation time per I/O node?

5 minutes, plus formatting time, which can also be done in parallel.

Networking (5)

Which interconnects and protocols are currently supported?

Today, CFS supports Lustre running on TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray's Rapid Array and Seastar networks.

Can I use more than one interface of the same type on the same node?

Yes, with Lustre 1.4.6 and later.

Can I use two or more different interconnects on the same node?

Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.

Can I use TCP offload cards?

Probably -- but we've tried many of these cards, and for various reasons we didn't see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.

Second, the problem isn't the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.

Does Lustre support crazy heterogeneous network topologies?

Yes, although the craziest of them are not yet fully supported.

Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.

Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.

These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.

Metadata Servers (10)

Object Servers and I/O Throughput (13)

Recovery (8)

OS Support (9)

Release Testing and Upgrading (4)

Licensing and Support (7)