NFS vs. Lustre

(Updated: Oct 2009)

DISCLAIMER - EXTERNAL CONTRIBUTOR CONTENT

''This content was submitted by an external contributor. We provide this information as a resource for the Lustre™ open-source community, but we make no representation as to the accuracy, completeness or reliability of this information.''

The following is based on a post written by Lee Ward and posted on the Lustre-discuss mailing list and a couple of corrections supplied by Daniel Kobras and Nicolas Williams have been added. Further expansion and correction is welcome.

I'll begin by motivating both NFS and Lustre. Why do they exist? What problems do they solve.

NFS
Way back in the day, ethernet and the concept of a workstation got popular. There were many tools to copy files between machines but few ways to share a name space; Have the directory hierarchy and it's content directly accessible to an application on a foreign machine. This made file sharing awkward. The model was to copy the file or files to the workstation where the work was going to be done, do the work, and copy the results back to some, hopefully, well maintained central machine.

There were solutions to this at the time. I recall an attractive alternative called RFS (I believe) from the Bell Labs folks, via some place in England if I'm remembering right, it's been a looong time after all. It had issues though. The nastiest issue for me was that if a client went down the service side would freeze, at least partially. Since this could happen willy-nilly, depending on the users wishes and how well the power button on his workstation was protected, together with the power cord and ethernet connection, this freezing of service for any amount of time was difficult to accept. This was so even in a rather small collection of machines.

The problem with RFS (?) and it's cousins were that they were all stateful. The service side depended on state that was held at the client. If the client went down, the service side couldn't continue without a whole lot of recovery, timeouts, etc. It was a very *annoying* problem.

In the latter half of the 1980's (am I remembering right?) SUN proposed an open protocol called NFS. An implementation using this protocol could do most everything RFS(?) could but it didn't suffer the service-side hangs. It couldn't. It was stateless. If the client went down, the server just didn't care. If the server went down, the client had the opportunity to either give up on the local operation, usually with an error returned, or wait. It was always up to the user and for client failures the annoyance was limited to the user(s) on that client.

SUN, also, wisely desired the protocol to be ubiquitous. They published it. They wanted *everyone* to adopt it. More, they would help competitors. SUN held interoperability bake-a-thons to help with this. It looks like they succeeded, all around :)

Let's sum up, then. The goals for NFS were:


 * 1) Share a local file system name space across the network.
 * 2) Do it in a robust, resilient way. Pesky FS issues because some user kicked the cord out of his workstation was unacceptable.
 * 3) Make it ubiquitous. SUN was a workstation vendor. They sold servers but almost everyone had a VAX in their back pocket where they made the infrastructure investment. SUN needed the high-value machines to support this protocol.

Lustre
Lustre has a weird story and I'm not going to go into all of it. The shortest, relevant, part is that while there was at least one solution that DOE/NNSA felt acceptable, GPFS, it was not available on anything other than an IBM platform and because DOE/NNSA had a semi-formal policy of buying from different vendors at each of the three labs we were kind of stuck. Other file systems, existing and imminent, at the time were examined but they were all distributed file systems and we needed IO bandwidth. We needed lots, and lots of bandwidth.

We also needed that ubiquitous thing that SUN had as one of their goals. We didn't want to pay millions of dollars for another GPFS. We felt that would only be painting ourselves into a corner. Whatever we did, the result had to be open. It also had to be attractive to smaller sites as we wanted to turn loose of the ting at some point. If it was attractive for smaller machines we felt we would win in the long term as, eventually, the cost to further and maintain this thing was spread across the community.

As far as technical goals, I guess we just wanted GPFS, but open. More though, we wanted it to survive in our platform roadmaps for at least a decade. The actual technical requirements for the contract that DOE/NNSA executed with HP, CFS was the sub-contractor responsible for development, can be found here:



LLNL used to host this but it's no longer there? Oh well, hopefully this link will be good for a while, at least.

I'm just going to jump to the end and sum the goals up:


 * 1) It must do everything NFS can. We relaxed the stateless thing though, see the next item for why.
 * 2) It must support full POSIX semantics; Last writer wins, POSIX locks, etc.
 * 3) It must support all of the transports we are interested in.
 * 4) It must be scalable, in that we can cheaply attach storage and both performance (reading *and* writing) and capacity within a single mounted file system increase in direct proportion.
 * 5) We wanted it to be easy, administratively. Our goal was that it be no harder than NFS to set up and maintain. We were involving too many folks with PhDs in the operation of our machines at the time. Before you yell FAIL, I'll say we did try. I'll also say we didn't make CFS responsible for this part of the task. Don't blame them overly much, OK?
 * 6) We recognized we were asking for a stateful system, we wanted to mitigate that by having some focus on resiliency. These were big machines and clients died all the time.
 * 7) While not in the SOW, we structured the contract to accomplish some future form of wide acceptance. We wanted it to be ubiquitous.

That's a lot of goals! For the technical ones, the main ones are all pretty much structured to ask two things of what became Lustre. First, give us everything NFS functionally does but go far beyond it in performance. Second, give us everything NFS functionally does but make it completely equivalent to a local file system, semantically.

There's a little more we have to consider. NFS4 is a different beast than NFS2 or NFS3. NFS{2,3} had some serious issues that became more prominent as time went by. First, security; It had none. Folks had bandaged on some different things to try to cure this but they weren't standard across platforms. Second, it couldn't do the full POSIX required semantics. That was attacked with the NFS lock protocols but it was such an after-thought it will always remain problematic. Third, new authorization possibilities introduced by Microsoft and then POSIX, called ACLs, had no way of being accomplished.

NFS4 addresses those by:


 * 1) Introducing state. (Lots of resiliency mechanisms introduced to offset the downside of this, too.) NFS4 implementations are able to handle Posix advisory locks, but unlike Lustre, they don't support full Posix filesystem semantics. For example, NFS4 still follows the traditional NFS close-to-open cache consistency model whereas with Lustre, individual writes are atomic and become immediately visible to all clients.

NFSv4 can't handle O_APPEND, and has those close-to-open semantics. Those are the two large departures from POSIX in NFSv4.

NFSv4.1 also adds metadata/data separation and data distribution, much like Lustre, but with the same POSIX semantics departures mentioned above. Also, NFSv4.1's "pNFS" concept doesn't have room for "capabilities" (in the distributed filesystem sense, not in the Linux capabilities sense), which means that OSSs and MDSs have to communicate to get permissions to be enforced. There are also differences with respect to recovery, etcetera.

One thing about NFS is that it's meant to be neutral w.r.t. the type of filesystem it shares. So NFSv4, for example, has features for dealing with filesystems that don't have a notion of persistent inode number. Whereas Lustre has its own on-disk format and therefore can't be used to share just any type of filesystem.
 * 1) Formalizing and offering standardized authentication headers.
 * 2) Introducing ACLs that map to equivalents in POSIX and Microsoft.

Strengths and Weaknesses of the Two
NFS4 does most everything Lustre can with one very important exception, IO bandwidth.

Both seem able to deliver metadata performance at roughly the same speeds. File create, delete, and stat rates are about the same. NetApp seems to have a partial enhancement. They bought the Spinnaker goodies some time back and have deployed that technology, and redirection too(?), within their servers. The good about that is two users in different directories *could* leverage two servers, independently, and, so, scale metadata performance. It's not guaranteed but at least there is the possibility. If the two users are in the same directory, it's not much different, though, I'm thinking. Someone correct me if I'm wrong?

Both can offer full POSIX now. It's nasty in both cases but, yes, in theory you can export mail directory hierarchies with locking.

The NFS client and server are far easier to set up and maintain. The tools to debug issues are advanced. While the Lustre folks have done much to improve this area, NFS is just leaps and bounds ahead. It's easier to deal with NFS than Lustre. Just far, far easier, still. NFS is just built in to everything. My TV has it, for heck's sake. Lustre is, seemingly, always an add-on. It's also a moving target. We're constantly futzing with it, upgrading, and patching. Lustre might be compilable most everywhere we care about but building it isn't trivial. The supplied modules are great but, still, moving targets in that we wait for SUN to catch up to the vendor supplied changes that affect Lustre. Given Lustre's size and interaction with other components in the OS, that happens far more frequently than desired. NFS just plain wins the ubiquity argument at present.

NFS IO performance does *not* scale. It's still an in-band protocol. The data is carried in the same message as the request and is, practically, limited in size. Reads are more scalable in writes, a popular file-segment can be satisfied from the cache on reads but develops issues at some point. For writes, NFS3 and NFS4 help in that they directly support write-behind so that a client doesn't have to wait for data to go to disk, but it's just not enough. If one streams data to/from the store, it can be larger than the cache. A client that might read a file already made "hot" but at a very different rate just loses. A client, writing, is always looking for free memory to buffer content. Again, too many of these, simultaneously, and performance descends to the native speed of the attached back-end store and that store can only get so big.

Lustre IO performance *does* scale. It uses a 3rd-party transfer. Requests are made to the metadata server and IO moves directly between the affected storage component(s) and the client. The more storage components, the less possibility of contention between clients and the more data can be accepted/supplied per unit time. NFS4 has a proposed extension, called pNFS, to address this problem. It just introduces the 3rd-party data transfers that Lustre enjoys. If and when that is a standard, and is well supported by clients and vendors, the really big technical difference will virtually disappear. It's been a long time coming, though. It's still not there. Will it ever be, really?

The answer to the NFS vs. Lustre question comes down to the workload for a given application then, since they do have overlap in their solution space. If I were asked to look at a platform and recommend a solution I would worry about IO bandwidth requirements. If the platform in question were either read-mostly and, practically, never needed sustained read or write bandwidth, NFS would be an easy choice. I'd even think hard about NFS if the platform created many files but all were very small; Today's filers have very respectable IOPS rates. If it came down to IO bandwidth, I'm still on the parallel file system bandwagon. NFS just can't deal with that at present and I do still have the folks, in house, to manage the administrative burden.

Done. That was useful for me. I think five years ago I might have opted for Lustre in the "create many small files" case, where I would consider NFS today, so re-examining the motivations, relative strengths, and weaknesses of both was useful. As I said, I did this more as a self-exercise than anything else but I hope you can find something useful here, too.

(Updated 10/09)