Architecture - IO system
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
No serialized calls whenever possible
The client currently serializes the writes to each OSC. If writes hit the cache this is not so serious, but caches are immediately full in large scale clusters. This is because the memory bandwidth of 1000 clients is incomparable (e.g. it equals 2TB/sec in older clusters) to the disk bandwidth of 30 servers (typically less than 12GB/sec in LLNL clusters).
Work well without a cache
The general rule is that everything we build should work optimally without caching and only after that do we consider caching.
Fully parallel dispatch
The correct implementation dispatches "lock ; I/O ; unlock" request groups to all OSCs in parallel. It may still block the system call if one of the OSC's has hit max dirty or no grants are remaining, but it should not block before proceeding to hand the request to the next OSC.
Without this construction a full pipe on one OSC will starve empty pipes on other OSC's.
Accumulate larger IOVs
I've heard people grumble about failure modes and POSIX and about possible deadlock - none of it applies. For O_APPEND and truncate we need to take locks serially, but the IO can still be done similarly.
Server side I/O request handling.
Step 1: Use a Request Scheduler
The server needs to get a request scheduler that is much less naive than what we do today. Today we handle the requests exactly in the order they come in creating likely a near random load when many clients are writing.
The request scheduler (single threaded or one thread per cpu as a refinement) would inspect lots of requests and build good groups to schedule. Lots means for example 8 requests from up to 100,000 clients, this consumes approximately 3.2GB of cached requests - not including the data. Note - the elevator cannot do this as it needs the data and there isn't enough RAM on the servers to cache so much data; normally the data comes in at 256x the request size (1MB for a 4K request).
The definition of good groups would aim address the following use cases:
- Accumulate small writes from one client into a group that contains nearly or fully sequential I/O, and hopefully contains large enough extents to saturate the DMU (file system) bandwidth
- Accumulate small writes from alls clients to one shared file in groups that are close to sequential
- Accumulate new file creations with small writes in tightly packed regions on the disk, using issues like CROW and Alex small I/O placement policy.
- Manage fair throughput from all or many clients
Like any policy, a perfect solution is not possible, we can certainly get very far compared to what we do today.
We know for a fact, from statistics on execution time, that I/O at the moment can be extremely unfair to some clients. Unfair networks can make this worse. Unfortunate grouping can also easily happen by sending widely separated disk writes down to the DMU (formerly the file system).
The request scheduler will allow the cluster to make much better use of the back end.
Step 2: Network data transfer for good groups
When good groups have been identified, the data for those requests is transfered to the servers
Data transfer over the network is generally fast and much less sensitive to smaller packets. With 64K transfers we generally are already close to maxing out the network bandwidth. 64K IO's to the disk are hairy.
We think that a thread per CPU model here is the right kind of pool to do this.
Step 3: Good groups become DMU transactions
The group of requests with data is given to the OSD / DMU subsystem to bundle into a transaction. It should be noted that the DMU currently has a limit of 10MB per transaction.
This is a lighweight operation except for possible checksumming etc that the DMU may do. A thread per CPU is all we would need in most cases, but in some cases the DMU might have to read metadata from disk before being able to handle the request, in which case more threads or asynchronous constructions might be desirable.
Step 4: Send replies with recovery information
Replies are sent to the clients with transaction numbers etc as we do for metadata
The recovery model we should follow for replay / resending write calls should be the same as what we do for metadata updates. We should be aware, but not stopped by the fact that the semantics are actually quite different because data can be modified on the client before it is re-played, while for metadata we replay operations, not state.
The last committed transaction information is also piggy backed, as today - in association with transaction flush.
Step 5: The transaction is flushed.
Lilke before, the DMU should be able to build big transactions and achieve full disk bandwidth without caching in simple cases.
We agree that some form of plug/unplug or iov handling may be desirable (and we can add this in due course), but our very coarse sgp_dd benchmark can achieve amazing throughput, as can concurrently running direct I/O jobs. First get a good understanding what the DMU does with a batch of transactions that form a good group as described above - it might be very good!
The amount of threading (in the absence of an async I/O interface) for the lowest level of the DMU should be enough to saturate the disk system - e.g. one thread per disk on a thumper maybe and 8 threads for a DDN that sits inside a DMU disk pool.