Architecture - Network Request Scheduler

Definitions

 * NRS : Network Request Scheduler.


 * RPC Concurrency : The number of RPC requests in-flight RPC between a given client and a server.


 * Active fan-out : The number of clients with in-flight requests to a given server at a given time.


 * Offset stream : The sequence of file or disk offsets in a stream of I/O requests.


 * "Before" relation (&le;) : File system operations that require ordering for correctness are related by "&le;".  For 2 operations a and b, if a &le; b, then operation a must complete reading/writing file system state before operation b can start.


 * POP : Partial Order Preservation. A filesystem's POP capability describes how its servers handle any "before" relations required on RPC sent to them.  Servers with no POP capability have no concept of any "before" relation on incoming RPCs so clients are completely responsible for preserving it.  Servers with local POP capability preserve the "before" relation within a single server, but clients are responsible for preserving any required order on RPCs sent to different servers.  A set of servers with global POP capability preserves the "before" relation on all RPCs.

Summary
The Network Request Scheduler manages incoming RPC requests on a server to provide improved and consistent performance. It does this primarily by ordering request execution to avoid client starvation and to present a workload to the backend filesystem that can be optimized more easily. It may also change RPC concurrency as active fan-out varies to reduce latency seen by the client and limit request buffering on the server.

POP Capability
The NRS must implement any POP capability its clients require.

Current Lustre servers have no POP capability therefore clients may never issue RPCs concurrently that have a "before" relation - viz. metadata RPCs are synchronous and dirty data must have been written back before locks can start to be released. This leaves the NRS free to reorder all incoming RPCs.

Any POP capability should permit better RPC pipelining for improved throughput to single clients and better latency hiding when resolving lock conflicts.

The implementation may choose to implement a very simple POP capability that only works for the most important use cases, since it can revert to synchronous client behaviour in complex cases.

An implementation may create additional "before" relations between RPCs provided they do not conflict with any "real" ordering (i.e. no cycles in the global "before" graph). This may allow a more compact "wire" representation of the "before" relation and/or just a simpler overall implementation, at the expense of reducing the scope to optimize request order.

Consider RPC requests a &le; b. Implementations that could allow request b to reach a server before request a will have to log completed requests for the duration of a server epoch.

A global POP capability seems to require too much and too fine-grained inter-server communication which will make it hard to implement efficiently. It should probably not be considered unless a significant use-case arises.

Scalability
The number of RPC requests the server may buffer at any time is the product of RPC concurrency and active fan-out - i.e. potentially many thousands of requests. Request scheduling operations should have complexity of O(log(n)) at most.

Offset Stream Consistency
The backend filesystem allocator determines the disk offset stream when a given file is first written. It may even turn a random file offset stream into a substantially sequential disk offset stream. The disk offset stream is repeated when the file is read, provided the file offset stream hasn't changed. Request ordering should therefore be as reproducible as possible in the face of ordering "noise" caused by network unfairness or client races.

Clients should pass a "hint" in RPC requests to ensure related offset streams can be identified, reordered and merged consistently on a multi-user cluster. This "hint" should also be passed through to the backend file system and used by its allocator. The "hint" may also become the basis of a resource reservation system to guarantee share of server resource to concurrent jobs.

Request Priority
Request priorities enable important requests to be serviced with lower latency - e.g. writes required to clean a cache on a locking conflict. Note that high priority requests must not break any POP requirements.

RPC Concurrency
There are conflicting pressures on RPC concurrency. It should be high when maximum individual client performance is required - e.g. when active fan-out is low on the server and there is spare server bandwidth, or when a client must clean its cache on a lock conflict. It should be low at times of high active fan-out to reduce buffering required on the server and to limit the latency of individual client requests.

Extendability
The NRS must inter-operate with non-NRS-aware clients and peers, making "best efforts" scheduling descisions for them. This same policy must apply to successive client and server versions.