Architecture - Write Back Cache

Summary
The meta-data write-back cache (WBC or MDWBC, where a possibility of misunderstanding exists) allows client meta-data operations to be delayed and batched. This increases client throughput and improves both network utilization and server efficiency.

Definitions

 * (MD)WBC : (Meta-data) Write-Back Cache


 * MD operation : A meta-data operation performed on a client (create, unlink, rename, etc.)


 * MD batch : A group of MD operations performed by a client such that: (a) the batch transforms the file system from one consistent state to another, (b) no other client depends on seeing the file system in any state where some, but not all of the MD operations in the batch are in effect.


 * reintegration : The process of applying an MD batch on a server. Reintegration executes all the MD operations in the batch and changes the file system from one consistent state to another.


 * dependency : A situation in which an MD operation modifies multiple separate pieces of client state that are otherwise not related. These dependent pieces of state have to be reintegrated atomically (in the data-base ACID sense).  For example:


 * link and unlink introduce a dependency between the directory where the entry is added or removed, and the target object whose nlink count is updated.


 * cross-directory rename makes the parent directories dependent.


 * unlinking the last name of a file introduces a dependency between the file inode and the inodes of its stripe objects that are to be destroyed.


 * coordinated reintegration : A special case of reintegration that occurs when the client cache contains dependent state pertaining to multiple servers. In this case the servers have to act in concert to guarantee consistency. Coordinated reintegration is originated by the client, that sends dependent batches to the servers in parallel. One (or more) server assumes the role of coordinator, and uses persistent logs together with the CUT mechanism to either commit or rollback that distributed transaction.


 * object-of-conflict : An object in the extent of the lock owned by a client and also in the extent of some conflicting lock that other client is attempting to acquire. I.e., an object where locks "intersect". Single pair of conflicting locks can have more than one object-of-conflict. This term is used in QAS description.

Requirements

 * scalability : client should be able to execute 32K creations of 1--64KB files per second. Files maybe created in different directories with file counts per directory to range from 1K to 100K.


 * correctness : reintegration changes the file system from one globally consistent state to another.


 * transactionality : reintegration assures that the disk image of the file system is consistent. This implies that reintegration is either done completely within a single transaction, or the batch contains enough information to cut reintegration into smaller pieces, each preserving consistency.


 * concurrency : when a client surrenders a meta-data lock it only flushes enough of its cache to guarantee correctness (i.e., flushing the whole meta-data cache is not necessary).

Details
Instead of immediately sending MD operations to the server and waiting for their execution, the client caches them in some form, simulating their local effects (creating, modifying, and deleting VFS and VM entities such as inodes, dentries, pages, etc.). Later, a batched description of the cached operations is sent to the server and executed there.

Two important aspects of the WBC are how MD batches are stored on the client and transported over the network. Possible extremes include pure (logical) logging where every operation is represented as a separate entity, and pure physical logging (aka "bulk state update") where only the latest state is maintained.

Current design is to store cached MD updates as a some sort of a log in the client memory and to transmit MD batch as a bulk state update. Storing modifications as a log has following advantages:


 * it is possible to create finer grained batches, i.e., to reduce amount of the flushed state by flushing only portion of modified state for a given object;


 * resend and replay are simplified;


 * higher degree of concurrency during reintegration seems possible: to do reintegration, client "cuts" certain prefix of the log and starts reintegrating it with the server. In the meanwhile, operations on the objects involved into reintegration can continue. That seems important, as reintegration of large batch can take (relatively) long time and stop-the-world cache flushing is undesirable.

Disadvantage is increased memory footprint (or, equivalently, more frequent reintegration).

Advantage of the sending and applying updates as a batch is off-loading work from the server, effectively rendering meta-data operations closer to the data ones, e.g., ideally, bulk update of the directory pages can be very similar to the bulk update of regular file pages. Disadvantages are


 * the necessity of high level of trust to the clients as they are permitted to carry out complex meta-data modifications, whose consistency cannot be proven by the server, and


 * the necessity to apply the batch as a single transaction, as it cannot be split into the smaller pieces.

Quality Attribute Scenarios

 * sub-tree-operations


 * sub-tree-conflict


 * undo


 * data-consistency


 * unlink


 * recovery


 * dependency


 * rename

Special case of dependency in which dependency is bi-directional: both parent directories depend on each other.


 * CMD-rename

Issues
It seems that scalability favors at least sending MD operations in form of bulk state update, while data-consistency and stop-the-world are easier to achieve with log-based representation.

Clustered meta-data: suppose that in CMD setup client renames a file, moving its name from one server to another. Correctness requirement in this case means that either both servers reintegrate changes, or none of them, which (it seems) implies CMD roll-back, originated and controlled by client.

Cross-mds MD dependencies introduce the danger of cascading evictions (much like cross-ost locks do).

Cross-mds operations together with batching require from mdt an ability to coordinate distributes operation from any point, e.g., a situation has to be handled when cross-ref unlink rpc comes to either the server holding directory, or the server holding the object, similarly for rename, etc. It seems logical, that for in the first version cross-ref operations (deemed to be rare) are not cached, as to avoid server modifications.

Effort decomposition
The following table also includes (a non exhaustive list of) the sub-components of Epochs and Sub Tree Locks.

C-* tasks are for the client, S-* tasks are for the server. Dependencies marked with (*) are weak.