WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - Write Back Cache
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Summary
The meta-data write-back cache (WBC or MDWBC, where a possibility of misunderstanding exists) allows client meta-data operations to be delayed and batched. This increases client throughput and improves both network utilization and server efficiency.
Definitions
- (MD)WBC
- (Meta-data) Write-Back Cache
- MD operation
- A meta-data change performed by a client that changes the namespace in a consistent manner (e.g. create file, unlink, rename, etc.). An MD operation is normally composed from multiple MD updates that are performed on one or more MDT devices.
- MD update
- A low-level change to a single meta-data storage target that form the building blocks of an MD operation (e.g. insert directory entry, increment link count, change timestamp). Individual MD updates do not necessarily leave the filesystem in a consistent state, and need to be applied to the filesystem in an atomic manner in order to ensure consistency.
- MD batch
- A group of MD updates performed by a client such that: (a) the batch transforms the file system from one consistent state to another, (b) no other client depends on seeing the file system in any state where some, but not all of the MD operations in the batch are in effect.
- reintegration
- The process of applying an MD batch on a server. Reintegration executes all the MD operations in the batch and changes the file system from one consistent state to another.
- dependency
- A situation in which an MD operation modifies multiple separate pieces of client state that are otherwise not related. These dependent pieces of state have to be reintegrated atomically (in the data-base ACID sense). For example:
- link and unlink introduce a dependency between the directory where the entry is added or removed, and the target object whose nlink count is updated.
- cross-directory rename makes the parent directories dependent.
- unlinking the last name of a file introduces a dependency between the file inode and its stripe objects that are to be destroyed.
- coordinated reintegration
- A special case of reintegration that occurs when the client cache contains dependent state pertaining to multiple servers. In this case the servers have to act in concert to guarantee consistency. Coordinated reintegration is originated by the client, that sends dependent batches to the servers in parallel. One (or more) server assumes the role of coordinator, and uses persistent logs together with the CUT mechanism to either commit or rollback that distributed transaction.
- object-of-conflict
- An object in the extent of the lock owned by a client and also in the extent of some conflicting lock that other client is attempting to acquire. I.e., an object where locks "intersect". Single pair of conflicting locks can have more than one object-of-conflict. This term is used in QAS description.
Requirements
- scalability
- client should be able to execute 32K creations of 1--64KB files per second. Files maybe created in different directories with file counts per directory to range from 1K to 100K.
- correctness
- reintegration changes the file system from one globally consistent state to another.
- transactionality
- reintegration assures that the disk image of the file system is consistent. This implies that reintegration is either done completely within a single transaction, or the batch contains enough information to cut reintegration into smaller pieces, each preserving consistency.
- concurrency
- when a client surrenders a meta-data lock it only flushes enough of its cache to guarantee correctness (i.e., flushing the whole meta-data cache is not necessary).
Details
Instead of immediately sending MD operations to the server and waiting for their execution, the client caches them in some form, simulating their local effects (creating, modifying, and deleting VFS and VM entities such as inodes, dentries, pages, etc.). Later, a batched description of the cached operations is sent to the server and executed there.
Two important aspects of the WBC are how MD batches are stored on the client and transported over the network. Possible extremes include pure (logical) logging where every operation is represented as a separate entity, and pure physical logging (aka "bulk state update") where only the latest state is maintained.
Current design is to store cached MD updates as a some sort of a log in the client memory and to transmit MD batch as a bulk state update. Storing modifications as a log has following advantages:
- it is possible to create finer grained batches, i.e., to reduce amount of the flushed state by flushing only portion of modified state for a given object;
- resend and replay are simplified;
- higher degree of concurrency during reintegration seems possible: to do reintegration, client "cuts" certain prefix of the log and starts reintegrating it with the server. In the meanwhile, operations on the objects involved into reintegration can continue. That seems important, as reintegration of large batch can take (relatively) long time and stop-the-world cache flushing is undesirable.
Disadvantage is increased memory footprint (or, equivalently, more frequent reintegration).
Advantage of the sending and applying updates as a batch is off-loading work from the server, effectively rendering meta-data operations closer to the data ones, e.g., ideally, bulk update of the directory pages can be very similar to the bulk update of regular file pages. Disadvantages are
- the necessity of high level of trust to the clients as they are permitted to carry out complex meta-data modifications, whose consistency cannot be proven by the server, and
- the necessity to apply the batch as a single transaction, as it cannot be split into the smaller pieces.
Use Cases
id | quality attribute | summary |
---|---|---|
sub-tree-operations | performance | A client creates a new sub-directory and populates it with a large number of files (and sub-directories, recursively) |
sub-tree-conflict | usability | A client creates a new sub-directory and populates it with a large number of files (and sub-directories, recursively). Another client obtains a conflicting lock on this sub-directory |
undo | performance | A client creates a new sub-directory, populates it with some number of files and then removes them all |
data-consistency | usability | A client executes data and meta-data operations on existing files. Another client obtains a conflicting lock on some data. |
unlink | usability, performance | A client removes a number of (not hard-linked) files and sends a batched update to the server. On successful execution of the batch, the client sends DESTROY rpcs to osts. |
recovery | usability | (A) A client performs a number of MD operations. (B) The client sends the batch to the server. (C) the server executes the batch. (D) The client gets a reply. (E) the server commits the batch. (F) the client gets the commit notification). The server crashes at any of A, B, C, D, E, or F. |
dependency | usability | A client performs an MD operation involving more than one object (link, unlink, etc.). The lock protecting one of the objects involved is revoked. |
rename | usability | A client renames a file across directories. The lock protecting one of these directories is revoked. |
CMD-rename | usability, scalability | A client renames a file across directories located on different MD servers. A lock protecting one of these directories, is revoked. |
Quality Attribute Scenarios
- sub-tree-operations
Scenario: | A client creates a new sub-directory and populates it with a large number of small files (and sub-directories, recursively) | |
Business Goals: | achieve high client throughput on small file creations | |
Relevant QA's: | performance | |
details | Stimulus source: | client application |
Stimulus: | stream of meta-data and data operations | |
Environment: | isolated directory subtree, not accessed by other clients | |
Artifact: | lustre client | |
Response: | Client executes operations locally, modifying VFS and VM objects in memory. Once enough of MD operations are cached to form an efficient RPC, batch is sent (possibly to multiple servers in parallel), while local operations can continue without a slowdown. Ongoing MD operations require very little communication with the servers, as critical resources (object identifiers, disk space on the OSS server, locks on meta-data objects) are in advance leased to the client (in the form of fid sequences, grants, and sub-tree locks respectively). | |
Response measure: | client should be able to execute 32K creations of 1--64KB files per second. Files maybe created in different directories with file counts per directory to range from 1K to 100K. Creation test has to run for 1, 5, and 10 minutes (1.9M, 9.6M, and 19.2M files total respectively). | |
Questions: | how batch is represented on client? how batch is transmitted over network to the server? what forms on concurrency control are used? | |
Issues: | see questions. |
- sub-tree-conflict
Scenario: | client creates new sub-directory and populates it with large number of files (and sub-directories, recursively). Other client obtains conflicting lock on this sub-directory | |
Business Goals: | consistent file system picture for both clients. Low latency of lock acquiring operation. | |
Relevant QA's: | performance, scalability, usability | |
details | Stimulus: | conflicting lock |
Stimulus source: | other client | |
Environment: | shared directory | |
Artifact: | client MD cache. | |
Response: | lock invalidation, including flush of the dependent state. Cached updates for the object-of-conflict are taken and "a minimal batch" is built as a transitive closure of all modifications, depending on the modifications already in the batch. Caching policy guarantees that the minimal batch fits into the single RPC (per-server). If the resulting batch is smaller than the maximal size of RPC, additional state is flushed according to certain policy (e.g., oldest cached updates are flushed). | |
Response measure: | locking latency. Latency should be O(cached_state(object-of-conflict)), where cached_state(X) is an amount of cached MD state for the object X. | |
Questions: | how much state to flush? Do we need something similar to ASYNC_URGENT as in data-cache? | |
Issues: | see questions. |
- undo
Scenario: | client creates new sub-directory, populates it with some number of files, and then removes them all | |
Business Goals: | efficient handling of temporary files | |
Relevant QA's: | performance, usability | |
details | Stimulus: | short-lived temporary files |
Stimulus source: | client application | |
Environment: | isolated directory | |
Artifact: | client MD cache. | |
Response: | cancellation of the cached state. To achieve this, before forming the batch, log of cached updates is preprocessed by replacing a group of the operations with the smaller number of operations where possible (e.g., creation and removal of the same file cancel each other, leaving only atime/mtime update on the parent directory, and atime/mtime updates on the parent cancel each other, leaving only latest one). | |
Response measure: | size of resulting batch (ideally should contain only time updates for the parent directory) should be O(1), i.e., independent of the number of files created and unlinked. | |
Questions: | what about the audit on the server? how to efficiently implement state cancellation? | |
Issues: | see questions. |
- data-consistency
Scenario: | client executes data and meta-data operations on existing files, when conflicting lock on some data is requested by other client | |
Business Goals: | maintain desired level of visible file system consistency | |
Relevant QA's: | usability | |
details | Stimulus: | conflicting access to the data range |
Stimulus source: | other client | |
Environment: | file/object data and associated meta-data. | |
Artifact: | client cached data and meta-data | |
Response: | flush of the data and some meta-data. Details are similar to the sub-tree-conflict case, mutatis mutandis. | |
Response measure: | size of resulting batch(es) should be O(cached_state(conflicting-extent)) + O(1), where O(1) is for meta-data update. | |
Questions: | what consistency between data and meta-data we want? | |
Issues: | see questions. |
- unlink
Scenario: | client removes a number of (not hard-linked) files, and sends batched update to the server. On successful execution of a batch, client sends DESTROY rpcs to osts. | |
Business Goals: | client originated unlink | |
Relevant QA's: | usability, performance, scalability | |
details | Stimulus: | batched flush of MD operations |
Stimulus source: | MD cache | |
Environment: | cached unlink operation | |
Artifact: | flush and reintegration | |
Response: | reintegration on the MD server, and file body destruction on the OSS servers. Several implementation strategies are possible. Simplest is to emulate current "client-originated DESTROY" design, where client sends UNLINK rpc to the MD server, and receives "unlink cookie" (stored by the server in a transactional persistent log) that is then broadcast in parallel to all OST servers involved together with the DESTROY rpc. In the case of batched updates this design is complicated, as single batch can contain multiple unlinks.
To achieve even higher degree on concurrency, a range of cookies can be leased to the client (in the same vein as a range of fid sequences is). In that design client sends rpcs to MD and OS servers in parallel. Every server stores received cookie in the persistent log transactionally with performing the operation in question. Then every OSS contacts coordinator (which is the corresponding MDS) and reports operation completion, allowing the coordinator to cancel llog entry. Additional failure mode introduced by this scenario is when at least one OSS received and carried out operation, while RPC sent to the coordinator was either lost or hasn't yet arrived, when OSS reported operation completion. | |
Response measure: | rpc concurrency level. Client sends DESTROY rpcs in parallel. | |
Questions: | ||
Issues: | some form of distributed transaction commit is probably the cleanest way to implement this. |
- recovery
Scenario: | client performs a number of MD operations. (A) Sends batch to the server. (B) Server executes batch. (C) Client gets reply. (D) Server commits batch. (E) Client gets commit notification. (F). Server crashes at either A, B, C, D, E, or F. | |
Business Goals: | service availability in the presence of server failures. | |
Relevant QA's: | usability | |
details | Stimulus: | transient server or network failure |
Stimulus source: | act of god | |
Environment: | client and server connected by network, all faulty. | |
Artifact: | recovery mechanism | |
Response: | client recovery | |
Response measure: | successful recovery completion. Usual recovery guarantees apply. Client has to keep up to O(cached_state) of state for replay. | |
Questions: | in what form batch is kept in memory for recovery purposes? | |
Issues: | see questions. |
- dependency
Scenario: | client performs MD operation, involving more than one object (link, unlink, etc.). Lock protecting of the objects involved is revoked. | |
Business Goals: | maintain desired level of visible file system consistency | |
Relevant QA's: | usability | |
details | Stimulus: | revocation of lock on one of the dependent objects. |
Stimulus source: | other client | |
Environment: | dependent state in MD cache | |
Artifact: | batched cache flush | |
Response: | flush of dependent state | |
Response measure: | amount of state flushed. In the worst case flush to all MD servers might be needed. All rpcs can be sent in parallel. | |
Questions: | how to track dependencies? | |
Issues: | see questions. |
- rename
Special case of dependency in which dependency is bi-directional: both parent directories depend on each other.
- CMD-rename
Scenario: | client renames file across directories, located on different MD servers. Lock, protecting one of these directories, is revoked. | |
Business Goals: | maintain desired level of visible file system consistency | |
Relevant QA's: | usability, scalability | |
details | Stimulus: | revocation of lock on one of the inter-dependent objects. |
Stimulus source: | other client | |
Environment: | dependent state in MD cache | |
Artifact: | batched cache flush | |
Response: | flush of dependent state to both MD servers. | |
Response measure: | amount of state flushed. Updates to both parent directories have to flushed in parallel. | |
Questions: | how to coordinate reintegration on multiple servers? | |
Issues: | see questions. |
Lower level choices
Description | Quality | Semantics |
---|---|---|
keep-what | performance, semantics | does client keep cached operations as a log or as an accumulated state or hybrid thereof |
send-what | performance | how cached operations are transferred over network: as a log or as a bulk state update |
cache-flush | usability | what triggers cache flush |
consistency | usability, semantics | what consistency guarantees WBC provides w.r.t. meta-data visibility by other clients |
data-consistency | usability, semantics | what consistency guarantees WBC provides w.r.t. ordering between data and meta-data operations |
lock-conflicts | usability, scalability | how lock conflicts are handled |
stop-the-world | usability | what form of concurrency control is used to achieve consistency and scalability during batched send |
recovery | correctness | server failure is handled |
client-originated-ops | scalability | client originated unlink is implementable |
Issues
It seems that scalability favors at least sending MD operations in form of bulk state update, while data-consistency and stop-the-world are easier to achieve with log-based representation.
Clustered meta-data: suppose that in CMD setup client renames a file, moving its name from one server to another. Correctness requirement in this case means that either both servers reintegrate changes, or none of them, which (it seems) implies CMD roll-back, originated and controlled by client.
Cross-mds MD dependencies introduce the danger of cascading evictions (much like cross-ost locks do).
Cross-mds operations together with batching require from mdt an ability to coordinate distributes operation from any point, e.g., a situation has to be handled when cross-ref unlink rpc comes to either the server holding directory, or the server holding the object, similarly for rename, etc. It seems logical, that for in the first version cross-ref operations (deemed to be rare) are not cached, as to avoid server modifications.
Effort decomposition
The following table also includes (a non exhaustive list of) the sub-components of Epochs and Sub Tree Locks.
C-* tasks are for the client, S-* tasks are for the server. Dependencies marked with (*) are weak.
Component | Sub-component | scope | depends upon |
---|---|---|---|
WBC | C-VFS-MM | integration with vfs: inodes, dentries, memory pressure. Executing operation effects locally. | |
C-ops-caching | tracking operations: list vs. fragments. Tracking dependencies. | ||
C-write-out | policy deciding when to write-out cached state updates, and with what granularity: age, amount, max-in-flight | C-grants | |
C-dir-pages | caching of directory pages and using them for local lookups | C-VFS-MM | |
C-new-files | creation of new files locally | C-VFS-MM | |
C-new-objects | creation of new objects locally | S-ost-fids C-VFS-MM | |
C-DLM | invoking reintegration on a lock cancel, lock weighting | C-ops-caching | |
C-data | dependencies between cached data and meta-data | C-ops-caching | |
C-IO | switching between whole-file mds-based locking and extent locking | ||
C-grants | unified resource range leasing mechanism | S-grants | |
S-grants | unified resource range leasing mechanism | S-ost-fid | |
C-misc | sync, fsync, compatibility flag, mount option | C-ops-caching | |
S-misc | compatibility flags | ||
STL | C-policy | track usage statistics and use them to decide when to ask for an STL | |
S-policy | track usage statistics and use them to decide when to grant an STL | ||
EPOCHS | formalization | formal reintegration model with "proofs" of recovery correctness and concurrency control description | |
C-reintegration | reintegration, including concurrency control, integration with ptlrpc | S-compound S-reintegration | |
S-compound | implementation of the compound operations on the server | ||
S-reintegration | reintegration of batches on the server, thread scheduling | ||
S-undo | keeping undo logs | S-gc(*) | |
S-cuts | implementation of the CUTs algorithm | ||
C-gc | garbage collection: when to discard cached batches | ||
S-gc | garbage collection: when to discard undo logs | ||
C-recovery | replay, including optional optimistic "pre-replay" | ||
S-recovery-0 | roll-back of the uncommitted epochs | S-gc | |
S-recovery-1 | roll-forward from the clients | C-gc | |
EXTERNAL | S-ost-fid | ost understanding fids, and granting fid sequences to the clients |