Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Definitions
- CROW (CReate On Write)
- the technique to optimize the create performance by deferring actual OSS objects creation until the first modify event (write or setattr) occur
CROW Architecture
MDS_LOV |
Allocate object FIDs and store them into LOV EA without object creation RPC to OST
|
obdfilter |
Implement object creation during first write/setattr request
|
Use Cases
Summary
id |
quality attribute |
summary
|
object_creation |
performance, usability |
Only save objects FIDs in EA during create operation on MDS. Create objects on OSTs at first write/setattr. Don't use precreation mechanism.
|
wrong_creation |
usability |
The re-creation of just destroyed objects are properly recognized and objects are not re-created
|
objects |
usability |
The MD object can exist with stripe info but without OSS object itself
|
fid |
usability |
Uniform FIDs are used to identify objects. MD FID and OSS FID are cross-referred using EA
|
cache |
availability, usability |
Client allocates both MD and OSS FIDs before doing create request to MDS.
|
recovery |
availability, usability |
Recoverable through re-creation of lost objects and/or deletion of orphaned objects in case of single-point failures.
|
cmd |
usability |
CMD is supported.
|
tests |
testability |
sanity and recovery tests.
|
quota |
security, usability |
compatible with quota.
|
scalability |
availability, usability |
more scalable than pre-creation mechanism
|
inodes_reservation |
usability |
OST should have enough inodes for all allocated FIDs
|
wrong_creation
Scenario: |
The OST objects being unlinked are re-created again by setattr.
|
Business Goals: |
Avoid re-creation of objects on OST
|
Relevant QA's: |
Usability & availability
|
details
|
Stimulus: |
OST objects are unlinked then late setattr or write come and create objects again.
|
Stimulus source: |
Object destroy and setattr come to OST from different nodes, therefore setattr can come later in case of network or server failure and consequent recovery.
|
Environment: |
MDS, OST
|
Artifact: |
There is no way to know about was the object already destroyed or just not created yet.
|
Response: |
Determine the state of non-existent OST objects - not yet created or destroyed already or cluster-wide serialization setattr vs. unlink
|
Response measure: |
Objects are not re-created.
|
Questions: |
No.
|
Issues: |
There is no clear understanding yet how to achieve the goal. The key can be the MDS FID of object. If it is exists then OST objects are not yet created, otherwise they was destroyed already.
serialize setattr vs unlink on MDS? bzzz
|
cache
Scenario: |
The FIDs for OST objects allocation on MDS can appear later than client needs them.
|
Business Goals: |
Caching MDS and disconnected operations should work
|
Relevant QA's: |
Usability
|
details
|
Stimulus: |
Though the creation of OST objects is postponed the FIDs for them should be allocated during create and saved at MDS in LOV EA. Doing that on MDS can invoke problems.
|
Stimulus source: |
Caching MDS, disconnected operations
|
Environment: |
Client
|
Artifact: |
Caching MD or disconnected operations can send requests to MDS with delay, doing MDS job. Meanwhile the client should have FIDs for OST objects without delays to work with OST.
|
Response: |
Allocate OST FIDs at client during create and pass them to MDS along with create request in LOV EA data.
|
Response measure: |
Client has valid LOV EA with OST object FIDs right after create operation.
|
Questions: |
If the clients are doing OST object creation before notifying the MDS then there is no way for the MDS/OST to clean up orphan objects if the client crashes before sending LOV EA to the MDS. Possibly the MDS would need to track the last-used objid for each sequence, and clients need to flush files+LOV_EAs to the MDS in objid order?
|
Issues: |
No.
|
inodes_reservation
Scenario: |
FIDs are allocated during create() but inodes on OST are created later, so there can be no free inodes for already allocated FIDs
|
Business Goals: |
All object FIDs should have inode
|
Relevant QA's: |
Usability
|
details
|
Stimulus: |
The object FIDs are allocated and stored in LOV EA on MDS but there are no free inodes at the moment of write/setattr on OST.
|
Stimulus source: |
Applications
|
Environment: |
OST
|
Artifact: |
The FID allocation is done earlier than getting inode on OST.
|
Response: |
Reserve OST inodes for future FID allocations.
|
Response measure: |
Any allocated FID for object shall get inode on OST or allocation should fail.
|
Questions: |
The DMU does not have a (practical) inode count limit like ldiskfs does, but will return ENOSPC when there is no free space left to create a new inode. This is equivalent to ENOSPC due to no free space for data, so maybe no reservation is needed in this case.
|
Issues: |
No.
|
Implementation Constraints
1. Use existing API and protocols