WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - CROW

From Obsolete Lustre Wiki
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Definitions

CROW (CReate On Write)
the technique to optimize the create performance by deferring actual OSS objects creation until the first modify event (write or setattr) occur

CROW Architecture

MDS_LOV Allocate object FIDs and store them into LOV EA without object creation RPC to OST
obdfilter Implement object creation during first write/setattr request

Use Cases

Summary

id quality attribute summary
object_creation performance, usability Only save objects FIDs in EA during create operation on MDS. Create objects on OSTs at first write/setattr. Don't use precreation mechanism.
wrong_creation usability The re-creation of just destroyed objects are properly recognized and objects are not re-created
objects usability The MD object can exist with stripe info but without OSS object itself
fid usability Uniform FIDs are used to identify objects. MD FID and OSS FID are cross-referred using EA
cache availability, usability Client allocates both MD and OSS FIDs before doing create request to MDS.
recovery availability, usability Recoverable through re-creation of lost objects and/or deletion of orphaned objects in case of single-point failures.
cmd usability CMD is supported.
tests testability sanity and recovery tests.
quota security, usability compatible with quota.
scalability availability, usability more scalable than pre-creation mechanism
inodes_reservation usability OST should have enough inodes for all allocated FIDs

wrong_creation

Scenario: The OST objects being unlinked are re-created again by setattr.
Business Goals: Avoid re-creation of objects on OST
Relevant QA's: Usability & availability
details Stimulus: OST objects are unlinked then late setattr or write come and create objects again.
Stimulus source: Object destroy and setattr come to OST from different nodes, therefore setattr can come later in case of network or server failure and consequent recovery.
Environment: MDS, OST
Artifact: There is no way to know about was the object already destroyed or just not created yet.
Response: Determine the state of non-existent OST objects - not yet created or destroyed already or cluster-wide serialization setattr vs. unlink
Response measure: Objects are not re-created.
Questions: No.
Issues: There is no clear understanding yet how to achieve the goal. The key can be the MDS FID of object. If it is exists then OST objects are not yet created, otherwise they was destroyed already.

serialize setattr vs unlink on MDS? bzzz

cache

Scenario: The FIDs for OST objects allocation on MDS can appear later than client needs them.
Business Goals: Caching MDS and disconnected operations should work
Relevant QA's: Usability
details Stimulus: Though the creation of OST objects is postponed the FIDs for them should be allocated during create and saved at MDS in LOV EA. Doing that on MDS can invoke problems.
Stimulus source: Caching MDS, disconnected operations
Environment: Client
Artifact: Caching MD or disconnected operations can send requests to MDS with delay, doing MDS job. Meanwhile the client should have FIDs for OST objects without delays to work with OST.
Response: Allocate OST FIDs at client during create and pass them to MDS along with create request in LOV EA data.
Response measure: Client has valid LOV EA with OST object FIDs right after create operation.
Questions: If the clients are doing OST object creation before notifying the MDS then there is no way for the MDS/OST to clean up orphan objects if the client crashes before sending LOV EA to the MDS. Possibly the MDS would need to track the last-used objid for each sequence, and clients need to flush files+LOV_EAs to the MDS in objid order?
Issues: No.

inodes_reservation

Scenario: FIDs are allocated during create() but inodes on OST are created later, so there can be no free inodes for already allocated FIDs
Business Goals: All object FIDs should have inode
Relevant QA's: Usability
details Stimulus: The object FIDs are allocated and stored in LOV EA on MDS but there are no free inodes at the moment of write/setattr on OST.
Stimulus source: Applications
Environment: OST
Artifact: The FID allocation is done earlier than getting inode on OST.
Response: Reserve OST inodes for future FID allocations.
Response measure: Any allocated FID for object shall get inode on OST or allocation should fail.
Questions: The DMU does not have a (practical) inode count limit like ldiskfs does, but will return ENOSPC when there is no free space left to create a new inode. This is equivalent to ENOSPC due to no free space for data, so maybe no reservation is needed in this case.
Issues: No.

Implementation Constraints

1. Use existing API and protocols