Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Summary
Version Based Recovery is a recovery mechanism allowing clients to recover in less strict order and even allow client to replay request long after main recovery is completed. Independent changes should be recovered even if some clients are missing.
Definitions
- transno
- transaction number, unique per disk filesystem through all life cycle
- version
- an unique version of object, every change to object changes its version
- pre-version
- a version of object before change
- post-version
- a version of object after change
- transno-based recovery
- existing recovery where all dependencies are tracked using transno and replay is done in order of transno
- version-based recovery
- new recovery where every object has version and every changes is applied only if its pre-version match current object version; thus dependencies are tracked per-object
- applicable request
- a request with pre-version matching current object(s) version
- orphan
- file or directory open by one or few clients and unlinked since then
Requirements
- better recoverability in case of missed client
- allow late clients to continue their work with no application visible errors
- no performance penalty (IOW, no additional seeks to access/update versions)
- compatibility or clearly stated incompatibility through connect flag?
Use Cases
ID |
Quality Attribute |
Summary
|
regular request
|
availability, performance
|
how regular requests are handled with regard to VBR
|
regular replay
|
availability, usability
|
client has reconnected in time and participate in regular recovery
|
missed regular replay
|
availability
|
during regular replay server observes "gap" in replay sequence
|
missed reconnect
|
availability
|
in contrast with old recover we keep missing clients for a while
|
late reconnect
|
availability
|
client hasn't connected in time, but we keep its export in system
|
late replay
|
availability
|
client has connected after regular recovery and has requests to replay
|
version mismatch
|
availability
|
during late replay server observers request's pre-version and object's version mismatch
|
late lock replay
|
availability
|
during late replay server may observe conflicting lock granted after recovery
|
stale export auto cleanup
|
usability
|
client hasn't connected in given time after regular recovery
|
stale export manual cleanup
|
usability
|
administrator wants to clean stale exports manually
|
permissions
|
security, performance
|
lost replay may affect accessibility of other objects
|
VBR disabled
|
security, performance
|
ability to disabled VBR
|
file orphan
|
availability
|
late client might have open unlinked files
|
OST orphans
|
availability
|
how to keep OST objects created by late client
|
Quality Attribute Scenarios
Regular request
Scenario: |
regular request
|
Business Goals: |
better recoverability (less evictions and lost changes) in case of missed client
|
Relevant QA's: |
availability
|
details
|
Stimulus source: |
application
|
Stimulus: |
request modifying file system
|
Environment: |
no recovery is in progress
|
Artifact: |
object(s) get new version, export updates on-disk 'last-used' timestamp
|
Response: |
new version is set to request's transno, reply is tagged with previous version of object and new version of object (transno)
|
Response measure: |
|
Questions: |
|
Regular replay
Scenario: |
regular replay
|
Business Goals: |
compatibility with old clients
|
Relevant QA's: |
availability, usability
|
details
|
Stimulus source: |
client with replied non-committed request
|
Stimulus: |
replied non-committed request sent by client
|
Environment: |
transno-based recovery in progress
|
Artifact: |
object's version changes
|
Response: |
check for version mismatch is done is supplied by client, new version is set from request's post-version or transno (for old clients)
|
Response measure: |
|
Questions: |
|
Missed regular replay
Scenario: |
missed regular replay
|
Business Goals: |
keep cluster recovering changes independent of missing one
|
Relevant QA's: |
availability
|
details
|
Stimulus source: |
some client hasn't reconnect in time and recovery started without him
|
Stimulus: |
missing request from missing client needed to continue transno-based recovery
|
Environment: |
transaction-based recovery is in progress
|
Artifact: |
server switches to version-based recovery mode
|
Response: |
applicable requests are executed and replied in order to get more replays from clients and continue recovery, old clients with non-empty replay queues are evicted
|
Response measure: |
in finite time all reconnected clients should be able to continue work with no application visible errors if missed replay(s) don't affect them
|
Questions: |
|
Missed reconnect
Scenario: |
missed reconnect
|
Business Goals: |
allow recovery to start before all clients have reconnected
|
Relevant QA's: |
availability
|
details
|
Stimulus source: |
network or internal problems preventing client to reconnect in time
|
Stimulus: |
timeout for clients reconnecting
|
Environment: |
failover has started
|
Artifact: |
exports are marked stale
|
Response: |
server proceeds with transno-based recovery
|
Response measure: |
connected clients should be able to start replay
|
Questions: |
|
Late reconnect
Scenario: |
late reconnect
|
Business Goals: |
allow late clients to reconnect once main recovery is done and give them chance to continue their work unaffected
|
Relevant QA's: |
availability
|
details
|
Stimulus source: |
network or internal problems preventing client to reconnect in time
|
Stimulus: |
client
|
Environment: |
transno-based recovery is finished
|
Artifact: |
existing export loaded from persistent storage is used, export protects orphan from removal
|
Response: |
client is allowed to replay his changes and in-core states (open files, locks), client is told own last committed transno (not global last_committed). old client are recognized by server via connect flags mechanism. server evicts old client immediately if he's late.
|
Response measure: |
no visible application changes if client is working on own set of objects
|
Questions: |
|
Late replay
Scenario: |
late replay
|
Business Goals: |
allow late clients to replay changes, in-core state to give them chance to continue their work unaffected
|
Relevant QA's: |
availability
|
details
|
Stimulus source: |
late client
|
Stimulus: |
replay request
|
Environment: |
transno-based recovery is already finished
|
Artifact: |
new version in modified object(s), if request is applicable
|
Response: |
if request is applicable, server grabs all needed locks (to invalidate cache clients might get before) and execute late request setting object's version to request's post-recovery
|
Response measure: |
no visible application changes if client is working on own set of objects
|
Questions: |
|
Version mismatch
Scenario: |
version mismatch
|
Business Goals: |
prevent conflicting changes to file system
|
Relevant QA's: |
availability
|
details
|
Stimulus source: |
late client
|
Stimulus: |
late replay
|
Environment: |
transno-based recovery is already finished
|
Artifact: |
no changes to underlying filesystem
|
Response: |
server provides some time for other clients to reconnect and recover object to needed state, otherwise this late replay is discarded with proper reply to client, client drops request from replay queue. notice, if servers provides no time, then it's very likely replay requests will come out of order meaning replay failure.
|
Response measure: |
|
Questions: |
is it possible to propagate error to application?
|
Late lock replay
Scenario: |
late lock replay
|
Business Goals: |
allow clients to save their caches associated with ldlm locks and prevent stale cache
|
Relevant QA's: |
availability, performance
|
details
|
Stimulus source: |
late client
|
Stimulus: |
lock replay
|
Environment: |
transno-based recovery is already finished, conflicting lock is found or object has changed since granting
|
Artifact: |
client's lock is canceled
|
Response: |
server finds conflicting lock or changed object and replies back with error, client cancels own lock and release local cache if needed
|
Response measure: |
no stale data and metadata should be exposed to application
|
Questions: |
|
Stale export auto cleanup
Scenario: |
stale export auto cleanup
|
Business Goals: |
mechanism to cleanup stale export after tunable expiration
|
Relevant QA's: |
usability
|
details
|
Stimulus source: |
stale export monitor
|
Stimulus: |
given time expired since last used stored in on-disk export structure
|
Environment: |
virtually in any environment
|
Artifact: |
stale export is removed
|
Response: |
all related resources are freed: record in last_rcvd is marked free, orphans are removed, their OST objects are scheduled for removal
|
Response measure: |
client can't connect to this export anymore
|
Questions: |
|
Stale export manual cleanup
Scenario: |
stale export manual cleanup
|
Business Goals: |
mechanism to allow administrator to cleanup stale export
|
Relevant QA's: |
usability
|
details
|
Stimulus source: |
administrator
|
Stimulus: |
control utility
|
Environment: |
virtually in any environment
|
Artifact: |
stale export is removed
|
Response: |
administrator can enlist all stale export with their last-used timestamps and send request to remove some of them. all related resources are freed: record in last_rcvd is marked free, orphans are removed, their OST objects are scheduled for removal
|
Response measure: |
client can't connect to this export anymore
|
Questions: |
|
Permissions
Scenario: |
permissions
|
Business Goals: |
allow administrator to control way server handles request changing permissions
|
Relevant QA's: |
security, performance
|
details
|
Stimulus source: |
administrator
|
Stimulus: |
utility
|
Environment: |
any
|
Artifact: |
per-server variable
|
Response: |
depending on this variable server my turn request changing permissions to synchronous operations
|
Response measure: |
with sync mode enabled, performance of such requests drop significantly
|
Questions: |
should we consider more complex models of tracking real dependencies? where should we describe async requests can affect security badly?
|
VBR disabled
Scenario: |
VBR disabled
|
Business Goals: |
administrator that want's bullet proof security can trade speed (no synchronous ops) versus recovery
|
Relevant QA's: |
security, performance
|
details
|
Stimulus source: |
administator
|
Stimulus: |
utility
|
Environment: |
recovery is in progress
|
Artifact: |
version-based phase of recovery disabled
|
Response: |
once gap is met, transno-based recovery finishes, all late-clients are evicted, all clients with requests to be replayed are evicted
|
Response measure: |
|
Questions: |
|
File orphan
Scenario: |
file orphan
|
Business Goals: |
allow late clients keep working with their ophan files, don't lose them util corresponded stale export is removed
|
Relevant QA's: |
usability
|
details
|
Stimulus source: |
application
|
Stimulus: |
unlink request
|
Environment: |
file is open by client
|
Artifact: |
link to object from special PENDING directory
|
Response: |
link is created at unlink time, once some export is destroyed server schedules orphan cleanup procedure. in that procedure server scans PENDING directory, finds all orphans that can be referenced by remaining exports and removes them. orphan are considered to be unreferenced if last_used timestamp of all exports are newer than orphan's ctime (as ctime is time of unlink and no open can be made after final unlink, no export with last_used newer than ctime cold open it)
|
Response measure: |
late client should be able to keep working with orphan file with no visible application errors
|
Questions: |
|
OST orphans
Scenario: |
OST orphans
|
Business Goals: |
recover late client's data as much as possible
|
Relevant QA's: |
availability
|
details
|
Stimulus source: |
|
Stimulus: |
|
Environment: |
|
Artifact: |
|
Response: |
|
Response measure: |
|
Questions: |
|
Implementation details
- 1.6/1.8 rely on ability underlying disk filesystem to recreate inode with given ino (wantedi patch in ldiskfs); ino space is very limited and disk filesystem reuses them in uncontrolable manner. so, late replay can find its ino already used. currently this is fatal for server. we can either reject such replay (and efficiency of VBR for 1.6/1.8 suffers) or try to update all client's state (inode in icache, locks, etc). fids (appear in 2.0?) aren't reused, so the problem disappears with them.