WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - Version Based Recovery

From Obsolete Lustre Wiki
Revision as of 14:01, 14 January 2010 by Docadmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Summary

Version Based Recovery is a recovery mechanism allowing clients to recover in less strict order and even allow client to replay request long after main recovery is completed. Independent changes should be recovered even if some clients are missing.

Definitions

transno
transaction number, unique per disk filesystem through all life cycle
version
an unique version of object, every change to object changes its version
pre-version
a version of object before change
post-version
a version of object after change
transno-based recovery
existing recovery where all dependencies are tracked using transno and replay is done in order of transno
version-based recovery
new recovery where every object has version and every changes is applied only if its pre-version match current object version; thus dependencies are tracked per-object
applicable request
a request with pre-version matching current object(s) version
orphan
file or directory open by one or few clients and unlinked since then

Requirements

  1. better recoverability in case of missed client
  2. allow late clients to continue their work with no application visible errors
  3. no performance penalty (IOW, no additional seeks to access/update versions)
  4. compatibility or clearly stated incompatibility through connect flag?

Use Cases

ID Quality Attribute Summary
regular request availability, performance how regular requests are handled with regard to VBR
regular replay availability, usability client has reconnected in time and participate in regular recovery
missed regular replay availability during regular replay server observes "gap" in replay sequence
missed reconnect availability in contrast with old recover we keep missing clients for a while
late reconnect availability client hasn't connected in time, but we keep its export in system
late replay availability client has connected after regular recovery and has requests to replay
version mismatch availability during late replay server observers request's pre-version and object's version mismatch
late lock replay availability during late replay server may observe conflicting lock granted after recovery
stale export auto cleanup usability client hasn't connected in given time after regular recovery
stale export manual cleanup usability administrator wants to clean stale exports manually
permissions security, performance lost replay may affect accessibility of other objects
VBR disabled security, performance ability to disabled VBR
file orphan availability late client might have open unlinked files
OST orphans availability how to keep OST objects created by late client

Quality Attribute Scenarios

Regular request

Scenario: regular request
Business Goals: better recoverability (less evictions and lost changes) in case of missed client
Relevant QA's: availability
details Stimulus source: application
Stimulus: request modifying file system
Environment: no recovery is in progress
Artifact: object(s) get new version, export updates on-disk 'last-used' timestamp
Response: new version is set to request's transno, reply is tagged with previous version of object and new version of object (transno)
Response measure:
Questions:

Regular replay

Scenario: regular replay
Business Goals: compatibility with old clients
Relevant QA's: availability, usability
details Stimulus source: client with replied non-committed request
Stimulus: replied non-committed request sent by client
Environment: transno-based recovery in progress
Artifact: object's version changes
Response: check for version mismatch is done is supplied by client, new version is set from request's post-version or transno (for old clients)
Response measure:
Questions:

Missed regular replay

Scenario: missed regular replay
Business Goals: keep cluster recovering changes independent of missing one
Relevant QA's: availability
details Stimulus source: some client hasn't reconnect in time and recovery started without him
Stimulus: missing request from missing client needed to continue transno-based recovery
Environment: transaction-based recovery is in progress
Artifact: server switches to version-based recovery mode
Response: applicable requests are executed and replied in order to get more replays from clients and continue recovery, old clients with non-empty replay queues are evicted
Response measure: in finite time all reconnected clients should be able to continue work with no application visible errors if missed replay(s) don't affect them
Questions:

Missed reconnect

Scenario: missed reconnect
Business Goals: allow recovery to start before all clients have reconnected
Relevant QA's: availability
details Stimulus source: network or internal problems preventing client to reconnect in time
Stimulus: timeout for clients reconnecting
Environment: failover has started
Artifact: exports are marked stale
Response: server proceeds with transno-based recovery
Response measure: connected clients should be able to start replay
Questions:

Late reconnect

Scenario: late reconnect
Business Goals: allow late clients to reconnect once main recovery is done and give them chance to continue their work unaffected
Relevant QA's: availability
details Stimulus source: network or internal problems preventing client to reconnect in time
Stimulus: client
Environment: transno-based recovery is finished
Artifact: existing export loaded from persistent storage is used, export protects orphan from removal
Response: client is allowed to replay his changes and in-core states (open files, locks), client is told own last committed transno (not global last_committed). old client are recognized by server via connect flags mechanism. server evicts old client immediately if he's late.
Response measure: no visible application changes if client is working on own set of objects
Questions:

Late replay

Scenario: late replay
Business Goals: allow late clients to replay changes, in-core state to give them chance to continue their work unaffected
Relevant QA's: availability
details Stimulus source: late client
Stimulus: replay request
Environment: transno-based recovery is already finished
Artifact: new version in modified object(s), if request is applicable
Response: if request is applicable, server grabs all needed locks (to invalidate cache clients might get before) and execute late request setting object's version to request's post-recovery
Response measure: no visible application changes if client is working on own set of objects
Questions:

Version mismatch

Scenario: version mismatch
Business Goals: prevent conflicting changes to file system
Relevant QA's: availability
details Stimulus source: late client
Stimulus: late replay
Environment: transno-based recovery is already finished
Artifact: no changes to underlying filesystem
Response: server provides some time for other clients to reconnect and recover object to needed state, otherwise this late replay is discarded with proper reply to client, client drops request from replay queue. notice, if servers provides no time, then it's very likely replay requests will come out of order meaning replay failure.
Response measure:
Questions: is it possible to propagate error to application?

Late lock replay

Scenario: late lock replay
Business Goals: allow clients to save their caches associated with ldlm locks and prevent stale cache
Relevant QA's: availability, performance
details Stimulus source: late client
Stimulus: lock replay
Environment: transno-based recovery is already finished, conflicting lock is found or object has changed since granting
Artifact: client's lock is canceled
Response: server finds conflicting lock or changed object and replies back with error, client cancels own lock and release local cache if needed
Response measure: no stale data and metadata should be exposed to application
Questions:

Stale export auto cleanup

Scenario: stale export auto cleanup
Business Goals: mechanism to cleanup stale export after tunable expiration
Relevant QA's: usability
details Stimulus source: stale export monitor
Stimulus: given time expired since last used stored in on-disk export structure
Environment: virtually in any environment
Artifact: stale export is removed
Response: all related resources are freed: record in last_rcvd is marked free, orphans are removed, their OST objects are scheduled for removal
Response measure: client can't connect to this export anymore
Questions:

Stale export manual cleanup

Scenario: stale export manual cleanup
Business Goals: mechanism to allow administrator to cleanup stale export
Relevant QA's: usability
details Stimulus source: administrator
Stimulus: control utility
Environment: virtually in any environment
Artifact: stale export is removed
Response: administrator can enlist all stale export with their last-used timestamps and send request to remove some of them. all related resources are freed: record in last_rcvd is marked free, orphans are removed, their OST objects are scheduled for removal
Response measure: client can't connect to this export anymore
Questions:

Permissions

Scenario: permissions
Business Goals: allow administrator to control way server handles request changing permissions
Relevant QA's: security, performance
details Stimulus source: administrator
Stimulus: utility
Environment: any
Artifact: per-server variable
Response: depending on this variable server my turn request changing permissions to synchronous operations
Response measure: with sync mode enabled, performance of such requests drop significantly
Questions: should we consider more complex models of tracking real dependencies? where should we describe async requests can affect security badly?

VBR disabled

Scenario: VBR disabled
Business Goals: administrator that want's bullet proof security can trade speed (no synchronous ops) versus recovery
Relevant QA's: security, performance
details Stimulus source: administator
Stimulus: utility
Environment: recovery is in progress
Artifact: version-based phase of recovery disabled
Response: once gap is met, transno-based recovery finishes, all late-clients are evicted, all clients with requests to be replayed are evicted
Response measure:
Questions:

File orphan

Scenario: file orphan
Business Goals: allow late clients keep working with their ophan files, don't lose them util corresponded stale export is removed
Relevant QA's: usability
details Stimulus source: application
Stimulus: unlink request
Environment: file is open by client
Artifact: link to object from special PENDING directory
Response: link is created at unlink time, once some export is destroyed server schedules orphan cleanup procedure. in that procedure server scans PENDING directory, finds all orphans that can be referenced by remaining exports and removes them. orphan are considered to be unreferenced if last_used timestamp of all exports are newer than orphan's ctime (as ctime is time of unlink and no open can be made after final unlink, no export with last_used newer than ctime cold open it)
Response measure: late client should be able to keep working with orphan file with no visible application errors
Questions:

OST orphans

Scenario: OST orphans
Business Goals: recover late client's data as much as possible
Relevant QA's: availability
details Stimulus source:
Stimulus:
Environment:
Artifact:
Response:
Response measure:
Questions:

Implementation details

  1. 1.6/1.8 rely on ability underlying disk filesystem to recreate inode with given ino (wantedi patch in ldiskfs); ino space is very limited and disk filesystem reuses them in uncontrolable manner. so, late replay can find its ino already used. currently this is fatal for server. we can either reject such replay (and efficiency of VBR for 1.6/1.8 suffers) or try to update all client's state (inode in icache, locks, etc). fids (appear in 2.0?) aren't reused, so the problem disappears with them.