Architecture - Version Based Recovery

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary

Version Based Recovery is a recovery mechanism allowing clients to recover in less strict order and even allow client to replay request long after main recovery is completed. Independent changes should be recovered even if some clients are missing.

Definitions

transno: transaction number, unique per disk filesystem through all life cycle
version: an unique version of object, every change to object changes its version
pre-version: a version of object before change
post-version: a version of object after change
transno-based recovery: existing recovery where all dependencies are tracked using transno and replay is done in order of transno
version-based recovery: new recovery where every object has version and every changes is applied only if its pre-version match current object version; thus dependencies are tracked per-object
applicable request: a request with pre-version matching current object(s) version
orphan: file or directory open by one or few clients and unlinked since then

Requirements

better recoverability in case of missed client
allow late clients to continue their work with no application visible errors
no performance penalty (IOW, no additional seeks to access/update versions)
compatibility or clearly stated incompatibility through connect flag?

Use Cases

ID	Quality Attribute	Summary
regular request	availability, performance	how regular requests are handled with regard to VBR
regular replay	availability, usability	client has reconnected in time and participate in regular recovery
missed regular replay	availability	during regular replay server observes "gap" in replay sequence
missed reconnect	availability	in contrast with old recover we keep missing clients for a while
late reconnect	availability	client hasn't connected in time, but we keep its export in system
late replay	availability	client has connected after regular recovery and has requests to replay
version mismatch	availability	during late replay server observers request's pre-version and object's version mismatch
late lock replay	availability	during late replay server may observe conflicting lock granted after recovery
stale export auto cleanup	usability	client hasn't connected in given time after regular recovery
stale export manual cleanup	usability	administrator wants to clean stale exports manually
permissions	security, performance	lost replay may affect accessibility of other objects
VBR disabled	security, performance	ability to disabled VBR
file orphan	availability	late client might have open unlinked files
OST orphans	availability	how to keep OST objects created by late client

Quality Attribute Scenarios

Regular request

Scenario:		regular request
Business Goals:		better recoverability (less evictions and lost changes) in case of missed client
Relevant QA's:		availability
details	Stimulus source:	application
	Stimulus:	request modifying file system
	Environment:	no recovery is in progress
	Artifact:	object(s) get new version, export updates on-disk 'last-used' timestamp
	Response:	new version is set to request's transno, reply is tagged with previous version of object and new version of object (transno)
	Response measure:
Questions:

Regular replay

Scenario:		regular replay
Business Goals:		compatibility with old clients
Relevant QA's:		availability, usability
details	Stimulus source:	client with replied non-committed request
	Stimulus:	replied non-committed request sent by client
	Environment:	transno-based recovery in progress
	Artifact:	object's version changes
	Response:	check for version mismatch is done is supplied by client, new version is set from request's post-version or transno (for old clients)
	Response measure:
Questions:

Missed regular replay

Scenario:		missed regular replay
Business Goals:		keep cluster recovering changes independent of missing one
Relevant QA's:		availability
details	Stimulus source:	some client hasn't reconnect in time and recovery started without him
	Stimulus:	missing request from missing client needed to continue transno-based recovery
	Environment:	transaction-based recovery is in progress
	Artifact:	server switches to version-based recovery mode
	Response:	applicable requests are executed and replied in order to get more replays from clients and continue recovery, old clients with non-empty replay queues are evicted
	Response measure:	in finite time all reconnected clients should be able to continue work with no application visible errors if missed replay(s) don't affect them
Questions:

Missed reconnect

Scenario:		missed reconnect
Business Goals:		allow recovery to start before all clients have reconnected
Relevant QA's:		availability
details	Stimulus source:	network or internal problems preventing client to reconnect in time
	Stimulus:	timeout for clients reconnecting
	Environment:	failover has started
	Artifact:	exports are marked stale
	Response:	server proceeds with transno-based recovery
	Response measure:	connected clients should be able to start replay
Questions:

Late reconnect

Scenario:		late reconnect
Business Goals:		allow late clients to reconnect once main recovery is done and give them chance to continue their work unaffected
Relevant QA's:		availability
details	Stimulus source:	network or internal problems preventing client to reconnect in time
	Stimulus:	client
	Environment:	transno-based recovery is finished
	Artifact:	existing export loaded from persistent storage is used, export protects orphan from removal
	Response:	client is allowed to replay his changes and in-core states (open files, locks), client is told own last committed transno (not global last_committed). old client are recognized by server via connect flags mechanism. server evicts old client immediately if he's late.
	Response measure:	no visible application changes if client is working on own set of objects
Questions:

Late replay

Scenario:		late replay
Business Goals:		allow late clients to replay changes, in-core state to give them chance to continue their work unaffected
Relevant QA's:		availability
details	Stimulus source:	late client
	Stimulus:	replay request
	Environment:	transno-based recovery is already finished
	Artifact:	new version in modified object(s), if request is applicable
	Response:	if request is applicable, server grabs all needed locks (to invalidate cache clients might get before) and execute late request setting object's version to request's post-recovery
	Response measure:	no visible application changes if client is working on own set of objects
Questions:

Version mismatch

Scenario:		version mismatch
Business Goals:		prevent conflicting changes to file system
Relevant QA's:		availability
details	Stimulus source:	late client
	Stimulus:	late replay
	Environment:	transno-based recovery is already finished
	Artifact:	no changes to underlying filesystem
	Response:	server provides some time for other clients to reconnect and recover object to needed state, otherwise this late replay is discarded with proper reply to client, client drops request from replay queue. notice, if servers provides no time, then it's very likely replay requests will come out of order meaning replay failure.
	Response measure:
Questions:		is it possible to propagate error to application?

Late lock replay

Scenario:		late lock replay
Business Goals:		allow clients to save their caches associated with ldlm locks and prevent stale cache
Relevant QA's:		availability, performance
details	Stimulus source:	late client
	Stimulus:	lock replay
	Environment:	transno-based recovery is already finished, conflicting lock is found or object has changed since granting
	Artifact:	client's lock is canceled
	Response:	server finds conflicting lock or changed object and replies back with error, client cancels own lock and release local cache if needed
	Response measure:	no stale data and metadata should be exposed to application
Questions:

Stale export auto cleanup

Scenario:		stale export auto cleanup
Business Goals:		mechanism to cleanup stale export after tunable expiration
Relevant QA's:		usability
details	Stimulus source:	stale export monitor
	Stimulus:	given time expired since last used stored in on-disk export structure
	Environment:	virtually in any environment
	Artifact:	stale export is removed
	Response:	all related resources are freed: record in last_rcvd is marked free, orphans are removed, their OST objects are scheduled for removal
	Response measure:	client can't connect to this export anymore
Questions:

Stale export manual cleanup

Scenario:		stale export manual cleanup
Business Goals:		mechanism to allow administrator to cleanup stale export
Relevant QA's:		usability
details	Stimulus source:	administrator
	Stimulus:	control utility
	Environment:	virtually in any environment
	Artifact:	stale export is removed
	Response:	administrator can enlist all stale export with their last-used timestamps and send request to remove some of them. all related resources are freed: record in last_rcvd is marked free, orphans are removed, their OST objects are scheduled for removal
	Response measure:	client can't connect to this export anymore
Questions:

Permissions

Scenario:		permissions
Business Goals:		allow administrator to control way server handles request changing permissions
Relevant QA's:		security, performance
details	Stimulus source:	administrator
	Stimulus:	utility
	Environment:	any
	Artifact:	per-server variable
	Response:	depending on this variable server my turn request changing permissions to synchronous operations
	Response measure:	with sync mode enabled, performance of such requests drop significantly
Questions:		should we consider more complex models of tracking real dependencies? where should we describe async requests can affect security badly?

VBR disabled

Scenario:		VBR disabled
Business Goals:		administrator that want's bullet proof security can trade speed (no synchronous ops) versus recovery
Relevant QA's:		security, performance
details	Stimulus source:	administator
	Stimulus:	utility
	Environment:	recovery is in progress
	Artifact:	version-based phase of recovery disabled
	Response:	once gap is met, transno-based recovery finishes, all late-clients are evicted, all clients with requests to be replayed are evicted
	Response measure:
Questions:

File orphan

Scenario:		file orphan
Business Goals:		allow late clients keep working with their ophan files, don't lose them util corresponded stale export is removed
Relevant QA's:		usability
details	Stimulus source:	application
	Stimulus:	unlink request
	Environment:	file is open by client
	Artifact:	link to object from special PENDING directory
	Response:	link is created at unlink time, once some export is destroyed server schedules orphan cleanup procedure. in that procedure server scans PENDING directory, finds all orphans that can be referenced by remaining exports and removes them. orphan are considered to be unreferenced if last_used timestamp of all exports are newer than orphan's ctime (as ctime is time of unlink and no open can be made after final unlink, no export with last_used newer than ctime cold open it)
	Response measure:	late client should be able to keep working with orphan file with no visible application errors
Questions:

OST orphans

Scenario:		OST orphans
Business Goals:		recover late client's data as much as possible
Relevant QA's:		availability
details	Stimulus source:
	Stimulus:
	Environment:
	Artifact:
	Response:
	Response measure:
Questions:

Implementation details

1.6/1.8 rely on ability underlying disk filesystem to recreate inode with given ino (wantedi patch in ldiskfs); ino space is very limited and disk filesystem reuses them in uncontrolable manner. so, late replay can find its ino already used. currently this is fatal for server. we can either reject such replay (and efficiency of VBR for 1.6/1.8 suffers) or try to update all client's state (inode in icache, locks, etc). fids (appear in 2.0?) aren't reused, so the problem disappears with them.

WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.