Architecture - Caching OSS

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary

Caching OSS introduces cache on OSS side.

Definitions

transno: number of transaction, server generates them.

committed transno: greatest transno known committed. Current recovery model implies all transactions with transno less than the committed one are committed as well.

Requirements

support writeback cache on OST
support cached reads on OST
follow same recovery model (atomic updates, executed-once semantics, replay changes from clients)
no performance penalty in majority cases

Use Cases

ID	Quality Attribute	Summary
write	usability	we can't free pages once reply for OST_WRITE received, we also shouldn't confuse VM with written, but non freeable pages (unstable pages in linux?)
full page truncate	usability	retain pages referred by non-committed write request
partial page truncate	usability	corrupted data in pages referred by non-committed write request
rewrite on client side	availability	how do we do if application on the client side modifies data sent to OSS, but not committed yet
dependent read/write	availability	how server handles access to non-committed data yet, commit-on-share?
enomem	usability	how client can ask server to flush cache in order to free client's memory
order of flushing	availability	whether all disk filesystem flush in strict order? (data=ordered in ldiskfs?)
non-blocking flushing	performance	flushing of data from previous transaction shouldn't block newer transactions in most cases
grants	usability	how server counts cached data
few reqs in flight	availability, performance	do we follow current recovery model with per-request entry in last_rcvd?
sync	performance	can we implement fine-grained sync?
lockless	availability	how do we with lockless IO?
som	availability	integration with Size-on-MDS?
vbr	availability	integration with version based recovery

Quality Attribute Scenarios

Full write

Scenario:		full page(s) write RPC
Business Goals:		allow recoverable writeback cache on OSS
Relevant QA's:		usability
details	Stimulus source:	application
	Stimulus:	write syscall or modification via mmap
	Environment:
	Artifact:	client's and server's cache
	Response:	client drops dirty bit on pages, OST makes own pages dirty. OST replies with transno and client puts request onto retained queue. The request pins involved data pages
	Response measure:	In the most cases response time should be close to RTT*2 (OST_WRITE request, BULK GET, BULK PUT, OST_WRITE reply)
Questions:

Full page truncate

Scenario:		full page truncate
Business Goals:		allow recoverable truncate over cached on OST data
Relevant QA's:		usability
details	Stimulus source:	application
	Stimulus:	truncate syscall or open(O_TRUNC)
	Environment:
	Artifact:	client's and server's cache
	Response:	pages are dropped from server cache. corresponded pages in client's cache are made unavailable to applications, but they can be retained for non-committed-yet OST_WRITE requests
	Response measure:
Questions:		do we still need synchronous OST_PUNCH all the time?

Partial page truncate

Scenario:		partial page truncate
Business Goals:		allow recoverable truncate over cached on OST data
Relevant QA's:		usability
details	Stimulus source:	application
	Stimulus:	truncate syscall or open(O_TRUNC)
	Environment:
	Artifact:	client's and server's cache
	Response:	make a copy to preserve pages referred by non-committed writes ???
	Response measure:
Questions:

Rewrite on client side

Scenario:		rewrite on client side
Business Goals:		preserve correct data in case of partial replay
Relevant QA's:		availability
details	Stimulus source:	transient server failure
	Stimulus:	it happens
	Environment:
	Artifact:	client's cache and retain queue
	Response:	if page is referred by non-committed request, allocate and use new one ???
	Response measure:	there were overlapping write1 and write2 requests from a client. the client was able to replay write1 only and not write2. write1 must not carry data from write2.
Questions:		probably we'd better don't follow this strict model and do a bit simpler? there are some doubts this can be implemented easily for mmap case.

Dependent read/write

Scenario:		dependent read/write
Business Goals:		No data to be corrupted in case of partial recovery
Relevant QA's:		availability
details	Stimulus source:	read/write overlapping non-committed read/write from different client
	Stimulus:	applications
	Environment:	applications using shared files
	Artifact:	server's and client's cache
	Response:	Depending on configuration server may forcefully flush own and client cache if client accesses non-committed data or allow him to use non-committed data.
	Response measure:	Overlapping access shouldn't meet different data after failover (even with partial recovery). Any overlapping access causes IO on server which degrades performance
Questions:

ENOMEM on client

Scenario:		ENOMEM on client
Business Goals:		Smooth IO on clients
Relevant QA's:		performance
details	Stimulus source:	Client can't find freeable memory, but amount of memory pinned by non-committed requests
	Stimulus:	applications
	Environment:	Client with memory less than OST's cache
	Artifact:	OST's and client's cache
	Response:	Client should be able to detect such a situation and ask OST to flush cache
	Response measure:	IO shouldn't stale for long on client
Questions:		How much of OSTs cache should be flushed? Can we do partial flush to avoid situations when one client trashes servers

Order of flushing

Scenario:		order of flushing
Business Goals:		Stable and simple recovery code
Relevant QA's:		availability
details	Stimulus source:	multiple OST_WRITE requests
	Stimulus:	clients flushing their caches
	Environment:	OST
	Artifact:	order of flushing
	Response:	write request with transno X is expected to be committed if write request with transno Y is committed and Y < X
	Response measure:	no data loss after full successful failover. IOW, no lost transactions
Questions:

Non-blocking flushing

Scenario:		non-blocking flushing
Business Goals:		Good performance, storage shouldn't be idling while clients are submiting server with data
Relevant QA's:		performance
details	Stimulus source:	coming writes cause existing disk transaction to be flushed
	Stimulus:	closed disk transaction
	Environment:	OST under load
	Artifact:	flushing code of disk filesystem
	Response:	flushing code of disk filesystem should be flushing old transaction(s) and don't block existing activity much
	Response measure:	there should be no gap in storage activity while OST is fed with enough data from clients
Questions:

Grants

Scenario:		Grants
Business Goals:		Grants should be working with caching OST
Relevant QA's:		usability
details	Stimulus source:	write request to OSTp
	Stimulus:	space accounting
	Environment:	OST
	Artifact:	grants accounting
	Response:	disk filesystem or grants code should track amount of data cached and possible amount of disk space required to flush it
	Response measure:	No -ENOSPC is allowed for data client put in its own cache (this is existing requirement for grants)
Questions:

Few requests in flight

Scenario:		few requests in flight
Business Goals:		Allow recoverable and well performing writes
Relevant QA's:		availability, performance
details	Stimulus source:	applications
	Stimulus:	few write requests in flight
	Environment:
	Artifact:	last_rcvd file
	Response:	store result of request handling as long as the reply is known to be received
	Response measure:	Results for all committed not replied requests should be found in last_rcvd upon recovery
Questions:		Do we really need to follow these strict rules for data? Can we consider some of cases is IO errors to relax the rules and simplify the code? How do we know reply is received and we can free corresponded slot in last_rcvd? ACK seem to be too expensive.

QAS template

Scenario:
Business Goals:
Relevant QA's:
details	Stimulus source:
	Stimulus:
	Environment:
	Artifact:
	Response:
	Response measure:
Questions:

Decomposition

Server side

transno generation for OST_WRITE
flush in order of transno

Client side

copy-on-write mechanism for regular writes
copy-on-write mechanism for mmap'ed pages
OST_WRITE replay and free upon commit

Implementation Details

Should we consider different recovery model for data? For metadata we use model where state is reproduced by replaying all requests and each request has all required data (IOW, it doesn't refer pages, inodes, dentries, etc). We can't follow exactly this model as this implies we do copy of all data for each request. So, we have to refer external data from retained requests, but external data is shared and can change by the moment of replay. Would it make sense to use "state" model for data when we reproduce current state only, not replay all previous states?

Can we support asynchronous truncate? Then how do we understand on the client page was truncated and don't try to read it from server? How do we truncate up on lock cancel?

Should we take NRS into account? To what extent?

WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.