WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - Caching OSS

From Obsolete Lustre Wiki
Revision as of 16:53, 18 January 2010 by Docadmin (talk | contribs)
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.


Caching OSS introduces cache on OSS side.


number of transaction, server generates them.
committed transno
greatest transno known committed. Current recovery model implies all transactions with transno less than the committed one are committed as well.


  1. support writeback cache on OST
  2. support cached reads on OST
  3. follow same recovery model (atomic updates, executed-once semantics, replay changes from clients)
  4. no performance penalty in majority cases

Use Cases

ID Quality Attribute Summary
write usability we can't free pages once reply for OST_WRITE received, we also shouldn't confuse VM with written, but non freeable pages (unstable pages in linux?)
full page truncate usability retain pages referred by non-committed write request
partial page truncate usability corrupted data in pages referred by non-committed write request
rewrite on client side availability how do we do if application on the client side modifies data sent to OSS, but not committed yet
dependent read/write availability how server handles access to non-committed data yet, commit-on-share?
enomem usability how client can ask server to flush cache in order to free client's memory
order of flushing availability whether all disk filesystem flush in strict order? (data=ordered in ldiskfs?)
non-blocking flushing performance flushing of data from previous transaction shouldn't block newer transactions in most cases
grants usability how server counts cached data
few reqs in flight availability, performance do we follow current recovery model with per-request entry in last_rcvd?
sync performance can we implement fine-grained sync?
lockless availability how do we with lockless IO?
som availability integration with Size-on-MDS?
vbr availability integration with version based recovery

Quality Attribute Scenarios

Full write

Scenario: full page(s) write RPC
Business Goals: allow recoverable writeback cache on OSS
Relevant QA's: usability
details Stimulus source: application
Stimulus: write syscall or modification via mmap
Artifact: client's and server's cache
Response: client drops dirty bit on pages, OST makes own pages dirty. OST replies with transno and client puts request onto retained queue. The request pins involved data pages
Response measure: In the most cases response time should be close to RTT*2 (OST_WRITE request, BULK GET, BULK PUT, OST_WRITE reply)

Full page truncate

Scenario: full page truncate
Business Goals: allow recoverable truncate over cached on OST data
Relevant QA's: usability
details Stimulus source: application
Stimulus: truncate syscall or open(O_TRUNC)
Artifact: client's and server's cache
Response: pages are dropped from server cache. corresponded pages in client's cache are made unavailable to applications, but they can be retained for non-committed-yet OST_WRITE requests
Response measure:
Questions: do we still need synchronous OST_PUNCH all the time?

Partial page truncate

Scenario: partial page truncate
Business Goals: allow recoverable truncate over cached on OST data
Relevant QA's: usability
details Stimulus source: application
Stimulus: truncate syscall or open(O_TRUNC)
Artifact: client's and server's cache
Response: make a copy to preserve pages referred by non-committed writes ???
Response measure:

Rewrite on client side

Scenario: rewrite on client side
Business Goals: preserve correct data in case of partial replay
Relevant QA's: availability
details Stimulus source: transient server failure
Stimulus: it happens
Artifact: client's cache and retain queue
Response: if page is referred by non-committed request, allocate and use new one ???
Response measure: there were overlapping write1 and write2 requests from a client. the client was able to replay write1 only and not write2. write1 must not carry data from write2.
Questions: probably we'd better don't follow this strict model and do a bit simpler? there are some doubts this can be implemented easily for mmap case.

Dependent read/write

Scenario: dependent read/write
Business Goals: No data to be corrupted in case of partial recovery
Relevant QA's: availability
details Stimulus source: read/write overlapping non-committed read/write from different client
Stimulus: applications
Environment: applications using shared files
Artifact: server's and client's cache
Response: Depending on configuration server may forcefully flush own and client cache if client accesses non-committed data or allow him to use non-committed data.
Response measure: Overlapping access shouldn't meet different data after failover (even with partial recovery). Any overlapping access causes IO on server which degrades performance

ENOMEM on client

Scenario: ENOMEM on client
Business Goals: Smooth IO on clients
Relevant QA's: performance
details Stimulus source: Client can't find freeable memory, but amount of memory pinned by non-committed requests
Stimulus: applications
Environment: Client with memory less than OST's cache
Artifact: OST's and client's cache
Response: Client should be able to detect such a situation and ask OST to flush cache
Response measure: IO shouldn't stale for long on client
Questions: How much of OSTs cache should be flushed? Can we do partial flush to avoid situations when one client trashes servers

Order of flushing

Scenario: order of flushing
Business Goals: Stable and simple recovery code
Relevant QA's: availability
details Stimulus source: multiple OST_WRITE requests
Stimulus: clients flushing their caches
Environment: OST
Artifact: order of flushing
Response: write request with transno X is expected to be committed if write request with transno Y is committed and Y < X
Response measure: no data loss after full successful failover. IOW, no lost transactions

Non-blocking flushing

Scenario: non-blocking flushing
Business Goals: Good performance, storage shouldn't be idling while clients are submiting server with data
Relevant QA's: performance
details Stimulus source: coming writes cause existing disk transaction to be flushed
Stimulus: closed disk transaction
Environment: OST under load
Artifact: flushing code of disk filesystem
Response: flushing code of disk filesystem should be flushing old transaction(s) and don't block existing activity much
Response measure: there should be no gap in storage activity while OST is fed with enough data from clients


Scenario: Grants
Business Goals: Grants should be working with caching OST
Relevant QA's: usability
details Stimulus source: write request to OSTp
Stimulus: space accounting
Environment: OST
Artifact: grants accounting
Response: disk filesystem or grants code should track amount of data cached and possible amount of disk space required to flush it
Response measure: No -ENOSPC is allowed for data client put in its own cache (this is existing requirement for grants)

Few requests in flight

Scenario: few requests in flight
Business Goals: Allow recoverable and well performing writes
Relevant QA's: availability, performance
details Stimulus source: applications
Stimulus: few write requests in flight
Artifact: last_rcvd file
Response: store result of request handling as long as the reply is known to be received
Response measure: Results for all committed not replied requests should be found in last_rcvd upon recovery
Questions: Do we really need to follow these strict rules for data? Can we consider some of cases is IO errors to relax the rules and simplify the code? How do we know reply is received and we can free corresponded slot in last_rcvd? ACK seem to be too expensive.

QAS template
Business Goals:
Relevant QA's:
details Stimulus source:
Response measure:


Server side

  1. transno generation for OST_WRITE
  2. flush in order of transno

Client side

  1. copy-on-write mechanism for regular writes
  2. copy-on-write mechanism for mmap'ed pages
  3. OST_WRITE replay and free upon commit

Implementation Details

Should we consider different recovery model for data? For metadata we use model where state is reproduced by replaying all requests and each request has all required data (IOW, it doesn't refer pages, inodes, dentries, etc). We can't follow exactly this model as this implies we do copy of all data for each request. So, we have to refer external data from retained requests, but external data is shared and can change by the moment of replay. Would it make sense to use "state" model for data when we reproduce current state only, not replay all previous states?

Can we support asynchronous truncate? Then how do we understand on the client page was truncated and don't try to read it from server? How do we truncate up on lock cancel?

Should we take NRS into account? To what extent?