WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - Caching OSS
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Summary
Caching OSS introduces cache on OSS side.
Definitions
- transno
- number of transaction, server generates them.
- committed transno
- greatest transno known committed. Current recovery model implies all transactions with transno less than the committed one are committed as well.
Requirements
- support writeback cache on OST
- support cached reads on OST
- follow same recovery model (atomic updates, executed-once semantics, replay changes from clients)
- no performance penalty in majority cases
Use Cases
ID | Quality Attribute | Summary |
---|---|---|
write | usability | we can't free pages once reply for OST_WRITE received, we also shouldn't confuse VM with written, but non freeable pages (unstable pages in linux?) |
full page truncate | usability | retain pages referred by non-committed write request |
partial page truncate | usability | corrupted data in pages referred by non-committed write request |
rewrite on client side | availability | how do we do if application on the client side modifies data sent to OSS, but not committed yet |
dependent read/write | availability | how server handles access to non-committed data yet, commit-on-share? |
enomem | usability | how client can ask server to flush cache in order to free client's memory |
order of flushing | availability | whether all disk filesystem flush in strict order? (data=ordered in ldiskfs?) |
non-blocking flushing | performance | flushing of data from previous transaction shouldn't block newer transactions in most cases |
grants | usability | how server counts cached data |
few reqs in flight | availability, performance | do we follow current recovery model with per-request entry in last_rcvd? |
sync | performance | can we implement fine-grained sync? |
lockless | availability | how do we with lockless IO? |
som | availability | integration with Size-on-MDS? |
vbr | availability | integration with version based recovery |
Quality Attribute Scenarios
Full write
Scenario: | full page(s) write RPC | |
Business Goals: | allow recoverable writeback cache on OSS | |
Relevant QA's: | usability | |
details | Stimulus source: | application |
Stimulus: | write syscall or modification via mmap | |
Environment: | ||
Artifact: | client's and server's cache | |
Response: | client drops dirty bit on pages, OST makes own pages dirty. OST replies with transno and client puts request onto retained queue. The request pins involved data pages | |
Response measure: | In the most cases response time should be close to RTT*2 (OST_WRITE request, BULK GET, BULK PUT, OST_WRITE reply) | |
Questions: |
Full page truncate
Scenario: | full page truncate | |
Business Goals: | allow recoverable truncate over cached on OST data | |
Relevant QA's: | usability | |
details | Stimulus source: | application |
Stimulus: | truncate syscall or open(O_TRUNC) | |
Environment: | ||
Artifact: | client's and server's cache | |
Response: | pages are dropped from server cache. corresponded pages in client's cache are made unavailable to applications, but they can be retained for non-committed-yet OST_WRITE requests | |
Response measure: | ||
Questions: | do we still need synchronous OST_PUNCH all the time? |
Partial page truncate
Scenario: | partial page truncate | |
Business Goals: | allow recoverable truncate over cached on OST data | |
Relevant QA's: | usability | |
details | Stimulus source: | application |
Stimulus: | truncate syscall or open(O_TRUNC) | |
Environment: | ||
Artifact: | client's and server's cache | |
Response: | make a copy to preserve pages referred by non-committed writes ??? | |
Response measure: | ||
Questions: |
Rewrite on client side
Scenario: | rewrite on client side | |
Business Goals: | preserve correct data in case of partial replay | |
Relevant QA's: | availability | |
details | Stimulus source: | transient server failure |
Stimulus: | it happens | |
Environment: | ||
Artifact: | client's cache and retain queue | |
Response: | if page is referred by non-committed request, allocate and use new one ??? | |
Response measure: | there were overlapping write1 and write2 requests from a client. the client was able to replay write1 only and not write2. write1 must not carry data from write2. | |
Questions: | probably we'd better don't follow this strict model and do a bit simpler? there are some doubts this can be implemented easily for mmap case. |
Dependent read/write
Scenario: | dependent read/write | |
Business Goals: | No data to be corrupted in case of partial recovery | |
Relevant QA's: | availability | |
details | Stimulus source: | read/write overlapping non-committed read/write from different client |
Stimulus: | applications | |
Environment: | applications using shared files | |
Artifact: | server's and client's cache | |
Response: | Depending on configuration server may forcefully flush own and client cache if client accesses non-committed data or allow him to use non-committed data. | |
Response measure: | Overlapping access shouldn't meet different data after failover (even with partial recovery). Any overlapping access causes IO on server which degrades performance | |
Questions: |
ENOMEM on client
Scenario: | ENOMEM on client | |
Business Goals: | Smooth IO on clients | |
Relevant QA's: | performance | |
details | Stimulus source: | Client can't find freeable memory, but amount of memory pinned by non-committed requests |
Stimulus: | applications | |
Environment: | Client with memory less than OST's cache | |
Artifact: | OST's and client's cache | |
Response: | Client should be able to detect such a situation and ask OST to flush cache | |
Response measure: | IO shouldn't stale for long on client | |
Questions: | How much of OSTs cache should be flushed? Can we do partial flush to avoid situations when one client trashes servers |
Order of flushing
Scenario: | order of flushing | |
Business Goals: | Stable and simple recovery code | |
Relevant QA's: | availability | |
details | Stimulus source: | multiple OST_WRITE requests |
Stimulus: | clients flushing their caches | |
Environment: | OST | |
Artifact: | order of flushing | |
Response: | write request with transno X is expected to be committed if write request with transno Y is committed and Y < X | |
Response measure: | no data loss after full successful failover. IOW, no lost transactions | |
Questions: |
Non-blocking flushing
Scenario: | non-blocking flushing | |
Business Goals: | Good performance, storage shouldn't be idling while clients are submiting server with data | |
Relevant QA's: | performance | |
details | Stimulus source: | coming writes cause existing disk transaction to be flushed |
Stimulus: | closed disk transaction | |
Environment: | OST under load | |
Artifact: | flushing code of disk filesystem | |
Response: | flushing code of disk filesystem should be flushing old transaction(s) and don't block existing activity much | |
Response measure: | there should be no gap in storage activity while OST is fed with enough data from clients | |
Questions: |
Grants
Scenario: | Grants | |
Business Goals: | Grants should be working with caching OST | |
Relevant QA's: | usability | |
details | Stimulus source: | write request to OSTp |
Stimulus: | space accounting | |
Environment: | OST | |
Artifact: | grants accounting | |
Response: | disk filesystem or grants code should track amount of data cached and possible amount of disk space required to flush it | |
Response measure: | No -ENOSPC is allowed for data client put in its own cache (this is existing requirement for grants) | |
Questions: |
Few requests in flight
Scenario: | few requests in flight | |
Business Goals: | Allow recoverable and well performing writes | |
Relevant QA's: | availability, performance | |
details | Stimulus source: | applications |
Stimulus: | few write requests in flight | |
Environment: | ||
Artifact: | last_rcvd file | |
Response: | store result of request handling as long as the reply is known to be received | |
Response measure: | Results for all committed not replied requests should be found in last_rcvd upon recovery | |
Questions: | Do we really need to follow these strict rules for data? Can we consider some of cases is IO errors to relax the rules and simplify the code? How do we know reply is received and we can free corresponded slot in last_rcvd? ACK seem to be too expensive. |
- QAS template
Scenario: | ||
Business Goals: | ||
Relevant QA's: | ||
details | Stimulus source: | |
Stimulus: | ||
Environment: | ||
Artifact: | ||
Response: | ||
Response measure: | ||
Questions: |
Decomposition
Server side
- transno generation for OST_WRITE
- flush in order of transno
Client side
- copy-on-write mechanism for regular writes
- copy-on-write mechanism for mmap'ed pages
- OST_WRITE replay and free upon commit
Implementation Details
Should we consider different recovery model for data? For metadata we use model where state is reproduced by replaying all requests and each request has all required data (IOW, it doesn't refer pages, inodes, dentries, etc). We can't follow exactly this model as this implies we do copy of all data for each request. So, we have to refer external data from retained requests, but external data is shared and can change by the moment of replay. Would it make sense to use "state" model for data when we reproduce current state only, not replay all previous states?
Can we support asynchronous truncate? Then how do we understand on the client page was truncated and don't try to read it from server? How do we truncate up on lock cancel?
Should we take NRS into account? To what extent?