WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - OSS-on-DMU
From Obsolete Lustre Wiki
Jump to navigationJump to search
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Definitions
DMU - Data Management Unit, the core of ZFS filesystem implementing object storage, transactions, snapshots, pool management.
ZAP - an indexing subsystem of DMU, allows to operate on set of key->value pairs
FID - cluster-wide ID of any Lustre object, including objects on MDS'es and OSS'es
Requirements
id | quality | trigger | affected | description |
dio | performance | reads/writes | fsfilt | reads/writes should be zero-copy to allow high throughput |
cache | performance | small writes, reads | obdfilter/dmu | cache should be used to aggregate writes and avoid repeating reads (clients booting from lustre) |
90% of bandwidth | performance | reads, writes | dmu | is it our responsibility? is it 90%? |
mount | usability | startup | obdfilter/OSD/fsfilt | standalone uOSS process attaches to requested pool and starts serving it via lustre protocol |
stats | usability | explicit request | OST/obdfilter/OSD/fsfilt/dmu | we should provide user with different runtime information for all components |
control interface | usability | explicit request | OST/obdfilter/OSD/fsfilt/dmu | be able to control runtime behavior of all components |
read/write | usability | data access | obdfilter/fsfilt | |
lockless read/write | usability | data access | obdfilter | in some cases it's more efficient to switch to cache-disabled ftp-like protocol to maintain good performance |
create on write | usability | obdfilter | first write/setattr creates object | |
interoperability | usability | fids enabled | OSD | different on-disk layout depending on fids enabled or not |
grants | usability | obdfilter | clients should be given limited amount of writeback cache to avoid late -ENOSPC | |
disk quota | usability | dmu | probably regular per-user/group quota to be implemented in DMU | |
rich OSD/fsfilt API | modifiability | different backends | OSD/fsfilt | hide DMU specifics in fsfilt, keep OSD API abstract enough |
compile | modifiability | developer | libcfs/build | first targets are Solaris and Linux, but it's likely there will be more platforms |
transaction callbacks | availability | cluster failures | fsfilt/dmu | obdfilter to be notified once transaction is committed |
orphans | availability | cluster failures | a mechanism to track and clean unreferenced OST objects | |
aborted write | availability | cluster failures | regular recovery to be used: every write is assigned a trasno and flushed in transno order | |
aborted destroy | availability | cluster failures | llog record on MDS is canceled upon destroy commit | |
aborted setattr | availability | cluster failures | regular recovery to be used | |
aborted setuid/setgid | availability | cluster failures | llog record on MDS is canceled upon commit (if requested) | |
aborted size change | availability | cluster failures | every time size/blocks change with new IO epoch, llog record is generated | |
data rollback | availability | cluster failures | obdfilter/fsfilt/dmu | required for SNS |
ZFS compatibility | availability | any modification | dmu/fsfilt | dmu/fsfilt should maintain underlying filesystem compatible with ZFS |
failure simulation | testability | the more, the better | all | keep existing failure simulation with OBD_FAIL |
capabilities | security | any access | ost | with capabilities enabled, any access to be signed and checked with key provided by MDS |
Implementation Constraints
- use ZAP for fid->dnode mapping
- use DMU's transactions
- single process to serve all DMU pools
- synchronous IO from libzpool to system (Solaris doesn't do AIO well)
OSS on DMU Architecture
The architecture of OSS includes few components:
fsfilt | hides filesystem specific (in this case DMU) from upper layers as many APIs we're going to use are not standard. Provides data, index and transaction management. |
OSD | implements Object Storage Device with objects addressable by Lustre FIDs, uses fsfilt to manipulate filesystem's objects, indices and transactions. |
obdfilter | Implements grants, quota, distributed locking, wraps transactions, capabilities, Size-on-MDS |
OST | RPCs: receiving, parsing, swabbing, bulks, replying |