WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - OSS-on-DMU

From Obsolete Lustre Wiki
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Definitions

DMU - Data Management Unit, the core of ZFS filesystem implementing object storage, transactions, snapshots, pool management.

ZAP - an indexing subsystem of DMU, allows to operate on set of key->value pairs

FID - cluster-wide ID of any Lustre object, including objects on MDS'es and OSS'es


Requirements

id quality trigger affected description
dio performance reads/writes fsfilt reads/writes should be zero-copy to allow high throughput
cache performance small writes, reads obdfilter/dmu cache should be used to aggregate writes and avoid repeating reads (clients booting from lustre)
90% of bandwidth performance reads, writes dmu is it our responsibility? is it 90%?
mount usability startup obdfilter/OSD/fsfilt standalone uOSS process attaches to requested pool and starts serving it via lustre protocol
stats usability explicit request OST/obdfilter/OSD/fsfilt/dmu we should provide user with different runtime information for all components
control interface usability explicit request OST/obdfilter/OSD/fsfilt/dmu be able to control runtime behavior of all components
read/write usability data access obdfilter/fsfilt
lockless read/write usability data access obdfilter in some cases it's more efficient to switch to cache-disabled ftp-like protocol to maintain good performance
create on write usability obdfilter first write/setattr creates object
interoperability usability fids enabled OSD different on-disk layout depending on fids enabled or not
grants usability obdfilter clients should be given limited amount of writeback cache to avoid late -ENOSPC
disk quota usability dmu probably regular per-user/group quota to be implemented in DMU
rich OSD/fsfilt API modifiability different backends OSD/fsfilt hide DMU specifics in fsfilt, keep OSD API abstract enough
compile modifiability developer libcfs/build first targets are Solaris and Linux, but it's likely there will be more platforms
transaction callbacks availability cluster failures fsfilt/dmu obdfilter to be notified once transaction is committed
orphans availability cluster failures a mechanism to track and clean unreferenced OST objects
aborted write availability cluster failures regular recovery to be used: every write is assigned a trasno and flushed in transno order
aborted destroy availability cluster failures llog record on MDS is canceled upon destroy commit
aborted setattr availability cluster failures regular recovery to be used
aborted setuid/setgid availability cluster failures llog record on MDS is canceled upon commit (if requested)
aborted size change availability cluster failures every time size/blocks change with new IO epoch, llog record is generated
data rollback availability cluster failures obdfilter/fsfilt/dmu required for SNS
ZFS compatibility availability any modification dmu/fsfilt dmu/fsfilt should maintain underlying filesystem compatible with ZFS
failure simulation testability the more, the better all keep existing failure simulation with OBD_FAIL
capabilities security any access ost with capabilities enabled, any access to be signed and checked with key provided by MDS

Implementation Constraints

  1. use ZAP for fid->dnode mapping
  2. use DMU's transactions
  3. single process to serve all DMU pools
  4. synchronous IO from libzpool to system (Solaris doesn't do AIO well)

OSS on DMU Architecture

The architecture of OSS includes few components:

fsfilt hides filesystem specific (in this case DMU) from upper layers as many APIs we're going to use are not standard. Provides data, index and transaction management.
OSD implements Object Storage Device with objects addressable by Lustre FIDs, uses fsfilt to manipulate filesystem's objects, indices and transactions.
obdfilter Implements grants, quota, distributed locking, wraps transactions, capabilities, Size-on-MDS
OST RPCs: receiving, parsing, swabbing, bulks, replying