WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - MDS-on-DMU
From Obsolete Lustre Wiki
Jump to navigationJump to search
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Definitions
- dmu
- Data Management Unit — a set of internal interfaces of zfs, implementing data objects, disk space management and transactions. dmu operates on top of volume manager and provides services for zpl (ZFS POSIX Layer). dmu also exists in the form of user level library that has been ported to multiple platforms.
- osd
- Object Storage Device. Bottom layer in mds stack (as per CMD3 server architecture). osd implements transactions, data-objects, indices, local locking, object attributes, and extended attributes.
- mdd
- Meta-Data Device. mds layer implementing posix functionality on top of osd. Implements name-space operations (link, unlink, readdir), permission checks (i_mode and acls), fine-grained pdirops locking of directories, lov attributes.
- cmm
- Clustered Meta-data Module. Optional layer implementing clustered meta-data on top of mdd (responsible for local mete-data) and mdc (used to manipulate meta-data on other md servers in the cluster).
- mdt
- Meta-Data Target. Topmost layer in mds stack. Responsible for all things networking: receiving and unpacking requests, sending replies, recovery, distributed locking.
- zap
- generic key->value indexing mechanism implemented in dmu. Used by zpl to implement both usual posix directory service and ea support.
Use Cases
Summary
Description | Quality | Semantics |
---|---|---|
async-txn | performance, scalability | asynchronous transactions are supported. |
recovery | availability | single-failure recovery is implemented. |
cache | performance, scalability | meta-data has to be cached on the server. |
consistency | performance, scalability, availability | distributed consistency between md and os servers is maintained through llogs. |
cmd | performance, scalability | clustered meta-data are supported. |
splitd | performance, scalability | split directories. Do we want this? Hopefully not. |
pdirops | performance, scalability | fine-grained name-space locking. |
improvements | performance, scalability | features like directory read-ahead, early lock cancellation, server-driver lock lru resizing, version-based recovery, etc. are supported. |
layering | testability, modifiability | CMD3 layering is preserved. |
rollback | availability | distributed transaction roll-back in CMD configurations. |
fid | usability | files and objects are uniformly identified by fid. |
posix | usability, security | posix semantics and posix interfaces are supported except where unreasonable (e.g., atime). |
layout | usability | support various file layout formats (striping, join-file, etc.) through common interface. |
stats | testability, usability, performance | measurements of run-time behavior of every module are collected and exported to the user. |
back-end | modifiability, usability | all back-end specific functionality (ldiskfs vs. ZFS) is encapsulated into few modules with well-defined interfaces. |
platform | modifiability, usability | all platform specific functionality (kernel vs. user space) is encapsulated into few modules with well-defined interfaces. |
op-rate | performance | performance comparable with kernel version. |
osd | all | following osd-specific qualities are included from OSS-on-DMU by reference: control interface, interoperability, disk quota, orphans, aborted setuid/setgid, ZFS compatibility, failure simulation, capabilities. |
Features and Functional Behaviour
Description | Semantics | |
---|---|---|
txn:open | opening transaction. Mapping dmu transaction state machine to model exported by osd. | |
txn:credits | mapping dt_txn interface into form suitable for use by creditless transaction engines similar to dmu. | |
txn:call-backs | call-backs invoked by transaction engine when transaction state changed. | |
obj:alloc | allocation hint interface is general enough to be suitable for both dmu and ldiskfs. | |
obj:attr | object attribute interface is suitable for dmu. | |
obj:xattr | ea interface is suitable for dmu. | |
obj:dir | * | creation of zfs compatible directory (specifically, insertion of dot and dot-dot by mdd). |
obj:io | locking (i.e., scalability concerns) for non-bulk ->dt_{read,write} methods. | |
dir:pdirops | * | interaction between mdd-based pdirops implementation and dmu. |
dir:features | mapping of struct dt_index_features onto dmu zap interface. | |
dir:ops | mapping struct dt_index_operations onto dmu zap interface. | |
dir:it | mapping struct dt_it_ops onto dmu zap interface. | |
fs:statfs | mapping file system statistics into kstat form (including ->f_files and ->f_ffree). | |
fs:testing | handle 'sync' and 'set read-only' requests. | |
osd:share | share common code between ldiskfs and dmu based versions of osd (capability handling, local locking, reference counting, etc.). | |
osd:fid->objid | implementation of persistent fid-to-object mapping (aka object index, aka oi) as a zap. | |
osd:fid-dirent | emulation of fids in directory entries by storing fids in ea of object. (Consider using small scratch pad area in dnode.) | |
cmm:remote-fid | management of proxy objects, serving as place-holders for remote objects. Alternatively use symbolic links. |
* — possibly requires changes outside of osd.