WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - MDS-on-DMU

From Obsolete Lustre Wiki
Revision as of 10:58, 22 January 2010 by Docadmin (talk | contribs) (Protected "Architecture - MDS-on-DMU" ([edit=sysop] (indefinite) [move=sysop] (indefinite)))
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Definitions

dmu
Data Management Unit — a set of internal interfaces of zfs, implementing data objects, disk space management and transactions. dmu operates on top of volume manager and provides services for zpl (ZFS POSIX Layer). dmu also exists in the form of user level library that has been ported to multiple platforms.
osd
Object Storage Device. Bottom layer in mds stack (as per CMD3 server architecture). osd implements transactions, data-objects, indices, local locking, object attributes, and extended attributes.
mdd
Meta-Data Device. mds layer implementing posix functionality on top of osd. Implements name-space operations (link, unlink, readdir), permission checks (i_mode and acls), fine-grained pdirops locking of directories, lov attributes.
cmm
Clustered Meta-data Module. Optional layer implementing clustered meta-data on top of mdd (responsible for local mete-data) and mdc (used to manipulate meta-data on other md servers in the cluster).
mdt
Meta-Data Target. Topmost layer in mds stack. Responsible for all things networking: receiving and unpacking requests, sending replies, recovery, distributed locking.
zap
generic key->value indexing mechanism implemented in dmu. Used by zpl to implement both usual posix directory service and ea support.

Use Cases

Summary

Description Quality Semantics
async-txn performance, scalability asynchronous transactions are supported.
recovery availability single-failure recovery is implemented.
cache performance, scalability meta-data has to be cached on the server.
consistency performance, scalability, availability distributed consistency between md and os servers is maintained through llogs.
cmd performance, scalability clustered meta-data are supported.
splitd performance, scalability split directories. Do we want this? Hopefully not.
pdirops performance, scalability fine-grained name-space locking.
improvements performance, scalability features like directory read-ahead, early lock cancellation, server-driver lock lru resizing, version-based recovery, etc. are supported.
layering testability, modifiability CMD3 layering is preserved.
rollback availability distributed transaction roll-back in CMD configurations.
fid usability files and objects are uniformly identified by fid.
posix usability, security posix semantics and posix interfaces are supported except where unreasonable (e.g., atime).
layout usability support various file layout formats (striping, join-file, etc.) through common interface.
stats testability, usability, performance measurements of run-time behavior of every module are collected and exported to the user.
back-end modifiability, usability all back-end specific functionality (ldiskfs vs. ZFS) is encapsulated into few modules with well-defined interfaces.
platform modifiability, usability all platform specific functionality (kernel vs. user space) is encapsulated into few modules with well-defined interfaces.
op-rate performance performance comparable with kernel version.
osd all following osd-specific qualities are included from OSS-on-DMU by reference: control interface, interoperability, disk quota, orphans, aborted setuid/setgid, ZFS compatibility, failure simulation, capabilities.

Features and Functional Behaviour

Description Semantics
txn:open opening transaction. Mapping dmu transaction state machine to model exported by osd.
txn:credits mapping dt_txn interface into form suitable for use by creditless transaction engines similar to dmu.
txn:call-backs call-backs invoked by transaction engine when transaction state changed.
obj:alloc allocation hint interface is general enough to be suitable for both dmu and ldiskfs.
obj:attr object attribute interface is suitable for dmu.
obj:xattr ea interface is suitable for dmu.
obj:dir * creation of zfs compatible directory (specifically, insertion of dot and dot-dot by mdd).
obj:io locking (i.e., scalability concerns) for non-bulk ->dt_{read,write} methods.
dir:pdirops * interaction between mdd-based pdirops implementation and dmu.
dir:features mapping of struct dt_index_features onto dmu zap interface.
dir:ops mapping struct dt_index_operations onto dmu zap interface.
dir:it mapping struct dt_it_ops onto dmu zap interface.
fs:statfs mapping file system statistics into kstat form (including ->f_files and ->f_ffree).
fs:testing handle 'sync' and 'set read-only' requests.
osd:share share common code between ldiskfs and dmu based versions of osd (capability handling, local locking, reference counting, etc.).
osd:fid->objid implementation of persistent fid-to-object mapping (aka object index, aka oi) as a zap.
osd:fid-dirent emulation of fids in directory entries by storing fids in ea of object. (Consider using small scratch pad area in dnode.)
cmm:remote-fid management of proxy objects, serving as place-holders for remote objects. Alternatively use symbolic links.
* — possibly requires changes outside of osd.




References

ZFS for Lustre

OSS-on-DMU