WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - Metadata API
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Summary
Metadata API is a set of methods llite layer uses to access and manipulate metadata. This page starts from describing existing API. As time goes MdAPI should evolve to become clear and sufficiently portable.
Definitions
- Lock
- LDLM lock
- resolving
- a triple of parent/name pointing to child
Requirements
- clear and sufficient (no tricks needed to use it)
- portable
Cache Model
Current implementation supposes four types of cache:
- resolving (dentry in linux)
- attributes (inode in linux)
- directory (pagecache to store directory entries in linux)
- open files
All of them expect open files are maintained by the kernel with help from filesystem driver (llite module of lustre).
All these caches are protected by DLM locks. A special type of lock - inode bitlock - was introduced. Each lock may have 3 bits:
- LOOKUP - protects all resolvings to given object (all because of hardlinks) + permissions
- UPDATE - protects all metadata attributes except permissions
- OPEN - allows client to cache open state of a file
Depending on actual operation client combines bits to access data. For example, to do intermediate lookup where we only need to make sure the component still exists and permissions haven't changed, LOOKUP bit is required. Once resolving of intermediate component is cached, client doesn't need to lookup it on server again. And given operations within directory don't need LOOKUP bit, client's resolving is still in cache even if another client create/unlink file in that directory.
Intents
Intents is a mechanism to save RPCs which is the most expensive part of metadata handling due to network latency.
Let's imagine open("file", O_CREAT). Linux VFS breaks this request into series of calls to filesystem driver:
- lookup to find file if it exists
- if file doesn't exist, then create is called
- then open is called
Thus, 2 or 3 RPCs could be needed, if we follow usual model. Instead, we send lookup RPC with a special structure called intent which describes real intent of the operation, MDS does all required work, then create and open use results returned to the first operation.
In reality any operation starts with locking, so usually it's ENQUEUE RPC which carries intent.
There are few intents defined currently:
intent | purpose | details |
---|---|---|
IT_OPEN | open file/directory | may return LOOKUP + OPEN lock |
IT_CREAT | create file | used with IT_OPEN, may return LOOKUP + OPEN lock |
IT_READDIR | readdir | used to access/cache directories, may return LOOKUP + UPDATE lock |
IT_GETATTR | lookup/getattr | used to lookup and get attr (final lookup), may return LOOKUP + UPDATE lock |
IT_LOOKUP | lookup | used for intermediate lookups, may return LOOKUP lock |
IT_UNLINK | unused | -- |
IT_TRUNC | unused | -- |
IT_GETXATTR | unused | -- |
Recovery
Use Cases
ID | Quality Attribute | Summary |
---|---|---|
intermediate lookup | usability | lookup of intermediate component of filename |
final lookup | usability | lookup of final component of filename |
cached lookup | performance | how can user check resolving is still valid |
resolving invalidate | usability | how to invalidate cached resolving |
cached negative resolving | performance | how can user check name still does not exist |
negative resolving invalidate | usability | how can user check name is created now |
mkdir | usability | creating of new directory |
rmdir | usability | directory removal |
link | usability | |
unlink | usability | |
unlink open file | usability, availability | file should be protected from removal until it's closed |
rename | usability | |
stat | usability | |
cached stat | performance | |
stat invalidate | usability | |
readdir | usability | |
cached readdir | performance | |
readdir invalidate | usability | |
open | usability | |
cached open | performance | |
open invalidate | usability | |
close | usability | |
open recovery | availability | during MDS failover, clients recover opens to provide posix semantics of unlink |
open by fid | availability | used by nfsd |
lockless | usability | in some cases results can return with no lock |
Existing MdAPI
Data Structures
struct lustre_intent_data { int it_disposition; int it_status; __u64 it_lock_handle; void *it_data; int it_lock_mode; };
struct lookup_intent { int it_op; int it_flags; int it_create_mode; union { struct lustre_intent_data lustre; } d; };
struct mdc_op_data { struct ll_fid fid1; struct ll_fid fid2; struct ll_fid fid3; struct ll_fid fid4; __u64 mod_time; const char *name; int namelen; __u32 create_mode; __u32 suppgids[2]; void *data;
};
struct lustre_md { struct mds_body *body; struct lov_stripe_md *lsm; #ifdef CONFIG_FS_POSIX_ACL struct posix_acl *posix_acl; #endif };
mdc_intent_lock
int mdc_intent_lock(struct obd_export *exp, struct mdc_op_data *op_data, void *lmm, int lmmsize, struct lookup_intent *it, int lookup_flags, struct ptlrpc_request **reqp, ldlm_blocking_callback cb_blocking, int extra_lock_flags)
- exp
- export for the connection to MDS
- op_data
- fids/name/etc - operation args
- lmm
- buffer to store striping info
- lmmsize
- available space in lmm buffer
- it
- intent
- lookup_flags
- unused currently
- reqp
- where to store address of request (to be used to access server's reply)
- cb_blocking
- callback routine, called when correspondent lock is being cacelled - to invalidate cache
- extra_flags
- extra flags to be passed LDLM
mdc_intent_lock() is the major entry point for many metadata operations. as it was described above, some metadata operations are implemented with intents. if we didn't have intents, then operation could look like the following:
lock(dir); lookup(dir, name); unlock(dir);
with intents (and currently) it looks this way:
lock(dir, intent, reply); find lookup result in reply;
mdc_intent_lock() is that lock() in pseudo-code.
mdc_intent_lock() is used to lookup name, retrieve attributes, create regular files and open files/directories. it usually returns DLM lock for result object. actual bits of the lock may depend on intent and server's policy. if mdc_intent_lock() returns with a lock, then data in reply is subject to cache.
mdc_set_lock_data
void mdc_set_lock_data(__u64 *l, void *data)
- l
- lock handle
- data
- data to be stored in lock's private field
the function is used to associate lock with some cache entity so that at time lock is being cancelled we can find that entity easily
mdc_readpage
int mdc_readpage(struct obd_export *exp, struct ll_fid *fid, __u64 offset, struct page *page, struct ptlrpc_request **request)
- exp
- export for connection to MDS
- fid
- fid of the directory to be read
- offset
- where to start reading from
- page
- where to store result (LINUX-SPECIFIC YET)
- request
- request where newer attributes of the directory can be fetched from
mdc_setattr
int mdc_setattr(struct obd_export *exp, struct mdc_op_data *op_data, struct iattr *iattr, void *ea, int ealen, void *ea2, int ea2len, struct ptlrpc_request **request)
- exp
- connection to MDS
- op_data
- operation arguments
- iattr
- new attributes (platform specific yet)
- ea
- striping info buffer
- ealen
- striping info buffer length
- ea2
- currently unused
- ea2len
- currently unused
- request
- where address of request will be stored
mdc_setattr() sends sync RPC to MDS to change attributes. it does not return any lock, so attributes in reply isn't subject to cache.
mdc_getattr
int mdc_getattr(struct obd_export *exp, struct ll_fid *fid, obd_valid valid, unsigned int ea_size, struct ptlrpc_request **request)
- exp
- connection to MDS
- fid
- object of interest
- valid
- attributes of interest
- ea_size
- how many bytes of striping info client can store
- reuqest
- where address of request will be stored (to access reply)
mdc_getattr() sends synchronous RPC to MDS. it never returns lock.
mdc_close
int mdc_close(struct obd_export *exp, struct obdo *oa, struct obd_client_handle *och, struct ptlrpc_request **request)
- exp
- export for connection to MDS
- oa
- och
- openhandle returned by mdc_intent_lock() at open time
- request
- where address of request will be stored (to access reply)
mdc_link
int mdc_link(struct obd_export *exp, struct mdc_op_data *op_data, struct ptlrpc_request **request)
mdc_unlink
int mdc_unlink(struct obd_export *exp, struct mdc_op_data *op_data, struct ptlrpc_request **request)
mdc_create
int mdc_create(struct obd_export *exp, struct mdc_op_data *op_data, const void *data, int datalen, int mode, __u32 uid, __u32 gid, __u32 cap_effective, __u64 rdev, struct ptlrpc_request **request)
mdc_rename
int mdc_rename(struct obd_export *exp, struct mdc_op_data *op_data, const char *old, int oldlen, const char *new, int newlen, struct ptlrpc_request **request)
mdc_req2lustre_md
int mdc_req2lustre_md(struct ptlrpc_request *req, int offset, struct obd_export *exp, struct lustre_md *md)
Examples
intermediate lookup
VFS parses filename, finds intermediate (not last) component and starts lookup:
- local cache (dcache) is checked
- if dentry is found in dcache, then revalidation is called (llite's revalidation check whether lock with LOOKUP bit is still granted to the client)
- if dentry isn't found in dcache, then lookup method is called
- llite's lookup prepares intent and sends ENQUEUE RPC with the intent
- MDS receives RPC, finds intent and execute it: name is looked up to fid, correspondent attributes are fetched, DLM lock is acquired for this resolving (LOOKUP). all this is put in the reply
- client gets fid and attributes, finds or creates new inode for given fid, adds one more entry into dcache
ll_lookup_it() mdc_intent_lock(IT_LOOKUP) mdc_enqueue() iget() ldlm_lock_decref()