Architecture - Metadata API

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary
Metadata API is a set of methods llite layer uses to access and manipulate metadata. This page starts from describing existing API. As time goes MdAPI should evolve to become clear and sufficiently portable.

Definitions

 * Lock: LDLM lock
 * resolving: a triple of parent/name pointing to child

Requirements

 * 1) clear and sufficient (no tricks needed to use it)
 * 2) portable

Cache Model
Current implementation supposes four types of cache:
 * 1) resolving (dentry in linux)
 * 2) attributes (inode in linux)
 * 3) directory (pagecache to store directory entries in linux)
 * 4) open files

All of them expect open files are maintained by the kernel with help from filesystem driver (llite module of lustre).

All these caches are protected by DLM locks. A special type of lock - inode bitlock - was introduced. Each lock may have 3 bits:
 * 1) LOOKUP - protects all resolvings to given object (all because of hardlinks) + permissions
 * 2) UPDATE - protects all metadata attributes except permissions
 * 3) OPEN - allows client to cache open state of a file

Depending on actual operation client combines bits to access data. For example, to do intermediate lookup where we only need to make sure the component still exists and permissions haven't changed, LOOKUP bit is required. Once resolving of intermediate component is cached, client doesn't need to lookup it on server again. And given operations within directory don't need LOOKUP bit, client's resolving is still in cache even if another client create/unlink file in that directory.

Intents
Intents is a mechanism to save RPCs which is the most expensive part of metadata handling due to network latency.

Let's imagine open("file", O_CREAT). Linux VFS breaks this request into series of calls to filesystem driver:
 * 1) lookup to find file if it exists
 * 2) if file doesn't exist, then create is called
 * 3) then open is called

Thus, 2 or 3 RPCs could be needed, if we follow usual model. Instead, we send lookup RPC with a special structure called intent which describes real intent of the operation, MDS does all required work, then create and open use results returned to the first operation.

In reality any operation starts with locking, so usually it's ENQUEUE RPC which carries intent.

There are few intents defined currently:

Data Structures
struct lustre_intent_data { int      it_disposition; int      it_status; __u64    it_lock_handle; void    *it_data; int      it_lock_mode; };

struct lookup_intent { int    it_op; int    it_flags; int    it_create_mode; union { struct lustre_intent_data lustre; } d; };

struct mdc_op_data { struct ll_fid   fid1; struct ll_fid   fid2; struct ll_fid   fid3; struct ll_fid   fid4; __u64           mod_time; const char     *name; int             namelen; __u32           create_mode; __u32           suppgids[2]; void           *data; };

struct lustre_md { struct mds_body        *body; struct lov_stripe_md   *lsm; struct posix_acl       *posix_acl; };
 * 1) ifdef CONFIG_FS_POSIX_ACL
 * 1) endif

mdc_intent_lock
int mdc_intent_lock(struct obd_export *exp, struct mdc_op_data *op_data,                    void *lmm, int lmmsize, struct lookup_intent *it,                     int lookup_flags, struct ptlrpc_request **reqp,                     ldlm_blocking_callback cb_blocking, int extra_lock_flags)


 * exp: export for the connection to MDS
 * op_data: fids/name/etc - operation args
 * lmm: buffer to store striping info
 * lmmsize: available space in lmm buffer
 * it: intent
 * lookup_flags: unused currently
 * reqp: where to store address of request (to be used to access server's reply)
 * cb_blocking: callback routine, called when correspondent lock is being cacelled - to invalidate cache
 * extra_flags: extra flags to be passed LDLM

mdc_intent_lock is the major entry point for many metadata operations. as it was described above, some metadata operations are implemented with intents. if we didn't have intents, then operation could look like the following:

lock(dir); lookup(dir, name); unlock(dir);

with intents (and currently) it looks this way:

lock(dir, intent, reply); find lookup result in reply;

mdc_intent_lock is that lock in pseudo-code.

mdc_intent_lock is used to lookup name, retrieve attributes, create regular files and open files/directories. it usually returns DLM lock for result object. actual bits of the lock may depend on intent and server's policy. if mdc_intent_lock returns with a lock, then data in reply is subject to cache.

mdc_set_lock_data
void mdc_set_lock_data(__u64 *l, void *data)


 * l: lock handle
 * data: data to be stored in lock's private field

the function is used to associate lock with some cache entity so that at time lock is being cancelled we can find that entity easily

mdc_readpage
int mdc_readpage(struct obd_export *exp, struct ll_fid *fid, __u64 offset,                 struct page *page, struct ptlrpc_request **request)


 * exp: export for connection to MDS
 * fid: fid of the directory to be read
 * offset: where to start reading from
 * page: where to store result (LINUX-SPECIFIC YET)
 * request: request where newer attributes of the directory can be fetched from

mdc_setattr
int mdc_setattr(struct obd_export *exp, struct mdc_op_data *op_data,                struct iattr *iattr, void *ea, int ealen, void *ea2, int ea2len,                 struct ptlrpc_request **request)


 * exp: connection to MDS
 * op_data: operation arguments
 * iattr: new attributes (platform specific yet)
 * ea: striping info buffer
 * ealen: striping info buffer length
 * ea2: currently unused
 * ea2len: currently unused
 * request: where address of request will be stored

mdc_setattr sends sync RPC to MDS to change attributes. it does not return any lock, so attributes in reply isn't subject to cache.

mdc_getattr
int mdc_getattr(struct obd_export *exp, struct ll_fid *fid,                obd_valid valid, unsigned int ea_size,                 struct ptlrpc_request **request)


 * exp: connection to MDS
 * fid: object of interest
 * valid: attributes of interest
 * ea_size: how many bytes of striping info client can store
 * reuqest: where address of request will be stored (to access reply)

mdc_getattr sends synchronous RPC to MDS. it never returns lock.

mdc_close
int mdc_close(struct obd_export *exp, struct obdo *oa,              struct obd_client_handle *och, struct ptlrpc_request **request)


 * exp: export for connection to MDS
 * oa:
 * och: openhandle returned by mdc_intent_lock at open time
 * request: where address of request will be stored (to access reply)

mdc_link
int mdc_link(struct obd_export *exp, struct mdc_op_data *op_data,             struct ptlrpc_request **request)

mdc_unlink
int mdc_unlink(struct obd_export *exp, struct mdc_op_data *op_data,              struct ptlrpc_request **request)

mdc_create
int mdc_create(struct obd_export *exp, struct mdc_op_data *op_data,               const void *data, int datalen, int mode, __u32 uid, __u32 gid,                __u32 cap_effective, __u64 rdev, struct ptlrpc_request **request)

mdc_rename
int mdc_rename(struct obd_export *exp, struct mdc_op_data *op_data,               const char *old, int oldlen, const char *new, int newlen,                struct ptlrpc_request **request)

mdc_req2lustre_md
int mdc_req2lustre_md(struct ptlrpc_request *req, int offset,                      struct obd_export *exp,                       struct lustre_md *md)

intermediate lookup
VFS parses filename, finds intermediate (not last) component and starts lookup:
 * 1) local cache (dcache) is checked
 * 2) if dentry is found in dcache, then revalidation is called (llite's revalidation check whether lock with LOOKUP bit is still granted to the client)
 * 3) if dentry isn't found in dcache, then lookup method is called
 * 4) llite's lookup prepares intent and sends ENQUEUE RPC with the intent
 * 5) MDS receives RPC, finds intent and execute it: name is looked up to fid, correspondent attributes are fetched, DLM lock is acquired for this resolving (LOOKUP). all this is put in the reply
 * 6) client gets fid and attributes, finds or creates new inode for given fid, adds one more entry into dcache

ll_lookup_it mdc_intent_lock(IT_LOOKUP) mdc_enqueue iget ldlm_lock_decref

final lookup
New_Metadata_API