WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - Metadata API

From Obsolete Lustre Wiki
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary

Metadata API is a set of methods llite layer uses to access and manipulate metadata. This page starts from describing existing API. As time goes MdAPI should evolve to become clear and sufficiently portable.

Definitions

Lock
LDLM lock
resolving
a triple of parent/name pointing to child


Requirements

  1. clear and sufficient (no tricks needed to use it)
  2. portable

Cache Model

Current implementation supposes four types of cache:

  1. resolving (dentry in linux)
  2. attributes (inode in linux)
  3. directory (pagecache to store directory entries in linux)
  4. open files

All of them expect open files are maintained by the kernel with help from filesystem driver (llite module of lustre).

All these caches are protected by DLM locks. A special type of lock - inode bitlock - was introduced. Each lock may have 3 bits:

  1. LOOKUP - protects all resolvings to given object (all because of hardlinks) + permissions
  2. UPDATE - protects all metadata attributes except permissions
  3. OPEN - allows client to cache open state of a file

Depending on actual operation client combines bits to access data. For example, to do intermediate lookup where we only need to make sure the component still exists and permissions haven't changed, LOOKUP bit is required. Once resolving of intermediate component is cached, client doesn't need to lookup it on server again. And given operations within directory don't need LOOKUP bit, client's resolving is still in cache even if another client create/unlink file in that directory.

Intents

Intents is a mechanism to save RPCs which is the most expensive part of metadata handling due to network latency.

Let's imagine open("file", O_CREAT). Linux VFS breaks this request into series of calls to filesystem driver:

  1. lookup to find file if it exists
  2. if file doesn't exist, then create is called
  3. then open is called

Thus, 2 or 3 RPCs could be needed, if we follow usual model. Instead, we send lookup RPC with a special structure called intent which describes real intent of the operation, MDS does all required work, then create and open use results returned to the first operation.

In reality any operation starts with locking, so usually it's ENQUEUE RPC which carries intent.

There are few intents defined currently:

intent purpose details
IT_OPEN open file/directory may return LOOKUP + OPEN lock
IT_CREAT create file used with IT_OPEN, may return LOOKUP + OPEN lock
IT_READDIR readdir used to access/cache directories, may return LOOKUP + UPDATE lock
IT_GETATTR lookup/getattr used to lookup and get attr (final lookup), may return LOOKUP + UPDATE lock
IT_LOOKUP lookup used for intermediate lookups, may return LOOKUP lock
IT_UNLINK unused --
IT_TRUNC unused --
IT_GETXATTR unused --

Recovery

Use Cases

ID Quality Attribute Summary
intermediate lookup usability lookup of intermediate component of filename
final lookup usability lookup of final component of filename
cached lookup performance how can user check resolving is still valid
resolving invalidate usability how to invalidate cached resolving
cached negative resolving performance how can user check name still does not exist
negative resolving invalidate usability how can user check name is created now
mkdir usability creating of new directory
rmdir usability directory removal
link usability
unlink usability
unlink open file usability, availability file should be protected from removal until it's closed
rename usability
stat usability
cached stat performance
stat invalidate usability
readdir usability
cached readdir performance
readdir invalidate usability
open usability
cached open performance
open invalidate usability
close usability
open recovery availability during MDS failover, clients recover opens to provide posix semantics of unlink
open by fid availability used by nfsd
lockless usability in some cases results can return with no lock

Existing MdAPI

Data Structures

struct lustre_intent_data {
        int       it_disposition;
        int       it_status;
        __u64     it_lock_handle;
        void     *it_data;
        int       it_lock_mode;
};
struct lookup_intent {
        int     it_op;
        int     it_flags; 
        int     it_create_mode;
        union {
                struct lustre_intent_data lustre;
        } d;
};
struct mdc_op_data {
        struct ll_fid    fid1;
        struct ll_fid    fid2;
        struct ll_fid    fid3;
        struct ll_fid    fid4;
        __u64            mod_time;
        const char      *name;
        int              namelen;
        __u32            create_mode;
        __u32            suppgids[2];
        void            *data;

};

struct lustre_md {
        struct mds_body         *body;
        struct lov_stripe_md    *lsm;
#ifdef CONFIG_FS_POSIX_ACL
        struct posix_acl        *posix_acl;
#endif
};

mdc_intent_lock

int mdc_intent_lock(struct obd_export *exp, struct mdc_op_data *op_data,
                    void *lmm, int lmmsize, struct lookup_intent *it,
                    int lookup_flags, struct ptlrpc_request **reqp,
                    ldlm_blocking_callback cb_blocking, int extra_lock_flags)
exp
export for the connection to MDS
op_data
fids/name/etc - operation args
lmm
buffer to store striping info
lmmsize
available space in lmm buffer
it
intent
lookup_flags
unused currently
reqp
where to store address of request (to be used to access server's reply)
cb_blocking
callback routine, called when correspondent lock is being cacelled - to invalidate cache
extra_flags
extra flags to be passed LDLM

mdc_intent_lock() is the major entry point for many metadata operations. as it was described above, some metadata operations are implemented with intents. if we didn't have intents, then operation could look like the following:

lock(dir);
lookup(dir, name);
unlock(dir);

with intents (and currently) it looks this way:

lock(dir, intent, reply);
find lookup result in reply;

mdc_intent_lock() is that lock() in pseudo-code.

mdc_intent_lock() is used to lookup name, retrieve attributes, create regular files and open files/directories. it usually returns DLM lock for result object. actual bits of the lock may depend on intent and server's policy. if mdc_intent_lock() returns with a lock, then data in reply is subject to cache.

mdc_set_lock_data

void mdc_set_lock_data(__u64 *l, void *data)
l
lock handle
data
data to be stored in lock's private field

the function is used to associate lock with some cache entity so that at time lock is being cancelled we can find that entity easily

mdc_readpage

int mdc_readpage(struct obd_export *exp, struct ll_fid *fid, __u64 offset,
                 struct page *page, struct ptlrpc_request **request)
exp
export for connection to MDS
fid
fid of the directory to be read
offset
where to start reading from
page
where to store result (LINUX-SPECIFIC YET)
request
request where newer attributes of the directory can be fetched from

mdc_setattr

int mdc_setattr(struct obd_export *exp, struct mdc_op_data *op_data,
                struct iattr *iattr, void *ea, int ealen, void *ea2, int ea2len,
                struct ptlrpc_request **request)
exp
connection to MDS
op_data
operation arguments
iattr
new attributes (platform specific yet)
ea
striping info buffer
ealen
striping info buffer length
ea2
currently unused
ea2len
currently unused
request
where address of request will be stored

mdc_setattr() sends sync RPC to MDS to change attributes. it does not return any lock, so attributes in reply isn't subject to cache.

mdc_getattr

int mdc_getattr(struct obd_export *exp, struct ll_fid *fid,
                obd_valid valid, unsigned int ea_size,
                struct ptlrpc_request **request)
exp
connection to MDS
fid
object of interest
valid
attributes of interest
ea_size
how many bytes of striping info client can store
reuqest
where address of request will be stored (to access reply)

mdc_getattr() sends synchronous RPC to MDS. it never returns lock.

mdc_close

int mdc_close(struct obd_export *exp, struct obdo *oa,
              struct obd_client_handle *och, struct ptlrpc_request **request)
exp
export for connection to MDS
oa
och
openhandle returned by mdc_intent_lock() at open time
request
where address of request will be stored (to access reply)

mdc_link

int mdc_link(struct obd_export *exp, struct mdc_op_data *op_data,
             struct ptlrpc_request **request)

mdc_unlink

int mdc_unlink(struct obd_export *exp, struct mdc_op_data *op_data,
              struct ptlrpc_request **request)

mdc_create

int mdc_create(struct obd_export *exp, struct mdc_op_data *op_data,
               const void *data, int datalen, int mode, __u32 uid, __u32 gid,
               __u32 cap_effective, __u64 rdev, struct ptlrpc_request **request)

mdc_rename

int mdc_rename(struct obd_export *exp, struct mdc_op_data *op_data,
               const char *old, int oldlen, const char *new, int newlen,
               struct ptlrpc_request **request)

mdc_req2lustre_md

int mdc_req2lustre_md(struct ptlrpc_request *req, int offset,
                      struct obd_export *exp,
                      struct lustre_md *md)


Examples

intermediate lookup

VFS parses filename, finds intermediate (not last) component and starts lookup:

  1. local cache (dcache) is checked
  2. if dentry is found in dcache, then revalidation is called (llite's revalidation check whether lock with LOOKUP bit is still granted to the client)
  3. if dentry isn't found in dcache, then lookup method is called
  4. llite's lookup prepares intent and sends ENQUEUE RPC with the intent
  5. MDS receives RPC, finds intent and execute it: name is looked up to fid, correspondent attributes are fetched, DLM lock is acquired for this resolving (LOOKUP). all this is put in the reply
  6. client gets fid and attributes, finds or creates new inode for given fid, adds one more entry into dcache
ll_lookup_it()
  mdc_intent_lock(IT_LOOKUP)
    mdc_enqueue()
  iget()
  ldlm_lock_decref()

final lookup

New Metadata API