WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - New Metadata API

From Obsolete Lustre Wiki
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Requirements

  1. clear, simple, no tricks
  2. portable
  3. support for CMD
  4. support for WBC
  5. support for few llite-like layer at same time (lustre, pnfs, etc)

Proposal

  1. MDAPI consists of primitive metadata operations like lookup, getattr by fid, readdir, create, unlink, open, close, mkdir, rmdir, rename
  2. MDD uses same API so that client can talk to MDD directly
  3. all network-related things are extracted to MDC-MDT (intents, recovery)
  4. open is part of in-core state needed only for remote clients, so part of MDT-MDC only
  5. there may be MDD running on server and MDD running on client at same time. MDD on server is better for contended resource (like /tmp directory), MDD on client is metadata writeback cache
  6. MDC should provide notion of context where results of compound operation can be stored, this context to be opened and closed by llite (probably ldlm could help as we're holding lock till end of operation)
  7. LMV is request forwarder only (do we still need striped directories?)
  8. transactions are exposed above MDC with CMD as single transaction can update few servers
Mdapi.png
  1. red lines - Metadata API
  2. blue lines - OSD API
  3. black lines - RPC
  4. Llite - OS specific part, takes care about cache, locking
  5. MDC - client component turning set of MDAPI calls into single RPC
  6. MDT - server component turning RPC with intent into set of MDAPI calls
  7. ROSD - Remote OSD, client sending OSD requests over network
  8. TOSD - Target OSD, server receiving OSD request from network

API

context - a notion allowing to store results of do-ahead operation (intents)

md_init_ctxt()  -- initialize context
md_fini_ctxt()  -- release context
md_lookup()     -- lookup given parent directory and name into fid (NULL)
md_getattr()    -- fetch attributes for given fid
md_setattr()    -- set attributes for given fid
md_create()     -- create file/directory with given name in given parent
md_open()       -- open file/directory by fid, returns open handle
md_close()      -- close given open handle
md_readdir()    -- return set of directory entries from given fid/hash/offset?
md_unlink()     -- unlink name in given directory (may cause file removal)
md_symlink()    -- create symbolic link
md_rename()     -- rename ...
md_xattr_*()    -- set of calls to manipulate extended attributes
md_statfs()     -- return fs stats (all/avail blocks/files, etc)
md_enqueue()    -- enqueue lock (can take intent from context)
md_decref_lock()-- release (not meaning cancel) lock
md_cancel()     -- cancel lock


Use Cases

final lookup with IT_OPEN + IT_CREATE

non-WBC WBC
ll_lookup()
 ctxt = md_init_ctxt(IT_OPEN | IT_CREAT, parent, name, ...)
 lockh = md_enqueue(ctxt, <parent fid>, LCK_PR, ...);
   mdc_enqueue(ctxt, ...)
     mdt_enqeue()
       lock(pfid, UPDATE, PR)
       pfid = find_obj_by_fid()
       fid = md_lookup(pfid, name)
       if (fid == NULL)
         tx = tx_create()
         tx_declare_write(tx, last_rcvd_file)
         md_declare_create(tx, pfid, name)
           mdd_declare_create(tx, pfid, name)
             osd_tx_declare_insert()
             osd_tx_declare_new_obj()
         md_declare_set_xattr(tx, fid, LOVEA)
           mdd_declare_set_xattr()
             osd_tx_declare_set_xattr()
         md_tx_start(tx)
           mdd_tx_start(tx)
             osd_tx_start(tx)
         fid = md_create(tx, pfid, name)
           mdd_create(tx)
             fid = osd_new_object()
             osd_insert_index(pfid, name, fid)
         md_set_xattr(tx, fid, LOVEA)
           mdd_set_xattr(tx, fid, LOVEA)
             osd_set_xattr(tx, fid, LOVEA)
         mdt_update_last_rcvd()
         md_tx_commit(tx)
           mdd_tx_commit(tx)
             osd_tx_commit(tx)
       md_getattr(fid, &attr)
       lock(fid, OPEN, PR)
       unlock(pfid)
       /* pack lock&attr in reply */
     /* mdc unpack reply and store in ctxt */
 md_lookup(ctxt, parent, name, &fid)
   mdc_lookup(ctxt, parent, name, &fid)
     /*
 md_getattr(ctxt, fid, &attr)
   mdc_getattr(ctxt, fid, &attr)
 inode = iget(FID2INO(fid), ll_test_inode, ll_set_inode, &attr);
 md_open(ctxt, fid, mode)
   mdc_open(ctxt, fid, mode)
 md_decref_lock(lockh);
 md_fini_ctxt(ctxt);
ll_lookup()
 ctxt = md_init_ctxt(IT_OPEN | IT_CREAT, parent, name, ...)
 lockh = md_enqueue(ctxt, <parent fid>, LCK_PR, ...);
 md_lookup(ctxt, parent, name, &fid)
   mdd_lookup(ctxt, parent, name, fid)
     rosd_lookup(ctxt, parent, name, fid)
 if (fid == NULL)
   md_tx_create()
     mdd_tx_create()
       rosd_tx_create()
   md_tx_declare_create()
     mdd_tx_declare_create()
       rosd_tx_declare_create()
   md_tx_declare_set_xattr
     mdd_tx_declare_set_xattr()
       rosd_tx_declare_create()
   md_tx_start()
     mdd_tx_start()
       rosd_tx_start()
   md_create(parent, name)
     mdd_create()
       rosd_new_object()
       rosd_insert_index(pfid, name, fid)
   md_set_xattr(tx, fid, LOVEA)
     mdd_set_xattr()
       rosd_set_xattr()
   md_tx_commit()
   /* put details in the context */
 md_getattr(ctxt, fid, &attr)
   mdd_getattr(ctxt, fid, &attr)
     rosd_getattr()
 inode = iget(FID2INO(fid), ll_test_inode, ll_set_inode, &attr);
 md_open(ctxt, fid, mode)
   /* here we somehow pin object */
   /* not sure is this responsibility of MDD or upper layer */	
   mdd_open(ctxt, fid, mode)
 md_decref_lock(lockh);
 md_fini_ctxt(ctxt);

intermediate lookup

intermediate revalidate

final revalidate

mkdir

unlink/rmdir

rename

readdir

lock cancel

Problems

  1. without intents in VFS, it's possible that stat(2) has to issue 2 RPCs to MDS: one from ->lookup(), another from ->getattr(). is solution needed?


Implementation steps

  1. restructure llite and MDC to new API
  2. IFF we want MDWBC implemented this way, restructure MDT/MDD and implement ROSD/TOSD