Architecture - Lustre Logging API

Lustre Logging API

Introduction
Lustre needs logging API in numerous places -orphan recovery, RAID1 synchronization, configuration, all associated with update of persistent information on multiple systems. Generally, logs are written transactionally and cancelled when a commit on another system completes.

Log records are stored in log objects. Log objects are currently implemented as files, and possibly, in some minor ways, the APIs below reflect this. In this discussion, we speak of log objects and sometimes of llogs (lustre-logs).

API Requirements
Some of the key requirements of these APIs that defines their design are:
 * The API should be usable through methods
 * The methods should not reveal if the API is being used locally or invoked remotely
 * Logs only grow
 * Logs can be removed, remote callers may not assume that open logs will remain available
 * Access to logs should be through stateless APIs that can be invoked remotely
 * Access to logs should go through some kind of authorization/authentication system

Logs.
(1) Log objects can be identified in two ways (a) Through a name -The interpretation of the name is upto the driver servicing the call. Typical examples of named logs are files identified by a path name, text versions of the UUIDs, profile names. (b) Through an object identifier or llog-log identifier -A directory of llogs which can lookup a name to get an id can provide translation from naming system to an id based system. In our implementation, we use a file system directory to provide this catalog function.

(2) Logs only contain records

(3) Records in the logs have the following structure: (4) The first record in every log is a 4K long llog_log_rec. The body of this record contains: (5) Records can be accessed by : (6) Some logs are potentially very large, for example replication logs, and require a hierarchical structure. A catalog of logs is held at the top level. In some cases the catalog structure is two levels deep:
 * llog_rec_hdr -a header, indicating the index, length and type. The header is 16 bytes long
 * Body which is opaque, 32-bit aligned blob
 * llog_rec_tail -length and index of recors for walking backwards, it is 16 byte long
 * a bitmap of records that have been allocated; bit 0 is set immediately because the header itself occupies it
 * A collection of log records behind the header
 * iterating through a specific log
 * providing a llog_cookie, which contains the struct llog_logid of the log and the offset in the log file where the record resides.
 * A catalog API is provided which exploits the lower lustre log API
 * Catalog entries are log entries in the catalog log which contain the log id of the log file concerned.

Logging contexts.
Each obd device has an array of logging contexts (struct llog_ctxt). The contexts contain: (1) The generation of the logs. This is a 128 bit integer consisting of the mount count of the origianating device and the connection count to the replicators. (2) A handle to an open log (struct llog_handle *loc_handle) (3) A pointer to the logging commit daemon (struct llog_canceld_ctxt *loc_llcd) (4) A pointer to the containing obd (struct obd_device *loc_obd) (5) An export to the storage obd for the logs (struct obd_export *loc_exp) (6) A method table (struct llog_operations *loc_logops) lop_destroy: destroy a log lop_create:create/open a log lop_next_bloc: read next block in a log lop_close: close a log lop_read_header: read the header in a log lop_setup: set up a logging subsystem lop_add: add a record to a log lop_cancel: cancel a log record lop_connect: start a logging connection. This is called by the originator to initiate cancellation handling and log recovery processing on the replicators side. The originator calls this from a few places in the recovery state machine. lop_write_rec: write a log record

Llog connections and the Cancellation API
This section describes the typical use of the logging API to manage distributed commit of related persistent updates. The next section describes the recovery in case of netowrk or system failures. We consider systems that make related updates and use the following definitions: Originator: -the first system performing a transaction Replicators: -one or more other systems performing a related persistent update The key requirement is that the replicators must complete their updates if the originators do, even if the originating systems crash or the replicators roll back. Note that we do not require that the the system remains invariant under rollback of the originator.

This goal is achieved by transactionally recording the originators action in a log. When the replicators related action commits, it cancels the log entry on the originator. In the subsequent sections, we describe the handshake and protocols involved.

Llog connections.
In order to process cancellation and recovery actions, the originators and replicators use a ptlrpc connection to execute remote procedure calls. The connection can be set up on the originator or the replicator and we call the system setting up the connection the initiator and the target of that connection event the receptor. The connection is used symmetrically, that is, the originator and replicator can either be the initiator or the receptor. The obd device structure has an optional llog_obd_ctxt which holds a pointer to the import to be used for queuing rpc’s.
 * The originator and the replicator establish a connection. These are the usual connections used by other subsystems.
 * The logging subsystem on the originator uses the lop_connect method to the replicator. The lop connect call sends the logid’s of the open catalog from the originator to the replicator.
 * Just prior to sending this the originator context increases its generation, and includes the generation and the logid in the lop_connect method, usually calling llog_orig_connect.
 * The replicator now receives a llog_connect RPC. The handler is the replicators lop_connect (usually llog_repl_connect). This method first increases the llcd’s generation then initiates processing of the logs.

The cancellation daemon.
A replicator runs a subsystem responsible for collecting pages of cookies and sending them to the originator for cancellation of the origin log records. This is done as a side effect of committing the replicating transaction on the replicator.

A key element in the cancellation is to distinguish between old and new cookies. Old cookies are those that have a generation smaller than the current generation, new cookies have the current generation. The generation is present in the llog_context, hence it is both on the server and on the client. The cancellation context is responsible for the queueing of cancel cookies. For each originator it is in one of two states: (1) Accepting cookies for cancellation (2) Dropping cookies for cancellation

The context switches from 1 to 2 if a timeout occurs on the cancellation rpc. It switches from 2 to 1 in two cases: (1) A cookie is presented with an llog_generation bigger than the one held in the context (2) The replicator receives a llog_connect method (which will also carry a new llog_generation)

The llog_generation is an increasing sequence of 128 bit integers with highest order bits the boot count of the originator and the lower bits the obd_conncnt between the originator and the replicator. The originator increases its generation just before sending the llog_connect call, the replicator increases it just prior to beginning the handling of recovery when receiving an llog_connect call.

Normal operation.
Under normal operation, the originator performs a transaction and as a part of the transaction, writes a log record for each replicator. The following steps are then followed to ensure that the replicator is updated with a copy:


 * The log record creation, done with lop_add produces a log_cookie
 * The log_cookie is sent to the replicator, through a means that we do not discuss here.
 * The replicator performs the related transaction and executes a commit callback for that. The callback indicates that the log_cookie can be put up for cancellation. The function lop_cancel is responsible for this queuing of the cancellation.
 * When the replicator has a page full of cancellation cookies, it sends the cookies to the originator
 * The originator cancels the the log records associated with the cookies and cleans up the empty log files. The handling function is llog_handle_cancel and it invokes the originators lop_cancel functions to remove the log record.

The replication scenarios are closely related to commit callbacks and RPCs, the key differences are:


 * The commit callbacks with transaction numbers involve a volatile client and a persistent server
 * The transaction sequence is determined by the server in the voilatile-persistent case by the originator in the replicating case

Deletion of files.
Change needs to be replicated from MDS (originator) to OST’s (replicators):
 * The OSC’s used by the LOV on the MDS act as originator for the change log, using the storage and disk transactions offered by the MDS:

–	OSC’s write log records for file unlink events. This is done through an obd api which stacks the MDS on the LOV on the OSC’s. Such events are caused by unlink calls, by closing open but unlinked files, by removing orphans (which is recovery from failed closes) and by renaming inodes when they clobber. –	The OSC’s create cookies to be returned to OSTs. These cookies are piggy backed on the replies of unlink, close and rename calls. In the case of removing orphans the cookies are passed to obd_destroy calls executed on the MDS.


 * OST’s act as replicators, they must delete the objects associated with the inode.

–	Remove objects –	Pass OSC generated cookies as parameters to obd_destroy transactions –	Collect cookies in pages for bulk cancellation RPCs to the OSC on MDS –	Cancel records on the OSCs on MDS

File size changes.
–	Upon the first file size change in an I/O epoch on the OST:
 * Changes originate on OSTs, these need to be implemented on the MDS
 * Writes a new size changes record for new epoch
 * Records the size of the previous epoch in the record
 * Records the object id of the previous epoch in the record
 * It generates a cancellation cookie

–	When MDS knows the epoch has ended:


 * It obtains the size at completion of the epoch from client (or exceptionally from the OST)
 * It obtains cancellation cookies for each OST from the client or from the OSTs
 * It postpones starting a new epoch untill the size is known
 * It starts a setattr transaction to store the size
 * When it commits, it cancels the records on the OSTs

RAID1 OST.

 * The primary is the originator, the secondary is the replicator
 * – Writes on the primary are accompanied by a change record for an extent

Cancellation timeouts.
If the replicator times out during cancellation, it will continue to process the transactions with cookies. The cancellation context will drop the cookies.

The timeout will indicate to the system that the connection must be recovered.

Llog recovery handling
When the replicator recieves an llog_connect rpc, it increases the llcd’s generation, and then spawns a thread to handle the processing of catalogs for the context. For each of the catalogs it is handling, it fetches the catalog’s log_id through an obd_get_cat_info call. When it has received the catalog logid, the replicator calls sync and proceeds with llog_cat_process


 * It only processes records in logs from previous log connection generations.
 * The catalog processing repeats operations that should have been performed by the initiator earlier
 * –	The replicator must be able to distinguish:

Done: If the operation already took place. If so it queues a commit cancellation cookie which will cancel the log record which it found in the catalog’s log that is being processed. Because sync was called there is no question that this cancellation is for a committed replicating action. Not done:	The operation was not performed, the replicator performs the action, as it usually does, and queues a commit cookie to initiate cancellation of the log record.
 * When log processing completes, an obd-method is called to indicate to the system that logs have been fully processed. In the case of size recovery, this means that the MDS can resume caching file sizes and guarantee their correctness.

Log removal failure.
If an initiator crashes during log removal, the log entries may re-appear after recovery. It is important that the removal of a log from a catalog and the removal of the log file are atomic and idempotent. Upon re-connection, the replicator will again process the log.

File size recovery.
The recovery of orphan deletion is adequately described by 1.5.1. In the case of file size recovery, things are more complicated.

Llog OBD methods.
There is only one obd method related to llog which llog_init.

llog_init.
This obd method initializes the logging subsystem for an obd. It sets the methods and propages calls to dependent obd’s.

llog_cat_initialize.
There is a simple master function llog_cat_initialize for catalog setup that uses and array of object id’s stored on the storage obd of the logging. The logids are stored in an array form and given to the llogging contexts during the lop_setup calls made by llog_init. It uses support from lvfs to read and write the catalog entries and create or remove them.

Log method table API
Logs can be opened and/or created, this fills in a log handle. the log handle can be used through the log handle API.

Prototype.
int llog_create(struct obd_device *obd, struct llog_handle **, struct llog_logid *, char

Description.
If the log_id is not null, open an existing log with this ID. If the name is not NULL, open or create a log with that name. Otherwise open a nameless log. The object id of the log is stored in the handle upon success of opening or creation.

Prototype.

 * int llog_close(struct llog_handle *loghandle);

Description.
Close the log and free the handle. remove the handle from the catalog’s list of open handles. If the log has a flag set of destroy if empty, the log may be zapped.

Prototype.

 * int llog_destroy(struct llog_handle *loghandle);

Description.
Destroy the log object and close the handle.

Prototype.
int llog_write_rec(struct llog_handle *handle, struct llog_reec_hdr *rec, struct llog_cookie

Description.
Write a record in the log. If buf is NULL, the record is complete. If buf is not NULL, it is inserted in the middle. Records are multiple of 128bits in size and have a header and tail. Write the cookie for the entry into the cookie pointer.

Prototype.
int llog_next_block(struct llog_handle *h, int curr_idx, int next_idx, __u64 *offset,

Description.
Index curr_idx is in the block at *offset. Set *offset to the block offset of recort next_idx. Copy len bytes from the start of that block into the buffer buf.

Prototype.
int *lop_read_header(struct llog_handle *loghandle);

Description.
Read the header of the log into the handle and also read the last rec_tail in the log to find the last index that was used in the log.

Prototype.
int llog_init_handle(struct llog_handle *handle, int flags, struct *obd_uuid);

Description.
Initialize the handle, try to read it from the log file. But if the log does not have a header built, build it from the arguments. If the header is read, verify the flags and UUID in the log equal those of the arguments.

Prototype.
int llog_add_record(struct llog_handle *cathandle, struct llog_trans_hdr *rec, struct

Prototype.
int llog_delete_record(struct llog_handle *loghandle, struct llog_handle *cathandle);

Prototype.
int llog_cancel_record(struct llog_handle *cathandle, int count, struct llog_cookie *cookie);

Description.
For each cookie in the cookie array, we clear the log in-use bit and either:


 * Mark it free in the catalog header and delete it if its empty
 * Just write out the log header if the log is not empty

The cookies maybe in different log files, so we need to get new logs each time.

Prototype.
int llog_next_block(struct llog_handle *handle, int curr_idx, int next_idx, __u64 *curr_offset,

Description.
Return the block in the log that contains record with index next_idx. The curr_idx at the offset curr_offset is used to optimize the search.

Sample Method Table Descriptions
The obd_llog api

The obd_llog api has several methods, setup, cleanup, add, cancel, as part of the OBD operations. These operations have 3 implementations: mds_obd_llog_*: simply redirects and uses the method mds_osc_obd, which is normally the LOV running on the MDS to reach the OST’s.

lov_obd_llog_*: calls the method on all relevant OSC devices attached to the LOV. A parameter including striping information of the inode is included to determine which OSC’s should generate a log record for their replicating OST. A more interesting implemenation is the collection of methods that is used by the OSC on the MDS and by the OBDFILTER:

llog_obd_setup: sets up a catalog entry based on a log id. llog_obd_cleanup: cleans up all catalog entries in the array llog_obd_origin_add: adds a record using the catalog in the llog_obd_ctxt array of handles llog_obd_repl_cancel: queues a cookie for cancellation on the replicator.

obd_llog_setup(struct obd_device *obd, struct obd_device *disk_obd, int index, int count, struct llog_logid *idarray).
To activate the catalogs for logging and make their headers and file handles available is fairly involved. Each system that requires catalogs manages an array of catalogs. This function is given an array of logid’s and an index. The index pertains to the array of logs used by an originator, the array of logid’s is an array with an entry for each osc in the lov stripe descriptor.

obd_llog_cleanup(struct obd_device *).
Cleans up all initialized catalog handles for a device.

int llog_obd_origin_add
(struct obd_export *exp, int index, struct llog_rec_hdr *rec, struct lov_stripe_md *lsm, struct llog_cookie *logcookies, int numcookies). Adds a record to the catalog at index index. The lsm is used to identify how to descend an LOV device. The cookies are generated for each record that is added.

int llog_obd_repl_cancel(struct obd_device *obd, struct lov_stripe_md *lsm, int count, struct llog_cookie *cookies, int flags).
Queue the cookies for cancellation. Flags can be 0 or LLC_CANCEL_NOW for immediate cancellation.

Configuration Logs
Configuration of Lustre is arranged by using llogs with records that describe the configuration. The first time a configuration is written it is given a version of 0. Each record is numbered. Configurations can then be updated, which results in: (1) a new configuration log (2) a change descriptor with the previous configuration

Configurations are then recorded on the configuration obd. At any time there are stored:

(1) One full configuration log (for the current version) (2) A collection of change descriptors for every change made since the initial configuration. A client uses the configuration logs in two ways: –	Determines its current version of the configuration –	Asks the config obd for the latest version –	Fetches the change logs to change the current configuration to the latest one
 * On startup it fetches the full current configuration log from the configuration obd and processes the records to complete the mount command
 * A client can also receive a signal that it needs to refresh its configuration. This signal can be an ioctl, /proc/sys file or lock revocation callback. When the client gets this signal it:

The last operation is done with llog_process, using a suitable callback function, as well as the logs that the client has in memory.

Size Recovery
This section contains a discussion of the recovery of MDS cached sizes from OST’s. The MDS sees open calls which precede any I/O on a file. When an open request reaches the MDS the file inode is in one of two states:

quiescent: No I/O is currently happening on the inode I/O epoch: The inode is in I/O epoch k.

If no I/O epoch is active the MDS starts a new one. The epoch number will be a random number from boot time which is increased each time a new epoch is started. A fairly complicated sequence of events involving the inode may now ensue, such a many other openers. Eventually the clients will all close the file and flush their data. The simplest epoch management scheme is:


 * 1) open file is opened for write
 * 2) closed and flushed all clients have closed and flushed data
 * 3) mds changes file size and ends epoch

When a client closes the file, has no dirty data outstanding and knows the file size and OST size update cookies authoritatively it will include them with the close call to the MDS. The MDS will initiate the setattr to update its cached file size and use the MDS cookies.

When a client closes but doesn’t satisfy some of these conditions it will still make a close call to the MDS. The MDS will know if this is the last client closing the file. If so, it will indicate in its response to the client that it requires the client to obtain the file size and cookies and make an additional setattr call to the MDS with the cookies. The client can flush its data and force a flush of other clients data through the DLM. An obd_getattr call will obtain the file size and cookies for a particular epoch. A slightly more lax scheme is to allow the client to update the MDS even when it has not yet flushed all dirty data to the inode.

The epoch ends when the MDS receives the setattr call.

The OST should pin the inode in memory and remember the MDS epoch in volatile data. Perhaps it takes a refcount for each client writing to the inode. Each client can indicate to the OST when it