Logging API

= Lustre Logging API =

In numerous places Lustre needs a logging API. Generally log records are written transactionally and canceled when a commit on another system completes. There are many uses for this, all associated with updates of persistent information on multiple systems.

Log records are stored in log objects. Log objects are currently implemented as files, and possibly, in some minor ways, the api below reflects this. We speak of log objects and sometimes of llog's (lustre-logs).

= API Requirements =


 * The api should be usable through methods.
 * The methods should mask wether the api is used remotely or locally.
 * Logs only grow
 * Logs can be removed, and remote callers may not assume open logs remain available


 * Access to logs should be mostly through a stateless api that can be called remotely
 * Access to logs should go through some kind of authentication/authorization system

= Fundamental Data Structures =


 * Log objects can be identified in two ways:
 * Through a name:
 * the interpretation of the name is up to the driver servicing the call.
 * typical examples of named logs are files identified by a pathname
 * text versions of UUID's
 * profile names
 * Through an object id, the llog_logid.


 * A directory of llogs which can look up a name to get an id can provide translation from a naming system to an id based access system. In our implementation we use a file system directory to provide this catalog function.
 * Logs only contain records
 * Records in logs have the following structure:
 * llog_rec_hdr: a header, indicating the index, length and type; the header is 16 bytes in length
 * body which is opaque 32 bit aligned blob ; the body is 32 byte aligned
 * llog_rec_tail: length and index of the record again, for walking backward ; the tail is 16 byes length.
 * The first record in every log is a header 4K llog_log_rec which has in its body:
 * a bitmap of records that are allocated; bit 0 is set immediately because the header itself occupies it
 * A collection of log records behind the header


 * Records can be accessed:
 * by iterating through a specific log
 * by providing a llog_cookie


 * Some logs are potentially very large, for example, replication logs, and require a hierarchical structure, where a catalog of logs is held at the top level. In some cases the catalog structure is two deep.
 * A catalog api is provided which exploits the lower lustre log api
 * Catalog entries

= Cancellation API =

This describes typical use of the logging api to manage distributed commit of related distributed persistent updates.

We consider systems that make related updates


 * originator:: the first system performing a transaction
 * replicators:: one or more other systems performing a related update

The key requirement is that the replicators must complete their updates. This is accomplished by transactionally recording the originators action in a log. When the replicators related action completes, it cancels the log entry on the originator. Here we describe the handshakes and protocols involved.

Normal Operation

 * Originator performs a transaction and as part of the transaction writes a log record for each replicator.
 * The log record creation produces a log_cookie.
 * The log_cookie is sent to the replicator.
 * The replicator performs the related transaction and executes a commit callback for that.
 * The call back to put the log_cookie up for cancellation
 * When the replicator has a page full of cancellation cookies it sends the cookies to the originator
 * The originator cancels the log records associated with the cookies and cleans up empty log files

Notes


 * 1)  These replication scenarios are closely related to commit callbacks and rpc's Key differences are
 * The commit callbacks with transaction numbers involve a volatile client and persistent server
 * The transaction sequence is determined by the server in the volatile-persistent case by the originator in the replicating case


 * === Examples ===
 * ==== deletion of files ====
 * ==== deletion of files ====


 * change needs to be replicated from mds to ost's
 * osc's on mds act as originator for the change log, using storage from the mds
 * osc's write log records
 * osc's generate cancel cookies
 * osc's send cookies to ost's
 * ost's act as replicators
 * remove objects
 * pass osc generated cookies as parameters to obd_destroy transactions
 * collect cookies in pages for bulk cancellation rpcs to the osc on mds
 * cancel records on the osc's on mds


 * ==== file size changes ====


 * changes orginate on OST's need to be implemented on the MDS
 * Upon the first file size change in an io epoch on OST:
 * writes a new size change record for new epoch
 * records the size of the previous epoch in the record
 * records the object id of the previous epoch in the record
 * it generates a cancellation cookie
 * when MDS knows epoch has ended
 * it obtains the size at completion of epoch from client (or exceptionally from the OST)
 * it obtains cancellation cookies for each OST from the client or from the OST's
 * it postpones starting a new epoch until the size is known
 * it starts a setattr transaction to store the size
 * when it commits, it cancels the records on the OST's


 * RAID1 OST
 * The primary is the originator, the secondary the replica
 * Writes on the primary are accompanied by a change record for an extent

Connections
In order to process cancellation and recovery actions the originators and replicators use a ptlrpc connection to execute remote procedure calls. The connection can be set up on the originator or the replicator and we call the system setting up the connection the initiator.


 * The originator and the replicator establish a connection.
 * The logging subsystem on each end of the connection receives a log connection active completion event
 * if connections are active log_cookies can be accepted for cancellation by the cancellation daemons
 * if connections are not active the log_cookies are not collected for cancellation
 * A generation number is associated with the connections
 * it increases for each connection complete event

Cancellation timeout

 * If the replicator times out during cancellation
 * It will continue to process the transactions with cookies
 * The cancellation daemon will drop the cookies

Initiator Connect Event
When the replicator receives a connect complete event it


 * Fetches the catalogs log_id for the catalog it needs to process through a get_cat_info call
 * The replicator needs to know what the catalog logid is for processing
 * When it has received the catalog it calls process catalog:
 * The replicator calls sync
 * It only processes records in logs from previous log connection generations
 * The catalog processing repeats operations that should have been performed by the initiator earlier
 * The replicator must be able to distinguish:
 * the operation already took place
 * the operation was not performed
 * If the operations are performed:
 * During this processing the replicator generates cookies as it normally does
 * Upon commit cookies are queued on the replicator for bulk cancellation of log records
 * If the update was already performed:
 * the replicator can queue a cancellation cookie back immediately

Log Removal Failure



 * If an initiator crashes during log removal, the log entries may re-appear after recovery
 * It is important that the removal of a log from a catalog and the removal of the log file are atomic
 * Upon reconnection the replicator will again process the log

= Log API's =

Low Level Log Operations
Logs can be opened and/or created. This fills in a log handle. The log handle can be used through the log handle api.


 * int llog_create(struct obd_device *obd, struct llog_handle **, struct llog_logid *, char *name);
 * int llog_create(struct obd_device *obd, struct llog_handle **, struct llog_logid *, char *name);


 * If llog_id is not null open an existing log with this ID. If name is not null, open or create a log with that name. Otherwise open a nameless log.  The object id of the log is stored in the handle upon success of opening or creation.


 * int llog_close(struct llog_handle *loghandle);
 * int llog_close(struct llog_handle *loghandle);


 * Close the log and free the handle. Remove the handle from the catalogs list of open handles. If the log has a flag set of destroy if empty the log may be zapped.


 * int llog_destroy(struct llog_handle *);
 * int llog_destroy(struct llog_handle *);


 * Destroy the log object and close the handle.


 * int llog_write_rec(struct llog_handle *handle, struct llog_rec_hdr *rec, struct llog_cookie *cookie, int cookie_count, void *buf);
 * int llog_write_rec(struct llog_handle *handle, struct llog_rec_hdr *rec, struct llog_cookie *cookie, int cookie_count, void *buf);


 * Write a record in the log. If buf is null the record is complete, if buf is not null it inserted in the middle.  Records are multiples of 128 bits in size and have a hdr and tail. Write the cookie for the entry into the cookie pointer. (cookie_count is probably a mistake).


 * int llog_next_block(struct llog_handle *h, int curr_idx, int next_idx, __u64 *offset, void *buf, int len);
 * int llog_next_block(struct llog_handle *h, int curr_idx, int next_idx, __u64 *offset, void *buf, int len);


 * Index curr_idx is in the block at offset *offset. Set *offset to the block offset of record next_idx.  Copy len bytes from the start of that block into the buffer buf.


 * int (*lop_read_header)(struct llog_handle *handle);
 * int (*lop_read_header)(struct llog_handle *handle);


 * Read the header of the log into the handle and also read the last rec_tail in the log to find the last index that was used in the log.

Higher Level Log Operations

 * int llog_init_handle(struct llog_handle *handle, int flags, struct obd_uuid *uuid);
 * int llog_init_handle(struct llog_handle *handle, int flags, struct obd_uuid *uuid);


 * Initialize the handle. Try to read it from the log file, but if the log has no header yet, build it from the arguments.  If the header is read, verify the flags and UUID in the log equal those of the arguments.

OBD Level Log Operations

 * int llog_add_record(struct llog_handle *cathandle, struct llog_trans_hdr *rec,
 * struct llog_cookie *logcookies, void *buf)


 * int llog_delete_log(struct llog_handle *cathandle,struct llog_handle *loghandle)


 * int llog_cancel_records(struct llog_handle *cathandle, int count, struct llog_cookie *cookies)


 * For each cookie in the cookie array, we clear the log in-use bit and either:
 * the log is empty, so mark it free in the catalog header and delete it
 * the log is not empty, just write out the log header
 * The cookies may be in different log files, so we need to get new logs each time.


 * int llog_next_block(struct llog_handle *loghandle, int cur_idx, int next_idx,
 * __u64 *cur_offset, void *buf, int len)


 * Return the block in the log that contains record with index next_idx. The current idx at offset cur_offset is used to optimize the search.


 * typedef int (*llog_cb_t)(struct llog_handle *, struct llog_trans_hdr *rec, void *data);
 * int llog_process_log(struct llog_handle *loghandle, llog_cb_t cb, void *data)


 * Call the callback function cb