In numerous places Lustre needs a logging API. Generally log records are written transactionally and canceled when a commit on another system completes. There are many uses for this, all associated with updates of persistent information on multiple systems.
Log records are stored in log objects. Log objects are currently implemented as files, and possibly, in some minor ways, the API below reflects this. We speak of log objects and sometimes of llog's (lustre-logs).
API requirements include:
- The API should be usable through methods.
- The methods should mask whether the API is used remotely or locally.
- Logs only grow.
- Logs can be removed, and remote callers may not assume open logs remain available.
- Access to logs should be mostly through a stateless API that can be called remotely.
- Access to logs should go through some kind of authentication/authorization system.
Fundamental Data Structures
Log objects can be identified in two ways:
- Through a name:
- - the interpretation of the name is up to the driver servicing the call.
- - typical examples of named logs are files identified by a pathname
- - text versions of UUID's
- - profile names
- Through an object id, the llog_logid.
A directory of llogs which can look up a name to get an id can provide translation from a naming system to an id based access system. In our implementation we use a file system directory to provide this catalog function.
Logs only contain records. Records in logs have the following structure:
- llog_rec_hdr: a header, indicating the index, length and type; the header is 16 bytes in length.
- Body which is opaque 32 bit aligned blob; the body is 32 byte aligned.
- llog_rec_tail: length and index of the record again, for walking backward ; the tail is 16 byes length.
The first record in every log is a header 4K llog_log_rec which has in its body a bitmap of records that are allocated. Bit 0 is set immediately because the header itself occupies it. Behind each header is a collection of log records.
Records can be accessed by:
- Iterating through a specific log
- Providing a llog_cookie
Some logs are potentially very large, for example, replication logs, and require a hierarchical structure, where a catalog of logs is held at the top level. In some cases the catalog structure is two deep.
A catalog API is provided which exploits the lower Lustre log API.
- Catalog entries contain an llog_cookie which references another llog object
This describes typical use of the logging API to manage distributed commit of related distributed persistent updates.
We consider systems that make related updates
- originator:: the first system performing a transaction
- replicators:: one or more other systems performing a related update
The key requirement is that the replicators must complete their updates. This is accomplished by transactionally recording the originators action in a log. When the replicator's related action completes, it cancels the log entry on the originator. Here we describe the handshakes and protocols involved.
Normal operation is described below:
- Originator performs a transaction and as part of the transaction writes a log record for each replicator.
- The log record creation produces a log_cookie.
- The log_cookie is sent to the replicator.
- The replicator performs the related transaction and executes a commit callback for that.
- The call back to put the log_cookie up for cancellation
- When the replicator has a page full of cancellation cookies it sends the cookies to the originator.
- The originator cancels the log records associated with the cookies and cleans up empty log files.
These replication scenarios are closely related to commit callbacks and rpc's. Key differences are
- The commit callbacks with transaction numbers involve a volatile client and persistent server.
- The transaction sequence is determined by the server in the volatile-persistent case by the originator in the replicating case.
Deletion of files
- Change needs to be replicated from the MDS to the OSTs.
- OSCs on the MDS act as originator for the change log, using storage from the MDS
- OSCs write log records
- OSCs generate cancel cookies
- OSCs send cookies to OSTs
- OSCs act as replicators
- Remove objects
- Pass OSC-generated cookies as parameters to obd_destroy transactions
- Collect cookies in pages for bulk cancellation rpcs to the OSC on MDS
- Cancel records on the OSCs on MDS
File size changes
Changes originated on OSTs need to be implemented on the MDS. Upon the first write that causes a file size change in an IO epoch on OST, the OST will:
- Write a new size change record for new epoch
- Record the size of the previous epoch in the record
- Record the object id of the previous epoch in the record
- Generate a cancellation cookie
When MDS knows epoch has ended, the MDS will:
- Obtain the size at completion of epoch from client (or exceptionally from the OST)
- Obtain cancellation cookies for each OST from the client or from the OST(s)
- Postpone starting a new epoch until the file size is known
- Start a setattr transaction to store the size on the MDS inode
- When the setattr transaction commits, the MDS sends the cancellation cookies to the OST(s)
- The OSTs will cancel the size change llog records on the OST(s)
Server Network Striping (Mirrored objects)
- The primary is the originator, the secondary the replica
- Writes on the primary are accompanied by a change record for an extent
In order to process cancellation and recovery actions, the originators and replicators use a ptlrpc connection to execute remote procedure calls. The connection can be set up on the originator or the replicator, and the system setting up the connection is called the initiator.
The steps include:
- The originator and the replicator establish a connection.
- The logging subsystem on each end of the connection receives a log connection active completion event.
- If connections are active, log_cookies can be accepted for cancellation by the cancellation daemons.
- If connections are not active, the log_cookies are not collected for cancellation.
- A generation number is associated with the connections.
- It increases for each connection complete event.
If the replicator times out during cancellation, it will continue to process the transactions with cookies. Then the cancellation daemon will drop the cookies.
Initiator Connect Event
When the replicator receives a connect complete event, it does the following:
- Fetches the catalogs log_id for the catalog it needs to process through a get_cat_info call
- The replicator needs to know what the catalog logid is for processing
- When it has received the catalog, it calls process catalog:
- The replicator calls sync
- It only processes records in logs from previous log connection generations
- The catalog processing repeats operations that should have been performed by the initiator earlier
The replicator must be able to distinguish:
- The operation already took place
- The operation was not performed
- If the operations are performed
During this processing the replicator generates cookies as it normally does. Upon commit, cookies are queued on the replicator for bulk cancellation of log records. If the update was already performed, the replicator can queue a cancellation cookie back immediately.
Log Removal Failure
If an initiator crashes during log removal, the log entries may re-appear after recovery. It is important that the removal of a log from a catalog and the removal of the log file are atomic. Upon reconnection, the replicator will again process the log.
Log operations are described in this section.
Low Level Log Operations
Logs can be opened and/or created. This fills in a log handle. The log handle can be used through the log handle API. Some low-level log operations with examples are shown below:
- If llog_id is not null open an existing log with this ID. If name is not null, open or create a log with that name. Otherwise open a nameless log. The object id of the log is stored in the handle upon success of opening or creation.
- int llog_create(struct obd_device *obd, struct llog_handle **, struct llog_logid *, char *name);
- Close the log and free the handle. Remove the handle from the catalogs list of open handles. If the log has a flag set to destroy if empty, the log may be zapped.
- int llog_close(struct llog_handle *loghandle);
- Destroy the log object and close the handle.
- int llog_destroy(struct llog_handle *);
- Write a record in the log. If buf is null the record is complete, if buf is not null. it inserted in the middle. Records are multiples of 128 bits in size and have a hdr and tail. Write the cookie for the entry into the cookie pointer. (cookie_count is probably a mistake).
- int llog_write_rec(struct llog_handle *handle, struct llog_rec_hdr *rec, struct llog_cookie *cookie, int cookie_count, void *buf);
- Index curr_idx is in the block at offset *offset. Set *offset to the block offset of record next_idx. Copy len bytes from the start of that block into the buffer buf.
- int llog_next_block(struct llog_handle *h, int curr_idx, int next_idx, __u64 *offset, void *buf, int len);
- Read the header of the log into the handle and also read the last rec_tail in the log to find the last index that was used in the log.
- int (*lop_read_header)(struct llog_handle *handle);
Higher Level Log Operations
- Initialize the handle. Try to read it from the log file, but if the log has no header yet, build it from the arguments. If the header is read, verify the flags and UUID in the log equal those of the arguments.
- int llog_init_handle(struct llog_handle *handle, int flags, struct obd_uuid *uuid);
OBD Level Log Operations
int llog_add_record(struct llog_handle *cathandle, struct llog_trans_hdr *rec, struct llog_cookie *logcookies, void *buf)
int llog_delete_log(struct llog_handle *cathandle,struct llog_handle *loghandle)
int llog_cancel_records(struct llog_handle *cathandle, int count, struct llog_cookie *cookies)
For each cookie in the cookie array, we clear the log in-use bit and either:
- The log is empty, so mark it free in the catalog header and delete it
- The log is not empty, just write out the log header
The cookies may be in different log files, so we need to get new logs each time.
- Return the block in the log that contains record with index next_idx. The current idx at offset cur_offset is used to optimize the search.
- int llog_next_block(struct llog_handle *loghandle, int cur_idx, int next_idx, __u64 *cur_offset, void *buf, int len)
- Call the callback function cb.
- typedef int (*llog_cb_t)(struct llog_handle *, struct llog_trans_hdr *rec, void *data);
- int llog_process_log(struct llog_handle *loghandle, llog_cb_t cb, void *data)