Architecture - Changelogs

Summary
A changelog is a log of data or metadata changes. In general, these will track filesystem operations conveyed via one or more RPCs. Changelogs are used by consumers such as userspace audit logs, mirroring OSTs or files, database feeds, etc. Changelogs are stored persistently and transactionally and are removed upon completion.

There are 3 subflavors of changelogs (we intend to use the same changelog facility for all).
 * 1) rollback (undo) logs - used for filesystem recovery
 * 2) replication logs - used to propagate changes from a master server to a replica
 * 3) audit logs - record auditable actions (file create, access violation, etc.)  Will typically be converted into a feed for userspace-level usage.

Definitions

 * server : lustre md or os server participating in changelog generation.
 * server operation : lustre operation reflected in changelog. Operation may correspond to a single rpc, or part of an rpc, or multiple rpc's.
 * log entry : a representation of a server operation in the changelog.
 * changelog : a list of log entries, describing operations performed by a particular server.
 * feed : an representation of (or layer upon) a changelog intended for an userspace consumer.
 * consumer : an entity receiving the changelog or feed entries; may be a Lustre-internal consumer (replication) or an external consumer (database feed).

Scope
We take changelogs to mean per-server changelogs. Changelogs may be further restricted to the portion of a particular fileset residing on each server.

Transactionality
Changelogs must be persistent across server failure and restarts in order to provide recovery information. Changelogs should be written transactionally on a journaled filesystem. The actions described by each changelog entry must be marked as incomplete until the results of those actions are fully committed to persistent storage. Furthermore, due to rollback requirements, even committed transactions may need to be rolled back to an epoch boundary.

Feeds must also be persistent in some cases (database_sync), but due to different filtering and auditing requirements, it may make sense to store each feed as a separate (simplified) log.

Since changelog entries will be written frequently, it may be desireable to store them on a high-speed medium (potentially flash drive) where possible. In any case, we should have provisions to locate changelogs on separate devices or colocated with the target disk. In some cases ram-only changelogs may be useful.

Consistency
The changelogs for individual servers should be able to be combined into a consistent total filesystem changelog. A filesystem snapshot plus the subsequent combined changelog should be able to mirror the current state of the filesystem.

Complete synchronization of log entries implies global event ordering in the Lamport timestamp sense. However, since many operations will be confined to an individual server (e.g. metadata operations for a particular subdirectory), a strict total ordering requirement may not be needed. Instead, a causal ordering condition may be sufficient (ref vector clocks). File or server-based versioning, perhaps with occasional synchronizing event markers (epoch boundary markers?) (included in distributed RPCs) could insure that inter-server operations would be ordered correctly but independent, intra-server operations may not be strictly ordered (e.g. write on OST1 for file1, rename file2 on MDT2 may be reported in either order).

Completeness
Some changes masked by client or proxy caching servers will not be reported to servers, and so will not be included in server changelogs. (e.g. if caching mechanism never reports a file create/write/unlink of a tempfile to the server, only directory mtime change might show up in server changelog. Or write bytes 1-10, then bytes 11-20 might appear in changelog as write bytes 1-20.)  Note that depending on the cache flushing policy, the data for repeated writes to the same extent may never be sent to the server and therefore no rollback will be possible to intermediate points. Server changelogs will reflect only the net result of flushed operations.

For auditing purposes, atime logging might be required on some feeds. This would be limited to our current 'lazy atime' at best, again potentially masked by client-side read cache.

Retention
A distributed transaction may involve multiple file operations on different servers. For recovery of the FS to a consistent state in case only some components of the transaction are committed, we require the ability to rollback the committed components. This may be some kind of external snapshot reference, or COW semantics, or inclusion of the changed data in the changelog, etc.

This makes the retention policy more complex, in that changelog records may not be discarded even after the component has been committed to disk -- the entire distributed transaction must be committed. Beyond that, if arbitrary rollback abilities are desired, then changelog records may be required until the previous filesystem checkpoint, consistent cut, or snapshot. User-selectable retention policies may be required.

Scalablity
Replication of a large filesystem must not harm client responsiveness. Replication must be spread out over a large percentage of the servers of both the source and destination clusters

Feed Synchronization
To insure feed entries are not lost due to e.g. output buffer overruns, some kind of completion signal must be returned to Lustre. This may be a per-entry completion callback, or potentially just a blocking pipe. Lustre must correctly handle buffer overflow conditions in case e.g. consumer dies.

Log Content
I.  MDT changes (create/delete/rename): filename,perm,etc. old and new values, object/version list

II. OST metadata changes (append/modify/truncate/redirect): file/object name / version, extents

III. OST data changes: references to old data blocks or previous snapshot may be desired for rollback. Note that this would not involve copying the data to disk again, but merely adding a reference to the on-disk old data. These would be cleaned as the log entries were canceled.

IV. Synchronization items: epoch, object version, generation number, etc.

For auditing:
 * V.  Access time: atime/mtime on OST, ctime/mtime on MDT
 * VI. Identification: nid, pid, uid if available
 * VII. Permission failures

Feed API
The user API for feeds should include the following features: - Userspace data output stream - Register event filters - Register consumer completion callback - Multiple consumer capability

Implementation Notes
Implementation contraints:

Base on llogs
Changelogs will be implemented as a flavor of Lustre llogs after suitable, relatively minor enhancements are made to the llog facility. Specifically, there will be multiple replicators (consumers) for some records.

Created only on demand
Changelog entries are created only when required by an existing consumer(s), and are cancelled when that consumer(s) has finished processing the change.

Independent per server
Changelogs are per-server, and may be further restricted to a particular fileset. Changelogs from different servers / filesets will not be recombined by Lustre.

Feeds as files
User-level access to a feed will take place via a virtual file type mechanism, similar to proc. Feed entries will be read out of $MNT/.lustre/feed. A consumer completion callback must be called (or some other signal) before the next changelog entry is presented (multiple entries? asyncronous cancellation?). A timeout would be used to detect a dead consumer, at which point we abort the feed (raise a signal? continue recording non-cancelled entries forever?). Upon recovery, we restart feed from last uncanceled entry (when consumer re-registers).

Single consumer per feed
In order to easily define separate filters and completion callbacks, we will generate a single feed per consumer. A single changelog may drive multiple feeds; each feed will take a reference on each changelog entry. We will need to identify the consumer to the feed during re-registration after recovery.

Feed registration
Feed consumer registration therefore includes:
 * 1) changelog identifier (OST0003)
 * 2) consumer identifier (process name?  job id?)
 * 3) filter definition
 * 4) policy flags (recovery_required, no_callback, batch_cancel, timeout)
 * 5) last entry identifier (for recovery)