Architecture - Changelogs
A changelog is a log of data or metadata changes. In general, these will track filesystem operations conveyed via one or more RPCs. Changelogs are used by consumers such as userspace audit logs, mirroring OSTs or files, database feeds, etc. Changelogs are stored persistently and transactionally and are removed upon completion.
There are 3 subflavors of changelogs (we intend to use the same changelog facility for all).
- rollback (undo) logs - used for filesystem recovery
- replication logs - used to propagate changes from a master server to a replica
- audit logs - record auditable actions (file create, access violation, etc.) Will typically be converted into a feed for userspace-level usage.
- lustre md or os server participating in changelog generation.
- server operation
- lustre operation reflected in changelog. Operation may correspond to a single rpc, or part of an rpc, or multiple rpc's.
- log entry
- a representation of a server operation in the changelog.
- a list of log entries, describing operations performed by a particular server.
- an representation of (or layer upon) a changelog intended for an userspace consumer.
- an entity receiving the changelog or feed entries; may be a Lustre-internal consumer (replication) or an external consumer (database feed).
|scope||usability, performance, scalability||changelog per server or fileset|
|capability||usability||records should allow redo, undo, pathname reconstruction|
|completeness||usability||reliable reconstruction of total file system changelog from per-server changelogs.|
|precision||usability||operations that are allowed to be omitted from the changelog.|
|granularity||usability||scope of recorded server operations.|
|retention||usability, scalability||when changelog entries are discarded.|
|throughput||performance, scalability||updating changelog doesn't hamper normal filesystem activity.|
|multiplexing||usability||changelog can be delivered to multiple consumers.|
|transactionality||usability||entries are not discarded until consumer has indicated completion|
|filtering||usability, performance, scalability||only a subset of "interesting" events are recorded/reported|
|audit||feature||audit trail generated from feed.|
|database_sync||scalability, consistency||use feed to keep database synchronized with filesystem.|
|replication||scalability||use changelogs to drive server/filesystem replicas|
|reintegration||consistency||replay changelogs from caches and proxies to reintegrate changes back to master servers|
|rollback||recovery||roll back to previous cluster-wide consistent state.|
|Scenario:||audit trail generated from feed.|
|Business Goals:||Detailed audit of file access or access failures|
|Relevant QA's:||scalability, precision|
|details||Stimulus:||User on client 1 opens file A, reads file A, removes file A. User on client 2 attempts to open file B, but fails EACCESS.|
|Environment:||A and B are on MDT0001, where sysadmin has set up a Lustre audit feed with mask IN_ALL_EVENTS|
|Artifact:||feed, presented as a file under /mnt/<server>/.lustre/audit|
|Response:||feed is updated with event records as clients execute transactions: IN_OPEN,IN_ACCESS,IN_DELETE,IN_OPEN. Feed is synchronous with transactions. Entries remain in feed file until the next entry is read from the file.|
|Response measure:||all events are indicated in the feed file|
|Issues:||Need a user-settable retention policy on feed entries - read_once, explicit_cancel, time_out, etc.|
|Scenario:||external database is kept up to date with global filesystem metadata changes|
|Business Goals:||Provide feed to database consumer for customer-specific purposes (audit, query, HSM, etc.)|
|Relevant QA's:||scalability, precision|
|details||Stimulus:||Filesystem metadata changes, data changes, and access events|
|Stimulus source:||Client filesystem usage|
|Environment:||Database input agents reading audit feeds on multiple servers|
|Artifact:||audit feed file on each server|
|Response:||Filesystem events are integrated transactionally into database|
|Response measure:||All events are collected in database, even in the event of power loss|
|Questions:||Is cross-server synchronization required? E.g. a single event results in two changelog entries on two servers - do these need to be reconciled before integration into the database?|
|Scenario:||changelog from a primary server provides replication instructions on a backup server|
|Business Goals:||Efficient replication of filesystem changes|
|details||Stimulus:||file or metadata on primary servers is modified|
|Stimulus source:||normal client filesystem usage|
|Environment:||replicated filesystem: primary filesystem replicated on secondary servers. Secondary filesystem may have a different layout than primary.|
|Response:||relevant changelog entries are replayed on secondary servers, bringing secondary up to date.|
|Response measure:||copy of filesystem on secondary servers matches primary within X time measure. Secondary is always internally consistent.|
|Issues:||Replication must be fully scalable|
|Scenario:||replay changelogs from caches and proxies to reintegrate changes on primary servers|
|Business Goals:||help facilitate efficient caches / proxies|
|details||Stimulus:||file or metadata is changed on caching server|
|Stimulus source:||client filesystem usage|
|Environment:||filesystem is cached before "main" servers|
|Response:||relevant changelog entries are batched and replayed on primary servers|
|Response measure:||Primary copies of files reflect all cached changes|
|Scenario:||roll back to previous consistent state for stripe or CMD recovery|
|Business Goals:||help facilitate distributed recovery|
|details||Stimulus:||external mechanism decides that some committed server operations must be rolled back for overall filesystem consistency|
|Stimulus source:||recovery mechanism (not defined by this arch)|
|Environment:||some distributed ops are committed, others are not|
|Response:||committed ops are backed out of some servers|
|Response measure:||filesystem is returned to globally consistent state|
|Questions:||What mechanism identifies which ops should be backed out? Is it possible to avoid committing those ops in the first place?|
We take changelogs to mean per-server changelogs. Changelogs may be further restricted to the portion of a particular fileset residing on each server.
Changelogs must be persistent across server failure and restarts in order to provide recovery information. Changelogs should be written transactionally on a journaled filesystem. The actions described by each changelog entry must be marked as incomplete until the results of those actions are fully committed to persistent storage. Furthermore, due to rollback requirements, even committed transactions may need to be rolled back to an epoch boundary.
Feeds must also be persistent in some cases (database_sync), but due to different filtering and auditing requirements, it may make sense to store each feed as a separate (simplified) log.
Since changelog entries will be written frequently, it may be desireable to store them on a high-speed medium (potentially flash drive) where possible. In any case, we should have provisions to locate changelogs on separate devices or colocated with the target disk. In some cases ram-only changelogs may be useful.
The changelogs for individual servers should be able to be combined into a consistent total filesystem changelog. A filesystem snapshot plus the subsequent combined changelog should be able to mirror the current state of the filesystem.
Complete synchronization of log entries implies global event ordering in the Lamport timestamp sense. However, since many operations will be confined to an individual server (e.g. metadata operations for a particular subdirectory), a strict total ordering requirement may not be needed. Instead, a causal ordering condition may be sufficient (ref vector clocks). File or server-based versioning, perhaps with occasional synchronizing event markers (epoch boundary markers?) (included in distributed RPCs) could insure that inter-server operations would be ordered correctly but independent, intra-server operations may not be strictly ordered (e.g. write on OST1 for file1, rename file2 on MDT2 may be reported in either order).
Some changes masked by client or proxy caching servers will not be reported to servers, and so will not be included in server changelogs. (e.g. if caching mechanism never reports a file create/write/unlink of a tempfile to the server, only directory mtime change might show up in server changelog. Or write bytes 1-10, then bytes 11-20 might appear in changelog as write bytes 1-20.) Note that depending on the cache flushing policy, the data for repeated writes to the same extent may never be sent to the server and therefore no rollback will be possible to intermediate points. Server changelogs will reflect only the net result of flushed operations.
For auditing purposes, atime logging might be required on some feeds. This would be limited to our current 'lazy atime' at best, again potentially masked by client-side read cache.
A distributed transaction may involve multiple file operations on different servers. For recovery of the FS to a consistent state in case only some components of the transaction are committed, we require the ability to rollback the committed components. This may be some kind of external snapshot reference, or COW semantics, or inclusion of the changed data in the changelog, etc.
This makes the retention policy more complex, in that changelog records may not be discarded even after the component has been committed to disk -- the entire distributed transaction must be committed. Beyond that, if arbitrary rollback abilities are desired, then changelog records may be required until the previous filesystem checkpoint, consistent cut, or snapshot. User-selectable retention policies may be required.
Replication of a large filesystem must not harm client responsiveness. Replication must be spread out over a large percentage of the servers of both the source and destination clusters
To insure feed entries are not lost due to e.g. output buffer overruns, some kind of completion signal must be returned to Lustre. This may be a per-entry completion callback, or potentially just a blocking pipe. Lustre must correctly handle buffer overflow conditions in case e.g. consumer dies.
I. MDT changes (create/delete/rename): filename,perm,etc. old and new values, object/version list
II. OST metadata changes (append/modify/truncate/redirect): file/object name / version, extents
III. OST data changes: references to old data blocks or previous snapshot may be desired for rollback. Note that this would not involve copying the data to disk again, but merely adding a reference to the on-disk old data. These would be cleaned as the log entries were canceled.
IV. Synchronization items: epoch, object version, generation number, etc.
- V. Access time
- atime/mtime on OST, ctime/mtime on MDT
- VI. Identification
- nid, pid, uid if available
- VII. Permission failures
The user API for feeds should include the following features:
- Userspace data output stream - Register event filters - Register consumer completion callback - Multiple consumer capability
Base on llogs
Changelogs will be implemented as a flavor of Lustre llogs after suitable, relatively minor enhancements are made to the llog facility. Specifically, there will be multiple replicators (consumers) for some records.
Created only on demand
Changelog entries are created only when required by an existing consumer(s), and are cancelled when that consumer(s) has finished processing the change.
Independent per server
Changelogs are per-server, and may be further restricted to a particular fileset. Changelogs from different servers / filesets will not be recombined by Lustre.
Feeds as files
User-level access to a feed will take place via a virtual file type mechanism, similar to proc. Feed entries will be read out of $MNT/.lustre/feed. A consumer completion callback must be called (or some other signal) before the next changelog entry is presented (multiple entries? asyncronous cancellation?). A timeout would be used to detect a dead consumer, at which point we abort the feed (raise a signal? continue recording non-cancelled entries forever?). Upon recovery, we restart feed from last uncanceled entry (when consumer re-registers).
Single consumer per feed
In order to easily define separate filters and completion callbacks, we will generate a single feed per consumer. A single changelog may drive multiple feeds; each feed will take a reference on each changelog entry. We will need to identify the consumer to the feed during re-registration after recovery.
Feed consumer registration therefore includes:
- changelog identifier (OST0003)
- consumer identifier (process name? job id?)
- filter definition
- policy flags (recovery_required, no_callback, batch_cancel, timeout)
- last entry identifier (for recovery)