[edit] WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - Changelogs

From Obsolete Lustre Wiki
(Difference between revisions)
Jump to: navigation, search
m (spelling)
m (1 revision)

Revision as of 21:01, 14 January 2010



A changelog is a log of data or metadata changes. In general, these will track filesystem operations conveyed via one or more RPCs. Changelogs are used by consumers such as userspace audit logs, mirroring OSTs or files, database feeds, etc. Changelogs are stored persistently and transactionally and are removed upon completion.

There are 3 subflavors of changelogs (we intend to use the same changelog facility for all).

  1. rollback (undo) logs - used for filesystem recovery
  2. replication logs - used to propagate changes from a master server to a replica
  3. audit logs - record auditable actions (file create, access violation, etc.) Will typically be converted into a feed for userspace-level usage.


lustre md or os server participating in changelog generation.
server operation 
lustre operation reflected in changelog. Operation may correspond to a single rpc, or part of an rpc, or multiple rpc's.
log entry 
a representation of a server operation in the changelog.
a list of log entries, describing operations performed by a particular server.
an representation of (or layer upon) a changelog intended for an userspace consumer.
an entity receiving the changelog or feed entries; may be a Lustre-internal consumer (replication) or an external consumer (database feed).


Description Quality Semantics
scope usability, performance, scalability changelog per server or fileset
capability usability records should allow redo, undo, pathname reconstruction
completeness usability reliable reconstruction of total file system changelog from per-server changelogs.
precision usability operations that are allowed to be omitted from the changelog.
granularity usability scope of recorded server operations.
retention usability, scalability when changelog entries are discarded.
throughput performance, scalability updating changelog doesn't hamper normal filesystem activity.
multiplexing usability changelog can be delivered to multiple consumers.
transactionality usability entries are not discarded until consumer has indicated completion
filtering usability, performance, scalability only a subset of "interesting" events are recorded/reported

Use cases

ID Quality attribute Summary
audit feature audit trail generated from feed.
database_sync scalability, consistency use feed to keep database synchronized with filesystem.
replication scalability use changelogs to drive server/filesystem replicas
reintegration consistency replay changelogs from caches and proxies to reintegrate changes back to master servers
rollback recovery roll back to previous cluster-wide consistent state.


Scenario: audit trail generated from feed.
Business Goals: Detailed audit of file access or access failures
Relevant QA's: scalability, precision
details Stimulus: User on client 1 opens file A, reads file A, removes file A. User on client 2 attempts to open file B, but fails EACCESS.
Stimulus source: Client
Environment: A and B are on MDT0001, where sysadmin has set up a Lustre audit feed with mask IN_ALL_EVENTS
Artifact: feed, presented as a file under /mnt/<server>/.lustre/audit
Response: feed is updated with event records as clients execute transactions: IN_OPEN,IN_ACCESS,IN_DELETE,IN_OPEN. Feed is synchronous with transactions. Entries remain in feed file until the next entry is read from the file.
Response measure: all events are indicated in the feed file
Issues: Need a user-settable retention policy on feed entries - read_once, explicit_cancel, time_out, etc.


Scenario: external database is kept up to date with global filesystem metadata changes
Business Goals: Provide feed to database consumer for customer-specific purposes (audit, query, HSM, etc.)
Relevant QA's: scalability, precision
details Stimulus: Filesystem metadata changes, data changes, and access events
Stimulus source: Client filesystem usage
Environment: Database input agents reading audit feeds on multiple servers
Artifact: audit feed file on each server
Response: Filesystem events are integrated transactionally into database
Response measure: All events are collected in database, even in the event of power loss
Questions: Is cross-server synchronization required? E.g. a single event results in two changelog entries on two servers - do these need to be reconciled before integration into the database?


Scenario: changelog from a primary server provides replication instructions on a backup server
Business Goals: Efficient replication of filesystem changes
Relevant QA's: scalability
details Stimulus: file or metadata on primary servers is modified
Stimulus source: normal client filesystem usage
Environment: replicated filesystem: primary filesystem replicated on secondary servers. Secondary filesystem may have a different layout than primary.
Artifact: changelog
Response: relevant changelog entries are replayed on secondary servers, bringing secondary up to date.
Response measure: copy of filesystem on secondary servers matches primary within X time measure. Secondary is always internally consistent.
Issues: Replication must be fully scalable


Scenario: replay changelogs from caches and proxies to reintegrate changes on primary servers
Business Goals: help facilitate efficient caches / proxies
Relevant QA's: consistency
details Stimulus: file or metadata is changed on caching server
Stimulus source: client filesystem usage
Environment: filesystem is cached before "main" servers
Artifact: changelog
Response: relevant changelog entries are batched and replayed on primary servers
Response measure: Primary copies of files reflect all cached changes


Scenario: roll back to previous consistent state for stripe or CMD recovery
Business Goals: help facilitate distributed recovery
Relevant QA's: recovery
details Stimulus: external mechanism decides that some committed server operations must be rolled back for overall filesystem consistency
Stimulus source: recovery mechanism (not defined by this arch)
Environment: some distributed ops are committed, others are not
Artifact: changelog
Response: committed ops are backed out of some servers
Response measure: filesystem is returned to globally consistent state
Questions: What mechanism identifies which ops should be backed out? Is it possible to avoid committing those ops in the first place?



We take changelogs to mean per-server changelogs. Changelogs may be further restricted to the portion of a particular fileset residing on each server.


Changelogs must be persistent across server failure and restarts in order to provide recovery information. Changelogs should be written transactionally on a journaled filesystem. The actions described by each changelog entry must be marked as incomplete until the results of those actions are fully committed to persistent storage. Furthermore, due to rollback requirements, even committed transactions may need to be rolled back to an epoch boundary.

Feeds must also be persistent in some cases (database_sync), but due to different filtering and auditing requirements, it may make sense to store each feed as a separate (simplified) log.

Since changelog entries will be written frequently, it may be desireable to store them on a high-speed medium (potentially flash drive) where possible. In any case, we should have provisions to locate changelogs on separate devices or colocated with the target disk. In some cases ram-only changelogs may be useful.


The changelogs for individual servers should be able to be combined into a consistent total filesystem changelog. A filesystem snapshot plus the subsequent combined changelog should be able to mirror the current state of the filesystem.

Complete synchronization of log entries implies global event ordering in the Lamport timestamp sense. However, since many operations will be confined to an individual server (e.g. metadata operations for a particular subdirectory), a strict total ordering requirement may not be needed. Instead, a causal ordering condition may be sufficient (ref vector clocks). File or server-based versioning, perhaps with occasional synchronizing event markers (epoch boundary markers?) (included in distributed RPCs) could insure that inter-server operations would be ordered correctly but independent, intra-server operations may not be strictly ordered (e.g. write on OST1 for file1, rename file2 on MDT2 may be reported in either order).


Some changes masked by client or proxy caching servers will not be reported to servers, and so will not be included in server changelogs. (e.g. if caching mechanism never reports a file create/write/unlink of a tempfile to the server, only directory mtime change might show up in server changelog. Or write bytes 1-10, then bytes 11-20 might appear in changelog as write bytes 1-20.) Note that depending on the cache flushing policy, the data for repeated writes to the same extent may never be sent to the server and therefore no rollback will be possible to intermediate points. Server changelogs will reflect only the net result of flushed operations.

For auditing purposes, atime logging might be required on some feeds. This would be limited to our current 'lazy atime' at best, again potentially masked by client-side read cache.


A distributed transaction may involve multiple file operations on different servers. For recovery of the FS to a consistent state in case only some components of the transaction are committed, we require the ability to rollback the committed components. This may be some kind of external snapshot reference, or COW semantics, or inclusion of the changed data in the changelog, etc.

This makes the retention policy more complex, in that changelog records may not be discarded even after the component has been committed to disk -- the entire distributed transaction must be committed. Beyond that, if arbitrary rollback abilities are desired, then changelog records may be required until the previous filesystem checkpoint, consistent cut, or snapshot. User-selectable retention policies may be required.


Replication of a large filesystem must not harm client responsiveness. Replication must be spread out over a large percentage of the servers of both the source and destination clusters

Feed Synchronization

To insure feed entries are not lost due to e.g. output buffer overruns, some kind of completion signal must be returned to Lustre. This may be a per-entry completion callback, or potentially just a blocking pipe. Lustre must correctly handle buffer overflow conditions in case e.g. consumer dies.

Log Content

I. MDT changes (create/delete/rename): filename,perm,etc. old and new values, object/version list

II. OST metadata changes (append/modify/truncate/redirect): file/object name / version, extents

III. OST data changes: references to old data blocks or previous snapshot may be desired for rollback. Note that this would not involve copying the data to disk again, but merely adding a reference to the on-disk old data. These would be cleaned as the log entries were canceled.

IV. Synchronization items: epoch, object version, generation number, etc.

For auditing:

V. Access time
atime/mtime on OST, ctime/mtime on MDT
VI. Identification
nid, pid, uid if available
VII. Permission failures

Feed API

The user API for feeds should include the following features:

  - Userspace data output stream 
  - Register event filters
  - Register consumer completion callback
  - Multiple consumer capability

Implementation Notes

Implementation contraints:

Base on llogs

Changelogs will be implemented as a flavor of Lustre llogs after suitable, relatively minor enhancements are made to the llog facility. Specifically, there will be multiple replicators (consumers) for some records.

Created only on demand

Changelog entries are created only when required by an existing consumer(s), and are cancelled when that consumer(s) has finished processing the change.

Independent per server

Changelogs are per-server, and may be further restricted to a particular fileset. Changelogs from different servers / filesets will not be recombined by Lustre.

Feeds as files

User-level access to a feed will take place via a virtual file type mechanism, similar to proc. Feed entries will be read out of $MNT/.lustre/feed. A consumer completion callback must be called (or some other signal) before the next changelog entry is presented (multiple entries? asyncronous cancellation?). A timeout would be used to detect a dead consumer, at which point we abort the feed (raise a signal? continue recording non-cancelled entries forever?). Upon recovery, we restart feed from last uncanceled entry (when consumer re-registers).

Single consumer per feed

In order to easily define separate filters and completion callbacks, we will generate a single feed per consumer. A single changelog may drive multiple feeds; each feed will take a reference on each changelog entry. We will need to identify the consumer to the feed during re-registration after recovery.

Feed registration

Feed consumer registration therefore includes:

  1. changelog identifier (OST0003)
  2. consumer identifier (process name? job id?)
  3. filter definition
  4. policy flags (recovery_required, no_callback, batch_cancel, timeout)
  5. last entry identifier (for recovery)


bug 14169

Personal tools