Architecture - Changelogs

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Summary

A changelog is a log of data or metadata changes. In general, these will track filesystem operations conveyed via one or more RPCs. Changelogs are used by consumers such as userspace audit logs, mirroring OSTs or files, database feeds, etc. Changelogs are stored persistently and transactionally and are removed upon completion.

There are 3 subflavors of changelogs (we intend to use the same changelog facility for all).

rollback (undo) logs - used for filesystem recovery
replication logs - used to propagate changes from a master server to a replica
audit logs - record auditable actions (file create, access violation, etc.) Will typically be converted into a feed for userspace-level usage.

Definitions

server: lustre md or os server participating in changelog generation.
server operation: lustre operation reflected in changelog. Operation may correspond to a single rpc, or part of an rpc, or multiple rpc's.
log entry: a representation of a server operation in the changelog.
changelog: a list of log entries, describing operations performed by a particular server.
feed: an representation of (or layer upon) a changelog intended for an userspace consumer.
consumer: an entity receiving the changelog or feed entries; may be a Lustre-internal consumer (replication) or an external consumer (database feed).

Qualities

Description	Quality	Semantics
scope	usability, performance, scalability	changelog per server or fileset
capability	usability	records should allow redo, undo, pathname reconstruction
completeness	usability	reliable reconstruction of total file system changelog from per-server changelogs.
precision	usability	operations that are allowed to be omitted from the changelog.
granularity	usability	scope of recorded server operations.
retention	usability, scalability	when changelog entries are discarded.
throughput	performance, scalability	updating changelog doesn't hamper normal filesystem activity.
multiplexing	usability	changelog can be delivered to multiple consumers.
transactionality	usability	entries are not discarded until consumer has indicated completion
filtering	usability, performance, scalability	only a subset of "interesting" events are recorded/reported

Use cases

ID	Quality attribute	Summary
audit	feature	audit trail generated from feed.
database_sync	scalability, consistency	use feed to keep database synchronized with filesystem.
replication	scalability	use changelogs to drive server/filesystem replicas
reintegration	consistency	replay changelogs from caches and proxies to reintegrate changes back to master servers
rollback	recovery	roll back to previous cluster-wide consistent state.

audit

Scenario:		audit trail generated from feed.
Business Goals:		Detailed audit of file access or access failures
Relevant QA's:		scalability, precision
details	Stimulus:	User on client 1 opens file A, reads file A, removes file A. User on client 2 attempts to open file B, but fails EACCESS.
	Stimulus source:	Client
	Environment:	A and B are on MDT0001, where sysadmin has set up a Lustre audit feed with mask IN_ALL_EVENTS
	Artifact:	feed, presented as a file under /mnt/<server>/.lustre/audit
	Response:	feed is updated with event records as clients execute transactions: IN_OPEN,IN_ACCESS,IN_DELETE,IN_OPEN. Feed is synchronous with transactions. Entries remain in feed file until the next entry is read from the file.
	Response measure:	all events are indicated in the feed file
Questions:
Issues:		Need a user-settable retention policy on feed entries - read_once, explicit_cancel, time_out, etc.

database_sync

Scenario:		external database is kept up to date with global filesystem metadata changes
Business Goals:		Provide feed to database consumer for customer-specific purposes (audit, query, HSM, etc.)
Relevant QA's:		scalability, precision
details	Stimulus:	Filesystem metadata changes, data changes, and access events
	Stimulus source:	Client filesystem usage
	Environment:	Database input agents reading audit feeds on multiple servers
	Artifact:	audit feed file on each server
	Response:	Filesystem events are integrated transactionally into database
	Response measure:	All events are collected in database, even in the event of power loss
Questions:		Is cross-server synchronization required? E.g. a single event results in two changelog entries on two servers - do these need to be reconciled before integration into the database?
Issues:

replication

Scenario:		changelog from a primary server provides replication instructions on a backup server
Business Goals:		Efficient replication of filesystem changes
Relevant QA's:		scalability
details	Stimulus:	file or metadata on primary servers is modified
	Stimulus source:	normal client filesystem usage
	Environment:	replicated filesystem: primary filesystem replicated on secondary servers. Secondary filesystem may have a different layout than primary.
	Artifact:	changelog
	Response:	relevant changelog entries are replayed on secondary servers, bringing secondary up to date.
	Response measure:	copy of filesystem on secondary servers matches primary within X time measure. Secondary is always internally consistent.
Questions:
Issues:		Replication must be fully scalable

reintegration

Scenario:		replay changelogs from caches and proxies to reintegrate changes on primary servers
Business Goals:		help facilitate efficient caches / proxies
Relevant QA's:		consistency
details	Stimulus:	file or metadata is changed on caching server
	Stimulus source:	client filesystem usage
	Environment:	filesystem is cached before "main" servers
	Artifact:	changelog
	Response:	relevant changelog entries are batched and replayed on primary servers
	Response measure:	Primary copies of files reflect all cached changes
Questions:
Issues:

rollback

Scenario:		roll back to previous consistent state for stripe or CMD recovery
Business Goals:		help facilitate distributed recovery
Relevant QA's:		recovery
details	Stimulus:	external mechanism decides that some committed server operations must be rolled back for overall filesystem consistency
	Stimulus source:	recovery mechanism (not defined by this arch)
	Environment:	some distributed ops are committed, others are not
	Artifact:	changelog
	Response:	committed ops are backed out of some servers
	Response measure:	filesystem is returned to globally consistent state
Questions:		What mechanism identifies which ops should be backed out? Is it possible to avoid committing those ops in the first place?
Issues:

Requirements

Scope

We take changelogs to mean per-server changelogs. Changelogs may be further restricted to the portion of a particular fileset residing on each server.

Transactionality

Changelogs must be persistent across server failure and restarts in order to provide recovery information. Changelogs should be written transactionally on a journaled filesystem. The actions described by each changelog entry must be marked as incomplete until the results of those actions are fully committed to persistent storage. Furthermore, due to rollback requirements, even committed transactions may need to be rolled back to an epoch boundary.

Feeds must also be persistent in some cases (database_sync), but due to different filtering and auditing requirements, it may make sense to store each feed as a separate (simplified) log.

Since changelog entries will be written frequently, it may be desireable to store them on a high-speed medium (potentially flash drive) where possible. In any case, we should have provisions to locate changelogs on separate devices or colocated with the target disk. In some cases ram-only changelogs may be useful.

Consistency

The changelogs for individual servers should be able to be combined into a consistent total filesystem changelog. A filesystem snapshot plus the subsequent combined changelog should be able to mirror the current state of the filesystem.

Complete synchronization of log entries implies global event ordering in the Lamport timestamp sense. However, since many operations will be confined to an individual server (e.g. metadata operations for a particular subdirectory), a strict total ordering requirement may not be needed. Instead, a causal ordering condition may be sufficient (ref vector clocks). File or server-based versioning, perhaps with occasional synchronizing event markers (epoch boundary markers?) (included in distributed RPCs) could insure that inter-server operations would be ordered correctly but independent, intra-server operations may not be strictly ordered (e.g. write on OST1 for file1, rename file2 on MDT2 may be reported in either order).

Completeness

Some changes masked by client or proxy caching servers will not be reported to servers, and so will not be included in server changelogs. (e.g. if caching mechanism never reports a file create/write/unlink of a tempfile to the server, only directory mtime change might show up in server changelog. Or write bytes 1-10, then bytes 11-20 might appear in changelog as write bytes 1-20.) Note that depending on the cache flushing policy, the data for repeated writes to the same extent may never be sent to the server and therefore no rollback will be possible to intermediate points. Server changelogs will reflect only the net result of flushed operations.

For auditing purposes, atime logging might be required on some feeds. This would be limited to our current 'lazy atime' at best, again potentially masked by client-side read cache.

Retention

A distributed transaction may involve multiple file operations on different servers. For recovery of the FS to a consistent state in case only some components of the transaction are committed, we require the ability to rollback the committed components. This may be some kind of external snapshot reference, or COW semantics, or inclusion of the changed data in the changelog, etc.

This makes the retention policy more complex, in that changelog records may not be discarded even after the component has been committed to disk -- the entire distributed transaction must be committed. Beyond that, if arbitrary rollback abilities are desired, then changelog records may be required until the previous filesystem checkpoint, consistent cut, or snapshot. User-selectable retention policies may be required.

Scalablity

Replication of a large filesystem must not harm client responsiveness. Replication must be spread out over a large percentage of the servers of both the source and destination clusters

Feed Synchronization

To insure feed entries are not lost due to e.g. output buffer overruns, some kind of completion signal must be returned to Lustre. This may be a per-entry completion callback, or potentially just a blocking pipe. Lustre must correctly handle buffer overflow conditions in case e.g. consumer dies.

Log Content

I. MDT changes (create/delete/rename): filename,perm,etc. old and new values, object/version list

II. OST metadata changes (append/modify/truncate/redirect): file/object name / version, extents

III. OST data changes: references to old data blocks or previous snapshot may be desired for rollback. Note that this would not involve copying the data to disk again, but merely adding a reference to the on-disk old data. These would be cleaned as the log entries were canceled.

IV. Synchronization items: epoch, object version, generation number, etc.

For auditing:

V. Access time: atime/mtime on OST, ctime/mtime on MDT
VI. Identification: nid, pid, uid if available
VII. Permission failures

Feed API

The user API for feeds should include the following features:

  - Userspace data output stream 
  - Register event filters
  - Register consumer completion callback
  - Multiple consumer capability

Implementation Notes

Implementation contraints:

Base on llogs

Changelogs will be implemented as a flavor of Lustre llogs after suitable, relatively minor enhancements are made to the llog facility. Specifically, there will be multiple replicators (consumers) for some records.

Created only on demand

Changelog entries are created only when required by an existing consumer(s), and are cancelled when that consumer(s) has finished processing the change.

Independent per server

Changelogs are per-server, and may be further restricted to a particular fileset. Changelogs from different servers / filesets will not be recombined by Lustre.

Feeds as files

User-level access to a feed will take place via a virtual file type mechanism, similar to proc. Feed entries will be read out of $MNT/.lustre/feed. A consumer completion callback must be called (or some other signal) before the next changelog entry is presented (multiple entries? asyncronous cancellation?). A timeout would be used to detect a dead consumer, at which point we abort the feed (raise a signal? continue recording non-cancelled entries forever?). Upon recovery, we restart feed from last uncanceled entry (when consumer re-registers).

Single consumer per feed

In order to easily define separate filters and completion callbacks, we will generate a single feed per consumer. A single changelog may drive multiple feeds; each feed will take a reference on each changelog entry. We will need to identify the consumer to the feed during re-registration after recovery.

Feed registration

Feed consumer registration therefore includes:

changelog identifier (OST0003)
consumer identifier (process name? job id?)
filter definition
policy flags (recovery_required, no_callback, batch_cancel, timeout)
last entry identifier (for recovery)

References

bug 14169

WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.